How to spot a high-quality repository conversion
In my last post, I inveighed against using git-svn to do whole-repository conversions from Subversion to git (as opposed to its intended use, which is working a Subversion repository live through a git remote).
Now comes the word that hundreds of projects a week seem to be fleeing SourceForge because of their evil we’ll-hijack-your-repo-and-crapwarify-your installer policy. And they’re moving to GitHub via its automatic importer. Which, sigh, uses git-svn.
I wouldn’t trust that automatic importer (or any other conversion path that uses git-svn) with anything I write, so I don’t know how badly it messes things up.
But as a public service, I follow with a description of how a really well-done repository conversion – the kind I would deliver using reposurgeon – differs from a crappy one.
In evaluating quality, we need to keep in mind why people spelunk into code histories. Typically they’re doing it to trace bugs, understand the history of a feature, or grasp the thinking behind prior design decisions.
These kinds of analyses are hard work demanding attention and
cognitive exertion. The last thing anyone doing them needs is to have his or her attention jerked to the fact that back past a certain point of conversion everything was different – commit references in alien and unusable formats, comments in a different style, user IDs missing, ignore patterns not working, etc.
Thus, as a repository translator my goal is for the experience of diving into the past to be as frictionless as possible. Ideally, the converted repository should look as though modern DVCS-like practices had been in use from the beginning of time.
Some of the kinds of glitches I’m going to describe may seem like they ought to be ignorable nits. And individually they often are. But the cumulative effect of all of them is distracting. Unnecessarily distracting.
These some key things that distinguish a really good conversion, one that’s near-frictionless to use, from a poor one.
1. Subversion/CVS/BitKeeper user IDs are properly mapped to Git-style human-name-plus-email identifications.
Sometimes this is a lot of work – for one conversion I did recently I spent many hours Googling to identify hundred of contributors going back to 1999.
The immediate reason this is valuable is so we know who was
responsible for individual commits, which can be important in bug forensics.
A more social reason is that otherwise OpenHub and sites like it in the future won’t be able to do reputation tracking
properly. Contributors deserve their kudos and should have it.
2. Commit references are mapped to some reasonably VCS-independent way to identify the commits they point at; I generally use ether unique prefixes of commit comments or commiter/date pairs.
Because ‘r1234′ is useless when you’re not in Subversion-land anymore, Toto. And appending a fossil Subversion ID to every commit comment is heavyweight, ugly, and distracting.
3. Comments are reformatted to be in DVCS form – that is, standalone summary line plus (if there’s more) a spacer line plus following paragraphs.
Yes, this means that to do it right you need to eyeball the entire comment history end edit it into what it would have looked like if the committers had been using those conventions from the beginning. Yes, this is a lot of work. Yes, I do it, and so should you.
The reason this step is really important is that without it it tools like gitk and git log can’t do their job properly. This makes it far more difficult for people reading the history to zero in efficiently on what they need to know to get real work done,
4. Ignore patterns and files should be lifted from the syntax and wildcarding conventions of the old system to the syntax and wildcarding conventions of the new one.
This is one of the many things git-svn simply fluffs. Other batch-mode converters could in theory do a better job, but generally don’t.
5. The converted repository should not lose valuable metadata – like release tags.
Yes, I’m actually looking at a GitGub conversion that was that bad.
When the tags are missing, users will be unable to identify
or do code diffs against historical release points. It’s a usability crash landing.
Eric S. Raymond's Blog
- Eric S. Raymond's profile
- 140 followers
