You’re gonna need a bigger Beast
I’m taking a management-approved break from NTPsec to do a repository conversion that dwarfs any I’ve ever seen before. Yep, more history than Emacs – much much more. More backtrail than entire BSD distributions, in fact about an order of magnitude larger than any repo I’ve previously encountered.
Over 255000 commits dating back to 1989 – now in Subversion, formerly in CVS and (I suspect) RCS. It’s the history of GCC, the Gnu Compiler Collection.
For comparison, the entire history of NTP, including all the years before I started working on it, is 14K commits going back to 1999. That’s a long history compared to most projects, but you’d have to lay 168 NTPsec histories end to end to even approximate the length of GCC’s.
In fact, this monstrous pile is so huge that loading it into reposurgeon OOM-crashed the Great Beast (that’s the machine I designed specifically for large-scale repository surgery). The Beast went down twice before I started to get things under control. The title isn’t quite true, but I post it in commemoration of Mark Atwood’s comment after the first OOM: “You’re gonna need a bigger boat.” Mark is the manager signing the checks so I can do this thing; all praise to Mark.
I’ve gotten the maximum memory utilization under 64GB with a series of shifty dodges, including various internal changes to throw away intermediate storage as soon as possible. The most important move was probably running reposurgeon under PyPy, which has a few bytes less overhead per Python object than CPython and pulls the maximum working set just low enough that the Beast can deal. Even so, I’ve been shutting down my browser during the test runs.
So I hear you ask: Why don’t you just put in more memory? And I answer: Duuuude, have you priced 128GB DDR4 RAM with ECC lately? Even the cheap low-end stuff is pretty damn pricey, and I can’t use the cheap stuff. The premise of the Beast’s design is maximizing memory-access speed and bandwidth (rather than raw processor speed) in order to deal with huge working sets – repository surgery is, as I’ve noted before, traversal computations on a graph gigabytes wide. Nearly two A bit over three years after it first booted there probably still isn’t any other machine built from off-the-shelf parts more effective for this specific job load (yes, I’ve been keeping track), but that advantage could be thrown away by memory with poor latency.
Next I hear you ask: what about swap? Isn’t demand paging supposed to, you know, deal with this sort of thing?
I must admit to being a bit embarrassed here. After the second OOM crash John Bell and I did some digging (John is the expert admin who configured the Beast’s initial system load) and I rediscovered the fact that I had followed some advice to set swappiness really really low for interactive responsiveness. Exactly the opposite tuning from what I need now.
I fixed that, but I’m afraid to push the machine into actual swapping lest I find that I have not done enough and it OOMs again. Each crash is a painful setback when your test-conversion runs take seven hours each (nine if you’re trying to build a live repo). So I’m leaving my browser shut down and running light with just i3, Emacs, and a couple of terminal instances.
If 7 to 9 hours sounds like a lot, consider that the first couple tries took 13 or 14 hours before OOMing. For comparison, one of the GCC guys reported tests running 18 or 19 hours before failure on more stock hardware.
PyPy gets a lot of the credit for the speedup – I believe I’m getting at least a 2:1 speed advantage over CPython, and possibly much more – the PyPy website boldly claims 7:1 and I could believe that. But it’s hard to be sure because (a) I don’t know how long the early runs would have taken but for OOMing, and (b) I’ve been hunting down hot loops in the code and finding ways to optimize them out.
Here is a thing that happens. You code an O(n**2) method, but you don’t realize it (maybe there’s an operation with hidden O(n) inside the O(n) loop you can see). As long as n is small, it’s harmless – you never find it because the worst-case cost is smaller than your measurement noise. Then n goes up by two orders of magnitude and boom. But at least this kind of monster load does force inefficiencies into the open; if you wield a profiler properly, you may be able to pin them down and abolish them. So far I’ve nailed two rather bad ones.
There’s still one stubborn hot spot – handling mergeinfo properties – that I can’t fix because I don’t understand it because it’s seriously gnarly and I didn’t write it. One of my cometary contributors did. I might have to get off my lazy ass and learn how to use git blame so I can hunt up whoever it was.
Today’s bug: It turns out that if your Subversion repo has branch-root tags from its shady past as a CVS repository, the naive thing you do to preserve old tags in gitspace may cause some sort of obscure name collision with gitspace branch identifiers that your tools won’t notice. You sure will, though…when git-fast-import loses its cookies and aborts the very last phase of a 9-hour test. Grrrr….
Of such vicissitudes is repository surgery made. I could have done a deep root-cause analysis, but I am fresh out of fucks to give about branch-root tags still buried 4 gigabytes deep in a Subversion repository only because nobody noticed that they are a fossil of CVS internal metadata that has been meaningless for, oh, probably about fifteen years. I just put commands to nuke them all in the translation script.
Babysitting these tests does give you a lot of time for blogging, though. When you dare have your browser up, that is…
Eric S. Raymond's Blog
- Eric S. Raymond's profile
- 140 followers
