Eric S. Raymond's Blog, page 53

March 24, 2013

Python speed optimization in the real world

I shipped reposurgeon 2.29 a few minutes ago. The main improvement in this version is speed – it now reads in and analyzes Subversion repositories at a clip of more than 11,000 commits per minute. This, is, in case you are in any doubt, ridiculously fast – faster than the native Subversion tools do it, and for certain far faster than any of the rival conversion utilities can manage. It’s well over an order of magnitude faster than when I began seriously tuning for speed three weeks ago. I’ve learned some interesting lessons along the way.



The impetus for this tune-up was the Battle for Wesnoth repository. The project’s senior devs finally decided to move from Subversion to git recently. I wan’t actively involved in the decision myself, since I’ve been semi-retired from Wesnoth for a while, but I supported it and was naturally the person they turned to to do the conversion. Doing surgical runs on that repository rubbed my nose in the fact that code with good enough performance on a repository 500 or 5000 commits long won’t necessarily cut it on a repository with over 56000 commits. Two-hour waits for the topological-analysis phase of each load to finish were kicking my ass – I decided that some serious optimization effort seemed like a far better idea than twiddling my thumbs.


First I’ll talk about some things that didn’t work.


pypy, which is alleged to use fancy JIT compilation techniques to speed up a lot of Python programs, failed miserably on this one. My pypy runs were 20%-30% slower than plain Python. The pypy site warns that pypy’s optimization methods can be defeated by tricky, complex code, and perhaps that accounts for it; reposurgeon is nothing if not algorithmically dense.


cython didn’t emulate pypy’s comic pratfall, but didn’t deliver any speed gains distinguishable from noise either. I wasn’t very surprised by this; what it can compile is mainly control structure. which I didn’t expect to be a substantial component of the runtime compared to (for example) string-bashing during stream-file parsing.


My grandest (and perhaps nuttiest) plan was to translate the program into a Lisp dialect with a decent compiler. Why Lisp? Well…I needed (a) a language with unlimited-extent types that (b) could be compiled to machine-code for speed, and (c) minimized the semantic distance from Python to ease translation (that last point is why you Haskell and ML fans should refrain from even drawing breath to ask your obvious question; instead, go read this). After some research I found Steel Bank Common Lisp (SBCL) and began reading up on what I’d need to do to translate Python to it.


The learning process was interesting. Lisp was my second language; I loved it and was already expert in it by 1980 well before I learned C. But since 1982 the only Lisp programs I’ve written have been Emacs modes. I’ve done a whole hell of a lot of those, including some of the most widely used ones like GDB and VC, but semantically Emacs Lisp is a sort of living fossil coelacanth from the 1970s, dynamic scoping and all. Common Lisp, and more generally the evolution of Lisp implementations with decent alien type bindings, passed me by. And by the time Lisp got good enough for standalone production use in modern environments I already had Python in hand.


So, for me, reading the SBCL and Common Lisp documentation was a strange mixture of learning a new language and returning to very old roots. Yay for lexical scoping! I recoded about 6% of reposurgeon in SBCL, then hit a couple of walls. Once of the lesser walls was a missing feature in Common Lisp corresponding to the __str__ special method in Python. Lisp types don’t know how to print themselves, and as it turns out reposurgeon relies on this capability in various and subtle ways. Another problem was that I couldn’t easily see how to duplicate Python’s subprocess-control interface – at all, let alone portably across common Lisp implementations.


But the big problem was CLOS, the Common Lisp Object System. I like most of the rest of Common Lisp now that I’ve studied it. OK, it’s a bit baroque and heavyweight and I can see where it’s had a couple of kitchen sinks pitched in – if I were choosing a language on purely esthetic grounds I’d prefer Scheme. But I could get comfortable with it, except for CLOS.


But me no buts about multimethods and the power of generics – I get that, OK? I see why it was done the way it was done, but the brute fact remains that CLOS is an ugly pile of ugly. More to the point in this particular context, CLOS objects are quite unlike Python objects (which are in many ways more like CL defstructs). It was the impedance mismatch between Python and CLOS objects that really sank my translation attempt, which I had originally hoped could be done without seriously messing with the architecture of the Python code. Alas, that was not to be. Which refocused me on algorithmic methods of improving the Python code.


Now I’ll talk about what did work.


What worked, ultimately, was finding operations that have instruction costs O(n**2) in the number of commits and squashing them. At this point a shout-out goes to Julien “FrnchFrgg” Rivaud, a very capable hacker trying to use reposurgeon for some work on the Blender repository. He got interested in the speed problem (the Blender repo is also quite large) and was substantially helpful with both patches and advice. Working together, we memoized some expensive operations and eliminated others, often by incrementally computing reverse-lookup pointers when linking objects together in order to avoid having to traverse the entire repository later on.


Even just finding all the O(n**2) operations isn’t necessarily easy in a language as terse and high-level as Python; they can hide in very innocuous-looking code and method calls. The biggest bad boy in this case turned out to be child-node computation. Fast import streams express “is a child of” directly; for obvious reasons, a repository analysis often has to look at all the children of a given parent. This operation blows up quite badly on very large repositories even if you memoize it; the only way to make it fast is to precompute all the reverse lookups and update them when you update the forward ones.


Another time sink (the last one to get solved) was identifying all tags and resets attached to a particular commit. The brute-force method (look through all tags for any with a from member matching the commit’s mark) is expensive mainly because to look through all tags you have to look through all the events in the stream – and that’s expensive when there are 56K of them. Again, the solution was to give each commit a list of back-pointers to the tags that reference it and make sure all the mutation operations update it properly.


It all came good in the end. In the last benchmarking run before I shipped 2.29 it processed 56424 commits in 303 seconds. That’s 186 commits per second, 11160 per minute. That’s good enough that I plan to lay off serious speed-tuning efforts; the gain probably wouldn’t be worth the increased code complexity.

 •  0 comments  •  flag
Share on Twitter
Published on March 24, 2013 16:41

February 26, 2013

Mode of the Reposturgeon!

It was inevitable, I suppose; reposurgeon now has its own Emacs mode.


The most laborious task in the reposurgeon conversion of a large CVS or Subversion repository is editing the comment history. You want to do this for two reasons: (1) to massage multiline comments into the summary-line + continuation form that plays well with git log and gitk, and (2) lifting Subversion and CVS commit references from, e.g., ’2345′ to [[SVN:2345]] so reposurgeon can recognize them unambiguously and turn them into action stamps.


In the new release 2.22, there’s a small Emacs mode with several functions that help semi-automate this process.


Fear the reposturgeon!

 •  0 comments  •  flag
Share on Twitter
Published on February 26, 2013 04:40

February 25, 2013

MIXAL is dead

I terminated one of my open-source projects today. MIXAL is dead; it has been replaced by the GNU MIX Development Kit, alias MDK. Open-source projects die so seldom that the circumstances deserve a minor note.



I didn’t actually write MIXAL; somebody named ‘Darius Bacon’ (probably this guy) did it, under DOS. I stumbled across it in 1998, ported it to Unix, and fixed some minor bugs. Later, when I was in semi-regular contact with Don Knuth, he contributed two of his test programs and a text description of MIX from The Art of Computer Programming. Don gets open source; he was careful to arrange with his publisher terms that allow this material to be redistributed not just by me but by any project shipping under an open-source license.


I’m not sure when the MDK project started. When I first ran across it, it seemed to me to be not as capable as MIXAL; I made a note of it in my README file but did not consider simply handing off to it. That might have been as much a decade ago; when I re-encountered it recently, it looked a great deal more polished and mature. I, on the other hand, had barely touched MIXAL since I first ported it.


The world needs one competently-written MIX interpreter, but it doesn’t need two. So I looked up MDK’s maintainer and negotiated a handoff; he got the material Don Knuth donated to MIXAL, and I got to put MIXAL to a tidy end.


This what the open-source version of what musicologists call “folk process” looks like. Re-use, improve, contribute – and when someone else is clearly doing a better job, let go.

 •  0 comments  •  flag
Share on Twitter
Published on February 25, 2013 14:07

February 24, 2013

Norovirus alert – how to avoid spreading it

For about 24 hours beginning last Wednesday evening I had what I thought at the time was a bout of food poisoning. It wasn’t, because my wife Cathy and then our houseguest Dave Täht got it. It was a form of extremely infectious gastroenteritis, almost certainly a new strain of norovirus that is running through the U.S. like wildfire right now. Here’s what you need to know to avoid getting it and giving it to others:



We think it was norovirus because the symptoms match its clinical profile perfectly. They are: vomiting, diarrhea, muscle pain. Duration is about 24 hours. It is difficult to convey how hard and fast this hits – you go from feeling fine to violent nausea in 20 minutes. I experienced the progression myself and watched it in two others.


1. If you experience fast-onset gastroenteritis that mimics food poisoning, it is quite likely to actually be norovirus.


2. This bug is extremely infectious and can be spread by skin contact, aerosol droplets, or virus on food-preparation sufaces.


3. Careful hygiene in the bathroom and frequent hand-washing before and after contact with other people is important to help you keep from getting it and passing it on. Sources differ on how effective conventional hand sanitizer fluid is against this bug, but it can’t hurt to use it.


4. If you get it, isolate yourself. Avoid contact with other people. Do not go out in public if you can possibly avoid it. Do not handle food that will be eaten by others; do not touch food-preparation surfaces or cookware or utensils that will be used by others.


5. It is not lethally dangerous (except to the very old, the very young, and people with compromised immune systems) and it doesn’t last long (about 24 hours).


6. Those will, however, probably be 24 of the most most miserable and disgusting hours of your life. The symptoms include violent vomiting and (sometimes uncontrollable) diarrhea. You do not want to have this.


7. Do not try to hold down the initial vomiting any longer than it takes you to get to somewhere sanitary to barf into. You will feel better after you have emptied your stomach. You may, in fact, feel almost normal – until the diarrhea hits.


8. You will be losing water rapidly via the vomiting and diarrhea; you will be more miserable, and recover more slowly, if you let yourself get dehydrated. Take small drinks of water at relatively frequent intervals.


9. Washing contaminated clothes, towels and surfaces with a light chlorine solution will kill the virus.


10. Important: Continue isolation until you have been without symptoms for a minimum of 48 hours! I cannot emphasize this point enough – failing this because I didn’t know I was infectious is probably how Cathy and Dave got it.


If we had known it was infectious when I got it, Cathy and Dave might have been spared a lot of unpleasantness. Play safe and don’t infect those near you.

 •  0 comments  •  flag
Share on Twitter
Published on February 24, 2013 11:13

February 16, 2013

Why I want to confiscate every programmer’s * key

I just sent mail to the Battle for Wesnoth developer list titled “Why I want to confiscate every dev’s * key”. The points in it probably deserve a wider audience.



I want to confiscate every dev’s * key because while converting the Wesnoth repo I’ve just fixed about the hundredth comment that does this:



------------------------------------------------------------------------------
Fossil-ID: 29314

* Document weapon/second_weapon addition and changes on RC imagepath
* function

------------------------------------------------------------------------------

The obvious thing that is wrong with this is the second ‘*’; the bare word ‘function’ is not a list entry.


The unobvious thing that is wrong with this is the first ‘*’. This entire comment should not have been written as a bullet item. Because there aren’t any other items in it! There’s no list here!


A change comment like this is OK:



------------------------------------------------------------------------------

A summary of what I did to fix a random bug

* This is the first thing

* This is the second

------------------------------------------------------------------------------

A change comment like this is not OK:



------------------------------------------------------------------------------

* This is a random thing I did.

------------------------------------------------------------------------------

That leading ‘* ‘ is just a hindrance to readability, a visual bump in the road. When your comment history is over 50K commits long this sort of small additional friction matters.


The following is also *not* OK:



------------------------------------------------------------------------------

* This is one random thing I did.

* This is another random thing I did.

* This is a third random thing I did.

------------------------------------------------------------------------------

Why is this not OK?


Well, the obvious reason is that there is no summary line tying all the items together. This matters more in git-land because so many of the tools just show first summary lines.


The less-obvious reason is that if you write a comment like this, your single commit is almost certainly doing too much. You should break it up into several commits, each with one topic that becomes a single (non-bullet) item in its own change comment.


The next time you are tempted to press your ‘*’ key when writing a change comment, stop and think. Is it just going to be noise? If so, stop and slap your own hand.


More generally, give a little thought to what the shape of your comment says about the shape of your commit. Is it trying to describe too many different and unrelated changes in one go? Does it have a proper first summary line that will minimize the overhead of reading it for someone six months down the road?


There’s a programmer’s proverb: “Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. Code for readability.”


That applies to comments, too. At the scale of this project [8 years, over 55K commits], such details really matter.

 •  0 comments  •  flag
Share on Twitter
Published on February 16, 2013 12:31

February 14, 2013

Penguicon Party Preannouncement

No details yet, but this is a heads-up: I will be at Penguicon 2013, 26-28 April in Pontiac Michigan…and there will be a second annual paty for friends of Armed & Dangerous (or FOAD – now where have I seen that before?)


Convention info at http://2013.penguicon.org/

 •  0 comments  •  flag
Share on Twitter
Published on February 14, 2013 15:29

February 10, 2013

Why I don’t accept would-be disciples

Every few months I get a letter from a would-be hacker petitioning me to accept him (always a “him” so far) as my disciple. Happened again today; I think this time I’ll share part of the request, and my response, so I have it to point to next time this happens.



The aspirant writes:



Thus, I started my search for Master Foo, who will accept me as disciple. A master whom I can look at, follow, walk with, see through and ultimately become a master myself. A master in front of whom I can empty my cup. One who will introduce me to computer considering me a newbie.


I know it might take decades or even my lifetime to achieve what I pursue. But, it’s worth it.


I have written this email to ask a very simple and the most sophisticated question:


“Will you accept me as your disciple and teach me all you know ?”


Here is my reply:


No, because the communication path between us doesn’t have the required bandwidth. Sorry.


Yes, there are important aspects of being a hacker that are best learned by mimesis (actually this is true of any really skilled craft). If you lived near enough to a master hacker to be social with him face-to-face, or you worked with one day to day, you might benefit greatly from observing what he does and what kind of person produces those behaviors.


Unfortunately, the most valuable parts of those interactions won’t pass over an email link. Jokes, spontaneous reactions, small and at first sight unimportant behavioral examples that fit together into significant wholes, all the things the master is teaching when he is not thinking about teaching and the student is learning when he is not thinking about learning.


I can’t give you that experience; I don’t run an ashram for you to live in. Trying to create a counterfeit of it over email would only cheat you and frustrate me.


And that pretty much disposes of “disciple”, unless you’re a billionaire’s kid whose parents are willing to offer me enough money to persuade me to uproot my life for a few years so you can have what you want. The “you Alexander, me Aristotle” scenario wouldn’t be utterly impossible, but it would be very expensive.

 •  0 comments  •  flag
Share on Twitter
Published on February 10, 2013 21:23

February 8, 2013

The Reposturgeon from Beyond Space!

I released reposurgeon 2.20 tiday, with various minor improvements in the graph command and the behavior of repodiffer. Which, mainly, gives me all the excuse I need for this:



Image composed using Pulp-O-Mizer.

 •  0 comments  •  flag
Share on Twitter
Published on February 08, 2013 19:01

February 3, 2013

Sugar turns twenty

This is a bulletin for Sugar’s distributed fan club, the hackers and sword geeks and other assorted riff-raff who have guested in our commodious basement. The rest of you can go about your business.


According to the vet’s paperwork, our cat Sugar turned 20 yesterday. (Actually, the vet thinks she may be 19, but that would have required her to be only 6 months old when we got her which we strongly doubt – she would have had to have been exceptionally large and physically mature for a kitten that age, which seems especially unlikely because her growth didn’t top out until a couple of years later.)


Even 19 would be an achievement for any cat – average lifespan for a neutered female is about 15, and five years longer is like a human hitting the century mark. It’s especially remarkable since this cat was supposed to be dead of acute nephritis sixteen months ago. Instead, she’s so healthy that we’ve been letting the interval between subcutaneous hydrations slip a little without seeing any recurrence of the symptoms we learned to associate with her kidney troubles (night yowling, disorientation, poor appetite).



We’re trying not to let our hopes about Sugar’s continued lifespan rise too high, but she’s making it difficult not to be optimistic. She does not look or act like a doddering relic. She is cheerful, active, and bright-eyed – more so than many cats half her age, if the truth be known. Some days her arthritis makes the basement stairs a little difficult, but not most days – and that’s still about the only obvious sign that she’s geriatric for a cat. Her amiable disposition, exceptional sociability and ability to charm humans have diminished not at all.


This is an occasion for quiet celebration. Go, I say to you, find a friendly cat and make nice at it. And hope with us that Sugar keeps beating the odds.

 •  0 comments  •  flag
Share on Twitter
Published on February 03, 2013 06:47

February 1, 2013

Is closed source worth it for performance?

The following question appeared in my mailbox today:



If a certain program (that you need) was proprietary, and its open-source counterpart was (currently) 40% slower. Which would you use, the open-source one or the proprietary one?


The answer is: it depends. I’m going to answer this one in public because it’s a useful exercise in thinking about larger tradeoffs.



The first subsidiary question I’d have to ask myself is, how much does that 40% speed increase matter? If the answer is “not much” (say, because even the ‘slower’ program runs pretty fast, or because even though it’s noticeably slower I won’t use it often) then I’ll use the program I can read the source code for. Because what if I trip over a bug, or need to extend what the program does to get my task finished?


The more general point here is that by using the closed-source program I’m giving up some significant options, including (a) asking the open-source program’s maintainers for help, and (b) fixing or enhancing it myself. There is a tradeoff between the value of those options and the value of the additional performance which I have to evaluate on the facts in each case. (And yes, valuing these open-source options highly is based on the assumption that help is effectively unavailable from the maintainers of the closed-source program. Sadly, this is usually the case.)


From now on, then, we’re only talking about the case where that 40% is really important. Let’s consider a couple of easy cases.


It might be the case that what these programs do is very simple and well-defined, so it’s easy to verify that they’re functionally equivalent except for speed. If that’s the case, my risk of getting locked in by the closed-source program is low – so I’ll go right ahead and use the closed-source one, knowing I can fall back to the open-source program at any time.


It might also be the case that buying faster hardware will make the open-source program fast enough for my purposes. Hardware is cheap and the benefits from improving it extend across a broad range of tasks, so I’d probably rather upgrade my hardware than accept the risks of using the closed-source program.


Another easy case is when the closed-source program jails my data, so I cannot examine or modify it with other tools. It is hard for me to imagine any scenario in which I would swallow that for a mere 40% performance boost. Forget it; no sale.


Could I make the open-source program go faster? I would at least spend an hour or two looking at the possibility.


Beyond these the decision process starts to get more difficult. I don’t think I can utter a completely general rule for when I will use closed-source software, but I hope I have illuminated some of the tradeoffs.

1 like ·   •  0 comments  •  flag
Share on Twitter
Published on February 01, 2013 22:29

Eric S. Raymond's Blog

Eric S. Raymond
Eric S. Raymond isn't a Goodreads Author (yet), but they do have a blog, so here are some recent posts imported from their feed.
Follow Eric S. Raymond's blog with rss.