CIA and the perils of overengineering

irker 1.0 (a functional CIA replaceme... irker takes off like a rocket

CIA and the perils of overengineering

The CIA commit-notification service abruptly died two days ago, a development that surprised nobody who has been paying attention to the recent history of the codebase and its one public server site. A screwup at the cloud service hosting the CIA virtual machine irretrievably destroyed the instance data; please don’t ask me for details, I don’t know how it happened and don’t care. The CIA codebase is so screwed up that even reconsituting a virgin instance would be way too much work – and that I will talk about a bit later in this post.

Fortunately, I saw this coming and had started work on a CIA replacement in late August. I had been holding off releasing it because there was some effort going on to salvage the code, but that possibility effectively vanished when the only instance was erased. I shipped my replacement just a few minutes ago, and expect to spend much of the next week helping forge-site operators install it so we can have our notification service back.

The remainder of this post is a finished version of a design analysis of CIA I started a couple of weeks ago when the death of the service was still only a theoretical possibility. Since that theory has become actuality, the message should be heard loudly and clearly: this was a truly classic case of over-engineering, code bloat, excessive centralization, and bad practice. Read on for the cautionary tale.

I always liked the idea of CIA, the “version control informant” that relays commit notifications to IRC channels. I’ve been maintained the (now obsolete) git hook scripts that talk to CIA for several years now. But recently I have been looking more closely at the design of CIA and how it’s implemented, and have concluded that it was a pretty horrible example of how not to do things.

First, a review of what CIA did and how it did it. What you saw from the outside when a CIA setup was working for a project was simple: whenever a developer commited code to the project’s public repository, the commit summary was shipped to an IRC channel associated with the project. It became part of that conversational stream, and was also echoed to a special channel (#commits on freenode) where you could watch all the open-source world’s commits flow by like a river.

A notification service like this is a very useful aid to collaboration. It makes IRC conversations among a development group more productive. It also does something unquantifiable but good to the coherence of the development groups that use it, and the coherence of the open-source community as a whole – when the service was live it was hard to watch #commits for any length of time without being impressed and encouraged.

Looking a little deeper, here’s what happened when a commit was made. The repository’s checkin procedure fires a “commit hook” – a small program, usually written in shell or Perl or Python – that is passed various metadata such as the commit’s ID, its list of files modified, and its change comment. The hook assembled an XML message in a particular format containing this information. It then used XML-RPC to call a central CIA server at cia.vc and ship it the notification.

The CIA server was then responsible for turning the XML notification into a text line that got shipped to the project channel and to #commits. It also updated a bunch of statistical summaries that could be browsed at the CIA website (now defunct).

Unfortunately, as in the old proverb about law and sausage, those who loved CIA notifications were best advised not to look too much more closely than this at how they were made. The service was notoriously subject to random outages and stalls; but that, bad as it is, is only symptomatic. Underlying this were several layers of unfortunate history, poor design decisions and shoddy implementation.

CIA hadn’t been actively maintained in several years before its collapse – the originator, one Micah Dowty, disappeared around 2007. One Karsten Behrmann, aka “BearPerson”, stepped in around 2008 but was unable to solve the problems with the software. The sole running public instance was hosted by a third party prone to loudly complaining on the #cia channel that the host box was an insecure hairball full of flaky and obsolete software that he couldn’t fix because the CIA code had dependencies on now-obsolete software versions.

That running instance is what’s now vanished. If you examine the repo of the CIA software, you’ll discover that it’s a mixture of parts in mostly Python but some Erlang, using (a) a custom web framework, (b) some Twisted, and (c) some Django. I’m told by people who have examined all this more closely than me that the individual subsystems (such as the Django code that generates most of the visible web pages in the site) aren’t too bad, but the interactions among them are messy and leaky.

The more experienced software engineers in my audience will already be getting a clue to what went wrong here, if not yet quite why. This is what software that has undergone a collapse into rubble under the weight of its own complexity looks like, complete with maintainers who have run away from their own inability to manage the resulting mess.

But the indictment wouldn’t be complete without noticing that their development practices sucked, too. My hair stood on end when BearPerson let drop on the #cia channel that the code in the running instance didn’t match the head state of the project’s CIA repository on GoogleCode – he admitted he’d “been lazy” and patched things on the site without propagating the changes back to the repo. I nudged him into fixing this. Or, at least, claiming to have fixed it, but the historical record did’t do a lot to reassure me on that score. And it’s why those close to the problem have given up on attempting to resuscitate CIA without a running instance to look at.

Yes, before the VM was wiped there was a crew on the #cia channel trying to salvage the codebase. I helped a bit on this, but my estimate of their odds was never very optimistic. It is notoriously difficult to un-collapse a rubble pile, especially when the author and one previous rescue attempt (BearPerson’s) have already manifestly failed. Thus, I directed most of the limited energy I could spend on this problem into a different strategy.

That strategy began with asking why CIA suffered a complexity collapse, and whether much simpler code could do the job its users expect of it.

There are several aspects of the design that seemed rather iffy. Why one centralized server? Why the elaboration of XML-RPC? Why do you have to register your project on cia.vc to use the notification, with the mapping from your project to IRC channels lurking in an opaque database on a distant server, rather than being simply declared in the (arguments to your) repository hook?

The answer seems to be that the original designer fell in love with the idea of data-mining and filtering the notification stream. It is quite visible on the CIA site how much of the code is concerned with automatically massaging the commit stream into pretty reports. I’m told there is a complicated and clever feature involving XML rewrite rules that allows one to filter commit reports from any number of projects by the file subtrees they touch, then aggregate the result into a synthetic notification channel distinct from any of the ones those projects declared themselves.

Bletch! Bloat, feature creep, and overkill! With chrome like this piled on top of the original simple concept of a notication relay, the resulting complexity collapse should no longer be any surprise. Additionally, this is a near-perfect case study in how to make your service scale up poorly and be maximally vulnerable to single-point failures – if that one database gets lost or corrupted, everybody’s notifications will go haywire. The design would have been over-centralized even if the implementation weren’t broken.

Of course the way to prove this kind of indictment is to do better. But once I got this far in my thinking, I realized that wouldn’t be difficult. And started to write code. The result is irkerd, a simple service daemon. One end of it listens on a socket for JSON requests that specify a server/channel pair and a message string. The other end behaves like a specialized IRC client that maintains concurrent session state for any number of IRC-server instances. All irkerd is, really, is a message bus that routes notification requests to the right servers. (And is multithreaded so it won’t block on a server stall, and times out inactive sessions.)

That’s it. Less than 400 lines of Python replaces CIA’s core notification service. The code for a repo hook to talk to it is simpler than any existing CIA hook. And it doesn’t require a centralized server. The right way to deplay this thing will be to host multiple instances of irker on repository sites, not publicly visible (because otherwise they could too easily be used to spam IRC channels) but available to the repository’s hooks running inside the site firewall.

Filtering? Aggregation? As previously noted, they don’t need to be in the transmission path. One or more IRC bots could be watching #commits, generating reports visible on the web, and aggregating synthetic feeds. The only agreement needed to make this happen is minimal regularity in the commit message formats that the hooks ship to IRC, which is really no more onerous than the current requirement to gin up an XML-RPC blob in a documented format.

I must note one drawback to this way of partitioning things. Because IRC has a message length limit, naively shipping commits with very long metadata (due to for example, large lists of modified files) would make only a truncated version available on IRC (and thus, to an IRC warther bot gathering statistics).

It might be that this was the original motivation for using an XML-RPC transport on CIA’s input end. Indeed, when I first recognized the problem I started sketching a design for a auxiliary daemon that would do nothing but accept XML-RPC requests in something very close to CIA’s preferred format, then forward short digests of them to an irker instance for shipping to IRC. This auxiliary could collect statistics based on the un-truncated metadata…

Fortunately, I experienced a rush of good sense before I actually started coding this thing. It would have hugely complicated deployment and testing to handle an unusual case – observably from #commits, most commit messages are short and touch few files. We get a much simpler system if we accept two reduction rules:

1. If a commit notification would be longer than 510 bytes, we omit the filenames list. An empty filenames list is to be interpreted by filtering software as “may touch any file in the project”.

2. Then…we just ship it. If the IRC server truncates it at 510 bytes, so be it. Humans watching the commit stream won’t need more than that to put the commit in context (especially not for projects which use git’s first-line-is-a-summary convention) and the hypothetical statistics-gathering bots won’t understand natural language well enough to care that it’s truncated.

This is how you keep things simple. And that is how you prevent your projects from collapsing under complexity.

I wrote irkerd to accomplish two things: (1) Light a fire under the CIA salvage crew, attempting to speed up their success, and (2) provide a viable alternative in case they didn’t succeed. To this I now add (3) illustrate what healthy minimalism in software design looks like. Antoine de St-Exupéry said it best: Perfection (in the design of software, as well as his airplanes) is achieved not when there is nothing more to add, but rather when there is nothing more to take away.

Accordingly, note the nonexistence of irkerd configuration options and the complete absence of anything resembling a control dotfile. I even, quite deliberately, omitted the usual option to change the port that irker listens on. Because if you think you need an option like that, you actually have a problem you need to solve at your firewall.

But releasing irkerd, of course, is not the end of the story. For it to do any good, instances of the daemon and its repo hook will need to be running and documented at sites like SourceForge, GitHub, Gitorious, Gna, and Savannah. As I noted at the beginning of this essay, I expect pushing the deployment along will eat up a lot of my time in the near future – probably more time than it took to write and test the code. These forge sites are all chronically understaffed and have long issue backlogs.

Still, at least we now have a simple and robust design, and working code. And – this can’t be emphasized enough – single-site outages will no longer be fatal. If there’s one thing the history of the Internet should have taught us, it’s that you get robust and scalable services not by centralizing but by distributing them. It’s too bad the designer of CIA never internalized that lesson, and there can be no better finish to this tale of failure than by reinforcing it.

View more on Eric S. Raymond's website »

Like • 0 comments • flag

Published on September 27, 2012 15:37

No comments have been added yet.

Eric S. Raymond's Blog

Eric S. Raymond's profile
140 followers

Eric S. Raymond isn't a Goodreads Author (yet), but they do have a blog, so here are some recent posts imported from their feed.

delete edit this post