Semantic locality and the Way of Unix
An important part of the Way of Unix is to try to tackle large problems with small, composable tools. This goes with a tradition of using line-oriented textual streams to represent data. But…you can’t always do either. Some kinds of data don’t serialize to text streams well (example: databases). Some problems are only tractable to large, relatively monolithic tools (example: compiling or interpreting a programming language).
Can we say anything generatively useful about where the boundary is? Anything that helps us do the Way of Unix better, or at least help us know when we have no recourse but to write something large?
Yes, in fact we can. Back in 2015 I was asked why reposurgeon, my editor for version-control repositories, is not written as a collection of small tools rather than as a relatively large interpreter for a domain-specific language. I think the answer I gave then generalizes to a deeper insight, and a productive set of classifications.
(Part of the motivation for expanding that comment into this essay now is that I recently got mail from a reposurgeon user who shipped me several good fix patches and complimented the tool as a good example of the Way of Unix. I actually had to pause to think about whether he was right or not, because reposurgeon is certainly not a small program, either in code bulk or internal complexity. At first sight it doesn’t seem very Unixy. But he had a point, as I hope to show.)
Classic Unix line-oriented text streams have what I’m going to call semantic locality. Consider as an example a Unix password file. The semantic boundaries of the records in it – each one serializing a user’s name, numeric user ID, home directory, and various other information – correspond nicely to the syntactic boundary at each end of line.
Semantic locality means you can do a lot by looking at relatively small pieces (like line-at-a-time) using simple state machines or parsers. Well-designed data serializations tend to have this property even when they’re not textual, so you can do Unix-fu tricks on things like the binary packed protocol a uBlox GPS ships.
Repository internals are different. A lot of the most important information – for example, the DAG structure of the history – is extremely de-localized; you have to do complicated and failure-prone operations on on an entire serialized dump of the repository (assuming you can get such a thing at all!) to recover it. You can’t do more than the very simplest transforms on the de-localized data without knowing a lot of fragile context.
(Another, probably more familiar example of a data structure with poor semantic locality is a database dump. It may be possible to express individual records in tables in a representation that has good locality, but the web of relationships between tables is nowhere expressed in a way that is local for parsing.)
Now, if you are a really clever Unix hacker, the way you deal with the problem of representing and editing repositories is by saying “Fsck it. I’m not going to deal with repository internals at all, only lossless textual serializations of them.” Voila, reposurgeon! All your Unix-fu is suddenly relevant again. You exile your serialization/deserialization logic into stream exporters and importers which have just one extremely well-defined job, just as the Way of Unix prescribes.
Inside those importer/exporter tools…Toto, you’re not in Unix-land anymore, at least not as far as the gospel of small separable tools is concerned. That’s OK; by using them to massage the repository structures into a shape with better semantic locality you’ve made the conceptually hard part (the editing operations) much easier to specify and implement. You can’t get all the way to line-oriented text streams, but you can get close enough that ed(1), the ancient Unix line-oriented text editor, makes a good model for reposurgeon’s interface.
To sharpen this point, consider repocutter. This companion tool in the reposurgeon distribution takes advantage of the fact that Subversion itself can serialize repository into a textual dumpfile. There’s a repertoire of useful operations that repocutter can perform on these dumpfiles; notably, one of them is dissecting a multi-project Subversion repository dump to get out any one of the project histories in a dumpfile of its own. While repocutter has a more limited repertoire than reposurgeon, it does behave like a classic Unix filter.
Stepping back from the details of reposurgeon, we can use it as a paradigmatic case for a couple of more general observations that explain and generalize traditional Unix practice.
First: semantic locality aids decomposability. Whether you get to write a flock of small separable tools or have to do one large one is largely controlled by whether your data structure has a lossless representation with good semantic locality or not.
Or, to put it more poetically, you can carve a data structure at its joints only if there’s a representation with joints to carve.
Second: There’s almost nothing magic about line-oriented text streams other than their good semantic locality. (I say “almost” only because eyeball friendliness and the ability to edit them with unspecialized tools also matter.)
Third: Because semantic locality aids decomposability, part of the Way of Unix is to choose data structures and data formats that maximize semantic locality (under the constraint that you have to represent the data’s entire ontology).
That third point is the philosophical generalization of “jam it into a line-oriented text-stream representation”; it’s why that works, when it works.
Fourth: When you can transform a data structure or representation to a form with better semantic locality, you can collect gains in decomposability.
That fourth point is the main trick that reposurgeon pulls. I had more to say about this as a design strategy in
Automatons, judgment amplifiers, and DSLs.
Eric S. Raymond's Blog
- Eric S. Raymond's profile
- 140 followers
