Eric S. Raymond's Blog, page 26

October 30, 2015

Hieratic documentation

Here’e where I attempt to revive and popularize a fine old word in a new context.


hieratic, adj. Of or concerning priests; priestly. Often used of the ancient Egyptian writing system of abridged hieroglyphics used by priests.


Earlier today I was criticizing the waf build system in email. I wanted to say that its documentation exhibits a common flaw, which is that it reads not much like an explanation but as a memory aid for people who are already initiates of its inner mysteries. But this was not the main thrust of my argument; I wanted to observe it as an aside.



Here’s what I ended up writing:


waf notation itself relies on a lot of aspect-like side effects and spooky action at a distance. It has much poorer locality and compactness than plain Makefiles or SCons recipes. This is actually waf’s worst downside, other perhaps than the rather hieratic documentation.


I was using “hieratic” in a sense like this:


hieratic, adj. Of computer documentation, impenetrable because the author never sees outside his own intimate knowledge of the subject and is therefore unable to identify or meet the expository needs of newcomers. It might as well be written in hieroglyphics.


Hieratic documentation can be all of complete, correct, and nearly useless at the same time. I think we need this word to distinguish subtle disasters like the waf book – or most of the NTP documentation before I got at it – from the more obvious disasters of documentation that is incorrect, incomplete, or poorly written simply considered as expository prose.

 •  0 comments  •  flag
Share on Twitter
Published on October 30, 2015 08:21

October 29, 2015

I’m not going to ‘take everyone’s guns away’

An expectation of casual, cynical lying has taken over American political culture. Seldom has this been more obviously displayed than Barack Obama’s address to police chiefs in Chicago two days ago.


Here is what everyone in the United States of America except possibly a handful of mental defectives heard:



Obama’s anti-gun-rights base: “I’m lying. I’m really about the Australian-style gun confiscation I and my media proxies were talking up last week, and you know it. But we have to pretend so the knuckle-dragging mouth-breathers in flyover country will go back to sleep, not put a Republican in the White House in 2016, and not scupper our chances of appointing another Supreme Court justice who’ll burn a hole in the Bill of Rights big enough to let us take away their eeeeevil guns. Eventually, if not next year.”


Gun owners: “I’m lying. And I think you’re so fucking stupid that you won’t notice. Go back back to screwing your sisters and guzzling moonshine now, oh low-sloping foreheads, everything will be juuust fiiine.”


Everyone else: “This bullshit again?”


Of course, the mainstream media will gravely pretend to believe Obama, so that they can maintain their narrative that anyone who doesn’t is a knuckle-dragging, mouth-breathing, sister-fucking, Comfederate-flag-waving RAAACIST.

 •  0 comments  •  flag
Share on Twitter
Published on October 29, 2015 04:32

October 23, 2015

NTPsec is not quite a full rewrite

In the wake of the Ars Technica article on NTP vulnerabilities, and Slashdot coverage, there has been sharply increased public interest in the work NTPsec is doing.


A lot of people have gotten the idea that I’m engaged in a full rewrite of the code, however, and that’s not accurate. What’s actually going on is more like a really massive cleanup and hardening effort. To give you some idea how massive, I report that the codebase is now sown to about 43% of the size we inherited – in absolute numbers, down from 227KLOC to 97KLOC.


Details, possibly interesting, follow. But this is more than a summary of work; I’m going to use it to talk about good software-engineering practice by example.



The codebase we inherited, what we call “NTP Classic”, was not horrible. When I was first asked to describe it, the first thought that leapt to my mind was that it looked like really good state-of-the-art Unix systems code – from 1995. That is, before API standardization and a lot of modern practices got rolling. And well before ubiquitous Internet made security hardening the concern it is today.


Dave Mills, the original designer and implementor of NTP, was an eccentric genius, an Internet pioneer, and a systems architect with vision and exceptional technical skills. The basic design he laid down is sound. But it was old code, with old-code problems that his successor (Harlan Stenn) never solved. Problems like being full of port shims for big-iron Unixes from the Late Cretaceous, and the biggest, nastiest autoconf hairball of a build system I’ve ever seen.


Any time you try to modify a codebase like that, you tend to find yourself up to your ass in alligators before you can even get a start on draining the swamp. Not the least of the problems is that a mess like that is almost forbiddingly hard to read. You may be able to tell there’s a monument to good design underneath all the accumulated cruft, but that’s not a lot of help if the cruft is overwhelming.


One thing the cruft was overwhelming was efforts to secure and harden NTP. This was a serious problem; by late last year (2014) NTP was routinely cracked and in use as a DDoS amplifier, with consequences Ars Technica covers pretty well.


I got hired (the details are complicated) because the people who brought me on believed me to be a good enough systems architect to solve the general problem of why this codebase had become such a leaky mess, even if they couldn’t project exactly how I’d do it. (Well, if they could, they wouldn’t need me, would they?)


The approach I chose was to start by simplifying. Chiseling away all the historical platform-specific cruft in favor of modern POSIX APIs, stripping the code to its running gears, tossing out as many superannuated features as I could, and systematically hardening the remainder.


To illustrate what I mean by ‘hardening’, I’ll quote the following paragraph from our hacking guide:



* strcpy, strncpy, strcat: Use strlcpy and strlcat instead.
* sprintf, vsprintf: use snprintf and vsnprintf instead.
* In scanf and friends, the %s format without length limit is banned.
* strtok: use strtok_r() or unroll this into the obvious loop.
* gets: Use fgets instead.
* gmtime(), localtime(), asctime(), ctime(): use the reentrant *_r variants.
* tmpnam() - use mkstemp() or tmpfile() instead.
* dirname() - the Linux version is re-entrant but this property is not portable.

This formalized an approach I’d used successfully on GPSD – instead of fixing defects and security holes after the fact, constrain your code so that it cannot have defects. The specific class of defects I was going after here was buffer overruns.


OK, you experienced C programmers out there are are thinking “What about wild-pointer and wild-index problems?” And it’s true that the achtung verboten above will not solve those kinds of overruns. But another prong of my strategy was systematic use of static code analyzers like Coverity, which actually is pretty good at picking up the defects that cause that sort of thing. Not 100% perfect, C will always allow you to shoot yourself in the foot, but I knew from prior success with GPSD that the combination of careful coding with automatic defect scanning can reduce the hell out of your bug load.


Another form of hardening is making better use of the type system to express invariants. In one early change, I ran through the entire codebase looking for places where integer flag variables could be turned into C99 booleans. The compiler itself doesn’t do much with this information, but it gives static analyzers more traction.


Back to chiseling away code. When you do that, and simultaneously code-harden, and use static analyzers, you can find yourself in a virtuous cycle. Simplification enables better static checking. The code becomes more readable. You can remove more dead weight and make more changes with higher confidence. You’re not just flailing.


I’m really good at this game (see: 57% of the code removed). I’m stating that to make a methodological point; being good at it is not magic. I’m not sitting on a mountaintop having satori, I’m applying best practices. The method is replicable. It’s about knowing what best practices are, about being systematic and careful and ruthless. I do have an an advantage because I’m very bright and can hold more complex state in my head than most people, but the best practices don’t depend on that personal advantage – its main effect is to make me faster at doing what I ought to be doing anyway.


A best practice I haven’t covered yet is to code strictly to standards. I’ve written before that one of our major early technical decisions was to assume up front that the entire POSIX.1-2001/C99 API would be available on all our target platforms and treat exceptions to that (like Mac OS X not having clock_gettime(2)) as specific defects that need to be isolated and worked around by emulating the standard calls.


This differs dramatically from the traditional Unix policy of leaving all porting shims back to the year zero in place because you never know when somebody might want to build your code on some remnant dinosaur workstation or minicomputer from the 1980s. That tradition is not harmless; the thicket of #ifdefs and odd code snippets that nobody has tested in Goddess knows how many years is a major drag on readability and maintainability. It rigidifies the code – you can wind up too frightened of breaking everything to change anything.


Little things also matter, like fixing all compiler warnings. I thought it was shockingly sloppy that the NTP Classic maintainers hadn’t done this. The pattern detectors behind those warnings are there because they often point at real defects. Also, voluminous warnings make it too easy to miss actual errors that break your build. And you never want to break your build, because later on that will make bisection testing more difficult.


Yet another important thing to do on an expedition like this is to get permission – or give yourself permission, or fscking take permission – to remove obsolete features in order to reduce code volume, complexity, and attack surface.


NTP Classic had two control programs for the main daemon, one called ntpq and one called ntpdc. ntpq used a textual (“mode 6”) packet protocol to talk to nptd; ntpdc used a binary one (“mode 7”). Over the years it became clear that ntpd’s handler coder code for mode 7 messages was a major source of bugs and security vulnerabilities, and ntpq mode 6 was enhanced to match its capabilities. Then ntpdc was deprecated, but not removed – the NTP Classic team had developed a culture of never breaking backward compatibility with anything.


And me? I shot ntpdc through the head specifically to reduce our attack surface. We took the mode 7 handler code out of ntpd. About four days later Cisco sent us a notice of critical DoS vulnerability that wasn’t there for us precisely because we had removed that stuff.


This is why ripping out 130KLOC is actually an even bigger win than the raw numbers suggest. The cruft we removed – the portability shims, the obsolete features, the binary-protocol handling – is disproportionately likely to have maintainability problems, defects and security holes lurking in it and implied by it. It was ever thus.


I cannot pass by the gains from taking a poleaxe to the autotools-based build system. It’s no secret that I walked into this project despising autotools. But the 31KLOC monstrosity I found would have justified a far more intense loathing than I had felt before. Its tentacles were everywhere. A few days ago, when I audited the post-fork commit history of NTP Classic in order to forward-port their recent bug fixes, I could not avoid noticing that a disproportionately large percentage of their commits were just fighting the build system, to the point where the actual C changes looked rather crowded out.


We replaced autotools with waf. It could have been scons – I like scons – but one of our guys is a waf expert and I don’t have time to do everything. It turns out waf is a huge win, possibly a bigger one than scons would have been. I think it produces faster builds than scons – it automatically parallelizes build tasks – which is important.


It’s important because when you’re doing exploratory programming, or mechanical bug-isolation procedures like bisection runs, faster builds reduce your costs. They also have much less tendency to drop you out of a good flow state.


Equally importantly, the waf build recipe is far easier to understand and modify than what it replaced. I won’t deny that waf dependency declarations are a bit cryptic if you’re used to plain Makefiles or scons productions (scons has a pretty clear readability edge over waf, with better locality of information) but the brute fact is this: when your build recipe drops from 31KLOC to 1.1KLOC you are winning big no matter what the new build engine’s language looks like,


The discerning reader will have noticed that though I’ve talked about being a systems architect, none of this sounds much like what you might think systems architects are supposed to do. Big plans! Bold refactorings! Major new features!


I do actually have plans like that. I’ll blog about them in the future. But here is truth: when you inherit a mess like NTP Classic (and you often will), the first thing you need to do is get it to a sound, maintainable, and properly hardened state. The months I’ve spent on that are now drawing to a close. Consequently, we have an internal schedule for first release; I’m not going to announce a date, but think weeks rather than months.


The NTP Classic devs fell into investing increasing effort merely fighting the friction of their own limiting assumptions because they lacked something that Dave Mills had and I have and any systems architect necessarily must have – professional courage. It’s the same quality that a surgeon needs to cut into a patient – the confidence, bordering on arrogance, that you do have what it takes to go in and solve the problem even if there’s bound to be blood on the floor before you’re done.


What justifies that confidence? The kind of best practices I’ve been describing. You have to know what you’re doing, and know that you know what you’re doing. OK, and I fibbed a little earlier. Sometimes there is a kind of Zen to it, especially on your best days. But to get to that you have to draw water and chop wood – you have to do your practice right.


As with GPSD, one of my goals for NTPsec is that it should not only be good software, but a practice model for how to do right. This essay. in addition to being a progress report, was intended as a down payment on that promise.

 •  0 comments  •  flag
Share on Twitter
Published on October 23, 2015 12:13

October 21, 2015

Are tarballs obsolete?

NTPsec is preparing for a release, which brought a question to the forefront of my mind. Are tarballs obsolete?



The center of the open-source software-release ritual used to be making a tarball, dropping it somewhere publicly accessible, and telling the world to download it.


But that was before two things happened: pervasive binary-package managers and pervasive git. Now I wonder if it doesn’t make more sense to just say “Here’s the name of the release tag; git clone and checkout”.


Pervasive binary package managers mean that, generally speaking, people no longer download source code unless they’re either (a) interested in modifying it, or (b) a distributor intending to binary-package it. A repository clone is certainly better for (a) and as good or better for (b).


(Yes, I know about source-based distributions, you can pipe down now. First, they’re too tiny a minority to affect my thinking. Secondly, it would be trivial for their build scripts to include a clone and pull.)


Pervasive git means clones are easy and fast even for projects with a back history as long as NTPsec’s. And we’ve long since passed the point where disk storage is an issue.


Here’s an advantage of the clone/pull distribution system; every clone is implicitly validated by its SHA1 hash chain. It would be much more difficult to insert malicious code in the back history of a repo than it is to bogotify a tarball, because people trying to push to the tip of a modified branch would notice sooner.


What use cases are tarballs still good for? Discuss,,,

 •  0 comments  •  flag
Share on Twitter
Published on October 21, 2015 17:31

October 15, 2015

SPDX: boosting the signal

High on my list of Things That Annoy Me When I Hack is sourcefiles that contain huge blobs of license text at the top. That is valuable territory which should be occupied by a header comment explaining the code, not a boatload of boilerplate that I’ve seen hundreds of times before.


Hackers have a lot of superstitious ideas about IP law and one is that these blobs are necessary for the license to be binding. They are not: incorporation by reference is a familiar concept to lawyers and courts, it suffices to unambiguously name the license you want to apply rather than quoting it in full.


This is what I do in my code. But to make the practice really comfortable for lawyers we need a registry of standardized license identifiers and an unambiguous way of specifying that we intend to include by reference.


Comes now the Software Package Data Exchange to solve this problem once and for all. It’s a great idea, I endorse it, and I will be using it in all my software projects from now on.



Here is what the hacking guide for NTPsec now says on this topic, lightly edited to remove some project-specific references:


We use the SPDX convention for inclusion by reference You can read about this at


http://spdx.org/licenses


When you create a new file, mark it as follows (updating the year) as required:


/* Copyright 2015 by the NTPsec project contributors
* SPDX-License-Identifier: BSD-2-Clause
*/

For documentation:


// Copyright 2015 by the NTPsec project contributors
// SPDX-License-Identifier: CC-BY-4.0

Modify as needed for whatever comment syntax the language or markup uses. Good places for these markings are at the end of an extended header comment, or at the very top of the file.


When you modify a file, leave existing copyrights in place. You may add a project copyright and replace the inline license with an SPDX tag. For example:


/* Copyright 2015 by the NTPsec project contributors
* SPDX-License-Identifier: NTP
*/

We recognize that occasionally a file may have changed so much that the historic copyright is no longer appropriate, but such decisions cannot be made casually. Discuss it with the project management before moving.

 •  0 comments  •  flag
Share on Twitter
Published on October 15, 2015 07:14

October 9, 2015

I improved time last night

Sometimes you find performance improvements in the simplest places. Last night I improved the time-stepping precision of NTP by a factor of up to a thousand. With a change of less than 20 lines.


The reason I was able to do this is because the NTP code had not caught up to a change in the precision of modern computer clocks. When it was written, you set time with settimeofday(2), which takes a structure containing seconds and microseconds. But modern POSIX-conformant Unixes have a clock_settime(2) which takes a structure containing seconds and nanoseconds.



Internally, NTP represents times to a precision of under a nanosecond. But because the code was built around the old settimeofday(2) call, until last night it rounded to the nearest microsecond too soon, throwing away precision which clock_settime(2) was capable of passing to the system clock.


Once I noticed this it was almost trivial to fix. The round-off only has to happen if your target platform only has settimeofday(2). Moving it into the handler code for that case, and changing one argument-structure declaration, sufficed.


Now, in practice this is not going to yield a full thousand-fold improvement in stepping accuracy, because you can’t get clock sources that accurate. (Well, not unless you’re a national time authority and can afford a cesium-fountain clock.) This change only helps to the extent that your time-server is actually delivering corrections with sub-microsecond accuracy; otherwise those extra bits will be plain noise.


You won’t get that accuracy from a plain GPS, which is seriously wobbly in the 100-millisecond range. Nor from a GPS with 1PPS, which delivers around one microsecond accuracy. But plug in a GPS-conditioned oscillator (GPSDO) and now you’re talking. These commonly have accuracy in about the 100-nanosecond range, so we can expect computing in nanoseconds to actually pass through an order of magnitude in stepping precision.


Pretty good for a 20-line change!


What are our lessons for today?


First…roundoff is insidious. You should always compute at the highest available precision and round off, when you have to, at the latest possible moment. I knew this and had a to-do item in my head to change as many instances of the old struct timeval (microsecond precision) to struct timespec (nanosecond precision) as possible. This is the first place in the NTP code I’ve found that it makes a provable difference. I’ll be hunting others.


Second…you really ought to beat the dust out of your code every couple of years even if it’s working. Because APIs will improve on you, and if you settle for a quick least-effort shim you may throw away significant functional gains without realizing it. A factor of ten is not bupkis, and this one was stupid-easy to collect; I just had to be paying attention. Clearly the NTP Classic maintainers were not.


So, this is my first non-security-related functional improvement in NTP. To be followed by many others, I hope.

 •  0 comments  •  flag
Share on Twitter
Published on October 09, 2015 04:57

October 7, 2015

The FCC must not lock down device firmware!

The following is a comment I just filed on FCC Docket 15-170, “Amendment of Parts 0, 1, 2, 15, and 18 of the Commission’s Rules et al.”


Thirty years ago I had a small hand in the design of the Internet. Since then I’ve become a senior member of the informal collegium that maintains key pieces of it. You rely on my code every time you use a browser or a smartphone or an ATM. If you ever ride in a driverless car, the nav system will critically depend on code I wrote, and Google Maps already does. Today I’m deeply involved in fixing Internet time service.


I write to endorse the filings by Dave Taht and Bruce Perens (I gave Dave Taht a bit of editorial help). I’m submitting an independent comment because while I agree with the general thrust of their recommendations I think they may not go far enough.



The present state of router and wireless-access-point firmware is nothing short of a disaster with grave national-security implications. I know of people who say that could use firmware exploits to take down targeted and arbitrarily large swathes of the public Internet. I believe them because I’m pretty sure I could figure out how to do that myself in three weeks or so if I wanted to.


So far we have been lucky. The specialized technical knowledge required for Internet disruption on a massive scale is mostly confined to a small cadre of old hands like Vint Cerf and Dave Taht and myself. *We* don’t want to disrupt the internet; we created it and we love it. But the threat from others not so benign is a real and present danger.


Cyberwarfare and cyberterrorism are on the rise, with no shortage of malefactors ready to employ them. The Communist Chinese are not just a theoretical threat, they have already run major operations like the OPM crack. Add the North Koreans, the Russians, and the Iranians to a minimum list of those who might plausibly acquire the know-how to turn our own infrastructure against us in disastrous ways.


The effect of locking down router and WiFi firmware as these rules contemplate would be to lock irreparably in place the bugs and security vulnerabilities we now have. To those like myself who know or can guess the true extent of those vulnerabilities, this is a terrifying possibility.


I believe there is only one way to avoid a debacle: mandated device upgradeability and mandated open-source licensing for device firmware so that the security and reliability problems can be swarmed over by all the volunteer hands we can recruit. This is an approach proven to work by the Internet ubiquity and high reliability of the Linux operating system.


In these recommendations I go a bit beyond where Taht and Perens are willing to push. Dave Taht is willing to settle for a mandate of *inspectable* source without a guarantee of permission to modify and redistribute; experience with such arrangements warns me that they scale poorly and are usually insufficient. Bruce Perens is willing to settle for permitting/licensing requirements which I believe would be both ineffective and suppressive of large-scale cooperation.


The device vendors aren’t going to solve the security and reliability problem, because they can’t profit from solving it and they’re generally running on thin margins as it is. Thus, volunteer hackers like myself (and thousands of others) are the only alternative.


We have the skill. We have the desire. We have a proud tradition of public service and mutual help. But you have to *let us do it* – and, to the extent it is in your remit, you have to make the device vendors let us do it.


There is precedent. Consider the vital role of radio hams in coordinating disaster relief. The FCC understands that it is in the public interest to support their and enable their voluntarism. In an Internetted age, enabling our voluntarism is arguably even more important.


Mandated device upgradeability. Mandated open source for firmware. It’s not just a good idea, it should be the law.

 •  0 comments  •  flag
Share on Twitter
Published on October 07, 2015 11:58

October 6, 2015

Vox is wrong – we don’t have too many guns, we have too many criminals

One of my followers on G+ asked me to comment on a Vox article,

What no politician wants to admit about gun control
.


I’ve studied the evidence, and I don’t believe the effect of the Australian confiscation on homicides was significant.  You can play games with statistics to make it look that way, but they are games.


As for the major contention of the article, it’s simply wrong.  80% of U.S. crime, including gun violence, is associated with the drug trade and happens in urban areas where civil order has partially or totally collapsed.


Outside those areas, the U.S. looks like Switzerland or Norway – lots of guns, very little crime.  Those huge, peaceful swathes of high-gun-ownership areas show that our problem is not too many guns, it’s too many criminals.



The reason nobody at Vox or anywhere in the punditosphere wants to admit this is because of the racial angle.  The high-crime, high-violence areas of the U.S. are populated by blacks (who, at 12.5% of the population, commit 50% of index violent crimes).  The low-crime, lots-of-guns areas are white.


The predictively correct observation would be that in the U.S., lots of legal weapons owned by white people doesn’t produce high levels of gun violence any more than they do in Switzerland or Norway.   The U.S. has extraordinarily high levels of gun violence because American blacks (and to a lesser extent American Hispanics and other non-white, non-Asian minorities) are extraordinarily lawless.  As in, Third-World tribal badlands levels of lawless.


Nobody wants to be honest about this except a handful of evil scumbag racists (and me).  Thus, the entire policy discussion around U.S. firearms is pretty much fucked from the word go.


What would I do about it? Well, since I’m not an evil scumbag racist and in fact believe all laws and regulations should be absolutely colorblind, I would start by legalizing all drugs.  Then we could watch gun violence drop by 80% and look for the next principal driver.

 •  0 comments  •  flag
Share on Twitter
Published on October 06, 2015 10:01

September 24, 2015

ifdex: a tool for code archeologists

I’ve written a tool to assist intrepid code archeologists trying to comprehend the structure of ancient codebases. It’s called ifdex, and it comes with a backstory. Grab your fedora and your bullwhip, we’re going in…



One of the earliest decisions we made on NTPsec was to replace its build system. It had become so difficult to understand and modify that we knew it would be significant drag on development.


Ancient autoconf builds tend to be crawling horrors and NTP’s is an extreme case – 31KLOC of kludgy macrology that defines enough configuration symbols to make getting a grasp on its interface with the codebase nigh-impossible even when you have a config.h to look at. And that’s a problem when you’re planning large changes!


One of our guys, Amar Takhar, is an expert on the waf build system. When he tentatively suggested moving to that I cheered the idea resoundingly. Months later he was able to land a waf recipe which, while not complete, would at least produce binaries that could be live-tested.


When I say “not complete” I mean that I could tell that there were configuration #defines in the codebase that the waf build never set. Quite a few of them – in some cases fossils that the autoconf build didn’t touch either, but in others … not. And these unreached configuration knobs tended to get lost amidst a bunch of conditional guards looking at #defines set by system headers and the compiler.


And we’re not talking a handful or even dozens. I eventually counted over 670 distinct #defines being used in #if/#ifdef/#ifndef/#elif guards – 2430 of them, as A&D regular John D. Bell pointed out in a comment on my last post. I needed some way to examine these and sort them into groups – this is from a system header, that’s a configuration knob, and over there is something else…


So I wrote an analyzer. It parses every compile-time conditional in a code tree for symbols, then reports them either as a bare list or GCC-like file/line error messages that you can step through with Emacs compilation mode.


To reduce noise, it knows about a long list of guard symbols (almost 200 of them) that it should normally ignore – things like the __GNUC__ symbol that GCC predefines, or the O_NONBLOCK macro used by various system calls.


The symbols are divided into groups that you can choose to ignore individually with a command-line option. So, if you want to ignore all standardized POSIX macros in the list but see anything OS-dependent, you can do that.


Another important feature is that you can build your own exclusion lists, with comments. The way I’m exploring the jungle of NTP conditionals is by building a bigger and bigger exclusion list describing the conditional symbols I understand. Eventually (I hope) the report of unknown symbols will shrink to empty. At that point I’ll know what all the configuration knobs are with certainty.


As of now I have knocked out about 300 of them and have 373 to go. That’s most of a week’s work, even with my spiffy new tool. Oh well, nobody ever said code archeology was easy.

 •  0 comments  •  flag
Share on Twitter
Published on September 24, 2015 12:07

September 23, 2015

Major progress on the NTPsec front

I’ve been pretty quiet on what’s going on with NTPsec since posting Yes, NTPsec is real and I am involved just about a month ago. But it’s what I’m spending most of my time on, and I have some truly astonishing success to report.


The fast version: in three and a half months of intensive hacking, since the NTP Classic repo was fully converted to Git on 6 June, the codebase is down to 47% of its original size. We’ve fixed our first serious security bug. Live testing on multiple platforms seems to indicate that the codebase is now solid beta quality, mostly needing cosmetic fixes and more testing before we can certify it production-ready.


Here’s the really astonishing number…



In fifteen weeks of intensive hacking, code cleanup, and security-hardening changes, the number of user-visible bugs we introduced was … one (1). Turns out that during one of my code-hardening passes, when I was converting as many flag variables as possible to C99 bool so static analyzers would have more type constraint information, I flipped one flag initialization. This produced two different minor symptoms (strange log messages at startup and incorrect drift-statistics logging)


Live testing revealed two other bugs, one of which turned out to be a build-system issue and the other some kind of Linux toolchain problem with glibc or pthreads that doesn’t show up under FreeBSD, so it doesn’t really count.


Oh, and that build system bug? Happened while we were reducing the build from 31KLOC of hideous impacted autotools cruft to a waf recipe that runs at least an an order of magnitude faster and comes in at a whole 900 lines. Including the build engine.


For those of you who aren’t programmers, just two iatrogenic bugs after fifteen weeks of hacking on a 227-thousand-line codebase full of gnarly legacy crud from as far back as the 1980s – and 31KLOC more of autotools hair – is a jaw-droppingly low, you’ve-got-to-be-kidding-me, this-never-happens error rate.


This is the point at which I would normally make self-deprecating noises about how good the other people on my team are, especially because in the last week and a half they really have been. But for complicated and unfortunate reasons I won’t go into, during most of that period I was coding effectively alone. Not by choice.


Is it bragging when I say I didn’t really know I was that good? I mean, I thought I might be, and I’ve pulled off some remarkable things before, and I told my friends I felt like I was doing the best work of my life on this project, but looking at those numbers leaves me feeling oddly humbled. I wonder if I’ll ever achieve this kind of sustained performance again.


About that security bug, I’m not going to say anything more detailed until the research paper comes out of embargo, but it’s potentially really bad. Easy denial of service, no exploits in the wild yet but an estimated 700K systems vulnerable. A&D regular Daniel Franke and I collaborated on the fix; analysis mostly his, code mostly mine. We turned it around within 48 hours of reading the paper.


The August release announcement was way premature (see complicated and unfortunate reasons I won’t go into, above). But. Two days ago I told the new project manager – another A&D regular, Mark Atwood – that, speaking as architect and lead coder, I saw us as being one blocker bug and a bunch of cosmetic stuff from a release I’d be happy to ship. And yesterday the blocker got nailed.


I think what we have now is actually damn good code – maybe still a bit overengineered in spots, but that’s forgivable if you know the history. Mostly what it needed was to have thirty years of accumulated cruft chiseled off of it – at times it was such an archeological dig that I felt like I ought to be coding with a fedora on my head and a bullwhip in hand. Once I get replicable end-to-end testing in place the way GPSD has, it will be code you can bet your civilizational infrastructure on. Which is good, because you probably are going to be doing exactly that.


I need to highlight one decision we made early on and how much it has paid off. We decided to code to an ANSI.1-2001/C99 baseline and ruthlessly throw out support for legacy OSes that didn’t meet that. Partly this was informed by my experience with GPSD, from which I tossed out all the legacy-Unix porting shims in 2011 and never got a this-doesn’t-port complaint even once afterwards – which might impress you more if you knew how many weird-ass embedded deployments GPSD has. Tanks, robot submarines, you name it…


I thought that commitment would allow us to chisel off 20% or so of the bulk of the code, maybe 25% if we were lucky.


This morning it was up to 53% 53%! And we’re not done. If reports we’ve been hearing of good POSIX conformance in current Windows are accurate, we may soon have a working Windows port and be able to drop most of another 6 KLOC.


(No, I won’t be doing the Windows port. It’ll be Chris Johns of the RTEMS project behind that, most likely.)


I don’t have a release date yet. But we are starting to reach out to developers who were not on the original rescue team. Daniel Franke will probably be the first to get commit rights. Public read-only access to the project repo will probably be made available some time before we ship 1.0.


Why didn’t we open up sooner? I’m just going to say “politics” and leave it at that. There were good reasons. Not pleasant ones, but good ones – and don’t ask because I’m not gonna talk about it.


Finally, a big shout-out to the Core Infrastructure Initiative and the Linux Foundation, who are as of about a a month ago actually (gasp!) paying me to work on NTPsec. Not enough that I don’t still have some money worries, because Cathy is still among the victims-of-Obamacare unemployed, but enough to help. If you want to help and you haven’t already, there’s my Patreon page.


I have some big plans and the means to make them happen. The next six months should be good.

 •  0 comments  •  flag
Share on Twitter
Published on September 23, 2015 04:44

Eric S. Raymond's Blog

Eric S. Raymond
Eric S. Raymond isn't a Goodreads Author (yet), but they do have a blog, so here are some recent posts imported from their feed.
Follow Eric S. Raymond's blog with rss.