Major progress on the NTPsec front
I’ve been pretty quiet on what’s going on with NTPsec since posting Yes, NTPsec is real and I am involved just about a month ago. But it’s what I’m spending most of my time on, and I have some truly astonishing success to report.
The fast version: in three and a half months of intensive hacking, since the NTP Classic repo was fully converted to Git on 6 June, the codebase is down to 47% of its original size. We’ve fixed our first serious security bug. Live testing on multiple platforms seems to indicate that the codebase is now solid beta quality, mostly needing cosmetic fixes and more testing before we can certify it production-ready.
Here’s the really astonishing number…
In fifteen weeks of intensive hacking, code cleanup, and security-hardening changes, the number of user-visible bugs we introduced was … one (1). Turns out that during one of my code-hardening passes, when I was converting as many flag variables as possible to C99 bool so static analyzers would have more type constraint information, I flipped one flag initialization. This produced two different minor symptoms (strange log messages at startup and incorrect drift-statistics logging)
Live testing revealed two other bugs, one of which turned out to be a build-system issue and the other some kind of Linux toolchain problem with glibc or pthreads that doesn’t show up under FreeBSD, so it doesn’t really count.
Oh, and that build system bug? Happened while we were reducing the build from 31KLOC of hideous impacted autotools cruft to a waf recipe that runs at least an an order of magnitude faster and comes in at a whole 900 lines. Including the build engine.
For those of you who aren’t programmers, just two iatrogenic bugs after fifteen weeks of hacking on a 227-thousand-line codebase full of gnarly legacy crud from as far back as the 1980s – and 31KLOC more of autotools hair – is a jaw-droppingly low, you’ve-got-to-be-kidding-me, this-never-happens error rate.
This is the point at which I would normally make self-deprecating noises about how good the other people on my team are, especially because in the last week and a half they really have been. But for complicated and unfortunate reasons I won’t go into, during most of that period I was coding effectively alone. Not by choice.
Is it bragging when I say I didn’t really know I was that good? I mean, I thought I might be, and I’ve pulled off some remarkable things before, and I told my friends I felt like I was doing the best work of my life on this project, but looking at those numbers leaves me feeling oddly humbled. I wonder if I’ll ever achieve this kind of sustained performance again.
About that security bug, I’m not going to say anything more detailed until the research paper comes out of embargo, but it’s potentially really bad. Easy denial of service, no exploits in the wild yet but an estimated 700K systems vulnerable. A&D regular Daniel Franke and I collaborated on the fix; analysis mostly his, code mostly mine. We turned it around within 48 hours of reading the paper.
The August release announcement was way premature (see complicated and unfortunate reasons I won’t go into, above). But. Two days ago I told the new project manager – another A&D regular, Mark Atwood – that, speaking as architect and lead coder, I saw us as being one blocker bug and a bunch of cosmetic stuff from a release I’d be happy to ship. And yesterday the blocker got nailed.
I think what we have now is actually damn good code – maybe still a bit overengineered in spots, but that’s forgivable if you know the history. Mostly what it needed was to have thirty years of accumulated cruft chiseled off of it – at times it was such an archeological dig that I felt like I ought to be coding with a fedora on my head and a bullwhip in hand. Once I get replicable end-to-end testing in place the way GPSD has, it will be code you can bet your civilizational infrastructure on. Which is good, because you probably are going to be doing exactly that.
I need to highlight one decision we made early on and how much it has paid off. We decided to code to an ANSI.1-2001/C99 baseline and ruthlessly throw out support for legacy OSes that didn’t meet that. Partly this was informed by my experience with GPSD, from which I tossed out all the legacy-Unix porting shims in 2011 and never got a this-doesn’t-port complaint even once afterwards – which might impress you more if you knew how many weird-ass embedded deployments GPSD has. Tanks, robot submarines, you name it…
I thought that commitment would allow us to chisel off 20% or so of the bulk of the code, maybe 25% if we were lucky.
This morning it was up to 53% 53%! And we’re not done. If reports we’ve been hearing of good POSIX conformance in current Windows are accurate, we may soon have a working Windows port and be able to drop most of another 6 KLOC.
(No, I won’t be doing the Windows port. It’ll be Chris Johns of the RTEMS project behind that, most likely.)
I don’t have a release date yet. But we are starting to reach out to developers who were not on the original rescue team. Daniel Franke will probably be the first to get commit rights. Public read-only access to the project repo will probably be made available some time before we ship 1.0.
Why didn’t we open up sooner? I’m just going to say “politics” and leave it at that. There were good reasons. Not pleasant ones, but good ones – and don’t ask because I’m not gonna talk about it.
Finally, a big shout-out to the Core Infrastructure Initiative and the Linux Foundation, who are as of about a a month ago actually (gasp!) paying me to work on NTPsec. Not enough that I don’t still have some money worries, because Cathy is still among the victims-of-Obamacare unemployed, but enough to help. If you want to help and you haven’t already, there’s my Patreon page.
I have some big plans and the means to make them happen. The next six months should be good.
Eric S. Raymond's Blog
- Eric S. Raymond's profile
- 140 followers
