NaNoWriMo 2014: The Little Website That Could
Earlier in November, we promised you the straight dope from our technical director, Dave Beck, himself. He reflects on the rocky beginnings of his very first NaNoWriMo, and how our tech team found triumph despite it all:
One of my takeaways from my first November as NaNoWriMo’s technical director was an intense feeling of empathy with my predecessor, Dan Duvall. Dan served well and nobly for five years, until he finally succumbed to shell shock and is now living out his days in a sanatorium with green pastures and pretty nurses in upstate New York.
…Not really. Dan now toils nobly at Wikimedia Commons, helping serve up images like the one above—but early November 2014 I felt like the next doughboy to man the machine gun, with Sergeant Dan lying bleeding and broken nearby, propped up by the bodies of past developers. Our site, nanowrimo.org, was throwing 404 timeout errors, and nobody could update their word counts. The Young Writers Program servers were behaving like a bunch of drunken ballerinas, and the YWP ship was listing dangerously. In short, our metaphors were badly mixed.
First, a little history. NaNo has been, since its inception, the under-funded, under-manned little engine that could. The technical infrastructure was created initially by volunteer Friends of Baty, then by a succession of stalwart developers with different skills and feeble budgets.
Our online systems are a potpourri of recent Ruby on Rails code, hopelessly outdated Drupal 5, and an assortment of ancient, mostly deprecated server applications. Coding for NaNoWriMo sometimes feels like an archeological dig. Is anybody still using this stone axe? Seriously, will our email fail if we throw away this stone axe?!?
Among the stone axes is the YWP website code. At our November peak we run 96 server instances (virtual servers) on Amazon Web Services, some 32 of which are devoted exclusively to powering the Young Writers Program. From Halloween morning until midday on All Saints Day I watched in horror as server after server would crash, recover briefly, then crash again. I checked and double-checked everything from asset and caching mounts to database connections, to no avail—the systems seemed to be inherently unstable.
And, in fact, they were. Things get a little technical here, so bear with me. Amazon Web Services instances start with a basic configuration called an AMI, which is sort of like the bedrock foundation for each server. With Dan’s help, we discovered that the AMI for YWP was allowing for too many simultaneous connections than the old Drupal 5 code could handle. A simple edit to the server configuration files solved the problem. Let loose the young writers!
On nanowrimo.org there were three separate issues contributing to the slowness. One of the perplexing questions is why the site didn’t have similar problems last year, because the buggy code had gone unchanged from one year to the next. The issues were all bottlenecks caused by server misconfigurations or improper coding, basically creating a massive traffic jam on the site. Here’s what we found and fixed:
In the code, every word-count update also updated the total word count for November 2014, which is stored in a single field in the database. With so many people trying to simultaneously modify the same single piece of information, the database server bogged down. Our new dashboard design no longer displayed the total global word count, so our steadfast developer Jezra Lickter simply removed the offending code, and voila! No bottleneck.
In the database, the latest word count is saved in the “novels” table, and so updating one’s word count also re-saved all the other novel information (like the title and synopsis). Saving one’s novel triggers a re-indexing of the novel search data, which can be time-consuming and CPU intensive, so every word-count update precipitated a weighty and largely unnecessary process. Jezra’s fix: only re-index the novel search when a searchable field is changed.
Finally, we discovered that each of the 40 nanowrimo.org server instances only allowed a maximum of 25 concurrent database connections, for a total of 1,000 simultaneous database connections. But our database server was configured to handle over 20,000 connections. A simple configuration edit on each instance unleashed the full power of our big AWS database server.
These three code modifications made the site orders of magnitude faster, eliminated almost all of the 404 errors, and will save us money on server costs next year. Although I’d rather the sites run flawlessly and without incident, the frustrating errors led to vast improvements in server speed and efficiency.
Thanks for your patience, all! Next year’s November will be much smoother, or I’ll be joining Dan in the asylum.
Dave Beck has a B.A. in Italian Literature and a Doctorate of Jurisprudence, and has worked as a chef, reporter, editor, museum exhibit developer, website designer, computer programmer, and now the Director of IT for the most awe-inspiring community on the planet. He and his beloved twin daughters are attempting to watch every great movie ever made.
Chris Baty's Blog
- Chris Baty's profile
- 62 followers
