Humble Pi: When Math Goes Wrong in the Real World
Rate it:
Open Preview
Kindle Notes & Highlights
Read between October 16 - October 19, 2022
17%
Flag icon
Checking if a data entry is equal to NULL is a handy step in programming. I wrote a program to maintain a spreadsheet of all my YouTube videos.
17%
Flag icon
In a lot of computer languages, != means “not equal to.” So, for each row, it checks if the data in the first cell is not equal to null. If it’s not equal to null, it adds 1 to the row and keeps going until the first blank row. If my spreadsheet had rows starting with people’s surnames, then Steve Null could have broken my code (depends how clever the programming language is).
17%
Flag icon
But if you are a parent, please don’t give your child a first name that will set them up for a lifetime of battling computers. And given that over three hundred children in the USA since 1990 have been named Abcde, it’s worth spelling this out:
17%
Flag icon
That last one is not even a joke. It looks like I fell asleep on my keyboard but it is actually a fully functional computer program that will scan through a database without needing to know how it is arranged.
18%
Flag icon
Typing it in as someone’s name is not a joke either. This is known as an SQL injection attack (named after the popular database system SQL; sometimes pronounced like “sequel”). It involves entering malicious code via the URL of an online form and hoping whoever is in charge of the database has not put enough precautions in place.
18%
Flag icon
It may seem ridiculous that a database would process incoming malicious code, but without the ability to run code a modern database would lose its functionality. It’s a balancing act to keep a database secure but able to support advanced features that require running code.
18%
Flag icon
That very code was used in 2008 to attack the UK government and the United Nations—except some of it had been converted into hexadecimal values to slip by security systems looking for incoming code. Once in the database, it would unzip back into computer code, find the database entries then phone home to download additional malicious programs.
18%
Flag icon
In Los Angeles there is a block of land on the corner of West First Street and South Spring Street that houses the offices of the Los Angeles Times. It is just down the street from City Hall and directly across from the LA Police Department. There may be some rough areas of LA best avoided by tourists, but this is certainly not one of them. The area looks as safe as safe can be . . . until you check the LAPD’s online map of reported crime locations. Between October 2008 and March 2009, there were 1,380 crimes on that block. That’s around 4 percent of all crimes marked on the map.
18%
Flag icon
All reported crimes have a location recorded, often handwritten, and this is automatically geocoded by computer to latitude and longitude. If the computer is unable to work out the location, it simply logs the default location for Los Angeles: the front doorstep of the LAPD headquarters.
18%
Flag icon
Null Island is a small but proud island nation off the west coast of Africa. It’s located about six hundred kilometers south of Ghana, and you can find it by putting its latitude and longitude into any mapping software of your choice: 0,0.
18%
Flag icon
Whenever a computer cannot decipher a location, it still has to fill something in, and so 0,0 became the default location. The island where bad data goes to die.
18%
Flag icon
For generations, cartographers have been sneaking fictitious places into real maps (often as a way to expose people plagiarizing their work), and it was inevitable that Null Island would take on a life of its own.
18%
Flag icon
Even when the data has made it into a database, it is not safe . . . which brings us, finally, to Microsoft Excel.
19%
Flag icon
there is one thing Excel is not, and that is a database system. Yet it is frequently used as one.
19%
Flag icon
For a start, just because something walks like a number and quacks like a number does not mean it is a number. Some things that look like numbers are just not. Phone numbers are a perfect example: despite being made from digits, they are not actually numbers.
19%
Flag icon
The rule of thumb should be: if you’re not going to do any math with it, don’t store it as a number.
19%
Flag icon
It gets worse for a phone number. If you enter the UK phone number 0141 404 2559, not only does the zero disappear but 1,414,042,559 is a really big number, well over a billion. So if you put it in Excel, it might switch the number over to a different way of writing numbers: scientific notation. I just pasted that number into a spreadsheet and now all I can see is 1.414E+9.
19%
Flag icon
Scientific notation separates the size of the number from what its specific digits are. Normally, the size of a number is indicated by how many digits it has (before a decimal point), but when we don’t know all the digits, or the digits simply are not important, then the number ends up being mostly zeros.
19%
Flag icon
Scientific notation just takes that a step further. In normal language, we like round multiples of millions and billions, but in science, the decimal point is moved all the way to the front, and then the number of digits is specified.
19%
Flag icon
If you’re ever not sure if something is a number or not, my test is to imagine asking someone for half of it. If you asked for half the height of someone 180 centimeters tall, they would say 90 centimeters. Height is a number. Ask for half of someone’s phone number, and they will give you the first half of the digits.
19%
Flag icon
A number system needs as many digits as its base, which is why base-10 has the ten digits: 0 to 9. But once you go past base-10, there are no more “normal” digits, so letters get called into service.
20%
Flag icon
If you convert the base-10 number 19,527 to base-16, you get 4C47. Here, the C is not a letter, despite still looking like one; it is a digit. Specifically, it is the digit that represents a value of twelve. Just as 7 represents a value of, well, seven. When they ran out of digits, mathematicians realized that letters were a perfect source of more symbols and already have an agreed order.
20%
Flag icon
If you try to use a letter as a digit in Excel, it reasonably assumes you’re typing a word, not a number.
20%
Flag icon
If computers are obsessed with binary numbers, their next love is base-16 numbers. It’s really easy to convert between binary and base-16 hexadecimal numbers, which is why hexadecimal is used to make computer binary a bit more human-friendly; the hexadecimal 4C47 represents the full binary number 0100110001000111 but is much easier to read.
20%
Flag icon
You can think of hexadecimal as binary in disguise. It was used in the SQL injection example before, to hide computer code in plain sight.
20%
Flag icon
The mistake is to try to store computer data that uses hexadecimal values in Excel, a mistake I’m as guilty of as anyone.
20%
Flag icon
Excel has a built-in function that converts between base-10 and base-16 called DEC2HEX (if I ever start a boy band, I’m calling it Dec2Hex). If you type in DEC2HEX(19527), it will spit out 4C47 then immediately forget that it’s a number.
20%
Flag icon
If you really want to nerd out (too late! we’re doing it!), there is one special niche case where Excel completely breaks hexadecimal by, ironically, actually treating it as a number. But it’s the wrong type of number.
20%
Flag icon
Suddenly, your 1E9 has been replaced by 1.00E+09. From 489 to a billion in the blink of a format.
20%
Flag icon
This is not only limited to us nerds using hexadecimal. I spoke off the record to a database consultant who was working with a company in Italy. They had a lot of clients, and their database would generate a client ID for each one by using something like the current year, the first letter of the client company name and then an index number to make sure each ID was unique. For some reason their database was losing companies whose names started with the letter E. It was because they were using Excel and it was converting those client IDs to be a scientific notation number, which was no longer ...more
20%
Flag icon
Reading biology texts reminds me what it is like for other people reading math: a string of words and symbols that looks like normal language yet transmits no new information to my brain as it parses them. If I fight back the need to process the actual content and just focus on the overall syntax, I can kinda get the gist of what the author is trying to say.
21%
Flag icon
Thankfully, on the tenth human chromosome is the gene that encodes for the production of this enzyme. The gene has the catchy name of MARCH5, and if you think that looks a lot like a date, then you can already see where this is going.
21%
Flag icon
Type those gene names into Excel and they’ll transform into 5-Mar and 15-Sep, encoded as 3/5/2019 and 9/15/2019 (or whatever the current year is) in the Formula Bar of the US version. All mention of MARCH5 and SEP15 has been obliterated.
21%
Flag icon
In 2016 three intrepid researchers in Melbourne analyzed eighteen journals that had published genome research between 2005 and 2015 and found a total of 35,175 publicly available Excel files associated with 3,597 different research papers. They wrote a program to autodownload the Excel files, then scan them for lists of gene names, keeping an eye out for where they had been “autocorrected” by Excel into something else.
21%
Flag icon
In 2010 WikiLeaks presented The Guardian and The New York Times with 92,000 leaked field reports from the war in Afghanistan. Julian Assange delivered them in person to the Guardian offices in London. The journalists quickly confirmed that they seemed to be real but, to their surprise, the reports ended abruptly in April 2009, although they should have gone through to the end of that year. You guessed it: Excel counted its rows as a 16-bit number, so there was a maximum of 216 = 65,536 rows available. So when the journalists opened the data in Excel, all the data after the first 65,536 entries ...more
21%
Flag icon
Excel has since been expanded to a maximum of 220 = 1,048,576 rows. But that is still a limit! Scrolling down in Excel can feel like it goes on forever, but if you drag down for long enough you will eventually hit the end of the spreadsheet.
21%
Flag icon
On the whole, doing any kind of important work in a spreadsheet is not a good idea. They are the perfect environment for mistakes to spawn and grow unchecked.
21%
Flag icon
The European Spreadsheet Risks Interest Group (yes, that is a real organization, one dedicated to examining the moments when spreadsheets go wrong) estimates that over 90 percent of all spreadsheets contain errors.
22%
Flag icon
They are able to arrive at such an oddly specific percentage because, occasionally, an entire company’s worth of spreadsheets escape at once. Dr. Felienne Hermans is an assistant professor at Delft University of Technology, where she runs their Spreadsheet Lab.
22%
Flag icon
In the aftermath of the Enron scandal of 2001, the Federal Energy Regulatory Commission published the results of its investigation into the corporation and the evidence behind it—which included around 500,000 e-mails from within the company. There were some concerns about publishing the e-mails of employees who had nothing to do with the scandal, so a sanitized version taking employee concerns into account is now available online.
22%
Flag icon
It gets really interesting when you drill down into how these spreadsheets had gone wrong. The 6,650 spreadsheets with no formulas in them were basically being used as glorified text documents listing numbers, so I’ll ignore them. I care only about spreadsheets that are doing some math that can go wrong. So that’s the remaining 9,120 spreadsheets containing 20,277,835 formulas.
22%
Flag icon
But Excel cannot make sure that you use sensible functions or point them at the correct cells to feed the right data into the formulas.
22%
Flag icon
Hermans found that 2,205 spreadsheets had one or more Excel error messages. Which means that around 24 percent of all formula-containing spreadsheets contained an error.
22%
Flag icon
one spreadsheet took first place with 83,273 errors.
22%
Flag icon
Without having a deep knowledge of what the creator was trying to do in the first place, there is no easy way to scan spreadsheets and make sure the formulas are all pointing in the right places. This is probably the biggest problem with them. It’s easy to select the wrong column accidentally and, suddenly, the data is coming from the wrong year, or the data is gross instead of net (gross data indeed!).
22%
Flag icon
In 2012 the State Office of Education in Utah miscalculated its budget to the tune of $25 million because of what State Superintendent Larry Shumway called “a faulty reference” in a spreadsheet. In 2011 the village of West Baraboo in Wisconsin miscalculated how much their borrowing would cost by $400,000 because a range being summed missed one important cell.
22%
Flag icon
Goodness knows how many minor mistakes there are in the complex webs of formulas that exist in industrial spreadsheets.
23%
Flag icon
This is before we even get to “version control,” which means making sure everyone knows what the most up-to-date spreadsheet is. Of the 68,979 Enron e-mails about spreadsheets, 14,084 were about what version of a spreadsheet people were using.
23%
Flag icon
In 2011 Kern County in California forgot to ask a company for $12 million in taxes because they used the wrong version of a spreadsheet, missing $1.26 billion worth of oil-and-gas-producing property.
23%
Flag icon
In 2012 JPMorgan Chase lost a bunch of money; it’s difficult to get a hard figure, but the agreement seems to be that it was around $6 billion. As is often the case in modern finance, there are a lot of complicated aspects to how the trading was done and structured (none of which I claim to understand). But the chain of mistakes featured some serious spreadsheet abuse, including the calculation of how big the risk was and how losses were being tracked.