Stephen Baker’s The Numerati. It is a serviceable introduction to the arenas where statistical analyses of large data sets are gaining prominence. Despite the title, the book does not really present leading scientists and statisticians who are at the forefront of converting our analog lives into computer friendly numbers. I would also have liked to see this book grapple more with issues such as how non-statisticians should come to terms with how we are all being quantified and analyzed.
The book presents this numerification without judgment. It is simply a description of what is already happening. By virtue of Mr. Baker’s matter-of-fact presentation, we can surmise that current uses of behavior quantification seem to be used to market products to us or to track on us. Politicians get to slice us into finer demographics; true believers are ignored while swing voters are targeted. Stores entice consumers to join rewards programs; the information that businesses gain is cheaply bought. The debris of our personal lives are vacuumed by governments, intent on identifying the terrorists among us. The workplace becomes more divided, first by cubicle walls and then by algorithms designed to flag malingerers.
Mr. Baker does not dwell on how power resides in those who have access to the information, although most of the researchers seem to think that their analyses will be used by laymen as much as by themselves. He presents two dissenting voices; one is a detective who utilizes the latest face recognition software for casinos. The expert has become an advocate for the privacy that citizens deserve; it might be uncomfortable for one to receive targeted ads that presumes too much about our behavior. The other is Baker himself, but only in the narrow scope of how numerification affects his own industry. He thinks there is a value in the role of editors in acting as a curator for news. Otherwise, that role will fall to the reader, who may be overwhelmed by the number of news items. More likely, that reader will defer to search engines (the very things supplanting editors).
Mr. Baker does not really push this issue, but search engines do not have to be value-neutral. They can very well reflect the political biases of their owners, or the function itself might be a value-add meant to drive up revenue streams (don’t forget, Google makes money by selling ads.) People tend to think of software as without bias and objective, due to its being based on algorithms, machine rules and mathematical models. I think one interesting aspect of numerification in that it in no way dismisses the need for judgment. This is especially important in selecting the mathematical rules to use, the filters and gates one applies to data, and the interpretation of results. A computer can crank out numbers, but humans decide what formulas to use.
A short while ago, I was discussing this very issue with a director of analytics at a marketing firm. We got to discussing cluster analysis; we both felt that while its result is perfect for what we want to do with our data, there is a surprising amount of ambiguity involved. In MatLab, one function used for finding groups of data points is k-means clustering. To use it, you have to specify how many clusters the function should slice the data. The process itself is straightforward: a number of positions are selected at random. The algorithm then proceeds to reposition these points so that it is equidistant from the group of points that will form the cluster. Everything about it works as advertised, expect for the part where the user needs to tell the program how many clusters there are. Not much help if you are looking for a computational method to find the clusters “objectively.” The director and I moved onto other topics, such as formulating the machine rules and vetting them.
Let’s leave aside the loss of dignity and individuality entailed in numerification; the subtle points not addressed in The Numerati are how models are built and how metrics are validated. This touches directly on the things that can go wrong with numerifying society. The most obvious example is bad data – either typos or out of date information – leading to misclassification. It’s not identify theft, but the result is the same: some agent attributes some notoriety to the wrong person. The victim gets stuck with a bill or worse, labeled a terrorist and detained by authorities. Another possible error is that the wrong metric is used, leading to even more inefficiencies than had numbers been ignored. Simply, are the measures used really the most relevant ones, and how likely are we to settle on the wrong formula?