Hands down, one of the most useful layman texts I’ve read all year, perhaps the past 3 years. “Dark Data”, written by British statistician David Hand, discusses one of the least well-understood/talked about subject matters in the data-analysis process flow, dealing with missing data. Why this is the case is not exactly clear, but it probably comes down to two reasons: One, the rapidity of work expected from data scientists, not only in executing their scripts, writing code, implementing algorithms, but overall in gaining the valuable “insights” for the company. Thus, when tasked to quickly build a predictive model that could serve as an engine for some scoring process, the focus is often on the “traditional” machine learning development: the test-train-validation loop, and then the subsequent report or implement into production deliverables. Two, the complexity of the topic of missingness (more on that later).
What is often less focused on either by the team, or the client, in these fast-paced projects, is on the nature of the data itself. That is, how was it produced, what does that production imply about the nature of the data- it’s fidelity, and of course, what information is missing from this data. Classical statistical processes of yesteryear would have been very methodical along these lines, but of course statistical consulting was even more bespoke 2 or 3 decades past than it was today , and thus, afforded a more leisurely pace towards it’s analyst. Not so, in modern data science projects, which are often measured in weeks between significant deliverables.
So the “missingness” of data is often pigeonholed into some process to be dumped into a (often blackbox) algorithm, or simple business logics to quickly sieve out offending columns or rows (if it’s tabular/rectangular) etc. Thus, the art of really taking time to understand your data is quickly becoming lost to the current spate of data science analysts and practitioners, in a field that is itself increasingly becoming more abstract, and is emphasizing the “common sense” human intuitions that underlie the kind of analysis described in this text.
Besides the expediency, and with respect to the complexity of missigness, there are actually many different ways one’s data can suffer from missingness. In fact, that’s why this book is a great layman intro to the topic, as the author has outlined a nice, compact system to think about the nature of missing data, and provide starting threads of analysis one can engage in to drill down on that missingness to inform the eventual ameliorating solution.
The author’s system takes the form of a simple typology, a list of 15 different kinds of missing data, which collectively form the definition of “Dark Data”. The types range from self-selection, knowing what’s missing, knowing what’s not missing, missing data that may be time-dependent, to purposefully missing (gaming or adversarial behavior), and many others. Each type in the list gets at least one section of commentary, and an illustrative example. The first few on the list (knowing when something is or is not missing or not knowing it), provides the perfunctory commentary on the traditional typology of missingness that one may have encountered in statistical texts, missing at random (MAR), missing not at random (MNAR) etc., and this material can easily be gotten freely on the net if not encountered in class. However, from there, the real value of the text becomes more apparent, with in-depth discussions from a range of historical case-studies from history (polling during FDRs reelection campaign), another may discuss issues with experimentation, specifically the randomized control trial (RCT), like differential dropout rates, and what that may imply for analysis. In a way, a student inherits the “folklore” of motivating examples from statistics, both well-known and little-known.
Overall, I really liked the book. It isn’t a textbook, it is an introduction, and as mentioned previously, a well-needed one. If there was a weak section in the book, I think it would have to be the chapter dealing with fraud and dark data, I just didn't feel it fit as naturally with the other chapter. I think this should be on the reading list for every introductory and intermediate data scientist, and even old-hands would get some use (I think) from a guided retrospection on these topics and ways to deal with them. Highly recommended.