Rate this book

Dark Data: Why What You Don’t Know Matters

Name: Dark Data: Why What You Don’t Know Matters
Rating: 3.56 (28 reviews)
ISBN: 9780691182377

David J. Hand

Rate this book

A practical guide to making good decisions in a world of missing data

In the era of big data, it is easy to imagine that we have all the information we need to make good decisions. But in fact the data we have are never complete, and may be only the tip of the iceberg. Just as much of the universe is composed of dark matter, invisible to us but nonetheless present, the universe of information is full of dark data that we overlook at our peril. In Dark Data, data expert David Hand takes us on a fascinating and enlightening journey into the world of the data we don't see.

Dark Data explores the many ways in which we can be blind to missing data and how that can lead us to conclusions and actions that are mistaken, dangerous, or even disastrous. Examining a wealth of real-life examples, from the Challenger shuttle explosion to complex financial frauds, Hand gives us a practical taxonomy of the types of dark data that exist and the situations in which they can arise, so that we can learn to recognize and control for them. In doing so, he teaches us not only to be alert to the problems presented by the things we don't know, but also shows how dark data can be used to our advantage, leading to greater understanding and better decisions.

Today, we all make decisions using data. Dark Data shows us all how to reduce the risk of making bad ones.

GenresNonfictionScienceTechnologyAudiobookPhilosophyMathematicsResearch

344 pages, Hardcover

First published February 18, 2020

94 people are currently reading

758 people want to read

About the author

David J. Hand

44 books59 followers

David J. Hand is Senior Research Investigator and Emeritus Professor of Mathematics at Imperial College, London, and Chief Scientific Advisor to Winton Capital Management. He is a Fellow of the British Academy, and a recipient of the Guy Medal of the Royal Statistical Society. He has served (twice) as President of the Royal Statistical Society, and is on the Board of the UK Statistics Authority. He has published 300 scientific papers and 25 books: his next book, The Improbability Principle, is due out in February 2014. He has broad research interests in areas including classification, data mining, anomaly detection, and the foundations of statistics. His applications interests include psychology, physics, and the retail credit industry - he and his research group won the 2012 Credit Collections and Risk Award for Contributions to the Credit Industry. He was made OBE for services to research and innovation in 2013.

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

38 (15%)

4 stars

85 (34%)

3 stars

105 (42%)

2 stars

18 (7%)

1 star

2 (<1%)

Displaying 1 - 28 of 28 reviews

Chris Esposo

680 reviews56 followers

September 21, 2020

Hands down, one of the most useful layman texts I’ve read all year, perhaps the past 3 years. “Dark Data”, written by British statistician David Hand, discusses one of the least well-understood/talked about subject matters in the data-analysis process flow, dealing with missing data. Why this is the case is not exactly clear, but it probably comes down to two reasons: One, the rapidity of work expected from data scientists, not only in executing their scripts, writing code, implementing algorithms, but overall in gaining the valuable “insights” for the company. Thus, when tasked to quickly build a predictive model that could serve as an engine for some scoring process, the focus is often on the “traditional” machine learning development: the test-train-validation loop, and then the subsequent report or implement into production deliverables. Two, the complexity of the topic of missingness (more on that later).

What is often less focused on either by the team, or the client, in these fast-paced projects, is on the nature of the data itself. That is, how was it produced, what does that production imply about the nature of the data- it’s fidelity, and of course, what information is missing from this data. Classical statistical processes of yesteryear would have been very methodical along these lines, but of course statistical consulting was even more bespoke 2 or 3 decades past than it was today , and thus, afforded a more leisurely pace towards it’s analyst. Not so, in modern data science projects, which are often measured in weeks between significant deliverables.

So the “missingness” of data is often pigeonholed into some process to be dumped into a (often blackbox) algorithm, or simple business logics to quickly sieve out offending columns or rows (if it’s tabular/rectangular) etc. Thus, the art of really taking time to understand your data is quickly becoming lost to the current spate of data science analysts and practitioners, in a field that is itself increasingly becoming more abstract, and is emphasizing the “common sense” human intuitions that underlie the kind of analysis described in this text.

Besides the expediency, and with respect to the complexity of missigness, there are actually many different ways one’s data can suffer from missingness. In fact, that’s why this book is a great layman intro to the topic, as the author has outlined a nice, compact system to think about the nature of missing data, and provide starting threads of analysis one can engage in to drill down on that missingness to inform the eventual ameliorating solution.

The author’s system takes the form of a simple typology, a list of 15 different kinds of missing data, which collectively form the definition of “Dark Data”. The types range from self-selection, knowing what’s missing, knowing what’s not missing, missing data that may be time-dependent, to purposefully missing (gaming or adversarial behavior), and many others. Each type in the list gets at least one section of commentary, and an illustrative example. The first few on the list (knowing when something is or is not missing or not knowing it), provides the perfunctory commentary on the traditional typology of missingness that one may have encountered in statistical texts, missing at random (MAR), missing not at random (MNAR) etc., and this material can easily be gotten freely on the net if not encountered in class. However, from there, the real value of the text becomes more apparent, with in-depth discussions from a range of historical case-studies from history (polling during FDRs reelection campaign), another may discuss issues with experimentation, specifically the randomized control trial (RCT), like differential dropout rates, and what that may imply for analysis. In a way, a student inherits the “folklore” of motivating examples from statistics, both well-known and little-known.

Overall, I really liked the book. It isn’t a textbook, it is an introduction, and as mentioned previously, a well-needed one. If there was a weak section in the book, I think it would have to be the chapter dealing with fraud and dark data, I just didn't feel it fit as naturally with the other chapter. I think this should be on the reading list for every introductory and intermediate data scientist, and even old-hands would get some use (I think) from a guided retrospection on these topics and ways to deal with them. Highly recommended.

Josh Friedlander

813 reviews132 followers

November 26, 2021

Pop-stats book about missing data: how to detect it, how to reason about it, how to compensate for it (single and multiple imputation, the EM algorithm, bootstrapping) and ways we use it to our advantage ("blinding" study participants, anonymisation, Bayesian priors). A clear presentation of the material, made accessible to non-statisticians; the book at times suffers from trying to cover too much.

contemporary maths pop-stats

Hristo Milev

69 reviews1 follower

October 19, 2022

I bought this book expecting to read about non-standard sources of data and new ways to use existing data. But it only repeats the classic problems of censoring and sample bias, which can be found in any introductory statistics textbook (often using the same textbook examples), and which anyone with basic data literacy would have had to deal with ad nauseam. I don't see why this book was necessary.

Popup-ch

890 reviews24 followers

August 20, 2020

A lot of important data is lost, not collected or simply not available. This book tries to shine a light on the fact that a lot of important data is dark - sometimes for very good reasons.
The book is rather sprawling and lacks focus. By trying to shoehorn subjects as disparate as Sokal's hoax and the Piltdown man into the same text as underrepresented social groups it ends up very disjointed.

2020 kindle non-fiction

Sofia

175 reviews6 followers

March 15, 2023

I was expecting to embark on a fascinating journey to the depths of the dark web or something equally mysterious but unfortunately I was mislead by the title. The author goes through, in detail and with numerous examples, how we live in constant darkness because we never have all the data about any one thing. Dark data is the data that's missing and there are different types.

The first third of the book was fairly interesting with different types of dark data being explained and illustrated with examples. However, the rest was very focused on how to identify and deal with dark data when conducting scientific research. It wasn't quite textbook style but not quite popular science either. It was certainly more than I was interested in knowing though and I struggled through the final third by having it playing in the background.

The book was narrated very well by the author himself but I would stay clear of this one unless research and understanding the pitfalls of data are interesting or relevant to you.

read-in-2023

Jaime Gacitua

28 reviews2 followers

February 2, 2022

A bunch of stories and examples about data.
I find it risky and misleading to rename concepts that are already defined in academia.

Levi Marasco

15 reviews

December 11, 2024

Dark Data: Why What You Don't Know Matters wasn't really for me. I think that this book would be good for someone that doesn't think too much about statistics, wants to learn more about how studies are conducted, or has never had a college class in statistics and wants to learn more about the topic.

According to the author, Dark Data is similar to dark matter (and is named after it). We don't know either of their origins, but we do know that both affect the data that we are observing.

I found a lot of the information contained in this book to not be fresh. While the main points are true and important, I felt it was nothing that I haven't already learned about during my undergrad.

The main emphasis of the book is to never take the data you observe at face value and always be thoughtful, honest, and careful when collecting data. Dig deeper to see if you can find more reasons why what you are observing is happening.

If more of X leads to less of Y, Z might be in the background causing X to increase and thus actually more of Z leads to less of Y. If we were to only look at X and Y, we would have completely missed the well of information being given to us by Z.

This is all fine and good. Good enough for me to recommend to someone wanting to learn more about statistics. The reason I gave this book 3 stars is because it is very long. There are a lot of examples, which is good. However, those examples are very exhaustive and feel like you are going down a rabbit trail. There were many times during this book where the author would wrap up an example, bring it back to the original point, and I would think to myself "oh, we're still on this subject?"

If you can power through the many examples and are interested in learning more about statistics, then I would recommend checking this book out.

audiobooks non-fiction

Todd Martinez

35 reviews

February 1, 2021

This book is basically about statistical and logical errors caused by the absence of data. Very reminiscent of Donald Rumsfeld's famous quote about known unknowns and unknown unknowns. Overall, worth reading.

Tao Hu

68 reviews4 followers

July 19, 2020

Too trivial and very easy to get lost in the not well-structured examples. Now giving up.

Chris

177 reviews

August 15, 2020

This book is written for a non-technical audience, and this review is written from this perspective.

The first 200 pages constitutes a useful, if slightly repetitive, survey of different ways that data can be "dark" - i.e., misleading. It might be wrong, it might be missing, it might be an unrepresentative sample, it might be the wrong data for the question to hand, the underlying reality might have changed since the data was collected and so forth. To be reminded in so much depth that we need to be mindful of what data is missing and how it might be wrong is useful.

The last 100 pages is the real meat, and it's unfortunate it is so brief. There is a good discussion of how data can be missing (viz: the missing data is dependent on present data, the missing data is not dependent on any other data or the missing data is dependent on other data which is also missing). The discussion of how to mitigate this issue is a bit brief. There is also a good if brief discussion of how deliberately darkening the data can be used to drive better analytical outcomes.

All in all, for me, this book hovers between 2 stars and 3. 3, because the final section touches on important issues. 2, because it only touches on them and doesn't do more than superficially brush the sand off a rich vein of ore. Perhaps that's a limitation of a popular science book written for a non-technical audience, and it's only so obvious when you are really in the technical audience. Overall 3 stars.

tech

Amelia

590 reviews21 followers

October 25, 2022

“All this means, first, is that it’s necessary to be very clear about what question you are asking, and, second, that whether data are dark or not will depend on that question. Trite though it may sound, the data you need to collect, the analysis you will undertake, and the answer you will get depend on what you want to know.”

What is invisible to us when we, the non-statisticians, read about statistics? What do we lack in our own personal critical thinking? What questions are mathematicians asking? Or not asking? How can we read data and understand what went into it? These are the questions that Hand asks and answers. This is considered "Dark Data", which Hand lists, including time-dependent, purposefully obscured data, unrepresentative, and more.

As an introductory text to this way of thinking, it hits the mark. Through examples including our everyday statistics to more militaristic, financial, and demographic, Hand shows us just why it's so important to acknowledge what isn't acknowledged.

nonfiction

Ugh

662 reviews40 followers

February 6, 2025

Two-thirds a quite bad popular science book; one-third a seemingly good or very good practical guide for beginner-level data practitioners. Why two-thirds bad? The first part is a greatest hits meander through some basic statistics, scientific practice, industry tales etc that I think will likely bore and quite possibly annoy anyone with expertise or past reading history in those areas. It's not concise or focused, and some parts read like outright filler. Do I want a two-page summary of nuclear fusion in a book about data? I do not. The second part, on how to correct for missing data in statistical treatments, seems far more useful and better edited, and might be a great 80-odd pages for a professional researcher who isn't an expert statistician, but this part certainly isn't very interesting for a general reader, and the package as a whole doesn't really work.

Gillian Covillo

8 reviews

June 9, 2022

The book wasn’t bad but it wasn’t great either. I would have to say the first half was very solid and I loved reading it. I felt like I was learning tons of useful and fun information but then a lot of the second half felt repetitive; as other users have mentions, as if he was just providing more and more examples for things he’d already explained. I was struggling to read more than 15-20 pages at a time with interest and without falling asleep towards the end. I want to emphasize that this wasn’t because it was written badly, just not as exciting as the first half of the book.
Overall it’s a good viewpoint on data and I don’t regret reading it, as I know it will come in handy.

Brandon Kinney

18 reviews1 follower

April 16, 2024

Did not finish… read to page 48. Found the topic interesting but the author’s tone a bit judgmental and opinionated. The topic is statistics and the way missing or dark data can impact the results of surveys and studies. I would finish if this field had more to do with my day to day occupation as the systematic breakdown of the information is informative. My TBR stack is too big to spend more time on this topic, however.

Boris

155 reviews

May 28, 2020

A good and important read, but takes a bit of scientific background to fully comprehend. Problem is that people with a scientific backgrounds are mostly aware of dark data. Still, a good problem overview and cool categorization. Wish it was a boog for everyone though since this is a very important issue.

Emma Johns

108 reviews4 followers

September 3, 2022

This book is not quite a text book, not quite popular science. I think it would be better if it shifted towards one or the other. It wasn’t the most thrilling read - I wasn’t really expecting it to be - but it did remind me of some aspects of working with data and made me aware of some others. A good foundation for my career change from maths to data science.

Anna Piphany

98 reviews2 followers

June 13, 2023

Very important concepts and ideas, this is why I rate the book highly- too for its earnest and sweet approach, gentle and poetic anecdotes. Very very slow read however, many many words and not for much reason.

Alex S

60 reviews1 follower

December 19, 2024

This reads like a sequel to the old-but-gold How to Lie with Statistics, with some more up-to-date anecdotes. The “DD taxonomy” is a taxonomy only in Borges’s ironic sense; nevertheless, I think it’s worth keeping at hand and revisiting from time to time.

data-science philosophy-of-science research

Wagner Crivelini

57 reviews1 follower

December 20, 2024

It's a good book, despite the fancy name.
It's important that data professionals develop their skills on Statistics. But most people wouldn't read it if "Statistics" appeared in the title.
All in all, I liked the book and recommend it!

Aliaksei Mukhachou

61 reviews1 follower

December 2, 2023

Giving it 3.6 rounded up to 4. 90% of contents is trivial for someone with any degree of econometrics training. Zeroes could avail themselves well.

João Pedro Lopes

53 reviews1 follower

December 24, 2023

https://medium.com/@jphpwii/dark-data...

Ashley

131 reviews

December 11, 2024

Wow that was boring - read like a text book, so it literally might be. Not at all what I expected so doesn’t feel fair to really rate … did not retain a thing

Duncan Murch

3 reviews1 follower

January 6, 2025

A must-read for any data professional

Anne

9 reviews1 follower

August 23, 2024

One of the best books I've read this year. Moving from a focus on data to absent, hidden, concealed, etc. data is powerful.
Insightful, funny, very clear due in large part to the writing style and abundant use of examples.
Useful for everyone with even the slightest interest in, say, understanding the news, spicy big data claims, or policy plans better. Or those with simply a knack for data and research.

economie engelse-non-fictie politiek-en-samenleving

Prabhat

32 reviews1 follower

December 22, 2022

It took me generous amount of time to finish this book. I work in the field of big data and interested in data science. I also like philosophical questions around data and what we can/cannot know with data. What I found most enlightening with this book is the typologies of dark data, which encompasses missing data among many others. Typologies are super helpful for professional like myself, who constantly encounter these problems but don’t always have a framework for making distinctions between problems. I think the framework and the more or less pedagogical tone of the book was helpful to understand the framework. For example, at my work we have dark data related to known unknowns, unknown unknowns and self selected unknowns, but at times we talk about all these different types of dark data as the same thing in one breath. While ethical concerns are at the front of our mind when dealing with dark data, we may need to approach decision making differently depending on the type of dark data. I really hope to be able to use the DD framework introduced in this book more often when talking/discussing quality of data.

Ashar Malik

59 reviews2 followers

November 2, 2020

This made for a great read. Although I was already familiar with most of stuff discussed in book - the term "Dark Data" was new to me. The was dark data is dealt in this book made for a very interesting read. The book is well written and explains the subject matter quite nicely. The simple way in which this book is written makes it - in my opinion - a good read from beginners to experts. While the book covers generic examples in data sciences to cover common cases, the way the author explains the subject matter makes it quite easy for one to extend the concepts to their own area of research. The book is definitely recommended to all especially people working in data sciences in any capacity.