More on this book
Community
Kindle Notes & Highlights
by
Tim Harford
Read between
June 25 - July 13, 2022
Big found datasets can seem comprehensive, and may be enormously useful, but “N = All” is often a seductive illusion: it’s easy to make unwarranted assumptions that we have everything that matters.
An algorithm, meanwhile, is a step-by-step recipe[*] for performing a series of actions, and in most cases “algorithm” means simply “computer program.” But over the past few years, the word has come to be associated with something quite specific: algorithms have become tools for finding patterns in large sets of data.
“Found” datasets can be huge. They are also often relatively cheap to collect, updated in real time, and messy—a collage of data points collected for disparate purposes.
As our communication, leisure, and commerce are moving to the internet, and the internet is moving into our phones, our cars, and even our spectacles, life can be recorded and quantified in a way that would have been hard to imagine just a decade ago.
cheerleaders for big data have made three exciting claims, each one reflected in the success of Google Flu Trends. First, that data analysis produces uncannily accurate results. Second, that every single data point can be captured—the “N = All” claim we met in the previous chapter—making old statistical sampling techniques obsolete (what that means here is that Flu Trends captured every single search). And finally, that scientific models are obsolete, too: there’s simply no need to develop and test theories about why searches for “flu symptoms” or “Beyoncé” might or might not be correlated
...more
a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down.
Making big data work is harder than it seems. Statisticians have spent the past two hundred years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster, and cheaper these days, but we must not pretend that the traps have all been made safe. They have not.
I hope I’ve persuaded you that we shouldn’t be too eager to entrust our decisions to algorithms. But I don’t want to overdo the critique, because we don’t have some infallible alternative way of making decisions. The choice is between algorithms and humans. Some humans are prejudiced. Many humans are frequently tired, harassed, and overworked. And all humans are, well, human.
we should compare the fallibility of today’s algorithms with that of the humans who would otherwise be making the decisions.
Hannah Fry’s book Hello World.
Many people have strong intuitions about whether they would rather have a vital decision about them made by algorithms or humans. Some people are touchingly impressed by the capabilities of the algorithms; others have far too much faith in human judgment. The truth is that sometimes the algorithms will do better than the humans, and sometimes they won’t. If we want to avoid the problems and unlock the promise of big data, we’re going to need to assess the performance of the algorithms on a case-by-case basis.
the problem is not the algorithms, or the big datasets. The problem is a lack of scrutiny, transparency, and debate.
Alchemy is not the same as gathering big datasets and developing pattern-recognizing algorithms. For one thing, alchemy is impossible, and deriving insights from big data is not. Yet the parallels should also be obvious. The likes of Google and Target are no more keen to share their datasets and algorithms than Newton was to share his alchemical experiments.
There’s gold in the data that Amazon, Apple, Facebook, Google, and Microsoft have about us. And that gold will be worth a lot less to them if the knowledge that produces it is shared with everyone.
just as the most brilliant thinkers of the age failed to make progress while practicing in secret, secret algorithms based on secret data are likely to lead to missed opportunities for improvement.
Onora O’Neill argues that if we want to demonstrate trustworthiness, we need the basis of our decisions to be “intelligently open.” She proposes a checklist of four properties that intelligently open decisions should have. Information should be accessible: that implies it’s not hiding deep in some secret data vault. Decisions should be understandable—capable of being explained clearly and in plain language. Information should be usable—which may mean something as simple as making data available in a standard digital format. And decisions should be assessable—meaning that anyone with the time
...more
anyone who is confident of the effectiveness of their algorithm should be happy to demonstrate that effectiveness in a fair and rigorous test.
We need to look on a case-by-case basis. What sort of accountability or transparency we want depends on what problem we are trying to solve.
Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes.
We should not simply trust that algorithms are doing a better job than humans, nor should we assume that if the algorithms are flawed, the humans would be flawless.
Now, it’s one thing to be wrong, or to have a view of the world that misses out on something important. But, argues Scott, because the state is powerful, its misperceptions of the world often take physical form, producing well-meaning but clumsy and oppressive modernist schemes that ignore local knowledge and stifle local autonomy.
States should be humble. Bureaucrats must recognize the limits of their knowledge. There is always a risk that the bird’s-eye view is so grand and sweeping as to induce delusions of omnipotence.
the tactic of simply refusing to collect basic statistics could only make sense for a libertarian, laissez-faire regime. And the truth is that very few people seem attracted by that prospect. For better or worse, we want our governments to take action, and if they are to take action they need information. Statistics collected by the state make for better-informed policies—on crime, education, infrastructure, and much else.
There is nothing wrong with the idea that government should collect statistics to inform itself. But there is a risk that this view slips into a proprietorial sense of ownership, when politicians believe not only that they should be using statistics to run the country, but that those statistics are none of anyone else’s business, and that external scrutiny is a distraction. The facts are no longer the facts—they become the tools of the powerful.
Good statistics don’t just serve government planners; they are valuable to a far wider group of people.
This isn’t just about making money; it’s about making sure that citizens have access to accurate information about the world in which they live.
Much of the data visualization that bombards us today is decoration at best, and distraction or even disinformation at worst. The decorative function is surprisingly common, perhaps because the data visualization teams of many media organizations are part of the art departments. They are led by people whose skills and experience are not in statistics but in illustration or graphic design.[4] The emphasis is on the visualization, not on the data. It is, above all, a picture.
Data visualization ducks can be more than tasteless: the duckness of the graph can actually obscure—or worse, it can misrepresent—the underlying information.
The most straightforward problem with a clever decorative idea is that the basic data may not be solid. The visualization then simply hides that fact—the shimmering icing over a moldering statistical cake.
So information is beautiful—but misinformation can be beautiful, too. And producing beautiful misinformation is becoming easier than ever.
A good chart isn’t an illustration but a visual argument,” declares Alberto Cairo near the beginning of his book How Charts Lie.
by organizing and presenting the data, we are inviting people to draw certain conclusions. And just as a verbal argument can be logical or emotional, sharp or woolly, clear or baffling, honest or misleading, so too can the argument made by a chart.
When you look at data visualizations, you’ll do much better if you recognize that someone may well be trying to persuade you of something. There is nothing wrong with artfully persuasive graphs, any more than with artfully persuasive words. And there is nothing wrong with being persuaded and changing your mind.
our preconceptions are powerful things. We filter new information. If it accords with what we expect, we’ll be more likely to accept it.
Our brains are always trying to make sense of the world around us based on incomplete information. The brain makes predictions about what it expects, and tends to fill in the gaps, often based on surprisingly sparse data.
Our brains fill in the gaps—which is why we see what we expect to see and hear what we expect to hear,
Motlatsi Kgaphola liked this
we can also filter new information consciously, because we don’t want it to spoil our day.
One of the reasons facts don’t always change our minds is that we are keen to avoid uncomfortable truths.
we make a forecast with the facts that are in front of our nose.
But it is a better idea to zoom out and find one very straightforward[*] statistic: In general, how many marriages end in divorce? This number is known as the “base rate.”
The importance of the base rate was made famous by the psychologist Daniel Kahneman, who coined the phrase “the outside view and the inside view.” The inside view means looking at the specific case in front of you:
The outside view requires you to look at a more general “comparison class” of cases—here,
Ideally, a decision maker or a forecaster will combine the outside view and the inside view—or, similarly, statistics plus personal experience. But it’s much better to start with the statistical view, the outside view, and then modify it in the light of personal experience than it is to go the other way around. If you start with the inside view you have no real frame of reference, no sense of scale—and can easily come up with a probability that is ten times too large, or ten times too small.
Second, keeping score was important.
Third, superforecasters tended to update their forecasts frequently as new information emerged, which suggests that a receptiveness to new evidence was important. This willingness to adjust predictions is correlated with making better predictions in the first place:
superforecasting is a matter of having an open-minded personality.
The superforecasters are what psychologists call “actively open-minded thinkers”—people who don’t cling too tightly to a single approach, are comfortable abandoning an old view in the light of fresh evidence or new arguments, and embrace disagreements with others as an opportunity to learn.
“For superforecasters, beliefs are hypotheses to be tested, not treasures to be guard...
This highlight has been truncated due to consecutive passage length restrictions.
superforecasting means being willing to change your mind.
“Making public commitments ‘freezes’ attitudes in place. So saying something dumb makes you a bit dumber. It becomes harder to correct yourself.”