Naked Statistics: Stripping the Dread from the Data
Rate it:
Open Preview
Kindle Notes & Highlights
1%
Flag icon
I am not impressed by fancy formulas that have no real-world application. I particularly disliked high school calculus for the simple reason that no one ever bothered to tell me why I needed to learn it. What is the area beneath a parabola? Who cares?
3%
Flag icon
Andrejs Dunkels: It’s easy to lie with statistics, but it’s hard to tell the truth without them.
4%
Flag icon
The Gini index measures how evenly wealth (or income) is shared within a country on a scale from zero to one.
4%
Flag icon
A country in which every household had identical wealth would have a Gini index of zero. By contrast, a country in which a single household held the country’s entire wealth would have a Gini index of one. As you can probably surmise, the closer a country is to one, the more unequal its distribution of wealth.
4%
Flag icon
The Gini index for the United States was .41 in 1997 and grew to .45 over the next decade. (The most recent CIA data are for 2007.) This tells us in an objective way that while the United States grew richer over that period of time, the distribution of wealth grew more unequal.
4%
Flag icon
Inequality in Canada was basically unchanged over the same stretch. Sweden has had significant economic growth over the past two decades, but the Gini index in Sweden actually fell from .25 in 1992 to .23 in 2005, meaning that Sweden grew richer and more equal over that period.
4%
Flag icon
“Data is merely the raw material of knowledge.”3* Statistics is the most powerful tool we have for using information to some meaningful end, whether that is identifying underrated baseball players or paying teachers more fairly.
4%
Flag icon
We use numbers, in sports and everywhere else in life, to summarize information.
5%
Flag icon
That makes it a nice descriptive statistic. It’s easy to calculate, it’s easy to understand, and it’s easy to compare across students.
5%
Flag icon
overreliance on any descriptive statistic can lead to misleading conclusions, or cause undesirable behavior.
5%
Flag icon
One key function of statistics is to use the data we have to make informed conjectures about larger questions for which we do not have full information. In short, we can use data from the “known world” to make informed inferences about the “unknown world.”
5%
Flag icon
One important statistical practice is sampling, which is the process of gathering data for a small area, say, a handful of census tracts, and then using those data to make an informed judgment, or inference, about the homeless population for the city as a whole. Sampling requires far less resources than trying to count an entire population; done properly, it can be every bit as accurate.
5%
Flag icon
The polling and research firm Gallup reckons that a methodologically sound poll of 1,000 households will produce roughly the same results as a poll that attempted to contact every household in America.
5%
Flag icon
The whole gambling industry is built on games of chance, meaning that the outcome of any particular roll of the dice or turn of the card is uncertain. At the same time, the underlying probabilities for the relevant events—drawing 21 at blackjack or spinning red in roulette—are known. When the underlying probabilities favor the casinos (as they always do), we can be increasingly certain that the “house” is going to come out ahead as the number of bets wagered gets larger and larger, even as those bells and whistles keep going off.
5%
Flag icon
When individuals and firms cannot make unacceptable risks go away, they seek protection in other ways. The entire insurance industry is built upon charging customers to protect them against some adverse outcome, such as a car crash or a house fire.
6%
Flag icon
The scientific method dictates that if we are testing a scientific hypothesis, we should conduct a controlled experiment in which the variable of interest (e.g., smoking) is the only thing that differs between the experimental group and the control group. If we observe a marked difference in some outcome between the two groups (e.g., lung cancer), we can safely infer that the variable of interest is what caused that outcome.
6%
Flag icon
Regression analysis is the tool that enables researchers to isolate a relationship between two variables, such as smoking and cancer, while holding constant (or “controlling for”) the effects of other important variables, such as diet, exercise, weight, and so on.
6%
Flag icon
use regression analysis to do two crucial things: (1) quantify the association observed between eating bran muffins and contracting colon cancer (e.g., a hypothetical finding that people who eat bran muffins have a 9 percent lower incidence of colon cancer, controlling for other factors that may affect the incidence of the disease); and (2) quantify the likelihood that the association between bran muffins and a lower rate of colon cancer observed in this study is merely a coincidence—a quirk in the data for this sample of people—rather than a meaningful insight about the relationship between ...more
8%
Flag icon
descriptive statistics, which are the numbers and calculations we use to summarize raw data.
8%
Flag icon
Or I can just tell you that at the end of the 2011 season Derek Jeter had a career batting average of .313. That is a descriptive statistic, or a “summary statistic.” The batting average is a gross simplification of Jeter’s seventeen seasons. It is easy to understand, elegant in its simplicity—and limited in what it can tell us.
8%
Flag icon
Per capita income is a simple average: total income divided by the size of the population. By that measure, average income in the United States climbed from $7,787 in 1980 to $26,487 in 2010 (the latest year for which the government has data).1 Voilà! Congratulations to us. There is just one problem. My quick calculation is technically correct and yet totally wrong in terms of the question I set out to answer. To begin with, the figures above are not adjusted for inflation. (A per capita income of $7,787 in 1980 is equal to about $19,600 when converted to 2010 dollars.) That’s a relatively ...more
8%
Flag icon
explosive growth in the incomes of the top 1 percent can raise per capita income significantly without putting any more money in the pockets of the other 99 percent. In other words, average income can go up without helping the average American.
9%
Flag icon
The first descriptive task is often to find some measure of the “middle” of a set of data, or what statisticians might describe as its “central tendency.” What is the typical quality experience for your printers compared with those of the competition? The most basic measure of the “middle” of a distribution is the mean, or average.
9%
Flag icon
average number of quality problems per printer sold for your firm and for your competitor.
9%
Flag icon
the average number of quality problems per printer sold.
9%
Flag icon
That was easy. You’ve just taken information on a million printers sold by two different companies and distilled it to the essence of the problem: your printers break a lot. Clearly it’s time to send a short e-mail to your boss quantifying this quality gap and then get back to day eight of Kim Kardashian’s marriage.
9%
Flag icon
The mean, or average, turns out to have some problems in that regard, namely, that it is prone to distortion by “outliers,” which are observations that lie farther from the center. To get your mind around this concept, imagine that ten guys are sitting on bar stools in a middle-class drinking establishment in Seattle; each of these guys earns $35,000 a year, which makes the mean annual income for the group $35,000. Bill Gates walks into the bar with a talking parrot perched on his shoulder. (The parrot has nothing to do with the example, but it kind of spices things up.) Let’s assume for the ...more
9%
Flag icon
For this reason, we have another statistic that also signals the “middle” of a distribution, albeit differently: the median. The median is the point that divides a distribution in half, meaning that half of the observations lie above the median and half lie below. (If there is an even number of observations, the median is the midpoint between the two middle observations.)
9%
Flag icon
For distributions without serious outliers, the median and the mean will be similar.
9%
Flag icon
Because the distribution is nearly symmetrical, the mean and median are relatively close to one another.
9%
Flag icon
The distribution is slightly skewed to the right by the small number of printers with many reported quality defects. These outliers move the mean slightly rightward but have no impact on the median.
10%
Flag icon
What becomes clear is that your firm does not have a uniform quality problem; you have a “lemon” problem; a small number of printers have a huge number of quality complaints. These outliers inflate the mean but not the median. More important from a production standpoint, you do not need to retool the whole manufacturing process; you need only figure out where the egregiously low-quality printers are coming from and fix that.*
10%
Flag icon
Neither the median nor the mean is hard to calculate; the key is determining which measure of the “middle” is more accurate in a particular situation
10%
Flag icon
the median divides a distribution in half. The distribution can be further divided into quarters, or quartiles. The first quartile consists of the bottom 25 percent of the observations; the second quartile consists of the next 25 percent of the observations; and so on.
10%
Flag icon
Each percentile represents 1 percent of the distribution, so that the 1st percentile represents the bottom 1 percent of the distribution and the 99th percentile represents the top 1 percent of the distribution.
10%
Flag icon
The benefit of these kinds of descriptive statistics is that they describe where a particular observation lies compared with everyone else. If I tell you that your child scored in the 3rd percentile on a reading comprehension test, you should know immediately that the family should be logging more time at the library. You don’t need to know anything about the test itself, or the number of questions that your child got correct. The percentile score provides a ranking of your child’s score relative to that of all the other test takers. If the test was easy, then most test takers will have a high ...more
10%
Flag icon
An “absolute” score, number, or figure has some intrinsic meaning. If I shoot 83 for eighteen holes of golf, that is an absolute figure. I may do that on a day that is 58 degrees, which is also an absolute figure. Absolute figures can usual...
This highlight has been truncated due to consecutive passage length restrictions.
10%
Flag icon
If I place ninth in the golf tournament, that is a relative statistic. A “relative” value or figure has meaning only in comparison to something else, or in some broader context, such as compared with the eight golfers who shot better than I did. Most standardized tests produce results that have meaning only as a relative statistic.
10%
Flag icon
In this case, the percentile (the relative score) is more meaningful than the number of correct answers (the absolute score).
10%
Flag icon
Another statistic that can help us describe what might otherwise be a jumble of numbers is the standard deviation, which is a measure of how dispersed the data are from their mean. In other words, how spread out are the observations?
10%
Flag icon
On the basis of the descriptive tools introduced so far, the weights of the airline passengers and the marathoners are nearly identical. But they’re not. Yes, the weights of the two groups have roughly the same “middle,” but the airline passengers have far more dispersion around that midpoint, meaning that their weights are spread farther from the midpoint.
10%
Flag icon
The weights of the airline passengers are “more spread out,” which is an important attribute when it comes to describing the weights of these two groups. The standard deviation is the descriptive statistic that allows us to assign a single number to this dispersion around the mean.
11%
Flag icon
There is natural variation in the HCb2 count, as there is with most biological phenomena (e.g., height). While the mean count for the fake chemical might be 122, plenty of healthy people have counts that are higher or lower. The danger arises only when the HCb2 count gets excessively high or low.
11%
Flag icon
the standard deviation is a measure of dispersion, meaning that it reflects how tightly the observations cluster around the mean.
11%
Flag icon
For many typical distributions of data, a high proportion of the observations lie within one standard deviation of the mean (meaning that they are in the range from one standard deviation below...
This highlight has been truncated due to consecutive passage length restrictions.
11%
Flag icon
far fewer observations lie two standard deviations from the mean, and fewer still lie three or four standard deviations away.
11%
Flag icon
Some distributions are more dispersed than others. Hence, the standard deviation of the weights of the 250 airline passengers will be higher than the standard deviation of the weights of the 250 marathon runners. A frequency distribution with the weights of the airline passengers would literally be fatter (more spread out) than a frequency distribution of the weights of the marathon runners.
11%
Flag icon
In fact, we can do even better than “not very many.” This is a good time to introduce one of the most important, helpful, and common distributions in statistics: the normal distribution. Data that are distributed normally are symmetrical around their mean in a bell shape that will look familiar to you.
11%
Flag icon
The normal distribution describes many common phenomena.
11%
Flag icon
The beauty of the normal distribution—its Michael Jordan power, finesse, and elegance—comes from the fact that we know by definition exactly what proportion of the observations in a normal distribution lie within one standard deviation of the mean (68.2 percent), within two standard deviations of the mean (95.4 percent), within three standard deviations (99.7 percent), and so on. This may sound like trivia. In fact, it is the foundation on which much of statistics is built.
« Prev 1