Measuring Up: What Educational Testing Really Tells Us
Rate it:
Read between March 26 - April 19, 2025
3%
Flag icon
tests are generally very small samples of behavior that we use to make estimates of students’ mastery of very large domains of knowledge and skill.
3%
Flag icon
A test score is just one indicator of what a student has learned—an exceptionally useful one in many ways, but nonetheless one that is unavoidably incomplete and somewhat error prone.
3%
Flag icon
Most of us grew up in a school system with some simple but arbitrary rules for grading tests, such as “90 percent correct gets you an A.” But replace a few hard questions with easier ones, or vice versa—and variations of this sort occur even when people try to avoid them—and “90 percent correct” no longer signifies the level of mastery it did before.
4%
Flag icon
To his evident annoyance, I responded that this would not be a sensible undertaking because there is no optimal design. Rather, designing a testing program is an exercise in trade-offs and compromise, and a judgment about which compromise is best will depend on specifics, such as the particular uses to which scores will be put. For example, the assessment designs that are best for providing descriptive information about the performance of groups (such as schools, districts, states, or even entire nations) are not suitable for systems in which the performance of individual students must be ...more
4%
Flag icon
This proclivity to associate the arcane with the unimportant is both ludicrous and pernicious.
6%
Flag icon
if we want to measure the mathematics proficiency of eighth-graders, we need to specify what knowledge and skills we mean by “eighth-grade mathematics.” We might decide that this subsumes skills in arithmetic, measurement, plane geometry, basic algebra, and data analysis and statistics, but then we would have to decide which aspects of algebra and plane geometry matter and how much weight should be given to each component. Do students need to know the quadratic formula? Eventually, we end up with a detailed map of what the test should include, often called “test specifications” or a “test ...more
6%
Flag icon
And it has also resulted in uncountable instances of bad test preparation by teachers and others, in which instruction is focused on the small sample actually tested rather than the broader set of skills the mastery of which the test is supposed to signal.
6%
Flag icon
Specifically, it means only that all examinees face the same tasks, administered in the same manner and scored in the same way. The motivation for standardization is simple: to avoid irrelevant factors that might distort comparisons among individuals.
7%
Flag icon
the test would be too hard for them. Everyone would receive a score of zero or nearly zero, and that would make the test useless: you would gain no useful information about the relative strengths of their vocabularies. List B is no better. The odds are high that all of your applicants would know the definitions of bath, travel, and carpet. Everyone would obtain a perfect or nearly perfect score. Once again, you would learn nothing, in this case because the test would be too easy. FIG. 2.1. Three words from each of three hypothetical word lists.
8%
Flag icon
Items and tests that discriminate are simply those that differentiate between students with more of whatever knowledge and skills one wants to measure and those with less. In this case, you want items that are more likely to be answered correctly by students with stronger vocabularies. Items that are too hard or too easy can’t discriminate—virtually
8%
Flag icon
Discriminating items are simply needed if one wants to draw inferences about relative proficiency. This was clear in the vocabulary example: you chose discriminating items in order to be able to gauge the relative vocabularies of applicants. You did not create differences in vocabulary among your applicants by making this choice; you simply made it possible for the test to reveal the differences that already existed.
8%
Flag icon
The key is the particular inference the teacher wants to base on test scores. She would have no basis for an inference about relative proficiency if she used nondiscriminating items, but she would have a basis for an inference about mastery of that specific material.
9%
Flag icon
In public debate, and sometimes in statutes and regulations as well, we find reference to “valid tests,” but tests themselves are not valid or invalid. Rather, it is an inference based on test scores that is valid or not. A given test might provide good support for one inference but weak support for another. For example, a well-designed end-of-course exam in statistics might provide good support for inferences about students’ mastery of basic statistics but very weak support for conclusions about mastery of mathematics more broadly. Validity is also a continuum: inferences are rarely perfectly ...more
9%
Flag icon
Inferences of this latter sort are called absolute inferences in the trade: you are comparing a student’s performance not to the performance of others but rather to an absolute standard.
9%
Flag icon
In theory, a student could know no words at all other than the forty and still get a perfect score. In principle, one could teach the forty words, and nothing else, to Koko the gorilla (albeit in sign language), and she could then demonstrate a strong vocabulary on the test.
9%
Flag icon
So as a rough estimate, their vocabularies would have increased by twenty words, from, say, 11,000 words to 11,020, or from 17,000 to 17,020. An improvement, perhaps, but hardly enough to merit comment. There may be cases in which learning what is specifically on the test constitutes substantial improvement, but the general conclusion remains that even inferences about improvement are undermined by certain types of test preparation that focus on the specific sample included in the test.
9%
Flag icon
teaching students the specific content of the test, or material close enough to it to undermine the representativeness of the test—illustrates the contentious issue of score inflation, which refers to increases in scores that do not signal a commensurate increase in proficiency in the domain of interest.
10%
Flag icon
obscure paper published more than half a century ago by E. F. Lindquist of the University of Iowa, unappealingly entitled “Preliminary Considerations in Objective Test Construction.”
10%
Flag icon
The evidence shows unambiguously that standardized tests can measure a great deal that is of value, and clearly Lindquist believed this. But Lindquist was warning us that however valuable the information from an achievement test, it remains necessarily incomplete, and some of what it omits is very important.
10%
Flag icon
it warns that it is inappropriate to use a score from a single test, without additional information, to assign students to special education, to hold students back, to screen students for first-time enrollment, to evaluate the effectiveness of an entire educational system, or to identify the “best” teachers or schools.2 And, again, this is not the position of anti-testing advocates; it is the advice of the authors of one of the best-known achievement tests in America.
11%
Flag icon
“The only perfectly valid measure of the attainment of an educational objective would be one based on direct observation of the natural behavior of … individuals…. Direct measurement is that based on a sample from the natural, or criterion, behavior … for each individual.”3
Olivier Chabot
.c1
11%
Flag icon
But this sort of measurement is clearly impractical, Lindquist maintained, for many reasons. The criterion is delayed, for one. We really can’t afford to wait a decade or two to find out whether this year’s eighth-graders can use algebra in their adult work. Even if we were to wait a decade or two, the criterion behaviors—in this case, applying algebra successfully when appropriate—are often infrequent.
Olivier Chabot
.c2
11%
Flag icon
First, he pointed out that naturally occurring samples of behavior are not comparable. For example, suppose that I used a bit of algebra this morning, while a friend of mine, who is the dean of a law school, did not. Does that indicate that I successfully acquired more of this set of dispositions and skills than the dean did?
Olivier Chabot
.c3
12%
Flag icon
Second, Lindquist noted that some criterion behaviors are complex, requiring a variety of skills and knowledge. In such cases, if a person performed poorly on one of these criterion behaviors, one would not know why.
Olivier Chabot
.c4
12%
Flag icon
Lindquist would have argued that if you want to determine whether third-grade students can manage subtraction with carrying, you give them problems that require subtraction with carrying but that entail as few ancillary skills as possible. You would not embed that skill in complex text, because then a student might fail to solve the problem either for want of these arithmetic skills or because of poor reading, and it would be hard to know which. This principle is still reflected in the design of some tests, but in other cases, reformers and test developers have deliberately moved in the ...more
12%
Flag icon
For example, we know that teachers’ grading is on average much more lenient in high-poverty schools than in low-poverty schools. By assembling information from several sources that have different strengths and weaknesses, we can obtain a more complete view of what students know and can do.
13%
Flag icon
Scores on a single test are now routinely used as if they were a comprehensive summary of what students know or what schools produce. It is ironic and unfortunate that as testing has become more central to American education, we have strayed ever farther from the astute advice given so long ago by one of the nation’s most important and effective proponents of standardized testing.
14%
Flag icon
How do we know whether a high-school student who runs a mile in a bit over four minutes is a star? Norms again.
14%
Flag icon
In each case, we use norms to make sense of quantitative information that otherwise would be hard to interpret. Nonetheless, in many quarters, norm-referenced reporting of performance on tests has an undeservedly bad name.
14%
Flag icon
norm-referenced reporting can be paired with other forms of reporting that directly compare performance with expectations, such as standards-based reporting (discussed below).
15%
Flag icon
“Norm-reference testing, which a lot of us in the audience grew up on, create[s] winners and losers. You got the top decile, you got your bottom decile, you got your average and above-average. They are designed to designate winners and losers.”3 This is one of the most fundamental misconceptions in the current debate about testing. Tests may “designate” winners and losers, but they don’t create them. There simply are winners and losers. Anywhere you look in the world, even in much more equitable societies, there is enormous variation in how well students perform.
Olivier Chabot
.c1
15%
Flag icon
If you choose not to measure that variation then you won’t see it, but it is there regardless.
Olivier Chabot
.c2
16%
Flag icon
The shift from using tests for information to holding students or educators directly accountable for scores is beyond a doubt the single most important change in testing in the past half century.
16%
Flag icon
Studies have also begun to shed light on the factors—other than simple cheating—that cause score inflation, such as focusing instruction on material emphasized by the test at the expense of other important aspects of the curriculum; focusing on unimportant details of a particular test; and teaching test-taking tricks. Score inflation is a preoccupation of mine, both because I have been investigating this problem for more than fifteen years and because I think it is one of the most serious hurdles we need to surmount if we are to find more effective ways of using tests for accountability.
17%
Flag icon
At one conference of testing experts during those years, a prominent advocate of performance assessments presented a lecture in which she made precisely this argument. In such an extreme form, this is a silly position; while many real-world problems do not have single correct answers, innumerable ones do.
Olivier Chabot
.c1
17%
Flag icon
As the speaker was leaving the building after her presentation, she stopped and told me that she did not know which of the surrounding hotels was the Hilton, where she had a reservation. I pointed out that this was a question with a single correct answer, in response to which she left in a huff without letting me tell her which one it was.
Olivier Chabot
.c2
17%
Flag icon
Research has shown that the format of the tasks presented to students does not always reliably predict which skills they will bring to bear, and students often fail to apply higher-order skills to the solution of tasks that would seem to call for them.
18%
Flag icon
In traditional achievement testing, tasks were designed to extract diagnostic information that would enable teachers to improve instruction, but there was no expectation that the tasks used in instruction should resemble those in the test. The phrase “tests worth teaching to” had another connotation as well: tests would be designed such that preparing students for them—teaching to the test—would not lead to score inflation. But this was a logical sleight of hand. There is no reason to expect that a test that is “worth teaching to” in the sense of measuring higher-order skills and the like ...more
18%
Flag icon
In matrix-sampled testing, the test is broken into a number of different parts that comprise different tasks, and these are then distributed randomly within classrooms or schools. Thus the test is not standardized for comparing individual students, but it is standardized for purposes of comparing schools or states. Matrix sampling is now common; it is used, for example, in NAEP, TIMSS, and some state assessments. The significance of this seemingly arcane innovation is that it allows the testing of a broader range of knowledge and skills—a larger sample from the domain—within a given amount of ...more
Olivier Chabot
.c1
18%
Flag icon
A pure matrix-sampling design does not provide useful scores for individual students because students take different, and therefore not comparable, subtests. Hence the controversy.
Olivier Chabot
.c2
18%
Flag icon
one portion of the test is common to all students and is used to provide individual scores, while the remainder is matrix-sampled and contributes only to scores for schools.
Olivier Chabot
.c3
19%
Flag icon
This innovation in scoring is now almost universally accepted, it has been incorporated into federal statute, and it is widely considered desirable because it focuses on expectations and is supposedly easy to understand. In fact, however, it exacts a very high, perhaps excessive, cost. The process of setting standards—deciding just how much students have to do to pass muster—is technically complex and has a scientific aura, but in fact the standards are quite arbitrary. The simplicity of this form of reporting is therefore more apparent than real, and most people do not really have a clear ...more
21%
Flag icon
The SAT mathematics scale runs from 200 to 800, while the ACT mathematics scale runs from 1 to 36. What does this difference in scales indicate? Nothing at all. These scales are arbitrary, have no intrinsic significance, and are not comparable.
22%
Flag icon
Some insist that the bell curve is the malicious creation of psychometricians who want to create an appearance of differences among groups. Others associate it with the pernicious and unfounded view that differences in test scores between racial and ethnic groups are biologically determined. None of these associations is warranted.
Olivier Chabot
.c1
22%
Flag icon
However, when several common conditions are met—when tests assess broad domains, are constructed of items that have a reasonable range of difficulty, and are scaled using most of the currently common methods—scale scores often show a roughly normal distribution, with many students clustered near the average and progressively fewer as one goes both lower and higher. Exceptions are not rare, however. If a test is easy for the students taking it, the distribution will not be normal—it will be asymmetrical, with a tail of low scores but many students piled up near the maximum score. A test that is ...more
Olivier Chabot
.c2
23%
Flag icon
If standardized scales are so handy, why are they never used to present test scores to the public? Because most lay people cannot abide fractional and, worse, negative scores. Imagine a parent receiving a report from her child’s school that said, “Your daughter received a score of 0.50, which put her well above average.” A standardized score of 0.50 is well above average—if scores follow the bell curve exactly, it represents a percentile rank of 69—but it certainly does not seem like it. It is easy to envision a worried parent responding that she does not quite understand how her child ...more
24%
Flag icon
Just how big was the decline in scores that put our nation at risk? There is no single answer, but painting with a broad brush and considering many different sources of data, it would be fair to call the drop “moderately large.”
24%
Flag icon
Such changes are called compositional effects—changes in performance arising from changes in the composition of the tested group. In general, if subgroups that are growing have substantially different average scores than those that constitute a decreasing share of the group, the result is a change in the overall average score stemming simply from these trends in composition.
25%
Flag icon
As college attendance became more common, the proportion of high-school graduates electing to take admissions tests rose, and many of those newly added to the rolls were lower-scoring students. This was studied in considerable detail by the College Entrance Examination Board in the 1970s, and the research showed clearly that a sizable share of the drop in SAT scores was the result of this compositional change. Had the characteristics of the test-taking group remained constant, the decline would have been much smaller.
27%
Flag icon
I focus here on mean differences because changes in the percentage of students reaching a standard are very hard to interpret; they conflate the performance difference between groups with the level at which the standard has been set.
« Prev 1 3 4