Kindle Notes & Highlights
Read between
March 26 - April 19, 2025
In some respects, the arbitrariness of scale scores is less of a problem than that of performance standards. When performance standards are set, the levels expected of students are arbitrary, and the labels attached to them often carry additional, prior meaning, some of which may be unwarranted. Altering these decisions—selecting a different method that happens to move the standard up or down, or giving a standard a different verbal label—can have a major effect on the way people interpret student performance, even though the performance itself would not have changed. Scale scores lack this
...more
The NAEP is an interesting example because the agencies responsible for it have wrestled for years with the problem of adding meaning to scale scores and have devised quite a variety of approaches. One, of course, has been to layer performance standards (“achievement levels”) on top of the scale. But many of the approaches for giving the NAEP scores meaning rely on normative information. For example, normative data are key to making sense of NAEP’s much publicized comparisons among states. How is the commissioner of education in Minnesota to know whether that state’s mean scale score of 290 is
...more
The hoariest of these is grade equivalents, or GEs. These have fallen out of favor over the past several decades, which is a great shame, as they are quite easy to understand and provide an intuitively clear way to think about children’s development. A grade equivalent is simply the typical performance—the performance of the median student—at any grade level. It is usually shown in terms of academic years and months (with 10 academic months per school year). Thus a GE of 3.7 is the median performance of students in March of third grade on the test designed for third-graders. GEs tell you
...more
As useful as they are, GEs have a number of drawbacks. One is that the rate of growth in a subject area is not constant as children get older. For example, the typical child gains reading skills faster in the primary grades than later on. Therefore, a gain of one GE denotes greater growth in the early grades than in later grades. If you want to know, for example, whether the rate at which students learn math slows down or speeds up when they move to middle school, GEs by their very nature cannot tell you. The average student will gain one GE per year regardless.
For example, suppose that one student showed a gain from 230 to 250 between third grade and fourth, and a second student increased from 245 to 265 between fourth grade and fifth. If these are developmental standard scores, their identical gains of 20 points would ideally mean that both improved their performance by the same amount. Despite the inconsistent and confusing labeling of these scales, you can often identify them by comparing the numbers across grades. If the numbers are similar across grades, the scale is not a developmental standard score, but if the numbers increase from grade to
...more
Just how much facility with basic multiplication is equivalent to a given gain in pre-algebra? A sensible rule of thumb is to treat these scales as approximate and to be increasingly skeptical as the grade range they cover grows larger.
Technically, the problem is that test scores (like Fahrenheit temperature, but unlike length, speed, or any number of other common measures) are not a ratio scale, which means that zero on the test score scale does not mean “zero achievement.” Zero on most scales is just an arbitrary point. Even on a raw-score scale, where zero means zero items answered correctly, it need not mean “no knowledge of the domain”; it just means no mastery of the particular material on the test. Percentage change is a meaningful metric only in the case of ratio scales.
Validity is not a characteristic of the test itself. This may seem like splitting hairs, given that we are talking about conclusions that are themselves based on test scores, but it is anything but. One reason this distinction matters is that a given test score can be used to support a wide range of different conclusions, some of which may be justified and others not.
Validity is a continuum, one end of which is anchored by inferences that simply are not justified. At the other end of the spectrum, however, we are rarely fortunate enough to be able to walk away from the table having decided that an inference is valid, pure and simple. Rather, some inferences are better supported than others, but because the evidence bearing on this point is usually limited, we have to hedge our bets.
Until the 1980s and 1990s, direct assessments of writing, in which students actually write essays that are scored, were rare in statewide testing programs. Multiple-choice tests of language arts skills were common. Many critics argued, albeit usually without using the actual phrase, that this was a clear case of construct underrepresentation. Certainly, some skills needed for writing can be assessed with multiple-choice items. But some of the essential skills implied by the construct of “proficiency in writing” can be measured only by having students write. As a consequence, direct assessments
...more
The converse of construct underrepresentation is measuring something unwanted. This goes by the yet uglier term construct-irrelevance variance.
Or suppose that a test of mathematics or science contains some unnecessarily complex language. What will happen to the scores of nonnative speakers of English who have good mastery of mathematics or science but are thrown off by these irrelevant linguistic complexities?
No test of a complex domain can be perfect. Some amount of construct underrepresentation and construct-irrelevant variance is inevitable, even in the case of a superb test. This is one reason that most inferences based on test scores cannot be perfectly valid. But often they are valid enough to be very useful. So how can one determine how valid an inference is? Many types of evidence can be brought to bear. In most discussions of the problem, one finds up to four different types of evidence: analysis of the content of the test, statistical analysis of performance on the test, statistical
...more
Reliability is necessary but not sufficient for validity. Or, to put this differently, one can have a reliable measure without validity, but one cannot have a valid inference without reliability.
In this case also, your inference about your weight, if you measured yourself only once, would not be worth much, despite the lack of bias. That is, its validity would be low because you would often reach the wrong conclusion. You could get a valid inference from this unreliable scale by weighing yourself many times and taking an average, but that is only because the reliability of the average would be much higher than that of a single observation.
Results from a casual examination of a test’s content are often labeled face validity, as in “it seems valid on its face,” but people in the business of testing do not consider this real evidence of validity.
Reliance on face validity reached a high point during the wave of enthusiasm for performance assessment, when many reformers and educators assumed that complex tasks necessarily tap higher-order skills better than multiple-choice items do. People were looking for rich, realistic, and engaging tasks to include in tests, so perhaps it was only natural that the tasks themselves, rather than other forms of evidence that I will describe momentarily, became the sine quo non of validity for many people who did not know better. In response, Bill Mehrens, now professor emeritus at Michigan State
...more
Grading tends to be much harsher in mathematics and the physical sciences than in the humanities.
For example, scores on a new mathematics test ought to correlate more strongly with scores on another mathematics test than with scores on a reading test. Strong correlations between theoretically related measures are called convergent evidence of validity; weaker correlations between theoretically unrelated measures are discriminant evidence.
In the same ITBS data, schools’ average scores in reading correlated 0.88 with their average scores in mathematics (Table 9.2), indicating that knowing schools’ average scores in one of these two subjects allows one to predict more than three-fourths of the variation in school means in the second subject. As a consequence, one typically finds only small differences in the correlations between related and unrelated subjects, which makes the use of this convergent and discriminant evidence difficult.
Both states used portfolio assessments in writing and mathematics but also gave other, standardized tests. In Vermont, mathematics portfolio scores correlated about as strongly with a standardized test of writing as with a standardized test of math.6 In Kentucky, the portfolio assessment of mathematics correlated more strongly with the portfolio assessment of writing than with anything else.7 These findings suggest that the mathematics portfolio assessments were measuring things other than mathematics—proficiency in writing and differences among teachers in the way portfolio tasks were
...more
a red flag goes up if members of different groups—say, boys and girls—who have the same score on the test as a whole show markedly different performance on some items. This is a sign of possible bias in those items, which would undermine the validity of inferences for one of the groups.
one expects that, on average, students with higher scores on the test as a whole will perform better on any individual test item than students with lower scores on the test as a whole. If this is not found, the individual item is measuring something different than the rest of the test.
Campbell’s law in social science: “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”
“An overwhelming majority of cardiologists in New York say that, in certain circumstances, they do not operate on patients who might benefit from heart surgery, because they are worried about hurting their rankings on physician scorecards issued by the state.”8 Fully 83 percent of respondents said that the reporting of mortality rates had this effect, and 79 percent admitted that “the knowledge that mortality statistics would be made public” had affected their own decisions about whether to perform surgery.
Given the current high-stakes uses of tests, we can be confident of the validity of inferences about improvement only if we have an additional type of validity evidence: a comparison to a second measure less threatened by the possibility of corruption (often called an audit test). The logic of using an audit test is simple: if gains on the tested sample generalize to the domain, they should generalize to other, similar samples from the domain. In this case, the similar sample is the audit test.
For example, some people use “teaching the test” to refer to teaching specific items on the test (clearly bad) and “teaching to the test” to refer to focusing on the skills the test is supposed to represent (presumably good). Others, however, use “teaching to the test” to mean instruction that is inappropriately focused on the details of the test (presumably bad, and likely to inflate scores). I think it’s best to ignore all of this and to distinguish instead between seven different types of test preparation: Working more effectively Teaching more Working harder Reallocation Alignment Coaching
...more
For example, it is not clear that depriving young children of recess, which many schools are now doing in an effort to raise scores, is effective, and in my opinion it is undesirable regardless. Similarly, if students’ workload becomes excessive, it may interfere with learning. It may also generate an aversion to learning that could have serious repercussions later in life.
And reallocation is not limited to teachers. For example, some studies have found that school administrators reassign teachers to place the most effective ones in the grades in which important tests are given.10
Coaching refers to focusing instruction on small details of the test, many of which have no substantive meaning. For example, if a test happens to use the multiple-choice format for testing certain content, one can teach students tricks that work with that format. One can teach students to write in ways that are tailored to the specific scoring rubrics used with a particular test.
Coaching need not inflate scores. If the format or content of a test is sufficiently unfamiliar, a modest amount of coaching may even increase the validity of scores. For example, the first time young students are given a test that requires filling in bubbles on an optical scanning sheet, it is worth spending a very short time familiarizing them with this procedure before they start the test. Most often, however, coaching either wastes time or inflates scores. Inflation occurs when coaching generates gains that are limited to a specific test—or to others that are very similar—and that do not
...more
bias is an attribute of a specific inference, not of a test.
Similarly, a difference in scores between groups—between poor and rich kids, males and females, blacks and whites, Asian Americans and whites—does not necessarily indicate bias. Bias might contribute to the difference, or it might not. A difference in scores entails bias only if it is misleading (again, for a particular inference).
It is well known that, on average, the schools serving poor children are of lower quality than those serving students from higher-income families. Resources are more limited in schools in low-income areas, for example, and teaching positions are more likely to be filled by inexperienced and uncertified teachers. Now let’s assume—hardly a risky assumption—that some of these differences among schools matter and that, as a result, many poor students learn less in school and end up less well prepared for college. If that is true, tests designed to estimate how well prepared students are for
...more
Thus a difference in scores between groups is a reason to check for bias but not grounds to assume it.
Adverse impact can arise without bias, and conversely, bias can exist even in the absence of any adverse impact.
A common way of examining performance on individual test items goes by the cumbersome name of differential item functioning, or DIF. DIF refers to group differences in performance on a particular test item among students who are comparable in terms of their overall proficiency.
But now suppose that we match boys and girls on their total scores. We ask: do girls and boys with the same test score perform differently on this item? Ideally, the answer would be no. A substantial difference in performance between matched boys and girls would constitute DIF.
Therefore, DIF can arise from differences in instruction experienced by the average student in various ethnic groups. The same could hold true of social-class differences in performance. Gender differences are a another matter—boys and girls are similarly distributed across regions and, for the most part, schools—but at the high-school level, they may choose different courses, and that too may result in meaningful differences in performance that appear as DIF.
To play it safe, test authors often discard items that show very large amounts of DIF, even if they cannot identify its cause. But apart from that, the benefit of DIF is that it allows us to zero in on specific test items that require more examination.
group differences in scores will appear smaller on unreliable tests than on reliable ones. As one colleague of mine quipped years ago, “The easiest way to shrink group differences in performance is to write lousy, unreliable tests.”*
First, the fact that large score differences between social groups need not indicate bias does not imply that they never do. The appropriate response is to treat these score differences as a reason to check for bias.
Their more important argument is that the services delivered to a student should be based on each individual’s functional impediments to learning, not the broad classification into which the child’s disability places him. Two students classified as having different primary disabilities may need the same services, and two others with the same classification may require different services.
Even though accommodations are a deliberate violation of standardization, they share its primary goal: to improve the validity of conclusions based on test scores.
But what happens to a student who has a visual disability and can read the test materials only very slowly and with great strain? Your estimate of proficiency for him will not only be fuzzy but will also be biased downward, lower than his actual proficiency warrants. If you tested him repeatedly, you could reduce the fuzziness but you would zero in on the wrong score. The ideal accommodations would function like a corrective lens, offsetting the disability-related impediments to performance and raising your estimate of the student’s proficiency to the level it should be. This would make the
...more
Thus the purpose of accommodations is not to help students score better but to help them score as well as their actual proficiency warrants, and not higher. In other words, their purpose is to improve validity, not to increase scores.
the impediment this student faced, her lack of visual acuity, was unrelated to the content and skills the test was designed to measure. In the ugly jargon of the trade, she faced “construct-irrelevant” barriers to performing well on the test. Therefore, the effects of the disability on her performance on the standard test were clearly bias: if given the exam in standard form, her score would imply a lower level of mastery than she had actually attained. If we could find an accommodation that would do nothing but offset this impediment, validity would be increased.

