Kindle Notes & Highlights
Read between
March 26 - April 19, 2025
Not all racial and ethnic differences favor whites. On tests of mathematics and science, one often finds whites lagging behind Asian Americans.
It is essential to keep in mind what these data do and do not tell us. They indicate only that the average differences between racial and ethnic groups are substantial. But these average differences, even when very large, tell us nothing about individual students. The variability within any of these racial or ethnic groups is very large, almost as large as in the student population as a whole. There are some black and Hispanic students who score very high relative to the distribution of white students, and there are Asian American students who score well below the white average. Also, these
...more
The first complication is that the tests used for international comparisons are, like all others, small samples of content, and the decisions about the sampling of content and the format of its presentation matter. For example, what percentage of mathematics test items should be allocated to algebra? (The decision was approximately 25 percent on the TIMSS test and NAEP tests but only 11 percent on the PISA test.) What aspects of algebra should be emphasized, and how should they be presented? What formats should be used, and in what mix? TIMSS and NAEP are roughly two-thirds multiple-choice
...more
Statements such as “the United States scored at the international average” do not mean much when the international average can move up or down depending on which countries elect to participate in a given assessment.
Yet more surprising was the ordering of countries: by and large, the social and educational homogeneity of countries does not predict homogeneity of student performance. Some small and homogeneous countries—for example, Tunisia and Norway—do have relatively small standard deviations of scores, but many do not. For example, in all three TIMSS surveys, the standard deviation of scores in Japan and Korea—both countries that are socially more homogeneous than the U.S. and that have more homogeneous education systems through the eighth grade—was roughly similar to the U.S. standard deviation.
FIG. 7.1. A sample MCAS report to parents. Massachusetts Department of Education, MCAS Tests of Spring 2002, Parent/Guardian Report.
If you were to take repeated measurements and the average of those repeated measures gradually approached the right answer, you would have measurement error. If the average of repeated measures stayed incorrect, even with a great many measurements, you would have bias.
The technical reports accompanying most large-scale assessments usually include estimates of reliability, called internal consistency reliability statistics, that take into account only the measurement error that arises from the selection of items to construct that particular form of the test.
This is the reason for the common but misguided advice that if you are trying to lose weight, you should weigh yourself only infrequently rather than daily. This is poor advice because if you compare only two measurements, say one week apart, the randomness of your scale’s behavior and the fluctuations in your own weight together will add error to both estimates and create a substantial risk that the comparison will be entirely misleading unless your weight change has been large enough to overwhelm these inconsistencies. A better, if compulsive, approach would be to take frequent measurements
...more
(For this reason, all work in my classes is graded anonymously, with students identified solely by ID numbers, and students take their exams on computers to avoid issues raised by handwriting. We add names only at the end of the semester so that we can take other factors into account in assigning final grades, such as class participation and extenuating circumstances.)
The dashed lines represent a distance of one standard error of measurement above or below the average. The range in this particular case is 66 points, 33 in each direction from the mean, which is similar to the standard error of measurement on the SAT. Roughly two-thirds of the simulated observations lie within that range. This is true in general: an examinee with any given true score, taking a test once, has a probability of about two-thirds of getting a score within the range from one SEM below that score to one SEM above, and a probability of one in three of obtaining a score more than one
...more
However, if scores on the test were used as only one piece of information contributing to the decision to admit or reject students, a modest amount of measurement error would have little impact.
In large-scale assessment programs, the most reliable tests have reliability coefficients in the range of .90 or a bit higher.
One is to ask, for a given reliability coefficient, how large is the band of error around an individual score? For example, even though the SAT is a highly reliable test, with a reliability coefficient over .90, the standard error of measurement is more than 30 points, similar to that shown in Figure 7.2. A reliability coefficient of .80 indicates an error band roughly 40 percent larger, and a reliability coefficient of .70 indicates an SEM almost 75 percent larger
A second way is to ask, with a given reliability coefficient, how well can you predict a second score by knowing the first? If one has a first set of scores for a group of students, a reliability coefficient of .90 indicates that these first scores allow you to predict about 80 percent of the variability in the second scores. With a reliability coefficient of .70, one can predict only about half of the variability in a second set of scores.
The table shows that unless the standard is set either very low or very high, a substantial number of students would be reclassified if retested. If the standard is set near the middle, so that anywhere from 30 to 70 percent of students pass, 12 to 14 percent of students would be classified differently if tested a second time.
This is yet another of the unavoidable trade-offs in measurement. In designing a test of a large domain such as fourth-grade mathematics, one would want reasonably broad coverage of the domain to support the conclusions in which you are interested, but that breadth of content will reduce reliability. As the reliability coefficients above suggest, with careful work, the authors of tests of broad domains can in fact attain high levels of internal consistency reliability, but this is nonetheless constrained by the breadth of the test. One of the most important influences on reliability is the
...more
In educational testing, the individual items in a test are analogous to individual readings on your scale. The more of them you include in the test, the more the measurement error in each one of them will average out and the more reliable the test score will be.
Thus, for the most part, these changes represent the greater sampling error when relatively few students are tested, not meaningful changes in the performance of small schools. In other words, it is not really the case that many small schools are rapidly improving or deteriorating while the performance of larger schools is remaining stable.
the more groups a school has that must be reported separately, the more likely it is that the school will fail solely because of sampling error.
This question was a matter of some debate among members of the profession only a few years ago, but it is now generally agreed that sampling error is indeed a problem even if every student is tested. The reason is the nature of the inference based on scores. If the inference pertaining to each school in Figure 7.3 were about the particular students in that school at that time, sampling error would not be an issue, because almost all of them were tested.
Rather, they are interested in conclusions about the performance of schools. For those inferences, each successive cohort of students enrolling in the school is just another small sample of the students who might possibly enroll, just as the people interviewed for one poll are a small sample of those who might have been.
In a meeting some years ago, teachers in a small school in Maryland were puzzling over a noticeable drop in scores that lasted a single year in each grade and moved up one grade each year, much as a recently consumed rat might be seen moving through a python. One teacher offered this explanation: “That’s Leo.” She was referring to a disruptive student who managed to bring down the performance of every class he was in. Whether or not she was correct in the specifics, I don’t know, but her explanation was reasonable in pointing to sampling error as the likely culprit. When I mention this in one
...more
When cut scores are used, a common response to measurement error is to allow students who fail the test a second chance to take it, to lessen the probability that students will fail only because of measurement error.
In their view, to label a student above average creates a false sense of success if the average itself is unacceptably low. And beyond that, the critics of norm-referenced reporting did not want value-neutral, purely descriptive reporting. They wanted to evaluate performance by comparing it with explicit goals.
content standards are statements of what students should know and be able to do, and performance standards indicate how much they should know and be able to do.
they almost always believe there is some underlying truth about performance, some real but hidden level of achievement that constitutes being “proficient,” that is somehow revealed by the complex methods used to set standards. Or, at the very least, that the standards set clearly break the continuum of performance into unambiguous categories.
That is, they imagined students whose performance barely qualified as proficient. Then they had to estimate the probability that these imaginary, marginally proficient students would answer each item correctly.
After the first round of ratings, the procedure used by NAEP and many other testing programs does introduce some actual data about performance, called impact data. Panelists are given the actual percentage of students who answered each item correctly—the percentage of all students, not the percentages of students in the imagined groups just above each of the standards, which have not yet been set. This adds a norm-referenced element to the standards because the impact data are in fact normative data about performance. NAEP also provides panelists with data about the variation in ratings among
...more
They are then asked to go through the items in order of difficulty and to stop at the item that they believe would be answered correctly by a specified percentage of the marginally proficient students they have imagined. This percentage, called the response probability, is often set at 67 percent, but there is no compelling reason why it has to be, and panels have used a variety of response probabilities ranging at least from 50 percent to 80 percent.
in the case of these two common methods, estimates of the item-level performance of imagined groups of students—is not entirely confidence-inspiring. Whatever the pros and cons of these methods, they are not a means of uncovering some “true” or objective standard that is waiting to be discovered. This makes the resulting standards a lot less compelling than many people think they are, but it need not render them worthless by any means, and in fact there is a long-standing debate among measurement experts about their utility.
In 1989, Richard Jaeger, then unquestionably one of the world’s leading experts on standard setting, published a comprehensive review in which he showed that the results of standard setting are generally inconsistent across methods. He reviewed thirty-two published comparisons and calculated the ratio of the percentages of students labeled as failing by different standard-setting methods. In the typical case (the median), the harsher method of standard setting categorized fully half again as many students as failing as did the more lenient method, and some studies found far larger ratios.7
...more
Judges also have been found to underestimate the difficulty of hard items and overestimate the difficulty of easy items, which can lead them to set higher standards when the items they evaluate are more difficult.8 Changing the response probability used with the bookmark method—an arbitrary choice—can have dramatic effects on the placement of the standards.9
Performance standards are also often inconsistent across grades or among subjects within a grade.
Even leaving aside all of these many inconsistencies, standards-based reporting has a serious drawback: it obscures a great deal of information about variations in student performance. This is a consequence not of the judgmental nature of standards but rather of the coarseness of the resulting scale. As described above, most standards-based systems have three or four performance standards that create four or five ranges or categories for reporting performance. Information about differences among students within any one of those ranges does not register. And those unnoted differences can be
...more
“Doesn’t this mean that the system is failing African American students and that they are falling farther behind?” To her evident annoyance, I told her that I had no idea and that I would need different data to answer her question. The problem, I explained to her, is that when performance is reported in terms of standards, comparisons of trends in performance between two groups that start out at different levels—such as whites and African Americans in Boston—are almost always misleading. There are two different statistics used for this purpose. One is the statistic she used: the composition of
...more
Therefore, I told the audibly impatient reporter, the difference between groups that she gave me did not directly measure whether African Americans in Boston schools were falling farther behind whites. That might have been the case, but it was also possible that it was not. With the simple statistics that used to be routinely reported, such as mean scale scores, I could have told her. But differences among groups in terms of standards-based statistics, particularly changes in the percent exceeding the proficient standard, now dominate reporting of the achievement gap.
Normative data often creep into standard setting. Sometimes this happens during the initial standard-setting process, as when panelists are given impact data. Sometimes it happens after the fact, when policymakers decide that the process has resulted in unreasonable or unacceptable standards. This modest reliance on normative data notwithstanding, standards are sometimes set at levels that normative data suggest are unreasonable.
We can find out, for example, that in 2000, 22 percent of fourth-graders in Maryland reached or exceeded the proficient standard. Standards-based, right? But here is the rub: how do you know what to make of this result? Is 22 percent high or low? One way to find out is to compare this percentage to the percentages in other states. NAEP conveniently displays the percentages for all states together, ranked from highest to lowest. (Remember the panty hose chart in Chapter 7?) These charts report performance in terms of state norms—that is, by comparison to a distribution of the performance of
...more
For example, assume that you have to appoint a new coach for a middle-school track team. One applicant comes in brimming with enthusiasm and announces that his target is to have half of the distance runners clocking three-and-a-half-minute miles within a year. What do you do? You send him packing, because he is either utterly incompetent or a liar.
In other words, you rely on norm-referenced information to tell you that the level of performance he promises is absurd. This example is contrived, but the fact is that we use normative information constantly in all aspects of life—to evaluate the gas mileage of cars, to decide whether a purchase is too expensive, and so on. Testing is no different.
Given all the problems that arise when student achievement is reported in terms of a few performance standards, what should be done? In a recent article in which he outlined a number of the most serious weaknesses of standards-based reporting, Robert Linn of the University of Colorado suggested that we distinguish the cases in which we do need to make binary, up-or-down decisions based on a test—for example, in setting a passing score on a written driving test or in using tests as a minimum criterion for professional licensure or certification—from those in which we do not need to do so. He
...more
This is the kind of test scoring we all grew up with, and it does have some utility. After every exam in my classes today, I present a graph showing the distribution of raw scores. This gives students some valuable norm-referenced information: a comparison of their performance with that of the rest of the class.
Unfortunately, many of the scales used in educational testing are nonlinear transformations of each other. For the most part, the choice of scale does not affect rankings—students who score higher on one scale will score higher on another—but it does change comparisons. Two groups that show the same improvement over time on one of these scales may not show the same improvement on another.
for example, the SAT scale (within a subject area) runs from 200 to 800, while that of the ACT runs from 1 to 36. There is no substantive reason for this difference; no one would argue that a student who reaches the top score on the SAT knows twenty-two times as much as a student who reaches the maximum score on the ACT. And the ranking of college applicants would not be altered if ACT scoring were switched to match the SAT scale, or the College Board decided to change the SAT scoring to the ACT scale.

