Measuring Up: What Educational Testing Really Tells Us
Rate it:
Open Preview
Read between March 26 - April 19, 2025
83%
Flag icon
For example, there was no reason to expect that lowering the lights or increasing the type size would have given her any unfair advantage.
Olivier Chabot
It wouldnt advantage other students to have large print
83%
Flag icon
For example, a complex mathematics task might require that you return to the prompt repeatedly for numerical data or to extract information from a graphic. This takes additional time for students who read braille, this one student claimed, because they cannot skim quickly and must go over the entire item or a large part of it more slowly to re-locate the information. The student maintained that this places them at a disadvantage, particularly if they do not receive additional time.6
83%
Flag icon
Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy,it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
84%
Flag icon
Additional time is the most common accommodation offered in systems that administer tests with time limits. But how much time should be allowed? Offering too much additional time may run the risk of overcompensating—creating an unfair advantage—rather than merely leveling the playing field.
84%
Flag icon
His dyslexia impedes his ability to read the test well, but his reading proficiency is precisely what we are trying to gauge by testing him.
85%
Flag icon
Many contemporary tests of mathematics strive to include realistic problems of the sort that students would encounter outside of school. Many of these tests entail a good bit of reading; some also require that students write explanations of their answers.
85%
Flag icon
Clearly, the more reading and writing a math test requires, the more likely it is that the scores of dyslexic students will be adversely affected by their disabilities. But is this adverse effect a bias that should be offset by accommodations, or is it in fact a realistic indicator of lower proficiency? The answer, the NRC panel explained, is that it depends on the specific inference you base on the scores—that is, it depends on what you mean by “proficiency in mathematics.” On the one hand, if you were using scores to estimate skills such as computational facility, the performance of dyslexic ...more
86%
Flag icon
one group has the standard amount of time, another has half again as much, and a third group has twice as much time. We find that the first group gets the lowest scores and the third gets the highest. Which of the three sets of scores is the most accurate? With what would you compare them to find out? For most K–12 testing, we lack a trustworthy standard with which to compare them.
86%
Flag icon
Not all students with disabilities score poorly, of course, but overall, students with disabilities are disproportionately represented at the low end of the distribution.
86%
Flag icon
In practice, there is no single standard error. Rather, there are many: the margin of error for students for whom the test is appropriately difficult is smaller than those for higher- and lower-scoring students.
87%
Flag icon
The GRE is a computer-adaptive test (often abbreviated CAT), in which the performance of students on early items leads to their being assigned either easier or more difficult items to better match their performance level. The result is a higher level of reliability because students are not wasting time on items that are too easy or difficult for them.
87%
Flag icon
To expect mildly retarded students to outperform the entire bottom third of students in the highest-scoring countries in the world, and to expect this to happen in twelve years, no less, is remarkably naive. I have no doubt that the motivation for this mandate is good, an attempt to force schools to attend more to the achievement of students with disabilities. This stands in contrast to earlier federal requirements that focused primarily on procedural issues, such as the appropriate classroom placements for students. Nonetheless, as a former special education teacher, I consider the extremity ...more
88%
Flag icon
On the one hand, I and many others consider the current drive to improve the performance of low-scoring students essential and long overdue. This requires that higher standards be imposed for these students. At the same time, even if we were to succeed in this respect, we will still confront a very wide distribution of performance. The dilemma is to find a way to meet the goal of confronting unwanted variations in performance while still being realistic about the variations that will persist.
88%
Flag icon
Now suppose they wanted to answer a third question: whether I was at that time, and with the proficiency I had then, likely to be successful in Hebrew-language university study. In that case, my low score would have been right on the money: I would have been a weak student indeed.
Olivier Chabot
.c1
89%
Flag icon
How they should best be tested—whether translations should be used, whether accommodations should be offered, and so on—depends on the inferences the scores will be used to support. And we must be much more specific about the intended inferences than we often are. It is not enough to refer to “mathematics proficiency” or “readiness for university study.”
Olivier Chabot
.c2
89%
Flag icon
For example, we face the problem of polysemy—the multiple, unrelated meanings many words have. Native speakers easily shift among them and realize, for example, that “cutting a price” means “reducing a price,” not cutting as with a knife.
89%
Flag icon
And Chinese uses no tenses at all. It is likely that the difficulties LEP students face in taking English examinations may differ depending on the structure of their native language. For example, there is some research that shows that substituting words with Latinate roots for words with Germanic roots makes test items easier for native speakers of Spanish—hardly a surprising finding—but this would not likely be of much help for a native speaker of Korean, which shares no roots with either language. However, research investigating the effects of these differences on test performance is barely ...more
90%
Flag icon
The list of threats to the conclusions commonly based on test scores—threats to validity—is long. Some of the big ones: There is measurement error, to start, which creates a band of uncertainty around each student’s score. When we are concerned with aggregates, such as the average score or percent proficient in a school, there is sampling error as well, which causes meaningless fluctuations in scores from one group of students to another and from one year to the next. This is a particularly serious problem for small groups—for example, when tracking the performance of small schools or, even ...more
91%
Flag icon
To start, the notion of an “international mean” is useless. The average can vary markedly from survey to survey, depending on the mix of nations participating in the survey.
91%
Flag icon
don’t treat any single test as providing the “right,” authoritative answer. Ever. When possible, use more than one source of information about achievement—results from additional tests, or information from other sources entirely. With data from several sources—PISA, several iterations of TIMSS, and a few earlier international studies—we can see that there is little doubt: the United States always scores far below Japan, even though it does not always score above Norway.
93%
Flag icon
Rather than calling them below basic, basic, proficient, and advanced—labels that carry a lot of unwarranted freight—try thinking of them as four merely arbitrary levels of performance, say, level 1, level 2, level 3, and level 4. Proponents of standards-based reporting might say that this suggestion is over the top and that the standards are in some way tied to descriptions of what kids actually can do. There is some truth to that claim, but the uncomfortable fact is that the various methods used to set the performance standards can be strikingly inconsistent.
Olivier Chabot
.c1
93%
Flag icon
I think you are more likely to be misled by taking the descriptions of standards at face value than by treating the standards as arbitrary classifications.
Olivier Chabot
.c2
93%
Flag icon
one can safely assume neither that the schools with the largest score gains are in fact improving the most rapidly nor that those with the highest scores are the best. So how should you use scores to help you evaluate a school? Start by reminding yourself that scores describe some of what students can do, but they don’t describe all they can do, and they don’t explain why they can or cannot do it. Use scores as a starting point, and look for other evidence of school quality—ideally not just other aspects of student achievement but also the quality of instruction and other activities within the ...more
93%
Flag icon
are the differences in performance between different parts of the test sufficiently reliable that they can be used as a basis for changing instruction?
Olivier Chabot
Sbg
93%
Flag icon
Invariably, serving one master well means serving others more poorly. A test optimized to provide information about groups, for example, is not optimal for providing scores for individual students. That is why the National Assessment of Educational Progress cannot provide scores for individuals. A test that is constructed of a small number of large, complex tasks in an effort to assess students’ proficiency in solving complex problems will be poorly suited to identifying narrow, specific skills that the students have nor have not successfully mastered.
94%
Flag icon
To obtain valid and reliable information about what students are learning, we need to focus tests on their levels of performance and on the content that they are actually studying.
Olivier Chabot
Adaptive tests
94%
Flag icon
The notion that we can figure out what “proficient” students should be able to do and then require schools to get them there has its appeal, but as previous chapters have showed, the way that this is now done can be a house of cards.
94%
Flag icon
And in setting goals, we need to recognize that wide variations in performance are a human universal, something that our educational system would have to address even if we had the political will and the means to reduce the glaring social inequities that plague our educational system and our society.
94%
Flag icon
I strongly support the goal of improved accountability in public education. I saw the need for it when I was myself an elementary school and junior high school teacher, many years ago. I certainly saw it as the parent of two children in school. Nothing in more than a quarter century of education research has led me to change my mind on this point. And it seems clear that student achievement must be one of the most important things for which educators and school systems should be accountable. However, we need an effective system of accountability, one that maximizes real gains and minimizes ...more
Olivier Chabot
.c1
94%
Flag icon
There is no reason to expect test-based accountability education to be exempt from Campbell’s law, and the research evidence indicates that it is not.
Olivier Chabot
.c2
1 2 4 Next »