More on this book
Kindle Notes & Highlights
Read between
May 29 - June 29, 2023
The coefficient of determination is the percentage of variance in one variable that is accounted for by the variance in the other variable. Quite a mouthful, huh?
The more these two variables share in common, the more they will be related.
To determine exactly how much of the variance in one variable can be accounted for by the variance in another variable, the coefficient of determination is computed by squaring the correlation coefficient. For example, if the correlation between GPA and number of hours of study time is .70 (or r GPA.time = .70), then the coefficient of determination, represented by , is .702, or .49. This means that 49% of the variance in GPA “can be explained by” or “is shared by” the variance in studying time. And the stronger the correlation, the more variance can be explained (which only makes good sense).
...more
That’s because of the simple principle that correlations express the association that exists between two or more variables; they have nothing to do with causality. In other words, just because level of ice cream consumption and crime rate increase together (and decrease together as well) does not mean that a change in one results in a change in the other.
A common “extra” tool is called partial correlation, where the relationship between two variables is explored, but the impact of a third variable is removed from the relationship between the two. Sometimes that third variable is called a mediating or a confounding variable.
That’s what partial correlation does. It looks at the relationship between two variables (in this case, consumption of ice cream
and crime rate) as it removes the influence of a third (in this case, outside temperature). A third variable that explains the relationship between two variables can be a mediating variable or a confounding variable. Those are different types of variables with different definitions, though, and are easy to confuse. In our example with correlations, a confounding variable is something like temperature that affects both our variables of interest and explains the correlation between them. A mediating variable is a variable that comes between our two variables of interest and explains the apparent
...more
The fundamental questions that are answered in this chapter are “How do I know that the test, scale, instrument, and so on. I use produces scores that aren’t random but actually represent an individual’s typical performance?” (that’s reliability) and “How do I know that the test, scale, instrument, and so on. I use measures what it is supposed to?” (that’s validity).
If the tools that you use to collect data are unreliable or invalid, then the results of any test or any hypothesis,
and the conclusions you may reach based on those results, are necessarily inconclusive. If you are not sure that the test does what it is supposed to and that it does so consistently without randomness in its scores, how do you know that the nonsignificant results you got aren’t a function of the lousy test tools rather than an actual reflection of reality? Want a clean test of your hypothesis? Make reliability and validity an important part of your research. You may have noticed a new term at the beginning of this chapter—dependent variable. In an experiment, this is the outcome variable, or
...more
That, my friend, is one type of reliability—the degree to which scores are consistent for one person measured twice.
When you take a test in this class, you get a score, such as 89 (good for you) or 65 (back to the books!). That test score consists of several elements, including the observed score (or what you actually get on the test, such as 89 or 65) and a true score (the typical score you would get if you took the same test an infinite number of times). We can’t directly measure true score (because we don’t have the time or energy to give someone the same test an infinite number of times), but we can estimate it.
Notice that reliability is not the same as validity; it does not reflect whether you are measuring what you want to. Here’s why. True score has nothing to do with whether the construct of interest is really being reflected. Rather, true score is the mean score an individual would get if he or she took a test an infinite number of times, and it represents the theoretical typical level of performance on a given test. Now, one would hope that the typical level of performance would reflect the construct of interest, but that’s another question (a question of validity). The distinction here is that
...more
score (which we never really know but only theorize about) is 80. That means the 9-point difference (that’s the error score) is due to error, or the reason why individual test scores vary from being 100% true.
The less random error, the more reliable—it’s that simple.
All of these types of reliability look at how much randomness there is in the
scores by seeing if a test correlates with itself.
Test–retest reliability is used when you want to examine whether a test is reliable over time.
The first and last step in this process is to compute the Pearson product-moment correlation (see Chapter 5 for a refresher on this), which is equal to
Parallel forms reliability is used when you want to examine
the equivalence or similarity between two different forms of the same test. For example, let’s say that you are doing a study on memory and part of the task is to look at 10 different words, memorize them as best you can, and then recite them back after 20 seconds of study and 10 seconds of rest. Because this study takes place over a 2-day period and involves some training of memory skills, you want to have another set of items that is exactly similar in task demands, but it obviously cannot be the same as far as content. So, you create another list of words that is hopefully similar to the
...more
The first and only step in this process is to compute the Pearson product-moment correlation (again, see Chapter 5 for a refresher on this), which is equal to
Internal consistency reliability is quite different from the two previous types that we have explored. It is used when you want to know whether the items on a test correlate with one another strongly enough that it makes sense to assume they all measure the same thing. And if they all measure the same thing, then it makes sense to add them up into a total score.
Coefficient alpha (or α), sometimes called for its inventor Lee (Cronbach), Cronbach’s alpha, is a special measure of reliability known as internal consistency. The more predictably individual item scores relate with each other, the higher the value of Cronbach’s alpha. And the higher the value, the more confidence you can have that this test is internally consistent and correlates well with itself.
When you compute Cronbach’s alpha, you are actually correlating the score for each item with the total score for each individual and then comparing that with the variability present for all individual item scores. The logic is that any individual test taker with a high total test score should have a high(er) score on each item (such as 5, 5, 3, 5, 3, 4, 4, 2, 4, 5 for a total score of 40) and that any individual test taker with a low(er) total test score should have a low(er) score on each individual item (such as 4, 1, 2, 1, 3, 2, 4, 1, 2, 1).
Not only is there coefficient alpha, but there are also split-half reliability, Kuder–Richardson 20 and 21 (KR20 and KR21), and still others that basically do the same thing, only in different ways. Most researchers, however, believe that Cronbach’s alpha is the best way to assess internal consistency.
Interrater reliability is the measure that tells you how much two raters agree on their judgments of some outcome.
And when we plug in the numbers as you see here … the resulting interrater reliability coefficient is .833.
There are standards here, too, for interpreting reliability. The difference is we usually see much higher values then with typical research correlations, so we have tougher standards. We want only two things, and here they are:
Reliability coefficients to be positive and not to be negative Reliability coefficients that are as large as possible (they will be between .00 and +1.00)
Typically, for most types of reliability, we want coefficients to be .70 or higher. For interrater reliability as estimated with percentage of agreement, we would hope for 80% agreement or higher.
Here are a few things to keep in mind. Remember that reliability is a function of how much random error contributes to the observed score. Lower that error, and you increase the reliability.
Make sure that the instructions are standardized and clear across all settings in which the test is administered. Increase the number of items or observations because the more questions on a test, the more opportunities for that randomness in responses to cancel itself out. Typically, the longer the test, the more reliable. Delete unclear items because some people will respond in one way and others will respond in a different fashion, regardless of their knowledge or ability level or individual traits. For an achievement test especially (such as a spelling or history test), make sure that the
...more
The first step in creating an instrument that has sound psychometric (how’s that for a big word, which means measurement of the mind?) properties is to establish its reliability (and we just spent some good time on that). Why? Well, if a test or measurement instrument is not reliable, is not consistent, and does not do the same thing time after time after time, it does not matter what it measures (and that’s the validity question), right?
Validity is, most simply, the property of an assessment tool that indicates that the tool does what it says it does. A valid test is a test that measures what it is supposed to and works well for its
intended purpose.
Content-based validity is the property of a test such that the test items fairly sample the universe of items for which the test is designed.
I would probably tell Alberta what the topics were, and then she would look at the items and provide a judgment as to whether they met the criterion I established—a representation of the entire universe of all items that are introductory in physics. If the answer is yes, I’m done (at least for now). If the answer is no, it’s back to the drawing board and either the creation of new items or the refinement of existing ones (until the content is deemed correct by the expert).
Criterion-based validity assesses whether a test reflects a set
of abilities in a current or future setting. If the criterion is taking place in the here and now, then we talk about concurrent criterion validity. If the criterion is taking place in the future, then we talk about predictive criterion validity. For criterion validity to be present, one need not establish both concurrent and predictive validity but only the one that works for the purposes of the test.
As a criterion (and that’s the key here), you have some set of judges rank each student from 1 to 10 on overall cooking ability. Then, you simply correlate the COOK scores with the judges’ rankings. If the validity coefficient (a simple correlation) is high, you’re in business—if not, it’s back to the drawing board.
Here, we are interested in developing a test that predicts success as a chef 10 years down the line. To establish the predictive validity of the COOK test, you go to recent graduates of the program and administer the test to them. Then, wait 10 years. Measure them again on a different but related criterion, such as their level of success, and you use as measures (a) whether they own their own restaurant and (b) the restaurant’s average rating on social media. The rationale here is that if a restaurant has good ratings, then the chef must be doing something right. To complete this exercise, you
...more
Construct-based validity is the most interesting and the most difficult of all the validities to establish because it is based on some underlying construct, or idea, behind a test or measurement tool. You may remember from your extensive studies in Psych 1 that a construct is an abstract trait that can’t be directly observed. For example, aggression is a construct (consisting of such variables as inappropriate touching, violence, lack of successful social interaction, etc.), as are intelligence, mother–infant attachment, and hope. And keep in mind that these constructs are generated from some
...more
So, you have the FIGHT test (of aggression), an observational tool that consists of a series of items. It is an outgrowth of your theoretical view about what the construct of aggression consists of. And your theory was based on published research studies. You
The FIGHT scale includes items that describe different behaviors, some that are theoretically related to aggressive behaviors and some that are not. Once the FIGHT scale is completed, you examine the results to see whether positive scores on the FIGHT scale correlate with the presence of the kinds of behaviors you would predict given your theory (level of involvement in crime, quality of personal relationships, etc.) and don’t correlate with the kinds of behaviors that should not be related (such as handedness or preferences for certain types of food). And, if the correlation is high for the
...more
In general, if you don’t have the validity evidence you want, then it’s because your test is not doing what it should. If it’s an achievement test and a satisfactory level of content-based validity is what you seek, then you probably have to rewrite the questions on your test to make sure they are more consistent with the opinions of the experts you consulted or with the state board of education’s standards for what should be taught at each grade level. If you are concerned with criterion-based validity, then you probably need to reexamine the nature of the items on the test and answer the
...more
The process of establishing the reliability and validity of any instrument can take years of intensive work. And what can make matters even worse is when the naive or unsuspecting individual wants to create a new instrument to test a new hypothesis. That means that on top of everything else that comes with testing a new hypothesis, there is also the work of making sure the instrument works as it should. If you are doing original research of your own, such as for your thesis or dissertation requirement, be sure to find a measure that has already had reliability and validity evidence well
...more
As we mentioned earlier in this chapter, you can have a test that is reliable but not valid. However, you cannot have a valid test without it first being reliable. Why? Well, a test can do whatever it does over and over (that’s reliability) but still not do what it is supposed to (that’s validity). But if a test does what it is supposed to, then it has to do it consistently to work.
This relationship says that the maximum level of validity is equal to the square
root of the reliability coefficient. For example, if the reliability coefficient for a test of mechanical aptitude is .87, the validity coefficient can be no larger than .93 (which is the square root of .87). What this means in tech talk is that the validity of a test is constrained by how reliable it is. And that makes perfect sense if we stop to think that a test must do what it does consistently before we are sure it does what it says it does. But the relationship is closer as well. You cannot have a valid instrument without it first being reliable, because in order for something to do what
...more

