More on this book
Kindle Notes & Highlights
Read between
May 29 - June 29, 2023
Second, it’s the line that allows us our best guess (at estimating what college GPA would be, given each high school GPA). For example, if high school GPA is 3.0, then college GPA should be around (remember, this is only an eyeball prediction) 2.8. Take a look at Figure 16.3 to see how we did this. We located the predictor value (3.0) on the x-axis, drew a perpendicular line from the x-axis to the regression line, then drew a horizontal line to the y-axis, and finally estimated what the predicted value of Y would be.
Third, the distance between each individual data point and the regression line is the error in prediction—a direct reflection of the correlation between the two variables. For example, if you look at data point (3.3, 3.7), marked in Figure 16.4, you can see that this (X, Y) data point is above the regression line. The distance between that point and the line is the error in prediction, as marked in Figure 16.4, because if the prediction were perfect, then all the predicted points would fall where? Right on the regression or prediction line.
Fourth, if the correlation were perfect (and the x-axis meets the y-axis at Y ’s mean), all the data points would align themselves along a 45° angle, and the regression line would pass through each point (just as we said earlier in the third point).
The simplest way to think of prediction is that you are determining the score on one variable (which we’ll call Y—the criterion or dependent variable) based on the value of another score (which we’ll call X—the predictor or independent variable).
The way that we find out how well X can predict Y is through the creation of the regression line we mentioned earlier in this chapter. This line is created from data that have already been collected. The equations are then used to predict scores using a new value for X, the predictor variable. Formula 16.1 shows the general formula for the regression line, which may look familiar because you may have used something very similar in a high school or college math course. In geometry, it’s the formula for any straight line: (16.1) where Y ′ is the predicted score of Y based on a known value of X;
...more
∑ X, or the sum of all the X values, is 31.4. ∑Y, or the sum of all the Y values, is 29.3. ∑ X 2, or the sum of each X value squared, is 102.5. ∑ Y 2, or the sum of each Y value squared, is 89.99. ∑ XY, or the sum of the products of X and Y, is 94.75. Formula 16.2 is used to compute the slope of the regression line (b in the equation for a straight line): (16.2)
Formula 16.4 is used to compute the point at which the line crosses the y-axis (a in the equation for a straight line): (16.4)
Why the Y ′ and not just a plain Y ? Remember, we are using X to predict Y, so we use Y ′ to mean the predicted and not the actual value of Y.
You can use this formula and the known values to compute predicted values. That’s most of what we just talked about. But you can also plot a regression line to show how well the scores (what you are trying to predict) actually fit the data from which you are predicting. Take another look at Figure 16.2, the plot of the high school–college GPA data. It includes a regression line, which is also called a trend line. How did we get this line? Easy. We used the same charting skills you learned in Chapter 5 to create a scatterplot; then we selected Add Fit Line in the SPSS Chart Editor.
Not all lines that fit best between a bunch of data points are straight. Rather, they could be curvilinear, just as you can have a curvilinear relationship between your variables, as we discussed in Chapter 5. For example, the relationship between anxiety and performance is such that when people are not at all anxious or very anxious, they don’t perform very well. But if they’re moderately anxious, then performance can be enhanced. The relationship between these two variables is curvilinear, and the prediction of Y from X takes that into account.
But being practical, we can also look at the difference between the predicted value (Y ′) and the actual value (Y) when we first compute the formula of the regression line. For example, if the formula for the regression line is Y ′ = 0.704X + 0.719, the predicted Y (or Y ′) for an X value of 2.8 is 0.704(2.8) + 0.719, or 2.69. We know that the actual Y value that corresponds to an X value is 3.5 (from the data set shown in Table 16.1). The difference between 3.5 and 2.69 is 0.81, and that’s the size of the error in prediction. Another measure of error that you could use is the coefficient of
...more
This highlight has been truncated due to consecutive passage length restrictions.
All of the examples that we have used so far in the chapter have been for one criterion or outcome measure and one predictor variable. There is also the case of regression where more than one predictor or independent variable is used to predict a particular outcome. If one variable can predict an outcome with some degree of accuracy, then why couldn’t two do a better job? Maybe so, but there’s a big caveat—read on. For example, if high school GPA is a pretty good indicator of college GPA, then how about high school GPA plus number of hours of extracurricular activities? So, instead of the
...more
X1 is the value of the first independent variable, X2 is the value of the second independent variable, b is the regression weight for that particular variable, and a is the intercept of the regression line, or where the regression line crosses the y-axis. As you may have guessed, this model is called multiple regression (multiple predictors, right?). So, in theory anyway, you are predicting an outcome from two independent variables rather than one. But you want to add additional predictor variables only under certain conditions. Read on. Any variable you add has to make a unique contribution
...more
This highlight has been truncated due to consecutive passage length restrictions.
This is a powerful way of examining what and how more than one independent variable contribute to prediction of another variable.
If you are using more than one predictor variable, try to keep the following two important guidelines in mind:
When selecting a variable to predict an outcome, select a predictor variable (X) that is related to the criterion variable (Y). That way, the two share something in common (remember, they should be correlated). When selecting more than one predictor variable (such as X1 and X2), try to select variables that are independent or uncorrelated with one another but are both related to the outcome or predicted (Y) variable. In effect, you want only independent or predictor variables that are related to the dependent variable and are unrelated to each other. That way, each one makes as distinct a
...more
How many predictor variables are too many? Well, if one variable predicts some outcome, and two are even more accurate, then why not three, four, or five predictor variables? In practical terms, every time you add a variable, an expense is incurred. Someone has to go collect the data, it takes time (which is $$$ when it comes to research budgets), and so on. From a theoretical sense, there is a fixed limit on how many variables can contribute to an understanding of what we are trying to predict. Remember that it is best when the predictor or independent variables are independent or unrelated
...more
Almost every statistical test that we’ve covered so far in Statistics for People Who (Think They) Hate Statistics assumes that the data set with which you are working has certain characteristics. For example, one assumption underlying a t test between means is that the variances of each group are homogeneous, or similar. And this assumption can be tested. Another assumption of many parametric statistics is that the sample is large enough to represent the population. Statisticians have found that it takes a sample size of about 30 to fulfill this assumption. Many of the statistical tests we’ve
...more
This highlight has been truncated due to consecutive passage length restrictions.
Chi-square is an interesting nonparametric test that allows you to determine if what you observe in a distribution of frequencies is what you would expect to occur by chance. Here’s what we mean when we say “would expect by chance”: If there are three possible categories that a thing can fall into, there should be a one-third chance for each thing to be in each category and, by chance alone, we would expect a one third of things to fall into each category. If in a sample, the things fall into the categories at some different rate from one third and one third and one third,
then it’s possible that in the population, the true probability isn’t one third and one third and one third. A one-sample chi-square includes only one dimension, variable, or factor, such as in the example you’ll see here, and is usually referred to as a goodness-of-fit test (just how well do the data you collect fit the pattern you expected?). A two-sample chi-square includes two dimensions, variables, or factors and is usually referred to as a test of independence. For example, it might be used to test whether preference for school vouchers is not related to, or independent of, political
...more
The rationale behind the one-sample chi-square or goodness-of-fit test is that for any set of occurrences, you can easily compute what you would expect by chance. You do this by dividing the total number of occurrences by the number of classes or categories. In this example using data from the Census, the observed total number of occurrences is 84. We would expect that, by chance, 84/3 (84, which is the total of all frequencies, divided by 3, which is the total number of categories), or 28, respondents would fall into each of the three categories of level of education. Then we look at how
...more
Formula 17.1 shows how to compute the chi-square value for a one-sample chi-square test:
(17.1)
where χ2 is the chi-square value, ∑ is the summation sign, O is the observed frequency, and E is the expected frequency.
We are using only three categories, but the number could be extended to fit the situation as long as each of the categories is mutually exclusive, meaning that any one observation cannot be in more than one category.
Any test between frequencies or proportions of mutually exclusive categories (such as For, Maybe, and Against) requires the use of chi-square.
Why “goodness of fit”? This name suggests that the statistic addresses the question of how well or “good” a set of data “fits” an existing set. The set of data is, of course, what you observe. The “fit” part suggests that there is another set of data to which the observed set can be matched. This standard is the set of expected frequencies that are calculated in the course of computing the χ2 value. If the observed data fit, the χ2 value is just too close to what you would expect by chance and does not differ significantly. If the observed data do not fit, then what you observed is different
...more
as our criterion to deem the research hypothesis more attractive than the null hypothesis, our conclusion is that there is a significant difference among the three sets of scores.
With this test, two different dimensions, almost always at the nominal level of measurement, are examined to see whether they are related.
What you need to keep in mind is that (1) there are two dimensions (gender and voting participation) and (2) the question asked by the chi-square test of independence is whether, in this case, gender and voting participation are independent of each other.
As with a one-dimensional chi-square, the closer the observed and expected values are to one another, the less likely the dimensions are to be independent of one another. The less similar the observed and expected values are, the more likely the two dimensions are independent of one another. The test statistic is computed in the same manner as we did for a goodness-of-fit test (what a nice surprise); expected values are computed and used along with observed values to compute a chi-square value, which is then tested for significance. However, unlike with the one-dimensional test, the expected
...more
On the other hand, you may very well find yourself dealing with samples that are very small (or at least fewer than 30 cases) or data that violate some of the important assumptions underlying parametric tests.
Chi-square is one of many different types of nonparametric statistics that help you answer questions based on data that violate the basic assumptions of the normal distribution or when a data set is just too small for other statistical methods to work.
You won’t be surprised to learn that there are many different renditions of analysis of variance (ANOVA), each one designed to fit a particular “the averages of more than two groups being compared” situation. One of these, multivariate analysis of variance (MANOVA), is used when there is more than one dependent variable. So, instead of looking at just one outcome or dependent variable, the analysis can use more than one. If the dependent or outcome variables are related to one another (which they usually are—see the Tech Talk note in Chapter 13 about multiple t tests), it would be hard to
...more

