Discovering Statistics Using IBM SPSS Statistics: North American Edition
Rate it:
32%
Flag icon
think about what a correlation of zero represents: it is no effect whatsoever. A confidence interval is the boundary between which the population value falls in 95% of samples.
32%
Flag icon
The correlation coefficient squared (known as the coefficient of determination, R2) is a measure of the amount of variability in one variable that is shared by the other.
32%
Flag icon
You’ll often see people write things about R2 that imply causality: they might write ‘the variance in y accounted for by x’, or ‘the variation in one variable explained by the other’.
32%
Flag icon
Although R2 is a useful measure of the substantive importance of an effect, it cannot be used to infer causal relationships.
32%
Flag icon
Spearman’s correlation coefficient, denoted by rs (Figure 8.9), is a non-parametric statistic that is useful to minimize the effects of extreme scores or the effects of violations of the assumptions
33%
Flag icon
Remember that these confidence intervals are based on a random sampling procedure so the values you get will differ slightly from mine, and will change if you rerun the analysis.
33%
Flag icon
Kendall’s tau, denoted by τ, is another non-parametric correlation and it should be used rather than Spearman’s coefficient when you have a small data set with a large number of tied ranks.
33%
Flag icon
Although Spearman’s statistic is the more popular of the two coefficients, there is much to suggest that Kendall’s statistic is a better estimate of the correlation in the population
33%
Flag icon
Often it is necessary to investigate relationships between two variables when one of the variables is dichotomous (i.e., it is categorical with only two categories).
33%
Flag icon
different. The difference between the use of biserial and point-biserial correlations depends on whether the dichotomous variable is discrete or continuous.
33%
Flag icon
discrete or true, dichotomy is one for which there is no underlying continuum between the categories.
33%
Flag icon
The point-biserial correlation coefficient (rpb) is used when one variable is a discrete dichotomy (e.g., pregnancy), whereas the biserial correlation coefficient (rb) is used when one variable is a continuous dichotomy (e.g., passing or failing an exam).
33%
Flag icon
Another way to express the unique relationship between two variables (i.e., the relationship accounting for other variables) is the partial correlation.
33%
Flag icon
the semi-partial correlation expresses the unique relationship between two variables, X and Y, as a function of the total variance in Y.
33%
Flag icon
Partial correlations can be done when variables are dichotomous (including the ‘third’ variable).
33%
Flag icon
We can calculate a z-score of the differences between these correlations using: (8.18)
33%
Flag icon
If you want to compare correlation coefficients that come from the same entities then things are a little more complicated. You can use a t-statistic to test whether a difference between two dependent correlations is significant.
33%
Flag icon
The t-statistic is computed as follows (Chen & Popovich, 2002): (8.20)
33%
Flag icon
This value can be checked against the appropriate critical value for t with N − 3 degrees of freedom
34%
Flag icon
A table is a good way to report lots of correlations.
34%
Flag icon
This equation keeps the fundamental idea that an outcome for a person can be predicted from a model (the stuff in parentheses) and some error associated with that prediction (εi). We still predict an outcome variable (Yi) from a predictor variable (Xi) and a parameter, b1, associated with the predictor variable that quantifies the relationship it has with the outcome variable.
34%
Flag icon
34%
Flag icon
Any straight line can be defined by two things: (1) the slope (or gradient) of the line (usually denoted by b1); and (2) the point at which the line crosses the vertical axis of the graph (known as the intercept of the line, b0).
34%
Flag icon
A model with a positive b1 describes a positive relationship, whereas a line with a negative b1 describes a negative relationship.
35%
Flag icon
regression analysis is a term for fitting a linear model to data and using it to predict values of an outcome variable (a.k.a. dependent variable) from one or more predictor variables (a.k.a. independent variables). With one predictor variable, the technique is sometimes referred to as simple regression, but with several predictors it is called multiple regression. Both are merely terms for the linear model.
35%
Flag icon
predictions). If the model is a perfect fit to the data then for a given value of the predictor(s) the model will predict the same value of the outcome as was observed.
35%
Flag icon
it overestimates the observed value of the outcome and sometimes it underestimates it. With the linear model the differences between what the model predicts and the observed data are usually called residuals (they are the same as deviations when we looked at the mean);
35%
Flag icon
square them before we add them up (this idea should be familiar from Section 2.5.2). Therefore, to assess the error in a linear model, just like when we assessed the fit of the mean using the variance, we use a sum of squared errors, and because we call these errors residuals, this total is called the sum of squared residuals or residual sum of squares (SSR). The residual sum of squares is a gauge of how well a linear model fits the data: if the squared differences are large, the model is not representative of the data (there is a lot of error in prediction); if the squared differences are ...more
35%
Flag icon
when we estimate the mean, we use the method of least squares to estimate the parameters (b) that define the regression model for which the sum of squared errors is the minimum it can be (given the data). This method is known as ordinary least squares (OLS) regression.
35%
Flag icon
The mean of the outcome is a model of ‘no relationship’ between the variables: as one variable changes the prediction for the other remains constant
35%
Flag icon
This sum of squared differences is known as the total sum of squares (denoted by SST) and it represents how good the mean is as a model of the observed outcome scores
35%
Flag icon
We can use the values of SST and SSR to calculate how much better the linear model is than the baseline model of ‘no relationship’.
35%
Flag icon
The improvement in prediction resulting from using the linear model rather than the mean is calculated as the difference between SST and SSR (Figure 9.5, bottom). This difference shows us the reduction in the inaccuracy of the model resulting from fitting the regression model to the data. This improvement is the model sum of squares (SSM).
35%
Flag icon
If the value of SSM is large, the linear model is very different from using the mean to predict the outcome variable. This implies that the linear model has made a big improvement to predicting the outcome variable.
35%
Flag icon
SSM is small then using the linear model is little better than using the mean (i.e., the best model is no better than predicting from ‘no relationship’).
35%
Flag icon
variance or, put another way, the model compared to the error in the model. This is true here: F is based upon the ratio of the improvement due to the model (SSM) and the error in the model (SSR).
35%
Flag icon
For SSM the degrees of freedom are the number of predictors in the model (k), and for SSR they are the number of observations (N) minus the number of parameters being estimated (i.e., the number of b coefficients including the constant). We estimate a b for each predictor and the intercept (b0), so the total number of bs estimated will be k + 1, giving us degrees of freedom of N - (k + 1) or, more simply, N - k - 1. Thus (9.11)
35%
Flag icon
is a measure of how much the model has improved the prediction of the outcome compared to the level of inaccuracy of the model.
35%
Flag icon
The F-statistic is also used to calculate the significance of R2 using the following equation: (9.13) in which N is the number of cases or participants, and k is the number of predictors in the model.
35%
Flag icon
This hypothesis is tested using a t-statistic that tests the null hypothesis that the value of b is 0. If the test is significant, we might interpret this information as supporting a hypothesis that the b-value is significantly different from 0 and that the predictor variable contributes significantly to our ability to estimate values of the outcome.
35%
Flag icon
the t-statistic is based on the ratio of explained variance against unexplained variance or error.
35%
Flag icon
If the standard error is very small, then most samples are likely to have a b-value similar to the one in our sample (because there is little variation across samples).
35%
Flag icon
how the t-test is calculated: (9.14)
35%
Flag icon
The statistic t has a probability distribution that differs according to the degrees of freedom for the test. In this context, the degrees of freedom are N − k − 1, where N is the total sample size and k is the number of predictors.
35%
Flag icon
Using the appropriate t-distribution, it’s possible to calculate a p-value that indicates the probability of getting a t at least as large as the one we observed if the null hypothesis were true
35%
Flag icon
Generalization (Section 9.4) is a critical additional step, and if we find that our model is not generalizable, then we must restrict any conclusions to the sample used.
35%
Flag icon
we can use standardized residuals, which are the residuals converted to z-scores (see Section 1.8.6) and so are expressed in standard deviation units.
35%
Flag icon
(1) standardized residuals with an absolute value greater than 3.29 (we can use 3 as an approximation) are cause for concern because in an average sample a value this high is unlikely to occur; (2) if more than 1% of our sample cases have standardized residuals with an absolute value greater than 2.58 (2.5 will do) there is evidence that the level of error within our model may be unacceptable; and (3) if more than 5% of cases have standardized residuals with an absolute value greater than 1.96 (2 for convenience) then the model may be a poor representation of the data.
35%
Flag icon
third form of residual is the studentized residual, which is the unstandardized residual divided by an estimate of its standard deviation that varies point by point.
35%
Flag icon
It is also possible to look at whether certain cases exert undue influence over the parameters of the model.
1 8 14