More on this book
Community
Kindle Notes & Highlights
Nominal variables include nearly all demographic variables such as religious affiliation, political persuasion and socio-economic status.
Ordinal variables are simply ordered and then named. The Mohs Scale, devised by the German mineralogist Friedrich Mohs in 1822, is an example of an ordinal scale.
The American psychologist Stanley Smith Stevens (1906–73) made a further sub-division with “continuous variables” in 1947 when he introduced ratio and interval scales of measurement (most of Pearson’s continuous variables were ratio). Stevens proposed the following: 1. Ratio scales These differ from interval variables (see here) in two ways: a) an absolute zero indicates the absence of the property being measured (i.e. height, weight and blood pressure) and b) ratio scales are additive.
2. Interval scales The zero point is arbitrary and does not reflect the absence of an attribute (such as 0 Celsius and 0 Fahrenheit readings).
Correlation one of the most widely used statistical methods, indicates the extent to which two variables go together (e.g., height and weight). The most common type measures a linear relationship between two variables, and refers to how well they go together in a straight line. But not every pair of characters or variables can be assessed by using a statistical correlation, and different methods of correlation are used within the biological, medical, behavioural, social and environmental sciences, as well as in industry, commerce, economics and education. Different types of correlational
...more
Francis Galton was the first person to come up with a method to measure correlation when he created a graph to find a relationship between mother and daughter sweet peas. Until Galton invented the idea of correlation, causation was the primary way in which two related events were explained, especially in the physical sciences.
Hence, a mathematically perfect correlation does not mean causation: it simply means that two variables are very highly correlated. This may even be the result of a spurious or illusory correlation due to the influence of a third variable, called a “lurking variable”. While students’ university qualifications are highly correlated with their income later in life (the higher the grades, the higher the salary), this correlation could be due to a third (lurking or hidden) variable, such as the tendency to work hard.
Correlation is often depicted graphically on something called a scatter diagram to see what shape produces. If two variables produce a narrow ellipse that resembles a straight line, this would indicate a high correlation. A full-size ellipse reveals a moderate correlation, whereas a circle indicates no correlation in this way, correlation measures the strength (high, medium or low) of the relationship.
Correlation cannot, however, be transformed to a percentage. Thus, a moderate correlation of 0.55 or a high correlation of 0.80 is not equivalent to 55% or 80%, as some people erroneously believe.
The numerical index that correlation yields also measures the direction of the relationship. Either two variables move up or down the graph together (e.g., height and weight in healthy infants goes up together) or one variable moves up while the other moves down (e.g., the faster one travels in a car, the sooner the destination is reached: speed increases as time decreases). The former produces a positive or direct correlation while the latter yields a negative or inverse correlation.
Though the numerical index provides some information about the degree of a linear relationship, a scatter plot is a useful tool, because it may reveal instead a curvilinear relationship. Pearson introduced the correlation ratio in 1905 to measure a curvilinear relationship.
Galton measured the diameter and weight of thousands of mother and daughter sweet pea seeds in 1875, and found that the population of the offspring reverted towards the parents and followed the normal distribution. As the size of the mother pea seed increased, so did the size of the daughter pea seed, but the offspring was not as big or as small as the mother pea; it therefore regressed back towards the size of its “ancestor pea”.
Regression to the Mean This refers to the tendency of a characteristic in a population to move away from the extreme value and closer to the average values.
At the end of the 19th century, Pearson’s student George Udny Yule (1871–1951) introduced a novel approach to interpreting correlation and regression with a conceptually new use of the method of least squares, which is a mathematical tool that reduces the influence of errors when fitting a regression line to a set of data points.
Using the method of least squares, a regression analysis allows statisticians to estimate the response variable “Y” (the dependent variable or the one being manipulated) from a specified variable “X” (the independent variable or the one being studied).
Although the method of least squares may be used to analyse regression lines, much of the confusion surrounding regression to the mean can be attributed to those who forget that Galton’s regression to the mean involves two regression lines and not simply one regression line to be used to predict future outcomes by using the method of least squares.
Though Galton wanted to measure the correlation of stature between father and son, Pearson discovered in 1896 that Galton’s procedure for finding “co-relation”, as he spelt it, measured the slope of the regression line, which was a measure of the regression coefficient instead.
The covariance, ∑(xy) is a measure of how much the deviations of two random variables move together.
In 1925, R.A. Fisher (1890–1962) reconstructed Pearson’s notation, introducing Y = a + bX (the general equation for a straight line) and incorporating the terms “dependent” variable and “independent” variable. This was an essential distinction to make for regression, because the independent variable is the predictor and the dependent variable is the criterion. Fisher then produced the equation for the regression (or predicted) line: Y’ = a + bX (where b is the regression coefficient and Y’, pronounced “Y prime”, indicates a regression line).
Pearson introduced the term simple correlation when measuring a linear relationship between two continuous variables only, such as the relationship between stature of father and stature of son.
This work provided the basis for the development of multiple regression. Like simple regression, it involves a linear prediction, but rather than using only one variable to be “predicted”, a collection of variables can be used instead.
To calculate the multiple correlation coefficient, Pearson introduced a higher form of mathematics. This played a pivotal role in the professionalization of mathematical statistics as an academic discipline at the end of the 19th century. Pearson learnt this type of mathematics at Cambridge from J.J. Sylvester and Arthur Cayley (1821–95), who had created matrix algebra out of their discovery of the theory of invariants during the mid-19th century.
This higher level of mathematics enabled statisticians to find complex mathematical solutions for statistical problems in a multivariate (or p-dimensional) space when a bivariate (or two-dimensional) system was insufficient.
Scientists can use two types of control when undertaking research: experimental and statistical control.
Pearson offered one way to statistically control certain variables in 1895 with part correlation, which is used with multiple correlation only and thus involves three or more variables. It is the correlation between the dependent variable and one of the independent variables after the researcher statistically removes the influence of one of the other independent variables from the first independent variable. Thus, the researcher can mathematically isolate the variable when it cannot be experimentally isolated. The statistician is essentially treating the item as if one of the variables doesn’t
...more
George Udny Yule later introduced partial correlation, in which the statistician removes the effects of one or more of the independent variables from both the dependent and one of the other independent variables. Partial correlation helps to identify spurious correlations
Pearson introduced two new methods in 1900: the tetrachoric (i.e. “four-fold”) correlation coefficient (rt); and his phi coefficient (φ), known later as “Pearson’s phi coefficient” for discrete variables. Both methods measure the association between two variables, designed for 2 × 2 (or four-fold) tables, which can be placed into two mutually exclusive categories, (called “dichotomous” variables).
Pearson’s phi coefficient was designed for two variables where a true dichotomy exists and thus the variables are not continuous. This technique is commonly used by psychometricians for test-construction in situations where a true dichotomy exists, such as “true” or “false” test items, and by epidemiologists who use it to assess a risk factor associated with the “presence” or “absence” of a disease against the incidence of mortality.
Yule proposed the Q statistic which he named for Quetelet, in 1899 (one month after Pearson introduced the phi coefficient and tetrachoric correlation). Yule was also looking for a measure that didn’t rely on continuous variables or depend on an underlying normal distribution, as was the case with the Pearson product-moment correlation.
Pearson devised the biserial correlation in 1909. This is related to the product-moment correlation (in which both variables are continuous), with one difference.
The point-biserial correlation is related to Pearson’s biserial correlation, but one variable is continuous and the other is a “true dichotomy”, such as male/female. This is an estimate of what the product-moment correlation would be if the dichotomous variable were replaced by a continuous variable instead.
Rank order correlation is the study of relationships between different rankings on the same set of items. It deals with measuring correspondence between two rankings, and assessing the statistical significance of this. Two of the main methods were devised by Charles Spearman (1863–1945, a student of Karl Pearson) and Maurice Kendall. Three other tests include the Wilcoxon signed-rank test, Mann-Whitney U test and the Kruskal-Wallis analysis of ranks.
Spearman was also influenced by Galton’s ideas of measuring individual differences in human abilities and by his early ideas on intelligence testing. Using Pearson’s product-moment correlation and the principal components method* that Pearson introduced in 1901, Spearman created a new statistical method, known as factor analysis, which reduces a set of complex data into a more manageable form that makes it possible to detect structures in the relationship between variables.
The English statistician Maurice Kendall (1907–83) created another ranking method of correlation in 1938, known as Kendall’s tau. This method is a scheme based on the number of agreements or disagreements in ranked data.