In: Math
What are the factors that affect the magnitude of correlation coefficient? (a) List at least 2 factors and (b) explain how they affect the magnitude of correlation coefficient.
What factors influence the magnitudes of correlations?
The empirical correlation between two variables X and Y should estimate the strength of the “true” association between those variables, in theory. But in practice it often over or underestimates the strength of the real relationship.
Factors that affect the magnitude of correlation:
An r value can only be large if the association between the X and Y variables is linear. (there can be a strong non-linear association between X and Y, and yet, the Pearson’s r between them can be small, because r detects only linear association).
2. Are the distribution shapes of scores on X and Y the same?
An r of +1 tells us that there is a one to one mapping of locations (in z score terms) of X and Y values. It is not possible to get a one to one mapping for all scores if the distribution shapes for X and Y are different from each other (ideally, we assume that both X and Y are normally distributed).
3. Are there bivariate outliers in the X, Y scatter plot?
Depending on their locations, outliers can either increase or decrease the value of r. The example below (from Warner, 2012, Applied Statistics) shows a scatter plot and correlation for the same data with (top) and without (bottom) an extreme outlier.
4. How reliable are the X, Y measures?
When X and Y are not reliably measured, the observed r is smaller than the “real” correlation due to ‘attenuation due to unreliability”. In this equation, rhoxy is the “real” strength of association between X and Y if they were measured without error, rxx and ryy are the reliability coefficients for X and Y, and rxy is the observed correlation in the sample.
Estimating attenuation of correlation due to unreliability of measures:
5. Is there a restricted range for scores on X and/or Y?
Other factors being equal, a restricted range usually yields a smaller correlation.
6. Do the scores on either variable represent only extreme groups (and not intermediate score values)?
If extreme groups are selected, this usually results in a larger correlation.
7. Does the sample include groups of people for whom the X,Y association differs?
For example, if women have a negative correlation between X and Y, and men have a negative correlation between X and Y, and your sample includes both men and women, and your analysis does not control for sex, the correlation between X and Y in the entire sample may be close to zero. (The interaction between sex and X as predictors of Y will not be detected unless further analyses are done.)
8. Are individual cases or aggregated scores for groups of cases examined?
Correlations using individual cases can be quite different from correlations based on aggregated scores. Inferences to individuals based on analyses of grouped data can lead to the ‘ecological fallacy’.
9. And then of course: Sampling error (chance, the ever present rival explanation).
Due to sampling error, the sample estimate of r is often larger or smaller than the “true” strength of the association.