0702-Correlation Analysis

Lecture for Year2 Semester 2 PPSD Session on Wednesday 14th February 2007 by Professor Omar Hasan Kasule Sr.


Correlation analysis is used as preliminary data analysis before applying more sophisticated methods. A correlation matrix is used to explore for pairs of variables likely to be associated. More detailed analysis is then guided by the results of the matrix.


Correlation describes the relation between 2 random variables (bivariate relation) about the same person or object with no prior evidence of inter-dependence.


Correlation indicates only association; the association is not necessarily causative.


Correlation measures linear relation and not variability.


Correlation analysis has the objectives of describing the relation between x and y, prediction of y if x is known, prediction of x if y is known, studying trends, and studying the effect of a third factor on the relation between x and y.


The first step in correlation analysis is to inspect a scatter plot of the data to obtain a visual impression of the data layout and identify out-liers. The Pearson’s coefficient of correlation, r, is the commonest statistic for linear correlation. It has a complicated formula but can be computed easily by modern computers. It essentially is a measure of the scatter of the data.



The value of the Pearson simple linear correlation coefficient is invariant when a constant is added to the y or x variable or when the x and y variables are multiplied or divided by a constant.


The coefficient can be used to compare scatter in 2 data sets measured in different units because it is not affected by the unit of measure.


Inspecting a scatter-gram helps interpret the coefficient. The correlation is not interpretable for small samples.


Values of 0.25 - 0.50 indicate a fair degree of association. Values of 0.50 - 0.75 indicate moderate to fair relation. Values above 0.75 indicate good to excellent relation. Values of r = 0 indicate either no correlation or that the two variables are related in a non-linear way. Very high correlation coefficients are suspect. They may be due to collinearity or restrictions of the range of x or y and not due to actual biological relationship.


In perfect positive correlation, r=1. In perfect negative correlation, r=-1. In cases of no correlation, r=0. In cases of no correlation with r=0, the scatter-plot is circular.


The t test is used to test significance of the coefficient and to compute 95% confidence intervals of the coefficient of correlation. The coefficient will be wrong or misleading for non-linear relations. The linear correlation coefficient is not used when the relation is non-linear, outliers exist, the observations are clustered in 2 or 4 groups, and if one of the 2 variables is fixed in advance.

ŠProfessor Omar Hasan Kasule, Sr. February 2007