Correlation analysis is used as preliminary data analysis before applying more sophisticated methods.
A correlation matrix is used to explore for pairs of variables likely to be associated. More detailed analysis is then guided
by the results of the matrix.
Correlation describes the relation between 2 random variables (bivariate relation) about the same person
or object with no prior evidence of inter-dependence.
Correlation indicates only association; the association is not necessarily causative.
Correlation measures linear relation and not variability.
Correlation analysis has the objectives of describing the relation between x and y, prediction of y
if x is known, prediction of x if y is known, studying trends, and studying the effect of a third factor on the relation between
x and y.
The first step in correlation analysis is to inspect a scatter plot of the data to obtain a visual
impression of the data layout and identify out-liers. The Pearson’s coefficient of correlation, r, is the commonest
statistic for linear correlation. It has a complicated formula but can be computed easily by modern computers. It essentially
is a measure of the scatter of the data.
PEARSON'S CORRELATION COEFFICIENT, r
The value of the Pearson simple linear correlation coefficient is invariant when a constant is added
to the y or x variable or when the x and y variables are multiplied or divided by a constant.
The coefficient can be used to compare scatter in 2 data sets measured in different units because it
is not affected by the unit of measure.
Inspecting a scatter-gram helps interpret the coefficient. The correlation is not interpretable for
Values of 0.25 - 0.50 indicate a fair degree of association. Values of 0.50 - 0.75 indicate moderate
to fair relation. Values above 0.75 indicate good to excellent relation. Values of r = 0 indicate either no correlation or
that the two variables are related in a non-linear way. Very high correlation coefficients are suspect. They may be due to
collinearity or restrictions of the range of x or y and not due to actual biological relationship.
In perfect positive correlation, r=1. In perfect negative correlation, r=-1. In cases of no correlation,
r=0. In cases of no correlation with r=0, the scatter-plot is circular.
The t test is used to test significance of the coefficient and to compute 95% confidence
intervals of the coefficient of correlation. The coefficient will be wrong or misleading for non-linear relations. The linear
correlation coefficient is not used when the relation is non-linear, outliers exist, the observations are clustered in 2 or
4 groups, and if one of the 2 variables is fixed in advance.