1.0 DESCRIPTION
Correlation analysis is used as preliminary data analysis before applying more sophisticated methods. A correlation
matrix is used to explore for pairs of variables likely to be associated. Correlation describes the relation between 2 random
variables (bivariate relation) about the same person or object with no prior evidence of inter-dependence. Correlation indicates
only association; the association is not necessarily causative. It measures linear relation and not variability. Correlation
analysis has the objectives of describing the relation between x and y, prediction of y if x is known, prediction of x if
y is known, studying trends, and studying the effect of a third factor on the relation between x and y. The first step in
correlation analysis is to inspect a scatter plot of the data to obtain a visual impression of the data layout and identify
out-liers. The Pearson’s coefficient of correlation (product moments correlation), r, is the commonest statistic for
linear correlation. It has a complicated formula but can be computed easily by modern computers. It essentially is a measure
of the scatter of the data.
2.0 PEARSON'S CORRELATION COEFFICIENT, r
The value of the Pearson simple linear correlation coefficient is invariant when a constant is added to the y or x
variable or when the x and y variables are multiplied or divided by a constant. The coefficient can be used to compare scatter
in 2 data sets measured in different units because it is not affected by the unit of measure. Inspecting a scatter-gram helps
interpret the coefficient. The correlation is not interpretable for small samples. Values of 0.25 - 0.50 indicate a fair degree
of association. Values of 0.50 - 0.75 indicate moderate to fair relation. Values above 0.75 indicate good to excellent relation.
Values of r = 0 indicate either no correlation or that the two variables are related in a non-linear way. Very high correlation
coefficients may be due to collinearity or restrictions of the range of x or y and not due to actual biological relationship.
In perfect positive correlation, r=1. In perfect negative correlation, r=-1. In cases of no correlation, r=0. In cases of
no correlation with r=0, the scatter-plot is circular. The t test is used to test significance of the coefficient and to compute
95% confidence intervals of the coefficient of correlation. Random measurement errors, selection bias, sample heterogeneity,
and non-linear (curvilinear) relations reduce r whereas differential (non-random) errors increase the correlation. The coefficient
will be wrong or misleading for non-linear relations. The linear correlation coefficient is not used when the relation is
non-linear, outliers exist, the observations are clustered in 2 or 4 groups, and if
one of the variables is fixed in advance.
3.0 OTHER CORRELATION COEFFICIENTS
When
the relation between x and y is influenced by a third variable, the coefficient of partial correlation explains the net relationship. The correlation ratio, used for curvilinear relations, is interpreted as the variability
of y accounted for by x. The biserial or tetrachomic correlation coefficient is used in linear relations when one variable
is quantitative and the other is qualitative. The contingency coefficient is used for
2 qualitative nominal (ie unordered) variables each of which has 2 or more categories. The coefficient of mean square contingency
is used when both variables are qualitative. The multiple correlation coefficient is used to describe the relationship in
which a given variable is being correlated with several other variables. It describes
the strength of the linear relation between y and a set of x variables. It is obtained from the multiple regression
function as the positive square root of the coefficient of determination.. The partial
correlation coefficient denotes the conditional relation between one independent variable and a response variable if all other
variables are held constant.
4.0 THE COEFFICIENT OF DETERMINATION, r2
The square of the linear correlation coefficient is called the coefficient
of determination. It is the proportion of variation in the dependent variable, y, explained by the variation in the independent
variable, x.
5.0 NON-PARAMETRIC CORRELATION ANALYSIS
The Spearman rank correlation coefficient is used for non-normal data for which the Pearson linear correlation coefficient
would be invalid. Its significance is tested using the t test. The advantage of rank correlation is that comparisons can be
carried out even if actual values of the observations are not known. It suffices to know the ranks.