Synopsis of a lecture by Professor Omar Hasan Kasule Sr. for the MPH class at Universiti Malaya on 17th November 2006


Correlation analysis is used as preliminary data analysis before applying more sophisticated methods. A correlation matrix is used to explore for pairs of variables likely to be associated. Correlation describes the relation between 2 random variables (bivariate relation) about the same person or object with no prior evidence of inter-dependence. Correlation indicates only association; the association is not necessarily causative. It measures linear relation and not variability. Correlation analysis has the objectives of describing the relation between x and y, prediction of y if x is known, prediction of x if y is known, studying trends, and studying the effect of a third factor on the relation between x and y. The first step in correlation analysis is to inspect a scatter plot of the data to obtain a visual impression of the data layout and identify out-liers. The Pearson’s coefficient of correlation (product moments correlation), r, is the commonest statistic for linear correlation. It has a complicated formula but can be computed easily by modern computers. It essentially is a measure of the scatter of the data.



The value of the Pearson simple linear correlation coefficient is invariant when a constant is added to the y or x variable or when the x and y variables are multiplied or divided by a constant. The coefficient can be used to compare scatter in 2 data sets measured in different units because it is not affected by the unit of measure. Inspecting a scatter-gram helps interpret the coefficient. The correlation is not interpretable for small samples. Values of 0.25 - 0.50 indicate a fair degree of association. Values of 0.50 - 0.75 indicate moderate to fair relation. Values above 0.75 indicate good to excellent relation. Values of r = 0 indicate either no correlation or that the two variables are related in a non-linear way. Very high correlation coefficients may be due to collinearity or restrictions of the range of x or y and not due to actual biological relationship. In perfect positive correlation, r=1. In perfect negative correlation, r=-1. In cases of no correlation, r=0. In cases of no correlation with r=0, the scatter-plot is circular. The t test is used to test significance of the coefficient and to compute 95% confidence intervals of the coefficient of correlation. Random measurement errors, selection bias, sample heterogeneity, and non-linear (curvilinear) relations reduce r whereas differential (non-random) errors increase the correlation. The coefficient will be wrong or misleading for non-linear relations. The linear correlation coefficient is not used when the relation is non-linear, outliers exist, the observations are clustered in 2 or 4 groups, and  if one of the variables is fixed in advance.



When the relation between x and y is influenced by a third variable, the coefficient of partial correlation explains the net relationship.  The correlation ratio, used for curvilinear relations, is interpreted as the variability of y accounted for by x. The biserial or tetrachomic correlation coefficient is used in linear relations when one variable is quantitative and the other is qualitative. The contingency coefficient is used  for 2 qualitative nominal (ie unordered) variables each of which has 2 or more categories. The coefficient of mean square contingency is used when both variables are qualitative. The multiple correlation coefficient is used to describe the relationship in which a given variable is being correlated with several other variables. It describes the strength of the linear relation between y and a set of x variables. It is obtained from the multiple regression function as the positive square root of the coefficient of determination.. The partial correlation coefficient denotes the conditional relation between one independent variable and a response variable if all other variables are held constant.



The square of the linear correlation coefficient is called the coefficient of determination. It is the proportion of variation in the dependent variable, y, explained by the variation in the independent variable, x.



The Spearman rank correlation coefficient is used for non-normal data for which the Pearson linear correlation coefficient would be invalid. Its significance is tested using the t test. The advantage of rank correlation is that comparisons can be carried out even if actual values of the observations are not known. It suffices to know the ranks.

Professor Omar Hasan Kasule Sr. November 2006