Synopsis of a lecture by Professor Omar Hasan Kasule Sr. for the MPH class at Universiti Malaya on 17th November 2006


Regression to the mean, first described by Francis Galton (1822-1911) is one of the basic laws of nature, sunan al llah fi al kawn. Parametric regression models are cross sectional (linear, logistic, or log-linear) or longitudinal (linear and proportional hazards). Regression relates independent with dependent variables. The variables may be raw data, dummy indicator variables, or scores. The simple linear regression equation is y=a + bx where y is the dependent/response variable, a is the intercept, b is the slope/regression coefficient, and x is the dependent/predictor variable. Its validity is based on 4 assumptions: linearity of the x-y relation, normal distribution of the y variable for any given value of x, homoscedacity (constant y variance for all x values), and y values are independent for each value of x. The t test can be used to test the significance of the regression coefficient and to compare regression coefficients of 2 lines. Multiple linear regression, a form of multivariate analysis, is defined by y=a+b1x1 + b2x2 + …bnxn. Y can be interval, dichotomous, ordinal, or nominal and x can be interval or dichotomous but not ordinal or nominal. Interactive (product) variables can be included in the model. Linear regression is used for prediction (intrapolation and extrapolation) and for analysis of variance.



Logistic regression is non-linear regression with y dichotomous/binary such that logit (y) = a+b1x1 + b2x2 + …bnxn Logistic regression is used in epidemiology because of a dichotomized outcome variable and direct derivation of the odds ratio from the regression coefficient as shown in the formula OR = eβ. Significance of the regression coefficient is tested using either the likelihood ratio or the Wald test. Multiple logistic regression is used for matched analysis, stratified analysis to control for confounders, and prediction.



Fitting the simple regression model is very straightforward since it has only one independent variable. Fitting the multiple regression model is by step-up, step-down, and step-wise selection of x variables. Step-up or forwards selection starts with a minimal set of x variables and one x variable is added at a time. Step-down or backward elimination starts with a full model and one variable is eliminated at a time. Step-wise selection is a combination of step up and step down selection. Variables are retained or eliminated on the basis of their p-value. Model validation is by using new data, data splitting, the jackknife procedures, and the boot strap procedure. Misspecification occurs when a linear relation is assumed for a curvilinear one. Over-specification is including too many unnecessary variables. Extraneous variables cause model overfit. Omitting important variables causes an under-fit model. Bias due to missing data can be dealt with by deleting incomplete observations, using an indicator variable for missing data, estimating missing values, or collecting additional data.





The best model is one with the highest coefficient of determination or one for which any additions do not make any significant changes in the coefficient. The model is assessed by the following: testing linearity, row diagnostics, column diagnostics, hypothesis testing, residual analysis, impact assessment of individual observations, and the coefficient of determination. Row diagnostics identify the following: outliers, influential observations, unequal variances (heteroscedacity), and correlated errors. Column diagnostics deal mainly with multicollinearity that is correlations among several x variables causing model redundancy and imprecision. Collinear variables should be dropped leaving only the important one. Hypothesis testing of omnibus significance of the model uses the F ratio. Hypothesis testing of individual x variables uses the t test. Residuals are defined as the difference between the observed values and the predicted values. A good  model fit will have most residuals near zero and the residual plot will be normal in shape. The impact of specific observations is measured by their leverage or by Cook’s distance. The coefficient of determination defined as r2 varies 0-1.0 and is a measure of goodness of fit. The fit of the model can be improved by using polynomial functions, linearizing transformations, creation of categorical or interaction variables, and dropping outliers.



Methods based on grouping or General Linear Models (GLIM) are an alternative to regression. Methods based on grouping/classification are principal components analysis, discriminant analysis, factor analysis, and cluster analysis. The General Linear Models (GLIM) unlike the general regression model allows for the fact that explanatory variables can be linear combinations of other variables and does not give unique parameter estimates. It works well with continuous as well as categorical variables and has no restrictions on parameters.

Professor Omar Hasan Kasule Sr. November 2006