1.0 LINEAR REGRESSION
Regression to the mean, first described by Francis Galton (1822-1911) is one of the basic laws of nature, sunan al llah fi al kawn. Parametric regression models are cross sectional
(linear, logistic, or log-linear) or longitudinal (linear and proportional hazards). Regression relates independent
with dependent variables. The variables may be raw data, dummy indicator variables, or scores. The simple linear regression
equation is y=a + bx where y is the dependent/response variable, a is the intercept, b is the slope/regression coefficient,
and x is the dependent/predictor variable. Its validity is based on 4 assumptions: linearity of the x-y relation, normal distribution
of the y variable for any given value of x, homoscedacity (constant y variance for all x values), and y values are independent
for each value of x. The t test can be used to test the significance of the regression coefficient and to compare regression
coefficients of 2 lines. Multiple linear regression, a form of multivariate analysis, is defined by y=a+b1x1
+ b2x2 + …bnxn. Y can be interval, dichotomous, ordinal, or nominal and
x can be interval or dichotomous but not ordinal or nominal. Interactive (product) variables can be included in the model.
Linear regression is used for prediction (intrapolation and extrapolation) and for analysis of variance.
2.0 LOGISTIC REGRESSION
Logistic regression
is non-linear regression with y dichotomous/binary such that logit (y) = a+b1x1 + b2x2
+ …bnxn Logistic regression is used in epidemiology because of a dichotomized outcome variable
and direct derivation of the odds ratio from the regression coefficient as shown in the formula OR = eβ. Significance
of the regression coefficient is tested using either the likelihood ratio or the Wald test. Multiple logistic regression is
used for matched analysis, stratified analysis to control for confounders, and prediction.
3.0 FITTING REGRESSION
MODELS
Fitting the simple
regression model is very straightforward since it has only one independent variable. Fitting the multiple regression model
is by step-up, step-down, and step-wise selection of x variables. Step-up or forwards selection starts with a minimal set
of x variables and one x variable is added at a time. Step-down or backward elimination starts with a full model and one variable
is eliminated at a time. Step-wise selection is a combination of step up and step down selection. Variables are retained or
eliminated on the basis of their p-value. Model validation is by using new data, data splitting, the jackknife procedures,
and the boot strap procedure. Misspecification occurs when a linear relation is assumed for a curvilinear one. Over-specification
is including too many unnecessary variables. Extraneous variables cause model overfit. Omitting important variables causes
an under-fit model. Bias due to missing data can be dealt with by deleting incomplete observations, using an indicator variable
for missing data, estimating missing values, or collecting additional data.
4.0 ASSESSING
REGRESSION MODELS
The best model is one with the highest coefficient of determination or one for which any additions do not make any
significant changes in the coefficient. The model is assessed by the following: testing linearity, row diagnostics, column
diagnostics, hypothesis testing, residual analysis, impact assessment of individual observations, and the coefficient of determination.
Row diagnostics identify the following: outliers, influential observations, unequal variances (heteroscedacity), and correlated
errors. Column diagnostics deal mainly with multicollinearity that is correlations among several x variables causing model
redundancy and imprecision. Collinear variables should be dropped leaving only the important one. Hypothesis testing of omnibus
significance of the model uses the F ratio. Hypothesis testing of individual x variables uses the t test. Residuals are defined
as the difference between the observed values and the predicted values. A good model
fit will have most residuals near zero and the residual plot will be normal in shape. The impact of specific observations
is measured by their leverage or by Cook’s distance. The coefficient of determination defined as r2 varies
0-1.0 and is a measure of goodness of fit. The fit of the model can be improved by using polynomial functions, linearizing
transformations, creation of categorical or interaction variables, and dropping outliers.
5.0 ALTERNATIVES TO REGRESSION
Methods based on grouping or General Linear Models (GLIM) are an alternative to regression. Methods based on grouping/classification
are principal components analysis, discriminant analysis, factor analysis, and cluster analysis. The General Linear Models
(GLIM) unlike the general regression model allows for the fact that explanatory variables can be linear combinations of other
variables and does not give unique parameter estimates. It works well with continuous as well as categorical variables and
has no restrictions on parameters.