Learning Objectives:
- Use of multivariate methods in controlling confounding and prediction
- Definition of the logistic regression line and equation
- Three methods of fitting multiple regression: step-up, step-down, and stepwise
- Assessment Goodness of Fit
Key Words and Terms:
· Confounding
·
Interactive factor
·
Likelihood Function
·
Model, overfit
· Model, reduced
·
Model, under-fit
· Model, validation
· Principle of parsimony
·
Regression, multiple logistic
· Selection, automatic variable
· Selection, backward
· Selection, forward
· Selection, stepwise
· Standard Error Of The Estimate
· Variable, dependent
· Variable, dummy
· Variable, independent
· Variable, indicator
1.0 LOGISTIC REGRESSION
1.1 A. DEFINITION
Logistic regression is a type of non-linear regression. In logistic regression, the dependent variable, y, is stated
as the logit which is the logarithmic transformation of a proportion or a probability. In logistic regression the outcome,
variable y, is binary or dichotomized.
The derivation of the logistic model is simple and straightforward. If the outcome y is dichotomous, it can take on
only two values, 0 and 1. We can define p = pr(y = 1). The odds of y = 1 can be computed as p/(1-p). The logit is defined
as the logtransformation of the odds thus logit (pi) = natural logarithm of pi/(1-pi) = a
+ b1x1 + bnxn.
By simple mathematical manipulation we can define p from above as pi = 1 / [1 + e-(a + b1x1 …
bnxn). The parameters are fitted by MLE. The logistic function written
as ex/1 + ex is the inverse of the logit function written as log (p/1-p).
Logistic regression is very useful in epidemiological analysis for 2 reasons. (a) A dichotomized outcome variable and
derivation of the odds ratio directly from the regression coefficient. (b) Logistic regression is also used in classification
into two categories. The cut-off point is set for dichotomizing the outcome for example at Pr (y =1) = .2, 0.5, 0.7 etc. The
model is set up and the coefficients are computed. The coefficients are then used to classify any person with a given profile
of independent x variables.
1.2
TESTS OF SIGNIFICANCE
The
likelihood ratio test is constructed by subtracting the likelihoods of two models one with and the other without the covariate
whose significance is being tested. The chi square statistic is used thus (c1)2 = (-2lnL1)
- (-2lnL0) = -2ln(L1/ L0). The alternative
test is the Wald test that is based on the regression coefficient thus z = b/se(b). The likelihood ratio and the Wald tests give
the same result for large samples. For smaller samples the likelihood ratio test is more reliable. The confidence limits for
the odds ratio can be computed as eb +/- 1.96 se(b
1.3
STATISTICAL PACKAGE
When using statistical packages to model the logistic relation care must be taken to make sure that the right response
is being modeled. The packages normally model the logit of the non-event (y=0) by default. The usual output of a logistic
regression is: the parameter estimate (the logistic regression coefficient), the standard error of the estimate, the Wald
chisquare which is defined as {b/se(b)}2, the p-value being the probability of a result higher than the given value
of chisquare, the standardized estimate which is defined as b/{(p2/3)/s}1/2, the odds ratio (OR) defined
as the exponent of b = eb, the 95% confidence intervals for OR, the global chisquare, and 6 statistics that describe
the association of predicted and observed probabilities: concordant pairs, discordant pairs, tied pairs, Somer’s D,
Gamma, Tau-a.
1.4
ANALYSIS OF MATCHED DATA
ANALYSIS OF 1:1 MATCHED DATA
The variables are manipulated. The difference between each case-control pair is used as explanatory variable and the
usual logistic regression model is fitted.
ANALYSIS OF N:M MATCHED DATA
The proportional hazards procedure used to fit a conditional logistic regression model is fitted for 1:M or N:M matched
data. A stratum is formed for each matched pair based on age or some other variable.
A survival variable is created such that all cases in a stratum have the same event time and the controls are censored
at a later time. The survival variable is 1 for cases and 0 for controls.
2.0
MULTIPLE LOGISTIC REGRESSION
The
purpose of multiple logistic regression is to adjust for many co-factors in situations with a dichotomous outcome variable.
Stratified analysis is an alternative method of adjustment but it breaks down rapidly if there are too many strata or if the
strata are thin. Multiple logistic regression is able to model many thin strata and give meaningful results.
In
the logistic regression model, the dependent variable, y, is nominal or discrete. The
independent variable, x, can be nominal or discrete; nominal being better.
The multiple logistic regression model is: logit = a + b1x1 + b1x1. + b2x2+ b4x4 …. bnxn.
It can be seen that this is mathematically equivalent to
y = 1/ (1 + e - (b1x1 + b1x1. + b2x2+ b4x4 ….….bnxn) )
The logistic model is fitted using maximum likelihood estimation, MLE. The conditional logistic model is used for matched
data.
The adjusted odds ratio is estimated directly from the regression coefficient as OR = eb
The predicted probability is given as p-hat = 1 / [1 + exp(-a-hat – b-hatixi)].
The predicted probabilities can then be compared with the actual observed probabilities. A 2 x 2 table is then created as
follows
|
PREDICTED |
OBSERVED |
|
0 |
1 |
|
1 |
A |
B |
|
0 |
C |
D |
The Brier score is used to assess prediction. The smaller the score, the better the prediction. However in assessing
model prediction we have to use a different set of data. Bias can occur if the data used for modeling is the same data used
for assessing model prediction.
3.0
FITTING THE MULTIPLE REGRESSION MODEL
For any data set we can set up several linear models and an infinite number of non-linear models. Selecting the best
regression model can be quite complicated. Not every variable available in the data set should be included in the model. Only
relevant variables must be included in the model. A model is said to be correctly specified if it contains all relevant independent
variables including interaction terms with no redundant or extraneous terms. A model is said to be under-specified if it misses
important independent variables. An over-specified model contains redundant independent variables. The extraneous variables
may be unrelated to independent variables or the dependent variable.
There are several approaches to reducing the number of potential variables in the model. Variables can be excluded
on theoretical grounds using biological knowledge of causal or non-causal associations. A correlation matrix is useful for
preliminary exploration of relations among variables. If two variables are correlated we drop the one with a large amount
of missing data, greater measurement error, and less theoretical unimportance. Also dropped are variables that are unrelated
to the outcome variable in bivariate analysis. The number of variables can also be reduced by combining variables into a single
variable or a single scale.
Four procedures are used for fitting the multiple regression line: subset or best, step-up, step-down, and step-wise.
The best fitting model is one with an unbiased estimate of the b coefficient and minimum variance. Residual diagnostics and
evaluation of multicollinearity are carried out on the fit model to make sure it is the best.
In subset or best regression, the computer is told to compute all possible models with 1,2,3, or more covariates and
select the best fitting based on likelihood score or the chisquare.
Step-up is forward entry or forward selection and it starts with a minimal model. It involves adding one variable at
a time without trying to delete any variable.
In step-down or backward elimination we start with a full model or maximal model consisting of all variables then we
delete one variable at a time without trying to add any new variables.
Step-wise selection is a combination of step up and step down selection. All variables are run to select the one with
the largest absolute value of the t ratio. The selected variable is entered first into the model. Variables are added to the
model one at a time if they make a significant contribution as assessed by a pre-specified t value. Alternatively the selection
could be based on changes in the p value, the point estimate or the standard error of the estimate. After each addition of
a new variable, the variable with the least contribution is removed based on a pre-specified t value. The following rules of thumb are used to make decisions about variable inclusion and exclusion. Generally
if the t ratio =<1 the variable is omitted. If the t ratio is 1.0 to 2.0 the variable is considered and a decision is made
to include or exclude. If the t ratio is >=2.0 the variable is included. Stepwise model selection has the following disadvantages:
(a) too many models have to be checked before arriving at the best model (b) it ignores the effect of outliers (c) It ignores
non-linear models (d) It uses the t ratio as a criterion and ignores R2 and s. (e) It does not consider the joint
effects of independent variables (f) The order in which variables are introduced may affect the final result (g) Purely automatic
routines do not consider the investigator’s special knowledge.
Variable selection procedures are useful if the purpose of the regression is prediction. They are less useful if the
purpose is study of causal relations.
Significance testing and 95% CI can be done for the intercept and the regression coefficients using the t-test. The
test hypotheses are in the form H0: a = 0 and H0: b = 0
Data splitting is a method used to validate variable selection. The data is split into 2 parts. One part is used for
variable selection and the other part is used to evaluate the variable selection.
The
actual fitting of the regression model can be carried out using several approaches. The most popular is the maximum likelihood
method which could be based on the Poisson, binomial, or hypergeometric distributions.
Model specification errors can occur when important variables are omitted from the
model. Failure to account for non-linear relations leads to mis-specification. Over-specification is including too many variables
in the model with the risk of introducing collinearity. Variable selection procedures can be used to overcome this problem
by selecting the best subset of explanatory variables that gives the maximum R2 for the given p value. A model
is said to be overfit if extraneous variables are included. These variables however do not bias the parameter estimates. A
model is said to be under-fit if important variables are not included. Under-fitting is a cause of bias in parameter estimates.
Missing data causes bias. The extent of bias due to missing data can be assessed by comparing observations with
missing data against those without missing data on the most important variables. There are several approaches to dealing with
missing data. Cases with missing data can be deleted. In order to determine how many missing cases exist in the data, a new
variable is created with value 1 if data is missing and 0 if data is not missing on any independent variable. The dummy variable
will adjust for missing data in the analysis. Additional efforts can be made to obtain extra data. The number of independent
variables can be reduced by combining or scaling variables which reduces the problem of missing data. There are several methods
of estimating the value of missing data.
Non-convergence is a common problem. In a situation of no convergence,
the likelihood equation for a logistic regression model does not have a finite solution and logist returns a message of ‘infinite
parameters’. The following actions are taken in case of infinite parameters: (a) checking raw data for transcription
errors (b) categorizing quantitative variables (c) using fewer explanatory variables (d) collecting more data or (e) reclassify
the response variable by using a different cut-off point.
4.0 ASSESSING
REGRESSION MODELS
4.1
The ideal model
Selection
of the best model is guided by the coefficient of determination, the significance of the regression coefficient, and residual
analysis. The best model is one with the highest coefficient of determination or one for which any additions do not make any
significant changes in the coefficient. Insignificant predictor items are best eliminated from the model unless there is a
special reason for wishing to retain them. Model misspecification occurs when a linear relation is assumed for a curvilinear
situation. A model may also be misspecified is important variables are omitted. After fitting the model several diagnostic
procedures can be carried out to assess its validity and appropriateness. Tests of linearity are carried out first. Then row
and column diagnostics are performed.
4.2
Validating a regression model
There are basically four approaches to validating a regression model that has been fitted. New data may be collected
and may be used to test the model. Alternatively existing data may be randomly split into 2 parts; one part is used to develop
the model and the other part is used to test the model. In the jack-knife approach, observations are deleted from the model
one at a time with the model being recomputed to see whether there are any differences; a valid model will not change because
of such removal of some observations. In the bootstrap approach, random samples are selected from the data (with replacement)
and the model is refit for each sample. Constancy of the model indicates its validity.
4.3 Assessment
of good of fit in logistic regression models
There are basically three options: (a) Hosmer and Lameshow Goodness of Fit test (b) The generalized
coefficient of determination and (c) the adjusted generalized coefficient of determination.
The Hosmer and Lameshow Goodness of Fit test calculates the Pearson Chisquare for a 2 x g table with i groups. It essentially
involves comparing observed with expected or predicted values. The chisquare with g-2 degrees of freedom is given by: c2
= gåi=1 [{(Oi - NipI)2} / { NipI (1 - pI)}] where Ni = number
of observations in group I, Oi = number of outcomes in group I, pI = average estimated probability of the event in the ith group.
The generalized coefficient of determination is given by the expression R2 = 1 – [L(0) / L(b)]2/p where L(0) = likelihood for a model consisting of the
intercept only and L(b) = likelihood of the specified model.
The adjusted generalized coefficient of determination is computed as the ratio of the observed coefficient of determination
to the maximum coefficient of determination. The maximum coefficient of determination is given bt the expression 1 –
[ L(0)]2/p
4.4
Improving the fit of the regression model
The
interaction term is defined as the product of two terms for example var3 = var1
* var2. Interaction terms produce a better fit. Use of interactioin terms also improves model fit. More than one method of creating interacton variables may be used to improve the model. For example interaction and
indicator variables may be combined. In the model y = a + b1x1 + b2x2+ b3x1x2
, x2 is a dummy variables. If x2 = 0, the model becomes y = a + b1x1. If x2
= 1, the model becomes y = a + b3 + b1x1. A dummy variable can be attached to each indicator
variable for example in the model y = a + b1x1 b2x2 + b4x1x3
+ b5x2x3. Some significant interactions may turn out to be difficult to interpret clinically.
Interaction is suspected when a variable thought to be significant on theoretical grounds turns out to be insignificant in
the regression model. This indicates that its significance is under certain conditions of interaction. Thus testing for interaction
becomes a form of sub-group analysis.
Other approaches: The regression can
be improved by addition of a suppressor variable to the model in order to enhance the importance of other variables. The regression
model can also be improved by dropping outliers. A constant could be added or subtracted
from each independent variable as in the model y = a + b (x –100) or y = a + b (x + 50).