
ISLAMIC MEDICAL EDUCATION RESOURCES04

0612 ARTIFICIAL DATA SET FOR PRACTICE ANALYSIS BY POSTGRADUATE EPIDEMIOLOGY STUDENTS



Data analysis workshop by
Professor Omar Hasan Kasule for MPH candidates at Universiti Malaya 24^{th} November – 01^{st} December
2006


INSTRUCTIONS
 The attached data set is quite abstract and the numbers were selected in a roughly random way.
The aim is to give you practice in managing and analyzing data. Some of the conclusions you reach may not be logical because
the data is not natural. The advantage of this is that you will focus on what the data is telling you and not any preconceived
ideas or prior knowledge.
 The assignment instructions are deliberately made too general to force you to think of all what
can be done so do not limit your imagination and to make choices. You will have to use your ingenuity to complete the data
management and analysis starting from converting a word file into an SPSS file and then looking for the various analytic programs.
It is possible that some analyses are not found in SPSS and you may have to do them by hand using formulas looked up in specialized
books or analytic programs other than SPSS. You may need to compute and use extra variables. Be humble there may be some analyzes
you cannot make.
 Please note that completing this data analysis exercise will involve heavy time investment so
budget your time carefully and judiciously. The analyzes are too many and you may consider working as one of two groups so
that you discuss together but share the actual computer work because it will be extensive.
 The data set is basically a cohort study with a nested case control study. It is also analyzable
as a crosssectional study using the status at the point in time that the rectangular data file shows. All the analytic procedures
will have to be repeated three times for each of the 3 study designs: crosssectional, case control, and follow up. Cross
sectional analysis will use the data as shown in the rectangular file. For case control analysis you will have to randomly
select 20 cases from cases of throat cancer and 20 controls from the noncancer patients. For cohort analysis you will use
the follow up times provided.
 You are expected to make relevant comments on the results of your computations.
DATA MANAGEMENT
 Undertake data validation and data editing and solve any data problems you identify for example
handling of missing data and outliers if any. Problems in the data should not be a bar to further analysis since this is an
exercise.
DESCRIPTIVE STATISTICS
 Assess the normality of relevant variables in the data set and normalize the nonnormal ones.
 Find out how you would check equality of variances of cancer and smoking prevalence as a condition
for using large sample tests
 Produce all relevant summary statistics for all variables in the data set giving both point
estimates and measures of variation/dispersion
 Draw and interpret a scattergram of weight against height
 Compute a linear correlation matrix for relevant variables and compute other types of correlation
coefficients between each pair of relevant variables. Test for the significance of the linear correlation coefficients.
 Construct a multiple linear regression model relating weight to height and adjusting for relevant
confounders. Interpret indicators of goodness of fit from the print out.
 Using the t test statistic determine whether throat cancer risk is associated with weight
 Repeat the analysis above using a corresponding nonparametric test and assume for purposes
of this exercise that the data was not normally distributed.
 Compute the incidence rate of throat cancer and give a 95% confidence intervals
 Compute the prevalence of throat cancer and give a 95% confidence intervals
 Compute and draw a survival curve for throat cancer patients using the Life table Method
 Compute and draw a survival curve for throat cancer patients using the KaplanMeier method
ANALYTIC STATISTICS:
UNSTRATIFIED ANALYSIS
 Compute the chisquare for association between throat cancer and smoking.
 Use Fischer’s exact test to test for association between throat cancer and smoking.
 Compute the rate ratio of throat cancer in smokers vs non smokers and give the 95% confidence intervals
 Compute the rate difference of throat cancer smokers and nonsmokers and give a 95% confidence interval
 Compute the prevalence difference of throat cancer smokers vs nonsmokers and give a 95% confidence interval
 Compute the prevalence odds ratio of throat cancer in smokers vs nonsmokers and give 95% confidence intervals
 Using the odds ratio from above compute all the various attributable measures that you know
ANALYTIC STATISTICS: STRATIFYING BY
RELEVANT POTENTIAL CONFOUNDERS
 Carry out tests for homogeneity of chisquares/odds ratios of throat cancer in smokers vs nonsmokers by different levels
of (a) potential confounding variable(s)
 Compute the MH chisquare of association between throat cancer and smoking stratifying by (a) relevant confounder(s)
 Compute the MH Odds ratio with 95% confidence intervals for throat cancer in smokers vs nonsmokers stratifying for
(a) relevant confounder(s)
ANALYTIC STATISTICS: REGRESSION
 Construct a logistic regression model relating throat cancer to smoking identifying and adjusting for (a) potential
confounder(s). Try all 3 methods of model fitting (step up, step down, and step wise) and use a 0.05 cutoff point. Derive
the odds ratio, test for its significance, and derive its 95% confidence intervals. Interpret the indicators of model fit
from your printouts.
 Explore for interaction/effect modification by using interaction (multiplication) variables. If you find a significant
interaction term determine whether it changes the odds ratio and show how this is done.
ANALYSIS STATISTICS: SURVIVAL
 Use the Lifetable method to construct separate survival curves for drug A and drug B. Use (a) suitable test(s) of significance
 Use the KaplanMeier method to construct separate survival curves for drug A and drug B. Use (a) suitable test(s) of
significance
Use Cox’s
model to explore the effects of treatment on survival and the effect(s) of prognostic variable(s).




© Professor Omar Hasan Kasule,
Sr. December, 2006


