Synopsis of a lecture given on 27th October 2006 to MPH (Epidemiology) students at the Department of Social and Preventive Medicine, Universiti Malaya by Professor Omar Hasan Kasule, Sr. MB ChB (MUK), MPH (Harvard), DrPH (Harvard)


The size of the sample depends on the hypothesis, the budget, the study durations, and the precision required. If the sample is too small the study will lack sufficient power to answer the study question. A sample bigger than necessary is a waste of resources. Power is ability to detect a difference and is determined by the significance level, magnitude of the difference, and sample size. Power = 1 – b = Pr (rejecting H0 when H0 is false) = Pr (true negative). The bigger the sample size the more powerful the study. Beyond an optimal sample size, increase in power does not justify costs of larger sample. There are procedures, formulas, and computer programs for determining sample sizes for different study designs.



Secondary data is from decennial censuses, vital statistics, routinely collected data, epidemiological studies, and special health surveys. Census data is reliable. It is wide in scope covering demographic, social, economic, and health information. The census describes population composition by sex, race/ethnicity, residence, marriage, socio-economic indicators. Vital events are births, deaths, Marriage & divorce, and some disease conditions. Routinely collected data are cheap but may be unavailable or incomplete. They are obtained from medical facilities, life and health insurance companies, institutions (like prisons, army, and schools), disease registries, and administrative records. Observational epidemiological studies are of 3 types: cross-sectional, case-control, and follow-up/cohort studies. Special surveys cover a larger population that epidemiological studies and may be health, nutritional, or socio-demographic surveys.



Questionnaire design involves content, wording of questions, format and layout. The reliability and validity of the questionnaire as well as practical logistics should be tested during the pilot study. Informed consent and confidentiality must be respected. A protocol sets out data collection procedures. Questionnaire administration by face-to-face interview is the best but is expensive. Questionnaire administration by telephone is cheaper. Questionnaire administration by mail is very cheap but has a lower response rate. Computer-administered questionnaire is associated with more honest responses.



Data can be obtained by clinical examination, standardized psychological/psychiatric evaluation, measurement of environmental or occupational exposure, and assay of biological specimens (endobiotic or xenobiotic) and laboratory experiments. Pharmacological experiments involve bioassay, quantal dose-effect curves, dose-response curves, and studies of drug elimination. Physiology experiments involve measurements of parameters of the various body systems. Microbiology experiments involve bacterial counts, immunoasays, and serological assays. Biochemical experiments involve measurements of concentrations of various substances. Statistical and graphical techniques are used to display and summarize this data.



Self-coding or pre-coded questionnaires are preferable. Data is input as text, multiple choices, numeric, date and time, and yes/no responses. In double entry techniques, 2 data entry clerks enter the same data and a check is made by computer on items on which they differ. Data in the computer can be checked manually against the original questionnaire. Interactive data entry enables detection and correction of logical and entry errors immediately. Data replication is a copy management service that involves copying the data and also managing the copies. Synchronous data replication is instantaneous updating with no latency in data consistency. In asynchronous data replication the updating is not immediate and consistency is loose.


Data editing is the process of correcting data collection and data entry errors. The data is 'cleaned' using logical, statistical, range, and consistency checks. All values are at the same level of precision (number of decimal places) to make computations consistent and decrease rounding off errors. The kappa statistic is used to measure inter-rater agreement. Data editing identifies and corrects errors such as invalid or inconsistent values. Data is validated and its consistency is tested. The main data problems are missing data, coding and entry errors, inconsistencies, irregular patterns, digit preference, out-liers, rounding-off / significant figures, questions with multiple valid responses, and record duplication. Data transformation is the process of creating new derived variables preliminary to analysis and includes mathematical operations such as division, multiplication, addition, or subtraction; mathematical transformations such as logarithmic, trigonometric, power, and z-transformations.


Data analysis consists of data summarization, estimation and interpretation. Simple manual inspection of the data is needed before statistical procedures. Preliminary examination consists of looking at tables and graphics. Descriptive statistics are used to detect errors, ascertain the normality of the data, and know the size of cells. Missing values may be imputed or incomplete observations may be eliminated. Tests for association, effect, or trend involve construction and testing of hypotheses. The tests for association are the t, chi-square, linear correlation, and logistic regression tests or coefficients. The common effect measures Odds Ratio, Risk Ratio, and Rate difference. Measures of trend can discover relationships that are not picked up by association and effect measures. The probability, likelihood, and regression models are used in analysis. Analytic procedures and computer programs vary for continuous and discrete data, for person-time and count data, for simple and stratified analysis, for univariate, bivariate and multivariate analysis, and for polychotomous outcome variables. Procedures are different for large samples and small samples.

Professor Omar Hasan Kasule Sr. October 2006