SAMPLE SIZE DETERMINATION
The size of the sample depends on the hypothesis, the budget, the study durations, and the precision
required. If the sample is too small the study will lack sufficient power to answer the study question. A sample bigger than
necessary is a waste of resources. Power is ability to detect a difference and is determined by the significance level, magnitude
of the difference, and sample size. Power = 1 – b = Pr (rejecting
H0 when H0 is false) = Pr (true negative). The bigger the sample size the more powerful the study. Beyond
an optimal sample size, increase in power does not justify costs of larger sample. There are procedures, formulas, and computer
programs for determining sample sizes for different study designs.
SOURCES OF SECONDARY DATA
Secondary data is from decennial censuses, vital statistics, routinely collected data, epidemiological studies, and
special health surveys. Census data is reliable. It is wide in scope covering demographic, social, economic, and health information.
The census describes population composition by sex, race/ethnicity, residence, marriage, socio-economic indicators. Vital
events are births, deaths, Marriage & divorce, and some disease conditions. Routinely collected data are cheap but may be unavailable or incomplete. They are obtained from medical facilities, life and health insurance companies, institutions like prisons, disease registries, and administrative
records. Observational epidemiological studies are of 3 types: cross-sectional, case-control, and follow-up/cohort
studies. Special surveys cover a larger population that epidemiological studies and may be health, nutritional, or socio-demographic
PRIMARY DATA COLLECTION BY QUESTIONNAIRE
Questionnaire design involves content, wording
of questions, format and layout. The reliability and validity of the questionnaire as well as practical logistics should be tested during the pilot study.
Informed consent and confidentiality must be respected. A protocol sets out data collection procedures. Questionnaire administration
by face-to-face interview is the best but is expensive. Questionnaire administration
by telephone is cheaper. Questionnaire administration by mail is very cheap but has a lower response rate. Computer-administered
questionnaire is associated with more honest responses.
PHYSICAL PRIMARY DATA COLLECTION
Data can be obtained by clinical examination, standardized
psychological/psychiatric evaluation, measurement of environmental or occupational exposure, and assay of biological specimens
(endobiotic or xenobiotic) and laboratory experiments. Pharmacological experiments involve bioassay, quantal dose-effect curves,
dose-response curves, and studies of drug elimination. Physiology experiments involve measurements of parameters of the various
body systems. Microbiology experiments involve bacterial counts, immunoasays, and serological assays. Biochemical experiments
involve measurements of concentrations of various substances. Statistical and graphical techniques are used to display and
summarize this data.
DATA MANAGEMENT AND DATA ANALYSIS
Self-coding or pre-coded questionnaires are preferable. Data is input as text, multiple choice, numeric, date and time,
and yes/no responses. In double entry techniques, 2 data entry clerks enter the same data and a check is made by computer
on items on which they differ. Data in the computer can be checked manually against the original questionnaire. Interactive
data entry enables detection and correction of logical and entry errors immediately. Data replication is a copy management
service that involves copying the data and also managing the copies. Synchronous data replication is instantaneous updating
with no latency in data consistency. In asynchronous data replication the updating is not immediate and consistency is loose.
Data editing is the process of correcting data collection and data entry errors. The data is 'cleaned' using logical,
statistical, range, and consistency checks. All values are at the same level of precision (number of decimal places) to make
computations consistent and decrease rounding off errors. The kappa statistic is used to measure inter-rater agreement. Data editing identifies and corrects errors such as invalid
or inconsistent values. Data is validated and its consistency is tested. The main data problems are missing data, coding and entry errors, inconsistencies, irregular patterns, digit preference, out-liers,
rounding-off / significant figures, questions with multiple valid responses, and record duplication. Data transformation is
the process of creating new derived variables preliminary to analysis and includes mathematical operations such as division,
multiplication, addition, or subtraction; mathematical transformations such as logarithmic, trigonometric, power, and z-transformations.
Data analysis consists of data summarization,
estimation and interpretation. Simple manual inspection of the data is needed before
statistical procedures. Preliminary examination consists
of looking at tables and graphics. Descriptive statistics are used to detect errors, ascertain the normality of the data,
and know the size of cells. Missing values may be imputed or incomplete observations may be eliminated. Tests for association,
effect, or trend involve construction and testing of hypotheses. The tests for association are the t, chi-square, linear correlation,
and logistic regression tests or coefficients. The common effect measures Odds Ratio, Risk Ratio, Rate difference. Measures
of trend can discover relationships that are not picked up by association and effect measures. The probability, likelihood,
and regression models are used in analysis. Analytic procedures and computer programs vary for continuous and discrete data,
for person-time and count data, for simple and stratified analysis, for univariate, bivariate and multivariate analysis, and
for polytomous outcome variables. Procedures are different for large samples and small samples.