5data processing
TRANSCRIPT
-
7/31/2019 5Data Processing
1/145
Wakgari Deressa, PhD
School of Public HealthAddis Ababa University
-
7/31/2019 5Data Processing
2/145
ObjectivesThe participants should be able to: Understand the process involved in data
processing Use computers to perform data
Interpret summary statistics, graphicaldisplays
Understand estimations & hypothesistesting
Understand statements in published articles
-
7/31/2019 5Data Processing
3/145
Introduction Describes statistical methods commonly
used in health research. Data processing, management
Data analysis
Interpretation
Use of computers and statisticalsoftware packages
Epi Info 2002: one of the most commonlyused software packages by health researchers
-
7/31/2019 5Data Processing
4/145
Preparation of Data for Statistical
Analysis Data collected should be entered into
a computer for analysis
Includes tasks such as:
1. Checking and manual editing
2. Coding3. Creating format or views for data entry
4. Entering data into a computer
5. Cleaning
6. Transformation
-
7/31/2019 5Data Processing
5/145
1. Checking and manual editing Is a function of the quality of data
Involves checking of data to detectincompleteness, inconsistencies, andother obvious errors in the questionnaire.
The majority of the errors in the datashould be detected and corrected in thefield before the data are sent away.
Interviewers and supervisors play a keyrole to correct any error closer to a
source
-
7/31/2019 5Data Processing
6/145
Types of checks1. Range checks: e.g., Male (=1) or
Female (=2), age (1-99 yr)
2. Typographic check: 41 rather than 14
3. Consistency check: date of birth
and age, age of mother =20 and achild of 15 years old
-
7/31/2019 5Data Processing
7/145
-
7/31/2019 5Data Processing
8/145
2. Coding Coding is assigning a separate non-
overlapping numerical code for separate
responses recorded in words andmissing values
Dont understand or accept alphabetical
texts or codes or verbatim responses
Example: 1=Male, 2=Female
Should be both exhaustive and mutuallyexclusive
-
7/31/2019 5Data Processing
9/145
Closed-ended questions are usually pre-assigned a numerical code (pre-coding= before data collection)
What is your current marital status?1. Never married
.
3. Widowed/ separated
Post-coding = after data collection
Mainly used for open-ended questionsThe code is assigned after reviewing a
representative responses from respondents
-
7/31/2019 5Data Processing
10/145
Codebook = is a document that lists thecodes (or keys) of assignments of thevalues of the variable
Guides researchers to find the rightcode for each answer category
Some responses such as quantitativevariables can be directly entered into a
computer without codingExample: Age, weight, body temperature,
number of pregnancies and ANC visits, etc.
-
7/31/2019 5Data Processing
11/145
Question/Statement
Variablename Codes
1.Your sex SEX 1=Male2=Female
2. What is our occu ation? OCCUP 1=Farmer
2=Housewife
3=Trader
4=Student
5=Other
Variable name = usually contains fewer than 8 characters
-
7/31/2019 5Data Processing
12/145
3. Creating data format in acomputer
Designing a format or electronic
questionnaire in a computer
Examples
ex: =ma e, = ema e2Age: ## in years
3Occup: # 1=farmer, 2=housewife,
3=trader, 4=student, 5=other4Sick: # 1=yes, 2=no
5RxSource _____________________
-
7/31/2019 5Data Processing
13/145
4. Data Entry The transfer of data from a
questionnaire to a computer file. Before entry, data must be checked for
errors
Data must be coded
Date entered by a data entry clerk or a
researcher
-
7/31/2019 5Data Processing
14/145
5. Data cleaning Data entered must be checked for errors,
impossible or implausible values and
inconsistencies
In most cases errors are inevitable
,avoidable
Errors can result from incorrect reading
of codes, incorrect reporting, missedentry, incorrect coding, repeated entry,incorrect typing, and so forth.
-
7/31/2019 5Data Processing
15/145
Data cleaning involves two types ofchecks:
1. Checking for outliers
Sex: 1=male, 2=female If 3 or other number is entered rather than 1
or 2, the error should be corrected by looking
n o e or g na source2. Performing consistency checks
Checking whether data in one part of a
record is compatible with data in another part If sex is initially entered as 1=male, and 2entered for number of pregnancies, the datais not internally consistent
-
7/31/2019 5Data Processing
16/145
6. Data transformation Involves restructuring of the data set and
generating new variables or recodingsome of the existing data fields to definenew variables
To transform the raw scores into standardscores
Facilitates data analysis and interpretation
Easily handled directly through instructionsinto a computer
-
7/31/2019 5Data Processing
17/145
Example: Infants original BWT (enteredon the computer files in grams) can becategorized into a dichotomous variable
Recode BWEIGHT (LOWEST THRU 2500=1) (2501 THRU HIGHEST=2)
= ow r we g , = orma we g Age can be recoded from a continuous
variable into a categorical one:
1=15-24, 2=25-34, 3=35-44, 4=45-54
-
7/31/2019 5Data Processing
18/145
EPI Info 2002 Epi Info 2002 is a series of programs foruse by public health professionals in
conducting outbreak investigations,managing databases for public healthsurveillance and epidemiological data
Epi Info 2002 software is in the publicdomain and freely available for use,copying, translation and distribution.
"Epi Info" is a trademark of the Centers forDisease Control and Prevention (CDC).
SPSS and STATA very expensive
-
7/31/2019 5Data Processing
19/145
EPI Info 2002
With Epi Info and a personal computer,
physicians,
epidemiologists, and
can easily develop a questionnaire orform, customize the data entry process,
and enter and analyze data (all in onepackage).
-
7/31/2019 5Data Processing
20/145
EPI Info 2002 Epi Info is a tool that public health
professionals use to: Create a questionnaire (format) (MakeView)
Enter data in a questionnaire (Enter Data)
Analyse the data (Analyse Data) Additional features: shape data entry,
error checking, coding, selecting
records, create new variables, recodedata, import and export files from othersystems.
-
7/31/2019 5Data Processing
21/145
Running EPI Info 2002 Put on your computer
Double click Epi Info 2002
Select program (Make view, Enter data,Analyze data, etc.) from the menu.
ThisistheEpiI
nfoforWindo
wsmainmenu
-
7/31/2019 5Data Processing
22/145
Makeview Program Is used to place prompts and data entry
fields on one or many pages of a View. Used to create questionnaires or
.
Regarded as both the form designerand the database design environment
-
7/31/2019 5Data Processing
23/145
Questionnaire / View An electronic replications of paper forms
or other information sources that arecreated to enter and store data
Is created in the Epi Info Makeviewprogram
-
7/31/2019 5Data Processing
24/145
-
7/31/2019 5Data Processing
25/145
Running Makeview Type the file name you gave earlier (e.g.
Mary) and click OK. You can also typeanother name.
Right click to create a field.
Right click on any space will provide a
box for writing the name of the variable(E.g. Name, age, sex, etc)
-
7/31/2019 5Data Processing
26/145
Running Makeview Below you will see a menu of field or
type of the variable Select the appropriate field (e.g. upper
, ______Occupation), numeric (# for age), labelfor title) etc.
Click OK
-
7/31/2019 5Data Processing
27/145
-
7/31/2019 5Data Processing
28/145
Enter Program Displays data entry screen created in
MakeView ready to receive data
Data are entered and edited here
Controls the data entry process, using
specified in MakeView.
A search function is provided so that
records can be located that matchvalues specified for any combination ofvariables.
-
7/31/2019 5Data Processing
29/145
Entering data using Enter Dataprogram Click Enter Data from the main menu
Across the top of the screen click onFile and select Open
You will be asked "Select a table" withdefault the file name you already gave it.
Click OK
-
7/31/2019 5Data Processing
30/145
The program will now make a data filefrom your questionnaire and display thevariables on the screen ready to receive
data.
Fill in the blanks until you have
completed the form. The program will be ready to receive the
2nd entry and so on.
After finishing your data entry click Exit
-
7/31/2019 5Data Processing
31/145
The Analyze Data Program The analysis program produce lists,
frequencies, graphs, tables, and more
statistics Select Analyze Data from the main Menu
command windows will appear Click Read (import)
Choose your file (e.g., Mary.rec)
Click OK
-
7/31/2019 5Data Processing
32/145
Descriptive Statistics Distribution or probability distribution
refers to the way data are distributed, in
order to draw conclusions about a setof data.
ont nuous var a es = t e a m s todetermine whether or not normalitymay be assumed
Categorical variables = We obtain thefrequency distribution for each variable
-
7/31/2019 5Data Processing
33/145
Distribution of Categorical Variable
A study was conducted to assess thecharacteristics of a group of 234 smokers by
collecting data on gender and othervariables.
Gender, 1 = male, 2 = female
Gender Frequency (n) Relative Frequency
Male (1)
Female (2)
110
124
47.0%
53.0%
Total 234 100%
-
7/31/2019 5Data Processing
34/145
Frequency distribution of BWT:Bar Chart
4000
5000
6000
.
0
1000
2000
3000
Very low Low Normal Big
BWT
F
re
-
7/31/2019 5Data Processing
35/145
-
7/31/2019 5Data Processing
36/145
Prevalence of active trachoma among
children (1-9) by sex and area of residence
33.5
28.6
34.4
31.2
22.824.3
30
40
e(%)
AT TF TI
17.5
20.2
9.8
.
118.7
10.4
7.79.9
0
10
20
Female Male Rural Urban Total
Prevalenc
Gender and area of residence
-
7/31/2019 5Data Processing
37/145
Pie Chart with relative frequencies of
categories of BWT
-
7/31/2019 5Data Processing
38/145
Distribution of Continuous Variable Examples:
Age, height, weight, etc
Continuous variable is infinite
The probability associated with any
particular value is almost equal to Zero However, it will assume some value in the
interval enclosed by two ranges: x1 and x2
The prob distn is visualized as a curve andprobabilities are areas under the curve
-
7/31/2019 5Data Processing
39/145
The total area under a probability distribution is
always 1. The section marked A represents theprobability of observing a value of 3 orgreater, symbolically written as Pr(X 3). If the area of
A is say 0.2 units, then Pr(X 3) = 0.2
0 1 2 3 4 5
BPr(X1)
APr(X3)
-
7/31/2019 5Data Processing
40/145
-
7/31/2019 5Data Processing
41/145
-
7/31/2019 5Data Processing
42/145
-
7/31/2019 5Data Processing
43/145
-
7/31/2019 5Data Processing
44/145
-
7/31/2019 5Data Processing
45/145
The Normal Distribution
mean
standard deviation
-
7/31/2019 5Data Processing
46/145
-
7/31/2019 5Data Processing
47/145
Histograms Histograms are frequency distributions with
continuous class intervals that have been turned
into graphs. To construct a histogram, we draw the interval
boundaries on a horizontal line and the
frequencies on a vertical line. Non-overlapping intervals that cover all of the
data values must be used.
Bars are then drawn over the intervals Area of each column proportional to the number
of observations in that interval
-
7/31/2019 5Data Processing
48/145
-
7/31/2019 5Data Processing
49/145
Frequency polygon
A frequency distribution can beportrayed graphically in yet another wayby means of a frequency polygon.
To draw a frequency polygon weconnect the mid-point of the tops of thecells of the histogram by a straight line.
-
7/31/2019 5Data Processing
50/145
The frequency polygon of birth weight ofnewborns by sex
-
7/31/2019 5Data Processing
51/145
Numerical Summary MeasuresMeasures of location
Measures of dispersion
-
7/31/2019 5Data Processing
52/145
Measures of LocationThe most common measures:
Mean (Arithmetic Mean) Median
o e
-
7/31/2019 5Data Processing
53/145
Mean (Arithmetic mean) the "average" which is obtained by adding
all the values in a sample or population anddividing them by the number of values.
-
7/31/2019 5Data Processing
54/145
Example: 10 numbers:
19 21 20 20 34 22 24 27 27 27
Mean = (19 + 21 + +27) = 24.1
10
Median
-
7/31/2019 5Data Processing
55/145
Median The median is the value which divides the
data set into two equal parts.
If the number of values is odd, the medianwill be the middle value when all values arearranged in order of magnitude.
When the number of observations iseven, there is no single middle value but twomiddle observations.
In this case the median is the mean of thesetwo middle observations, when allobservations have been arranged in the
order of their magnitude.
-
7/31/2019 5Data Processing
56/145
In the above data set, arranging in
increasing order :19 20 20 21 22 24 27 27 27 34
Median = (22 + 24)/2 = 23
-
7/31/2019 5Data Processing
57/145
Mode Any observation of a variable at which
the distribution reaches a peak Most distributions are unimodal
n t e a ove examp e, t e mo e s .
It occurs three times (most frequentvalue)
It is possible to have more than onemode or no mode.
-
7/31/2019 5Data Processing
58/145
Measures of Dispersion
Dispersion refers to the variety exhibited by
the values of the data.
e amount may e sma w en t e va uesare close together.
Two or more sets may have the same meanand/or median but they may be quitedifferent.
-
7/31/2019 5Data Processing
59/145
These two distributions have the same mean,median, and mode, but they may be quite different
-
7/31/2019 5Data Processing
60/145
Range The range is the difference between the largestand smallest values in the set of observations.
These values are often called the maximum andthe minimum.
If 167 is the largest and 40 is the smallest, then
range is,
167 40 = 127
-
7/31/2019 5Data Processing
61/145
-
7/31/2019 5Data Processing
62/145
a) The first quartile (Q1): 25% of all theranked observations are
-
7/31/2019 5Data Processing
63/145
Inter-quartile range (IQR)
The IQR encompasses the middle50% of the observations
3 1,
Q3 = 3rd quartile and Q1= 1st first quartile.
-
7/31/2019 5Data Processing
64/145
Median = 2nd quartile (dividing into twohalves)
1st quartile (Q1) = 1/4(n + 1)th 2nd Quartile (Q2) = 1/2 (n + 1)th
r uar e 3 = n+
2 2
-
7/31/2019 5Data Processing
65/145
Variance (2
, S2
) A measure of the dispersion relative to
the scatter of the values about theirmean.
squares of the deviations taken from themean.
Population variance = 2
Sample variance = S2
A sample variance is calculated for a sample of
-
7/31/2019 5Data Processing
66/145
A sample variance is calculated for a sample of
individual values (X1, X2, Xn) and uses the samplemean (e.g. ) rather than the population mean .
-
7/31/2019 5Data Processing
67/145
Limitation of the VarianceThe variance is a mean of squared values
It is not expressed in the same unitas the observation for which it representsthe dis ersion
A variance of a distribution of weightis not expressed in Kg, but in Kg2
weight = 36.5 Kg, s = 257 Kg2
-
7/31/2019 5Data Processing
68/145
Standard deviationStandard deviation = Square root of the variance
=
=
m = 36.5 Kg
s = 257 Kg2
s = = 16 Kg257The standard deviation is expressed in the same units
as the measurement it represents
-
7/31/2019 5Data Processing
69/145
-
7/31/2019 5Data Processing
70/145
Box Plot It is another way to display information
about the distribution of a set of data. Can be used to display a set of discrete
single vertical axis only certainsummaries of the data are shown
First the quartiles of the data set mustbe defined
A box is drawn with the top of the box at
-
7/31/2019 5Data Processing
71/145
A box is drawn with the top of the box atthe third quartile and the bottom at thefirst quartile.
The location of the mid-point of thedistribution is indicated with a horizontal
.
Finally, straight lines are drawn from thecentre of the top of the box to the largestobservation and from the centre of thebottom of the box to the smallestobservation.
-
7/31/2019 5Data Processing
72/145
Percentile = p(n+1), p=the required percentile
Arrange the numbers in ascending order
A. 1st quartile = 0.25(n+1)th
B. 2nd quartile = 0.5(n+1)th
C. 3rd
quartile = 0.75(n+1)th
D. 20th percentile = 0.2(n+1)th
C. 15th percentile = 0.15(n+1)th
-
7/31/2019 5Data Processing
73/145
-
7/31/2019 5Data Processing
74/145
G est . age
P re
T er m
P o s t
Bir th w eight(gra m s )
5000 4500 4000 3500 3000 2500 2000 1500 1000 500
-
7/31/2019 5Data Processing
75/145
Statistical Inference Statistical inference is the process of
using samples to make inferences about apopulation.
hypothesis testing. Often the population parameter of interest
is either mean or a proportion.
-
7/31/2019 5Data Processing
76/145
Parameter Estimations Population parameter: the underlying
(unknown) distribution of the variable ofinterest for a population
Sample parameter: estimates of thepopulation parameters obtained from asample
-
7/31/2019 5Data Processing
77/145
Types of Estimates
-
7/31/2019 5Data Processing
78/145
yp
Point estimation involves the calculationof a single number to estimate the
population parameter Single values
Interval estimation specifies a range ofreasonable values for the parameter Confidence interval
Provides more information about a populationcharacteristic than does a point estimate
-
7/31/2019 5Data Processing
79/145
Confidence Intervals Used for estimating the true value of
the population parameter
for anticipated true populationparameter.
-
7/31/2019 5Data Processing
80/145
-
7/31/2019 5Data Processing
81/145
CI tells us how precise our estimate islikely to be
A narrow CI implies highprec s on, w e a w e mp es ow
precision.
A Narrow CI reflects large sample sizeor low variability or both.
-
7/31/2019 5Data Processing
82/145
95% CI commonly usedSometimes 90% and 99%
The 95% CI is calculated in such a waythat, under the conditions assumed forunderlying distribution, the interval willconta n true popu at on parameter 5%
of the time. Loosely speaking, you might interpret a
95% CI as one which you are 95%confident contains the true parameter.
-
7/31/2019 5Data Processing
83/145
CIs can also answer the question ofwhether or notan association exists ora treatment is beneficial or harmful.analo ous to -values
e.g., if the CI of an odds ratio includes the value 1.0we cannot be confident that exposure is associatedwith disease.
-
7/31/2019 5Data Processing
84/145
C.I. for a population meana) Known variance (large sample size)
A 100(1-)% C.I. for is
is to be chosen by the researcher, mostcommon values of are 0.05, 0.01,0.001 and 0.1.
Margin of Error
-
7/31/2019 5Data Processing
85/145
(Precision of the estimate)
-
7/31/2019 5Data Processing
86/145
B. Unknown variance(small sample size, n 30)
What if the for the underlying populationis unknown and the sam le size is small?
As an alternative we use Students tdistribution.
Based on degrees of freedom
-
7/31/2019 5Data Processing
87/145
-
7/31/2019 5Data Processing
88/145
Note: t approaches z as n increases
C.I. for a population proportion
-
7/31/2019 5Data Processing
89/145
-
7/31/2019 5Data Processing
90/145
Hence,
is an approximate 95% CI for the true proportion P.
E l
-
7/31/2019 5Data Processing
91/145
Example: A study on dental health practice. Of 300
adults interviewed, 123 said that theyregularly had a dental check-up twice a
. . .
proportion in the population? (0.36, 0.46).
E ti ti f T P l ti
-
7/31/2019 5Data Processing
92/145
Estimation for Two Populations
H pothesis testing
-
7/31/2019 5Data Processing
93/145
Hypothesis testing The majority of statistical analyses involve
comparison, most obviously betweentreatments or procedures or between.
Hypothesis: A statement about one ormore population
Is the true population mean BW equals 3000 g?
-
7/31/2019 5Data Processing
94/145
The alternative hypothesis HA is ah di i h h ll
-
7/31/2019 5Data Processing
95/145
The alternative hypothesis, HA, is astatement that disagrees with the nullhypothesis.
The effect of interest is not zero, there isa difference
tates t e ne o t n ng o t e
researcher It is the hypothesis that the researcher
wants to prove
Examples of Research Hypotheses
-
7/31/2019 5Data Processing
96/145
Population Mean
The average length of stay of patients
admitted to the hospital is five days The mean BW of babies delivered by
mothers with low SES is lower than those
from higher SES
The economic burden of HIV/AIDS on thepoor is higher than that of the wealthierpeople
Etc
Population Proportion
The proportion of adult smokers in Addis
-
7/31/2019 5Data Processing
97/145
The proportion of adult smokers in AddisAbaba is p = 0.40
The prevalence of HIV among non-married adults is higher than that in
A greater proportion of people who live inpoverty may have a low health status
Inappropriate prescription of drugs is
more common in settings wheremicroscopy is unavailable
Etc
HT using test statistics (E g Mean):
-
7/31/2019 5Data Processing
98/145
HT using test statistics (E.g., Mean):H0: = 0 H0: 0 H0: 0H1: 0 H1: > 0 H1: < 0
two-tailed one-tailed one-tailed
Decide on the appropriate test statisticfor the hypothesis (Z, t, 2, F, etc.)
Select the level of significance for thestatistical test (=0.05, 0.01, 0.001, etc.)
Another way to state conclusion:
-
7/31/2019 5Data Processing
99/145
Another way to state conclusion:
Reject H0 if P-value< ,
Accept H0 if P-value .
the smaller the P-value the stronger the evidence
against the Ho.
P-value
-
7/31/2019 5Data Processing
100/145
P-value
The P valueis the probability that a difference as
large as we have observed could have occurred
simply by chance
The probability that we could be wrong if we reject
the Ho
Indicates the probability that the association
between two variables might be due to chance
Types of Errors in Hypothesis
-
7/31/2019 5Data Processing
101/145
Types of Errors in HypothesisTests
Whenever we reject or accept the Ho, wecommit errors.
Two types of errors are committed.
Type I Error
Type II Error
Type I error (): the probability of
-
7/31/2019 5Data Processing
102/145
Type I error (): the probability ofrejecting H0 when it is true.
It is the probability of being wrong wheno s true.
Typical value for (significance level) is 5%
Type II error (): The probability of not
-
7/31/2019 5Data Processing
103/145
Type II error (): The probability of notrejecting H0 when it is actually false.
Failure to accept HA when it is true
Power: The probability of rejecting H
-
7/31/2019 5Data Processing
104/145
Power: The probability of rejecting H0when it is false OR accepting HA when it is
true.Power = 1- .
Typical value for Power is 80%
Statistical Methods for Continuous
-
7/31/2019 5Data Processing
105/145
Statistical Methods for ContinuousVariables: Comparison of Groups
Is there a significant difference between two or moregroups?
Students t-Test (Unpaired data)
-
7/31/2019 5Data Processing
106/145
Student s t Test (Unpaired data) The Students t-test or simply t-test is
commonly used for comparison of the
means of two groups. Our null h othesis is HO : 1 = 2 where1 is the true mean of the first group and2 is the true mean of the second group.
Assumption:
The data are independent & normallydistributed
Example:C m i m BWT b t m l d
-
7/31/2019 5Data Processing
107/145
p Comparing mean BWT between males and
females
Comparing mean blood pressure betweendiabetic and non-diabetic patients
Etc
Students t-Test (Paired data)
-
7/31/2019 5Data Processing
108/145
Student s t Test (Paired data) Study subjects from one population can be
matched, or paired with particular subjectsin the second population.
Comparing mean BWT weight between twins Comparing mean blood pressure of diabetic
patients before and after some treatment
Eyes or ears of the same individual
One-way analysis of variance
(ANOVA)
-
7/31/2019 5Data Processing
109/145
(ANOVA) Suitable for deciding whether differences
exist between the means of more than two
groups. Is a eneralization of the Students t-test.
The Ho is HO: 1 = 2 = 3 = 4 , where iis the true mean of the ith group.
Allows us to test whether the mean of at
least one of the groups differs significantlyfrom some other group.
-
7/31/2019 5Data Processing
110/145
Correlation
-
7/31/2019 5Data Processing
111/145
Correlation Measures the strength of the relationship
between two continuous variables from a
single population The relationshi should be linear.
ProcedureDisplay the data in scatter plot before carrying out any further analysis
One variable is plotted on the X-axis
The other on the Y-axis
Nation %Immunized
Child MR
per 1000 LBBoliviaB il
77
69
118
65
-
7/31/2019 5Data Processing
112/145
Brazil
Cambodia
Canada
China
Czech Republic
Egypt
Ethiopia
69
32
85
94
99
89
13
65
184
8
43
12
55
208
Finland
France
Greece
India
Italy
Japan
Mexico
PolandRussia
Senegal
Turkey
UK
95
95
54
89
95
87
91
9873
47
76
90
7
9
9
124
10
6
33
1632
145
87
9
Percentage of children immunized against DPT and
under-five mortality rate for 20 countries, 1992
-
7/31/2019 5Data Processing
113/145
125
150
175
200
225
250
alityrate
0
25
50
75
100
0 25 50 75 100 125
Per ce ntage im m unize d