5data processing

7/31/2019 5Data Processing

1/145

Wakgari Deressa, PhD

School of Public HealthAddis Ababa University


2/145

ObjectivesThe participants should be able to: Understand the process involved in data

processing Use computers to perform data

Interpret summary statistics, graphicaldisplays

Understand estimations & hypothesistesting

Understand statements in published articles


3/145

Introduction Describes statistical methods commonly

used in health research. Data processing, management

Data analysis

Interpretation

Use of computers and statisticalsoftware packages

Epi Info 2002: one of the most commonlyused software packages by health researchers


4/145

Preparation of Data for Statistical

Analysis Data collected should be entered into

a computer for analysis

Includes tasks such as:

1. Checking and manual editing

2. Coding3. Creating format or views for data entry

4. Entering data into a computer

5. Cleaning

6. Transformation


5/145

1. Checking and manual editing Is a function of the quality of data

Involves checking of data to detectincompleteness, inconsistencies, andother obvious errors in the questionnaire.

The majority of the errors in the datashould be detected and corrected in thefield before the data are sent away.

Interviewers and supervisors play a keyrole to correct any error closer to a

source


6/145

Types of checks1. Range checks: e.g., Male (=1) or

Female (=2), age (1-99 yr)

2. Typographic check: 41 rather than 14

3. Consistency check: date of birth

and age, age of mother =20 and achild of 15 years old


7/145


8/145

2. Coding Coding is assigning a separate non-

overlapping numerical code for separate

responses recorded in words andmissing values

Dont understand or accept alphabetical

texts or codes or verbatim responses

Example: 1=Male, 2=Female

Should be both exhaustive and mutuallyexclusive


9/145

Closed-ended questions are usually pre-assigned a numerical code (pre-coding= before data collection)

What is your current marital status?1. Never married

.

3. Widowed/ separated

Post-coding = after data collection

Mainly used for open-ended questionsThe code is assigned after reviewing a

representative responses from respondents


10/145

Codebook = is a document that lists thecodes (or keys) of assignments of thevalues of the variable

Guides researchers to find the rightcode for each answer category

Some responses such as quantitativevariables can be directly entered into a

computer without codingExample: Age, weight, body temperature,

number of pregnancies and ANC visits, etc.


11/145

Question/Statement

Variablename Codes

1.Your sex SEX 1=Male2=Female

2. What is our occu ation? OCCUP 1=Farmer

2=Housewife

3=Trader

4=Student

5=Other

Variable name = usually contains fewer than 8 characters


12/145

3. Creating data format in acomputer

Designing a format or electronic

questionnaire in a computer

Examples

ex: =ma e, = ema e2Age: ## in years

3Occup: # 1=farmer, 2=housewife,

3=trader, 4=student, 5=other4Sick: # 1=yes, 2=no

5RxSource _____________________


13/145

4. Data Entry The transfer of data from a

questionnaire to a computer file. Before entry, data must be checked for

errors

Data must be coded

Date entered by a data entry clerk or a

researcher


14/145

5. Data cleaning Data entered must be checked for errors,

impossible or implausible values and

inconsistencies

In most cases errors are inevitable

,avoidable

Errors can result from incorrect reading

of codes, incorrect reporting, missedentry, incorrect coding, repeated entry,incorrect typing, and so forth.


15/145

Data cleaning involves two types ofchecks:

1. Checking for outliers

Sex: 1=male, 2=female If 3 or other number is entered rather than 1

or 2, the error should be corrected by looking

n o e or g na source2. Performing consistency checks

Checking whether data in one part of a

record is compatible with data in another part If sex is initially entered as 1=male, and 2entered for number of pregnancies, the datais not internally consistent


16/145

6. Data transformation Involves restructuring of the data set and

generating new variables or recodingsome of the existing data fields to definenew variables

To transform the raw scores into standardscores

Facilitates data analysis and interpretation

Easily handled directly through instructionsinto a computer


17/145

Example: Infants original BWT (enteredon the computer files in grams) can becategorized into a dichotomous variable

Recode BWEIGHT (LOWEST THRU 2500=1) (2501 THRU HIGHEST=2)

= ow r we g , = orma we g Age can be recoded from a continuous

variable into a categorical one:

1=15-24, 2=25-34, 3=35-44, 4=45-54


18/145

EPI Info 2002 Epi Info 2002 is a series of programs foruse by public health professionals in

conducting outbreak investigations,managing databases for public healthsurveillance and epidemiological data

Epi Info 2002 software is in the publicdomain and freely available for use,copying, translation and distribution.

"Epi Info" is a trademark of the Centers forDisease Control and Prevention (CDC).

SPSS and STATA very expensive


19/145

EPI Info 2002

With Epi Info and a personal computer,

physicians,

epidemiologists, and

can easily develop a questionnaire orform, customize the data entry process,

and enter and analyze data (all in onepackage).


20/145

EPI Info 2002 Epi Info is a tool that public health

professionals use to: Create a questionnaire (format) (MakeView)

Enter data in a questionnaire (Enter Data)

Analyse the data (Analyse Data) Additional features: shape data entry,

error checking, coding, selecting

records, create new variables, recodedata, import and export files from othersystems.


21/145

Running EPI Info 2002 Put on your computer

Double click Epi Info 2002

Select program (Make view, Enter data,Analyze data, etc.) from the menu.

ThisistheEpiI

nfoforWindo

wsmainmenu


22/145

Makeview Program Is used to place prompts and data entry

fields on one or many pages of a View. Used to create questionnaires or

.

Regarded as both the form designerand the database design environment


23/145

Questionnaire / View An electronic replications of paper forms

or other information sources that arecreated to enter and store data

Is created in the Epi Info Makeviewprogram


24/145


25/145

Running Makeview Type the file name you gave earlier (e.g.

Mary) and click OK. You can also typeanother name.

Right click to create a field.

Right click on any space will provide a

box for writing the name of the variable(E.g. Name, age, sex, etc)


26/145

Running Makeview Below you will see a menu of field or

type of the variable Select the appropriate field (e.g. upper

, ______Occupation), numeric (# for age), labelfor title) etc.

Click OK


27/145


28/145

Enter Program Displays data entry screen created in

MakeView ready to receive data

Data are entered and edited here

Controls the data entry process, using

specified in MakeView.

A search function is provided so that

records can be located that matchvalues specified for any combination ofvariables.


29/145

Entering data using Enter Dataprogram Click Enter Data from the main menu

Across the top of the screen click onFile and select Open

You will be asked "Select a table" withdefault the file name you already gave it.

Click OK


30/145

The program will now make a data filefrom your questionnaire and display thevariables on the screen ready to receive

data.

Fill in the blanks until you have

completed the form. The program will be ready to receive the

2nd entry and so on.

After finishing your data entry click Exit


31/145

The Analyze Data Program The analysis program produce lists,

frequencies, graphs, tables, and more

statistics Select Analyze Data from the main Menu

command windows will appear Click Read (import)

Choose your file (e.g., Mary.rec)

Click OK


32/145

Descriptive Statistics Distribution or probability distribution

refers to the way data are distributed, in

order to draw conclusions about a setof data.

ont nuous var a es = t e a m s todetermine whether or not normalitymay be assumed

Categorical variables = We obtain thefrequency distribution for each variable


33/145

Distribution of Categorical Variable

A study was conducted to assess thecharacteristics of a group of 234 smokers by

collecting data on gender and othervariables.

Gender, 1 = male, 2 = female

Gender Frequency (n) Relative Frequency

Male (1)

Female (2)

110

124

47.0%

53.0%

Total 234 100%


34/145

Frequency distribution of BWT:Bar Chart

4000

5000

6000

.

0

1000

2000

3000

Very low Low Normal Big

BWT

F

re


35/145


36/145

Prevalence of active trachoma among

children (1-9) by sex and area of residence

33.5

28.6

34.4

31.2

22.824.3

30

40

e(%)

AT TF TI

17.5

20.2

9.8

.

118.7

10.4

7.79.9

0

10

20

Female Male Rural Urban Total

Prevalenc

Gender and area of residence


37/145

Pie Chart with relative frequencies of

categories of BWT


38/145

Distribution of Continuous Variable Examples:

Age, height, weight, etc

Continuous variable is infinite

The probability associated with any

particular value is almost equal to Zero However, it will assume some value in the

interval enclosed by two ranges: x1 and x2

The prob distn is visualized as a curve andprobabilities are areas under the curve


39/145

The total area under a probability distribution is

always 1. The section marked A represents theprobability of observing a value of 3 orgreater, symbolically written as Pr(X 3). If the area of

A is say 0.2 units, then Pr(X 3) = 0.2

0 1 2 3 4 5

BPr(X1)

APr(X3)


40/145


41/145


42/145


43/145


44/145


45/145

The Normal Distribution

mean

standard deviation


46/145


47/145

Histograms Histograms are frequency distributions with

continuous class intervals that have been turned

into graphs. To construct a histogram, we draw the interval

boundaries on a horizontal line and the

frequencies on a vertical line. Non-overlapping intervals that cover all of the

data values must be used.

Bars are then drawn over the intervals Area of each column proportional to the number

of observations in that interval


48/145


49/145

Frequency polygon

A frequency distribution can beportrayed graphically in yet another wayby means of a frequency polygon.

To draw a frequency polygon weconnect the mid-point of the tops of thecells of the histogram by a straight line.


50/145

The frequency polygon of birth weight ofnewborns by sex


51/145

Numerical Summary MeasuresMeasures of location

Measures of dispersion


52/145

Measures of LocationThe most common measures:

Mean (Arithmetic Mean) Median

o e


53/145

Mean (Arithmetic mean) the "average" which is obtained by adding

all the values in a sample or population anddividing them by the number of values.


54/145

Example: 10 numbers:

19 21 20 20 34 22 24 27 27 27

Mean = (19 + 21 + +27) = 24.1

10

Median


55/145

Median The median is the value which divides the

data set into two equal parts.

If the number of values is odd, the medianwill be the middle value when all values arearranged in order of magnitude.

When the number of observations iseven, there is no single middle value but twomiddle observations.

In this case the median is the mean of thesetwo middle observations, when allobservations have been arranged in the

order of their magnitude.


56/145

In the above data set, arranging in

increasing order :19 20 20 21 22 24 27 27 27 34

Median = (22 + 24)/2 = 23


57/145

Mode Any observation of a variable at which

the distribution reaches a peak Most distributions are unimodal

n t e a ove examp e, t e mo e s .

It occurs three times (most frequentvalue)

It is possible to have more than onemode or no mode.


58/145

Measures of Dispersion

Dispersion refers to the variety exhibited by

the values of the data.

e amount may e sma w en t e va uesare close together.

Two or more sets may have the same meanand/or median but they may be quitedifferent.


59/145

These two distributions have the same mean,median, and mode, but they may be quite different


60/145

Range The range is the difference between the largestand smallest values in the set of observations.

These values are often called the maximum andthe minimum.

If 167 is the largest and 40 is the smallest, then

range is,

167 40 = 127


61/145


62/145

a) The first quartile (Q1): 25% of all theranked observations are


63/145

Inter-quartile range (IQR)

The IQR encompasses the middle50% of the observations

3 1,

Q3 = 3rd quartile and Q1= 1st first quartile.


64/145

Median = 2nd quartile (dividing into twohalves)

1st quartile (Q1) = 1/4(n + 1)th 2nd Quartile (Q2) = 1/2 (n + 1)th

r uar e 3 = n+

2 2


65/145

Variance (2

, S2

) A measure of the dispersion relative to

the scatter of the values about theirmean.

squares of the deviations taken from themean.

Population variance = 2

Sample variance = S2

A sample variance is calculated for a sample of


66/145

A sample variance is calculated for a sample of

individual values (X1, X2, Xn) and uses the samplemean (e.g. ) rather than the population mean .


67/145

Limitation of the VarianceThe variance is a mean of squared values

It is not expressed in the same unitas the observation for which it representsthe dis ersion

A variance of a distribution of weightis not expressed in Kg, but in Kg2

weight = 36.5 Kg, s = 257 Kg2


68/145

Standard deviationStandard deviation = Square root of the variance

=

=

m = 36.5 Kg

s = 257 Kg2

s = = 16 Kg257The standard deviation is expressed in the same units

as the measurement it represents


69/145


70/145

Box Plot It is another way to display information

about the distribution of a set of data. Can be used to display a set of discrete

single vertical axis only certainsummaries of the data are shown

First the quartiles of the data set mustbe defined

A box is drawn with the top of the box at


71/145

A box is drawn with the top of the box atthe third quartile and the bottom at thefirst quartile.

The location of the mid-point of thedistribution is indicated with a horizontal

.

Finally, straight lines are drawn from thecentre of the top of the box to the largestobservation and from the centre of thebottom of the box to the smallestobservation.


72/145

Percentile = p(n+1), p=the required percentile

Arrange the numbers in ascending order

A. 1st quartile = 0.25(n+1)th

B. 2nd quartile = 0.5(n+1)th

C. 3rd

quartile = 0.75(n+1)th

D. 20th percentile = 0.2(n+1)th

C. 15th percentile = 0.15(n+1)th


73/145


74/145

G est . age

P re

T er m

P o s t

Bir th w eight(gra m s )

5000 4500 4000 3500 3000 2500 2000 1500 1000 500


75/145

Statistical Inference Statistical inference is the process of

using samples to make inferences about apopulation.

hypothesis testing. Often the population parameter of interest

is either mean or a proportion.


76/145

Parameter Estimations Population parameter: the underlying

(unknown) distribution of the variable ofinterest for a population

Sample parameter: estimates of thepopulation parameters obtained from asample


77/145

Types of Estimates


78/145

yp

Point estimation involves the calculationof a single number to estimate the

population parameter Single values

Interval estimation specifies a range ofreasonable values for the parameter Confidence interval

Provides more information about a populationcharacteristic than does a point estimate


79/145

Confidence Intervals Used for estimating the true value of

the population parameter

for anticipated true populationparameter.


80/145


81/145

CI tells us how precise our estimate islikely to be

A narrow CI implies highprec s on, w e a w e mp es ow

precision.

A Narrow CI reflects large sample sizeor low variability or both.


82/145

95% CI commonly usedSometimes 90% and 99%

The 95% CI is calculated in such a waythat, under the conditions assumed forunderlying distribution, the interval willconta n true popu at on parameter 5%

of the time. Loosely speaking, you might interpret a

95% CI as one which you are 95%confident contains the true parameter.


83/145

CIs can also answer the question ofwhether or notan association exists ora treatment is beneficial or harmful.analo ous to -values

e.g., if the CI of an odds ratio includes the value 1.0we cannot be confident that exposure is associatedwith disease.


84/145

C.I. for a population meana) Known variance (large sample size)

A 100(1-)% C.I. for is

is to be chosen by the researcher, mostcommon values of are 0.05, 0.01,0.001 and 0.1.

Margin of Error


85/145

(Precision of the estimate)


86/145

B. Unknown variance(small sample size, n 30)

What if the for the underlying populationis unknown and the sam le size is small?

As an alternative we use Students tdistribution.

Based on degrees of freedom


87/145


88/145

Note: t approaches z as n increases

C.I. for a population proportion


89/145


90/145

Hence,

is an approximate 95% CI for the true proportion P.

E l


91/145

Example: A study on dental health practice. Of 300

adults interviewed, 123 said that theyregularly had a dental check-up twice a

. . .

proportion in the population? (0.36, 0.46).

E ti ti f T P l ti


92/145

Estimation for Two Populations

H pothesis testing


93/145

Hypothesis testing The majority of statistical analyses involve

comparison, most obviously betweentreatments or procedures or between.

Hypothesis: A statement about one ormore population

Is the true population mean BW equals 3000 g?


94/145

The alternative hypothesis HA is ah di i h h ll


95/145

The alternative hypothesis, HA, is astatement that disagrees with the nullhypothesis.

The effect of interest is not zero, there isa difference

tates t e ne o t n ng o t e

researcher It is the hypothesis that the researcher

wants to prove

Examples of Research Hypotheses


96/145

Population Mean

The average length of stay of patients

admitted to the hospital is five days The mean BW of babies delivered by

mothers with low SES is lower than those

from higher SES

The economic burden of HIV/AIDS on thepoor is higher than that of the wealthierpeople

Etc

Population Proportion

The proportion of adult smokers in Addis


97/145

The proportion of adult smokers in AddisAbaba is p = 0.40

The prevalence of HIV among non-married adults is higher than that in

A greater proportion of people who live inpoverty may have a low health status

Inappropriate prescription of drugs is

more common in settings wheremicroscopy is unavailable

Etc

HT using test statistics (E g Mean):


98/145

HT using test statistics (E.g., Mean):H0: = 0 H0: 0 H0: 0H1: 0 H1: > 0 H1: < 0

two-tailed one-tailed one-tailed

Decide on the appropriate test statisticfor the hypothesis (Z, t, 2, F, etc.)

Select the level of significance for thestatistical test (=0.05, 0.01, 0.001, etc.)

Another way to state conclusion:


99/145

Another way to state conclusion:

Reject H0 if P-value< ,

Accept H0 if P-value .

the smaller the P-value the stronger the evidence

against the Ho.

P-value


100/145

P-value

The P valueis the probability that a difference as

large as we have observed could have occurred

simply by chance

The probability that we could be wrong if we reject

the Ho

Indicates the probability that the association

between two variables might be due to chance

Types of Errors in Hypothesis


101/145

Types of Errors in HypothesisTests

Whenever we reject or accept the Ho, wecommit errors.

Two types of errors are committed.

Type I Error

Type II Error

Type I error (): the probability of


102/145

Type I error (): the probability ofrejecting H0 when it is true.

It is the probability of being wrong wheno s true.

Typical value for (significance level) is 5%

Type II error (): The probability of not


103/145

Type II error (): The probability of notrejecting H0 when it is actually false.

Failure to accept HA when it is true

Power: The probability of rejecting H


104/145

Power: The probability of rejecting H0when it is false OR accepting HA when it is

true.Power = 1- .

Typical value for Power is 80%

Statistical Methods for Continuous


105/145

Statistical Methods for ContinuousVariables: Comparison of Groups

Is there a significant difference between two or moregroups?

Students t-Test (Unpaired data)


106/145

Student s t Test (Unpaired data) The Students t-test or simply t-test is

commonly used for comparison of the

means of two groups. Our null h othesis is HO : 1 = 2 where1 is the true mean of the first group and2 is the true mean of the second group.

Assumption:

The data are independent & normallydistributed

Example:C m i m BWT b t m l d


107/145

p Comparing mean BWT between males and

females

Comparing mean blood pressure betweendiabetic and non-diabetic patients

Etc

Students t-Test (Paired data)


108/145

Student s t Test (Paired data) Study subjects from one population can be

matched, or paired with particular subjectsin the second population.

Comparing mean BWT weight between twins Comparing mean blood pressure of diabetic

patients before and after some treatment

Eyes or ears of the same individual

One-way analysis of variance

(ANOVA)


109/145

(ANOVA) Suitable for deciding whether differences

exist between the means of more than two

groups. Is a eneralization of the Students t-test.

The Ho is HO: 1 = 2 = 3 = 4 , where iis the true mean of the ith group.

Allows us to test whether the mean of at

least one of the groups differs significantlyfrom some other group.


110/145

Correlation


111/145

Correlation Measures the strength of the relationship

between two continuous variables from a

single population The relationshi should be linear.

ProcedureDisplay the data in scatter plot before carrying out any further analysis

One variable is plotted on the X-axis

The other on the Y-axis

Nation %Immunized

Child MR

per 1000 LBBoliviaB il

77

69

118

65


112/145

Brazil

Cambodia

Canada

China

Czech Republic

Egypt

Ethiopia

69

32

85

94

99

89

13

65

184

8

43

12

55

208

Finland

France

Greece

India

Italy

Japan

Mexico

PolandRussia

Senegal

Turkey

UK

95

95

54

89

95

87

91

9873

47

76

90

7

9

9

124

10

6

33

1632

145

87

9

Percentage of children immunized against DPT and

under-five mortality rate for 20 countries, 1992


113/145

125

150

175

200

225

250

alityrate

0

25

50

75

100

0 25 50 75 100 125

Per ce ntage im m unize d

5data processing

Documents