statistic i: data collection & handling

Post on 20-Jun-2015

323 Views

Category:

Presentations & Public Speaking

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Statistic I: Data Collection & Handling

TRANSCRIPT

©drtamil@gmail.com 2013

Collect, Explore & Summarise

Dr Azmi Mohd TamilDept of Community Health

Universiti Kebangsaan Malaysia

FK6163

©drtamil@gmail.com 2013

Data Collection

Data collection begins after deciding on design of study and the sampling strategy

©drtamil@gmail.com 2013

Data Collection

Sample subjects are identified and the required individual information is obtained in an item-wise and structured manner.

©drtamil@gmail.com 2013

Data Collection

Information is collected on certain characteristics, attributes and the qualities of interest from the samples

These data may be quantitative or qualitative in nature.

©drtamil@gmail.com 2013

Data Collection Techniques

Use available information Observation Interviews Questionnaires Focus group discussion

©drtamil@gmail.com 2013

Using Available Information

Existing Records• Hospital records - case notes• National registry of births & deaths• Census data• Data from other surveys

©drtamil@gmail.com 2013

Disadvantages of using existing records

Incomplete records Cause of death may not be verified by a

physician/MD Missing vital information Difficult to decipher May not be representative of the target

group - only severe cases go to hospital

©drtamil@gmail.com 2013

©drtamil@gmail.com 2013

Disadvantages of using existing records

Delayed publication - obsolete data Different method of data recording

between institutions, states, countries, making comparison & pooling of data incompatible

Comparisons across time difficult due to difference in classification, diagnostic tools etc

©drtamil@gmail.com 2013

Advantages of using existing records

Cheap convenient in some situations, it is the only data

source i.e. accidents & suicides

©drtamil@gmail.com 2013

Observation

Involves systematically selecting, watching & recording behaviour and characteristics of living beings, objects or phenomena

Done using defined scales Participant observation e.g. PEF and

asthma symptom diary Non-participant observation e.g.

cholesterol levels

©drtamil@gmail.com 2013

Interviews

Oral questioning of respondents either individually or as a group.

Can be done loosely or highly structured using a questionnaire

©drtamil@gmail.com 2013

Administering Written Questionnaires

Self-administered via mail by gathering them in one place and

getting them to fill it up hand-delivering and collecting them later Large non-response can distort results

©drtamil@gmail.com 2013

Questionnaires

Influenced by education & attitude of respondent esp. for self-administered

Interviewers need to be trained open ended vs close ended the need for pre-testing or pilot study

©drtamil@gmail.com 2013

Focus group discussion

Selecting relevant parties to the research questions at hand and discussing with them in focus groups

examples in your own field of interest?

©drtamil@gmail.com 2013

Plan for data collection

Permission to proceed Logistics - who will collect what, when

and with what resources Quality control

©drtamil@gmail.com 2013

Accuracy & Reliability

Accuracy - the degree which a measurement actually measures the measures the characteristic it is supposed to measure

Reliability is the consistency of replicate measures

©drtamil@gmail.com 2013

Reliability & Accuracy

©drtamil@gmail.com 2013

Accuracy & Reliability

Both are reduced by random error and systematic error from the same sources of variability;• the data collectors• the respondents• the instrument

©drtamil@gmail.com 2013

Strategies to enhance precision & accuracy

Standardise procedures and measurement methods

training & certifying the data collectors Repetition Blinding

©drtamil@gmail.com 2013

Introduction

Method of Exploring and Summarising Data differs

According to Types of Variables

©drtamil@gmail.com 2013

Dependent/Independent

Frequency of Exercise

Obesity

Food Intake

Independent Variables

Dependent Variable

©drtamil@gmail.com 2013

©drtamil@gmail.com 2013

Explore

It is the first step in the analytic process to explore the characteristics of the data to screen for errors and correct them to look for distribution patterns - normal

distribution or not May require transformation before further

analysis using parametric methods Or may need analysis using non-parametric

techniques

©drtamil@gmail.com 2013

Data Screening

By running frequencies, we may detect inappropriate responses

How many in the audience have 15 children and currently pregnant with the 16th?

PARITY

67 30.7

44 20.2

36 16.5

22 10.1

21 9.6

8 3.7

3 1.4

7 3.2

5 2.3

3 1.4

1 .5

1 .5

218 100.0

1

2

3

4

5

6

7

8

9

10

11

15

Total

ValidFrequency Percent

©drtamil@gmail.com 2013

Data Screening

See whether the data make sense or not.

E.g. Parity 10 but age only 25.

©drtamil@gmail.com 2013

©drtamil@gmail.com 2013

©drtamil@gmail.com 2013

Data Screening

By looking at measures of central tendency and range, we can also detect abnormal values for quantitative data

Descriptive Statistics

184 32 484 53.05 33.37

184

Pre-pregnancy weight

Valid N (listwise)

N Minimum Maximum MeanStd.

Deviation

©drtamil@gmail.com 2013

Interpreting the Box Plot

Outlier

Outlier

Upper quartile

Smallest non-outlier

Median

Lower quartile

Largest non-outlier The whiskers extend to 1.5 times the box width from both ends of the box and ends at an observed value. Three times the box width marks the boundary between "mild" and "extreme" outliers.

"mild" = closed dots "extreme"= open dots

©drtamil@gmail.com 2013

Data Screening

We can also make use of graphical tools such as the box plot to detect wrong data entry 184N =

Pre-pregnancy weight

600

500

400

300

200

100

0

141198211181

73

©drtamil@gmail.com 2013

Data Cleaning

Identify the extreme/wrong values Check with original data source – i.e.

questionnaire If incorrect, do the necessary correction. Correction must be done before

transformation, recoding and analysis.

©drtamil@gmail.com 2013

Parameters of Data Distribution

Mean – central value of data Standard deviation – measure of how

the data scatter around the mean Symmetry (skewness) – the degree of

the data pile up on one side of the mean Kurtosis – how far data scatter from the

mean

©drtamil@gmail.com 2013

Normal distribution

The Normal distribution is represented by a family of curves defined uniquely by two parameters, which are the mean and the standard deviation of the population.

The curves are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population.

However, the mere fact that a curve is bell shaped does not mean that it represents a Normal distribution, because other distributions may have a similar sort of shape.

©drtamil@gmail.com 2013

Normal distribution

If the observations follow a Normal distribution, a range covered by one standard deviation above the mean and one standard deviation below it includes about 68.3% of the observations;

a range of two standard deviations above and two below (+ 2sd) about 95.4% of the observations; and

of three standard deviations above and three below (+ 3sd) about 99.7% of the observations

68.3%

95.4%

99.7%

©drtamil@gmail.com 2013

Normality

Why bother with normality?? Because it dictates the type of analysis

that you can run on the data

©drtamil@gmail.com 2013

Normality-Why?Parametric

©drtamil@gmail.com 2013

Normality-Why?Non-parametric

©drtamil@gmail.com 2013

Normality-How?

Explored graphically• Histogram• Stem & Leaf• Box plot• Normal probability

plot• Detrended normal

plot

Explored statistically• Kolmogorov-Smirnov

statistic, with Lilliefors significance level and the Shapiro-Wilks statistic

• Skew ness (0)• Kurtosis (0)

– + leptokurtic– 0 mesokurtik– - platykurtic

©drtamil@gmail.com 2013

Kolmogorov- Smirnov

In the 1930’s, Andrei Nikolaevich Kolmogorov (1903-1987) and N.V. Smirnov (his student) came out with the approach for comparison of distributions that did not make use of parameters.

This is known as the Kolmogorov-Smirnov test.

©drtamil@gmail.com 2013

Skew ness

Skewed to the right indicates the presence of large extreme values

Skewed to the left indicates the presence of small extreme values

©drtamil@gmail.com 2013

Kurtosis

For symmetrical distribution only.

Describes the shape of the curve

Mesokurtic - average shaped

Leptokurtic - narrow & slim

Platikurtic - flat & wide

©drtamil@gmail.com 2013

Skew ness & Kurtosis

Skew ness ranges from -3 to 3. Acceptable range for normality is skew ness

lying between -1 to 1. Normality should not be based on skew ness

alone; the kurtosis measures the “peak ness” of the bell-curve (see Fig. 4).

Likewise, acceptable range for normality is kurtosis lying between -1 to 1.

©drtamil@gmail.com 2013

©drtamil@gmail.com 2013

Normality - ExamplesGraphically

Height

167.5

165.0

162.5

160.0

157.5

155.0

152.5

150.0

147.5

145.0

142.5

140.0

60

50

40

30

20

10

0

Std. Dev = 5.26

Mean = 151.6

N = 218.00

©drtamil@gmail.com 2013

Q&Q Plot

This plot compares the quintiles of a data distribution with the quintiles of a standardised theoretical distribution from a specified family of distributions (in this case, the normal distribution).

If the distributional shapes differ, then the points will plot along a curve instead of a line.

Take note that the interest here is the central portion of the line, severe deviations means non-normality. Deviations at the “ends” of the curve signifies the existence of outliers.

©drtamil@gmail.com 2013

Normality - ExamplesGraphically

Detrended Normal Q-Q Plot of Height

Observed Value

170160150140130

De

v fr

om

No

rma

l

.6

.5

.4

.3

.2

.1

0.0

-.1

-.2

Normal Q-Q Plot of Height

Observed Value

170160150140130

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

-3

©drtamil@gmail.com 2013

Normal distributionMean=median=mode

©drtamil@gmail.com 2013

Tests of Normality

.060 218 .052HeightStatistic df Sig.

Kolmogorov-Smirnova

Lilliefors Significance Correctiona.

Descriptives

151.65 .356

150.94

152.35

151.59

151.50

27.649

5.258

139

168

29

8.00

.148 .165

.061 .328

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

HeightStatistic Std. Error

Normality - ExamplesStatistically

Shapiro-Wilks; only if sample size less than 100.

Normal distributionMean=median=mode

Skewness & kurtosis within +1

p > 0.05, so normal distribution

©drtamil@gmail.com 2013

K-S Test

©drtamil@gmail.com 2013

K-S Test

very sensitive to the sample sizes of the data.

For small samples (n<20, say), the likelihood of getting p<0.05 is low

for large samples (n>100), a slight deviation from normality will result in being reported as abnormal distribution

©drtamil@gmail.com 2013

Guide to deciding on normality

©drtamil@gmail.com 2013

Normality Transformation

Normal Q-Q Plot of LN_PARIT

Observed Value

3.02.52.01.51.0.50.0-.5

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

Normal Q-Q Plot of LN_PARIT

Observed Value

3.02.52.01.51.0.50.0-.5

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

Normal Q-Q Plot of PARITY

Observed Value

1614121086420

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

Normal Q-Q Plot of PARITY

Observed Value

1614121086420

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

©drtamil@gmail.com 2013

Square root Logarithm Inverse

Reflect and square root

Reflect and logarithm

Reflect and inverse

TYPES OF TRANSFORMATIONS

©drtamil@gmail.com 2013

Summarise

Summarise a large set of data by a few meaningful numbers.

Single variable analysis• For the purpose of describing the data• Example; in one year, what kind of cases are

treated by the Psychiatric Dept?• Tables & diagrams are usually used to describe

the data• For numerical data, measures of central tendency

& spread is usually used

©drtamil@gmail.com 2013

Frequency Table

Race F %Malay 760 95.84%

Chinese 5 0.63%Indian 0 0.00%Others 28 3.53%TOTAL 793 100.00%

•Illustrates the frequency observed for each category

©drtamil@gmail.com 2013

Frequency Distribution Table

Umur Bil %0-0.99 25 3.26%1-4.99 78 10.18%5-14.99 140 18.28%15-24.99 126 16.45%25-34.99 112 14.62%35-44.99 90 11.75%45-54.99 66 8.62%55-64.99 60 7.83%65-74.99 50 6.53%75-84.99 16 2.09%85+ 3 0.39%JUMLAH 766

• > 20 observations, best presented as a frequency distribution table.

•Columns divided into class & frequency.

•Mod class can be determined using such tables.

©drtamil@gmail.com 2013

Measurement of Central Tendency & Spread

©drtamil@gmail.com 2013

Measures of Central Tendency

MeanModeMedian

©drtamil@gmail.com 2013

Measures of Variability

Standard deviation Inter-quartilesSkew ness & kurtosis

©drtamil@gmail.com 2013

Mean

the average of the data collected To calculate the mean, add up the

observed values and divide by the number of them.

A major disadvantage of the mean is that it is sensitive to outlying points

©drtamil@gmail.com 2013

Mean: Example

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Total of x = 648 n= 20 Mean = 648/20 = 32.4

©drtamil@gmail.com 2013

Measures of variation - standard deviation

tells us how much all the scores in a dataset cluster around the mean. A large S.D. is indicative of a more varied data scores.

a summary measure of the differences of each observation from the mean.

If the differences themselves were added up, the positive would exactly balance the negative and so their sum would be zero.

Consequently the squares of the differences are added.

©drtamil@gmail.com 2013

©drtamil@gmail.com 2013

sd: Example

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Mean = 32.4; n = 20 Total of (x-mean)2

= 3050.8 Variance = 3050.8/19

= 160.5684 sd = 160.56840.5=12.67

x (x-mean)^2 x (x-mean)^2

12 416.16 32 0.16

13 376.36 35 6.76

17 237.16 37 21.16

21 129.96 38 31.36

24 70.56 41 73.96

24 70.56 43 112.36

26 40.96 44 134.56

27 29.16 46 184.96

27 29.16 53 424.36

30 5.76 58 655.36

TOTAL 1405.8 TOTAL 1645

©drtamil@gmail.com 2013

Median

the ranked value that lies in the middle of the data

the point which has the property that half the data are greater than it, and half the data are less than it.

if n is even, average the n/2th largest and the n/2 + 1th largest observations

"robust" to outliers

©drtamil@gmail.com 2013

Median:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

(20+1)/2 = 10th which is 30, 11th is 32 Therefore median is (30 + 32)/2 = 31

©drtamil@gmail.com 2013

Measures of variation - quartiles

The range is very susceptible to what are known as outliers

A more robust approach is to divide the distribution of the data into four, and find the points below which are 25%, 50% and 75% of the distribution. These are known as quartiles, and the median is the second quartile.

©drtamil@gmail.com 2013

Quartiles

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

25th percentile 24; (24+24)/2 50th percentile 31; (30+32)/2 ; = median 75th percentile 42.5; (41+43)/2

©drtamil@gmail.com 2013

Mode

The most frequent occurring number. E.g. 3, 13, 13, 20, 22, 25: mode = 13.

It is usually more informative to quote the mode accompanied by the percentage of times it happened; e.g., the mode is 13 with 33% of the occurrences.

©drtamil@gmail.com 2013

Mode: Example

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Modes are 24 (10%) & 27 (10%)

©drtamil@gmail.com 2013

Mean or Median?

Which measure of central tendency should we use?

if the distribution is normal, the mean+sd will be the measure to be presented, otherwise the median+IQR should be more appropriate.

©drtamil@gmail.com 2013

Normal distribution;Use Mean+SD

Not Normal distribution;Use Median & IQR

©drtamil@gmail.com 2013

Presentation

Qualitative & Quantitative Data

Charts & Tables

©drtamil@gmail.com 2013

Presentation

Qualitative Data

©drtamil@gmail.com 2013

Graphing Categorical Data: Univariate Data

Categorical Data

Tabulating Data

The Summary Table

0 1 0 2 0 3 0 4 0 5 0

S to c k s

B o n d s

S a vin g s

C D

Graphing Data

Pie Charts

Pareto DiagramBar Charts

0

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0

4 5

S to c k s B o n d s S a vin g s C D

0

2 0

4 0

6 0

8 0

1 0 0

1 2 0

©drtamil@gmail.com 2013

Bar Chart

Type of work

Field w orkOffice w orkHousew ife

Pe

rce

nt

80

60

40

20

0

20

11

69

©drtamil@gmail.com 2013

Pie Chart

Others

Chinese

Malay

©drtamil@gmail.com 2013

Tabulating and Graphing Bivariate

Categorical Data Contingency tables:

Table 1: Contigency table of pregnancy induced hypertension andSGA

Count

103 94 197

5 16 21

108 110 218

No

Yes

Pregnancy inducedhypertension

Total

Normal SGA

SGA

Total

©drtamil@gmail.com 2013

Tabulating and Graphing Bivariate

Categorical Data

Pregnancy induced hypertension

YesNo

Co

un

t

120

100

80

60

40

20

0

SGA

Normal

SGA

16

94

103

Side by side charts

©drtamil@gmail.com 2013

Presentation

Quantitative Data

©drtamil@gmail.com 2013

Ogive

0

20

40

60

80

100

120

10 20 30 40 50 60

Tabulating and Graphing Numerical

Data

0

1

2

3

4

5

6

7

10 20 30 40 50 60

Numerical Data

Ordered Array

Stem and LeafDisplay

Histograms Area

Tables

2 144677

3 028

4 1

41, 24, 32, 26, 27, 27, 30, 24, 38, 21

21, 24, 24, 26, 27, 27, 30, 32, 38, 41

Frequency DistributionsCumulative Distributions

Polygons

©drtamil@gmail.com 2013

Tabulating Numerical Data: Frequency

Distributions Sort raw data in ascending order:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Find range: 58 - 12 = 46

Select number of classes: 5 (usually between 5 and 15)

Compute class interval (width): 10 (46/5 then round up)

Determine class boundaries (limits): 10, 20, 30, 40, 50, 60

Compute class midpoints: 14.95, 24.95, 34.95, 44.95, 54.95

Count observations & assign to classes

©drtamil@gmail.com 2013

Frequency Distributions and Percentage Distributions

Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class Midpoint Freq %

10.0 - 19.9 14.95 3 15%

20.0 - 29.9 24.95 6 30%

30.0 - 39.9 34.95 5 25%

40.0 - 49.9 44.95 4 20%

50.0 - 59.9 54.95 2 10%

TOTAL   20 100%

©drtamil@gmail.com 2013

3

6

5

4

2

0

1

2

3

4

5

6

7

14.95 24.95 34.95 44.95 54.95

Age

Fre

qu

en

cy

Graphing Numerical Data:

The HistogramData in ordered array:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class MidpointsClass Boundaries

No Gaps Between

Bars

©drtamil@gmail.com 2013

Graphing Numerical Data:

The Frequency Polygon

Class Midpoints

Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

0

1

2

3

4

5

6

7

14.95 24.95 34.95 44.95 54.95

©drtamil@gmail.com 2013

Linear Regression Line

©drtamil@gmail.com 2013

Survival Function

DURATION

76543210

Cu

m S

urv

iva

l

1.2

1.0

.8

.6

.4

.2

0.0

Survival Function

Censored

©drtamil@gmail.com 2013

Principles of Graphical Excellence

Presents data in a way that provides substance, statistics and design

Communicates complex ideas with clarity, precision and efficiency

Gives the largest number of ideas in the most efficient manner

Almost always involves several dimensions Tells the truth about the data

top related