statistic i: data collection & handling

89
©[email protected] 2013 Collect, Explore & Summarise Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia FK6163

Upload: sppippukm

Post on 20-Jun-2015

323 views

Category:

Presentations & Public Speaking


1 download

DESCRIPTION

Statistic I: Data Collection & Handling

TRANSCRIPT

Page 1: Statistic I: Data Collection & Handling

©[email protected] 2013

Collect, Explore & Summarise

Dr Azmi Mohd TamilDept of Community Health

Universiti Kebangsaan Malaysia

FK6163

Page 2: Statistic I: Data Collection & Handling

©[email protected] 2013

Data Collection

Data collection begins after deciding on design of study and the sampling strategy

Page 3: Statistic I: Data Collection & Handling

©[email protected] 2013

Data Collection

Sample subjects are identified and the required individual information is obtained in an item-wise and structured manner.

Page 4: Statistic I: Data Collection & Handling

©[email protected] 2013

Data Collection

Information is collected on certain characteristics, attributes and the qualities of interest from the samples

These data may be quantitative or qualitative in nature.

Page 5: Statistic I: Data Collection & Handling

©[email protected] 2013

Data Collection Techniques

Use available information Observation Interviews Questionnaires Focus group discussion

Page 6: Statistic I: Data Collection & Handling

©[email protected] 2013

Using Available Information

Existing Records• Hospital records - case notes• National registry of births & deaths• Census data• Data from other surveys

Page 7: Statistic I: Data Collection & Handling

©[email protected] 2013

Disadvantages of using existing records

Incomplete records Cause of death may not be verified by a

physician/MD Missing vital information Difficult to decipher May not be representative of the target

group - only severe cases go to hospital

Page 8: Statistic I: Data Collection & Handling

©[email protected] 2013

Page 9: Statistic I: Data Collection & Handling

©[email protected] 2013

Disadvantages of using existing records

Delayed publication - obsolete data Different method of data recording

between institutions, states, countries, making comparison & pooling of data incompatible

Comparisons across time difficult due to difference in classification, diagnostic tools etc

Page 10: Statistic I: Data Collection & Handling

©[email protected] 2013

Advantages of using existing records

Cheap convenient in some situations, it is the only data

source i.e. accidents & suicides

Page 11: Statistic I: Data Collection & Handling

©[email protected] 2013

Observation

Involves systematically selecting, watching & recording behaviour and characteristics of living beings, objects or phenomena

Done using defined scales Participant observation e.g. PEF and

asthma symptom diary Non-participant observation e.g.

cholesterol levels

Page 12: Statistic I: Data Collection & Handling

©[email protected] 2013

Interviews

Oral questioning of respondents either individually or as a group.

Can be done loosely or highly structured using a questionnaire

Page 13: Statistic I: Data Collection & Handling

©[email protected] 2013

Administering Written Questionnaires

Self-administered via mail by gathering them in one place and

getting them to fill it up hand-delivering and collecting them later Large non-response can distort results

Page 14: Statistic I: Data Collection & Handling

©[email protected] 2013

Questionnaires

Influenced by education & attitude of respondent esp. for self-administered

Interviewers need to be trained open ended vs close ended the need for pre-testing or pilot study

Page 15: Statistic I: Data Collection & Handling

©[email protected] 2013

Focus group discussion

Selecting relevant parties to the research questions at hand and discussing with them in focus groups

examples in your own field of interest?

Page 16: Statistic I: Data Collection & Handling

©[email protected] 2013

Plan for data collection

Permission to proceed Logistics - who will collect what, when

and with what resources Quality control

Page 17: Statistic I: Data Collection & Handling

©[email protected] 2013

Accuracy & Reliability

Accuracy - the degree which a measurement actually measures the measures the characteristic it is supposed to measure

Reliability is the consistency of replicate measures

Page 18: Statistic I: Data Collection & Handling

©[email protected] 2013

Reliability & Accuracy

Page 19: Statistic I: Data Collection & Handling

©[email protected] 2013

Accuracy & Reliability

Both are reduced by random error and systematic error from the same sources of variability;• the data collectors• the respondents• the instrument

Page 20: Statistic I: Data Collection & Handling

©[email protected] 2013

Strategies to enhance precision & accuracy

Standardise procedures and measurement methods

training & certifying the data collectors Repetition Blinding

Page 21: Statistic I: Data Collection & Handling

©[email protected] 2013

Introduction

Method of Exploring and Summarising Data differs

According to Types of Variables

Page 22: Statistic I: Data Collection & Handling

©[email protected] 2013

Dependent/Independent

Frequency of Exercise

Obesity

Food Intake

Independent Variables

Dependent Variable

Page 23: Statistic I: Data Collection & Handling

©[email protected] 2013

Page 24: Statistic I: Data Collection & Handling

©[email protected] 2013

Explore

It is the first step in the analytic process to explore the characteristics of the data to screen for errors and correct them to look for distribution patterns - normal

distribution or not May require transformation before further

analysis using parametric methods Or may need analysis using non-parametric

techniques

Page 25: Statistic I: Data Collection & Handling

©[email protected] 2013

Data Screening

By running frequencies, we may detect inappropriate responses

How many in the audience have 15 children and currently pregnant with the 16th?

PARITY

67 30.7

44 20.2

36 16.5

22 10.1

21 9.6

8 3.7

3 1.4

7 3.2

5 2.3

3 1.4

1 .5

1 .5

218 100.0

1

2

3

4

5

6

7

8

9

10

11

15

Total

ValidFrequency Percent

Page 26: Statistic I: Data Collection & Handling

©[email protected] 2013

Data Screening

See whether the data make sense or not.

E.g. Parity 10 but age only 25.

Page 27: Statistic I: Data Collection & Handling

©[email protected] 2013

Page 28: Statistic I: Data Collection & Handling

©[email protected] 2013

Page 29: Statistic I: Data Collection & Handling

©[email protected] 2013

Data Screening

By looking at measures of central tendency and range, we can also detect abnormal values for quantitative data

Descriptive Statistics

184 32 484 53.05 33.37

184

Pre-pregnancy weight

Valid N (listwise)

N Minimum Maximum MeanStd.

Deviation

Page 30: Statistic I: Data Collection & Handling

©[email protected] 2013

Interpreting the Box Plot

Outlier

Outlier

Upper quartile

Smallest non-outlier

Median

Lower quartile

Largest non-outlier The whiskers extend to 1.5 times the box width from both ends of the box and ends at an observed value. Three times the box width marks the boundary between "mild" and "extreme" outliers.

"mild" = closed dots "extreme"= open dots

Page 31: Statistic I: Data Collection & Handling

©[email protected] 2013

Data Screening

We can also make use of graphical tools such as the box plot to detect wrong data entry 184N =

Pre-pregnancy weight

600

500

400

300

200

100

0

141198211181

73

Page 32: Statistic I: Data Collection & Handling

©[email protected] 2013

Data Cleaning

Identify the extreme/wrong values Check with original data source – i.e.

questionnaire If incorrect, do the necessary correction. Correction must be done before

transformation, recoding and analysis.

Page 33: Statistic I: Data Collection & Handling

©[email protected] 2013

Parameters of Data Distribution

Mean – central value of data Standard deviation – measure of how

the data scatter around the mean Symmetry (skewness) – the degree of

the data pile up on one side of the mean Kurtosis – how far data scatter from the

mean

Page 34: Statistic I: Data Collection & Handling

©[email protected] 2013

Normal distribution

The Normal distribution is represented by a family of curves defined uniquely by two parameters, which are the mean and the standard deviation of the population.

The curves are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population.

However, the mere fact that a curve is bell shaped does not mean that it represents a Normal distribution, because other distributions may have a similar sort of shape.

Page 35: Statistic I: Data Collection & Handling

©[email protected] 2013

Normal distribution

If the observations follow a Normal distribution, a range covered by one standard deviation above the mean and one standard deviation below it includes about 68.3% of the observations;

a range of two standard deviations above and two below (+ 2sd) about 95.4% of the observations; and

of three standard deviations above and three below (+ 3sd) about 99.7% of the observations

68.3%

95.4%

99.7%

Page 36: Statistic I: Data Collection & Handling

©[email protected] 2013

Normality

Why bother with normality?? Because it dictates the type of analysis

that you can run on the data

Page 37: Statistic I: Data Collection & Handling

©[email protected] 2013

Normality-Why?Parametric

Page 38: Statistic I: Data Collection & Handling

©[email protected] 2013

Normality-Why?Non-parametric

Page 39: Statistic I: Data Collection & Handling

©[email protected] 2013

Normality-How?

Explored graphically• Histogram• Stem & Leaf• Box plot• Normal probability

plot• Detrended normal

plot

Explored statistically• Kolmogorov-Smirnov

statistic, with Lilliefors significance level and the Shapiro-Wilks statistic

• Skew ness (0)• Kurtosis (0)

– + leptokurtic– 0 mesokurtik– - platykurtic

Page 40: Statistic I: Data Collection & Handling

©[email protected] 2013

Kolmogorov- Smirnov

In the 1930’s, Andrei Nikolaevich Kolmogorov (1903-1987) and N.V. Smirnov (his student) came out with the approach for comparison of distributions that did not make use of parameters.

This is known as the Kolmogorov-Smirnov test.

Page 41: Statistic I: Data Collection & Handling

©[email protected] 2013

Skew ness

Skewed to the right indicates the presence of large extreme values

Skewed to the left indicates the presence of small extreme values

Page 42: Statistic I: Data Collection & Handling

©[email protected] 2013

Kurtosis

For symmetrical distribution only.

Describes the shape of the curve

Mesokurtic - average shaped

Leptokurtic - narrow & slim

Platikurtic - flat & wide

Page 43: Statistic I: Data Collection & Handling

©[email protected] 2013

Skew ness & Kurtosis

Skew ness ranges from -3 to 3. Acceptable range for normality is skew ness

lying between -1 to 1. Normality should not be based on skew ness

alone; the kurtosis measures the “peak ness” of the bell-curve (see Fig. 4).

Likewise, acceptable range for normality is kurtosis lying between -1 to 1.

Page 44: Statistic I: Data Collection & Handling

©[email protected] 2013

Page 45: Statistic I: Data Collection & Handling

©[email protected] 2013

Normality - ExamplesGraphically

Height

167.5

165.0

162.5

160.0

157.5

155.0

152.5

150.0

147.5

145.0

142.5

140.0

60

50

40

30

20

10

0

Std. Dev = 5.26

Mean = 151.6

N = 218.00

Page 46: Statistic I: Data Collection & Handling

©[email protected] 2013

Q&Q Plot

This plot compares the quintiles of a data distribution with the quintiles of a standardised theoretical distribution from a specified family of distributions (in this case, the normal distribution).

If the distributional shapes differ, then the points will plot along a curve instead of a line.

Take note that the interest here is the central portion of the line, severe deviations means non-normality. Deviations at the “ends” of the curve signifies the existence of outliers.

Page 47: Statistic I: Data Collection & Handling

©[email protected] 2013

Normality - ExamplesGraphically

Detrended Normal Q-Q Plot of Height

Observed Value

170160150140130

De

v fr

om

No

rma

l

.6

.5

.4

.3

.2

.1

0.0

-.1

-.2

Normal Q-Q Plot of Height

Observed Value

170160150140130

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

-3

Page 48: Statistic I: Data Collection & Handling

©[email protected] 2013

Normal distributionMean=median=mode

Page 49: Statistic I: Data Collection & Handling

©[email protected] 2013

Tests of Normality

.060 218 .052HeightStatistic df Sig.

Kolmogorov-Smirnova

Lilliefors Significance Correctiona.

Descriptives

151.65 .356

150.94

152.35

151.59

151.50

27.649

5.258

139

168

29

8.00

.148 .165

.061 .328

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

HeightStatistic Std. Error

Normality - ExamplesStatistically

Shapiro-Wilks; only if sample size less than 100.

Normal distributionMean=median=mode

Skewness & kurtosis within +1

p > 0.05, so normal distribution

Page 50: Statistic I: Data Collection & Handling

©[email protected] 2013

K-S Test

Page 51: Statistic I: Data Collection & Handling

©[email protected] 2013

K-S Test

very sensitive to the sample sizes of the data.

For small samples (n<20, say), the likelihood of getting p<0.05 is low

for large samples (n>100), a slight deviation from normality will result in being reported as abnormal distribution

Page 52: Statistic I: Data Collection & Handling

©[email protected] 2013

Guide to deciding on normality

Page 53: Statistic I: Data Collection & Handling

©[email protected] 2013

Normality Transformation

Normal Q-Q Plot of LN_PARIT

Observed Value

3.02.52.01.51.0.50.0-.5

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

Normal Q-Q Plot of LN_PARIT

Observed Value

3.02.52.01.51.0.50.0-.5

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

Normal Q-Q Plot of PARITY

Observed Value

1614121086420

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

Normal Q-Q Plot of PARITY

Observed Value

1614121086420

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

Page 54: Statistic I: Data Collection & Handling

©[email protected] 2013

Square root Logarithm Inverse

Reflect and square root

Reflect and logarithm

Reflect and inverse

TYPES OF TRANSFORMATIONS

Page 55: Statistic I: Data Collection & Handling

©[email protected] 2013

Summarise

Summarise a large set of data by a few meaningful numbers.

Single variable analysis• For the purpose of describing the data• Example; in one year, what kind of cases are

treated by the Psychiatric Dept?• Tables & diagrams are usually used to describe

the data• For numerical data, measures of central tendency

& spread is usually used

Page 56: Statistic I: Data Collection & Handling

©[email protected] 2013

Frequency Table

Race F %Malay 760 95.84%

Chinese 5 0.63%Indian 0 0.00%Others 28 3.53%TOTAL 793 100.00%

•Illustrates the frequency observed for each category

Page 57: Statistic I: Data Collection & Handling

©[email protected] 2013

Frequency Distribution Table

Umur Bil %0-0.99 25 3.26%1-4.99 78 10.18%5-14.99 140 18.28%15-24.99 126 16.45%25-34.99 112 14.62%35-44.99 90 11.75%45-54.99 66 8.62%55-64.99 60 7.83%65-74.99 50 6.53%75-84.99 16 2.09%85+ 3 0.39%JUMLAH 766

• > 20 observations, best presented as a frequency distribution table.

•Columns divided into class & frequency.

•Mod class can be determined using such tables.

Page 58: Statistic I: Data Collection & Handling

©[email protected] 2013

Measurement of Central Tendency & Spread

Page 59: Statistic I: Data Collection & Handling

©[email protected] 2013

Measures of Central Tendency

MeanModeMedian

Page 60: Statistic I: Data Collection & Handling

©[email protected] 2013

Measures of Variability

Standard deviation Inter-quartilesSkew ness & kurtosis

Page 61: Statistic I: Data Collection & Handling

©[email protected] 2013

Mean

the average of the data collected To calculate the mean, add up the

observed values and divide by the number of them.

A major disadvantage of the mean is that it is sensitive to outlying points

Page 62: Statistic I: Data Collection & Handling

©[email protected] 2013

Mean: Example

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Total of x = 648 n= 20 Mean = 648/20 = 32.4

Page 63: Statistic I: Data Collection & Handling

©[email protected] 2013

Measures of variation - standard deviation

tells us how much all the scores in a dataset cluster around the mean. A large S.D. is indicative of a more varied data scores.

a summary measure of the differences of each observation from the mean.

If the differences themselves were added up, the positive would exactly balance the negative and so their sum would be zero.

Consequently the squares of the differences are added.

Page 64: Statistic I: Data Collection & Handling

©[email protected] 2013

Page 65: Statistic I: Data Collection & Handling

©[email protected] 2013

sd: Example

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Mean = 32.4; n = 20 Total of (x-mean)2

= 3050.8 Variance = 3050.8/19

= 160.5684 sd = 160.56840.5=12.67

x (x-mean)^2 x (x-mean)^2

12 416.16 32 0.16

13 376.36 35 6.76

17 237.16 37 21.16

21 129.96 38 31.36

24 70.56 41 73.96

24 70.56 43 112.36

26 40.96 44 134.56

27 29.16 46 184.96

27 29.16 53 424.36

30 5.76 58 655.36

TOTAL 1405.8 TOTAL 1645

Page 66: Statistic I: Data Collection & Handling

©[email protected] 2013

Median

the ranked value that lies in the middle of the data

the point which has the property that half the data are greater than it, and half the data are less than it.

if n is even, average the n/2th largest and the n/2 + 1th largest observations

"robust" to outliers

Page 67: Statistic I: Data Collection & Handling

©[email protected] 2013

Median:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

(20+1)/2 = 10th which is 30, 11th is 32 Therefore median is (30 + 32)/2 = 31

Page 68: Statistic I: Data Collection & Handling

©[email protected] 2013

Measures of variation - quartiles

The range is very susceptible to what are known as outliers

A more robust approach is to divide the distribution of the data into four, and find the points below which are 25%, 50% and 75% of the distribution. These are known as quartiles, and the median is the second quartile.

Page 69: Statistic I: Data Collection & Handling

©[email protected] 2013

Quartiles

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

25th percentile 24; (24+24)/2 50th percentile 31; (30+32)/2 ; = median 75th percentile 42.5; (41+43)/2

Page 70: Statistic I: Data Collection & Handling

©[email protected] 2013

Mode

The most frequent occurring number. E.g. 3, 13, 13, 20, 22, 25: mode = 13.

It is usually more informative to quote the mode accompanied by the percentage of times it happened; e.g., the mode is 13 with 33% of the occurrences.

Page 71: Statistic I: Data Collection & Handling

©[email protected] 2013

Mode: Example

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Modes are 24 (10%) & 27 (10%)

Page 72: Statistic I: Data Collection & Handling

©[email protected] 2013

Mean or Median?

Which measure of central tendency should we use?

if the distribution is normal, the mean+sd will be the measure to be presented, otherwise the median+IQR should be more appropriate.

Page 73: Statistic I: Data Collection & Handling

©[email protected] 2013

Normal distribution;Use Mean+SD

Not Normal distribution;Use Median & IQR

Page 74: Statistic I: Data Collection & Handling

©[email protected] 2013

Presentation

Qualitative & Quantitative Data

Charts & Tables

Page 75: Statistic I: Data Collection & Handling

©[email protected] 2013

Presentation

Qualitative Data

Page 76: Statistic I: Data Collection & Handling

©[email protected] 2013

Graphing Categorical Data: Univariate Data

Categorical Data

Tabulating Data

The Summary Table

0 1 0 2 0 3 0 4 0 5 0

S to c k s

B o n d s

S a vin g s

C D

Graphing Data

Pie Charts

Pareto DiagramBar Charts

0

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0

4 5

S to c k s B o n d s S a vin g s C D

0

2 0

4 0

6 0

8 0

1 0 0

1 2 0

Page 77: Statistic I: Data Collection & Handling

©[email protected] 2013

Bar Chart

Type of work

Field w orkOffice w orkHousew ife

Pe

rce

nt

80

60

40

20

0

20

11

69

Page 78: Statistic I: Data Collection & Handling

©[email protected] 2013

Pie Chart

Others

Chinese

Malay

Page 79: Statistic I: Data Collection & Handling

©[email protected] 2013

Tabulating and Graphing Bivariate

Categorical Data Contingency tables:

Table 1: Contigency table of pregnancy induced hypertension andSGA

Count

103 94 197

5 16 21

108 110 218

No

Yes

Pregnancy inducedhypertension

Total

Normal SGA

SGA

Total

Page 80: Statistic I: Data Collection & Handling

©[email protected] 2013

Tabulating and Graphing Bivariate

Categorical Data

Pregnancy induced hypertension

YesNo

Co

un

t

120

100

80

60

40

20

0

SGA

Normal

SGA

16

94

103

Side by side charts

Page 81: Statistic I: Data Collection & Handling

©[email protected] 2013

Presentation

Quantitative Data

Page 82: Statistic I: Data Collection & Handling

©[email protected] 2013

Ogive

0

20

40

60

80

100

120

10 20 30 40 50 60

Tabulating and Graphing Numerical

Data

0

1

2

3

4

5

6

7

10 20 30 40 50 60

Numerical Data

Ordered Array

Stem and LeafDisplay

Histograms Area

Tables

2 144677

3 028

4 1

41, 24, 32, 26, 27, 27, 30, 24, 38, 21

21, 24, 24, 26, 27, 27, 30, 32, 38, 41

Frequency DistributionsCumulative Distributions

Polygons

Page 83: Statistic I: Data Collection & Handling

©[email protected] 2013

Tabulating Numerical Data: Frequency

Distributions Sort raw data in ascending order:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Find range: 58 - 12 = 46

Select number of classes: 5 (usually between 5 and 15)

Compute class interval (width): 10 (46/5 then round up)

Determine class boundaries (limits): 10, 20, 30, 40, 50, 60

Compute class midpoints: 14.95, 24.95, 34.95, 44.95, 54.95

Count observations & assign to classes

Page 84: Statistic I: Data Collection & Handling

©[email protected] 2013

Frequency Distributions and Percentage Distributions

Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class Midpoint Freq %

10.0 - 19.9 14.95 3 15%

20.0 - 29.9 24.95 6 30%

30.0 - 39.9 34.95 5 25%

40.0 - 49.9 44.95 4 20%

50.0 - 59.9 54.95 2 10%

TOTAL   20 100%

Page 85: Statistic I: Data Collection & Handling

©[email protected] 2013

3

6

5

4

2

0

1

2

3

4

5

6

7

14.95 24.95 34.95 44.95 54.95

Age

Fre

qu

en

cy

Graphing Numerical Data:

The HistogramData in ordered array:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class MidpointsClass Boundaries

No Gaps Between

Bars

Page 86: Statistic I: Data Collection & Handling

©[email protected] 2013

Graphing Numerical Data:

The Frequency Polygon

Class Midpoints

Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

0

1

2

3

4

5

6

7

14.95 24.95 34.95 44.95 54.95

Page 87: Statistic I: Data Collection & Handling

©[email protected] 2013

Linear Regression Line

Page 88: Statistic I: Data Collection & Handling

©[email protected] 2013

Survival Function

DURATION

76543210

Cu

m S

urv

iva

l

1.2

1.0

.8

.6

.4

.2

0.0

Survival Function

Censored

Page 89: Statistic I: Data Collection & Handling

©[email protected] 2013

Principles of Graphical Excellence

Presents data in a way that provides substance, statistics and design

Communicates complex ideas with clarity, precision and efficiency

Gives the largest number of ideas in the most efficient manner

Almost always involves several dimensions Tells the truth about the data