exploratory data analysis - checking for normality
Post on 07-May-2015
3.323 Views
Preview:
DESCRIPTION
TRANSCRIPT
©drtamil@gmail.com 2012
Explore & Summarise
Dr Azmi Mohd Tamil
Dept of Community Health
Universiti Kebangsaan Malaysia
FK6163
©drtamil@gmail.com 2012
Introduction
Method of Exploring and Summarising Data differs
According to Types of Variables
©drtamil@gmail.com 2012
Dependent/Independent
Frequency of Exercise
Obesity
Food Intake
Independent Variables
Dependent Variable
©drtamil@gmail.com 2012
©drtamil@gmail.com 2012
Explore
4 It is the first step in the analytic process
4 to explore the characteristics of the data
4 to screen for errors and correct them
4 to look for distribution patterns - normal
distribution or not
4May require transformation before further
analysis using parametric methods
4Or may need analysis using non-parametric
techniques
©drtamil@gmail.com 2012
Data Screening
4 By running
frequencies, we may
detect inappropriate
responses
4How many in the
audience have 15
children and
currently pregnant
with the 16th?
PARITY
67 30.7
44 20.2
36 16.5
22 10.1
21 9.6
8 3.7
3 1.4
7 3.2
5 2.3
3 1.4
1 .5
1 .5
218 100.0
1
2
3
4
5
6
7
8
9
10
11
15
Total
Valid
Frequency Percent
©drtamil@gmail.com 2012
Data Screening
4 See whether the
data make sense or
not.
4 E.g. Parity 10 but
age only 25.
©drtamil@gmail.com 2012
©drtamil@gmail.com 2012
©drtamil@gmail.com 2012
Data Screening
4 By looking at measures of central tendency
and range, we can also detect abnormal values
for quantitative data
Descriptive Statistics
184 32 484 53.05 33.37
184
Pre-pregnancy weight
Valid N (listwise)
N Minimum Maximum Mean
Std.
Deviation
©drtamil@gmail.com 2012
Interpreting the Box Plot
Outlier
Outlier
Upper quartile
Smallest non-outlier
Median
Lower quartile
Largest non-outlier The whiskers extend
to 1.5 times the box
width from both ends
of the box and ends
at an observed value.
Three times the box
width marks the
boundary between
"mild" and "extreme"
outliers.
"mild" = closed dots
"extreme"= open dots
©drtamil@gmail.com 2012
Data Screening
4We can
also make
use of
graphical
tools such
as the box
plot to
detect
wrong
data entry 184N =
Pre-pregnancy weight
600
500
400
300
200
100
0
141198211181
73
©drtamil@gmail.com 2012
Data Cleaning
4 Identify the extreme/wrong values
4Check with original data source – i.e.
questionnaire
4 If incorrect, do the necessary correction.
4Correction must be done before
transformation, recoding and analysis.
©drtamil@gmail.com 2012
Parameters of Data Distribution
4Mean – central value of data
4Standard deviation – measure of how
the data scatter around the mean
4Symmetry (skewness) – the degree of
the data pile up on one side of the mean
4Kurtosis – how far data scatter from the
mean
©drtamil@gmail.com 2012
Normal distribution
4 The Normal distribution is
represented by a family of curves
defined uniquely by two parameters,
which are the mean and the
standard deviation of the population.
4 The curves are always
symmetrically bell shaped, but the
extent to which the bell is
compressed or flattened out
depends on the standard deviation
of the population.
4 However, the mere fact that a curve
is bell shaped does not mean that it
represents a Normal distribution,
because other distributions may
have a similar sort of shape.
©drtamil@gmail.com 2012
Normal distribution
4 If the observations follow a
Normal distribution, a range
covered by one standard
deviation above the mean
and one standard deviation
below it includes about
68.3% of the observations;
4 a range of two standard
deviations above and two
below (+ 2sd) about 95.4%
of the observations; and
4 of three standard deviations
above and three below (+
3sd) about 99.7% of the
observations
68.3%
95.4%
99.7%
©drtamil@gmail.com 2012
Normality
4Why bother with normality??
4Because it dictates the type of analysis
that you can run on the data
©drtamil@gmail.com 2012
Normality-Why?Parametric
Qualitative
Dichotomus
Quantitative Normally distributed data Student's t Test
Qualitative
Polinomial
Quantitative Normally distributed data ANOVA
Quantitative Quantitative Repeated measurement of the
same individual & item (e.g.
Hb level before & after
treatment). Normally
distributed data
Paired t Test
Quantitative -
continous
Quantitative -
continous
Normally distributed data Pearson Correlation
& Linear
Regresssion
©drtamil@gmail.com 2012
Normality-Why?Non-parametric
Qualitative
Dichotomus
Quantitative Data not normally distributed Wilcoxon Rank Sum
Test or U Mann-
Whitney Test
Qualitative
Polinomial
Quantitative Data not normally distributed Kruskal-Wallis One
Way ANOVA Test
Quantitative Quantitative Repeated measurement of the
same individual & item
Wilcoxon Rank Sign
Test
Quantitative -
continous/ordina
l
Quantitative -
continous
Data not normally distributed Spearman/Kendall
Rank Correlation
©drtamil@gmail.com 2012
Normality-How?
4 Explored graphically
• Histogram
• Stem & Leaf
• Box plot
• Normal probability
plot
• Detrended normal
plot
4 Explored statistically
• Kolmogorov-Smirnov
statistic, with
Lilliefors significance
level and the
Shapiro-Wilks
statistic
• Skew ness (0)
• Kurtosis (0)
– + leptokurtic
– 0 mesokurtik
– - platykurtic
©drtamil@gmail.com 2012
Kolmogorov- Smirnov
4 In the 1930’s, Andrei Nikolaevich
Kolmogorov (1903-1987) and N.V.
Smirnov (his student) came out with the
approach for comparison of distributions
that did not make use of parameters.
4This is known as the Kolmogorov-
Smirnov test.
©drtamil@gmail.com 2012
Skew ness
4 Skewed to the right
indicates the
presence of large
extreme values
4 Skewed to the left
indicates the
presence of small
extreme values
©drtamil@gmail.com 2012
Kurtosis
4 For symmetrical
distribution only.
4Describes the shape
of the curve
4Mesokurtic -
average shaped
4 Leptokurtic - narrow
& slim
4 Platikurtic - flat &
wide
©drtamil@gmail.com 2012
Skew ness & Kurtosis
4 Skew ness ranges from -3 to 3.
4 Acceptable range for normality is skew ness
lying between -1 to 1.
4Normality should not be based on skew ness
alone; the kurtosis measures the “peak ness”
of the bell-curve (see Fig. 4).
4 Likewise, acceptable range for normality is
kurtosis lying between -1 to 1.
©drtamil@gmail.com 2012
©drtamil@gmail.com 2012
Normality - ExamplesGraphically
Height
167.5
165.0
162.5
160.0
157.5
155.0
152.5
150.0
147.5
145.0
142.5
140.0
60
50
40
30
20
10
0
Std. Dev = 5.26
Mean = 151.6
N = 218.00
©drtamil@gmail.com 2012
Q&Q Plot
4 This plot compares the quintiles of a data distribution with the quintiles of a standardised theoretical distribution from a specified family of distributions (in this case, the normal distribution).
4 If the distributional shapes differ, then the points will plot along a curve instead of a line.
4 Take note that the interest here is the central portion of the line, severe deviations means non-normality. Deviations at the “ends” of the curve signifies the existence of outliers.
©drtamil@gmail.com 2012
Normality - ExamplesGraphically
Detrended Normal Q-Q Plot of Height
Observed Value
170160150140130
Dev from Normal
.6
.5
.4
.3
.2
.1
0.0
-.1
-.2
Normal Q-Q Plot of Height
Observed Value
170160150140130
Expected Normal
3
2
1
0
-1
-2
-3
©drtamil@gmail.com 2012
Normal distribution
Mean=median=mode
©drtamil@gmail.com 2012
Tests of Normality
.060 218 .052Height
Statistic df Sig.
Kolmogorov-Smirnova
Lilliefors Significance Correctiona.
Descriptives
151.65 .356
150.94
152.35
151.59
151.50
27.649
5.258
139
168
29
8.00
.148 .165
.061 .328
Mean
Lower Bound
Upper Bound
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Height
Statistic Std. Error
Normality - ExamplesStatistically
Shapiro-Wilks; only if
sample size less than 100.
Normal distribution
Mean=median=mode
Skewness & kurtosis
within +1
p > 0.05, so normal
distribution
©drtamil@gmail.com 2012
K-S Test
©drtamil@gmail.com 2012
K-S Test
4very sensitive to the sample sizes of the
data.
4For small samples (n<20, say), the
likelihood of getting p<0.05 is low
4 for large samples (n>100), a slight
deviation from normality will result in
being reported as abnormal distribution
©drtamil@gmail.com 2012
Guide to deciding on normality
©drtamil@gmail.com 2012
Normality Transformation
Normal Q-Q Plot of LN_PARIT
Observed Value
3.02.52.01.51.0.50.0-.5
Expected Normal
3
2
1
0
-1
-2
Normal Q-Q Plot of LN_PARIT
Observed Value
3.02.52.01.51.0.50.0-.5
Expected Normal
3
2
1
0
-1
-2
Normal Q-Q Plot of PARITY
Observed Value
1614121086420
Expected Normal
3
2
1
0
-1
-2
Normal Q-Q Plot of PARITY
Observed Value
1614121086420
Expected Normal
3
2
1
0
-1
-2
©drtamil@gmail.com 2012
Square root Logarithm Inverse
Reflect and square root
Reflect and logarithm Reflect and inverse
TYPES OF TRANSFORMATIONS
©drtamil@gmail.com 2012
Summarise
4 Summarise a large set of data by a few meaningful numbers.
4 Single variable analysis• For the purpose of describing the data
• Example; in one year, what kind of cases are treated by the Psychiatric Dept?
• Tables & diagrams are usually used to describe the data
• For numerical data, measures of central tendency & spread is usually used
©drtamil@gmail.com 2012
Frequency Table
Race F %
Malay 760 95.84%
Chinese 5 0.63%
Indian 0 0.00%
Others 28 3.53%
TOTAL 793 100.00%
•Illustrates the frequency observed for each category
©drtamil@gmail.com 2012
Frequency Distribution Table
Umur Bil %
0-0.99 25 3.26%
1-4.99 78 10.18%
5-14.99 140 18.28%
15-24.99 126 16.45%
25-34.99 112 14.62%
35-44.99 90 11.75%
45-54.99 66 8.62%
55-64.99 60 7.83%
65-74.99 50 6.53%
75-84.99 16 2.09%
85+ 3 0.39%
JUMLAH 766
• > 20 observations, best
presented as a frequency
distribution table.
•Columns divided into class &
frequency.
•Mod class can be determined
using such tables.
©drtamil@gmail.com 2012
Measurement of Central
Tendency & Spread
©drtamil@gmail.com 2012
Measures of Central Tendency
4Mean
4Mode
4Median
©drtamil@gmail.com 2012
Measures of Variability
4Standard deviation
4Inter-quartiles
4Skew ness & kurtosis
©drtamil@gmail.com 2012
Mean
4 the average of the data collected
4To calculate the mean, add up the
observed values and divide by the
number of them.
4A major disadvantage of the mean is
that it is sensitive to outlying points
©drtamil@gmail.com 2012
Mean: Example
412, 13, 17, 21, 24, 24, 26, 27, 27,
30, 32, 35, 37, 38, 41, 43, 44, 46,
53, 58
4Total of x = 648
4n= 20
4Mean = 648/20 = 32.4
©drtamil@gmail.com 2012
Measures of variation -standard deviation
4 tells us how much all the scores in a dataset cluster around the
mean. A large S.D. is indicative of a more varied data scores.
4 a summary measure of the differences of each observation from
the mean.
4 If the differences themselves were added up, the positive would
exactly balance the negative and so their sum would be zero.
4 Consequently the squares of the differences are added.
©drtamil@gmail.com 2012
©drtamil@gmail.com 2012
sd: Example
4 12, 13, 17, 21, 24, 24,
26, 27, 27, 30, 32, 35,
37, 38, 41, 43, 44, 46,
53, 58
4 Mean = 32.4; n = 20
4 Total of (x-mean)2
= 3050.8
4 Variance = 3050.8/19
= 160.5684
4 sd = 160.56840.5=12.671645TOTAL1405.8TOTAL
655.36585.7630
424.365329.1627
184.964629.1627
134.564440.9626
112.364370.5624
73.964170.5624
31.3638129.9621
21.1637237.1617
6.7635376.3613
0.1632416.1612
(x-mean)^2x(x-mean)^2x
©drtamil@gmail.com 2012
Median
4 the ranked value that lies in the middle of the data
4 the point which has the property that half the data are greater than it, and half the data are less than it.
4 if n is even, average the n/2th largest and the n/2 + 1th largest observations
4"robust" to outliers
©drtamil@gmail.com 2012
Median:
4 12, 13, 17, 21, 24, 24, 26, 27, 27, 30,
32, 35, 37, 38, 41, 43, 44, 46, 53, 58
4 (20+1)/2 = 10th which is 30, 11th is 32
4 Therefore median is (30 + 32)/2 = 31
©drtamil@gmail.com 2012
Measures of variation -quartiles
4The range is very susceptible to what
are known as outliers
4A more robust approach is to divide the
distribution of the data into four, and find
the points below which are 25%, 50%
and 75% of the distribution. These are
known as quartiles, and the median is
the second quartile.
©drtamil@gmail.com 2012
Quartiles
4 12, 13, 17, 21, 24,
24, 26, 27, 27, 30,
32, 35, 37, 38, 41,
43, 44, 46, 53, 58
4 25th percentile 24; (24+24)/2
4 50th percentile 31; (30+32)/2 ; = median
4 75th percentile 42.5; (41+43)/2
©drtamil@gmail.com 2012
Mode
4The most frequent occurring number.
E.g. 3, 13, 13, 20, 22, 25: mode = 13.
4 It is usually more informative to quote
the mode accompanied by the
percentage of times it happened; e.g.,
the mode is 13 with 33% of the
occurrences.
©drtamil@gmail.com 2012
Mode: Example
412, 13, 17, 21, 24, 24, 26, 27, 27, 30,
32, 35, 37, 38, 41, 43, 44, 46, 53, 58
4Modes are 24 (10%) & 27 (10%)
©drtamil@gmail.com 2012
Mean or Median?
4Which measure of central tendency
should we use?
4 if the distribution is normal, the mean+sd
will be the measure to be presented,
otherwise the median+IQR should be
more appropriate.
©drtamil@gmail.com 2012
Normal distribution;
Use Mean+SD
Not Normal distribution;
Use Median & IQR
©drtamil@gmail.com 2012
Presentation
Qualitative & Quantitative Data
Charts & Tables
©drtamil@gmail.com 2012
Presentation
Qualitative Data
©drtamil@gmail.com 2012
Graphing Categorical Data:
Univariate Data
Categorical Data
Tabulating Data
The Summary Table
0 10 20 30 40 50
Stocks
Bonds
Savings
CD
Graphing Data
Pie Charts
Pareto DiagramBar Charts
0
5
10
15
20
25
30
35
40
45
Stocks Bonds Savings CD
0
20
40
60
80
100
120
©drtamil@gmail.com 2012
Bar Chart
Type of work
Field w orkOffice w orkHousew ife
Percent
80
60
40
20
0
20
11
69
©drtamil@gmail.com 2012
Pie Chart
Others
Chinese
Malay
©drtamil@gmail.com 2012
Tabulating and Graphing
Bivariate Categorical Data
4Contingency tables:
Table 1: Contigency table of pregnancy induced hypertension and
SGA
Count
103 94 197
5 16 21
108 110 218
No
Yes
Pregnancy induced
hypertension
Total
Normal SGA
SGA
Total
©drtamil@gmail.com 2012
Tabulating and Graphing Bivariate Categorical Data
Pregnancy induced hypertension
YesNo
Count
120
100
80
60
40
20
0
SGA
Normal
SGA
16
94
103
4 Side
by
side
charts
©drtamil@gmail.com 2012
Presentation
Quantitative Data
©drtamil@gmail.com 2012
Ogive
0
20
40
60
80
100
120
10 20 30 40 50 60
Tabulating and Graphing
Numerical Data
0
1
2
3
4
5
6
7
10 20 30 40 50 60
Numerical Data
Ordered Array
Stem and Leaf
Display
Histograms Area
Tables
2 144677
3 028
4 1
41, 24, 32, 26, 27, 27, 30, 24, 38, 21
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Frequency Distributions
Cumulative Distributions
Polygons
©drtamil@gmail.com 2012
Tabulating Numerical Data:
Frequency Distributions
4 Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
4 Find range: 58 - 12 = 46
4 Select number of classes: 5 (usually between 5 and 15)
4Compute class interval (width): 10 (46/5 then round up)
4Determine class boundaries (limits): 10, 20, 30, 40, 50, 60
4Compute class midpoints: 14.95, 24.95, 34.95, 44.95, 54.95
4Count observations & assign to classes
©drtamil@gmail.com 2012
Frequency Distributions
and Percentage Distributions
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
100%20TOTAL
10%254.9550.0 - 59.9
20%444.9540.0 - 49.9
25%534.9530.0 - 39.9
30%624.9520.0 - 29.9
15%314.9510.0 - 19.9
%FreqMidpointClass
©drtamil@gmail.com 2012
3
6
5
4
2
0
1
2
3
4
5
6
7
14.95 24.95 34.95 44.95 54.95
Age
Frequency
Graphing Numerical Data:
The Histogram
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class MidpointsClass Boundaries
No Gaps
Between
Bars
©drtamil@gmail.com 2012
Graphing Numerical Data:
The Frequency Polygon
Class Midpoints
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
0
1
2
3
4
5
6
7
14.95 24.95 34.95 44.95 54.95
©drtamil@gmail.com 2012
Calculate Measures of Central Tendency & Spread
4We can use frequency distribution table
to calculate;
• Mean
• Standard Deviation
• Median
• Mode
©drtamil@gmail.com 2012
Mean
4Mean = 659/20
= 32.95
4Compare with 32.4
from direct
calculation.
.f mpX
n=∑
659.0020TOTAL
109.90254.9550.0 - 59.9
179.80444.9540.0 - 49.9
174.75534.9530.0 - 39.9
149.70624.9520.0 - 29.9
44.85314.9510.0 - 19.9
freq x m.p.FreqMidpointClass
©drtamil@gmail.com 2012
Standard deviation
s2=((24634.05-(6592/20))/19)
s2=2920.05/19
s2=153.69
s = 12.4
4 Compare with 12.67 from
direct measurement.
( )2
2.
.
1
f mpf mp
nsn
−=
−
∑∑
24634.05659.0020TOTAL
6039.01109.90254.9550.0 - 59.9
8082.01179.80444.9540.0 - 49.9
6107.51174.75534.9530.0 - 39.9
3735.02149.70624.9520.0 - 29.9
670.5144.85314.9510.0 - 19.9
f.mp^2f.m.p.Freq
Mid
PointClass
©drtamil@gmail.com 2012
Median
20TOTAL
250.0 - 59.9
440.0 - 49.9
median class530.0 - 39.9
620.0 - 29.9
310.0 - 19.9
FreqClass 4 L1 +i *((n+1)/2) – f1
fmed
4 f1 = cumulative freq
above median class
4 29.95 + 10((21/2)-9)
5
4 29.95 + 15/5 = 32.95
4 From direct calculation,
median = 31
©drtamil@gmail.com 2012
Mode
=L1 +i *(Diff1/(Diff1+Diff2))
=19.95 + 10(3/(3+1))
=27.45
4Compare with
modes of 24 & 27
from direct
calculation.20TOTAL
250.0 - 59.9
440.0 - 49.9
530.0 - 39.9
mode class620.0 - 29.9
310.0 - 19.9
FreqClass
©drtamil@gmail.com 2012
Graphing Bivariate Numerical
Data (Scatter Plot)
©drtamil@gmail.com 2012
Linear Regression Line
©drtamil@gmail.com 2012
Survival Function
DURATION
76543210
Cum Survival
1.2
1.0
.8
.6
.4
.2
0.0
Survival Function
Censored
©drtamil@gmail.com 2012
Principles of Graphical Excellence
4Presents data in a way that provides
substance, statistics and design
4Communicates complex ideas with clarity,
precision and efficiency
4Gives the largest number of ideas in the
most efficient manner
4Almost always involves several
dimensions
4Tells the truth about the data
top related