exploratory data analysis - checking for normality

©drtamil@gmail.com 2012

Explore & Summarise

Dr Azmi Mohd Tamil

Dept of Community Health

Universiti Kebangsaan Malaysia

FK6163

Introduction

Method of Exploring and Summarising Data differs

According to Types of Variables

Dependent/Independent

Frequency of Exercise

Obesity

Food Intake

Independent Variables

Dependent Variable

Explore

4 It is the first step in the analytic process

4 to explore the characteristics of the data

4 to screen for errors and correct them

4 to look for distribution patterns - normal

distribution or not

4May require transformation before further

analysis using parametric methods

4Or may need analysis using non-parametric

techniques

Data Screening

4 By running

frequencies, we may

detect inappropriate

responses

4How many in the

audience have 15

children and

currently pregnant

with the 16th?

PARITY

67 30.7

44 20.2

36 16.5

22 10.1

21 9.6

218 100.0

Frequency Percent

Data Screening

4 See whether the

data make sense or

4 E.g. Parity 10 but

age only 25.

Data Screening

4 By looking at measures of central tendency

and range, we can also detect abnormal values

for quantitative data

Descriptive Statistics

184 32 484 53.05 33.37

Pre-pregnancy weight

Valid N (listwise)

N Minimum Maximum Mean

Deviation

Interpreting the Box Plot

Outlier

Upper quartile

Smallest non-outlier

Median

Lower quartile

Largest non-outlier The whiskers extend

to 1.5 times the box

width from both ends

of the box and ends

at an observed value.

Three times the box

width marks the

boundary between

"mild" and "extreme"

outliers.

"mild" = closed dots

"extreme"= open dots

Data Screening

4We can

also make

use of

graphical

tools such

as the box

plot to

detect

data entry 184N =

Pre-pregnancy weight

141198211181

Data Cleaning

4 Identify the extreme/wrong values

4Check with original data source – i.e.

questionnaire

4 If incorrect, do the necessary correction.

4Correction must be done before

transformation, recoding and analysis.

Parameters of Data Distribution

4Mean – central value of data

4Standard deviation – measure of how

the data scatter around the mean

4Symmetry (skewness) – the degree of

the data pile up on one side of the mean

4Kurtosis – how far data scatter from the

Normal distribution

4 The Normal distribution is

represented by a family of curves

defined uniquely by two parameters,

which are the mean and the

standard deviation of the population.

4 The curves are always

symmetrically bell shaped, but the

extent to which the bell is

compressed or flattened out

depends on the standard deviation

of the population.

4 However, the mere fact that a curve

is bell shaped does not mean that it

represents a Normal distribution,

because other distributions may

have a similar sort of shape.

Normal distribution

4 If the observations follow a

Normal distribution, a range

covered by one standard

deviation above the mean

and one standard deviation

below it includes about

68.3% of the observations;

4 a range of two standard

deviations above and two

below (+ 2sd) about 95.4%

of the observations; and

4 of three standard deviations

above and three below (+

3sd) about 99.7% of the

observations

Normality

4Why bother with normality??

4Because it dictates the type of analysis

that you can run on the data

Normality-Why?Parametric

Qualitative

Dichotomus

Quantitative Normally distributed data Student's t Test

Qualitative

Polinomial

Quantitative Normally distributed data ANOVA

Quantitative Quantitative Repeated measurement of the

same individual & item (e.g.

Hb level before & after

treatment). Normally

distributed data

Paired t Test

Quantitative -

continous

Quantitative -

continous

Normally distributed data Pearson Correlation

& Linear

Regresssion

Normality-Why?Non-parametric

Qualitative

Dichotomus

Quantitative Data not normally distributed Wilcoxon Rank Sum

Test or U Mann-

Whitney Test

Qualitative

Polinomial

Quantitative Data not normally distributed Kruskal-Wallis One

Way ANOVA Test

Quantitative Quantitative Repeated measurement of the

same individual & item

Wilcoxon Rank Sign

Quantitative -

continous/ordina

Quantitative -

continous

Data not normally distributed Spearman/Kendall

Rank Correlation

Normality-How?

4 Explored graphically

• Histogram

• Stem & Leaf

• Box plot

• Normal probability

• Detrended normal

4 Explored statistically

• Kolmogorov-Smirnov

statistic, with

Lilliefors significance

level and the

Shapiro-Wilks

statistic

• Skew ness (0)

• Kurtosis (0)

– + leptokurtic

– 0 mesokurtik

– - platykurtic

Kolmogorov- Smirnov

4 In the 1930’s, Andrei Nikolaevich

Kolmogorov (1903-1987) and N.V.

Smirnov (his student) came out with the

approach for comparison of distributions

that did not make use of parameters.

4This is known as the Kolmogorov-

Smirnov test.

Skew ness

4 Skewed to the right

indicates the

presence of large

extreme values

4 Skewed to the left

indicates the

presence of small

extreme values

Kurtosis

4 For symmetrical

distribution only.

4Describes the shape

of the curve

4Mesokurtic -

average shaped

4 Leptokurtic - narrow

& slim

4 Platikurtic - flat &

Skew ness & Kurtosis

4 Skew ness ranges from -3 to 3.

4 Acceptable range for normality is skew ness

lying between -1 to 1.

4Normality should not be based on skew ness

alone; the kurtosis measures the “peak ness”

of the bell-curve (see Fig. 4).

4 Likewise, acceptable range for normality is

kurtosis lying between -1 to 1.

Normality - ExamplesGraphically

Height

Std. Dev = 5.26

Mean = 151.6

N = 218.00

Q&Q Plot

4 This plot compares the quintiles of a data distribution with the quintiles of a standardised theoretical distribution from a specified family of distributions (in this case, the normal distribution).

4 If the distributional shapes differ, then the points will plot along a curve instead of a line.

4 Take note that the interest here is the central portion of the line, severe deviations means non-normality. Deviations at the “ends” of the curve signifies the existence of outliers.

Normality - ExamplesGraphically

Detrended Normal Q-Q Plot of Height

Observed Value

170160150140130

Dev from Normal

Normal Q-Q Plot of Height

Observed Value

170160150140130

Expected Normal

Normal distribution

Mean=median=mode

Tests of Normality

.060 218 .052Height

Statistic df Sig.

Kolmogorov-Smirnova

Lilliefors Significance Correctiona.

Descriptives

151.65 .356

150.94

152.35

151.59

151.50

27.649

.148 .165

.061 .328

Lower Bound

Upper Bound

95% Confidence

Interval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Interquartile Range

Skewness

Kurtosis

Height

Statistic Std. Error

Normality - ExamplesStatistically

Shapiro-Wilks; only if

sample size less than 100.

Normal distribution

Mean=median=mode

Skewness & kurtosis

within +1

p > 0.05, so normal

distribution

K-S Test

4very sensitive to the sample sizes of the

4For small samples (n<20, say), the

likelihood of getting p<0.05 is low

4 for large samples (n>100), a slight

deviation from normality will result in

being reported as abnormal distribution

Guide to deciding on normality

Normality Transformation

Normal Q-Q Plot of LN_PARIT

Observed Value

3.02.52.01.51.0.50.0-.5

Expected Normal

Normal Q-Q Plot of LN_PARIT

Observed Value

3.02.52.01.51.0.50.0-.5

Expected Normal

Normal Q-Q Plot of PARITY

Observed Value

1614121086420

Expected Normal

Normal Q-Q Plot of PARITY

Observed Value

1614121086420

Expected Normal

Square root Logarithm Inverse

Reflect and square root

Reflect and logarithm Reflect and inverse

TYPES OF TRANSFORMATIONS

Summarise

4 Summarise a large set of data by a few meaningful numbers.

4 Single variable analysis• For the purpose of describing the data

• Example; in one year, what kind of cases are treated by the Psychiatric Dept?

• Tables & diagrams are usually used to describe the data

• For numerical data, measures of central tendency & spread is usually used

Frequency Table

Race F %

Malay 760 95.84%

Chinese 5 0.63%

Indian 0 0.00%

Others 28 3.53%

TOTAL 793 100.00%

•Illustrates the frequency observed for each category

Frequency Distribution Table

Umur Bil %

0-0.99 25 3.26%

1-4.99 78 10.18%

5-14.99 140 18.28%

15-24.99 126 16.45%

25-34.99 112 14.62%

35-44.99 90 11.75%

45-54.99 66 8.62%

55-64.99 60 7.83%

65-74.99 50 6.53%

75-84.99 16 2.09%

85+ 3 0.39%

JUMLAH 766

• > 20 observations, best

presented as a frequency

distribution table.

•Columns divided into class &

frequency.

•Mod class can be determined

using such tables.

Measurement of Central

Tendency & Spread

Measures of Central Tendency

4Median

Measures of Variability

4Standard deviation

4Inter-quartiles

4Skew ness & kurtosis

4 the average of the data collected

4To calculate the mean, add up the

observed values and divide by the

number of them.

4A major disadvantage of the mean is

that it is sensitive to outlying points

Mean: Example

412, 13, 17, 21, 24, 24, 26, 27, 27,

30, 32, 35, 37, 38, 41, 43, 44, 46,

53, 58

4Total of x = 648

4n= 20

4Mean = 648/20 = 32.4

Measures of variation -standard deviation

4 tells us how much all the scores in a dataset cluster around the

mean. A large S.D. is indicative of a more varied data scores.

4 a summary measure of the differences of each observation from

the mean.

4 If the differences themselves were added up, the positive would

exactly balance the negative and so their sum would be zero.

4 Consequently the squares of the differences are added.

sd: Example

4 12, 13, 17, 21, 24, 24,

26, 27, 27, 30, 32, 35,

37, 38, 41, 43, 44, 46,

53, 58

4 Mean = 32.4; n = 20

4 Total of (x-mean)2

= 3050.8

4 Variance = 3050.8/19

= 160.5684

4 sd = 160.56840.5=12.671645TOTAL1405.8TOTAL

655.36585.7630

424.365329.1627

184.964629.1627

134.564440.9626

112.364370.5624

73.964170.5624

31.3638129.9621

21.1637237.1617

6.7635376.3613

0.1632416.1612

(x-mean)^2x(x-mean)^2x

Median

4 the ranked value that lies in the middle of the data

4 the point which has the property that half the data are greater than it, and half the data are less than it.

4 if n is even, average the n/2th largest and the n/2 + 1th largest observations

4"robust" to outliers

Median:

4 12, 13, 17, 21, 24, 24, 26, 27, 27, 30,

32, 35, 37, 38, 41, 43, 44, 46, 53, 58

4 (20+1)/2 = 10th which is 30, 11th is 32

4 Therefore median is (30 + 32)/2 = 31

Measures of variation -quartiles

4The range is very susceptible to what

are known as outliers

4A more robust approach is to divide the

distribution of the data into four, and find

the points below which are 25%, 50%

and 75% of the distribution. These are

known as quartiles, and the median is

the second quartile.

Quartiles

4 12, 13, 17, 21, 24,

24, 26, 27, 27, 30,

32, 35, 37, 38, 41,

43, 44, 46, 53, 58

4 25th percentile 24; (24+24)/2

4 50th percentile 31; (30+32)/2 ; = median

4 75th percentile 42.5; (41+43)/2

4The most frequent occurring number.

E.g. 3, 13, 13, 20, 22, 25: mode = 13.

4 It is usually more informative to quote

the mode accompanied by the

percentage of times it happened; e.g.,

the mode is 13 with 33% of the

occurrences.

Mode: Example

412, 13, 17, 21, 24, 24, 26, 27, 27, 30,

32, 35, 37, 38, 41, 43, 44, 46, 53, 58

4Modes are 24 (10%) & 27 (10%)

Mean or Median?

4Which measure of central tendency

should we use?

4 if the distribution is normal, the mean+sd

will be the measure to be presented,

otherwise the median+IQR should be

more appropriate.

Normal distribution;

Use Mean+SD

Not Normal distribution;

Use Median & IQR

Presentation

Qualitative & Quantitative Data

Charts & Tables

Presentation

Qualitative Data

Graphing Categorical Data:

Univariate Data

Categorical Data

Tabulating Data

The Summary Table

0 10 20 30 40 50

Stocks

Savings

Graphing Data

Pie Charts

Pareto DiagramBar Charts

Stocks Bonds Savings CD

Bar Chart

Type of work

Field w orkOffice w orkHousew ife

Percent

Pie Chart

Others

Chinese

Tabulating and Graphing

Bivariate Categorical Data

4Contingency tables:

Table 1: Contigency table of pregnancy induced hypertension and

103 94 197

5 16 21

108 110 218

Pregnancy induced

hypertension

Normal SGA

Tabulating and Graphing Bivariate Categorical Data

Pregnancy induced hypertension

Normal

4 Side

charts

Presentation

Quantitative Data

10 20 30 40 50 60

Tabulating and Graphing

Numerical Data

10 20 30 40 50 60

Numerical Data

Ordered Array

Stem and Leaf

Display

Histograms Area

Tables

2 144677

41, 24, 32, 26, 27, 27, 30, 24, 38, 21

21, 24, 24, 26, 27, 27, 30, 32, 38, 41

Frequency Distributions

Cumulative Distributions

Polygons

Tabulating Numerical Data:

4 Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

4 Find range: 58 - 12 = 46

4 Select number of classes: 5 (usually between 5 and 15)

4Compute class interval (width): 10 (46/5 then round up)

4Determine class boundaries (limits): 10, 20, 30, 40, 50, 60

4Compute class midpoints: 14.95, 24.95, 34.95, 44.95, 54.95

4Count observations & assign to classes

and Percentage Distributions

Data in ordered array:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

100%20TOTAL

10%254.9550.0 - 59.9

20%444.9540.0 - 49.9

25%534.9530.0 - 39.9

30%624.9520.0 - 29.9

15%314.9510.0 - 19.9

%FreqMidpointClass

14.95 24.95 34.95 44.95 54.95

Frequency

Graphing Numerical Data:

The Histogram

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class MidpointsClass Boundaries

No Gaps

Between

Graphing Numerical Data:

The Frequency Polygon

Class Midpoints

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

14.95 24.95 34.95 44.95 54.95

Calculate Measures of Central Tendency & Spread

4We can use frequency distribution table

to calculate;

• Mean

• Standard Deviation

• Median

• Mode

4Mean = 659/20

= 32.95

4Compare with 32.4

from direct

calculation.

.f mpX

659.0020TOTAL

109.90254.9550.0 - 59.9

179.80444.9540.0 - 49.9

174.75534.9530.0 - 39.9

149.70624.9520.0 - 29.9

44.85314.9510.0 - 19.9

freq x m.p.FreqMidpointClass

Standard deviation

s2=((24634.05-(6592/20))/19)

s2=2920.05/19

s2=153.69

s = 12.4

4 Compare with 12.67 from

direct measurement.

f mpf mp

∑∑

24634.05659.0020TOTAL

6039.01109.90254.9550.0 - 59.9

8082.01179.80444.9540.0 - 49.9

6107.51174.75534.9530.0 - 39.9

3735.02149.70624.9520.0 - 29.9

670.5144.85314.9510.0 - 19.9

f.mp^2f.m.p.Freq

PointClass

Median

20TOTAL

250.0 - 59.9

440.0 - 49.9

median class530.0 - 39.9

620.0 - 29.9

310.0 - 19.9

FreqClass 4 L1 +i *((n+1)/2) – f1

4 f1 = cumulative freq

above median class

4 29.95 + 10((21/2)-9)

4 29.95 + 15/5 = 32.95

4 From direct calculation,

median = 31

=L1 +i *(Diff1/(Diff1+Diff2))

=19.95 + 10(3/(3+1))

=27.45

4Compare with

modes of 24 & 27

from direct

calculation.20TOTAL

250.0 - 59.9

440.0 - 49.9

530.0 - 39.9

mode class620.0 - 29.9

310.0 - 19.9

FreqClass

Graphing Bivariate Numerical

Data (Scatter Plot)

Linear Regression Line

Survival Function

DURATION

76543210

Cum Survival

Survival Function

Censored

Principles of Graphical Excellence

4Presents data in a way that provides

substance, statistics and design

4Communicates complex ideas with clarity,

precision and efficiency

4Gives the largest number of ideas in the

most efficient manner

4Almost always involves several

dimensions

4Tells the truth about the data

exploratory data analysis - checking for normality

Health & Medicine

normality rpg - core rulebook

toolkit + “show your skills” ammbr from xtreg to xtmixed...

aov assumption checking and transformations (§8.4-8.5) how...

stat 112: lecture 14 notes finish chapter 6: –checking and...

an algorithm for checking normality of boolean functions

aov assumption checking and transformations (§8.4) how do...

unit 9: checking assumptions of normality

a randomised algorithm for checking the normality … ·...

departures from normality

stk4900/9900 - lecture 5 · stk4900/9900 - lecture 5...

post apocalypse new normality new normality · 2021. 7....

normality testing in excel the excel statistical master...

ammbr from xtreg to xtmixed (+checking for normality, and...

error checking of the mineralogical data- frame: east...

checking univariate normality normal probability plots

normality and data transformations...assessing normality...

s as normality

exploratory data analysis for complex models...checking:...

3. the distribution normality

appendex 1 normality test of icam-1 1. control ( -...