data exploration using r - aucklandkxio001/workshop.pdf · data exploration using r statistics...
TRANSCRIPT
Data Exploration using R
Statistics Refresher Workshop
Kai Xiong
[email protected] Consulting ServiceThe Department of StatisticsThe University of Auckland
July 1, 2011
Kai Xiong Data Exploration using R 1/47
Exploring your data
Checking the data for patterns, relationships, structures, andother features.
Can be done through graphs and summary statistics.
We recommend R for generating graphs, much more flexiblethan Excel, SPSS, SAS.
Kai Xiong Data Exploration using R 2/47
Roles of Exploratory Graphics
1 Are there any errors or outliers?2 Are there any patterns in the data?
Symmetric, Skewed, Bimodal, Clusters?
3 Are there any relationships between the variables?
Linear (increasing, decreasing), polynomial, exponential...
4 What sort of model might be appropriate?
Kai Xiong Data Exploration using R 3/47
Types of variable
1 Quantitative
Continuous:e.g. gene expression, Body length of musselsDiscrete:e.g. number of photophores in lantern fish
2 Qualitative
Categoricale.g. SNPs, Location of marine reservesOrdinale.g. Position in a food chain
Kai Xiong Data Exploration using R 4/47
One Quantitative Variable
Dotplots
Five-number-summary statistics1 Minimum2 1st Quantile/Lower Quantile/25th percentile
25% of the sorted variable is smaller than the 1st quantile75% of the sorted variable is greater than the 1st quantile
3 2nd Quantile/Median/50th percentile
Divide the sorted variable in two equal halves
4 3rd Quantile/Upper Quantile/75th percentile
25% of the sorted variable is greater than the 3rd quantile75% of the sorted variable is smaller than the 3rd quantile
5 Maximum
Boxplot: visual display of 5-number-summary statistics.
Normal QQ plot
Histogram
Kai Xiong Data Exploration using R 5/47
010
2030
4050
Minimum
1st Quantile
Median
3rd Quantile
Maximum
Outlier?●
Figure: Boxplot with 5-number-summaryKai Xiong Data Exploration using R 6/47
●●●●●●●●●●●●●●●●●●●●●
050
100
150
Figure: Boxplot with 5-number-summaryKai Xiong Data Exploration using R 7/47
●●●●●●●●●
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure: Boxplot with 5-number-summaryKai Xiong Data Exploration using R 8/47
Normal Quantile-Quantile (QQ) Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●●
●
●●
●
●●●●
●●
●●
●●●●●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
510
1520
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Kai Xiong Data Exploration using R 9/47
Histogram showing Normal Distribution
Fre
quen
cy
−4 −2 0 2 4
020
040
060
080
010
00
Figure: Histogram Example OneKai Xiong Data Exploration using R 10/47
Histogram and boxplot showing Normal Distribution
Fre
quen
cy
−4 −2 0 2 4
020
040
060
080
010
00 ● ●●●● ●●● ●● ●● ●● ●●● ●● ●● ●●●● ●● ●●● ●●● ●● ●●● ● ●●●●●● ●●● ● ●●● ● ●● ● ●● ● ● ●● ●●
−4 −2 0 2 4
Figure: Histogram with boxplot example oneKai Xiong Data Exploration using R 11/47
Histogram showing left skewed distribution
Fre
quen
cy
2 4 6 8 10
010
020
030
040
050
060
0
Figure: Histogram Example TwoKai Xiong Data Exploration using R 12/47
Histogram and boxplot showing left skewed istribution
Fre
quen
cy
2 4 6 8 10
010
020
030
040
050
060
0
●●●● ●●● ● ●● ● ●●● ● ●● ● ●● ● ●● ●● ●● ●●● ●●● ●● ●●● ●●● ●● ●●● ● ●●● ● ●● ● ●●●● ● ● ●●● ●● ● ●● ● ●●● ●●● ● ●●●● ●● ●●● ●● ● ●●●● ●●●● ●●● ● ●●● ● ●● ●●● ●● ●● ●● ●● ●●●● ● ●● ●● ●● ●● ●●●● ●●●● ● ●●● ●●● ●● ●●●● ●● ● ●●●●● ●● ●●●● ●●● ●● ●●● ●● ●●●
2 4 6 8 10
Figure: Histogram with boxplot example twoKai Xiong Data Exploration using R 13/47
Histogram showing uniform distribution
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
020
4060
80
Figure: Histogram Example ThreeKai Xiong Data Exploration using R 14/47
Histogram and boxplot showing uniform distribution
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
020
4060
80
0.0 0.2 0.4 0.6 0.8 1.0
Figure: Histogram with boxplot example threeKai Xiong Data Exploration using R 15/47
One Qualitative Variable
Frequency table
Table: One way frequency table
Phylum Frequency Percentage
Molluscs 250 61.4%Annelids 34 8.4%Arthropods 101 24.8%Echinoderms 22 5.4%Total 407 100%
Pie chart
Barplot
Kai Xiong Data Exploration using R 16/47
Molluscs
Arthropods
Annelids
Echinoderms
Figure: Piechart ExampleKai Xiong Data Exploration using R 17/47
Molluscs Arthropods Annelids Echinoderms
Phylum
Per
cent
age
010
2030
4050
60
Figure: Barplot ExampleKai Xiong Data Exploration using R 18/47
Two VariablesTwo Quantitative
Scatter Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●
● ●
●
●● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
−20 0 20 40 60 80 100
140
160
180
200
220
240
260
Weight (g)
Foo
d C
onsu
mpt
ion
(mg/
day)
Figure: Scatterplot ExampleKai Xiong Data Exploration using R 19/47
Two VariablesTwo Quantitative
Scatter Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●
● ●
●
●● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
−20 0 20 40 60 80 100
140
160
180
200
220
240
260
Weight (g)
Foo
d C
onsu
mpt
ion
(mg/
day)
●
●
Figure: Scatterplot ExampleKai Xiong Data Exploration using R 20/47
Two VariablesTwo Quantitative
Scatter Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●
● ●
●
●● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
−20 0 20 40 60 80 100
140
160
180
200
220
240
260
Weight (g)
Foo
d C
onsu
mpt
ion
(mg/
day)
●
Figure: Scatterplot ExampleKai Xiong Data Exploration using R 21/47
Two VariablesTwo Quantitative
Scatter Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●
● ●
●
●● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
−20 0 20 40 60 80 100
140
160
180
200
220
240
260
Weight (g)
Foo
d C
onsu
mpt
ion
(mg/
day)
● Mistake?
Figure: Scatterplot ExampleKai Xiong Data Exploration using R 22/47
Two VariablesTwo Quantitative
Scatter Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●
● ●
●
●● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
−20 0 20 40 60 80 100
140
160
180
200
220
240
260
Weight (g)
Foo
d C
onsu
mpt
ion
(mg/
day)
● NOBEL PRIZE!!
Figure: Scatterplot ExampleKai Xiong Data Exploration using R 23/47
Two VariablesTwo Quantitative
Scatter Plot
−20 0 20 40 60 80 100
140
160
180
200
220
240
260
Weight (g)
Foo
d C
onsu
mpt
ion
(mg/
day)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●
● ●
●
●● ●
●
●
●
● ●
●
●
●
●
●
●
●
Figure: Scatterplot ExampleKai Xiong Data Exploration using R 24/47
Two VariablesTwo Quantitative
Scatter Plot
−20 0 20 40 60 80 100
140
160
180
200
220
240
260
Weight (g)
Foo
d C
onsu
mpt
ion
(mg/
day)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●
● ●
●
●● ●
●
●
●
● ●
●
●
●
●
●
●
●
y = 0.91x + 153.66
Figure: Scatterplot ExampleKai Xiong Data Exploration using R 25/47
Two VariablesTwo Qualitative
Two way frequency table
Table: Two way frequency table
Phylum Hahei Leigh Total
Molluscs 130 120 250Annelids 29 5 34Arthropods 82 19 101Echinoderms 16 6 22Total 257 150 407
Side by side barplot: frequencies or percentages?
Kai Xiong Data Exploration using R 26/47
Molluscs Arthropods Annelids Echinoderms
HaheiLeigh
Fre
quen
cy (
Abs
olut
e A
bund
ance
)
020
4060
8010
012
0
Figure: Two way frequency on a barplotKai Xiong Data Exploration using R 27/47
Molluscs Arthropods Annelids Echinoderms
HaheiLeigh
Per
cent
age
(Rel
ativ
e A
bund
ance
)
0.0
0.2
0.4
0.6
0.8
Figure: Two way frequency on a barplotKai Xiong Data Exploration using R 28/47
Two VariablesOne Qualitative and One Quantitative
Side-by-side boxplot
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
Group One Group Two
0.0
0.5
1.0
1.5
2.0
Figure: Side by side boxplot
Kai Xiong Data Exploration using R 29/47
Two VariablesOne Qualitative and One Quantitative
Side-by-side boxplot
●●
●
●●
●
●●●
●
●●●
●
●●
●
●●●●●●●●●●●●
Group One Group Two
02
46
810
Figure: Side by side boxplot
Kai Xiong Data Exploration using R 30/47
Two VariablesOne Qualitative and One Quantitative
QQ Plot
●
● ●●
●●
●●●●●●
● ●●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
●●●●●●
●●●
●●
●●
● ●
0 2 4 6
01
23
45
67
Group One
Gro
up T
wo
Figure: Normal QQplot ExampleKai Xiong Data Exploration using R 31/47
Three VariablesThree Qualitative Variables
Table: Three way frequency table
Location
Hahei Leigh
Phylum Winter Summer Winter Summer
Molluscs 39 91 32 88Annelids 9 20 1 4Arthropods 25 57 6 13Echinoderms 6 10 2 4
Kai Xiong Data Exploration using R 32/47
Hahei Leigh
Mol
lusc
s
Art
hrop
ods
Ann
elid
s
Ech
inod
erm
s
Mol
lusc
s
Art
hrop
ods
Ann
elid
s
Ech
inod
erm
s
020
4060
80WinterSummer
Fre
quen
cy (
Abs
olut
e A
bund
ance
)
Figure: Three way frequency on a barplotKai Xiong Data Exploration using R 33/47
Three VariablesThree Quantitative Variables
Trellis Graphs: Display a variable or the relationship betweenvariables, conditioned on one or more other variables.
Three continuous variables -log(Concentration), Outcome andOxygen level.What is the relationship between -log(Concentration) andOutcome, conditioned on Oxygen level? In other words, howdoes the relationship between -log(Concentration) andOutcome change over different Oxygen level?
1 Plot -log(Concentration) against Outcome first.2 Break Oxygen level into ordinal groups. e.g. Divide Oxygen
level into 12 equally spaced interval.
Kai Xiong Data Exploration using R 34/47
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●●●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●
●
●
●●●●
●
●●
●
●
●
●
●●
●
●
●
●
●●●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●●●
●●●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●●●●
●
●●●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●●●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●●
●
●●
●
●●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●●
●
●●●●●●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●●●●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●
●●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●●
●●●●
●●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●●
●
●
●●
●
●●
●●
●
●
●●
●●
−2 −1 0 1 2 3 4
05
1015
2025
30
−log (Concentration)
Out
com
e
Kai Xiong Data Exploration using R 35/47
−log (Concentration)
Out
com
e
0
5
10
15
20
25
30
−2 −1 0 1 2 3 4
●
●
●
●●●●●
●●●
●
●●●●●
●
●
●
●
●●●
●
●●
●●●●●●
●
●●
●
●●
●●●
●●●●●●●●●●●●●
●
●●●
●●
●
●●
●
●●
●●●●●
●
●
●
●●●
●●●●●
●●●●●
●
●
●●●
●●
●
●●●●
Oxygen
●
●●●
●●●●
●●●●
●
●●●
●
●●●
●●
●●●
●
●●●
●
●●●●●●
●
●●●●●●●●●
●
●●
●
●●●
●●
●
●●●
●
●
●●
●●●●●●●●●
●●●●●
●●●●●
●●
●●●●
●
●●●●
●
●●●
●
●●
Oxygen
−2 −1 0 1 2 3 4
●
●●
●
●●
●●
●●●●●
●●●
●●
●
●
●
●
●
●●
●●●●●●●
●●●
●
●
●●
●●●●●●●
●●●●●
●●●●●
●●
●
●●
●●●●●●●●
●
●●●●
●●
●
●●●●●●●
●●●●
●
●●
●●●●●●●
●
●
Oxygen
●●●●●●●●●●●
●
●●
●
●
●●
●●
●●
●
●●●
●●●
●
●
●●●
●
●
●●
●●
●
●●
●●●●
●●●
●
●
●●●●●
●●●●
●●
●
●●●●
●
●●●
●●●●●
●●
●●●●
●●●
●
●●●●●●●●
●
●
●
●●
Oxygen
●●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●●●●●●●●●●●
●●
●
●●●●●
●
●
●●●●
●
●
●●
●
●
●
●●●●●
●
●●●●●
●●
●
●
●
●●●
●
●●
●
●
●
●●●●
●●●●●
●
●●●
●
●●●●●●
●
●
●
●
●
Oxygen
●
●
●
●●●●
●●●●●●●●
●●
●
●●●
●
●●●●●
●●●
●
●
●
●●
●
●
●
●●●●●
●
●●
●●●
●●
●
●●
●●
●●
●
●
●●●
●●
●●●●●
●
●●●
●
●
●●
●
●●●●●●●●
●
●
●●●
●●●
●
●
●
●
●
Oxygen
●
●●●●●●●●
●●
●●
●
●●●●
●
●●●
●
●
●●●
●●●●
●●
●●●●●
●
●
●
●
●●●
●●●●●
●●
●
●
●●●●
●●●
●●●
●
●
●●
●
●●●●●●●
●●
●●●
●●●●●●●●
●
●●●
●
●
●
●
●●●
Oxygen
0
5
10
15
20
25
30
●●●
●
●●●●
●
●
●●●
●●
●●
●
●
●
●●●●●●●
●●●
●
●●●●●
●●●
●
●●●
●
●
●●
●
●
●
●●●●
●
●●
●
●●●
●
●
●●●
●●
●●●●●●
●●●●
●●●●
●
●●●
●
●●
●●●●
●
●●●
●
●●
Oxygen0
5
10
15
20
25
30
●
●●
●●●●●●●●
●●●
●
●●●
●●●●
●●
●
●●
●●
●
●●
●●
●●
●
●
●●●
●●●●●●
●●●●
●●
●●●●●
●
●●
●
●●●●
●
●●●●●●
●
●
●
●
●
●●
●●
●
●●●
●●
●
●
●
●
●●
●
●●●●●
Oxygen
−2 −1 0 1 2 3 4
●●●●●
●
●
●
●
●●
●
●
●●●
●
●
●
●●●●
●
●●●●
●●●
●●●
●●●●●●
●●●●●●●●
●●●●
●
●
●
●●●●
●
●●
●●
●●
●●●●●●
●
●●
●
●●●●
●●
●●●●●●●
●●
●●●●
●●●●●
Oxygen
●
●
●
●●●
●●●
●●
●
●●
●●●●●
●●
●
●●●
●
●
●●●
●●●●●●
●
●●●●●●●●
●●
●●
●●
●●●●●●●
●●●●
●
●●
●●●●
●●
●●●
●
●
●●●●●●●
●
●
●●●●
●
●●
●●
●●
●
●
●●
Oxygen
−2 −1 0 1 2 3 4
●●●●
●
●
●
●
●
●●●
●●●●●●●
●
●●
●●
●
●●
●
●
●●
●●
●●●●
●
●
●●●
●●●●●
●
●
●●●●●●●●
●●●
●●●●
●●
●
●
●●●●●●
●
●●
●●●
●
●●
●
●
●
●●
●
●●
●
●●●●●●
●●
Oxygen
Figure: Trellis graph with three continuous variablesKai Xiong Data Exploration using R 36/47
Three VariablesTwo Qualitative Variables and One Quantitative Variable
IQ (
in 1
0s)
5
10
15
Female Male
●
●
●
●
●
●
●●
●
●
Biology
Female Male
●
●
●
●●●
●
●●
●
●●
●
●
Statistics
Figure: Trellis graph with two continuous variables and one qualitativevariable
Kai Xiong Data Exploration using R 37/47
Three VariablesTwo Quantitative Variables and One Qualitative Variable
Iris flower data set
Four continuous measure on Petal Length, Petal Width, SepalLength and Sepal Width.
One nominal variable, Species.
Suppose we are interested in the relationship between PetalLength and Petal Width, and how such relationship changesfor different species.
1 Plot Petal Length against Petal Width.2 Use different plotting character/color to represent different
species.
Kai Xiong Data Exploration using R 38/47
Three VariablesTwo Quantitative Variables and One Qualitative Variable
●●● ●●
●
●
●●
●
●●
●●
●
●●
● ●●
●
●
●
●
●●
●
●● ●●
●
●
●●●●
●
● ●
●●
●
●
●
●
●●●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
1 2 3 4 5 6 7
0.5
1.0
1.5
2.0
2.5
Petal.Length
Pet
al.W
idth
Figure: Iris scatter plot
Kai Xiong Data Exploration using R 39/47
Three VariablesTwo Quantitative Variables and One Qualitative Variable
1 2 3 4 5 6 7
0.5
1.0
1.5
2.0
2.5
Petal.Length
Pet
al.W
idth
●●● ●●
●
●
●●
●
●●
●●
●
●●
● ●●
●
●
●
●
●●
●
●● ●●
●
●
●●●●
●
● ●
●●
●
●
●
●
●●●●
● setosaversicolorvirginica
Figure: Iris scatter plot with different species
Kai Xiong Data Exploration using R 40/47
Three VariablesTwo Quantitative Variables and One Qualitative Variable
Petal.Length
Pet
al.W
idth
0.0
0.5
1.0
1.5
2.0
2.5
1 2 3 4 5 6 7
●●● ●●
●●
●●●●●
●●●
●●● ●●
●
●
●
●
●●
●
●● ●●
●
●●●●●●
● ●●●●
●
●●
●●●●
setosa0.0
0.5
1.0
1.5
2.0
2.5versicolor
0.0
0.5
1.0
1.5
2.0
2.5virginica
setosaversicolorvirginica
●
Figure: Iris trellis scatter graphs
Kai Xiong Data Exploration using R 41/47
More Than Three VariablesPairs Plot
Sepal.Length
2.0 3.0 4.0
●●
●●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●●
●
●
●●
● ●●●
●●
●●
●
●●
●
●
●
●●
● ●
● ●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●
●●
●
●●
●●
●●
●● ●
●●
●●●●
●
●
●
●
●●●
●●
●
● ●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
● ●
●●
●●
●
●
●
●
●
●
●
● ●●
●●
●
●●●
●
●●
●
●●●
●
●●●
●●
●●
●●●●
●
●
●
●
●
●
●
●●
●
●●●●
●
●●
●
●
●●
●●●●
●●
●●●
●●
●
●
●
●●
●●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●
●●
●
●●
●●
●●●
●●
●●
●●●
●
●
●
●
●
●● ●
●●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●●
●●
●
●●●
●
●●
●
●●
●
●
●●●
●●●
●
0.5 1.5 2.5
●●●●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●●
●
●
●●● ●●●
●●
●●
●
●●
●
●
●
●●
●●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●
●●
●
●●
●●
●●
●● ●
●●
●●●
●
●
●
●
●
●●●
●●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
● ●
●●
●●
●
●
●
●
●
●
●
●●●
●●
●
●●●
●
●●
●
●●
●
●
● ●●
●●
●●
4.5
5.5
6.5
7.5
●●●●
●
●
●
●
●
●
●
●●
●
●●●●
●
●●●
●
●●●●●●
●●
●●●
●●
●
●
●
●●
●●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●●●●●●●●●
●●●●●●
●
●
●
●
●●●
●●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●●
●●
●
●●●
●
●●
●
●●●
●
●●●
●●●●
2.0
3.0
4.0
●
●●
●
●
●
● ●
●●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●
●●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●●
●●
●●
●
●
●●●
●
●
●●
●●●
●
●●
●
●
●
●
●Sepal.Width
●
●●●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●●
●
●
●●
●
●
●
●●●●
●
●
●
●
●
●
●●●
●●
●
●●●
● ●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●●
●●
●●
●
●
●●●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●
●●
●●●
● ●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●●
●●
●
●
●●●
●
●
●●
● ●●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●●
●
●
●●
●
●
●
●●●●
●
●
●
●
●
●
●●●●●●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●●●●●
●
●●●
●
●
●●●●●
●
●●
●
●
●
●
●
●●●● ●
●● ●● ● ●●
●● ●
●●●
●●
●●
●
●●
●●●●●● ●● ●●
● ●●●●
●●●●●
●●
● ●●
●●
●
●
●●●
●
●
●●
●●
●
●
●●●
●
●
●
●
●●
●●●
●
●
●●●
●
●
● ●●
●●●
●●
●
●
●●● ●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●●●●
●●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
●● ●● ●
●●●● ● ●●
●● ●
●●●
●●
●●
●
●●
● ●●●●● ● ●●●● ●●●
●●● ●●
●
●●
● ●●
●●
●
●
●●●
●
●
●●
●●
●
●
●●●
●
●
●
●
●●
●●●
●
●
●●●
●
●
● ●●
●●●
●●
●
●
● ●●●
●
●
●
●
●●
●
●
●
●
●●
●●
●
● ●●
●
●●
●
●
●
●
●
●●
● ●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
Petal.Length
●●●●●
●●●●●●●●
●●●●●
●●
●●
●
●●● ●●●●● ●●●●●●
●●●
●●●●
●
●●●●●
●●●
●
●●●
●
●
●●
●●
●
●
●●●
●
●
●
●
●●
●●●
●
●
●●●
●
●
●●●
●●●
●●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●●
●●
●
● ●●
●
●●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
12
34
56
7
●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●●●
●
●●●
●
●
●●
●●
●
●
●●●●
●
●
●
●●●●●●
●
●●●●
●
●●●●●●●●
●
●
●●●●
●
●
●
●
●●●
●
●
●
●●
●●●
●●●●
●●
●
●
●
●
●
●●
●●
●●●●
●
●
●
●
●●
●
●●
●●
●●
●●●●●
0.5
1.5
2.5
●●●● ●●
●●●
●●●
●●●
●●● ●●
●●
●
●
●●●
●●●●●
●●●● ●
●● ●
●●●
●●
●●● ●●
●● ●
●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
●●
●●
●●
●●●
●
●●
●●
●●●●
●●
●
●●● ●
●●
●
●●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
● ●●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●
●●●
●●
●
●
●● ●● ●●
●●●
●●●
●●●
●●● ●●
●●
●
●
●●●
●●●●●
●●●● ●
●● ●
●●●
●●
●●● ●●
●●●
●●●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
●●
●●
●●
●●●
●
●●
●●
● ●●●
●●
●
●●
●●●
●
●
●●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
●●●
●
●● ●
●
●
●●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●
●●●●●●
●●●●●●●●
●●●●●●
●●
●
●
●●●●●●●●
●●●●●●●●●●●
●●
●●●●●
●● ●
●●●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
●●● ●
●●
●●●●
●●●●
●●●●●
●●
●●●●
●●
●
●●
●
●●
●●●
●
●●●
●
●●
●
●●
●
●
● ●●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●
Petal.Width
●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●
●●●●●●●●●●●
●●●●●●●
●●●●●●
●
●
●●
●
●
●
●●●●
●
●
●
●
●●
●●●●
●●
●●●●
●●●●●●●●●●●
●●●●●●
●
●●
●
●●
●●●
●
●●●●
●●
●
●●
●
●
●●●
●
●●●
●
●
●●●
●●
●●
●●
●
●●
●
●●●
●●
●
●
4.5 5.5 6.5 7.5
●●●● ● ●● ●● ● ●●●● ●●●● ●● ●●● ●●●●●●●● ●● ●●● ●●● ●●●● ●●● ●● ●●
●● ●● ●● ●● ●●● ●●●● ●●● ●● ●●●● ●●●●●●●● ●●● ● ●●●●● ●●● ●●● ●● ●
●● ●●● ●● ●● ●●● ●●● ●● ●●● ●● ●● ● ●●● ● ●● ●●●● ●●●● ●●●● ●●●●●●●
●● ●● ● ●●●● ● ●●●● ● ●●● ●●● ●●●●● ●●●●● ● ●●●● ●●● ●●● ● ● ●● ●● ●●
●●●● ●● ●● ●●● ●● ●● ●●●● ● ●●● ●●●● ●●●●● ●● ● ●●● ●●● ●●● ● ●●●● ●
●● ●●●●● ●● ●●● ●● ● ●● ●●● ●●●● ●●● ●● ●● ●●●● ● ●●●●●●● ●●●● ● ●●
1 2 3 4 5 6 7
●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
●● ●● ●●●● ●●● ●● ●● ●●● ●● ●● ●●●● ●●●●●●● ●●●●●●● ●●●● ●●●●● ●
●● ●●● ●● ●●●●●●●●●● ●●● ●● ●● ●●●● ●●●●●● ● ●●●● ●●●● ●●●●●●●
●●●●● ●●●●●●●●●● ●●●●●● ●● ●●● ●●●●● ●●●●●●●●●●●● ●●●●●●●
●●●● ●● ●● ●●● ●● ●●●●● ●● ●● ●●●●● ●●●●● ● ●●●●●●●● ●●● ●●●●● ●
●● ●● ●●●●● ●●● ●● ●●● ●●● ●●●● ●●●● ●● ●● ●●● ●●●● ● ●●● ● ●●●● ●●
1.0 2.0 3.0
1.0
2.0
3.0
Species
Figure: Pairs Plot
Kai Xiong Data Exploration using R 42/47
More Than Three Variables
Petal.Length + Petal.Width
Sep
al.L
engt
h +
Sep
al.W
idth
34
5
0.5 1.0 1.5
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
setosa
23
45
67
1 2 3 4 5
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●●
●●
●●
●
●●●
●●
●
●
●
●
●● ●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●●
●●●
●
●
●●●
●●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●●●
●
●●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●
●●
●●●
● ●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●●●
●
●●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●
●●
●●●
● ●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
versicolor
23
45
67
8
2 3 4 5 6 7
●
●
●
●●
●
●
●
●
●
● ●
●
●●
● ●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●●
●
●●
●
●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
● ●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●●
●
●●
●
●
● ●●
●●
●
●
●
●
●● ● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
● ●
●●
●●
●
●
●●●
●
●
●● ● ●●
●
●●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●●
●●
●
●
●●●
●
●
●● ● ●●
●
● ●
●
●
●
●
●
virginica
Sepal.Length * Petal.LengthSepal.Length * Petal.WidthSepal.Width * Petal.LengthSepal.Width * Petal.Width
●
●
●
●
Figure: Iris trellis graphs with more than three variablesKai Xiong Data Exploration using R 43/47
More Than Three Variables
Petal.Length
Pet
al.W
idth
0.0
0.5
1.0
1.5
2.0
2.5
1 2 3 4 5 6 7
●●●●●
●●●●●●●
Day 1
●●●
●●● ●●
●
●
●
●
Day 2
●●
●
●●●●
●
●●●●
Day 3
1 2 3 4 5 6 7
0.0
0.5
1.0
1.5
2.0
2.5
●●
●●●●●
●
●●
●●●●
Day 4
setosa versicolor virginica●
Figure: Iris trellis graphs with measurements from four different daysKai Xiong Data Exploration using R 44/47
Multivariate
Suppose a dataset has n observations (rows) and p variables(columns).
The n observations lie in a p dimensional space.
To visualise the data, we apply some mathematical proceduresto reduce the dimensionality of the data cloud, optimally intotwo dimensions.
1 Throw away some/lots of variances/information, TANSTAAFL.2 Plot the reduced data into a two/three dimensional graphs,
and visualise the MAIN components of variance.
Kai Xiong Data Exploration using R 45/47
MultivariatePrincipal Component Analysis (PCA)
−3 −2 −1 0 1 2 3 4
−3
−2
−1
01
23
97.8% of the total variance is explainedin this two dimensional graph
Component One (92.5%)
Com
pone
nt T
wo
(5.3
%)
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
● ●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● setosaversicolorvirginica
Figure: PCA on iris data with the first two components
Kai Xiong Data Exploration using R 46/47
Recommended Reading
Kai.http://www.stat.auckland.ac.nz/~kxio001.
Paul Murrell.R Graphics.Chapman & Hall/CRC, Boca Raton, FL, 2005.
Kai Xiong Data Exploration using R 47/47