outline - r-project.org
TRANSCRIPT
Letter Value BoxplotHeike Hofmann, Karen Kafadar, Hadley Wickham
IOWA STATE UNIVERSITY
Outline
• Boxplots: Definition, Strengths & Weaknesses
• Letter Value Statistics
• Letter Value Boxplots
• Examples
• Conclusion
Boxplots
• Early Version: Tukey 1972 (Snedecor Festzeitschrift, at Iowa State University)
• Most common version in EDA (1977):
• Median (Center Line), Fourths (Box Edges), adjacent values (ends of whiskers) and extreme values
• All marks correspond to actual data values
0 1 2 3 4 5 6 7
Boxplot: Strengths
• Quick summary without overwhelming amount of detail
• Approximate location, spread, shape of distribution
• Outlier identification
• Associations among variables
0 1 2 3 4 5 6 7
Boxplots: Weaknesses
• Expected rate of labeled outliers approx 0.4+ 0.007n
• For n = 100000 expect approx. 700 outliers!
Exponential Distribution,n= 100, 1000, 10000, 100000 0
12
34
~
5
02
46
02
46
8
~~
02
46
810
12
Modifications
• Notched box-and-whisker (McGill, Larsen, Tukey 1987)
• Nonparametric density estimates
• Vase plots (Benjamini, 1988)
• Violin plots (Hintze, Nelson 1998)
• Box-percentile plots (Esty, Banfield 2003)Implementations: S routines (David James), package vioplot (Adler, Romain), package HMisc bpplot (Harre#, Banfield), examples at R Graph Ga#ery
Letter Value Statistics
• Estimate quantiles corresponding to tail areas 2-j
• Median (1/2): depth =
• Fourths (1/4): depth =
• Eights (1/8): depth =
• Boxplots show median, fourths
• Large Data Sets: tail quantiles become more reliable include LVs beyond Fourths
dM = (1 + n)/2
dF = (1 + !dM")/2
dE = (1 + !dF ")/2
k = !log2 n" − 4
k =⌈log2 n − log2
(4z2
1−α/2
)⌉+ 1
1
Letter Value Boxplot
• How many boxes to show?
• Outlier identification?
• All marks are based on actual data values
x
-3 -2 -1 0 1 2 3
LVboxplot(rnorm(1000))
Stopping Rules & Outliers
• EDA: 5-8 outliers
• Percentage of data, e.g. 0.5-1%
• uncertainty in LVi extends beyond or into LVi-1 (i.e. upper limit for LVi crosses LVi-1)
dM = (1 + n)/2
dF = (1 + !dM")/2
dE = (1 + !dF ")/2
k = !log2 n" − 4
k =⌈log2 n − log2
(4z2
1−α/2
)⌉+ 1
1
dM = (1 + n)/2
dF = (1 + !dM")/2
dE = (1 + !dF ")/2
k = !log2 n" − 4
k =⌈log2 n − log2
(4z2
1−α/2
)⌉+ 1
1
Rules lead to similar answers
... Examples
Gaussian, Exponential & Normal
~/Documents/students/Maneesha/CEL files~/Documents/students/Maneesha/CEL files
Gaussian, n=10000
x
-3 -2 -1 0 1 2 3 -3 -2 -1 0
~/Documents/students/Maneesha/CEL files
1 2 3
Exponential, n=10000
x
0 2 4 6 8 0 2 4 6 8
Uniform, n=10000
x
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Gene Expression Values
~/Documents/students/Maneesha/CEL files
T1-1
x
68
10
12
14
T1-2
x
68
10
12
14
T1-3
x
68
10
12
14
~/Documents/students/Maneesha/CEL files
T2-1
x
68
10
12
14
T2-2
x
68
10
12
14
T2-3
x
68
10
12
14
WT-1
x
68
10
12
14
WT-2
x
68
10
12
14
WT-3
x
68
10
12
14
Conclusion
• appropriate for large number of values
• based on actual data values
• simple to compute
• reduce number of labeled outliers shown in conventional boxplots
• do not depend on a smoothing parameter
Letter Value Boxplots are
Download (for now) at http://www.public.iastate.edu/~hofmann
Graphical Displays of Large Data Sets
• Quick summary without overwhelming amount of detail
• Approximate location, spread, shape of distribution
• Outlier identification
• Associations among variables
“The greatest value of a picture us when it forces us to notice what we never expected to see” (Tukey 1977)