numerical analysis of biological and environmental data lecture 2. exploratory data analysis
TRANSCRIPT
NUMERICAL ANALYSIS OF BIOLOGICAL AND
ENVIRONMENTAL DATA
Lecture 2. Exploratory Data
Analysis
Types of variables
Simple diagrams
Summary statistics(i) Location(ii) Dispersion(iii) Skewness and kurtosis
Transformations
Density estimation
Graphical display(i) Univariate data(ii) Bivariate and multivariate
data
Outliers
Leverage and influence
Software
EXPLORATORY DATA ANALYSIS
TYPES OF VARIABLES
1) discrete e.g. counts
2) continuous e.g. pH, elevation
Both are random variables or variates, with random variation.
TABULAR PRESENTATION Raw data
Frequency tables
FrequencyCumulative Frequency
% CF
0 0 - 0.99 3 3 2
1 1 - 1.99 8 11 6
2 2 - 2.99 3 14 11
... ... ... ... ...
Value or Range
SIMPLE DIAGRAMS
Dot diagram Line diagram or profile
Histogram
Frequency graph or cumulative frequency graph
n/10 bins
CONTINUOUS VARIABLES
DISCRETE VARIABLESDISCRETE OR CONTINUOUS VARIABLES
HISTOGRAM BIN WIDTH Wand (1997) Amer. Statistician 51, 59-64
(a) (b) (c)
DEFAULTS-PLUS
Histograms of the British Incomes Data Based on (a) the Bin Width ĥ2 (b) the Bin Width ĥ0, and (c) the S-PLUS Default Bin Width.
Optimal solution
where g21 is band-width parameterψ2 is “normal scale” estimator
Solution of ψ2 and g21 is iterative, to optimise a function MEAN INTEGRATED SQUARED ERROR
Standard deviation n = sample size
31
493 nho .ˆ
3
1
212
26
ngh
ˆ
n21 log dataof rangeˆ
h
Histogram Bin Width
In R, a good option for histogram bin width is given by the Freedman-Diaconis rule which is:
where n is the number of observations, max-min is the range of the data, and Q3-Q1 is the inter-quartile range. The brackets represent the ceiling, which means that you round up to the next integer, thereby avoiding 4.2 bins!
)(2min)(max
13
3/1
QQn
Exploratory Data Analysis
1. Summary Statistics
(A)Measures of location ‘typical value’
(1) Arithmetic mean (2) Weighted mean
(3) Mode ‘most frequent’ value (4) Median ‘middle values’ Robust statistic
(5) Trimmed mean 1 or 2 extreme observations at both tails deleted
(6) Geometric mean
n
iixn
1
1 logGM log nnxxxx 321GM
n
i
xn1
11 log antilog =
n
iixnx
1
1
n
ii
n
iii wwxx
11
R
(B) Measures of dispersion
A 13.99 14.15 14.28 13.93 14.30 14.13
B 14.12 14.1 14.15 14.11 14.17 14.17
B smaller scatter than A
‘better precision’
PrecisionRandom error scatter
(replicates)
AccuracySystematic bias
(1) Range A = 0.37 B = 0.07
(2) Interquartile range ‘percentiles’
25% 25% 25% 25%
Q1
Q2
Q3
(3) Mean absolute deviation
n
iii xxn
1
1
Mean absolute difference
n
i
xxn
i
1
1 ignore negative signs
x 1 5 8 23 1 4 2 10 10/n = 2.5
4xxx
(5) Coefficient of variation
Relative standard deviationPercentage relative SD(independent of units)
(6) Standard error of mean
100 xsCV
SD
mean
ns2
SEM
(B) Measures of dispersion (cont.)
Variance = mean of squares of deviation from
mean
Root mean square value 2ssSD
(4) Variance and standard deviation
22
11
xxn
S
R
(C) Measures of skewness and kurtosis
gg11 skewness rr = 3 = 3 33
1xx
ns[third central moment
divided by sd3]
Skewness - measure of how one tail of curve is drawn out
Kurtosis - measure of peakedness of curve
g1 skewness measure g2 kurtosis measure
“moment statistics”
Central moment =
r = 1 deviation from mean = 0
r = 2 variance
n
i
rxxn1
1
gg22 kurtosis rr = 4 = 4 31 4
4 xxns
negative g1 skewness to left
positive g1 skewness to right negative g2 platykurtosis
flatter, larger tails positive g2 leptokurtosis
taller, few tails
Skewness and kurtosis
(1) Comparability
(2) Better fit to model
Comparability
Data centring - deviations from mean
Data standardisation
- zero mean, unit variance
xxx ii *
Often find
Better fit Normal distribution
1 sd = 66% of values
2 sd = 95% of values
sdx,
skewed to right
positive g1
Log-normal distribution
DATA TRANSFORMATIONS
frequency
sd
66%mean 95%x
sdxxx ii *
range*ii xx
LOG-NORMAL DISTRIBUTION PROPERTIES
geometric mean = median of log-normal distribution
mean of log values = Geometric mean (antilog)
SD log values CV of original values if sd
antilog
If SD larger CV =
0 5.
1Sexp 2
How to decide whether to log transform?
(1) Look at histograms. Right skewed (positive g1) log transform
(2) If sd > mean or maximum value of variable > 20x than smallest value
Log xi or Log (xi + 1)
(3) Improves normality
(4) Gives less weight to ‘dominants’ VARIANCE STABILISING
(5) Reflects linear response of many species to log of chemical variables, i.e. log response over certain ranges.
(6) In regression need normally distributed random errors. Log transformation.
NORMAL AND LOG-NORMAL DISTRIBUTIONS
Normal Log-Normal
Effects Additive Multiplicative
Shape Symmetric Skewed
Mean , arithmetic *, geometric
Standard deviation s, additive s*, multiplicative
Measure of dispersion cv = s/ s*
Confidence interval 68.3%
± s
* x/s*
95.5% ± 2s * x/(s*)2
99.7% ± 3s * x/(s*)3
x/ = times / divide (cf ± plus / minus); cv = coefficient of variation
x
x
x
x
x
x
x
x
x
METHODS FOR DESCRIBING LOG-NORMAL DISTRIBUTIONS
Graphical methods
Frequency plots, histograms, box plots
Parameters
Logarithm of x
Mean
Median
Standard deviation
Variance
Skewness and kurtosis of x
Problems
What logarithm base to use?
Parameters are not on the scale of the original data
Appear to be very common in the real world
Limpert, E, et al. 2001 BioScience 51 (5), 342-352
DATA TRANSFORMATIONS
(1)
(2)
Environmental variable skewed to right
log-normal distribution
If SD > mean or maximum value of x > 20 times the smallest, use log (x + c) transformation where c is constant, usually 1.
Biological data - Stabilise variances
- Dampen effects of very abundant taxa
Choices - No transformation
- Square root
- Log (y + 1)
- % data square root
- Counts log (y + 1)
Other transformations:
where λ 0 = log x where λ = 0
If x = 0.0, add 0.5 or 1.0 as constant
Can also solve for best estimate of constant to add
Can calculate confidence limits for λ.
If these include 1, no need for a transformation!
(1) square root (2) cubic root
(3) fourth root
(4) log2 log2 (x + 1)
(5) logp logp (x + 1)
(6) Box-Cox transformation - most appropriate value for exponent λ
3 x4 x
TRANSFOR
1 xx*
If = 1 no transformation
= 0.5 square root
= -1 reciprocal transformation
= 0 log transformation
DENSITY ESTIMATION
A useful alternative to histograms is non-parametric density estimation which results in a smoothing of the histogram.
The kernel-density estimate at the value of x of a variable X is given by
where xj are the n observations of X, K is a kernel function (such as the normal density), and b is a bandwidth parameter influencing the amount of smoothing. Small bandwidths produce rough density estimates, whereas large bandwidths produce smoother estimates.
n
j
j
b
xxK
bxf
1
1)(̂
Note that the histogram has been scaled to the density estimates, not the raw frequencies.
Multiple approaches
1. Histogram with density scaling (areas of histogram bars sum to 1)
2. Density estimation (default) (thick line)
3. Density estimation (half the default bin-width) (thin line)
4. One-dimensional scatter-plot ("rugplot") to show distribution of observations at the bottom
Fox, 2002
QUANTILE-QUANTILE PLOTS
Quantile-quantile (Q-Q) plots are useful tools for determining if data are normally distributed. They show the relationship between the distribution of a variable and a reference or theoretical distribution.
Q-Q plot shows the relationship between the ordered data and the corresponding quantiles of the reference (in our case, normal) distribution.
If the data are normally distributed, they should plot on a straight line through the 1st and 3rd quartiles. If there is a break in slope of the plotted points, the data deviate from the reference distribution.
Note that quantiles are divisions of a frequency or probability distribution into equal, ordered subgroups (e.g. quartiles (4 parts) or percentiles (100 parts)).
J.W. Tukey
(1) Stem-and-leaf displays
55 62 73 78 79 78 81
STEM5 56 27 3 8 8 98 1
LEAF
4 21 5 1 1 2 3 6 7
4 3 6 3 49 7 5 5 3 2 7 1
5 3 81 9
“back-to-back”
EXPLORATORY DATA ANALYSIS
GRAPHICAL DISPLAY
Univariate data
(2) Box-and-whisker plots - box plots
CI around median 95%Median 1.58 (Q3) / (n)½
quartile
(3) Hanging histograms
Variations of box plots
McGill et al. Amer. Stat. 32, 12-16
Useful to label extreme points
Fox, 2002
Box plots for samples of more than ten wing lengths of adult male winged blackbirds taken in winter at 12 localities in the southern United States, and in order of generally increasing latitude. From James et al. (1984a). Box plots give the median, the range, and upper and lower quartiles of the data.
Useful to apply several approaches EDA tools
• • • • • • • •
• •
• •
• • • • • •
• • •
•
• •
x2
x1
Bivariate and multivariate data
Simple scatter plot
SCATTERPLOT MATRIX. The data are measurements of ozone, solar radiation, temperature, and wind speed on 111 days. Thus the measurements are 111 points in a four-dimensional space. The graphical method in this figure is a scatterplot matrix: all pairwise scatterplots of the variables are aligned into a matrix with shared scales.
Triangular arrangement of all pairwise scatter plots for four variables. Variables describe length and width of sepals and petals for 150 iris plants, comprising 3 species of 50 plants.
Three-dimensional perspective view for the first three variables of the iris data. Plants of the three species are coded A,B and C.
Can explore scatter-plot by adding box-plots for each variable, add simple linear regression line, add smoother (LOWESS – see Lecture 5), and label particular points.
Fox, 2002
Categorical variables can be encoded in a plot by using different symbols or colours for each category (e.g. type of occupation) and smoothers fitted for each category.
bc = blue collar, prof = professional, wc = white collar
Fox, 2002
Jittering scatter-plots
Discrete quantitative variables usually result in uniformative scatter-plots (e.g. education (years) and vocabulary (score on 0-10 scale)).
Only 21 distinct education values and 11 scores, so only 21 x 11 = 231 plotting positions.
Jittering data adds a small random quantity to each value to try to separate over-plotted points. Can vary the amount of jittering and also plot a smoother. Fox, 2002
Bivariate density estimation and scatter-plots
Large data-sets and weak relationships between variables.
Improve plot by jittering and making symbols smaller and apply bivariate kernel-density estimate plus regression line and LOWESS smoother.
Fox, 2002
coal-fired power station
oil-fired power station
Diagonal = density estimate for each variable
The Bagplot: A Bivariate Boxplot
Peter J. Rousseeuw
The American Statistician November 1999, Vol. 53, No. 4, 382
Car weight and engine displacement of 60 cars.
Part (a) shows the concentrations of cholesterol and triglycerides in the plasma of 320 patients. In part (b) logarithms are taken of both variables.
Part (a) shows the altitudinal range and abundance of butterflies. In part (b) the logarithm of the abundance is plotted.
Bagplot matrix of the three-dimensional aquifer data
with 85 data points.
Conditioning plots (Co-plots)
Focus on relationship between response and a predictor variable, holding other predictors constant at particular values – conditionally fixing the values of other predictors. 'Statistical control'
Co-plots provide graphical statistical control.
Focus on particular predictor and set each other predictor to a relatively narrow range (if quantitative) or to a specific value (if categorical). Subranges for a quantitative predictor are typically set to overlap (called "shingles") rather than to partition data into disjoint subsets ("bins").
For each combination of values of the conditioning predictors, construct scatter-plot to show response to the local predictor and arrange the plots in an array.
Can condition on more than one predictor (e.g. age, gender).
Six overlapping age classes, two genders (male upper, female lower), LOWESS, and linear fits
Fox, 2002
EDA and Data-Transformations
Try to linearise non-linear relationships by trial-and-error.
Mosteller & Tukey's 'bulging rule'.
Fox, 2002
When bulge points down, transform y down the ladder of powers and roots;
when the bulge points up, transform y up,
when the bulge points left, transform x down;
when the bulge points right transform x up.
Infant mortality rate and GDP per capita for 193 countries
Points down and to left, try powers and roots
Log transformation linearising, variables more symmetric
Fox, 2002
Profiles, Stars, Glyphs, Faces, and Boxes of Percentages of Republican Votes in Six Presidential Elections in Six Southern States. The circles in the Stars Are Drawn at 50%. The Assignment of Variables to Facial Features in the Faces is: 1932 – Shape of Face; 1936 – Length of nose; 1940 – Curvature of Mouth; 1960 – Width of Mouth; 1964 – Slant of Eyes; 1968 – Length of Eyebrows
Simple multivariate data
Three types of shape for representing multivariate data. In these examples glyph, stars and faces represent five, six and twelve (!) variables respectively.
Frequency of the six commonest species on the Park Grass plots using star displays.
Polygon plots
Labelled polygon plot
Chernoff faces
CHERNOFF
MurderMan-
slaughterAtlanta 16.5 24.8 106 147 1112 905 494
Boston 4.2 13.3 122 90 982 669 954
Chicago 11.6 24.7 340 242 808 609 645
Dallas 18.1 34.2 184 293 1668 901 602
Denver 6.9 41.5 173 191 1534 1368 780
Detroit 13 35.7 477 220 1566 1183 788
Hartford 2.5 8.8 68 103 1017 724 468
Honolulu 3.6 12.7 42 28 1457 1102 637
Houston 16.8 26.6 289 186 1509 787 697
Kansas City 10.8 43.2 255 226 1494 955 765
Los Angeles 9.7 51.8 286 355 1902 1386 862
New Orleans 10.3 39.7 266 283 1056 1036 776
New York 9.4 19.4 522 267 1674 1392 848
Portland 5 23 157 144 1530 1281 488
Tucson 5.1 22.9 85 148 1206 756 483
Washington 1.5 27.6 524 217 1494 1003 739
Burglary Larceny Auto theftRape Robbery Assault
American city crime data
1. Atlanta2. Boston3. Chicago4. Dallas5. Denver6. Detroit7. Hartford8. Honolulu9. Houston10. Kansas City11. Los
Angeles12. New
Orleans13. New York14. Portland 15. Tucson16. Washingto
n
Faces representation of city crime data
CHERNOFF
Occurrence of seven vegetation groups at sites on cliffs of Snowdonia, from soils containing differing amounts of available phosphate and exchangeable calcium. The size of circles indicates the relative abundance of the vegetation.
1932 1936 1940 1960 1964 1968Missouri 35 38 48 50 36 45Maryland 36 37 41 46 35 42Kentucky 40 40 42 54 36 44Louisiana 7 11 14 29 57 23Mississippi 4 3 4 25 87 14South Carolina 2 1 4 49 59 39
Percentage of Republican Votes in residential Elections in six Southern States in the Years 1932-1940, 1960-68.
A) Schematic representation of the hierar-chical clustering of years by complete link of republican vote data in six southern states. The numbers at the far left denote distances between clusters.
B) Tree for Missouri computed according to decisions (i) – (v)
Trees for republican vote data in six southern states. .
Tree of yearly yields of 15 transportation companies with all variables labelled
Tree of yearly yields of 15 transportation companies 1953-1977
FOURIER PLOTS Andrews (1972)
Plot multivariate data into a function. where data are [x1, x2, x3, x4, x5... xm] Plot over range -π ≤ t ≤ π Each object is a curve. Function preserves distances between objects. Similar objects will be plotted close together.
txtxtxtxxtxf 222 54321 cossincossin
MULTPLOT
Complex multivariate data
Andrews' plot for artificial data
Andrews’ plots for all twenty-two Indian
tribes.
Dieldrin residues in the livers of 227 kestrels and barn owls found dead during 1970-1973. Each bird is represented by a point on the map. (Reproduced with permission from Institute of Terrestrial Ecology Annual Report for 1974).
OTHER TYPES OF GRAPHICAL DISPLAY
Map of aerial density of Sitobion avenea, 11-17 June 1984 produced using the SYMAP program. Darker areas represent higher densities on a logarithmic scale (×3 intervals). Numbers on map indicate positions of suction traps and their respective catch sizes (log3). (Reproduced with
permission from Woiwod and Tatchell, 1984.)
Contour map of the aerial density (using logarithmic intervals) of the hop aphid Phorodon humili 28 September to 2 October 1983, produced by the program SURFACE II. Suction trap sites are marked with a +. (Reproduced with permission from Fig. 3 of Woiwod and Tatchell, 1984)
Three dimensional perspective view of the aphid densities obtained using SURFACE II. (Reproduced from Woiwod and Tatchell, 1984)
THE POWER OF GRAPHICAL DATA DISPLAY. Visualization provides insight that cannot be appreciated by any other approach to learning from data. On this graph, the top left panel displays monthly average CO2 concentrations from Mauna Loa, Hawaii. The remaining panels show frequency components of variation in the data. The heights of the five bars on the right sides of the panels portray the same changes in ppm on the five vertical scales.
Identification of ‘outliers’ or ‘rogues’.
“Observation which is, in some sense, inconsistent with the rest of the observations in the data-set. An observation can be an outlier due to the response variable(s) or any one or more of the predictor variables having values outside their expected limits.”
Identify not for rejection at this stage but for investigation and evaluation.
? Incorrect measurement, incorrect data entry, transcription or recording error.
Concept of outlier is model dependent.
OUTLIERS
LEVERAGE Potential for influence resulting from unusual values, particularly of predictor variables
INFLUENCE Observation is influential if its deletion substantially changes the results
Generalised distance of observation i plus 1/n.
niii xxSxxd 1112
Measures how extreme the observation i is from the mean vector of complete
sample x.
If leverage of an observation is more than three times the average leverage, observation has high leverage. Need to check it and try to explain why it has high leverage.
Alternatively, leverage of observation i (hi) equals the diagonal element of hat
matrix H
x
H = X (X 1 X ) -1 X 1 where X is n x k matrix of x values (i.e. the number of parameters in model), H
is n x n square matrix.
[Hat matrix so called because it puts “hat on Y”
Ŷ= HY where Ŷ and Y are n x 1 vectors of predicted and observed Y values]
di2 - two or more response variables (e.g. CANOCO)
hi - one response variable (e.g. linear or multiple regression)
LEVERAGE MEASURES
Leverage ranges from 1/n to 1
Sample mean ĥi = k/n
Size-adjusted cut-off ĥi 2k/n (ca. extreme 5%)
Maximum (hi)
Max (hi) 0.2 Safe
0.2 < Max (hi) 0.5 Risky
Max (hi) > 0.5 Avoid if possible
k = number of parameters
As hi approaches 1, observation i may completely control
the model.
DFBETAS - change in standard errors if observation i is deleted
kie
ikkik RSSs
bb DFBETAS
slope of regression slope when i deleted
residual standard deviation when i deleted
residual sum of squares when i not deleted
If DFBETASik > 0, case
i pulls bk up
< 0, case i pulls bk down
influential
case
nik2DFBETASIf
DFBETAS
identifies influence of observations on individual regression coefficients to model “LOCAL”
INFLUENCE MEASURES
iii
i hkhz
D
1
2
standardised residual
number of parameters leverage measure from H
If Di > 1 observation influential
(size adjusted), observation
influential
D ni 4
High leverage - potential outlier
Low influence - good outliernon-discordant outlier
High influence - bad outlierdiscordant outlier
COOK’S D
assesses impact of observations on regression coefficients “GLOBAL”
COOK'S D
‘Good’ (left) and ‘bad’ (right) outliers: ‘bad’ outliers influence the slope (artificial data)
Leverage (depends of x values only)
hi 0.34 0.34
(‘risky’ (between 0.2 and 0.5) and well above size-adjusted cut-off of 2k/n = 4/100 = 0.04)
Influence
DFBETASi = 0.06 -9.1
(much less than 2/√n = 0.2) (much more than 2/√n = 0.2)
High leverage, low influence High leverage, high influence
‘Good’ outlier ‘Bad’ outlier
Non-discordant outlier Discordant outlier
Robust leverage vs. Robust residuals plot
NEVER FORGET THE GRAPH!
“What is the use of a book, thought Alice, without pictures”
SOFTWARE FOR EXPLORATORY DATA ANALYSIS
R and S–PLUS
MINITAB
SYSTAT
AXUM