numerical analysis of biological and environmental data lecture 2. exploratory data analysis

NUMERICAL ANALYSIS OF BIOLOGICAL AND

ENVIRONMENTAL DATA

Lecture 2. Exploratory Data

Analysis

Types of variables

Simple diagrams

Summary statistics(i) Location(ii) Dispersion(iii) Skewness and kurtosis

Transformations

Density estimation

Graphical display(i) Univariate data(ii) Bivariate and multivariate

data

Outliers

Leverage and influence

Software

EXPLORATORY DATA ANALYSIS

TYPES OF VARIABLES

1) discrete e.g. counts

2) continuous e.g. pH, elevation

Both are random variables or variates, with random variation.

TABULAR PRESENTATION Raw data

Frequency tables

FrequencyCumulative Frequency

% CF

0 0 - 0.99 3 3 2

1 1 - 1.99 8 11 6

2 2 - 2.99 3 14 11

... ... ... ... ...

Value or Range

SIMPLE DIAGRAMS

Dot diagram Line diagram or profile

Histogram

Frequency graph or cumulative frequency graph

n/10 bins

CONTINUOUS VARIABLES

DISCRETE VARIABLESDISCRETE OR CONTINUOUS VARIABLES

HISTOGRAM BIN WIDTH Wand (1997) Amer. Statistician 51, 59-64

(a) (b) (c)

DEFAULTS-PLUS

Histograms of the British Incomes Data Based on (a) the Bin Width ĥ2 (b) the Bin Width ĥ0, and (c) the S-PLUS Default Bin Width.

Optimal solution

where g21 is band-width parameterψ2 is “normal scale” estimator

Solution of ψ2 and g21 is iterative, to optimise a function MEAN INTEGRATED SQUARED ERROR

Standard deviation n = sample size

31

493 nho .ˆ

3

1

212

26

ngh

ˆ

n21 log dataof rangeˆ

h

Histogram Bin Width

In R, a good option for histogram bin width is given by the Freedman-Diaconis rule which is:

where n is the number of observations, max-min is the range of the data, and Q3-Q1 is the inter-quartile range. The brackets represent the ceiling, which means that you round up to the next integer, thereby avoiding 4.2 bins!

)(2min)(max

13

3/1

QQn

Exploratory Data Analysis

1. Summary Statistics

(A)Measures of location ‘typical value’

(1) Arithmetic mean (2) Weighted mean

(3) Mode ‘most frequent’ value (4) Median ‘middle values’ Robust statistic

(5) Trimmed mean 1 or 2 extreme observations at both tails deleted

(6) Geometric mean

n

iixn

1

1 logGM log nnxxxx 321GM

n

i

xn1

11 log antilog =

n

iixnx

1

1

n

ii

n

iii wwxx

11

R

(B) Measures of dispersion

A 13.99 14.15 14.28 13.93 14.30 14.13

B 14.12 14.1 14.15 14.11 14.17 14.17

B smaller scatter than A

‘better precision’

PrecisionRandom error scatter

(replicates)

AccuracySystematic bias

(1) Range A = 0.37 B = 0.07

(2) Interquartile range ‘percentiles’

25% 25% 25% 25%

Q1

Q2

Q3

(3) Mean absolute deviation

n

iii xxn

1

1

Mean absolute difference

n

i

xxn

i

1

1 ignore negative signs

x 1 5 8 23 1 4 2 10 10/n = 2.5

4xxx

(5) Coefficient of variation

Relative standard deviationPercentage relative SD(independent of units)

(6) Standard error of mean

100 xsCV

SD

mean

ns2

SEM

(B) Measures of dispersion (cont.)

Variance = mean of squares of deviation from

mean

Root mean square value 2ssSD

(4) Variance and standard deviation

22

11

xxn

S

R

(C) Measures of skewness and kurtosis

gg11 skewness rr = 3 = 3 33

1xx

ns[third central moment

divided by sd3]

Skewness - measure of how one tail of curve is drawn out

Kurtosis - measure of peakedness of curve

g1 skewness measure g2 kurtosis measure

“moment statistics”

Central moment =

r = 1 deviation from mean = 0

r = 2 variance

n

i

rxxn1

1

gg22 kurtosis rr = 4 = 4 31 4

4 xxns

negative g1 skewness to left

positive g1 skewness to right negative g2 platykurtosis

flatter, larger tails positive g2 leptokurtosis

taller, few tails

Skewness and kurtosis

(1) Comparability

(2) Better fit to model

Comparability

Data centring - deviations from mean

Data standardisation

- zero mean, unit variance

xxx ii *

Often find

Better fit Normal distribution

1 sd = 66% of values

2 sd = 95% of values

sdx,

skewed to right

positive g1

Log-normal distribution

DATA TRANSFORMATIONS

frequency

sd

66%mean 95%x

sdxxx ii *

range*ii xx

LOG-NORMAL DISTRIBUTION PROPERTIES

geometric mean = median of log-normal distribution

mean of log values = Geometric mean (antilog)

SD log values CV of original values if sd

antilog

If SD larger CV =

0 5.

1Sexp 2

How to decide whether to log transform?

(1) Look at histograms. Right skewed (positive g1) log transform

(2) If sd > mean or maximum value of variable > 20x than smallest value

Log xi or Log (xi + 1)

(3) Improves normality

(4) Gives less weight to ‘dominants’ VARIANCE STABILISING

(5) Reflects linear response of many species to log of chemical variables, i.e. log response over certain ranges.

(6) In regression need normally distributed random errors. Log transformation.

NORMAL AND LOG-NORMAL DISTRIBUTIONS

Normal Log-Normal

Effects Additive Multiplicative

Shape Symmetric Skewed

Mean , arithmetic *, geometric

Standard deviation s, additive s*, multiplicative

Measure of dispersion cv = s/ s*

Confidence interval 68.3%

± s

* x/s*

95.5% ± 2s * x/(s*)2

99.7% ± 3s * x/(s*)3

x/ = times / divide (cf ± plus / minus); cv = coefficient of variation

x

x

x

x

x

x

x

x

x

METHODS FOR DESCRIBING LOG-NORMAL DISTRIBUTIONS

Graphical methods

Frequency plots, histograms, box plots

Parameters

Logarithm of x

Mean

Median

Standard deviation

Variance

Skewness and kurtosis of x

Problems

What logarithm base to use?

Parameters are not on the scale of the original data

Appear to be very common in the real world

Limpert, E, et al. 2001 BioScience 51 (5), 342-352

DATA TRANSFORMATIONS

(1)

(2)

Environmental variable skewed to right

log-normal distribution

If SD > mean or maximum value of x > 20 times the smallest, use log (x + c) transformation where c is constant, usually 1.

Biological data - Stabilise variances

- Dampen effects of very abundant taxa

Choices - No transformation

- Square root

- Log (y + 1)

- % data square root

- Counts log (y + 1)

Other transformations:

where λ 0 = log x where λ = 0

If x = 0.0, add 0.5 or 1.0 as constant

Can also solve for best estimate of constant to add

Can calculate confidence limits for λ.

If these include 1, no need for a transformation!

(1) square root (2) cubic root

(3) fourth root

(4) log2 log2 (x + 1)

(5) logp logp (x + 1)

(6) Box-Cox transformation - most appropriate value for exponent λ

3 x4 x

TRANSFOR

1 xx*

If = 1 no transformation

= 0.5 square root

= -1 reciprocal transformation

= 0 log transformation

DENSITY ESTIMATION

A useful alternative to histograms is non-parametric density estimation which results in a smoothing of the histogram.

The kernel-density estimate at the value of x of a variable X is given by

where xj are the n observations of X, K is a kernel function (such as the normal density), and b is a bandwidth parameter influencing the amount of smoothing. Small bandwidths produce rough density estimates, whereas large bandwidths produce smoother estimates.

n

j

j

b

xxK

bxf

1

1)(̂

Note that the histogram has been scaled to the density estimates, not the raw frequencies.

Multiple approaches

1. Histogram with density scaling (areas of histogram bars sum to 1)

2. Density estimation (default) (thick line)

3. Density estimation (half the default bin-width) (thin line)

4. One-dimensional scatter-plot ("rugplot") to show distribution of observations at the bottom

Fox, 2002

QUANTILE-QUANTILE PLOTS

Quantile-quantile (Q-Q) plots are useful tools for determining if data are normally distributed. They show the relationship between the distribution of a variable and a reference or theoretical distribution.

Q-Q plot shows the relationship between the ordered data and the corresponding quantiles of the reference (in our case, normal) distribution.

If the data are normally distributed, they should plot on a straight line through the 1st and 3rd quartiles. If there is a break in slope of the plotted points, the data deviate from the reference distribution.

Note that quantiles are divisions of a frequency or probability distribution into equal, ordered subgroups (e.g. quartiles (4 parts) or percentiles (100 parts)).

J.W. Tukey

(1) Stem-and-leaf displays

55 62 73 78 79 78 81

STEM5 56 27 3 8 8 98 1

LEAF

4 21 5 1 1 2 3 6 7

4 3 6 3 49 7 5 5 3 2 7 1

5 3 81 9

“back-to-back”

EXPLORATORY DATA ANALYSIS

GRAPHICAL DISPLAY

Univariate data

(2) Box-and-whisker plots - box plots

CI around median 95%Median 1.58 (Q3) / (n)½

quartile

(3) Hanging histograms

Variations of box plots

McGill et al. Amer. Stat. 32, 12-16

Useful to label extreme points

Fox, 2002

Box plots for samples of more than ten wing lengths of adult male winged blackbirds taken in winter at 12 localities in the southern United States, and in order of generally increasing latitude. From James et al. (1984a). Box plots give the median, the range, and upper and lower quartiles of the data.

Useful to apply several approaches EDA tools

• • • • • • • •

• •

• •

• • • • • •

• • •

•

• •

x2

x1

Bivariate and multivariate data

Simple scatter plot

SCATTERPLOT MATRIX. The data are measurements of ozone, solar radiation, temperature, and wind speed on 111 days. Thus the measurements are 111 points in a four-dimensional space. The graphical method in this figure is a scatterplot matrix: all pairwise scatterplots of the variables are aligned into a matrix with shared scales.

Triangular arrangement of all pairwise scatter plots for four variables. Variables describe length and width of sepals and petals for 150 iris plants, comprising 3 species of 50 plants.

Three-dimensional perspective view for the first three variables of the iris data. Plants of the three species are coded A,B and C.

Can explore scatter-plot by adding box-plots for each variable, add simple linear regression line, add smoother (LOWESS – see Lecture 5), and label particular points.

Fox, 2002

Categorical variables can be encoded in a plot by using different symbols or colours for each category (e.g. type of occupation) and smoothers fitted for each category.

bc = blue collar, prof = professional, wc = white collar

Fox, 2002

Jittering scatter-plots

Discrete quantitative variables usually result in uniformative scatter-plots (e.g. education (years) and vocabulary (score on 0-10 scale)).

Only 21 distinct education values and 11 scores, so only 21 x 11 = 231 plotting positions.

Jittering data adds a small random quantity to each value to try to separate over-plotted points. Can vary the amount of jittering and also plot a smoother. Fox, 2002

Bivariate density estimation and scatter-plots

Large data-sets and weak relationships between variables.

Improve plot by jittering and making symbols smaller and apply bivariate kernel-density estimate plus regression line and LOWESS smoother.

Fox, 2002

coal-fired power station

oil-fired power station

Diagonal = density estimate for each variable

The Bagplot: A Bivariate Boxplot

Peter J. Rousseeuw

The American Statistician November 1999, Vol. 53, No. 4, 382

Car weight and engine displacement of 60 cars.

Part (a) shows the concentrations of cholesterol and triglycerides in the plasma of 320 patients. In part (b) logarithms are taken of both variables.

Part (a) shows the altitudinal range and abundance of butterflies. In part (b) the logarithm of the abundance is plotted.

Bagplot matrix of the three-dimensional aquifer data

with 85 data points.

Conditioning plots (Co-plots)

Focus on relationship between response and a predictor variable, holding other predictors constant at particular values – conditionally fixing the values of other predictors. 'Statistical control'

Co-plots provide graphical statistical control.

Focus on particular predictor and set each other predictor to a relatively narrow range (if quantitative) or to a specific value (if categorical). Subranges for a quantitative predictor are typically set to overlap (called "shingles") rather than to partition data into disjoint subsets ("bins").

For each combination of values of the conditioning predictors, construct scatter-plot to show response to the local predictor and arrange the plots in an array.

Can condition on more than one predictor (e.g. age, gender).

Six overlapping age classes, two genders (male upper, female lower), LOWESS, and linear fits

Fox, 2002

EDA and Data-Transformations

Try to linearise non-linear relationships by trial-and-error.

Mosteller & Tukey's 'bulging rule'.

Fox, 2002

When bulge points down, transform y down the ladder of powers and roots;

when the bulge points up, transform y up,

when the bulge points left, transform x down;

when the bulge points right transform x up.

Infant mortality rate and GDP per capita for 193 countries

Points down and to left, try powers and roots

Log transformation linearising, variables more symmetric

Fox, 2002

Profiles, Stars, Glyphs, Faces, and Boxes of Percentages of Republican Votes in Six Presidential Elections in Six Southern States. The circles in the Stars Are Drawn at 50%. The Assignment of Variables to Facial Features in the Faces is: 1932 – Shape of Face; 1936 – Length of nose; 1940 – Curvature of Mouth; 1960 – Width of Mouth; 1964 – Slant of Eyes; 1968 – Length of Eyebrows

Simple multivariate data

Three types of shape for representing multivariate data. In these examples glyph, stars and faces represent five, six and twelve (!) variables respectively.

Frequency of the six commonest species on the Park Grass plots using star displays.

Polygon plots

Labelled polygon plot

Chernoff faces

CHERNOFF

MurderMan-

slaughterAtlanta 16.5 24.8 106 147 1112 905 494

Boston 4.2 13.3 122 90 982 669 954

Chicago 11.6 24.7 340 242 808 609 645

Dallas 18.1 34.2 184 293 1668 901 602

Denver 6.9 41.5 173 191 1534 1368 780

Detroit 13 35.7 477 220 1566 1183 788

Hartford 2.5 8.8 68 103 1017 724 468

Honolulu 3.6 12.7 42 28 1457 1102 637

Houston 16.8 26.6 289 186 1509 787 697

Kansas City 10.8 43.2 255 226 1494 955 765

Los Angeles 9.7 51.8 286 355 1902 1386 862

New Orleans 10.3 39.7 266 283 1056 1036 776

New York 9.4 19.4 522 267 1674 1392 848

Portland 5 23 157 144 1530 1281 488

Tucson 5.1 22.9 85 148 1206 756 483

Washington 1.5 27.6 524 217 1494 1003 739

Burglary Larceny Auto theftRape Robbery Assault

American city crime data

1. Atlanta2. Boston3. Chicago4. Dallas5. Denver6. Detroit7. Hartford8. Honolulu9. Houston10. Kansas City11. Los

Angeles12. New

Orleans13. New York14. Portland 15. Tucson16. Washingto

n

Faces representation of city crime data

CHERNOFF

Occurrence of seven vegetation groups at sites on cliffs of Snowdonia, from soils containing differing amounts of available phosphate and exchangeable calcium. The size of circles indicates the relative abundance of the vegetation.

1932 1936 1940 1960 1964 1968Missouri 35 38 48 50 36 45Maryland 36 37 41 46 35 42Kentucky 40 40 42 54 36 44Louisiana 7 11 14 29 57 23Mississippi 4 3 4 25 87 14South Carolina 2 1 4 49 59 39

Percentage of Republican Votes in residential Elections in six Southern States in the Years 1932-1940, 1960-68.

A) Schematic representation of the hierar-chical clustering of years by complete link of republican vote data in six southern states. The numbers at the far left denote distances between clusters.

B) Tree for Missouri computed according to decisions (i) – (v)

Trees for republican vote data in six southern states. .

Tree of yearly yields of 15 transportation companies with all variables labelled

Tree of yearly yields of 15 transportation companies 1953-1977

FOURIER PLOTS Andrews (1972)

Plot multivariate data into a function. where data are [x1, x2, x3, x4, x5... xm] Plot over range -π ≤ t ≤ π Each object is a curve. Function preserves distances between objects. Similar objects will be plotted close together.

txtxtxtxxtxf 222 54321 cossincossin

MULTPLOT

Complex multivariate data

Andrews' plot for artificial data

Andrews’ plots for all twenty-two Indian

tribes.

Dieldrin residues in the livers of 227 kestrels and barn owls found dead during 1970-1973. Each bird is represented by a point on the map. (Reproduced with permission from Institute of Terrestrial Ecology Annual Report for 1974).

OTHER TYPES OF GRAPHICAL DISPLAY

Map of aerial density of Sitobion avenea, 11-17 June 1984 produced using the SYMAP program. Darker areas represent higher densities on a logarithmic scale (×3 intervals). Numbers on map indicate positions of suction traps and their respective catch sizes (log3). (Reproduced with

permission from Woiwod and Tatchell, 1984.)

Contour map of the aerial density (using logarithmic intervals) of the hop aphid Phorodon humili 28 September to 2 October 1983, produced by the program SURFACE II. Suction trap sites are marked with a +. (Reproduced with permission from Fig. 3 of Woiwod and Tatchell, 1984)

Three dimensional perspective view of the aphid densities obtained using SURFACE II. (Reproduced from Woiwod and Tatchell, 1984)

THE POWER OF GRAPHICAL DATA DISPLAY. Visualization provides insight that cannot be appreciated by any other approach to learning from data. On this graph, the top left panel displays monthly average CO2 concentrations from Mauna Loa, Hawaii. The remaining panels show frequency components of variation in the data. The heights of the five bars on the right sides of the panels portray the same changes in ppm on the five vertical scales.

Identification of ‘outliers’ or ‘rogues’.

“Observation which is, in some sense, inconsistent with the rest of the observations in the data-set. An observation can be an outlier due to the response variable(s) or any one or more of the predictor variables having values outside their expected limits.”

Identify not for rejection at this stage but for investigation and evaluation.

? Incorrect measurement, incorrect data entry, transcription or recording error.

Concept of outlier is model dependent.

OUTLIERS

LEVERAGE Potential for influence resulting from unusual values, particularly of predictor variables

INFLUENCE Observation is influential if its deletion substantially changes the results

Generalised distance of observation i plus 1/n.

niii xxSxxd 1112

Measures how extreme the observation i is from the mean vector of complete

sample x.

If leverage of an observation is more than three times the average leverage, observation has high leverage. Need to check it and try to explain why it has high leverage.

Alternatively, leverage of observation i (hi) equals the diagonal element of hat

matrix H

x

H = X (X 1 X ) -1 X 1 where X is n x k matrix of x values (i.e. the number of parameters in model), H

is n x n square matrix.

[Hat matrix so called because it puts “hat on Y”

Ŷ= HY where Ŷ and Y are n x 1 vectors of predicted and observed Y values]

di2 - two or more response variables (e.g. CANOCO)

hi - one response variable (e.g. linear or multiple regression)

LEVERAGE MEASURES

Leverage ranges from 1/n to 1

Sample mean ĥi = k/n

Size-adjusted cut-off ĥi 2k/n (ca. extreme 5%)

Maximum (hi)

Max (hi) 0.2 Safe

0.2 < Max (hi) 0.5 Risky

Max (hi) > 0.5 Avoid if possible

k = number of parameters

As hi approaches 1, observation i may completely control

the model.

DFBETAS - change in standard errors if observation i is deleted

kie

ikkik RSSs

bb DFBETAS

slope of regression slope when i deleted

residual standard deviation when i deleted

residual sum of squares when i not deleted

If DFBETASik > 0, case

i pulls bk up

< 0, case i pulls bk down

influential

case

nik2DFBETASIf

DFBETAS

identifies influence of observations on individual regression coefficients to model “LOCAL”

INFLUENCE MEASURES

iii

i hkhz

D

1

2

standardised residual

number of parameters leverage measure from H

If Di > 1 observation influential

(size adjusted), observation

influential

D ni 4

High leverage - potential outlier

Low influence - good outliernon-discordant outlier

High influence - bad outlierdiscordant outlier

COOK’S D

assesses impact of observations on regression coefficients “GLOBAL”

COOK'S D

‘Good’ (left) and ‘bad’ (right) outliers: ‘bad’ outliers influence the slope (artificial data)

Leverage (depends of x values only)

hi 0.34 0.34

(‘risky’ (between 0.2 and 0.5) and well above size-adjusted cut-off of 2k/n = 4/100 = 0.04)

Influence

DFBETASi = 0.06 -9.1

(much less than 2/√n = 0.2) (much more than 2/√n = 0.2)

High leverage, low influence High leverage, high influence

‘Good’ outlier ‘Bad’ outlier

Non-discordant outlier Discordant outlier

Robust leverage vs. Robust residuals plot

NEVER FORGET THE GRAPH!

“What is the use of a book, thought Alice, without pictures”

SOFTWARE FOR EXPLORATORY DATA ANALYSIS

R and S–PLUS

MINITAB

SYSTAT

AXUM

numerical analysis of biological and environmental data lecture 2. exploratory data analysis

Documents