lecture 25: inference for data visualization · lecture 25: inference for data visualization...

40
Lecture 25: Inference for data visualization Pratheepa Jeganathan 05/31/2019

Upload: others

Post on 03-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Lecture 25: Inference for data visualization

Pratheepa Jeganathan

05/31/2019

Page 2: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Recall

Page 3: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

I One sample sign test, Wilcoxon signed rank test, large-sampleapproximation, median, Hodges-Lehman estimator,distribution-free confidence interval.

I Jackknife for bias and standard error of an estimator.I Bootstrap samples, bootstrap replicates.I Bootstrap standard error of an estimator.I Bootstrap percentile confidence interval.I Hypothesis testing with the bootstrap (one-sample problem.)I Assessing the error in bootstrap estimates.I Example: inference on ratio of heart attack rates in the

aspirin-intake group to the placebo group.I The exhaustive bootstrap distribution.

Page 4: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

I Discrete data problems (one-sample, two-sample proportiontests, test of homogeneity, test of independence).

I Two-sample problems (location problem - equal variance,unequal variance, exact test or Monte Carlo, large-sampleapproximation, H-L estimator, dispersion problem, generaldistribution).

I Permutation tests (permutation test for continuous data,different test statistic, accuracy of permutation tests).

I Permutation tests (discrete data problems, exchangeability.)I Rank-based correlation analysis (Kendall and Spearman

correlation coefficients.)I Rank-based regression (straight line, multiple linear regression,

statistical inference about the unknown parameters,nonparametric procedures - does not depend on thedistribution of error term.)

I Smoothing (density estimation, bias-variance trade-off, curse ofdimensionality)

I Nonparametric regression (Local averaging, local regression,kernel smoothing, local polynomial, penalized regression)

Page 5: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

I Cross-validation, Variance Estimation, Confidence Bands,Bootstrap Confidence Bands.

I Wavelets (wavelet representation of a function, coefficientestimation using Discrete wavelet transformation, thresholding -VishuShrink and SureShrink).

I One-way layout (general alternative (KW test), orderedalternatives), multiple comparison procedure.

I Two-way layout (complete block design (Friedman test)),multiple comparison procedure, median polish, Tukey additivityplot, profile plots.

I Better bootstrap confidence intervals (bootstrap-t, percentileinterval, BCa interval, ABC interval).

Page 6: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for data visualization

Page 7: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Overview

I Visualization forI exploratory data analysis (to search for structure/pattern in the

data)I model diagnostics (to search for structure not captured by the

fitted model)I Magical thinking: natural human inclination to

over-interpret connections between random events (Diaconis1985)

Page 8: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

OverviewI Which plot shows the higher degree of association? (left/right)

5.0

7.5

10.0

12.5

15.0

5.0 7.5 10.0 12.5 15.0

x

y

0

10

20

30

0 5 10 15 20 25

x

y

Page 9: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

I Reference (Diaconis 1985)I Scale does change the perception of a viewer.I We can use some protocols to check whether a pattern in the

data arise by chance.

Page 10: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots

Page 11: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots

I Reference (Buja et al. 2009)I ‘The lineup’ protocol:

I Generate 19 null plots (assuming that structure is absent.)I Arrange all 19 plots and insert the plot from real data at

random location.I Ask human viewer to single out the real plot.I Under the null hypothesis that all plots are the same, there is a

one in 20 chance to single out the real one.I If the viewer chooses the plot of the real data, then the

discovery can be assigned a p-value of 1/20 = 0.05I Larger number of null plots could yield a smaller p-value.

I But there is a limit of how many plots a human can consider.

Page 12: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots

I Repeat the above ‘the lineup’ protocol with K independentlyrecruited viewers.

I Let X be the number of viewers choose the real data plot.I Suppose k ≤ K selected the plot of the real data.

I p-value = P (X ≥ k) , where X ∼ binomial (n = K , p = 1/20).I If all the viewers picked the real data plot, p-value is .05K .

Page 13: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots (Example)

I Scatter plot.I Null hypothesis: the two variables are independent

I null data sets can be produced by permuting the values of onevariable against the other.

Page 14: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots (Scatter plot Example)

I Cities across the USA were rated in 1984 according to manyfeatures (Boyer & Savageau (1984)).

I census data, not a random sample.I Consider ‘Climate-Terrain’ and ‘Housing’ variables.

I Climate-Terrain: low values - uncomfortable temperatures,higher values - moderate temperatures.

I Housing: a higher cost of owning a single family residenceI Expected association: more comfortable climates call for higher

average housing costs.

Page 15: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots (Scatter plot Example)

Figure 1: Source Buja et. al.

Page 16: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots (Scatter plot Example)

Figure 2: Source Buja et. al.

Page 17: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots (Scatter plot Example)

# number of studentsK = 8# number of correct picksk = 2pvalue = sum(dbinom(k:K,K,1/20)); pvalue

## [1] 0.05724465

I We reject the null hypothesis that Housing is independent ofClimate-Terrain.

Page 18: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots (time series plot example)

I HSBC (‘The Hongkong and Shanghai Banking Corporation’)daily stock returns (two panel data)

I 2005 return only.I 1998–2005 return.

I In each panel, select which plot is the most different andexplain why.

Page 19: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots (time series plot example)

Figure 3: Source Buja et. al.

Page 20: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots (time series plot example)

Figure 4: Source Buja et. al.

Page 21: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots (time series plot example)

# number of studentsK = 8# number of correct picksk = 4pvalue = sum(dbinom(k:K,K,1/8)); pvalue

## [1] 0.01124781

Page 22: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Inference for plots (time series plot example)

I In 2005 return data, the viewer should have had difficultyselecting the real data plot.

I This is the year of low and stable volatility in return.I In 1998–2005 return data, it should be easy.

I There are two volatility bursts.I In 1998 due to the Russian bond default and the LTCM

(long-term capital management) collapse.I In 2001 due to the 9/11 event.I Later after 2001, volatility stabilizes at a low level.

Page 23: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Principal component analysis (PCA)

Page 24: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

PCA

I PCA is primarily an exploratory tool.I PCA finds a low-dimensional subspace that minimizes the

distances between projected points and subspace.I Consider observations x1, x2, · · · , xn - each x i is a p column

vector.I Combine x is in a matrix X with dimension n × p.I Start by recentering X , from now on consider X centered i.e.

1nX = 0.I PCA finds U, S, and V matrices such that X = USVT

(Singular value decomposition - SVD).I Columns of V are new variables.I Principal components C = US.

Page 25: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

PCA

I PCA as an optimization problem.I The first column vector v1 of V is such that 〈v1 , v1〉 = 1 and

v̂1 = maximizev 1{V (Xv1)} .

I Find v2 such that 〈v1 , v2〉 = 0, 〈v2 , v2〉 = 1, and

v̂2 = maximizev 2{V (Xv2)} .

I Keep going the same way until v1, v2, · · · , vq have beencollected and put them in V̂ of dimensions p × q.

Page 26: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Bootstrap PCA

I Two ways to bootstrap PCA in case of random rows XI Partial bootstrap.I Total bootstrap.

I Partial bootstrap:I Project B replications onto initial subspace.I Initial subspace is obtained by PCA on original X.I Underestimates variation of parameters (Milan and Whittaker

1995).I Total bootstrap:

I Perform new PCA on each replication.I Nuisance variations in PCA on bootstrap samples: reflections

and rotationsI Align PCAs on bootstrap samples.

Page 27: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Bootstrap PCA (need of Procrustes analysis)

I For the total bootstrap, need to align PCAs of bootstrapsamples.

I This is usually done using Procrustes analysis.I Procrustes refers to a bandit from Greek mythology who made

his victims fit his bed by stretching their limbs (or cutting themoff)

I Procrustes analysis is used in statistical shape analysis to alignshapes and minimize difference between shapes to retain realshape (by removing nuisance parameters):

I translation in spaceI rotation in spaceI sometimes scaling of the objects

Page 28: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Bootstrap PCA (need of Procrustes analysis)I Shapes are different by rotation, translation, and scaling.

I We need to remove translation, rotation, and scaling effectsbefore comparing shapes of the spinal code defined by thelandmarks (red points).

Page 29: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Bootstrap PCA (need of Procrustes analysis)

I In PCA, shapes are the projected points onto the lowerdimensional subspace spanned by say PC1 and PC2.

I Procrustes rotations of the rows of the PCA estimators ontothe initial configuration. The first table corresponds to X andthe other to X̂∗b

, b = 1, · · · ,B.

Page 30: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Bootstrap PCA

I Collecting B bootstrap sampled PCAs by resampling rows ofdata matrix X

V∗1q ,V∗2

q , · · ·V∗Bq .

I Align all the projected point set using Procrustes alignment.I Find the rotation for each bootstrap replicates

R̂b = minimizeR

{∣∣∣∣∣∣X∗1V∗1q − X∗bV∗b

q R∣∣∣∣∣∣} .

I Apply the rotation to projected data point for each bootstrapsample

X∗bV∗bq R̂b.

I Overlay points and draw contours around it.I This nonparamteric bootstrap approach modify the structure of

the rows of X for each bootstrap replication.

Page 31: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Parametric bootstrap PCA

I In case of fixed rows and columns X , we can use parametricbootstrap.

I Perform PCA on X to estimate V̂q.I Estimate residual σ2 from the residual matrix E = X − V̂qV̂

Tq X

I Draw εij ∼ N(0, σ̂2).

I Generate new matrix X∗b = V̂qV̂Tq X + E∗b.

I Perform PCA on X∗b.

Page 32: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Parametric bootstrap PCA (Example)

I Consumers describe 10 white wines with 14 sensory attributes.I Consumers score wines between 1 and 10 for each attribute.I Collect averages across consumers in 10× 14 matrix X .

library(SensoMineR)data(napping)data(napping.words)library(ade4)library(magrittr)library(dplyr)df = napping.wordswine.pca = dudi.pca(df, scannf = F, nf = 3)row.scores = data.frame(li = wine.pca$li,

SampleCode = rownames(df))row.scores = row.scores %>%

left_join(data.frame(SampleCode = rownames(df)))evals.prop = 100 * (wine.pca$eig / sum(wine.pca$eig))

Page 33: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Parametric bootstrap PCA (Example)

evals = wine.pca$eigp = ggplot(data = row.scores, aes(x = li.Axis1,

y = li.Axis2, label = SampleCode, col = SampleCode)) +geom_point(size = 1) +geom_text(aes(label = SampleCode),

hjust=0, vjust=0.1) +labs(x = sprintf("Axis 1 [%s%% variance]",

round(evals.prop[1], 1)),y = sprintf("Axis 2 [%s%% variance]",

round(evals.prop[2], 1))) +ggtitle("PCA scores for wine data")

Page 34: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Parametric bootstrap PCA (Example)

1 T Michaud

2 T Renaudie3 T Trotignon

4 T Buisse Domaine

5 T Buisse Cristal

6 V Aub. Silex

V Aub. Marigny

8 V Font. Domaine

V Font. Brules

10 V Font Coteaux

0

2

4

0 5Axis 1 [32.6% variance]

Axi

s 2

[21.

5% v

aria

nce]

PCA scores for wine data

I The first two PCs explained almost 54% of total variability.

Page 35: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Parametric bootstrap PCA (Example)

I With the bootstrap confidence ellipses

Page 36: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Parametric bootstrap PCA (Example)

I PCA on the wine data set (Josse, Wager, and Husson 2016).

Page 37: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Parametric bootstrap PCA (Example)I Bootstrap for PCA scores. (Josse, Wager, and Husson 2016).

Page 38: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

Parametric bootstrap PCA (Notes)

I Read (Josse, Wager, and Husson 2016) for confidence areasusing jackknife for PCA.

Page 39: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

References for this lecture

I HW2018 Modern statistics for modern biology.

I Christof 2016: Lecture notes on inference for visualization

Page 40: Lecture 25: Inference for data visualization · Lecture 25: Inference for data visualization Author: Pratheepa Jeganathan Created Date: 6/5/2019 10:44:02 AM

References for this lectureBuja, Andreas, Dianne Cook, Heike Hofmann, Michael Lawrence,Eun-Kyung Lee, Deborah F Swayne, and Hadley Wickham. 2009.“Statistical Inference for Exploratory Data Analysis and ModelDiagnostics.” Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences 367 (1906). TheRoyal Society Publishing: 4361–83.

Diaconis, Persi. 1985. “Theories of Data Analysis: From MagicalThinking Through Classical Statistics.” Exploring Data Tables,Trends, and Shapes. Wiley Online Library, 1.

Josse, Julie, Stefan Wager, and François Husson. 2016. “ConfidenceAreas for Fixed-Effects Pca.” Journal of Computational andGraphical Statistics 25 (1). Taylor & Francis: 28–48.

Milan, Luis, and Joe Whittaker. 1995. “Application of theParametric Bootstrap to Models That Incorporate a Singular ValueDecomposition.” Journal of the Royal Statistical Society: Series C(Applied Statistics) 44 (1). Wiley Online Library: 31–49.