lecture 25: inference for data visualization · lecture 25: inference for data visualization...

Lecture 25: Inference for data visualization

Pratheepa Jeganathan

05/31/2019

Recall

I One sample sign test, Wilcoxon signed rank test, large-sampleapproximation, median, Hodges-Lehman estimator,distribution-free confidence interval.

I Jackknife for bias and standard error of an estimator.I Bootstrap samples, bootstrap replicates.I Bootstrap standard error of an estimator.I Bootstrap percentile confidence interval.I Hypothesis testing with the bootstrap (one-sample problem.)I Assessing the error in bootstrap estimates.I Example: inference on ratio of heart attack rates in the

aspirin-intake group to the placebo group.I The exhaustive bootstrap distribution.

I Discrete data problems (one-sample, two-sample proportiontests, test of homogeneity, test of independence).

I Two-sample problems (location problem - equal variance,unequal variance, exact test or Monte Carlo, large-sampleapproximation, H-L estimator, dispersion problem, generaldistribution).

I Permutation tests (permutation test for continuous data,different test statistic, accuracy of permutation tests).

I Permutation tests (discrete data problems, exchangeability.)I Rank-based correlation analysis (Kendall and Spearman

correlation coefficients.)I Rank-based regression (straight line, multiple linear regression,

statistical inference about the unknown parameters,nonparametric procedures - does not depend on thedistribution of error term.)

I Smoothing (density estimation, bias-variance trade-off, curse ofdimensionality)

I Nonparametric regression (Local averaging, local regression,kernel smoothing, local polynomial, penalized regression)

I Cross-validation, Variance Estimation, Confidence Bands,Bootstrap Confidence Bands.

I Wavelets (wavelet representation of a function, coefficientestimation using Discrete wavelet transformation, thresholding -VishuShrink and SureShrink).

I One-way layout (general alternative (KW test), orderedalternatives), multiple comparison procedure.

I Two-way layout (complete block design (Friedman test)),multiple comparison procedure, median polish, Tukey additivityplot, profile plots.

I Better bootstrap confidence intervals (bootstrap-t, percentileinterval, BCa interval, ABC interval).

Inference for data visualization

Overview

I Visualization forI exploratory data analysis (to search for structure/pattern in the

data)I model diagnostics (to search for structure not captured by the

fitted model)I Magical thinking: natural human inclination to

over-interpret connections between random events (Diaconis1985)

OverviewI Which plot shows the higher degree of association? (left/right)

5.0

7.5

10.0

12.5

15.0

5.0 7.5 10.0 12.5 15.0

x

y

0

10

20

30

0 5 10 15 20 25

x

y

I Reference (Diaconis 1985)I Scale does change the perception of a viewer.I We can use some protocols to check whether a pattern in the

data arise by chance.

Inference for plots

Inference for plots

I Reference (Buja et al. 2009)I ‘The lineup’ protocol:

I Generate 19 null plots (assuming that structure is absent.)I Arrange all 19 plots and insert the plot from real data at

random location.I Ask human viewer to single out the real plot.I Under the null hypothesis that all plots are the same, there is a

one in 20 chance to single out the real one.I If the viewer chooses the plot of the real data, then the

discovery can be assigned a p-value of 1/20 = 0.05I Larger number of null plots could yield a smaller p-value.

I But there is a limit of how many plots a human can consider.

Inference for plots

I Repeat the above ‘the lineup’ protocol with K independentlyrecruited viewers.

I Let X be the number of viewers choose the real data plot.I Suppose k ≤ K selected the plot of the real data.

I p-value = P (X ≥ k) , where X ∼ binomial (n = K , p = 1/20).I If all the viewers picked the real data plot, p-value is .05K .

Inference for plots (Example)

I Scatter plot.I Null hypothesis: the two variables are independent

I null data sets can be produced by permuting the values of onevariable against the other.

Inference for plots (Scatter plot Example)

I Cities across the USA were rated in 1984 according to manyfeatures (Boyer & Savageau (1984)).

I census data, not a random sample.I Consider ‘Climate-Terrain’ and ‘Housing’ variables.

I Climate-Terrain: low values - uncomfortable temperatures,higher values - moderate temperatures.

I Housing: a higher cost of owning a single family residenceI Expected association: more comfortable climates call for higher

average housing costs.


Figure 1: Source Buja et. al.


# number of studentsK = 8# number of correct picksk = 2pvalue = sum(dbinom(k:K,K,1/20)); pvalue

## [1] 0.05724465

I We reject the null hypothesis that Housing is independent ofClimate-Terrain.

Inference for plots (time series plot example)

I HSBC (‘The Hongkong and Shanghai Banking Corporation’)daily stock returns (two panel data)

I 2005 return only.I 1998–2005 return.

I In each panel, select which plot is the most different andexplain why.


# number of studentsK = 8# number of correct picksk = 4pvalue = sum(dbinom(k:K,K,1/8)); pvalue

## [1] 0.01124781


I In 2005 return data, the viewer should have had difficultyselecting the real data plot.

I This is the year of low and stable volatility in return.I In 1998–2005 return data, it should be easy.

I There are two volatility bursts.I In 1998 due to the Russian bond default and the LTCM

(long-term capital management) collapse.I In 2001 due to the 9/11 event.I Later after 2001, volatility stabilizes at a low level.

Principal component analysis (PCA)

PCA

I PCA is primarily an exploratory tool.I PCA finds a low-dimensional subspace that minimizes the

distances between projected points and subspace.I Consider observations x1, x2, · · · , xn - each x i is a p column

vector.I Combine x is in a matrix X with dimension n × p.I Start by recentering X , from now on consider X centered i.e.

1nX = 0.I PCA finds U, S, and V matrices such that X = USVT

(Singular value decomposition - SVD).I Columns of V are new variables.I Principal components C = US.

PCA

I PCA as an optimization problem.I The first column vector v1 of V is such that 〈v1 , v1〉 = 1 and

v̂1 = maximizev 1{V (Xv1)} .

I Find v2 such that 〈v1 , v2〉 = 0, 〈v2 , v2〉 = 1, and

v̂2 = maximizev 2{V (Xv2)} .

I Keep going the same way until v1, v2, · · · , vq have beencollected and put them in V̂ of dimensions p × q.

Bootstrap PCA

I Two ways to bootstrap PCA in case of random rows XI Partial bootstrap.I Total bootstrap.

I Partial bootstrap:I Project B replications onto initial subspace.I Initial subspace is obtained by PCA on original X.I Underestimates variation of parameters (Milan and Whittaker

1995).I Total bootstrap:

I Perform new PCA on each replication.I Nuisance variations in PCA on bootstrap samples: reflections

and rotationsI Align PCAs on bootstrap samples.

Bootstrap PCA (need of Procrustes analysis)

I For the total bootstrap, need to align PCAs of bootstrapsamples.

I This is usually done using Procrustes analysis.I Procrustes refers to a bandit from Greek mythology who made

his victims fit his bed by stretching their limbs (or cutting themoff)

I Procrustes analysis is used in statistical shape analysis to alignshapes and minimize difference between shapes to retain realshape (by removing nuisance parameters):

I translation in spaceI rotation in spaceI sometimes scaling of the objects

Bootstrap PCA (need of Procrustes analysis)I Shapes are different by rotation, translation, and scaling.

I We need to remove translation, rotation, and scaling effectsbefore comparing shapes of the spinal code defined by thelandmarks (red points).

Bootstrap PCA (need of Procrustes analysis)

I In PCA, shapes are the projected points onto the lowerdimensional subspace spanned by say PC1 and PC2.

I Procrustes rotations of the rows of the PCA estimators ontothe initial configuration. The first table corresponds to X andthe other to X̂∗b

, b = 1, · · · ,B.

Bootstrap PCA

I Collecting B bootstrap sampled PCAs by resampling rows ofdata matrix X

V∗1q ,V∗2

q , · · ·V∗Bq .

I Align all the projected point set using Procrustes alignment.I Find the rotation for each bootstrap replicates

R̂b = minimizeR

{∣∣∣∣∣∣X∗1V∗1q − X∗bV∗b

q R∣∣∣∣∣∣} .

I Apply the rotation to projected data point for each bootstrapsample

X∗bV∗bq R̂b.

I Overlay points and draw contours around it.I This nonparamteric bootstrap approach modify the structure of

the rows of X for each bootstrap replication.

Parametric bootstrap PCA

I In case of fixed rows and columns X , we can use parametricbootstrap.

I Perform PCA on X to estimate V̂q.I Estimate residual σ2 from the residual matrix E = X − V̂qV̂

Tq X

I Draw εij ∼ N(0, σ̂2).

I Generate new matrix X∗b = V̂qV̂Tq X + E∗b.

I Perform PCA on X∗b.

Parametric bootstrap PCA (Example)

I Consumers describe 10 white wines with 14 sensory attributes.I Consumers score wines between 1 and 10 for each attribute.I Collect averages across consumers in 10× 14 matrix X .

library(SensoMineR)data(napping)data(napping.words)library(ade4)library(magrittr)library(dplyr)df = napping.wordswine.pca = dudi.pca(df, scannf = F, nf = 3)row.scores = data.frame(li = wine.pca$li,

SampleCode = rownames(df))row.scores = row.scores %>%

left_join(data.frame(SampleCode = rownames(df)))evals.prop = 100 * (wine.pca$eig / sum(wine.pca$eig))


evals = wine.pca$eigp = ggplot(data = row.scores, aes(x = li.Axis1,

y = li.Axis2, label = SampleCode, col = SampleCode)) +geom_point(size = 1) +geom_text(aes(label = SampleCode),

hjust=0, vjust=0.1) +labs(x = sprintf("Axis 1 [%s%% variance]",

round(evals.prop[1], 1)),y = sprintf("Axis 2 [%s%% variance]",

round(evals.prop[2], 1))) +ggtitle("PCA scores for wine data")


1 T Michaud

2 T Renaudie3 T Trotignon

4 T Buisse Domaine

5 T Buisse Cristal

6 V Aub. Silex

V Aub. Marigny

8 V Font. Domaine

V Font. Brules

10 V Font Coteaux

0

2

4

0 5Axis 1 [32.6% variance]

Axi

s 2

[21.

5% v

aria

nce]

PCA scores for wine data

I The first two PCs explained almost 54% of total variability.


I With the bootstrap confidence ellipses


I PCA on the wine data set (Josse, Wager, and Husson 2016).

Parametric bootstrap PCA (Example)I Bootstrap for PCA scores. (Josse, Wager, and Husson 2016).

Parametric bootstrap PCA (Notes)

I Read (Josse, Wager, and Husson 2016) for confidence areasusing jackknife for PCA.

References for this lecture

I HW2018 Modern statistics for modern biology.

I Christof 2016: Lecture notes on inference for visualization

http://web.stanford.edu/class/bios221/book/Chap-Multivariate.html

http://christofseiler.github.io/stats205/Lecture24/Visualization.pdf

References for this lectureBuja, Andreas, Dianne Cook, Heike Hofmann, Michael Lawrence,Eun-Kyung Lee, Deborah F Swayne, and Hadley Wickham. 2009.“Statistical Inference for Exploratory Data Analysis and ModelDiagnostics.” Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences 367 (1906). TheRoyal Society Publishing: 4361–83.

Diaconis, Persi. 1985. “Theories of Data Analysis: From MagicalThinking Through Classical Statistics.” Exploring Data Tables,Trends, and Shapes. Wiley Online Library, 1.

Josse, Julie, Stefan Wager, and François Husson. 2016. “ConfidenceAreas for Fixed-Effects Pca.” Journal of Computational andGraphical Statistics 25 (1). Taylor & Francis: 28–48.

Milan, Luis, and Joe Whittaker. 1995. “Application of theParametric Bootstrap to Models That Incorporate a Singular ValueDecomposition.” Journal of the Royal Statistical Society: Series C(Applied Statistics) 44 (1). Wiley Online Library: 31–49.

lecture 25: inference for data visualization · lecture 25: inference for data visualization...

Documents