data mining - wharton statistics department - statistics...

Data Mining

Robert A Stine

stine @ wharton.upenn.edu

Dept of Statistics, Wharton School

University of Pennsylvania

St Gallen

June 2016

Introduction

syllabus materials topics

Wharton Department of Statistics

Welcome and IntroductionsBackground

Undergraduate political science and mathInterest in voting behavior and election polls

PhD in Statistics from Princeton Bootstrap resampling in time series analysis

Interest in data mining Research in variable selection for time series and large models with 1000s of predictors

Recent and on-going projects

Streaming variable selection

Spatial and sparse time series

Models for text

3


What is data mining?Insult when I learned statistics!

Make up results to suit your theory

Discovering predictive patterns in data

Data: typically large, loosely structured data sets

Seldom much better than hand-tuned regression

DM often faster and provides a useful diagnostic

Science

Testable claims

Tasks

Prediction and classification

Estimation and interpretation

4


Caution!

5

We find patterns in

surprising places with

fascinating consequences!

$28,000

grilled cheese!


Data Mining in Soc SciPoorly suited to social science?

Empiricism run wild, lack of theory or hypotheses

Post hoc inference

Response

Leverage technology

Tukey Cost of theory remains expensive compared to computing

Honest

A better match to what most do in common practice

Diagnostic

Have I missing something?

Deep connections

Multidimensional scaling, likelihood, modern regression

6


Course ObjectivesFor you to leave confident that you can

Recognize when data mining can help

Apply new techniques to your own data

Explain the intuition behind methods

Expand your knowledge of statistical methods

Big picture

Wide variety of tools for building models

Save time/energy for finding data, asking questions

More likely to happen if you ask questions here!

So dont hesitate...

7


TextbooksMachine learning

Modern term for statistical techniques designed for large amounts of complex data

Originated in Computer Science

An Introduction to Statistical Learning (ISL 2013) James, Witten, Hastie, Tibshirani

Examples in R for each chapter

Basis for much of the afternoon sessions

Elements of Statistical Learning (ESL 2009) Hastie, Tibshirani, and Friedman

More details, additional topics not covered in ISL

8


Plan for LecturesLoosely follow James text

Models and averaging Ch 1-4 Regression, bootstrap, classification

Feature selection Ch 4-6 Cross-validation, shrinkage, Lasso

Nonlinearity Ch 7, HO, ESL Ch 11 GAM, neural networks, networks, boosting

Tree-based methods Ch 8 CART, random forests, bagging

Practical modeling Ch 9-10, Readings Case studies, text, unsupervised features

9

Morning lecture

Afternoon R and JMP


RequirementsTaking the course for a grade?

Doing exercises useful follow up even if not

Daily exercises

James textbook

Start on during the afternoon lab sessions

Several for each class

Submit package within 2 weeks

Analysis in R and/or JMP

Explanation as called for in exercise

Appropriate use of graphics

10

Additional References and Software


Research Papers Fitting regressions, big models, inference

Chatfield (1995), JRSS Model Uncertainty, Data Mining and Stat Inference

Sala-I-Martin (1997), AER I Just Ran Two Million Regressions

Hand et al (2000), Statistical Science Data Mining for Fun and Profit

Foster and Stine (2004), JASA Variable Selection in Data Mining

Ward, Greenhil, and Bakke (2010) J of Peace Res Perils of policy by p-value: Predicting civil conflicts

Stine (1989) SMR Introduction to Bootstrap Methods

12

https://dl.dropboxusercontent.com/u/5425615/stgallenDM/sgdm_papers.tgz


Research Papers Neural networks

Zeng (1999) SMR Prediction and Classification with Neural Networks

Beck, King and Zeng (2000) APSR Improving Quant Studies of International Conflict

DeMarchi, Gelpi, and Grynaviski (2004) APSR Untangling Neural Nets

Trees

Berk (2006) SMR Introduction to Ensemble Methods for Data Analysis

Weerts and Ronca (2010) Ed Economics Using Classification Trees to Predict Alumni Giving

Kastellec (2010) J Empirical Legal Studies Stat Analysis of Judicial Decisions with Trees

13


Research PapersText analysis

Hopkins and King (2007) Extracting Social Science Meaning from Text

Hopkins and King (2010) AJPS A Method of Automated Nonparametric Content Analysis for Social Science

Monroe, Colaresi, and Quinn (2008) Pol Analysis Fightin' Words: Identifying Political Conflict

Turney and Pantel (2010) Freq to Meaning: Vector Space Models of Semantics

Grimmer and Stewart (2013) Text as Data: Promise and Pitfalls

ArXiv

14


SoftwareConsiderations

Interface

Type commands at prompt (R) vs point-and-click (JMP)

Data input and output

Ease of getting your data in and out

Extensibility

Ability to add your own customization

Graphics

Presentation (R) vs interactive (JMP)

Scope

Cutting (bleeding) edge vs ease of use (JMP)

Cost

Free (R) vs commercial (JMP from SAS, free trial)

Support, user community

Particularly within your own company or collaborators

Class

Blend of R (text) and JMP (interactive)

Alternatives from IBM, Stata...

15

You dont mine data by hand!


BooksPrinciples of Data Mining (2001) Hand, Mannila, and Smyth

Personal favorite, but dated

More like a textbook with blend of theory and practice, but limited examples.

Large number of background citations

Data Mining (3rd Ed, 2011) Witten, Frank, and Hall

Emphasis on trees, association rules and domain knowledge

Terse but comprehensive coverage, with terminology

Bound to the WEKA software, small examples

Data Mining Techniques (3rd Ed, 2011) Linoff and Berry

Business decision making emphasis (100+ pages intro)

Wide scope, with emphasis on business communication, visualization

Data mining cannot be bought, it must be mastered, Discusses privacy concerns

Data Mining with R: Case Studies (2011) Torgo

Introduction to R with five case studies

Smallish data: Algae blooms has 7 cases to predict.

Others cover stock trading scheme, fraud detection. Special coverage of microarrays

Includes SVM, MARS, along with basics

16


More BooksPractical Applications of Data Mining (2012) Suh

Eclectic topics, emphasis on association rules, fuzzy sets, neural networks. Less stat

Covers data retrieval, SQL, database management issues

Numerous detailed algorithm examples for small, in-text data tables

Data Mining (2011, 2nd Edition) Kantardzic

Broad coverage, from business aspects to ensemble learning to text mining

Usefully annotated bibliography, encyclopedic extent.

Precious little in the way of real examples.

Background and specific methods

Modern Multivariate Statistical Techniques (2008). Izenman

Principal Component Analysis (2002). Jolliffe

Independent Component Analysis (2004). Stone

Neural Networks for Pattern Recognition (1996). Bishop

Classification and Regression Trees (1984). Breiman, Friedman, Stone, Olshen

17

Data

What would be the ideal data for answering my questions?


Data Collection IssuesCollecting, organizing, cleaning

90% of work on any real task

Weak aspect of statistics education, training

Audit trail, value in scripting, reproducible

Common problems

Data are not random samples

Secondary source

Data gathered for other task (eg, accounting), repurposed for modeling

Merging data from multiple sources

Inconsistent labels, conventions

Changes over time in coverage, definitions

Mismatching cases from different sources

Missing data

Accuracy

Data entry errors

Mislabeled variable

Categories as numbers

19


Data TableTraditional rectangular layout

Rows are n cases

Columns are variables

Getting taller and wider these days

Often more columns than rows (Arcene)

When tall, much taller! (Income)

Most software uses tabular convention

Some customize, separating data into multiple files

Longitudinal data

Transactional data: customer history, patient records

Dependence exaggerates amount of data

20

var1 var2 vark

case1

case2

case3

casen


Tall Data TableClassical big data becoming less common...

Income data for targeted marketing from UCI

Goal: Identify high-income households based on location and descriptive characteristics

32,500 rows x 15 columns

US Census data on households

Banking data from UCI

Goal: Identify customers who will respond to solicitation for a term deposit program

45,000 rows x 18 columns

Portuguese bank study

Sparse response of 10% (fraud is much less)

21UCI = Univ of Ca Irving ML databases, http://archive.ics.uci.edu


Square Data TableElection survey

Expanding domain of questions balanced against the cost of more cases

American National Election Study (ANES) 2008

Goal: Explain why voters pick the candidate that they choose

2,000 registered voters (rows), with about the same number of features (columns)

All sorts of variables

age party affiliation open-ended responses

Complications

Missing data

Interviewer effects

Sampling weights, ...

22


Wide Data TableAutomation

Automated data collection allows much more extensive measurement

Text data is very wide

Arcene example from UCI

200 cases, positive/negative cancer

10,000 featuresmass spectrometer measurements

Challenge: Separate normal cells from cancerous cells (prostate, ovarian)

Complications galore

Easy to fit a model that can separate the observed cases perfectly because n < p

Hard to test out-of-sample because few cases

23UCI = Univ of Ca Irving ML databases, http://archive.ics.uci.edu

p = #possible explanatory variables

arcene


Some of BothFinancial data from Imperial

Defaults of customers

Sample of training data 25,000 cases, 770 features

Sparse: only 3,244 non-zero losses

Anonymous

Dont know definitions of features

Dont know which are categorical!

Huge amount of data

Training data alone is 488 MB

105,471 cases

Takes a while just to read data

24


Is big data really big?Sample from homogeneous population

Populations change over time

Household energy survey in US

Longitudinal data

Repeated measures

Instrumentation allow data on finer scale

Components replace averages, revealing noise

Time series

Financial and economic data

The data from Imperial are really a time series, but thats hidden

Assumption of stationarity

Lurking issue of dependence

25


Once Data Are Identified...Deal with the Vs of big data

Volume, velocity, variety, validity, ...

Data preparation often requires programming

Building variables from accounting, billing data

Example structure of mortgage loan recorded in data

Build new columns based on presence of certain strings in this column

Scripting tools

Perl, python, ruby, R, unix utilities (Makefiles)

Find a tool that you like to use and master it

26


Data from ISL

27R: Use str command to see structure of data set

Regression Models

Building block of all approaches

Preliminary Analysis


Getting StartedPlots remain important, even with millions

Experience in modeling bankruptcies

Outliers in text analysis

Sampling

Dont have to plot every observation

We know a lot about sampling

Objective

Trade-off: Pay now or pay later

Outliers, leverage points

Transformations

Combinations

30


Univariate Plots

Plot linking and brushing

Interactive tools add value to univariate plots

Things to look for

Scales, particularly when unfamiliar with data

Age is not the age of residents; its the proportion of homes built before 1940

Skewness, multimodal: square peg, round hole

Some methods automate conversion to symmetry

Outliers

Outliers matter, even in large data tables.

Scope

At least inspect variables of primary interest!

31

boston_from_R

Analyze > Distribution


Scatterplot MatrixVisual correlation matrix

Mix data types

Numerical/Categorical

Visual tables

Embellishments

Coloring subset

Subsampling

Interactive

Brushing

Plot linking

Weakness of R

32Graph>Scatterplot matrixSee Figure 3.6 of ISL


Rounded DataData has often been rounded

Are categories ratio-level data?

Plots can hide a great deal

Dithering, adding some random noise helps

33

caravans

JMP Software graphs categorical in grid

with dithering


Summary of ExploratoryData are key to any modeling

Take time to understand nuances of your data

Simple, interpretable plots

Substantive insight + exploration = new features

Transformations

Substantive combinations (eg, ratios)

Clustering

Plots remain important, even if many variables

Use samples to speed viewing (though may miss outliers)

34

Would be slow to plot 25,000 points at zero for Imperial data

Regression Methods

Quick review, emphasizing link to DM

Regression provides opportunity to study issues

from DM in familiar context


Why Emphasize Regression?Claim

Regression can match the predictive performance of black-box models...

Just need the right explanatory variables!

Opportunity for substantive insights

Regression is familiar

Recognize then fix problems

Shares problems with black-boxes

Foundation for understanding complex, opaque models

Familiarity allows improvements

Several given in Foster and Stine (2004)

36


Classical RegressionTwo-part model

Mean of response is linear in explanatory variables E(Y|X) = y|x = 0 + 1 X1 + + p Xp

Unexplained idealized iid random variation Var(Y-y|x) = 2

Discussion

No limits on the Xs powers, logs, products

Model assumes you know which variables to use; just a question of estimating s

Error assumptions reasonable if correctly specified CLT: sum of small, omitted contributions is normal

Interpretation becomes difficult as model grows37


Least SquaresAssuming know that yi = 0 + 1xi1 + pxip + i + Independent, equal variance

Data (n x 1) response vector Y (n x p) matrix of explanatory variables X

Fitted values = b0 + b1X1 + bpXp

OLS criterion minb (yi - i)2 b = (XX)-1XY

normal equations (yi - i)xi = 0

Goodness of fit R2 = SS()/SS(y)

Standard errors Var(b) = 2(XX)-1 var(b) = s2(XX)-1

38

var(b1) s2

n var(X)

Equation is KNOWN


Example: DepreciationHow should BMW set the lease price for cars in its 3-series of sedans?

Depends on residual value at end of lease

39

BMW-prices


Example: DepreciationHow should BMW set the lease price for cars in its 3-series of sedans?

Sample of 218 used BMW 3-series cars

Model includes mileage, age, and type of car as well as an interaction of age and type.

40

Effect-coded estimatesR2=71%, s=$2,469


Careful InterpretationWhat do these estimates tell us?

Regression concerns the comparison of means under different conditions

Highly beefed up two-sample testing procedure!

Association causation

Carefully read the ISL text discussion of the advertising example.

41


Whats an interaction?What are these terms in the model?

Dummy variables

Represent all but one category

Intercept represents reference category, left out grp

Effect coding

Intercept estimates an overall average

42


View in 3-DGeometry

Two numerical variables define a plane in 3D

43


Further VisualizationInteractive tools: Profiling

Model profiling, viewing properties of the fitted equation of the least squares fit.

44interactive view of equation


Linear Regression LinesWhats the word linear mean?That the coefficients can be estimated as weighted averages of the response Not that the model fits lines.

Curvature modeled by

Powers (such as squares of variables)

Interactions (products of variables)

Transformations (most often logs)

Examples

Interactions: Boston housing data

Transformations: diamond prices

45


InteractionsBoston housing dataClassic economic analysis of impact of pollution levels on housing values

Impact of social class and home sizes

Repeated from previous scatterplot matrix example

Note censoring of values, outliers, curvature

Marginal association consistent with partial slopes

46

boston_from_R


InteractionsInitial linear fit ignores the interaction between percentage Lower Status and Rooms.

Adding interaction term to model reveals a strong effect

Interpretation from coefficients?

From interactive profile view?

47

Howd you know to do that?


Response SurfaceAdd squares of features a response surface

Response surface

Common in industrial design to locate an optimal mixture.

Surface plot

shows model fit is not linear in initial features

Caution: Illustration only!

Model has several highlyleveraged outliers and omits relevant Xs

48


TransformationsRegression relies on appropriate scaling

Association may be highly nonlinear

Logs (ie, percentage change) often useful

Most stat tools wont know to do this!

49

Prices of 7,568

diamonds

log price 8.6 + 2 log(carat)


InferenceClassical inference

Relies on validity of the assumed model

Systematic approach

Test the overall fit of the model first

If reject H0, then proceed to individual effects.

50Why do this preliminary test?

BMW-prices


Collinearity and FHighly collinear features

Signficant separately, but not individually when together.

51

BMW-prices


InferenceTesting components of fitted model

Standard error variation of estimate from sample to sample

t-ratiocount of standard errors separating estimate from 0

p-value smallest level at which can reject null hypothesis. Assumes normal sampling distribution for slopes.

Statistical significance substantive importance

52


BootstrappingStandard error is key to inference

What are standard errors?

BS is alternative method for obtaining standard errors and confidence intervals

Estimates standard error by simulating sampling procedure

Simulates by sampling with replacement from observed distribution of data

Implementation

R bootstrap package - also easy to do yourself

Throughout JMP basic tools in JMP Pro

53


Bootstrap SamplingStandard error

Standard deviation of statistic

Repeated independent samples from the population

Bootstrap standard errors

Simulate standard error

Draw B samples from the observed sample itself.

Sampling with replacement

Collection of repeated statistics estimates sampling distribution

54

21 B...

SE

population

sampling dist

Sample

R scriptSee Figure 5.11, p 190


Bootstrap ExampleBootstrap regression in JMP

Illustrative model with Age and Mileage

Various conventions for representing categorical variables

Compare results of least squares to those found by resampling

BS finds - modest collinearity - normal samp. distribution - higher standard errors

55B=500

Least squares


OutliersRelevant even when modeling millions

CLT not a guarantee

Most of money often remains in the outliers. Hiding outliers conceals risks, variation, profits.

Complex models often result in sparse data

Sparse = most values are 0

e.g., interaction between two rare properties

Whats the p-valuefor the regression in this figure?

n = 10,000

x=0 ten 1sx=1 one 1

56

What would bootstrapping do?


Sandwich EstimatorConventional std errors are not robust

Assume the model is true, test adding Xp+1 to model

Math produces expression for variance

Sandwich estimators

Robustness of validity, even for dependence

Seldom implemented in software, particularly stepwise

57

heteroscedasticity

var(b) = (XX)-1XE(epep)X(XX)-1 = (XX)-1 XD2X (XX)-1

diagonal

dependence

var(b) = (XX)-1XE(epep)X(XX)-1 = (XX)-1 XBX (XX)-1

block diagonal

Var(b) = (XX)-1XE()X(XX)-1

= 2(XX)-1(XX) (XX)-1

sp+12(XX)-1

aka, White estimator


Better Standard ErrorHeteroscedastic error

Estimate standard error with outlier

Sandwich estimator allowing heteroscedastic error variances givesa t-stat 2, not 10.

Dependent error

Even more important need for accurate SE

Netflix contestBonferroni (or hard thresholding) overfits due to dependence in responses.

Credit modeling with 100,000s of cases Everything significant unless incorporate longitudinal dependence

58

R Script

Bias-Variance Tradeoff

The more we allow a model to adapt to data, the more easily it adapts to random noise.


Statistics = AveragingAverages estimate parameters and predict Just a question of which data to average.

Underlying rationale

Find to minimize E(Y-)2 given known features X

Best predictor is always conditional mean =E(Y|X)

Ideal data

Replications Large count of cases that share each relevant set of explanatory characteristics

Unlikely to find so many so similar

60


ExampleProblem: estimate an unknown function

Y = (x) + error

Data spread over x-axis from 0 to 1

Estimator

Average values of response for similar values of x

Divide x-axis from 0 to 1 into d bins j = ave(Y | (j-1)/d x j/d), j = 1,2, , d

61bias_variance.R

(x) = sin(2 x)

0.0 0.2 0.4 0.6 0.8 1.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

Explanatory Variable

Response

n=100, d=5

0.0 0.2 0.4 0.6 0.8 1.0

-1.5

-1.0

-0.5

0.0

0.5

1.0


Response

n=100, d=10

0.0 0.2 0.4 0.6 0.8 1.0

-1.5

-1.0

-0.5

0.0

0.5

1.0


Response

n=100, d=20

Whats the best choice for d, the number

of bins?


Trade-OffsBias versus variance

Large number of bins able to track changing values of (x) lower bias

Large number of bins imply few data values within a bin higher variance

62

10 20 30 40 50

0500

1000

1500

2000

Number of Bins

Squ

ared

Err

or

sigma= 0.25 , n= 100 Bias^2VarianceTotal

larger d

smaller bias





63

larger d

smaller bias

10 20 30 40 50

0500

1000

1500

2000

Number of Bins

Squ

ared

Err

or


larger d

larger variance





64

larger d

smaller bias

larger d

larger variance

10 20 30 40 50

0500

1000

1500

2000

Number of Bins

Squ

ared

Err

or



Looks Easy?Problem

You dont get to see that plot!

Residual SS

Residual variation gets smaller and smaller as the model complexity grows (ie, as more bins)

65

10 20 30 40 50

0.20

0.25

0.30

0.35

Number of Bins

Sqr

t Ave

rage

Squ

ared

Res

idua

l

sigma= 0.25 , n= 100


Harder StillOptimal number of bins (model complexity)

Depends on amount of random noise relative to the curvature of the underlying, unknown (x)

Implication

With less noise, need less averaging

But we dont know how much noise there is!

Look back at the prior slide to see that we dont know the level of noise

6610 20 30 40 50

0500

1000

1500

2000

Number of Bins

Squ

ared

Err

or





Implication




6710 20 30 40 50

0500

1000

1500

2000

Number of Bins

Squ

ared

Err

or





Implication




6810 20 30 40 50

0500

1000

1500

2000

Number of Bins

Squ

ared

Err

or



Bias-Variance ReviewModel

Data are independent observations of Y = (x) + , ~ N(0,2)

Choice

Rougher: Average nearby where |xi-xj| is small small bias, large variance

Smoother: Average those where |xi-xj| is large large bias, small variance

Best choice

Depends on the underlying mean function (x)

The smoother the mean function (small ), then the more benefit from averaging

69


Animated ExampleTrue model is very simple, Y = X + (0,22)

A diagonal line

Smoothing spline ISL, 7.5

Smooth curve that estimates Ave(Y|X)

Controllable degree of smoothness

70JMP version is

animated

Which fits obs data best?

Which predicts new data best?


Example Suggests ApproachComplex model has best fit to data Line R2 = 0.889 Smooth 0.918 Smooth 0.968

Average squared errors from true mean

71

line predicts held back data best


Averaging in DM: KNNIdeal predictor, ideal data

Best predictor is always conditional mean =E(Y|X)

Average cases with same explanatory characteristics

Practical version

Identify a subset of, say, d relevant characteristics

Average k cases that are similar

KNN K-nearest neighbors

= average of k nearest values (closest on X)

Need to pick k

Small k: adapts to local behavior

Large k: smoother with smaller variance (recall Var() = 2/n)

72


KNN ClassifierKNN classifier

The majority of your nearest k neighbors determines your estimated group

73Figure 2.14, p40

k=3

Note the importance of scales


Bias or Variance?Few neighbors (small k): flexible boundary, low bias, high variance

Many neighbors (large k): smooth boundary, high bias, low variance

74Figure 2.16, p41

dashed uses k = 10

n=200

cases,

100 each color


Best Choice?Simulated data, so we know the right answer

75Figure 2.17, p42fewer neighbors >

more flexible >

See example in lab

How far do you need to look to

find 10 neighbors?


ConnectionsData mining tools resemble KNN, smoothing

Adjustable fit, one or more tuning parameters

Flexible, adaptive computational algorithm

Consequence

Easy to fit random variation

Artificially good fit that predicts poorly

Traditional statistical summaries inappropriate

Emergent solution

Use reserved, held-back data to judge model

Essence of cross-validation

76

Missing Data


Missing DataAlways present

In medical example, 170 out of 1,200 are complete

Often informative

In bankruptcy model, half of predictors indicate presence of missing data

Is data ever missing at random?

Handle as part of the modeling process?

Offer a simple patch that requires few assumptions

Main idea

Done as a data preparation step

Add indicator column for missing values

Fill the missing value

78

JMP Pro does this

in some

platforms


Handle Missing by Adding VarsAdd another variable

Add indicator column for missing values

Fill the missing with average of those seen

Simple approach, fewer assumptions

Expands the domain of the feature search

Allows missing cases to behave differently

Conservative evaluation of variable

Part of the modeling process

Distinguish missing subsets only if predictive

Missing in a categorical variable: not a problem

Missing define another category

79

Only for Xs,

never Y


Example

80

Data frame with missing values

Filled in data with added indicator columns

missing_data.R


Example of ProcedureSimple regression, missing at random

Conservative: unbiased estimate, inflated SE

n=100, 0=0, 1=3

30% missing at random, 1=3

81

-10 -5 0 5 10

-40

-20

0

20

40

-10 -5 0 5 10

-40

-20

0

20

40

-10 -5 0 5 10

-40

-20

0

20

40

Est SEb0 -0.25 1b1 3.05 0.17

Complete

Est SEb0 -1.5 1.4b1 3.01 0.27

Filled In


Example of ProcedureSimple regression, not missing at random

Conservative: unbiased estimate, inflated SE

n=100, 0=0, 1=3

30% missing follow steeper line

82

Est SEb0 -0.02 2.6b1 2.82 0.44

Filled In

Requires robust variance estimate

-10 -5 0 5 10

-20

0

20

40

60

80

-10 -5 0 5 10

-20

0

20

40

60

80

-10 -5 0 5 10

-20

0

20

40

60

80

-10 -5 0 5 10

-20

0

20

40

60

80


Imperial ExampleVariable 331 is interesting, but has missing

Look at only those who default

Missing 11% among these

Regression results

Note results are conservative (smaller t)

Missingness is informative

Retain all cases for subsequent analysis

83

variables are anonymous

Calibration

A model should be right on average.

data mining - wharton statistics department - statistics...

Documents