data mining - wharton statistics department - statistics...

84
Data Mining Robert A Stine stine @ wharton.upenn.edu Dept of Statistics, Wharton School University of Pennsylvania St Gallen June 2016

Upload: truongkhanh

Post on 12-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • Data Mining

    Robert A Stine

    stine @ wharton.upenn.edu

    Dept of Statistics, Wharton School

    University of Pennsylvania

    St Gallen

    June 2016

  • Introduction

    syllabus materials topics

  • Wharton Department of Statistics

    Welcome and IntroductionsBackground

    Undergraduate political science and mathInterest in voting behavior and election polls

    PhD in Statistics from Princeton Bootstrap resampling in time series analysis

    Interest in data mining Research in variable selection for time series and large models with 1000s of predictors

    Recent and on-going projects

    Streaming variable selection

    Spatial and sparse time series

    Models for text

    3

  • Wharton Department of Statistics

    What is data mining?Insult when I learned statistics!

    Make up results to suit your theory

    Discovering predictive patterns in data

    Data: typically large, loosely structured data sets

    Seldom much better than hand-tuned regression

    DM often faster and provides a useful diagnostic

    Science

    Testable claims

    Tasks

    Prediction and classification

    Estimation and interpretation

    4

  • Wharton Department of Statistics

    Caution!

    5

    We find patterns in

    surprising places with

    fascinating consequences!

    $28,000

    grilled cheese!

  • Wharton Department of Statistics

    Data Mining in Soc SciPoorly suited to social science?

    Empiricism run wild, lack of theory or hypotheses

    Post hoc inference

    Response

    Leverage technology

    Tukey Cost of theory remains expensive compared to computing

    Honest

    A better match to what most do in common practice

    Diagnostic

    Have I missing something?

    Deep connections

    Multidimensional scaling, likelihood, modern regression

    6

  • Wharton Department of Statistics

    Course ObjectivesFor you to leave confident that you can

    Recognize when data mining can help

    Apply new techniques to your own data

    Explain the intuition behind methods

    Expand your knowledge of statistical methods

    Big picture

    Wide variety of tools for building models

    Save time/energy for finding data, asking questions

    More likely to happen if you ask questions here!

    So dont hesitate...

    7

  • Wharton Department of Statistics

    TextbooksMachine learning

    Modern term for statistical techniques designed for large amounts of complex data

    Originated in Computer Science

    An Introduction to Statistical Learning (ISL 2013) James, Witten, Hastie, Tibshirani

    Examples in R for each chapter

    Basis for much of the afternoon sessions

    Elements of Statistical Learning (ESL 2009) Hastie, Tibshirani, and Friedman

    More details, additional topics not covered in ISL

    8

  • Wharton Department of Statistics

    Plan for LecturesLoosely follow James text

    Models and averaging Ch 1-4 Regression, bootstrap, classification

    Feature selection Ch 4-6 Cross-validation, shrinkage, Lasso

    Nonlinearity Ch 7, HO, ESL Ch 11 GAM, neural networks, networks, boosting

    Tree-based methods Ch 8 CART, random forests, bagging

    Practical modeling Ch 9-10, Readings Case studies, text, unsupervised features

    9

    Morning lecture

    Afternoon R and JMP

  • Wharton Department of Statistics

    RequirementsTaking the course for a grade?

    Doing exercises useful follow up even if not

    Daily exercises

    James textbook

    Start on during the afternoon lab sessions

    Several for each class

    Submit package within 2 weeks

    Analysis in R and/or JMP

    Explanation as called for in exercise

    Appropriate use of graphics

    10

  • Additional References and Software

  • Wharton Department of Statistics

    Research Papers Fitting regressions, big models, inference

    Chatfield (1995), JRSS Model Uncertainty, Data Mining and Stat Inference

    Sala-I-Martin (1997), AER I Just Ran Two Million Regressions

    Hand et al (2000), Statistical Science Data Mining for Fun and Profit

    Foster and Stine (2004), JASA Variable Selection in Data Mining

    Ward, Greenhil, and Bakke (2010) J of Peace Res Perils of policy by p-value: Predicting civil conflicts

    Stine (1989) SMR Introduction to Bootstrap Methods

    12

    https://dl.dropboxusercontent.com/u/5425615/stgallenDM/sgdm_papers.tgz

  • Wharton Department of Statistics

    Research Papers Neural networks

    Zeng (1999) SMR Prediction and Classification with Neural Networks

    Beck, King and Zeng (2000) APSR Improving Quant Studies of International Conflict

    DeMarchi, Gelpi, and Grynaviski (2004) APSR Untangling Neural Nets

    Trees

    Berk (2006) SMR Introduction to Ensemble Methods for Data Analysis

    Weerts and Ronca (2010) Ed Economics Using Classification Trees to Predict Alumni Giving

    Kastellec (2010) J Empirical Legal Studies Stat Analysis of Judicial Decisions with Trees

    13

  • Wharton Department of Statistics

    Research PapersText analysis

    Hopkins and King (2007) Extracting Social Science Meaning from Text

    Hopkins and King (2010) AJPS A Method of Automated Nonparametric Content Analysis for Social Science

    Monroe, Colaresi, and Quinn (2008) Pol Analysis Fightin' Words: Identifying Political Conflict

    Turney and Pantel (2010) Freq to Meaning: Vector Space Models of Semantics

    Grimmer and Stewart (2013) Text as Data: Promise and Pitfalls

    ArXiv

    14

  • Wharton Department of Statistics

    SoftwareConsiderations

    Interface

    Type commands at prompt (R) vs point-and-click (JMP)

    Data input and output

    Ease of getting your data in and out

    Extensibility

    Ability to add your own customization

    Graphics

    Presentation (R) vs interactive (JMP)

    Scope

    Cutting (bleeding) edge vs ease of use (JMP)

    Cost

    Free (R) vs commercial (JMP from SAS, free trial)

    Support, user community

    Particularly within your own company or collaborators

    Class

    Blend of R (text) and JMP (interactive)

    Alternatives from IBM, Stata...

    15

    You dont mine data by hand!

  • Wharton Department of Statistics

    BooksPrinciples of Data Mining (2001) Hand, Mannila, and Smyth

    Personal favorite, but dated

    More like a textbook with blend of theory and practice, but limited examples.

    Large number of background citations

    Data Mining (3rd Ed, 2011) Witten, Frank, and Hall

    Emphasis on trees, association rules and domain knowledge

    Terse but comprehensive coverage, with terminology

    Bound to the WEKA software, small examples

    Data Mining Techniques (3rd Ed, 2011) Linoff and Berry

    Business decision making emphasis (100+ pages intro)

    Wide scope, with emphasis on business communication, visualization

    Data mining cannot be bought, it must be mastered, Discusses privacy concerns

    Data Mining with R: Case Studies (2011) Torgo

    Introduction to R with five case studies

    Smallish data: Algae blooms has 7 cases to predict.

    Others cover stock trading scheme, fraud detection. Special coverage of microarrays

    Includes SVM, MARS, along with basics

    16

  • Wharton Department of Statistics

    More BooksPractical Applications of Data Mining (2012) Suh

    Eclectic topics, emphasis on association rules, fuzzy sets, neural networks. Less stat

    Covers data retrieval, SQL, database management issues

    Numerous detailed algorithm examples for small, in-text data tables

    Data Mining (2011, 2nd Edition) Kantardzic

    Broad coverage, from business aspects to ensemble learning to text mining

    Usefully annotated bibliography, encyclopedic extent.

    Precious little in the way of real examples.

    Background and specific methods

    Modern Multivariate Statistical Techniques (2008). Izenman

    Principal Component Analysis (2002). Jolliffe

    Independent Component Analysis (2004). Stone

    Neural Networks for Pattern Recognition (1996). Bishop

    Classification and Regression Trees (1984). Breiman, Friedman, Stone, Olshen

    17

  • Data

    What would be the ideal data for answering my questions?

  • Wharton Department of Statistics

    Data Collection IssuesCollecting, organizing, cleaning

    90% of work on any real task

    Weak aspect of statistics education, training

    Audit trail, value in scripting, reproducible

    Common problems

    Data are not random samples

    Secondary source

    Data gathered for other task (eg, accounting), repurposed for modeling

    Merging data from multiple sources

    Inconsistent labels, conventions

    Changes over time in coverage, definitions

    Mismatching cases from different sources

    Missing data

    Accuracy

    Data entry errors

    Mislabeled variable

    Categories as numbers

    19

  • Wharton Department of Statistics

    Data TableTraditional rectangular layout

    Rows are n cases

    Columns are variables

    Getting taller and wider these days

    Often more columns than rows (Arcene)

    When tall, much taller! (Income)

    Most software uses tabular convention

    Some customize, separating data into multiple files

    Longitudinal data

    Transactional data: customer history, patient records

    Dependence exaggerates amount of data

    20

    var1 var2 vark

    case1

    case2

    case3

    casen

  • Wharton Department of Statistics

    Tall Data TableClassical big data becoming less common...

    Income data for targeted marketing from UCI

    Goal: Identify high-income households based on location and descriptive characteristics

    32,500 rows x 15 columns

    US Census data on households

    Banking data from UCI

    Goal: Identify customers who will respond to solicitation for a term deposit program

    45,000 rows x 18 columns

    Portuguese bank study

    Sparse response of 10% (fraud is much less)

    21UCI = Univ of Ca Irving ML databases, http://archive.ics.uci.edu

  • Wharton Department of Statistics

    Square Data TableElection survey

    Expanding domain of questions balanced against the cost of more cases

    American National Election Study (ANES) 2008

    Goal: Explain why voters pick the candidate that they choose

    2,000 registered voters (rows), with about the same number of features (columns)

    All sorts of variables

    age party affiliation open-ended responses

    Complications

    Missing data

    Interviewer effects

    Sampling weights, ...

    22

  • Wharton Department of Statistics

    Wide Data TableAutomation

    Automated data collection allows much more extensive measurement

    Text data is very wide

    Arcene example from UCI

    200 cases, positive/negative cancer

    10,000 featuresmass spectrometer measurements

    Challenge: Separate normal cells from cancerous cells (prostate, ovarian)

    Complications galore

    Easy to fit a model that can separate the observed cases perfectly because n < p

    Hard to test out-of-sample because few cases

    23UCI = Univ of Ca Irving ML databases, http://archive.ics.uci.edu

    p = #possible explanatory variables

    arcene

  • Wharton Department of Statistics

    Some of BothFinancial data from Imperial

    Defaults of customers

    Sample of training data 25,000 cases, 770 features

    Sparse: only 3,244 non-zero losses

    Anonymous

    Dont know definitions of features

    Dont know which are categorical!

    Huge amount of data

    Training data alone is 488 MB

    105,471 cases

    Takes a while just to read data

    24

  • Wharton Department of Statistics

    Is big data really big?Sample from homogeneous population

    Populations change over time

    Household energy survey in US

    Longitudinal data

    Repeated measures

    Instrumentation allow data on finer scale

    Components replace averages, revealing noise

    Time series

    Financial and economic data

    The data from Imperial are really a time series, but thats hidden

    Assumption of stationarity

    Lurking issue of dependence

    25

  • Wharton Department of Statistics

    Once Data Are Identified...Deal with the Vs of big data

    Volume, velocity, variety, validity, ...

    Data preparation often requires programming

    Building variables from accounting, billing data

    Example structure of mortgage loan recorded in data

    Build new columns based on presence of certain strings in this column

    Scripting tools

    Perl, python, ruby, R, unix utilities (Makefiles)

    Find a tool that you like to use and master it

    26

  • Wharton Department of Statistics

    Data from ISL

    27R: Use str command to see structure of data set

  • Regression Models

    Building block of all approaches

  • Preliminary Analysis

  • Wharton Department of Statistics

    Getting StartedPlots remain important, even with millions

    Experience in modeling bankruptcies

    Outliers in text analysis

    Sampling

    Dont have to plot every observation

    We know a lot about sampling

    Objective

    Trade-off: Pay now or pay later

    Outliers, leverage points

    Transformations

    Combinations

    30

  • Wharton Department of Statistics

    Univariate Plots

    Plot linking and brushing

    Interactive tools add value to univariate plots

    Things to look for

    Scales, particularly when unfamiliar with data

    Age is not the age of residents; its the proportion of homes built before 1940

    Skewness, multimodal: square peg, round hole

    Some methods automate conversion to symmetry

    Outliers

    Outliers matter, even in large data tables.

    Scope

    At least inspect variables of primary interest!

    31

    boston_from_R

    Analyze > Distribution

  • Wharton Department of Statistics

    Scatterplot MatrixVisual correlation matrix

    Mix data types

    Numerical/Categorical

    Visual tables

    Embellishments

    Coloring subset

    Subsampling

    Interactive

    Brushing

    Plot linking

    Weakness of R

    32Graph>Scatterplot matrixSee Figure 3.6 of ISL

  • Wharton Department of Statistics

    Rounded DataData has often been rounded

    Are categories ratio-level data?

    Plots can hide a great deal

    Dithering, adding some random noise helps

    33

    caravans

    JMP Software graphs categorical in grid

    with dithering

  • Wharton Department of Statistics

    Summary of ExploratoryData are key to any modeling

    Take time to understand nuances of your data

    Simple, interpretable plots

    Substantive insight + exploration = new features

    Transformations

    Substantive combinations (eg, ratios)

    Clustering

    Plots remain important, even if many variables

    Use samples to speed viewing (though may miss outliers)

    34

    Would be slow to plot 25,000 points at zero for Imperial data

  • Regression Methods

    Quick review, emphasizing link to DM

    Regression provides opportunity to study issues

    from DM in familiar context

  • Wharton Department of Statistics

    Why Emphasize Regression?Claim

    Regression can match the predictive performance of black-box models...

    Just need the right explanatory variables!

    Opportunity for substantive insights

    Regression is familiar

    Recognize then fix problems

    Shares problems with black-boxes

    Foundation for understanding complex, opaque models

    Familiarity allows improvements

    Several given in Foster and Stine (2004)

    36

  • Wharton Department of Statistics

    Classical RegressionTwo-part model

    Mean of response is linear in explanatory variables E(Y|X) = y|x = 0 + 1 X1 + + p Xp

    Unexplained idealized iid random variation Var(Y-y|x) = 2

    Discussion

    No limits on the Xs powers, logs, products

    Model assumes you know which variables to use; just a question of estimating s

    Error assumptions reasonable if correctly specified CLT: sum of small, omitted contributions is normal

    Interpretation becomes difficult as model grows37

  • Wharton Department of Statistics

    Least SquaresAssuming know that yi = 0 + 1xi1 + pxip + i + Independent, equal variance

    Data (n x 1) response vector Y (n x p) matrix of explanatory variables X

    Fitted values = b0 + b1X1 + bpXp

    OLS criterion minb (yi - i)2 b = (XX)-1XY

    normal equations (yi - i)xi = 0

    Goodness of fit R2 = SS()/SS(y)

    Standard errors Var(b) = 2(XX)-1 var(b) = s2(XX)-1

    38

    var(b1) s2

    n var(X)

    Equation is KNOWN

  • Wharton Department of Statistics

    Example: DepreciationHow should BMW set the lease price for cars in its 3-series of sedans?

    Depends on residual value at end of lease

    39

    BMW-prices

  • Wharton Department of Statistics

    Example: DepreciationHow should BMW set the lease price for cars in its 3-series of sedans?

    Sample of 218 used BMW 3-series cars

    Model includes mileage, age, and type of car as well as an interaction of age and type.

    40

    Effect-coded estimatesR2=71%, s=$2,469

  • Wharton Department of Statistics

    Careful InterpretationWhat do these estimates tell us?

    Regression concerns the comparison of means under different conditions

    Highly beefed up two-sample testing procedure!

    Association causation

    Carefully read the ISL text discussion of the advertising example.

    41

  • Wharton Department of Statistics

    Whats an interaction?What are these terms in the model?

    Dummy variables

    Represent all but one category

    Intercept represents reference category, left out grp

    Effect coding

    Intercept estimates an overall average

    42

  • Wharton Department of Statistics

    View in 3-DGeometry

    Two numerical variables define a plane in 3D

    43

  • Wharton Department of Statistics

    Further VisualizationInteractive tools: Profiling

    Model profiling, viewing properties of the fitted equation of the least squares fit.

    44interactive view of equation

  • Wharton Department of Statistics

    Linear Regression LinesWhats the word linear mean?That the coefficients can be estimated as weighted averages of the response Not that the model fits lines.

    Curvature modeled by

    Powers (such as squares of variables)

    Interactions (products of variables)

    Transformations (most often logs)

    Examples

    Interactions: Boston housing data

    Transformations: diamond prices

    45

  • Wharton Department of Statistics

    InteractionsBoston housing dataClassic economic analysis of impact of pollution levels on housing values

    Impact of social class and home sizes

    Repeated from previous scatterplot matrix example

    Note censoring of values, outliers, curvature

    Marginal association consistent with partial slopes

    46

    boston_from_R

  • Wharton Department of Statistics

    InteractionsInitial linear fit ignores the interaction between percentage Lower Status and Rooms.

    Adding interaction term to model reveals a strong effect

    Interpretation from coefficients?

    From interactive profile view?

    47

    Howd you know to do that?

  • Wharton Department of Statistics

    Response SurfaceAdd squares of features a response surface

    Response surface

    Common in industrial design to locate an optimal mixture.

    Surface plot

    shows model fit is not linear in initial features

    Caution: Illustration only!

    Model has several highlyleveraged outliers and omits relevant Xs

    48

  • Wharton Department of Statistics

    TransformationsRegression relies on appropriate scaling

    Association may be highly nonlinear

    Logs (ie, percentage change) often useful

    Most stat tools wont know to do this!

    49

    Prices of 7,568

    diamonds

    log price 8.6 + 2 log(carat)

  • Wharton Department of Statistics

    InferenceClassical inference

    Relies on validity of the assumed model

    Systematic approach

    Test the overall fit of the model first

    If reject H0, then proceed to individual effects.

    50Why do this preliminary test?

    BMW-prices

  • Wharton Department of Statistics

    Collinearity and FHighly collinear features

    Signficant separately, but not individually when together.

    51

    BMW-prices

  • Wharton Department of Statistics

    InferenceTesting components of fitted model

    Standard error variation of estimate from sample to sample

    t-ratiocount of standard errors separating estimate from 0

    p-value smallest level at which can reject null hypothesis. Assumes normal sampling distribution for slopes.

    Statistical significance substantive importance

    52

  • Wharton Department of Statistics

    BootstrappingStandard error is key to inference

    What are standard errors?

    BS is alternative method for obtaining standard errors and confidence intervals

    Estimates standard error by simulating sampling procedure

    Simulates by sampling with replacement from observed distribution of data

    Implementation

    R bootstrap package - also easy to do yourself

    Throughout JMP basic tools in JMP Pro

    53

  • Wharton Department of Statistics

    Bootstrap SamplingStandard error

    Standard deviation of statistic

    Repeated independent samples from the population

    Bootstrap standard errors

    Simulate standard error

    Draw B samples from the observed sample itself.

    Sampling with replacement

    Collection of repeated statistics estimates sampling distribution

    54

    21 B...

    SE

    population

    sampling dist

    Sample

    R scriptSee Figure 5.11, p 190

  • Wharton Department of Statistics

    Bootstrap ExampleBootstrap regression in JMP

    Illustrative model with Age and Mileage

    Various conventions for representing categorical variables

    Compare results of least squares to those found by resampling

    BS finds - modest collinearity - normal samp. distribution - higher standard errors

    55B=500

    Least squares

  • Wharton Department of Statistics

    OutliersRelevant even when modeling millions

    CLT not a guarantee

    Most of money often remains in the outliers. Hiding outliers conceals risks, variation, profits.

    Complex models often result in sparse data

    Sparse = most values are 0

    e.g., interaction between two rare properties

    Whats the p-valuefor the regression in this figure?

    n = 10,000

    x=0 ten 1sx=1 one 1

    56

    What would bootstrapping do?

  • Wharton Department of Statistics

    Sandwich EstimatorConventional std errors are not robust

    Assume the model is true, test adding Xp+1 to model

    Math produces expression for variance

    Sandwich estimators

    Robustness of validity, even for dependence

    Seldom implemented in software, particularly stepwise

    57

    heteroscedasticity

    var(b) = (XX)-1XE(epep)X(XX)-1 = (XX)-1 XD2X (XX)-1

    diagonal

    dependence

    var(b) = (XX)-1XE(epep)X(XX)-1 = (XX)-1 XBX (XX)-1

    block diagonal

    Var(b) = (XX)-1XE()X(XX)-1

    = 2(XX)-1(XX) (XX)-1

    sp+12(XX)-1

    aka, White estimator

  • Wharton Department of Statistics

    Better Standard ErrorHeteroscedastic error

    Estimate standard error with outlier

    Sandwich estimator allowing heteroscedastic error variances givesa t-stat 2, not 10.

    Dependent error

    Even more important need for accurate SE

    Netflix contestBonferroni (or hard thresholding) overfits due to dependence in responses.

    Credit modeling with 100,000s of cases Everything significant unless incorporate longitudinal dependence

    58

    R Script

  • Bias-Variance Tradeoff

    The more we allow a model to adapt to data, the more easily it adapts to random noise.

  • Wharton Department of Statistics

    Statistics = AveragingAverages estimate parameters and predict Just a question of which data to average.

    Underlying rationale

    Find to minimize E(Y-)2 given known features X

    Best predictor is always conditional mean =E(Y|X)

    Ideal data

    Replications Large count of cases that share each relevant set of explanatory characteristics

    Unlikely to find so many so similar

    60

  • Wharton Department of Statistics

    ExampleProblem: estimate an unknown function

    Y = (x) + error

    Data spread over x-axis from 0 to 1

    Estimator

    Average values of response for similar values of x

    Divide x-axis from 0 to 1 into d bins j = ave(Y | (j-1)/d x j/d), j = 1,2, , d

    61bias_variance.R

    (x) = sin(2 x)

    0.0 0.2 0.4 0.6 0.8 1.0

    -1.5

    -1.0

    -0.5

    0.0

    0.5

    1.0

    Explanatory Variable

    Response

    n=100, d=5

    0.0 0.2 0.4 0.6 0.8 1.0

    -1.5

    -1.0

    -0.5

    0.0

    0.5

    1.0

    Explanatory Variable

    Response

    n=100, d=10

    0.0 0.2 0.4 0.6 0.8 1.0

    -1.5

    -1.0

    -0.5

    0.0

    0.5

    1.0

    Explanatory Variable

    Response

    n=100, d=20

    Whats the best choice for d, the number

    of bins?

  • Wharton Department of Statistics

    Trade-OffsBias versus variance

    Large number of bins able to track changing values of (x) lower bias

    Large number of bins imply few data values within a bin higher variance

    62

    10 20 30 40 50

    0500

    1000

    1500

    2000

    Number of Bins

    Squ

    ared

    Err

    or

    sigma= 0.25 , n= 100 Bias^2VarianceTotal

    larger d

    smaller bias

  • Wharton Department of Statistics

    Trade-OffsBias versus variance

    Large number of bins able to track changing values of (x) lower bias

    Large number of bins imply few data values within a bin higher variance

    63

    larger d

    smaller bias

    10 20 30 40 50

    0500

    1000

    1500

    2000

    Number of Bins

    Squ

    ared

    Err

    or

    sigma= 0.25 , n= 100 Bias^2VarianceTotal

    larger d

    larger variance

  • Wharton Department of Statistics

    Trade-OffsBias versus variance

    Large number of bins able to track changing values of (x) lower bias

    Large number of bins imply few data values within a bin higher variance

    64

    larger d

    smaller bias

    larger d

    larger variance

    10 20 30 40 50

    0500

    1000

    1500

    2000

    Number of Bins

    Squ

    ared

    Err

    or

    sigma= 0.25 , n= 100 Bias^2VarianceTotal

  • Wharton Department of Statistics

    Looks Easy?Problem

    You dont get to see that plot!

    Residual SS

    Residual variation gets smaller and smaller as the model complexity grows (ie, as more bins)

    65

    10 20 30 40 50

    0.20

    0.25

    0.30

    0.35

    Number of Bins

    Sqr

    t Ave

    rage

    Squ

    ared

    Res

    idua

    l

    sigma= 0.25 , n= 100

  • Wharton Department of Statistics

    Harder StillOptimal number of bins (model complexity)

    Depends on amount of random noise relative to the curvature of the underlying, unknown (x)

    Implication

    With less noise, need less averaging

    But we dont know how much noise there is!

    Look back at the prior slide to see that we dont know the level of noise

    6610 20 30 40 50

    0500

    1000

    1500

    2000

    Number of Bins

    Squ

    ared

    Err

    or

    sigma= 0.15 , n= 100 Bias^2VarianceTotal

  • Wharton Department of Statistics

    Harder StillOptimal number of bins (model complexity)

    Depends on amount of random noise relative to the curvature of the underlying, unknown (x)

    Implication

    With less noise, need less averaging

    But we dont know how much noise there is!

    Look back at the prior slide to see that we dont know the level of noise

    6710 20 30 40 50

    0500

    1000

    1500

    2000

    Number of Bins

    Squ

    ared

    Err

    or

    sigma= 0.25 , n= 100 Bias^2VarianceTotal

  • Wharton Department of Statistics

    Harder StillOptimal number of bins (model complexity)

    Depends on amount of random noise relative to the curvature of the underlying, unknown (x)

    Implication

    With less noise, need less averaging

    But we dont know how much noise there is!

    Look back at the prior slide to see that we dont know the level of noise

    6810 20 30 40 50

    0500

    1000

    1500

    2000

    Number of Bins

    Squ

    ared

    Err

    or

    sigma= 0.35 , n= 100 Bias^2VarianceTotal

  • Wharton Department of Statistics

    Bias-Variance ReviewModel

    Data are independent observations of Y = (x) + , ~ N(0,2)

    Choice

    Rougher: Average nearby where |xi-xj| is small small bias, large variance

    Smoother: Average those where |xi-xj| is large large bias, small variance

    Best choice

    Depends on the underlying mean function (x)

    The smoother the mean function (small ), then the more benefit from averaging

    69

  • Wharton Department of Statistics

    Animated ExampleTrue model is very simple, Y = X + (0,22)

    A diagonal line

    Smoothing spline ISL, 7.5

    Smooth curve that estimates Ave(Y|X)

    Controllable degree of smoothness

    70JMP version is

    animated

    Which fits obs data best?

    Which predicts new data best?

  • Wharton Department of Statistics

    Example Suggests ApproachComplex model has best fit to data Line R2 = 0.889 Smooth 0.918 Smooth 0.968

    Average squared errors from true mean

    71

    line predicts held back data best

  • Wharton Department of Statistics

    Averaging in DM: KNNIdeal predictor, ideal data

    Best predictor is always conditional mean =E(Y|X)

    Average cases with same explanatory characteristics

    Practical version

    Identify a subset of, say, d relevant characteristics

    Average k cases that are similar

    KNN K-nearest neighbors

    = average of k nearest values (closest on X)

    Need to pick k

    Small k: adapts to local behavior

    Large k: smoother with smaller variance (recall Var() = 2/n)

    72

  • Wharton Department of Statistics

    KNN ClassifierKNN classifier

    The majority of your nearest k neighbors determines your estimated group

    73Figure 2.14, p40

    k=3

    Note the importance of scales

  • Wharton Department of Statistics

    Bias or Variance?Few neighbors (small k): flexible boundary, low bias, high variance

    Many neighbors (large k): smooth boundary, high bias, low variance

    74Figure 2.16, p41

    dashed uses k = 10

    n=200

    cases,

    100 each color

  • Wharton Department of Statistics

    Best Choice?Simulated data, so we know the right answer

    75Figure 2.17, p42fewer neighbors >

    more flexible >

    See example in lab

    How far do you need to look to

    find 10 neighbors?

  • Wharton Department of Statistics

    ConnectionsData mining tools resemble KNN, smoothing

    Adjustable fit, one or more tuning parameters

    Flexible, adaptive computational algorithm

    Consequence

    Easy to fit random variation

    Artificially good fit that predicts poorly

    Traditional statistical summaries inappropriate

    Emergent solution

    Use reserved, held-back data to judge model

    Essence of cross-validation

    76

  • Missing Data

  • Wharton Department of Statistics

    Missing DataAlways present

    In medical example, 170 out of 1,200 are complete

    Often informative

    In bankruptcy model, half of predictors indicate presence of missing data

    Is data ever missing at random?

    Handle as part of the modeling process?

    Offer a simple patch that requires few assumptions

    Main idea

    Done as a data preparation step

    Add indicator column for missing values

    Fill the missing value

    78

    JMP Pro does this

    in some

    platforms

  • Wharton Department of Statistics

    Handle Missing by Adding VarsAdd another variable

    Add indicator column for missing values

    Fill the missing with average of those seen

    Simple approach, fewer assumptions

    Expands the domain of the feature search

    Allows missing cases to behave differently

    Conservative evaluation of variable

    Part of the modeling process

    Distinguish missing subsets only if predictive

    Missing in a categorical variable: not a problem

    Missing define another category

    79

    Only for Xs,

    never Y

  • Wharton Department of Statistics

    Example

    80

    Data frame with missing values

    Filled in data with added indicator columns

    missing_data.R

  • Wharton Department of Statistics

    Example of ProcedureSimple regression, missing at random

    Conservative: unbiased estimate, inflated SE

    n=100, 0=0, 1=3

    30% missing at random, 1=3

    81

    -10 -5 0 5 10

    -40

    -20

    0

    20

    40

    -10 -5 0 5 10

    -40

    -20

    0

    20

    40

    -10 -5 0 5 10

    -40

    -20

    0

    20

    40

    Est SEb0 -0.25 1b1 3.05 0.17

    Complete

    Est SEb0 -1.5 1.4b1 3.01 0.27

    Filled In

  • Wharton Department of Statistics

    Example of ProcedureSimple regression, not missing at random

    Conservative: unbiased estimate, inflated SE

    n=100, 0=0, 1=3

    30% missing follow steeper line

    82

    Est SEb0 -0.02 2.6b1 2.82 0.44

    Filled In

    Requires robust variance estimate

    -10 -5 0 5 10

    -20

    0

    20

    40

    60

    80

    -10 -5 0 5 10

    -20

    0

    20

    40

    60

    80

    -10 -5 0 5 10

    -20

    0

    20

    40

    60

    80

    -10 -5 0 5 10

    -20

    0

    20

    40

    60

    80

  • Wharton Department of Statistics

    Imperial ExampleVariable 331 is interesting, but has missing

    Look at only those who default

    Missing 11% among these

    Regression results

    Note results are conservative (smaller t)

    Missingness is informative

    Retain all cases for subsequent analysis

    83

    variables are anonymous

  • Calibration

    A model should be right on average.