unit 1c: detecting influential data points and assessing their impact © andrew ho, harvard graduate...

© Andrew Ho, Harvard Graduate School of Education

Unit 1c: Detecting Influential Data Points and Assessing Their Impact

Unit 1c – Slide 1http://xkcd.com/539/

http://xkcd.com/539/

• Three measures of atypicality• Leverage x Discrepancy = Influence

• Sensitivity Analyses

© Andrew Ho, Harvard Graduate School of Education Unit 1c– Slide 2

Multiple RegressionAnalysis (MRA)

Multiple RegressionAnalysis (MRA) iiii XXY 22110

Do your residuals meet the required assumptions?

Test for residual

normality

Use influence statistics to

detect atypical datapoints

If your residuals are not independent,

replace OLS by GLS regression analysis

Use Individual

growth modeling

Specify a Multi-level

Model

If time is a predictor, you need discrete-

time survival analysis…

If your outcome is categorical, you need to

use…

Binomial logistic

regression analysis

(dichotomous outcome)

Multinomial logistic

regression analysis

(polytomous outcome)

If you have more predictors than you

can deal with,

Create taxonomies of fitted models and compare

them.

Form composites of the indicators of any common

construct.

Conduct a Principal Components Analysis

Use Cluster Analysis

Use non-linear regression analysis.

Transform the outcome or predictor

If your outcome vs. predictor relationship

is non-linear,

Use Factor Analysis:EFA or CFA?

Today’s Topic Area

Course Roadmap: Unit 1c

© Andrew Ho, Harvard Graduate School of Education Unit 1c – Slide 3

Anscombe’s Quartet: Four datasets with identical summary statisticsSame means, standard deviations, correlations, regression lines, statistics, statistics, and .

A powerful argument for Exploratory Data Analysis.

The model: Unit 1d: Next Class

How might we detect and describe these “atypical” observations?


* p<0.05, ** p<0.01, *** p<0.001D=Diabetic, A=Asthmatic, ILL=Diabetic or Asthmatic, LAGE=log-Age in months F 23.45 129.0 83.13 70.52 40.21 106.8 df_r 191 190 188 187 182 189 df_m 2 3 5 6 11 4 rss 161.8 66.37 62.77 61.77 58.75 61.82 mss 39.74 135.2 138.8 139.8 142.8 139.7 R-sq 0.197 0.671 0.689 0.694 0.708 0.693 _cons 4.604*** -5.464*** -7.411*** -7.338*** -9.773*** -7.337***ILLxLAGE -0.823***ILL 3.130** AxLAGExSES -0.211 DxLAGExSES -0.505 LAGExSES -0.277 AxSES 0.999 DxSES 2.549 SES -0.0857 1.268 -0.0875 AxLAGE -0.754** -0.788** 0.134 DxLAGE -0.939* -0.899* 0.684 LAGE 2.090*** 2.494*** 2.511*** 3.010*** 2.511***A -0.936*** -0.898*** 2.728* 2.969* -1.457 D -0.837*** -0.968*** 3.591 3.478 -4.444 Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Predicting children's understanding of causes of illness (n=194)

When do we check for atypical observations? At least first… and last…

• Atypical observations are often spotted in initial exploratory analyses.• However, an atypical observation may not be revealed until the addition of a

particular predictor. A “multivariate outlier” is a multidimensional generalization of this bivariate outlier: an outlier neither on nor on but an outlier on both.

• In practice, we conduct initial exploratory analyses, build our model, and then conduct additional exploratory analyses on our “final” model.

We should be nervous here that there is some atypical observation that has particular influence on all of these statistics.

We check these assumptions now.


Classifying Atypical Observations

Discrepancy:Extreme-on-

A Large Residual• Does not necessarily have

influence on the estimated regression coefficients.

• However, a large residual inflates SSError

• Increases MSE, RMSE• Reduces • Increases standard errors• Decreases • Increases -values• More difficult to reject

Leverage:Extreme-on-

Distant on One or More Predictors• Does not necessarily have

influence on the estimated regression coefficients.

• However, the influence is unpredictable and may be quite large.

• Erratic impact on SSE, MSE, RMSE,

• Erratic impact on standard errors, , and hypothesis testing

Influence:A high-discrepancy, high-leverage observation will have a strong influence on estimated regression coefficients and

an impact on all model fit statistics and hypothesis testing.


Discrepancy (1): The Raw Residual*--------------------------------------------------------------------------------* Refit final regression model, estimate & output selected influence statistics*--------------------------------------------------------------------------------* Refit Model 6, the final regression model: eststo: regress ILLCAUSE ILL LAGE ILLxLAGE SES

* Output and summarize the predicted values ("capture" suppresses an error that * would otherwise be generated if you ran this again, redefining the variable): capture predict PREDICTED, xb summarize PREDICTED

* Output and summarize the residuals: * Raw residuals: capture predict RESID, residuals summarize RESID

Refit Model 6

Save predicted values, , that

is,

Save residuals, ,the “miss”

A one-predictor illustration of residuals:

RESID 194 1.08e-09 .5659794 -1.649514 1.729051 Variable Obs Mean Std. Dev. Min Max

. summarize RESID

The mean of residuals will always be 0.

Why? The standard deviation will be close to the RMSE. Why?

http://hspm.sph.sc.edu/courses/J716/demos/LeastSquares/LeastSquaresDemo.html

PREDICTED 205 4.141373 .8490271 2.384122 5.86962 Variable Obs Mean Std. Dev. Min Max

. summarize PREDICTED




Discrepancy (2, 3): The PRESS and Standardized Residual * PRESS (Predicted Residual Sum of Squares) residuals: capture predict HATSTAT, leverage generate PRESS = RESID/(1-HATSTAT) summarize PRESS

* Standardized residuals: capture predict STDRESID, rstandard summarize STDRESID

Funny thing about atypical observations… they mask themselves. They draw the regression line to themselves, reducing their residuals.

The PRESS residual “unmasks” the atypicality of

an observation by calculating a residual for a regression line that is estimated from a dataset

that does not include the observation itself.

The standardized residual (also, confusingly, the standardized PRESS residual, and the

internally studentized residual) is the PRESS residual expressed in terms of predicted

standard deviations of residuals.

This arguably results in a more interpretable statistic, where a residual of 2 or 3 standard

deviations starts to seem “atypical.”

PRESS 194 -.0002356 .5822095 -1.70661 1.799791 Variable Obs Mean Std. Dev. Min Max

. summarize PRESS

STDRESID 194 -.0002017 1.003628 -2.933572 3.084371 Variable Obs Mean Std. Dev. Min Max

. summarize STDRESID

This does not mean that we should discard

observations with standardized residuals > 3, say. If we did, and

we recalculated standardized residuals,

what might we find?

© Andrew Ho, Harvard Graduate School of EducationUnit 1c /

This outlier has a large residualThis outlier does not. It masks itself.

• We should consider both the discrepancy from the regression line and the leverage exerted on the regression line.

• With simple linear regression, it looks somewhat familiar (see bottom left)• With many predictors, the leverage of an observation is a single expression of distance

from many predictor means.• Called because leverage is an element of the “hat matrix”: It puts the hat on the .

Leverage: Extremity of an observation along predictor variables

http://www.stat.sc.edu/~west/javahtml/Regression.html

𝑒𝑠𝑡 𝑑𝑖=

𝑌 𝑖−𝑌 𝑖

𝑅𝑀𝑆𝐸 √1− h𝑖

HATSTAT 205 .0258016 .0127337 .0103571 .0827226 Variable Obs Mean Std. Dev. Min Max

. summarize HATSTATLarger values, more leverage. All sum to , so tend to seem small in magnitude.




302

303

304 305

306

307

308

309

310

311312

313

315

316

317

319

320

321

322

323

324

325

326327

328 329330

331

332

333

334

335

336

420

421423

424

426

427

433

441

442443

444

501

502

503

504

505 506507

508

509510

511

512513

514

515

516

517518519

522525

528

529

530532

534

537

538 539546

548

549

550551

552

553

554555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

572

573

574

575

600

601 602603

605

606

608

609

610

611

612613

614

615

616

617618619

620

621622

624

625

626628629

630

631

632633

634

635

636

637

639

640

641

643650651

652

653

654

700

701

702

703

704

705706707

708

709

710

711

712

713

714

715

716

717

718

719719

720721

722723 724

725726727

728

729

730

731

732

733734

735

736

737

738

739

740

741

742

743744 745

746747

748

0.0

2.0

4.0

6.0

8Le

vera

ge

0 .01 .02 .03 .04 .05Normalized residual squared

Unit 1c – Slide 9

Cook’s Distance: Influence = Discrepancy * Leverage

Discrepancy:Extreme-on-

A Large Residual

Leverage:Extreme-on-

Distant on One or More Predictors

Influence:High Discrepancy, High Leverage

𝑑𝑖=( h𝑖1 − h𝑖 )(

𝑒 𝑠𝑡𝑑 (𝑖 )2

𝑘+1 )Leverage

(horizontal)Squared standardized

residual (vertical)

The lvr2plot

Influence Demo:. findit regpt. regpt


Exploratory Analysis of Discrepancy, Leverage, and Influence

* Identify cases that are extreme-on-Y: * Plot standardized residuals versus ID to pick out the extreme-on-Y cases: graph twoway (scatter STDRESID ID, mlabel(ID) msize(small)),name(Unit1c_g2,replace) * Sort and list only those cases atypical on the standardized residuals. * Recall that they can be both positive and negative: sort STDRESID list ID STDRESID if STDRESID !=. in F/20, clean list ID STDRESID if STDRESID !=. in -20/L, clean

* Second, identify the cases that are extreme-on-X: * Plot HATSTAT versus ID to pick out the extreme-on-X cases: graph twoway (scatter HATSTAT ID, mlabel(ID) msize(small)),name(Unit1c_g3,replace) * Sort and list only the cases atypical on HATSTAT. * Recall that the HAT statistic measures the "horizontal" distance of each * case from the "center" of the data in the predictor plane. It can only * take on positive values, and so only atypical cases with large positive * and non-missing values are in contention: sort HATSTAT list ID HATSTAT if HATSTAT !=. in -20/L, clean

* Third, identify the cases that are most influential overall: * Plot COOKSDSTAT versus ID to pick out the most influential cases overall: graph twoway (scatter COOKSDSTAT ID, mlabel(ID) msize(small)),name(Unit1c_g4,replace) * Sort and list the atypical cases that are most influential overall. * Recall that Cook's D statistic measures overall impact on the generic fit. * It can only take on positive values and so it is only atypical cases with * large positive and non-missing values that are in contention: sort COOKSDSTAT list ID COOKSDSTAT if COOKSDSTAT !=. in -20/L, clean

List the IDs of the cases with extreme values

Plot The Value Of Each Influence Statistic Versus the Case ID.

Use scatterplots of statistics on ID and list sorted statistics by ID.


444307 441 617

424310 602433322

530 605 634443 708722546329 606 725330 720705539 563 637 704515 551 618303 631 742508 612 724506 613427 738636 717552309 509507 556559561 739625 732718 746319313 558 730326327 709601565 735707723706516 538 566512 619549331324 701332 537 711334 614302 611 652 737721654560 716570 643569323 426 550 632304 740317 548510 727748562 624639532 568 600 719719622 651311316 700603525 712731421 729610 715529 650 733609528519 628518 633308 620629 747306 420328 640320 442 505 567333 554315 713572 616 743564 741736534 635503517511513 744653615574 710573575 703 734335 728626 714522501 555 641305 630504514312325 608557336321

745502 702726621

423553

-4-2

02

4S

tan

dard

ize

d re

sidu

als

300 400 500 600 700 800Child Identification Code

Exploring Discrepancy with Standardized Residuals: Extreme on

ID STDRESID 1. 444 -2.933572 2. 307 -2.640082 3. 441 -2.495741 4. 617 -2.462099 5. 424 -1.921746 6. 310 -1.759873 7. 602 -1.722051

ID STDRESID 188. 745 2.100213 189. 502 2.155126 190. 702 2.262468 191. 726 2.450768 192. 621 2.522492 193. 423 2.744562 194. 553 3.084371

Note that extreme values of residuals can be positive or negative


444

307

441

617

424

310

602

433

322

530 605

634

443708

722546329

606

725

330

720

705539

563

637

704

515

551

618

303

631

742

508

612724

506613

427

738

636

717

552309

509

507556559

561

739

625

732

718

746319

313 558

730326327

709

601

565 735

707723706

516

538

566

512

619

549

331

324

701332

537

711

334

614

302 611

652737

721654

560

716

570643

569

323

426

550

632

304

740

317

548

510

727

748562

624

639532

568

600

719719

622

651

311

316

700

603525

712

731

421729

610

715

529

650

733

609

528519

628

518

633

308620

629

747

306

420

328640

320

442505

567

333

554

315

713

572

616

743

564

741736534

635

503

517

511

513

744

653615

574 710

573

575

703734

335

728626

714

522501

555

641305

630

504514

312

325

608

557

336

321

745

502702

726

621423

553

440

314

623576

318

571301

627

638

435

531

0.0

2.0

4.0

6.0

8Le

vera

ge


Unit 1c – Slide 12

Exploring Leverage with the Hat Statistic: Extreme on ID HATSTAT199. 634 .0540155 200. 331 .0542818 201. 573 .0542818 202. 537 .0563484 203. 322 .0621393 204. 568 .0677331 205. 700 .0827226

Note that extreme values of leverage can only be positive.


504508330

327316332 512

514

713710

602

716640601427 730574507705329707701538

303603328326 552

525

522 546610513

501615309 506 728

319539

706612559549

423

556505 532333 639703 741723 747

738632

621

736732569 633724572

605

746653

530 630

442613

722626534421

734

712

708555

426 622510 628600554 733735

443

608

717614

702

725

304 729

742

305

565 629 739509317308

563

727652

502

323 420311

641

726

315560 620

511516

312

737548

551

321

609334 566503

731

617

550558

704

564306313 709721654 748611624

575

619 715302

618606

444

650

441

519528

335

718643

720

518 651562567

310

517320

557

570

714

515 631

711616

307

719719

553

433325

636

740

635

637

324

745

744

529

743625

336424

561

634

331

573

537

322

568 7000.0

2.0

4.0

6.0

8C

ook

's D



Exploring Influence with Cook’s Distance: Extreme on and ID HATSTAT188. 336 .0387276 189. 424 .0418666 190. 441 .0432583 191. 745 .0460144 192. 307 .0557836 193. 444 .0595759 194. 553 .0778421

Note that extreme values of Cook’s Distance can only be positive.


302

303

304 305

306

307

308

309

310

311312

313

315

316

317

319

320

321

322

323

324

325

326327

328 329330

331

332

333

334

335

336

420

421423

424

426

427

433

441

442443

444

501

502

503

504

505 506507

508

509510

511

512513

514

515

516

517518519

522525

528

529

530532

534

537

538 539546

548

549

550551

552

553

554555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

572

573

574

575

600

601 602603

605

606

608

609

610

611

612613

614

615

616

617618619

620

621622

624

625

626628629

630

631

632633

634

635

636

637

639

640

641

643650651

652

653

654

700

701

702

703

704

705706707

708

709

710

711

712

713

714

715

716

717

718

719719

720721

722723 724

725726727

728

729

730

731

732

733734

735

736

737

738

739

740

741

742

743744 745

746747

748

0.0

2.0

4.0

6.0

8Le

vera

ge

0 .01 .02 .03 .04 .05Normalized residual squared


Back to the Leverage by Residual Squared PlotFor our sensitivity analysis, we’ll take a look at these five points. In a real analysis, we would conduct a full substantive investigation of these children using all available qualitative and quantitative data.


205. 553 1 1 0 4.571 2.841949 1 80 4 204. 423 0 1 0 5.143 3.585113 1 112 2 203. 307 0 1 0 2.857 4.337627 1 194 4 202. 444 0 1 0 2.571 4.220514 1 181 4 201. 700 0 0 1 3 2.910231 0 68 4 ID HICOOK HIRESID HIHAT ILLCAUSE PREDIC~D ILL AGE SES

*--------------------------------------------------------------------------------* Describe atypical observations.*--------------------------------------------------------------------------------

* Identify atypical cases by their ID #s from the previous analysis. * First, tag high-discrepancy observations: generate HIRESID=(ID==444|ID==307|ID==423|ID==553) * Second, tag high-leverage observations: generate HIHAT=(ID==700) * Third, tag cases w/ high overall influence: generate HICOOK=(ID==553)

* Sort the atypical cases by their ID and selected characteristics: sort HICOOK HIRESID HIHAT ILLCAUSE ILL AGE SES ID * List cases for inspection, in a table, sorted by type of atypicality: list ID HICOOK HIRESID HIHAT ILLCAUSE PREDICTED ILL AGE SES /// if HIRESID==1 | HIHAT==1 | HICOOK==1, sepby(HICOOK HIRESID HIHAT)

Exploring Atypical Observations

Classify atypical observations by their basis for being flagged.

Sort and list cases by selected variables.

Similar observations may be chunked together for sensitivity analyses, e.g., #307 & #444


700

444307

423

553

44.

55

5.5

LAG

E

1 2 3 4 5Hollingshead SES Unit 1c – Slide 16

Bivariate Visualizations of Atypical Observations

700

444

307

423

553

23

45

6U

nde

rsta

ndi

ng

of I

llnes

s C

aus

alit

y S

core

4 4.5 5 5.5LAGE

700

444

307

423

553

23

45

6U

nde

rsta

ndi

ng

of I

llnes

s C

aus

alit

y S

core

1 2 3 4 5Hollingshead SES


* p<0.05, ** p<0.01, *** p<0.001ILL=Diabetic or Asthmatic, LAGE=log-Age in months F 106.8 105.3 114.2 111.8 111.6 110.8 116.5 df_r 189 188 188 188 188 188 187 df_m 4 4 4 4 4 4 4 rss 61.82 61.82 58.71 59.36 59.01 59.54 56.53 mss 139.7 138.4 142.6 141.2 140.1 140.4 140.9 R-sq 0.693 0.691 0.708 0.704 0.704 0.702 0.714 _cons -7.337*** -7.362*** -7.323*** -7.344*** -7.350*** -7.349*** -7.363***SES -0.0875 -0.0892 -0.104* -0.0789 -0.0722 -0.0737 -0.0572 ILLxLAGE -0.823*** -0.829*** -0.754** -0.806*** -0.764** -0.760** -0.695** LAGE 2.511*** 2.517*** 2.515*** 2.510*** 2.508*** 2.509*** 2.505***ILL 3.130** 3.160** 2.794* 3.026** 2.845* 2.827* 2.519* Baseline No 700 No 553 No 423 No 444 No 307 No307404 Sensitivity of predictions of children's understanding of causes of illness to atypical observations (n=194)

Effects of ILL and logAGE – both as main effects and in a two-way interaction – are fairly robust to the removal of the atypical datapoints. The basic substantive story is preserved throughout.

Sensitivity Analysis

• The main effect of SES fluctuates in magnitude and significance as atypical observations are excluded.

• The usefulness of socioeconomic status in the prediction of ILLCAUSE is sensitive to the inclusion of atypical observations in this model.

• In practice, we typically restrict investigation to high influence points, unless there is a serious question about whether the other atypical observations are actually part of the target population.

Goodness-of-fit ( statistic) is robust to the removal of

atypical observations.

It even rises a little when Low-SES children are omitted, due to the removal of two large negative residuals from SSE.

unit 1c: detecting influential data points and assessing their impact © andrew ho, harvard graduate...

Documents