local outlier detection in data forensics: data mining approach to flag unusual schools

26
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation May, 2012 1

Upload: baxter

Post on 22-Feb-2016

74 views

Category:

Documents


0 download

DESCRIPTION

Local outlier detection in data forensics: data mining approach to flag unusual schools. Mayuko Simon Data Recognition Corporation May, 2012. Statistical methods for data forensic. Univariate distributional techniques: e.g., average wrong-to-right erasures. Multivariate techniques - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools

Mayuko Simon

Data Recognition CorporationMay, 2012

1

Page 2: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

Statistical methods for data forensic

• Univariate distributional techniques: e.g., average wrong-to-right erasures.

• Multivariate techniques– Simple regression. E.g., 2011 Reading is predicted by

2010 Reading score. – A school is flagged if the observed dependent variable

differs significantly from the model’s prediction• The schools are flagged when it is an outliers

compared to ALL other schools

2

Global outlier

Page 3: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

What if schools is suspicious but not extreme?

• Schools with suspicious behavior may not display sufficient extremity to make them outliers in comparison to all schools.

• Nevertheless, it is reasonable to assume that their scores will be higher than that of their peers—schools that are very similar in many relevant aspects.

3

Local outlier

Page 4: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

Local v.s. global outlier

• Traditional statistical data forensic techniques lack the ability to detect local outliers– Regression will miss the blue rectangle– Univariate approach (e.g., using only variable a) will miss

both blue rectangle and red triangle– Cluster analysis is not for outlier detection

4

Page 5: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

The goal of RegLOD• Regression based local outlier detection

algorithm is introduced: RegLOD.• We wish to find schools that are very similar

to the peers in most respects (in terms of most independent variables) but differ significantly in current year’s score (the dependent variable).

5

Page 6: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

Assumptions of RegLOD

• When most independent variables are very similar, we expect the dependent variable to be similar, as well.

• This assumption is very reasonable and is frequently exploited: this is the principle on which regression trees or nearest neighbor regression are built (Hastie, 2009).

6

Page 7: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

Data and variables• Data:

– A large scale standardized state assessment test• Variables:

– School level Math and Reading scale score in 2010 and 2011.

– School level Math and Reading cohort scale score in 2010 (for grade 4, scale score when they were in grade 3)

– School level average wrong-to-right erasures in 2010 and 2011.

7

Page 8: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

Data and variables• Scale scores were transformed into logit for

a better sense of school level during the analysis

• Dependent variable– 2011 Reading or 2011 Math

• Five independent variables– 2011 Math or Reading, 2010 Math and

Reading, and 2010 cohort Math and Reading. – Erasure counts were not used in the algorithm

8

Page 9: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

RegLOD algorithm overview

1. Select a set of independent variables2. Find local weights3. Make a peer group for each school4. Obtain empirical p-values and flag

schools when criterions are met.

9

Page 10: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

10

RegLOD Example: Grade 4 Reading Step 1. Select a set of independent variables

2011 Reading

(G4)

2010 Reading

(G4)

2010 CohortMath(G3)

2010 Cohort Reading

(G3)

2010 Math(G4)

2011 Math(G4)DV

IV

R2 = 0.99

Page 11: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

11

RegLOD Example: Grade 4 Reading Step 2. Determine the local weights

2011Logit Score

2010 Logit Score

2010 Cohort Logit Score

M R M R MSubject Grade P1 P2 P3 P4 P5

R 4 0.35818 0.23228 -0.12216 0.23214 -0.05164

Page 12: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

12

RegLOD Example: Grade 4 Reading

School i School j Distance/Similarity1 2 0.0515041 3 0.1447721 4 -0.030061 5 0.218381 6 -0.032321 7 0.0842221 8 0.097555. . .. . .. . .

1600 1603 0.2930381601 1602 2.3426251601 1603 0.0500751602 1603 1.848821

Step 3. Select peer schools• Compute pair-wise distance using the weights.• Select peer schools within +/- 0.03 (Dist value) from a

school.

Page 13: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

13

RegLOD Example: Grade 4 Reading Step 4. Obtain empirical p-value

• Bootstrap 2011 Reading grade 4 scores of the peer school.• Obtain empirical p-values for replication of bootstrap and

average them.• Flag a school if the empirical p-value 0.05 or less.• Flag a if the number of peer schools are 10 or less.

Bootstrapped school mean

Freq

uenc

y

1250 1300 1350 1400

05

1015

Page 14: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

14

How many schools were flagged?Local weights (Dist = 0.03)

Subject GradeTotal School

NFlagged

School NProportion

FlaggedR 4 1603 49 3.06R 5 1493 37 2.48R 6 1056 38 3.60R 7 836 20 2.39R 8 836 22 2.63M 4 1603 33 2.06M 5 1493 63 4.22M 6 1056 36 3.41M 7 836 36 4.31M 8 836 37 4.43

Page 15: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

15

Compare to the results of other statistical methods for data forensic

• SS: scale score analysis, e.g., 2011 scale score is predicted by 2010 scale score.

• PL: performance level analysis, e.g, proportion of proficient or above in 2011 is predicted by 2010 proportion of proficient or above.

• Reg: regression analysis using two subject, e.g., 2011 reading is predicted by 2011 mathematic.

• Rasch: use of Rasch residual.• WR: wrong-to-right erasure count.• SSco: scale score analysis using cohort students, e.g., 2011 scale

score is predicted by 2010 scale score using cohort students.• StdRes: standardized residual of multiple regression using the

same variables as RegLOD analysis.

Page 16: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

16

Grade 4 Reading:Comparison of Local and global outlier detection

RegLOD Statistical Methods (global outlier detection)

Num Peer N P-value SS1 PL1 Reg1 Rasch1 WR1 SSCo1 StdRes2

1 3 - 3.3 1.1 9.9 0.0 4.9 11.4 3.122 9 - 9.3 4.6 1.4 1.0 0.2 13.8 1.093 93 0.000 6.4 3.5 1.1 0.0 0.0 6.7 3.024 57 0.000 4.6 4.0 8.1 0.0 2.8 17.4 2.415 6 - 0.9 2.6 8.8 0.0 0.0 7.1 0.646 3 - 3.9 2.2 0.0 0.0 20.8 16.8 1.837 424 0.005 5.2 1.0 11.0 1.6 0.0 8.4 3.918 312 0.006 1.5 1.1 9.7 0.3 2.4 7.2 3.619 380 0.008 9.1 4.6 11.8 0.1 4.4 5.0 4.19

10 119 0.008 2.4 2.2 8.7 0.6 0.0 1.1 3.4911 435 0.009 2.8 1.1 8.0 0.0 0.0 5.0 3.3412 147 0.014 6.4 4.5 4.9 0.0 4.0 3.2 4.1413 112 0.018 0.1 1.1 16.9 0.0 2.2 6.8 1.6814 418 0.019 4.2 3.1 14.3 0.0 0.0 4.7 2.2315 52 0.019 1.1 0.4 8.6 0.0 2.5 3.0 2.77

Page 17: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

17

A school with 10 or less peers

• E.g., The school number 1 in the table– 2011 the wrong-to-right erasure was 96 percentile, which is

rather high.– The RegLOD fond only three peer schools including the school,

indicating this school is an outlier.– Large increase in percentile (26 to 95), indicating suspicious

increase in score.– There are reasonable evidences that this school needs further

scrutiny.

PercentilesDV P1 P2 P3 P4 P5

2011Reading

2011 Math

2010 Reading

2010 Math

2010 Cohort

Reading

2010 Cohort Math

2010WR

2011WR

95 52 100 93 26 20 60 96

Page 18: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

18

A school with many peer schools

• The school number 3 in the table– The RegLOD fond 93 three peer schools including the

school.– Large increase in percentile (23 to 76 percentile)– Since there are many peers, we can plot the variables

with the peer schools

PercentilesDV P1 P2 P3 P4 P5

2011Reading

2011 Math

2010 Reading

2010 Math

2010 Cohort

Reading

2010 Cohort Math

2010WR

2011WR

76 75 22 67 23 41 3 52

Page 19: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

19

Comparison to peers for the IVs

2010 Reading and cohort 2010 Reading are around 20 percentile

School is within the peer distribution

Page 20: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

20

Comparison to peers with 2011 Reading score (DV)

This school was 23 percentile with 2010 cohort Reading and 76 percentile with 2011 Reading

School is an outlier among the peers with 2011 Reading

Page 21: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

21

A lower achieving school with many peer schools

• The school number 12 in the table– The RegLOD fond 147 peer schools including the school.– Moderate increase in percentile (13 to 42 percentile)– The 2010 cohort math’s 48 percentile seems little strange

given that 2011 Math is 14 percentile. – Erasure was 96 percentile, which is rather high.– We can take a look at the histograms

PercentilesDV P1 P2 P3 P4 P5

2011Reading

2011 Math

2010 Reading

2010 Math

2010 Cohort

Reading

2010 Cohort Math

2010WR

2011WR

42 14 12 27 13 48 35 96

Page 22: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

22

Comparison to peers for the IVs

2010 Math is in right tail because of the odd high percentile. Other than that, the school is well within the peer distribution.

School is within the peer distribution

Page 23: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

23

Comparison to peers with 2011 Reading score (DV)

School is an outlier among the peers with 2011 Reading

This school had a moderate increase in percentile, but since it is an outlier compare to the peers, it is a local outlier.

Page 24: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

24

Did all flagged schools exhibited suspicious behavior?

• 12 schools – potentially incorrectly (extremely high/low achievement)

• Majority of the flagged schools exhibited suspicious behavior.

• Some flagged schools by RegLOD were also flagged by other statistical methods – these were local and also global outliers.

• Other schools were flagged by RegLOD but not by other statistical method – these schools were local outliers but not global outliers.

Page 25: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

25

Conclusion• RegLOD have shown great promise in data forensic and it

is a valuable addition to our data forensic tools. • Its applicability is not limited to cheating detection in

educational testing. • Given its robust design of RegLOD, specifically its model-

based design (the concept of dependent and independent variables in data mining) and its ability to adapt makes it applicable to a wide range of outlier detection problems.

• We continue to study its capabilities, extend and apply it to other contexts and tasks.

Page 26: Local outlier detection in data forensics:  data mining approach  to flag unusual schools

26

Thank you!