summarization and deviation detection -- what is new?

Summarization and

Deviation Detection

--What is new?

22

Outline

Summarization

KEFIR – Key Findings Reporter

WSARE – What is Strange About Recent Events

33

What is New?

Old data new data

44

Summarization

Concisely summarize what is new and different, unexpected

with respect to previous values

with respect to expected values

…

Focus on what is actionable!

55

Problem: Healthcare Costs

Healthcare costs in US: 1 out of 7 GDP $ and rising potential problems: fraud, misuse, …

understanding where the problems are is first step to fixing them

GTE – self insured for medical costs GTE healthcare costs – $X00,000,000

Task: Analyze employee health care data and generate a report that describes the major problems

66

GTE Key Findings Reporter: KEFIR

KEFIR Approach: Analyze all possible deviations

Select interesting findings

Augment key findings with: Explanations of plausible causes

Recommendations of appropriate actions

Convert findings to a user-friendly report with text and graphics

KEFIR Search Space

88

Drill-Down Example

99

What Change Is Important?

1010

Deviation Detection Drill Down through the search space

Generate a finding for each measure deviation from previous period deviation from norm deviation projected for next period, if no action

Interestingness of Deviations

Impact: how much the deviation affects the bottom lineSavings Percentage: how much of the deviation from the norm can be expected to be saved by the action

Recommendations

Hierarchical recommendation rules define appropriateintervention strategies for important measures and study areas.

Example: measure = admission rate per 1000 &study_area = Inpatient admissions &percent_change > 0.10

If

Then Utilization review is needed in the area of admission certification.

Expected Savings: 20%

1313

Explanation

A measure is explained by finding the path of related measures with the highest impact

The large increase in m1 in group s1 was caused by an

increase in m3, which was caused by a rise in m5 , primarily in

sector s13.

1414

Report Generation

Automatic generation of business-user-oriented reports

Natural language generation with template matching

Graphics

delivered via browser

1616

Sample KEFIR pages

Overview

Inpatient admissions

Status

Prototype implemented in GTE in 1995

KEFIR received GTE’s highest award for technical achievement in 1995

Key business user left GTE in 1996 and system was no longer used

Publication: Selecting and Reporting What is Interesting: The KEFIR

Application to Healthcare Data, C. Matheus, G. Piatetsky-Shapiro, and D. McNeill, in Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996

http://aaai.org/Press/Books/Fayyad/fayyad.html

What’s Strange About Recent Events

(WSARE)Weng-Keen Wong (Carnegie Mellon University)

Andrew Moore (Carnegie Mellon University)

Gregory Cooper (University of Pittsburgh)

Michael Wagner (University of Pittsburgh)

http://www.autonlab.org/wsare

Designed to be easily applicable to any date/time-indexed biosurveillance-relevant data stream

1919

Motivation

Primary Key

Date Time Hospital ICD9 Prodrome Gender Age Home Location

Work Location

Many more…

100 6/1/03 9:12 1 781 Fever M 20s NE ? …

101 6/1/03 10:45 1 787 Diarrhea F 40s NE NE …

102 6/1/03 11:03 1 786 Respiratory F 60s NE N …

103 6/1/03 11:07 2 787 Diarrhea M 60s E ? …

104 6/1/03 12:15 1 717 Respiratory M 60s E NE …

105 6/1/03 13:01 3 780 Viral F 50s ? NW …

106 6/1/03 13:05 3 487 Respiratory F 40s SW SW …

107 6/1/03 13:57 2 786 Unmapped M 50s SE SW …

108 6/1/03 14:22 1 780 Viral M 40s ? ? …

: : : : : : : : : : :

Suppose we have access to Emergency Department data from hospitals around a city (with patient confidentiality preserved)

2020

Traditional ApproachesWe need to build a univariate detector to monitor each

interesting combination of attributes:

Diarrhea cases among children

Respiratory syndrome cases among females

Viral syndrome cases involving senior citizens from eastern part of city

Number of children from downtown hospital

Number of cases involving people working in southern

part of the city

Number of cases involving teenage girls living in thewestern part of the city

Botulinic syndrome cases

And so on…

You’ll need hundreds of univariate detectors!We would like to identify the groups with the strangest

behavior in recent events.

2121

WSARE Approach

Rule-Based Anomaly Pattern Detection

Association rules used to characterize anomalous patterns. For example, a two-component rule would be:

Gender = Male AND 40 Age < 50

2222

WSARE v2.0 Overview

2. Search for rule with best score

3. Determine p-value of best scoring rule through randomization test

All Data

4. If p-value is less than threshold, signal alert

RecentData

Baseline

1. Obtain Recent and Baseline datasets

2323

Step 1: Obtain Recent and Baseline Data

RecentData

Baseline

Data from last 24 hours

Baseline data is assumed to capture non-outbreak behavior. We use data from 35, 42, 49 and 56 days prior to the current day

2424

Example

Sat 12-23-2001

35.8% (48/134) of today's cases have 30 <= age < 40

17.0% (45/265) of other (baseline) cases have

30 <= age < 40

2525

Step 2. Search for Best RuleFor each rule, form a 2x2 contingency table eg.

Perform Fisher’s Exact Test to get a p-value (score) for each rule (for this data 0.00005)

Find rule R-best with the lowest score.

Caution: This score is not the true p-value of RBEST because of multiple tests

CountRecent CountBaseline

Age Decile = 3 48 45

Age Decile 3 86 220

2626

Step 3: Randomization Test

Take the recent cases and the baseline cases. Shuffle the date field to produce a randomized dataset called DBRand

Find the rule with the best score on DBRand.

June 4, 2002 C2

June 5, 2002 C3

June 12, 2002 C4

June 19, 2002 C5

June 26, 2002 C6

June 26, 2002 C7

July 2, 2002 C8

July 3, 2002 C9

July 10, 2002 C10

July 17, 2002 C11

July 24, 2002 C12

July 30, 2002 C13

July 31, 2002 C14

July 31, 2002 C15

June 4, 2002 C2

June 12, 2002 C3

July 31, 2002 C4

June 26, 2002 C5

July 31, 2002 C6

June 5, 2002 C7

July 2, 2002 C8

July 3, 2002 C9

July 10, 2002 C10

July 17, 2002 C11

July 24, 2002 C12

July 30, 2002 C13

June 19, 2002 C14

June 26, 2002 C15

2727

Step 3: Randomization Test

Repeat the procedure on the previous slide for 1000 iterations. Determine how many scores from the 1000 iterations are better than the original score.

If the original score were here, it would place in the top 1% of the 1000 scores from the randomization test. We would be impressed and an alert should be raised.

Estimated p-value of the rule is:

# better scores / # iterations

2828

Results on Actual ED Data from 2001

1. Sat 2001-02-13: SCORE = -0.00000004 PVALUE = 0.00000000

14.80% ( 74/500) of today's cases have Viral Syndrome = True and Encephalitic Prodome = False

7.42% (742/10000) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False

2. Sat 2001-03-13: SCORE = -0.00000464 PVALUE = 0.00000000

12.42% ( 58/467) of today's cases have Respiratory Syndrome = True

6.53% (653/10000) of baseline have Respiratory Syndrome = True

3. Wed 2001-06-30: SCORE = -0.00000013 PVALUE = 0.00000000

1.44% ( 9/625) of today's cases have 100 <= Age < 110

0.08% ( 8/10000) of baseline have 100 <= Age < 110

4. Sun 2001-08-08: SCORE = -0.00000007 PVALUE = 0.00000000

83.80% (481/574) of today's cases have Unknown Syndrome = False

74.29% (7430/10001) of baseline have Unknown Syndrome = False

5. Thu 2001-12-02: SCORE = -0.00000087 PVALUE = 0.00000000

14.71% ( 70/476) of today's cases have Viral Syndrome = True and Encephalitic Syndrome = False

7.89% (789/9999) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False

2929

WSARE 3:0 Improving the Baseline

Recall that the baseline was assumed to be captured by data that was from 35, 42, 49, and 56 days prior to the current day.

Baseline

We would like to determine the baseline automatically!

What if this assumption isn’t true? What if data from 7, 14, 21 and 28

days prior is better?

3030

Temporal Trends

From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp. 5237-5249)

3131

WSARE v3.0 Generate the baseline…

“Taking into account recent flu levels…”

“Taking into account that today is a public holiday…”

“Taking into account that this is Spring…”

“Taking into account recent heatwave…”

“Taking into account that there’s a known natural Food-borne outbreak in progress…”

Bonus: More efficient use of historical data

3232

Idea: Bayesian Networks

“On Cold Tuesday Mornings the folks coming in from the North

part of the city are more likely to have respiratory problems”

“Patients from West Park Hospital are less likely to be young”

“On the day after a major holiday, expect a boost in the morning followed by a lull in

the afternoon”

Bayesian Network: A graphical model representing the joint probability distribution of a set of random variables

“The Viral prodrome is more likely to co-occur with a Rash

prodrome than Botulinic”

3333

Obtaining Baseline Data

Baseline

All HistoricalData

Today’s Environment

1. Learn Bayesian Network

2. Generate baseline given today’s environment

What should be happening today given today’s environment

3434

Simulation

DATE

DAY OF WEEK SEASONFLU LEVEL WEATHER

REGION

AGE

GENDER Region Grassiness

Region Anthrax Concentration

Region Food

Condition

ImmuneSystem

OutsideActivity

HasAnthrax

HasFlu

HasAllergy

Has HeartAttack

HasSunburn

HasCold

HeartHealth

Has FoodPoisoning

Disease

ACTION

ActualSymptom

REPORTEDSYMPTOM DRUG

Actions: None, Purchase Medication, ED visit, Absent. If Action is not None, output record to dataset.

3535

Simulation 100 different data sets

Each data set consisted of a two year period

Anthrax release occurred at a random point during the second year

Algorithms allowed to train on data from the current day back to the first day in the simulation

Any alerts before actual anthrax release are considered a false positive

Detection time calculated as first alert after anthrax release. If no alerts raised, cap detection time at 14 days

3636

Simulation Plot Anthrax release

(not highest peak)

3737

Results on Simulation

3838

Summary

Summarization of what is new and interesting

Key ideas search many possible findings

compare to past data and expected data

avoid overfitting

focus on actionable changes

Example systems KEFIR (GTE, 1992-1995)

WSARE (CMU/Pitt, 2002-3)

summarization and deviation detection -- what is new?

Documents

graphics slide

action slide

browser slide

data mining slide

old data new data slide

kefir search space slide

healthcare data

kefir kefir approach