summarization and deviation detection -- what is new?
Post on 20-Dec-2015
224 views
TRANSCRIPT
44
Summarization
Concisely summarize what is new and different, unexpected
with respect to previous values
with respect to expected values
…
Focus on what is actionable!
55
Problem: Healthcare Costs
Healthcare costs in US: 1 out of 7 GDP $ and rising potential problems: fraud, misuse, …
understanding where the problems are is first step to fixing them
GTE – self insured for medical costs GTE healthcare costs – $X00,000,000
Task: Analyze employee health care data and generate a report that describes the major problems
66
GTE Key Findings Reporter: KEFIR
KEFIR Approach: Analyze all possible deviations
Select interesting findings
Augment key findings with: Explanations of plausible causes
Recommendations of appropriate actions
Convert findings to a user-friendly report with text and graphics
1010
Deviation Detection Drill Down through the search space
Generate a finding for each measure deviation from previous period deviation from norm deviation projected for next period, if no action
Interestingness of Deviations
Impact: how much the deviation affects the bottom lineSavings Percentage: how much of the deviation from the norm can be expected to be saved by the action
Recommendations
Hierarchical recommendation rules define appropriateintervention strategies for important measures and study areas.
Example: measure = admission rate per 1000 &study_area = Inpatient admissions &percent_change > 0.10
If
Then Utilization review is needed in the area of admission certification.
Expected Savings: 20%
1313
Explanation
A measure is explained by finding the path of related measures with the highest impact
The large increase in m1 in group s1 was caused by an
increase in m3, which was caused by a rise in m5 , primarily in
sector s13.
1414
Report Generation
Automatic generation of business-user-oriented reports
Natural language generation with template matching
Graphics
delivered via browser
Status
Prototype implemented in GTE in 1995
KEFIR received GTE’s highest award for technical achievement in 1995
Key business user left GTE in 1996 and system was no longer used
Publication: Selecting and Reporting What is Interesting: The KEFIR
Application to Healthcare Data, C. Matheus, G. Piatetsky-Shapiro, and D. McNeill, in Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996
What’s Strange About Recent Events
(WSARE)Weng-Keen Wong (Carnegie Mellon University)
Andrew Moore (Carnegie Mellon University)
Gregory Cooper (University of Pittsburgh)
Michael Wagner (University of Pittsburgh)
http://www.autonlab.org/wsare
Designed to be easily applicable to any date/time-indexed biosurveillance-relevant data stream
1919
Motivation
Primary Key
Date Time Hospital ICD9 Prodrome Gender Age Home Location
Work Location
Many more…
100 6/1/03 9:12 1 781 Fever M 20s NE ? …
101 6/1/03 10:45 1 787 Diarrhea F 40s NE NE …
102 6/1/03 11:03 1 786 Respiratory F 60s NE N …
103 6/1/03 11:07 2 787 Diarrhea M 60s E ? …
104 6/1/03 12:15 1 717 Respiratory M 60s E NE …
105 6/1/03 13:01 3 780 Viral F 50s ? NW …
106 6/1/03 13:05 3 487 Respiratory F 40s SW SW …
107 6/1/03 13:57 2 786 Unmapped M 50s SE SW …
108 6/1/03 14:22 1 780 Viral M 40s ? ? …
: : : : : : : : : : :
Suppose we have access to Emergency Department data from hospitals around a city (with patient confidentiality preserved)
2020
Traditional ApproachesWe need to build a univariate detector to monitor each
interesting combination of attributes:
Diarrhea cases among children
Respiratory syndrome cases among females
Viral syndrome cases involving senior citizens from eastern part of city
Number of children from downtown hospital
Number of cases involving people working in southern
part of the city
Number of cases involving teenage girls living in thewestern part of the city
Botulinic syndrome cases
And so on…
You’ll need hundreds of univariate detectors!We would like to identify the groups with the strangest
behavior in recent events.
2121
WSARE Approach
Rule-Based Anomaly Pattern Detection
Association rules used to characterize anomalous patterns. For example, a two-component rule would be:
Gender = Male AND 40 Age < 50
2222
WSARE v2.0 Overview
2. Search for rule with best score
3. Determine p-value of best scoring rule through randomization test
All Data
4. If p-value is less than threshold, signal alert
RecentData
Baseline
1. Obtain Recent and Baseline datasets
2323
Step 1: Obtain Recent and Baseline Data
RecentData
Baseline
Data from last 24 hours
Baseline data is assumed to capture non-outbreak behavior. We use data from 35, 42, 49 and 56 days prior to the current day
2424
Example
Sat 12-23-2001
35.8% (48/134) of today's cases have 30 <= age < 40
17.0% (45/265) of other (baseline) cases have
30 <= age < 40
2525
Step 2. Search for Best RuleFor each rule, form a 2x2 contingency table eg.
Perform Fisher’s Exact Test to get a p-value (score) for each rule (for this data 0.00005)
Find rule R-best with the lowest score.
Caution: This score is not the true p-value of RBEST because of multiple tests
CountRecent CountBaseline
Age Decile = 3 48 45
Age Decile 3 86 220
2626
Step 3: Randomization Test
Take the recent cases and the baseline cases. Shuffle the date field to produce a randomized dataset called DBRand
Find the rule with the best score on DBRand.
June 4, 2002 C2
June 5, 2002 C3
June 12, 2002 C4
June 19, 2002 C5
June 26, 2002 C6
June 26, 2002 C7
July 2, 2002 C8
July 3, 2002 C9
July 10, 2002 C10
July 17, 2002 C11
July 24, 2002 C12
July 30, 2002 C13
July 31, 2002 C14
July 31, 2002 C15
June 4, 2002 C2
June 12, 2002 C3
July 31, 2002 C4
June 26, 2002 C5
July 31, 2002 C6
June 5, 2002 C7
July 2, 2002 C8
July 3, 2002 C9
July 10, 2002 C10
July 17, 2002 C11
July 24, 2002 C12
July 30, 2002 C13
June 19, 2002 C14
June 26, 2002 C15
2727
Step 3: Randomization Test
Repeat the procedure on the previous slide for 1000 iterations. Determine how many scores from the 1000 iterations are better than the original score.
If the original score were here, it would place in the top 1% of the 1000 scores from the randomization test. We would be impressed and an alert should be raised.
Estimated p-value of the rule is:
# better scores / # iterations
2828
Results on Actual ED Data from 2001
1. Sat 2001-02-13: SCORE = -0.00000004 PVALUE = 0.00000000
14.80% ( 74/500) of today's cases have Viral Syndrome = True and Encephalitic Prodome = False
7.42% (742/10000) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False
2. Sat 2001-03-13: SCORE = -0.00000464 PVALUE = 0.00000000
12.42% ( 58/467) of today's cases have Respiratory Syndrome = True
6.53% (653/10000) of baseline have Respiratory Syndrome = True
3. Wed 2001-06-30: SCORE = -0.00000013 PVALUE = 0.00000000
1.44% ( 9/625) of today's cases have 100 <= Age < 110
0.08% ( 8/10000) of baseline have 100 <= Age < 110
4. Sun 2001-08-08: SCORE = -0.00000007 PVALUE = 0.00000000
83.80% (481/574) of today's cases have Unknown Syndrome = False
74.29% (7430/10001) of baseline have Unknown Syndrome = False
5. Thu 2001-12-02: SCORE = -0.00000087 PVALUE = 0.00000000
14.71% ( 70/476) of today's cases have Viral Syndrome = True and Encephalitic Syndrome = False
7.89% (789/9999) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False
2929
WSARE 3:0 Improving the Baseline
Recall that the baseline was assumed to be captured by data that was from 35, 42, 49, and 56 days prior to the current day.
Baseline
We would like to determine the baseline automatically!
What if this assumption isn’t true? What if data from 7, 14, 21 and 28
days prior is better?
3030
Temporal Trends
From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp. 5237-5249)
3131
WSARE v3.0 Generate the baseline…
“Taking into account recent flu levels…”
“Taking into account that today is a public holiday…”
“Taking into account that this is Spring…”
“Taking into account recent heatwave…”
“Taking into account that there’s a known natural Food-borne outbreak in progress…”
Bonus: More efficient use of historical data
3232
Idea: Bayesian Networks
“On Cold Tuesday Mornings the folks coming in from the North
part of the city are more likely to have respiratory problems”
“Patients from West Park Hospital are less likely to be young”
“On the day after a major holiday, expect a boost in the morning followed by a lull in
the afternoon”
Bayesian Network: A graphical model representing the joint probability distribution of a set of random variables
“The Viral prodrome is more likely to co-occur with a Rash
prodrome than Botulinic”
3333
Obtaining Baseline Data
Baseline
All HistoricalData
Today’s Environment
1. Learn Bayesian Network
2. Generate baseline given today’s environment
What should be happening today given today’s environment
3434
Simulation
DATE
DAY OF WEEK SEASONFLU LEVEL WEATHER
REGION
AGE
GENDER Region Grassiness
Region Anthrax Concentration
Region Food
Condition
ImmuneSystem
OutsideActivity
HasAnthrax
HasFlu
HasAllergy
Has HeartAttack
HasSunburn
HasCold
HeartHealth
Has FoodPoisoning
Disease
ACTION
ActualSymptom
REPORTEDSYMPTOM DRUG
Actions: None, Purchase Medication, ED visit, Absent. If Action is not None, output record to dataset.
3535
Simulation 100 different data sets
Each data set consisted of a two year period
Anthrax release occurred at a random point during the second year
Algorithms allowed to train on data from the current day back to the first day in the simulation
Any alerts before actual anthrax release are considered a false positive
Detection time calculated as first alert after anthrax release. If no alerts raised, cap detection time at 14 days