what’s strange about recent events (wsare)
DESCRIPTION
What’s Strange About Recent Events (WSARE). Weng-Keen Wong (University of Pittsburgh) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University of Pittsburgh) Michael Wagner (University of Pittsburgh). This work funded by DARPA, the State of Pennsylvania, and NSF. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
What’s Strange About Recent Events (WSARE)
Weng-Keen Wong (University of Pittsburgh)Andrew Moore (Carnegie Mellon University)
Gregory Cooper (University of Pittsburgh)Michael Wagner (University of Pittsburgh)
This work funded by DARPA, the State of Pennsylvania, and NSF
Motivation
Primary Key
Date Time Hospital ICD9 Prodrome Gender Age Home Location
Work Location
Many more…
100 6/1/03 9:12 1 781 Fever M 20s NE ? …
101 6/1/03 10:45 1 787 Diarrhea F 40s NE NE …
102 6/1/03 11:03 1 786 Respiratory F 60s NE N …
103 6/1/03 11:07 2 787 Diarrhea M 60s E ? …
104 6/1/03 12:15 1 717 Respiratory M 60s E NE …
105 6/1/03 13:01 3 780 Viral F 50s ? NW …
106 6/1/03 13:05 3 487 Respiratory F 40s SW SW …
107 6/1/03 13:57 2 786 Unmapped M 50s SE SW …
108 6/1/03 14:22 1 780 Viral M 40s ? ? …
: : : : : : : : : : :
Suppose we have real-time access to Emergency Department data from hospitals around a city (with patient confidentiality preserved)
The ProblemFrom this data, can we detect if a disease outbreak is happening?
The ProblemFrom this data, can we detect if a disease outbreak is happening?
We’re talking about a non-specific disease detection
The ProblemFrom this data, can we detect if a disease outbreak is happening? How early can we detect it?
The ProblemFrom this data, can we detect if a disease outbreak is happening? How early can we detect it?
The question we’re really asking: What’s strange about recent events?
Traditional ApproachesWhat about using traditional anomaly detection?• Typically assume data is generated by a model• Finds individual data points
that have low probability with respect to this model
• These outliers have rare attributes or combinations of attributes
• Need to identify anomalous patterns not isolated data points
Traditional Approaches
– Time series algorithms– Regression techniques– Statistical Quality Control methods
• Need to know apriori which attributes to form daily aggregates for!
Number of ED Visits per Day
0
10
20
30
40
50
1 10 19 28 37 46 55 64 73 82 91 100
Day Number
Num
ber o
f ED
Vis
its
What about monitoring aggregate daily counts of certain attributes?
• We’ve now turned multivariate data into univariate data
• Lots of algorithms have been developed for monitoring univariate data:
Traditional ApproachesWhat if we don’t know what attributes to
monitor?
What if we want to exploit the spatial, temporal and/or demographic characteristics of the epidemic to detect the outbreak as early as possible?
Traditional ApproachesWe need to build a univariate detector to monitor each interesting
combination of attributes:
Diarrhea cases among children
Respiratory syndrome cases among females
Viral syndrome cases involving senior citizens from eastern part of city
Number of children from downtown hospital
Number of cases involving people working in southern
part of the city
Number of cases involving teenage girls living in thewestern part of the city
Botulinic syndrome cases
And so on…
Traditional ApproachesWe need to build a univariate detector to monitor each interesting
combination of attributes:
Diarrhea cases among children
Respiratory syndrome cases among females
Viral syndrome cases involving senior citizens from eastern part of city
Number of children from downtown hospital
Number of cases involving people working in southern
part of the city
Number of cases involving teenage girls living in thewestern part of the city
Botulinic syndrome cases
And so on…
You’ll need hundreds of univariate detectors!We would like to identify the groups with the strangest
behavior in recent events.
One Possible ApproachPrimary
KeyDate Time Gender Age Hospital Many
more…
100 8/24/03 9:12 M 20s 1 …
101 8/24/03 10:45 F 40s 1 …
: : : : : : :
2243 8/17/03 11:07 M 60s 2 …
2244 8/17/03 12:15 M 60s 1 …
: : : : : : :
12567 8/24/02 13:05 F 40s 3 …
12568 8/24/02 13:57 M 50s 2 …
: : : : : : :
Today’s Records
Yesterday’s Records
Last Year’s Records
One Possible ApproachPrimary
KeyDate Time Gender Age Hospital Many
more…
100 8/24/03 9:12 M 20s 1 …
101 8/24/03 10:45 F 40s 1 …
: : : : : : :
2243 8/17/03 11:07 M 60s 2 …
2244 8/17/03 12:15 M 60s 1 …
: : : : : : :
12567 8/24/02 13:05 F 40s 3 …
12568 8/24/02 13:57 M 50s 2 …
: : : : : : :
Today’s Records
Yesterday’s Records
Last Year’s Records
Idea: Can use association rules to find patterns in
today’s records that weren’t there in past data
One Possible ApproachPrimary
KeyDate Time Gender Age …
100 8/24/03 9:12 M Child …
101 8/24/03 10:45 M Senior …
: : : : : :
Primary Key
Date Time Gender Age …
2164 8/17/03 13:05 F Senior …
2165 8/17/03 13:57 F Senior …
: : : : : :
Recent records ( from today )
Baseline records ( from 7 days ago )
Primary Key
Date Time … Source
100 8/24/03 9:12 … Recent
101 8/24/03 10:45 … Recent
: : : : :
2164 8/17/03 13:05 … Baseline
2165 8/17/03 13:57 … Baseline
: : : : :
Find which rules predict unusually high proportions in recent records when compared to the baseline eg.
52/200 records from “recent” have Gender = Male AND Age = Senior
90/180 records from “baseline” have Gender = Male AND Age = Senior
Which rules do we report?• Search over all rules up to a maximum number of
components• For each rule, form a 2x2 contingency table eg.
• Perform Fisher’s Exact Test to get a p-value for each rule (call this the score)
• Report the rule with the lowest score
CountRecent CountBaseline
Home Location = NW 48 45
Home Location NW
86 220
Problems with the Approach
1. Multiple Hypothesis Testing
2. A Changing Baseline
Problem #1: Multiple Hypothesis Testing • Can’t interpret the rule scores as p-values• Suppose we reject null hypothesis when score < ,
where = 0.05• For a single hypothesis test, the probability of
making a false discovery = • Suppose we do 1000 tests, one for each possible
rule• Probability(false discovery) could be as bad as:
1 – ( 1 – 0.05)1000 >> 0.05
Randomization Test
• Take the recent cases and the baseline cases. Shuffle the date field to produce a randomized dataset called DBRand
• Find the rule with the best score on DBRand.
Aug 16, 2003 C2
Aug 17, 2003 C3
Aug 17, 2003 C4
Aug 17, 2003 C5
Aug 17, 2003 C6
Aug 17, 2003 C7
Aug 21, 2003 C8
Aug 21, 2003 C9
Aug 22, 2003 C10
Aug 22, 2003 C11
Aug 23, 2003 C12
Aug 23, 2003 C13
Aug 24, 2003 C14
Aug 24, 2003 C15
Aug 16, 2003 C2
Aug 17, 2003 C3
Aug 24, 2003 C4
Aug 17, 2003 C5
Aug 24, 2003 C6
Aug 17, 2003 C7
Aug 21, 2003 C8
Aug 21, 2003 C9
Aug 22, 2003 C10
Aug 22, 2003 C11
Aug 23, 2003 C12
Aug 23, 2003 C13
Aug 17, 2003 C14
Aug 17, 2003 C15
Randomization TestRepeat the procedure on the previous slide for 1000 iterations. Determine how many scores from the 1000 iterations are better than the original score.
If the original score were here, it would place in the top 1% of the 1000 scores from the randomization test. We would be impressed and an alert should be raised.
Corrected p-value of the rule is:
# better scores / # iterations
Reporting Multiple Rules on each Day
But reporting only the best scoring rule can hide other more interesting anomalous patterns!
For example:
1. The best scoring rule is statistically significant but not a public health concern
2. The top 5 scoring rules indicate anomalous patterns in 5 neighboring zip codes but individually their p-values do not cause an alarm to be raised
Our Solution: FDRFalse Discovery Rate [Benjamini and Hochberg]• Can determine which of these p-values are
significant• Specifically, given an αFDR, FDR guarantees
that
• Given an αFDR, FDR produces a threshold below which any p-values in the history are considered significant
FDRrejected washyp nullin which tests#
positives false#
Our Solution: FDROnce we have the set of all possible rules and their scores, use FDR to determine which ones are significant
Problem #2: A Changing Baseline
From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp. 5237-5249)
Problem #2: A Changing Baseline • Baseline is affected by temporal trends in
health care data:– Seasonal effects in temperature and weather– Day of Week effects– Holidays– Etc.
• Choosing the wrong baseline distribution can affect the detection time and false positives rate
Generating the Baseline… • “Taking into account that today is a public holiday…”• “Taking into account that this is Spring…”• “Taking into account recent heatwave…”• “Taking into account recent flu levels…”• “Taking into account that there’s a known natural Food-
borne outbreak in progress…”
Generating the Baseline… • “Taking into account that today is a public holiday…”• “Taking into account that this is Spring…”• “Taking into account recent heatwave…”• “Taking into account recent flu levels…”• “Taking into account that there’s a known natural Food-
borne outbreak in progress…”
Use a Bayes net to model the joint probability distribution of the
attributes
Obtaining Baseline Data
Baseline
All HistoricalData
Today’s Environment
1. Learn Bayesian Network using Optimal Reinsertion [Moore and Wong 2003]
2. Generate baseline given today’s environment
Environmental AttributesDivide the data into two types of attributes:• Environmental attributes: attributes that
cause trends in the data eg. day of week, season, weather, flu levels
• Response attributes: all other non-environmental attributes
Environmental AttributesWhen learning the Bayesian network structure, do not allow
environmental attributes to have parents.Why? • We are not interested in predicting their distributions• Instead, we use them to predict the distributions of the response
attributesSide Benefit: We can speed up the structure search by avoiding
DAGs that assign parents to the environmental attributes
Season Day of Week Weather Flu Level
Generate Baseline Given Today’s Environment
Season Day of Week Weather Flu Level
Today Winter Monday Snow High
Season = Winter
Day of Week = Monday
Weather = Snow
Flu Level = High
Suppose we know the following for today:
We fill in these values for the environmental attributes in the learned Bayesian network
Baseline
We sample 10000 records from the Bayesian network and make this data set the baseline
Generate Baseline Given Today’s Environment
Season Day of Week Weather Flu Level
Today Winter Monday Snow High
Season = Winter
Day of Week = Monday
Flu Level = High
Suppose we know the following for today:
We fill in these values for the environmental attributes in the learned Bayesian network
Baseline
We sample 10000 records from the Bayesian network and make this data set the baseline
Sampling is easy because
environmental attributes are at the
top of the Bayes Net
Weather = Snow
Generate Baseline Given Today’s Environment
Season Day of Week Weather Flu Level
Today Winter Monday Snow High
Season = Winter
Day of Week = Monday
Flu Level = High
Suppose we know the following for today:
We fill in these values for the environmental attributes in the learned Bayesian network
Baseline
We sample 10000 records from the Bayesian network and make this data set the baseline
An alternate possible technique is to use inference
Weather = Snow
What’s Strange About Recent Events (WSARE) 3.0
2. Search for rule with best score
3. Determine p-value of best scoring ruleAll
Data
4. If p-value is less than threshold, signal alert
RecentData
Baseline
1. Obtain Recent and Baseline datasets
Simulator
Simulation• 100 different data sets• Each data set consisted of a two year period• Anthrax release occurred at a random point during the
second year• Algorithms allowed to train on data from the current day
back to the first day in the simulation• Any alerts before actual anthrax release are considered a
false positive• Detection time calculated as first alert after anthrax release.
If no alerts raised, cap detection time at 14 days
Other Algorithms used in Simulation
1. Control Chart: Mean + multiplier * standard deviation
2. Moving Average: 7 day window
3. ANOVA Regression: Linear regression with extra covariates for season, day of week, count from yesterday
4. WSARE 2.0: Create baseline using raw historical data
5. WSARE 2.5: Use raw historical data that matches environmental attributes
Results on Simulation
Results on Actual ED Data from 20011. Sat 2001-02-13: SCORE = -0.00000004 PVALUE = 0.00000000 14.80% ( 74/500) of today's cases have Viral Syndrome = True and Encephalitic Prodome = False 7.42% (742/10000) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False
2. Sat 2001-03-13: SCORE = -0.00000464 PVALUE = 0.00000000 12.42% ( 58/467) of today's cases have Respiratory Syndrome = True 6.53% (653/10000) of baseline have Respiratory Syndrome = True
3. Wed 2001-06-30: SCORE = -0.00000013 PVALUE = 0.00000000 1.44% ( 9/625) of today's cases have 100 <= Age < 110 0.08% ( 8/10000) of baseline have 100 <= Age < 110
4. Sun 2001-08-08: SCORE = -0.00000007 PVALUE = 0.00000000 83.80% (481/574) of today's cases have Unknown Syndrome = False 74.29% (7430/10001) of baseline have Unknown Syndrome = False
5. Thu 2001-12-02: SCORE = -0.00000087 PVALUE = 0.00000000 14.71% ( 70/476) of today's cases have Viral Syndrome = True and Encephalitic Syndrome = False 7.89% (789/9999) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False
6. Thu 2001-12-09: SCORE = -0.00000000 PVALUE = 0.00000000 8.58% ( 38/443) of today's cases have Hospital ID = 1 and Viral Syndrome = True 2.40% (240/10000) of baseline have Hospital ID = 1 and Viral Syndrome = True
Limitations of WSARE• Works on categorical data• Works on lower dimensional, dense data• Cannot monitor aggregate counts – relies on
changes in ratios• Assumes that given the environmental variables,
the baseline ratios are fairly stationary over time
Related Work• Contrast sets [Bay and Pazzani]• Association Rules and Data Mining in Hospital
Infection Control and Public Health Surveillance [Brossette et. al.]
• Spatial Scan Statistic [Kulldorff]• WRSARE: What’s Really Strange About Recent
Events [Singh and Moore]P( Age = Senior, Gender = Male | Season = Winter, Day of Week = Monday) =
Bayesian Biosurveillance of Disease Outbreaks
To appear in UAI04 [Cooper, Dash, Levander, Wong,
Hogan, Wagner]
Conclusion• One approach to biosurveillance: one algorithm
monitoring millions of signals derived from multivariate data
instead ofHundreds of univariate detectors
• WSARE is best used as a general purpose safety net in combination with other detectors
• Careful evaluation of statistical significance• Modeling historical data with Bayesian Networks
to allow conditioning on unique features of today
Software: http://www.autonlab.org/