Download - Extracting microbial threats from big data
![Page 1: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/1.jpg)
Extracting microbial threats from big data
Robert MunroCTO, EpidemicIQ
@WWRob
![Page 2: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/2.jpg)
The New Virus HuntersEpidemicIQ
@LuckOrChance
![Page 3: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/3.jpg)
Yellow Fever
![Page 4: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/4.jpg)
![Page 5: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/5.jpg)
EpidemicsGreatest cause of death globally
Any transmission is a chance for deadly mutation
No organization is (yet) tracking all outbreaks
![Page 6: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/6.jpg)
EpidemicsEradication of diseases in the last century:
1979: Small-pox
Progression of air-travel in the last century:
![Page 7: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/7.jpg)
Math, Engineering, Writing, Skepticism, Curiosity, (Linguistics)
![Page 8: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/8.jpg)
Daily potential language exposureHow many languages could you hear on any given
day?How has this changed?
Year
# of
la
ngua
ges
![Page 9: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/9.jpg)
Daily potential language exposure
Year
# of
la
ngua
ges
![Page 10: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/10.jpg)
Daily potential language exposure
Year
# of
la
ngua
ges
![Page 11: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/11.jpg)
Daily potential language exposure
Year
# of
la
ngua
ges
![Page 12: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/12.jpg)
Daily potential language exposure
Year
# of
la
ngua
ges
Our potential communications will never be so diverse as right now
![Page 13: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/13.jpg)
The communication age90% of the world’s ecological diversity
90% of the world’s linguistic diversity
![Page 14: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/14.jpg)
CDC vs Google Flu Trends?
![Page 15: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/15.jpg)
CDC vs Google Flu Trends?
Source: http://www.google.org/flutrends/
![Page 16: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/16.jpg)
Traditional Media?"I'm Jacqui Jeras with
today's cold and flu report ... across the mid- Atlantic states, a little bit of an increase here”
CDC vs Google Flu Trends?
![Page 17: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/17.jpg)
Traditional Media?"I'm Jacqui Jeras with
today's cold and flu report ... across the mid- Atlantic states, a little bit of an increase here” Jan 4th
CDC vs Google Flu Trends?
Winner !
![Page 18: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/18.jpg)
The first signal is linguisticEvery outbreak predicted by Google Flu Trends has
been preceded by open, online reportsThe same is true for all other search-term-based
disease predictions
NB: Google Flu Trends members have also discovered this!
![Page 19: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/19.jpg)
The first signal is linguistic“Improved Response to Disasters and Outbreaks
by Tracking Population Movements with Mobile Phone Network Data: A Post-Earthquake Geospatial Study in Haiti” Bengtsson et al. 2011.
… or you could just ask “I am going to Jeremie next week”
![Page 20: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/20.jpg)
I'm Jacqui Jeras with today's cold and flu report ... across the mid- Atlantic states, a little bit of an increase here
… but hidden in plain view
The first signal is linguistic
We're worried about the markets. we're going to take you to Kenya
where the U.S. has dispatched some diplomatic help to try to get the country back on political balance.Is individualism an
endangered concept in Saudi Arabia?
Well, in St. John's County, one man lost his home trying to keep his pig warm.
The pig did not make it. He had everything but the cape. A good samaritan in Ohio saved a family from this ferocious house fire.
A spunky boy reels in a 550-pound shark.
![Page 21: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/21.jpg)
… in 1000s of languagesв предстоящий осенне-зимний период в Украине
ожидаются две эпидемии гриппа(2 outbreaks predicted for the Ukraine)
مصر في الطيور انفلونزا من مزيد(more flu in Egypt)
香港现 1例 H5N1禽流感病例曾游上海南京等地(Hong Kong had a case of avian influenza that traveled
to Shanghai and Nanjing)
![Page 22: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/22.jpg)
Reported before identification
H1N1 (Swine Flu) – months
HIV – years
H1N5 (Bird Flu) – weeks
![Page 23: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/23.jpg)
HIV in the 1950s
HIV – years
People were:talking locallyreporting locally
We can now access local
![Page 24: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/24.jpg)
Outbreak information processingHealth-care professionals need to:
Evaluate reports of potential outbreaks.Find new sources of information.Stay ahead of the disease (especially) during information spikes.
![Page 25: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/25.jpg)
Most existing solutionsKeyword-based search:
language-specificnon-adaptive
A room full of humans:inefficient capped-volume
![Page 26: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/26.jpg)
epidemicIQVolume:
10x the processing of existing solutionsGreater languages / independenceCapable of short 100x spikes
Efficiency:First evaluation in secondsAdapts to new information in minutes1/10 the running cost
![Page 27: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/27.jpg)
Targeted machine-processing
Broad machine-processing
Human (manual) processing
Low-volumeprocessing
High-volumeprocessing
Data input
“there is a new flu-like illness here”
Discovered by crawler
Relevance evaluated by machine learning
Relevance evaluated by microtasker
Information stored from the reports
Relevance evaluated by in-house analyst
Sources monitor frequency updated
Maximally relevant phrases used to
search more data
Direct report from field staff / partner
organization
Reports for each outbreak aggregated
![Page 28: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/28.jpg)
Scale – machine learningMillions of reports daily from 100,000s sourcesStress-tested to billions per day>70 languages
![Page 29: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/29.jpg)
Scale – microtaskersOur virtual (but real) workforce>2,000 people from 50 nationsOn many platforms (via CrowdFlower)13 languages (English, Spanish, Portuguese, Chinese,
Arabic, Russian, French, Hindu, Urdu, Italian, Japanese, Korean, German)
Stress-tested to 10,000s per day
![Page 30: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/30.jpg)
Virtual good Real goodFor 600 new seeds, please answer this question:
Does this sentence refer to a disease outbreak:
“E Coli spreads to Spain, sprouts suspected”
Yes/no: __What disease: _______What location: _______
![Page 31: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/31.jpg)
“In a real-life setting, it is expensive to prepare a training data set … classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles.”
Torii, Yin, Nguyen, Mazumdar, Liu, Hartley and Nelson. 2011. An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics. Medical Informatics, 80(1)
ARGUS
![Page 32: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/32.jpg)
ARGUSWhat can we extrapolate from just 298 data points?Let’s compare 298 … to 100,000 data points… and a purely human rule-based filtering (giving the
humans infinite time) 20:1 relevance ratio10% hold-out evaluation data.20% hard cases
Bernoulli Naïve Bayes
L1 regularization on a linear model to select 1,000 best words/sequences
MaxEnt
![Page 33: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/33.jpg)
00.10.20.30.40.50.60.70.80.9
1
0.18%
0.22%
0.27%
0.34%
0.42%
0.52%
0.64%
0.79%
0.97%
1.20%
1.48%
1.82%
2.25%
2.78%
3.43%
4.24%
5.23%
6.46%
7.98%
9.85% 12
.1…
15.0
…18.5
…22.8
…28.2
…34.8
…43.0
…53.1
…65.6
…81.0
…
epidemicIQ
ARGUS (Torii et al, 2011)
Humans with infinite time
Machine-learning evaluation
F1 accuracy at increasing % of training data
00.10.20.30.40.50.60.70.80.9
1
0.18%
0.22%
0.27%
0.34%
0.42%
0.52%
0.64%
0.79%
0.97%
1.20%
1.48%
1.82%
2.25%
2.78%
3.43%
4.24%
5.23%
6.46%
7.98%
9.85% 12.1…
15.0…
18.5…
22.8…
28.2…
34.8…
43.0…
53.1…
65.6…
81.0…
epidemicIQ
ARGUS (Torii et al, 2011)
Humans with infinite time
298 data points
![Page 34: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/34.jpg)
Machine-learning evaluation
F1 accuracy at increasing % of training data
00.10.20.30.40.50.60.70.80.9
1
0.18%
0.22%
0.27%
0.34%
0.42%
0.52%
0.64%
0.79%
0.97%
1.20%
1.48%
1.82%
2.25%
2.78%
3.43%
4.24%
5.23%
6.46%
7.98%
9.85% 12.1…
15.0…
18.5…
22.8…
28.2…
34.8…
43.0…
53.1…
65.6…
81.0…
epidemicIQ
ARGUS (Torii et al, 2011)
Humans with infinite time
298 data points
![Page 35: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/35.jpg)
Machine-learning evaluation
F1 accuracy at increasing % of training data
00.10.20.30.40.50.60.70.80.9
1
0.18%
0.22%
0.27%
0.34%
0.42%
0.52%
0.64%
0.79%
0.97%
1.20%
1.48%
1.82%
2.25%
2.78%
3.43%
4.24%
5.23%
6.46%
7.98%
9.85% 12.1…
15.0…
18.5…
22.8…
28.2…
34.8…
43.0…
53.1…
65.6…
81.0…
epidemicIQ
ARGUS (Torii et al, 2011)
Humans with infinite time
~7% of data
298 data points
![Page 36: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/36.jpg)
Machine-learning evaluationBig-data conclusions cannot be drawn from small,
balanced data sets.Chose your algorithm wisely: generative or
discriminative? Changes data-collection and labeling strategies.
Natural Language Processing systems outperform rule-based systems - even highly tuned ones.
![Page 37: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/37.jpg)
Targeted-search evaluationUsing the (human and machine) labeled data, we
extract time-sensitive predictive key-phrases.
@lildata
We leverage search APIs and our machine-learner to find new sources/reports.
How useful are the new sources of information?
![Page 38: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/38.jpg)
Targeted-search evaluation
00.10.20.30.40.50.60.70.80.9
1
0.18%
0.22
%0.
27%
0.34%
0.42
%0.
52%
0.64%
0.79
%0.9
7%1.2
0%1.4
8%1.8
2%2.
25%
2.78%
3.43
%4.
24%
5.23%
6.46
%7.
98%
9.85
%12
.1…15
.0…18
.5…22
.8…28
.2…34
.8…43
.0…53
.1…65
.6…81
.0…
epidemicIQ
Without targeted search-data
F1 accuracy at increasing % of training data
consistent improvement,wholly in recall
![Page 39: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/39.jpg)
Targeted-search evaluationIncreases variety of report types and sources,
increasing overall recall.There is a place for search-engine-based epidemiology
![Page 40: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/40.jpg)
Human in the loopGive everything with >10% machine-learning
confidence to microtaskers to confirm/reject:~1000 reports per day, from 1,000,000s that the learner
evaluatesGive a capped amount of persistent ambiguities to
professional analysts.
![Page 41: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/41.jpg)
Human in the loop
F1 accuracy at increasing % of training data
00.10.20.30.40.50.60.70.80.9
1
0.18
%0.
22%
0.27
%0.
34%
0.42%
0.52
%0.
64%
0.79
%0.
97%
1.20
%1.
48%
1.82
%2.
25%
2.78
%3.
43%
4.24
%5.
23%
6.46
%7.
98%
9.85
%12
.1…15
.0…18
.5…22
.8…28
.2…34
.8…43
.0…53
.1…65
.6…81
.0…
epidemicIQ
With micro-tasker corrections
![Page 42: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/42.jpg)
Human in the loopGives near 100% precisionImproves with the machine-learning algorithm as
candidates have greater recall95% recall in seen dataWe see more reports than other orgs … but how many
more are still out there?Good-Turing Estimates & analysts expect more
![Page 43: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/43.jpg)
Teaser
Transmission characteristics of H1N5:
… …
Better network analysis
![Page 44: Extracting microbial threats from big data](https://reader036.vdocuments.site/reader036/viewer/2022062302/56816348550346895dd3d6ad/html5/thumbnails/44.jpg)
weakly human adapted
human adapted
human exclusive
Influenza HIV-1Yellow FeverRabies SARS/Ebola
transmissible
not human adapted
ConclusionsThe earliest signals are often in plain sight, but also in
plain language.The right architecture has a place for: machine-
learning/natural language processing, microtasking, targeted search and professional analysts.
00.10.20.30.40.50.60.70.80.9
10.
18%
0.22
%0.
27%
0.34
%0.
42%
0.52
%0.
64%
0.79
%0.
97%
1.20
%1.
48%
1.82
%2.
25%
2.78
%3.
43%
4.24
%5.
23%
6.46
%7.
98%
9.85
%12
.1…15
.0…18
.5…22
.8…28
.2…34
.8…43
.0…53
.1…65
.6…81
.0…
epidemicIQ
ARGUS (Torii et al, 2011)
Humans with infinite time
@WWRob