![Page 1: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/1.jpg)
1
Language, Science and Data Science
Kathleen McKeownDepartment of Computer Science
Columbia University
![Page 2: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/2.jpg)
2
![Page 3: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/3.jpg)
3
Science and Language
Predicting impact of scientific journal articles
Generating updates on disaster
![Page 4: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/4.jpg)
4
Machine learning framework
Data (often labeled)
Extraction of “features” from text data
Prediction of output
![Page 5: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/5.jpg)
5
Machine learning framework
Data (often labeled)
Extraction of “features” from text data
Prediction of output
What data is available for learning?
![Page 6: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/6.jpg)
6
Machine learning framework
Data (often labeled)
Extraction of “features” from text data
Prediction of output
What features yield good predictions?
![Page 7: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/7.jpg)
7
Science and Language
Predicting impact of scientific journal articles
Generating updates on disaster
![Page 8: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/8.jpg)
8
![Page 9: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/9.jpg)
9
What can articles yield?
![Page 10: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/10.jpg)
10
Alexandrescu and Kirchhoff, 2007
Lee and Ng, 2002
Ando, 2006Niu et al., 2005
Stevenson and Wilks, 2001
Ng et al., 2003
Rigau et al., 1997
Yarowsky, 1995
Stetina et al., 1998
Ng and Lee, 1996
Peh and Ng, 1997
Yarowsky, 1992
Khapra et al., 2011
Chan et al., 2007
Tim
e
Qazvinian et al, JAIR 2013Citation Network
![Page 11: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/11.jpg)
11
What can articles yield? Argumentative zoning (Teufel, 1996)
“Here, we present quantitative estimates of the global biological impacts of climate changes.” [AIM]
Citation sentiment and scope “The approach of economists takes a
broader view. … We argue that this approach misses biologically important phenomena.”
![Page 12: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/12.jpg)
12
Prediction of Scientific Impact
Climate change, Climate model
Input: term, document
Extract features from full text and metadata
Predict prominence of
term, document
![Page 13: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/13.jpg)
13
Approach
Metadata features Citation network Author collaboration
network
Text features Argumentative
zoning Citation sentiment Primary concepts Relations
Logistic regression to predict future impact
![Page 14: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/14.jpg)
14
Data
4 million full-text Elsevier journal articles
48 million Web of Science metadata records
Fields: medical, chemistry, biology, computer science, …
Ground truth function
![Page 15: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/15.jpg)
15
ExperimentsExperimental comparisons of text features vs metadata features
Data: Selected Elsevier articles and counterparts in WOS
![Page 16: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/16.jpg)
16
What have We Learned?
![Page 17: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/17.jpg)
17
What have we learned?
![Page 18: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/18.jpg)
18
Text features alone outperform metadataElsevier data
Text
Metadata
![Page 19: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/19.jpg)
19
Metadata adds value when combined with textElsevier data
Text
Text+metadata
![Page 20: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/20.jpg)
20
ExperimentsExperimental comparisons of text features vs metadata features
Data: All WOS and Elsevier data
![Page 21: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/21.jpg)
21
What have We Learned?
![Page 22: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/22.jpg)
22
Text better than metadata on Web of Science
Text
Metadata
![Page 23: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/23.jpg)
23
Science and Language
Predicting impact of scientific journal articles
Generating updates on disaster
![Page 24: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/24.jpg)
24
It’s October 28th, 2012
![Page 25: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/25.jpg)
25
![Page 26: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/26.jpg)
26
![Page 27: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/27.jpg)
27
![Page 28: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/28.jpg)
28
![Page 29: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/29.jpg)
29
![Page 30: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/30.jpg)
30
Monitor events over time
Input: streaming data
News, social media, web pages
At every hour, what’s new
![Page 31: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/31.jpg)
31
Data NIST: Trec Temporal Summarization
Track hourly web crawl October 2011 - February 2013 16.1TB
Natural disasters, industrial disasters, social unrest, …
Training Data drawn from Wikipedia
![Page 32: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/32.jpg)
32
![Page 33: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/33.jpg)
33
The U.S. Pacific Tsunami Warning Center said there was a possibility of a local tsunami, within 100 or 200 miles of the epicenter,
![Page 34: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/34.jpg)
34
Would you like to contribute to this story?Start a discussion.
![Page 35: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/35.jpg)
35
Add to Digg. Add to del.icio.us. Add to Facebook.
![Page 36: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/36.jpg)
36
System Update
![Page 37: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/37.jpg)
37
Temporal Summarization Approach
At time t:
1. Predict salience for input sentences Disaster-specific features for predicting
salience
2. Remove redundant sentences
3. Cluster and select exemplar sentences for t
Incorporate salience prediction as a prior
Kedzie & al, Bloomberg Social Good Workshop, KDD 2014Kedzie & al, ACL 2015
![Page 38: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/38.jpg)
38
Predicting Salience: Model FeaturesBasic sentence level features
sentence length punctuation count number of capitalized words number of event type synonyms, hypernyms, and hyponyms
![Page 39: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/39.jpg)
39
Predicting Salience: Model FeaturesBasic sentence level features
sentence length punctuation count number of capitalized words number of event type synonyms, hypernyms, and hyponyms
High SalienceNicaragua's disaster management said it had issued a local tsunami alert.
Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.
Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace
![Page 40: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/40.jpg)
40
Predicting Salience: Model FeaturesBasic sentence level features
sentence length punctuation count number of capitalized words number of event type synonyms, hypernyms, and hyponyms
High SalienceNicaragua's disaster management said it had issued a local tsunami alert.
Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.
Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace
![Page 41: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/41.jpg)
41
Predicting Salience: Model FeaturesBasic sentence level features
sentence length punctuation count number of capitalized words number of event type synonyms, hypernyms, and hyponyms
High SalienceNicaragua's disaster management said it had issued a local tsunami alert.
Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.
Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace
![Page 42: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/42.jpg)
42
Predicting Salience: Model FeaturesLanguage Models (5-gram Kneser-Ney model)
generic news corpus (10 years AP and NY Times articles) domain specific corpus (disaster related Wikipedia
articles)
![Page 43: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/43.jpg)
43
Predicting Salience: Model FeaturesLanguage Models (5-gram Kneser-Ney model)
generic news corpus (10 years AP and NY Times articles) domain specific corpus (disaster related Wikipedia
articles)
High SalienceNicaragua's disaster management said it had issued a local tsunami alert.
Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.
Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace
![Page 44: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/44.jpg)
44
Predicting Salience: Model FeaturesLanguage Models (5-gram Kneser-Ney model)
generic news corpus (10 years AP and NY Times articles) domain specific corpus (disaster related Wikipedia
articles)
High SalienceNicaragua's disaster management said it had issued a local tsunami alert.
Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.
Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace
![Page 45: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/45.jpg)
45
Predicting Salience: Model FeaturesLanguage Models (5-gram Kneser-Ney model)
generic news corpus (10 years AP and NY Times articles) domain specific corpus (disaster related Wikipedia
articles)
High SalienceNicaragua's disaster management said it had issued a local tsunami alert.
Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.
Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace
![Page 46: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/46.jpg)
46
Predicting Salience: Model FeaturesGeographic Features
tag input with Named-Entity tagger get coordinates for locations and mean distance to event
High SalienceNicaragua's disaster management said it had issued a local tsunami alert.
Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.
Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace
![Page 47: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/47.jpg)
47
Predicting Salience: Model FeaturesGeographic Features
tag input with Named-Entity tagger get coordinates for locations and mean distance to event
High SalienceNicaragua's disaster management said it had issued a local tsunami alert.
Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.
Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace
![Page 48: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/48.jpg)
48
Predicting Salience: Model FeaturesTemporal Features measuring burstiness of words
average tf-idf at the current time t0 difference of average tf-idf at time t0 and t - 1 difference of average tf-idf at time t0 and t - 5 hours since event start time
![Page 49: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/49.jpg)
49
SubEvent Identification
➔Decompose articles on a main event into related sub-events:
Manhattan Blackout Breezy Point fire Public Transit Outage
Hurricane Sandy
![Page 50: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/50.jpg)
50
What have We Learned?
![Page 51: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/51.jpg)
51
What have We Learned?
Salience predictions increase precision
![Page 52: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/52.jpg)
52
What have We Learned?
Salience predictions increase precision
Penalizing redundancy removes too many important facts
![Page 53: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/53.jpg)
53
Integrate lens of the media and scientific knowledge
To better predict unfolding impact oof disaster
Social media to feed into modelsof resilience
![Page 54: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/54.jpg)
54
In Sum
Can exploit the words on the web
Language analysis can help the world of science
![Page 55: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University](https://reader033.vdocuments.site/reader033/viewer/2022051621/5697bf7d1a28abf838c84ccd/html5/thumbnails/55.jpg)
55
Thank You!
The research presented here has been supported in part by DARPA BOLT, IARPA FUSE, and NSF.