1 language, science and data science kathleen mckeown department of computer science columbia...

55
1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

Upload: claribel-lawson

Post on 17-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

1

Language, Science and Data Science

Kathleen McKeownDepartment of Computer Science

Columbia University

Page 2: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

2

Page 3: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

3

Science and Language

Predicting impact of scientific journal articles

Generating updates on disaster

Page 4: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

4

Machine learning framework

Data (often labeled)

Extraction of “features” from text data

Prediction of output

Page 5: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

5

Machine learning framework

Data (often labeled)

Extraction of “features” from text data

Prediction of output

What data is available for learning?

Page 6: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

6

Machine learning framework

Data (often labeled)

Extraction of “features” from text data

Prediction of output

What features yield good predictions?

Page 7: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

7

Science and Language

Predicting impact of scientific journal articles

Generating updates on disaster

Page 8: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

8

Page 9: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

9

What can articles yield?

Page 10: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

10

Alexandrescu and Kirchhoff, 2007

Lee and Ng, 2002

Ando, 2006Niu et al., 2005

Stevenson and Wilks, 2001

Ng et al., 2003

Rigau et al., 1997

Yarowsky, 1995

Stetina et al., 1998

Ng and Lee, 1996

Peh and Ng, 1997

Yarowsky, 1992

Khapra et al., 2011

Chan et al., 2007

Tim

e

Qazvinian et al, JAIR 2013Citation Network

Page 11: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

11

What can articles yield? Argumentative zoning (Teufel, 1996)

“Here, we present quantitative estimates of the global biological impacts of climate changes.” [AIM]

Citation sentiment and scope “The approach of economists takes a

broader view. … We argue that this approach misses biologically important phenomena.”

Page 12: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

12

Prediction of Scientific Impact

Climate change, Climate model

Input: term, document

Extract features from full text and metadata

Predict prominence of

term, document

Page 13: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

13

Approach

Metadata features Citation network Author collaboration

network

Text features Argumentative

zoning Citation sentiment Primary concepts Relations

Logistic regression to predict future impact

Page 14: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

14

Data

4 million full-text Elsevier journal articles

48 million Web of Science metadata records

Fields: medical, chemistry, biology, computer science, …

Ground truth function

Page 15: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

15

ExperimentsExperimental comparisons of text features vs metadata features

Data: Selected Elsevier articles and counterparts in WOS

Page 16: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

16

What have We Learned?

Page 17: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

17

What have we learned?

Page 18: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

18

Text features alone outperform metadataElsevier data

Text

Metadata

Page 19: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

19

Metadata adds value when combined with textElsevier data

Text

Text+metadata

Page 20: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

20

ExperimentsExperimental comparisons of text features vs metadata features

Data: All WOS and Elsevier data

Page 21: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

21

What have We Learned?

Page 22: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

22

Text better than metadata on Web of Science

Text

Metadata

Page 23: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

23

Science and Language

Predicting impact of scientific journal articles

Generating updates on disaster

Page 24: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

24

It’s October 28th, 2012

Page 25: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

25

Page 26: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

26

Page 27: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

27

Page 28: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

28

Page 29: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

29

Page 30: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

30

Monitor events over time

Input: streaming data

News, social media, web pages

At every hour, what’s new

Page 31: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

31

Data NIST: Trec Temporal Summarization

Track hourly web crawl October 2011 - February 2013 16.1TB

Natural disasters, industrial disasters, social unrest, …

Training Data drawn from Wikipedia

Page 32: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

32

Page 33: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

33

The U.S. Pacific Tsunami Warning Center said there was a possibility of a local tsunami, within 100 or 200 miles of the epicenter,

Page 34: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

34

Would you like to contribute to this story?Start a discussion.

Page 35: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

35

Add to Digg. Add to del.icio.us. Add to Facebook.

Page 36: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

36

System Update

Page 37: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

37

Temporal Summarization Approach

At time t:

1. Predict salience for input sentences Disaster-specific features for predicting

salience

2. Remove redundant sentences

3. Cluster and select exemplar sentences for t

Incorporate salience prediction as a prior

Kedzie & al, Bloomberg Social Good Workshop, KDD 2014Kedzie & al, ACL 2015

Page 38: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

38

Predicting Salience: Model FeaturesBasic sentence level features

sentence length punctuation count number of capitalized words number of event type synonyms, hypernyms, and hyponyms

Page 39: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

39

Predicting Salience: Model FeaturesBasic sentence level features

sentence length punctuation count number of capitalized words number of event type synonyms, hypernyms, and hyponyms

High SalienceNicaragua's disaster management said it had issued a local tsunami alert.

Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.

Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace

Page 40: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

40

Predicting Salience: Model FeaturesBasic sentence level features

sentence length punctuation count number of capitalized words number of event type synonyms, hypernyms, and hyponyms

High SalienceNicaragua's disaster management said it had issued a local tsunami alert.

Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.

Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace

Page 41: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

41

Predicting Salience: Model FeaturesBasic sentence level features

sentence length punctuation count number of capitalized words number of event type synonyms, hypernyms, and hyponyms

High SalienceNicaragua's disaster management said it had issued a local tsunami alert.

Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.

Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace

Page 42: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

42

Predicting Salience: Model FeaturesLanguage Models (5-gram Kneser-Ney model)

generic news corpus (10 years AP and NY Times articles) domain specific corpus (disaster related Wikipedia

articles)

Page 43: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

43

Predicting Salience: Model FeaturesLanguage Models (5-gram Kneser-Ney model)

generic news corpus (10 years AP and NY Times articles) domain specific corpus (disaster related Wikipedia

articles)

High SalienceNicaragua's disaster management said it had issued a local tsunami alert.

Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.

Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace

Page 44: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

44

Predicting Salience: Model FeaturesLanguage Models (5-gram Kneser-Ney model)

generic news corpus (10 years AP and NY Times articles) domain specific corpus (disaster related Wikipedia

articles)

High SalienceNicaragua's disaster management said it had issued a local tsunami alert.

Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.

Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace

Page 45: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

45

Predicting Salience: Model FeaturesLanguage Models (5-gram Kneser-Ney model)

generic news corpus (10 years AP and NY Times articles) domain specific corpus (disaster related Wikipedia

articles)

High SalienceNicaragua's disaster management said it had issued a local tsunami alert.

Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.

Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace

Page 46: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

46

Predicting Salience: Model FeaturesGeographic Features

tag input with Named-Entity tagger get coordinates for locations and mean distance to event

High SalienceNicaragua's disaster management said it had issued a local tsunami alert.

Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.

Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace

Page 47: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

47

Predicting Salience: Model FeaturesGeographic Features

tag input with Named-Entity tagger get coordinates for locations and mean distance to event

High SalienceNicaragua's disaster management said it had issued a local tsunami alert.

Medium SaliencePeople streamed out of homes, schools and oce buildings as far north as Mexico City.

Low SalienceAdd to Digg Add to del.icio.us Add to Facebook Add to Myspace

Page 48: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

48

Predicting Salience: Model FeaturesTemporal Features measuring burstiness of words

average tf-idf at the current time t0 difference of average tf-idf at time t0 and t - 1 difference of average tf-idf at time t0 and t - 5 hours since event start time

Page 49: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

49

SubEvent Identification

➔Decompose articles on a main event into related sub-events:

Manhattan Blackout Breezy Point fire Public Transit Outage

Hurricane Sandy

Page 50: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

50

What have We Learned?

Page 51: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

51

What have We Learned?

Salience predictions increase precision

Page 52: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

52

What have We Learned?

Salience predictions increase precision

Penalizing redundancy removes too many important facts

Page 53: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

53

Integrate lens of the media and scientific knowledge

To better predict unfolding impact oof disaster

Social media to feed into modelsof resilience

Page 54: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

54

In Sum

Can exploit the words on the web

Language analysis can help the world of science

Page 55: 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

55

Thank You!

The research presented here has been supported in part by DARPA BOLT, IARPA FUSE, and NSF.