university of sheffield, nlp twitie: an open-source information extraction pipeline for microblog...

21
University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark A. Greenwood Diana Maynard Niraj Aswani © The University of Sheffield, 1995-2013 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs Licence

Upload: elisha-hillary

Post on 14-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

Kalina BontchevaLeon Derczynski

Adam FunkMark A. Greenwood

Diana MaynardNiraj Aswani

© The University of Sheffield, 1995-2013This work is licensed underthe Creative Commons Attribution-NonCommercial-NoDerivs Licence

Page 2: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

The Problem

• Running ANNIE on 300 news articles – 87% f-score

• Running ANNIE on some tweets - < 40% f-score

Page 3: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Example: Persons in news articles

Page 4: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Example: Persons in tweets

Page 5: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Genre Differences in Entity Types

News Tweets

PER Politicians, business leaders, journalists, celebrities

Sportsmen, actors, TV personalities, celebrities, names of friends

LOC Countries, cities, rivers, and other places related to current affairs

Restaurants, bars, local landmarks/areas, cities, rarely countries

ORG Public and private companies, government organisations

Bands, internet companies, sports clubs

Page 6: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Tweet-specific NER challenges

• Capitalisation is not indicative of named entities

• All uppercase, e.g. APPLE IS AWSOME

• All lowercase, e.g. all welcome, joe included

• All letters upper initial, e.g. 10 Quotes from Amy Poehler That Will Get You Through High School

• Unusual spelling, acronyms, and abbreviations

• Social media conventions:

• Hashtags, e.g. #ukuncut, #RusselBrand, #taxavoidance

• @Mentions, e.g. @edchi (PER), @mcg_graz (LOC), @BBC (ORG)

Page 7: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

TwitIE: GATE’s new Twitter NER pipeline

Page 8: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Importing tweets into GATE

• GATE now supports JSON format import for tweets

• Located in the Format_Twitter plugin

• Automatically used for files *.json

• Alternatively, specify text/x-json-twitter as a mime type

• The tweet text becomes the document, all other JSON fields become features

Page 9: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Language Detection: Less than 50% English

The main challenges on tweets/Facebook status updates:

the short number of tokens (10 tokens/tweet on average)

the noisy nature of the words (abbreviations, misspellings).

Due to the length of the text, we can make the assumption that one tweet is written in only one language

We have adapted the TextCat language identification plugin

Provided fingerprints for 5 languages: DE, EN, FR, ES, NL

You can extend it to new languages easily

Page 10: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Language Detection Examples

Page 11: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Tokenisation

Splitting a text into its constituent parts

Plenty of “unusual”, but very important tokens in social media:

– @Apple – mentions of company/brand/person names

– #fail, #SteveJobs – hashtags expressing sentiment, person or company names

– :-(, :-), :-P – emoticons (punctuation and optionally letters)

– URLs

Tokenisation key for entity recognition and opinion mining

A study of 1.1 million tweets: 26% of English tweets have a URL, 16.6% - a hashtag, and 54.8% - a user name mention [Carter, 2013].

Page 12: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Example

– #WiredBizCon #nike vp said when @Apple saw what http://nikeplus.com did, #SteveJobs was like wow I didn't expect this at all.

– Tokenising on white space doesn't work that well:

• Nike and Apple are company names, but if we have tokens such as #nike and @Apple, this will make the entity recognition harder, as it will need to look at sub-token level

– Tokenising on white space and punctuation characters doesn't work well either: URLs get separated (http, nikeplus), as are emoticons and email addresses

Page 13: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

The TwitIE Tokeniser

Treat RTs and URLs as 1 token each

#nike is two tokens (# and nike) plus a separate annotation HashTag covering both. Same for @mentions -> UserID

Capitalisation is preserved, but an orthography feature is added: all caps, lowercase, mixCase

Date and phone number normalisation, lowercasing, and emoticons are optionally done later in separate modules

Consequently, tokenisation is faster and more generic

Also, more tailored to our NER module

Page 14: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

POS Tagging

• The accuracy of the Stanford POS tagger drops from about 97% on news to 80% on tweets (Ritter, 2011)

• Need for an adapted POS tagger, specifically for tweets

• We re-trained the Stanford POS tagger using some hand-annotated tweets, IRC and news texts

• Next we compare the differences between the ANNIE POS Tagger and the Tweet POS Tagger on the example tweets

Page 15: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

POS Tagging Example

• TwitIE POS tagger on the left

• ANNIE POS tagger on the right

• The TwitIE POS tagger is a separate paper at RANLP’2013

• Beats Ritter (2011); uses a grown-up tag set (cf. Gimpel, 2011)

Page 16: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Tweet Normalisation

“RT @Bthompson WRITEZ: @libbyabrego honored?! Everybody knows the libster is nice with it...lol...(thankkkks a bunch;))”

OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!

Similar to SMS normalisation

For some components to work well (POS tagger, parser), it is necessary to produce a normalised version of each token

BUT uppercasing, and letter and exclamation mark repetition often convey strong sentiment

Therefore some choose not to normalise, while others keep both versions of the tokens

Page 17: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

A normalised example

Normaliser currently based on spelling correction and some lists of common abbreviations

Outstanding issues:

Insert new Token annotations, so easier to POS tag, etc? For example: “trying to” now 1 annotation

Some abbreviations which span token boundaries (e.g. gr8, do n’t) difficult to handle

Capitalisation and punctuation normalisation

Page 18: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

TwitIE NER Results

Page 19: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Trying TwitIE

• Plugin in the latest GATE snapshot and forthcoming 7.2 release

• Download details at: https://gate.ac.uk/wiki/twitie.html

• Available soon as a web service on the forthcoming AnnoMarket NLP cloud marketplace:

• https://annomarket.com/

Page 20: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Coming Soon: TwitIE-as-a-Service

Preview of some text analytics services on AnnoMarket.com

Page 21: University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark

University of Sheffield, NLP

Acknowledgements

• Kalina Bontcheva is supported by a Career Acceleration Fellowship from the Engineering and Physical Sciences Research Council (grant EP/I004327/1)

• This research is also partially supported by the EU-funded FP7 TrendMiner project (http://www.trendminer-project.eu) and the CHIST-ERA uComp project (http://www.ucomp.eu)

Thank you for your time!