social media and text analytics ii user geolocation ... · social media and text analytics ii webst...

Social Media and Text Analytics II WebST (18/7/2016)

Social Media and Text Analytics IIUser Geolocation; Twitter POS Tagging, Parsing and NER

Timothy Baldwin


Talk Outline

1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary

2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?


Contents




What is Geolocation Prediction?

Example (1)

Given the a collection of messages, e.g.,:I Waiting for a tram in the rain in Collins St. A more typical

Melbourne day today.I Why you keep me up? I ain’t got no worries.I New Aussie Hip Hop News: The Yarra stinks.I Just had a rather thrilling albeit bumpy camel ride around

Uluru - SO. MUCH. FUN!! Fancy joining me? Enter mycomp here

predict the location associated with those messages

A: melbourne-au


The Wisdom of the Crowd: Predict the User

Geolocation I

Example (2)Its snowing ! <3Praying everyday for the strength and a clear path. Jeremiah 29:11&hearts;@USER that is a normal amount of sleep! I would kill for 7 hours anight!sons of anarchy, gets better every episode. #waitingonmyjaxLove me some cute couple thompson square.“@USER: Who’s better in bed? Him or me?#LadiesWeWantAnswers” him, duh........Just aced my omm practical, do i go to the gym or nap? Hmmmmtough decision.

A: hinsdale-il043-us


The Wisdom of the Crowd: Predict the User

Geolocation II

Example (3)I am officially in hell (@ Bamboo Bernie’s) http://t.co/ribXOCJI’m at Chipotle Mexican Grill (2503 Brandermill Blvd, Gambrills)http://t.co/nIimrTLlI’m at ICAT Logistics, Inc. w/ @USER http://t.co/lCF6xXbuI’m at The Greene Turtle w/ @USER http://t.co/BEK0HyxxI’m at The Greene Turtle (7556 Teague Rd #100, Arundel MillsCorporate Park, Hanover) w/ 4 others http://t.co/9irDbxB9I’m at ICAT Logistics, Inc. (6805 Douglas Legum Drive, Elkridge)http://t.co/Z6h8NK0PI’m at National Museum of the American Indian (300 Maryland AveSW, at Independence Ave and 4th St, Washington)http://t.co/0HwHvQOT

A: elkridge-md027-us


Task Granularity

We will focus on the task of user geolocation:

given the posts of a given user, predict their location

Also the (less-researched) task of message geolocation:

given a single message, predict its location

Inherent uncertainty in both tasks, as there is no guaranteethat there is any geospatially-identifying information in apost/collection of posts

Note other end of extreme: check-in message, where thegeolocation is provided directly (as geotag AND URL ANDname of location AND ...)

All of the experiments we will report on in this lectureassume that checkins have been removed from the data


Why Geolocation Information?

Event detection

Sentiment analysis Advertising Recommendation


Research Motivation: Geospatial Information

Event detection

?

Sentiment analysis

?


Predict Geolocations for Twitter Users

Challenges:

IP-based method?

Data availability issues

Tweets with GPS labels (i.e., geotagged tweets)? 1–1.7%

User self-declared locations? e.g., in your heart

Research objective:

predict a user’s primary location based on the text contentassociated with their recent posts

Input: Aggregated tweets froma Twitter user

⇒ Output: A location inthe world



Challenges:

IP-based method? Data availability issues



Research objective:






Challenges:


Tweets with GPS labels (i.e., geotagged tweets)?

1–1.7%


Research objective:






Challenges:




Research objective:






Challenges:



User self-declared locations?

e.g., in your heart

Research objective:






Challenges:




Research objective:





An Intuitive Example

Where is @BarackObama?

Source(s): http://maps.google.com

http://hum.csse.unimelb.edu.au:9000/geo.html

http://maps.google.com


Can you Guess the City?


Contents




Geographic Representation of Location

The main approaches to geographically representing userlocations have been:

I point-based

I geopolitical (e.g. one of the 48 contiguous US states)[Eisenstein et al., 2010]

I city-based (e.g. one of 3709 cities of a certain size in theworld) [Han et al., 2012]

I grid cell-based

latlong uniform-sized grid cellpopulation-based uniform-sized grid cell [Roller et al., 2012]




I point-basedI geopolitical (e.g. one of the 48 contiguous US states)

[Eisenstein et al., 2010]

I city-based (e.g. one of 3709 cities of a certain size in theworld) [Han et al., 2012]

I grid cell-based






[Eisenstein et al., 2010]I city-based (e.g. one of 3709 cities of a certain size in the

world) [Han et al., 2012]

I grid cell-based







world) [Han et al., 2012]I grid cell-based








latlong uniform-sized grid cell

population-based uniform-sized grid cell [Roller et al., 2012]


Stochastic Representation of Location

Additional dimension of how to stochastically represent theuser:

I “one-hot” (the user is at a unique location)I discrete probabilistic (probability distribution over a

discrete geographical representation)I continuous probabilistic (2D probability density function

over a region, based on a discrete geographicalrepresentation) [Priedhorsky et al., 2014]




I “one-hot” (the user is at a unique location)

I discrete probabilistic (probability distribution over adiscrete geographical representation)

I continuous probabilistic (2D probability density functionover a region, based on a discrete geographicalrepresentation) [Priedhorsky et al., 2014]





discrete geographical representation)

I continuous probabilistic (2D probability density functionover a region, based on a discrete geographicalrepresentation) [Priedhorsky et al., 2014]





discrete geographical representation)I continuous probabilistic (2D probability density function

over a region, based on a discrete geographicalrepresentation) [Priedhorsky et al., 2014]


Different Dimensions of Locative Aboutness

Location can be described in terms of:I the about location

what location is a tweet aboutI the tweeting location

where was the tweet sent from (the GPS coordinates of thesending location)

I a user’s primary locationwhat is the “home base” of the user

For example (sent from MAD, e.g.):

En route to Bilbao; looking forward to aproductive summer school

about = Bilbao, EStweeting = Madrid, ESprimary = Melbourne, AUSource(s): Han et al. [2014]


Contents




Location Indicative Words

“Location indicative words” (LIWs) either explicitly orimplicitly infer locations, and take the form, e.g., of:

I Gazetted terms: Melbourne, London, Boston

I Dialectal terms: tata, aloha

I Local features: tram, tube, the G

I Sports terms: footy, superbowl, hockey

I Weather related words: cold, windy


Varying Levels of Location Indicativeness

Local words : yinz , hoagie, dippy

Somewhat local words : ferry , chinatown, tram

Common words : today , twitter, iphone

How to identify and select location indicative words?

Do LIWs improve prediction accuracy over using all tokens?


Varying Levels of Location Indicativeness

Local words : yinz , hoagie, dippySomewhat local words : ferry , chinatown, tramCommon words : today , twitter, iphone

How to identify and select location indicative words?Do LIWs improve prediction accuracy over using all tokens?


Feature Selection I

LIW identification = feature selection, which can take anumber of forms:

Statistical:I chi square (CHI and MaxCHI)I log likelihood (LOGLIKE)

Information-theoretic:I information gain (IG)I information gain ratio (IGR)I maximum entropy weight (MEW)


Feature Selection IISpatial measures:

I TF-ICF (inverse city frequency)I geographical spread (geo) [Laere et al., 2014]

1 divide the earth into 1◦ × 1◦ cells2 calculate which cells word w occurs in3 merge neighbouring cells containing w until no more cells

can be merged, and calculate the resulting number of cellscontaining w

GeoSpread(w) =# of cells containing w after merging

Max(w)

where Max(w) = max frequency of w in an unmerged cell.


Feature Selection IIII Ripley K function (Ripley) [Laere et al., 2014]: measures

whether a given set of points is generated from ahomogeneous Poisson distribution

K (λ) = A× |{p, q ∈ Qw : distance(p, q) ≤ λ}||Qw |2

Source(s): Han et al. [2014]


Contents




Experimental Setup

Datasets:I North America dataset (NA, Roller et al. [2012]): 500K

users, 38M tweetsI World dataset (WORLD, Han et al. [2012]): 1.4M users,

192M tweets

Evaluation metrics:I Accuracy (Acc)I Accuracy within 161km (Acc@161), e.g., Bilbao and San

SebastianI Country-level accuracy (Acc@C)I Median error distance



Experimental Parameters

Many variables and parameters to explore, including:I feature set: all tokens vs. LIWs (vs. L2 regularisation)I learner: multinomial naive Bayes (NB), Kullback-Leibler

divergence (KL), logistic regression (LR)I Location representation: City, k-d tree partitioned earth

grid [Roller et al., 2012]I language: English only, multilingualI data: geotagged data, geotagged and non-geotagged data,

metadata



Multinomial Naive Bayes

The basic formulation for multinomial NB is:

P(D|ci) =

|V |∏j=1

P(tj |ci)ND,tj

ND,tj !

where ND,tj is the frequency of the jth term in D, V is theset of all terms, and:

P(t|ci) =1 +

∑|D |k=1 Nk,tP(ci |Dk)

|V |+∑|V |

j=1

∑|D |k=1 Nk,tjP(ci |Dk)

In practice, use addition of log-likelihoods rather thanproduct of likelihoodsSource(s): McCallum and Nigam [1998]


Experimental Parameters

Many variables and parameters to explore, including:I feature set: all tokens vs. LIWsI learner: multinomial naive Bayes (NB), Kullback-Leibler

divergence (KL), maximum entropy (ME)I Location representation: City, k-d tree partitioned earth

grid [Roller et al., 2012]I language: English only, multilingualI data: geotagged data, geotagged and non-geotagged data,

metadata



Results using LIWs (NB, WORLD)

Features Acc Acc@161 Acc@C Median

Most Freq. 0.003 0.062 0.947 3089Full 0.171 0.308 0.831 571

CHI 0.233 0.402 0.850 385MaxCHI 0.238 0.412 0.848 356LOGLIKE 0.191 0.343 0.836 489

IG 0.184 0.336 0.838 491IGR 0.260 0.450 0.811 260MEW 0.183 0.326 0.836 520

ICF 0.209 0.359 0.841 533GEO 0.188 0.336 0.834 491Ripley 0.236 0.432 0.849 306



Models and Location Representation (NA)

Partition Method Acc Acc@161 Acc@C Median

k-d tree

KL 0.117 0.344 – 469KL+IGR 0.161 0.437 – 273NB 0.122 0.367 – 404NB+IGR 0.153 0.432 – 280

City

NB 0.171 0.308 0.831 571NB+IGR 0.260 0.450 0.811 260ME 0.129 0.232 0.756 878ME+IGR 0.229 0.406 0.842 369

* Acc is not comparable between different class representations



Models and Location Representation

(WORLD)

Dataset Method Acc Acc@161 Acc@C Median

CityNB 0.081 0.200 0.807 886NB+IGR 0.126 0.262 0.684 913

KD-tree

KL 0.116 0.283 - 564KL+IGR 0.121 0.286 - 602NB 0.119 0.289 - 553NB+IGR 0.134 0.290 - 577

Summary:

Feature selection improves geolocation prediction accuracyLess impact of model and location representation choicethan NA



Adding Non-geotagged Data

In addition to the geotagged tweets from each user, weoften have non-geotagged tweets, which we can potentiallyuse to expand the training/test user representation

Train Test Acc Acc@161 Acc@C MedianG G 0.126 0.262 0.684 913G+NG G 0.170 0.323 0.733 615G G+NG 0.187 0.366 0.835 398G+NG G+NG 0.280 0.492 0.878 170

G G-small 0.121 0.258 0.675 960G NG-small 0.114 0.248 0.666 1057

Incorporating NG improves the prediction accuracy

The difference between G-small and NG-small is minor



Exploration of Language Influence

All results to date on English data; some languages highlypredictive of location (e.g. Japanese, Finnish)Investigate interaction between language and geolocationaccuracy:

I Partition: cityI Learner: multinominal naive BayesI Training: IGR on geotagged multilingual data

Method Acc Acc@161 Acc@C Median

Per-language majority classes 0.107 0.189 0.693 2805Unified multilingual model 0.196 0.343 0.772 466Monolingual partitioned model 0.255 0.425 0.802 302

Table: WORLD in multilingual settings.

Language is a good indicator of location (EN hard!)Source(s): Han et al. [2014]


User Metadata in Tweets

Examples of user-declared location in public profile:I Calgary, AlbertaI Iowa, USAI heat of ArizonaI north east side of indyI -iN A Veryy Dope Place (:I hugging my big sister

Examples of user-declared real names in public profile:I Michael JordanI Yuji MatsumotoI Hinrich Schutze

Train NB classifier for each of user-declared location,timezone, self-description, registered real nameSource(s): Han et al. [2014]


Exploration of User Metadata (WORLD)

Classifier Acc Acc@161 Acc@C Median

text 0.280 0.492 0.878 170loc 0.405 0.525 0.834 92tz 0.064 0.171 0.565 1330desc 0.048 0.117 0.526 2907rname 0.045 0.109 0.550 2611



Stacking Metadata Classifiers

Level-0 classifiers Level-1 classifier

TEXT

LOC

TZ

DESC

RNAME

Logistic

Regression

Level-0

predictions

Tweet

Location

Time Zone

Description

Real NameStacking-based Geolocation Prediction

Final

prediction

Features Acc Acc@161 Acc@C Median0. text 0.280 0.492 0.878 1701. 0. + loc 0.483 0.653 0.903 142. 1. + tz 0.490 0.665 0.917 93. 2. + desc 0.490 0.666 0.919 94. 3. + rname 0.491 0.667 0.919 9



Temporal Influence I

Can a model trained on “old” data generalise to “new”data?

WORLD: 10K time-homogeneous usersFeatures Acc Acc@161 Acc@C Median1. text 0.280 0.492 0.878 1702. loc 0.405 0.525 0.834 923. tz 0.064 0.171 0.565 13301. + 2. + 3. 0.490 0.665 0.917 9

LIVE: 32K time-heterogeneous usersFeatures Acc Acc@161 Acc@C Median1. text 0.268 0.510 0.901 1512. loc 0.326 0.465 0.813 3063. tz 0.065 0.160 0.525 15291. + 2. + 3. 0.406 0.614 0.901 40


Temporal Influence II

Similar effect recently observed by Dredze et al. [2016] atthe message-level, with a drop in message-level accuracy of19% a week after the model was trained (to the level of amodel trained on two orders of magnitude less data(!))

... but by incrementally re-training a model trained on 1%of geotagged tweets, accuracy greater than that of astatically-trained model can be achieved within 20 days(online training more important than the volume of statictraining data)


Prediction Confidence I

More geolocatable user:

Porting my mobile toTelstra is a brilliant idea,#vodafail

@USER1 @USER2 @USER3actually Kevin Rudd alsohas an active weiboaccount.

@USER good memory, Ican hardly remember theday I came to Melbourne.

Less geolocatable user:

happy birthday to me

i just finished my hw, oooh,too much

Yes! all things are diffcultbefore they re easy


Prediction Confidence IIRank users by confidence: probability (AP), probability ratioof 1st and 2nd prediction (PR), geo-proximity in top-10predictions (PC), accumulated counts (FN) and weights(FW) of optimised features:


Prediction Confidence III

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

0.2

0.4

0.6

0.8A

cc@

16

1

Recall

Absolute Probability (AP)Prediction Coherence (PC)Prediction Ratio (PR)Feature Number (FN)Feature Weight (FW)



Contents




Inter-document Graphs

Various possible sources of inter-document graphs:I explicit inter-document interactions (e.g. mentions)I explicit author-level interactions (e.g. following)I implicit inter-document similarity (e.g. document

overlap/similarity)

Various possibilities for graph semantics:I directed vs. undirectedI weighted vs. unweightedI single vs. multiple graphs


Network Analytics 101

One of the core concepts in network analytics is homophily— the tendency of individuals to associate and bond withsimilar others

I the corollary for network analytics is that stronglyconnected subgraphs tend to share the same label

I obvious analogies in clustering and classification, with themain difference being the presence/absence of an explicitgraph

Sometimes connections actually represent heterophily, esp.in adversarial contexts such as debates


Approaches to Network Inference I

Popular approaches to network inference:I label propagation: nearest neighbour-style iterative

semi-supervised approachI collective classification: combine base and network

classifiers to optimise consistency in the networkI matrix factorisation: factorise the matrix into a product

of lower-dimensional matrices


Label Propagation I

Given a graph G = (V ,E ,W ) where V is the set of nodeswith |V| = n = nl + nu (where nl nodes are labelled and nunodes are unlabelled), E is the set of edges, and W is anedge weight matrix.

Simple iterative algorithm [Zhu and Ghahramani, 2002]:

1 for each node u(i)u ∈ Vu, get the set of labelled neighbours

based on E , and label u(i)u based on the (weighted)

mean/median of the neighbours2 repeat until convergence


Label Propagation II

Modified Adsorption [Talukdar and Crammer, 2009]:I Intuition: want a semi-supervised method that:

(a) predicts labels for a-priori labelled vertices (ul ∈ Vl) asclose as possible to the original;

(b) labels proximate vertices similarly; and

(c) generates an output that is as uninformative as possible (ala logistic regression)


Label Propagation III

I Formally, we capture each of these with a dedicated term:

(a) (Yl − Yl)TS(Yl − Yl)

(b) Y Tl LYl

(c) ‖Yl − Rl‖2

where Yl and Yl are the columns of the a-priori labelmatrix Y and predicted label matrix Y , respectively,associated with label l ; S is a diagonal matrix used toidentify the a-priori labelled vertices; L is the Laplacian ofan undirected graph derived from G ; and Rl is the lthcolumn of regularisation matrix R of dimensions n×(m+1)set to zero for all but a unique “dummy” label.


Label Propagation IV

I These are combined as follows:

C (Y ) =∑l

[µ1(Yl − Yl)

TS(Yl − Yl)+

µ2YTl LYl + µ3‖Yl − Rl‖2

]where µ1, µ2 and µ3 are hyperparameters


Collective Classification I

Collective classification: given a network and an object oin the network, use (up to) three types of correlations toinfer a label for o:

1 the correlations between the label of o and its observedattributes

2 the correlations between the label of o and the observedattributes and labels of nodes connected to o

3 the correlations between the label of o and the unobservedlabels of objects connected to o

Source(s): Sen et al. [2008]


Collective Classification II

Formally, collective classification takes a graph, made up of:

I nodes V = {V1, . . . ,Vn}I edges E

The task is to label the nodes Vi ∈ V from a label setL = {L1, . . . , Lq}, making use of the graph in the form of aneighborhood function N = {N1, . . . ,Nn}, whereNi ⊆ V \ {Vi}.


Approaches to Collective Classification I

Two general approaches to capturing the first twocorrelations:

I iterative classification: bootstrap node labels with acontent-only classifier and generate a random ordering overnodes V, then iteratively update estimate of vi based onthe current Ni , and update ~ai accordingly [local approach]

I dual classifier + graph inference: train separatecontent-only and link classifiers, and use graph inference(mean field, loopy belief propagation, min-cut, etc.) to“smooth” the predictions over the graph [global approach]

Source(s): Sen et al. [2008]


User Geolocation: Enter the Network I

The easiest way to generate a network for Twitter usergeolocation is via @user mentions (e.g. @eltimsterlovin the talk)

Question of what to do with user mentions outside thetraining/dev/test data sample

(one) solution = collapse edges throughout-of-network nodes into direct edges


User Geolocation: Enter the Network II

tr1

tr2

tr3

te1 te

2te3

te4

te5

m1

m2

@-mention Network

m3

tr1

tr2 tr

3

te1

te2

te3

te4

te5

d1

d2

d3 d

4d5

Collapsed Network plus Dongle Nodes

tri

mi

tei

di

train node

mentioned node

test node

dongle node

Weighted, directed graph the most obvious approach, but Ipresent results for unweighted, undirected graphs in this talk


User Geolocation: Enter the Network IIIFirst, results for the simple network-based methods:

I label propagation (LP)I modified adsorption (MAD)

Text

only LP MA

D

0

0.2

0.4

0.6

0.8

0.39

0.520.50

Acc

@16

1

Source(s): Jurgens [2013], Rahimi et al. [2015a,b]


User Geolocation: “Celebrity Nodes” I

For the larger datasets (e.g. Twitter-World) weobserve: (a) MAD doesn’t scale; and (b) highly-connectednodes bias the output heavily

Based on this observation, we automatically detect andremove highly-connected (“celebrity”) nodes:

2 5 15 50 500 5kCelebrity threshold T (# of mentions)

700

720

740

760

780

800

820

840

860

Mea

n er

ror (

in k

m)

Mean errorGraph size

105

106

107

108

109

Grap

h si

ze (#

edg

es)


User Geolocation: “Celebrity Nodes” II

The removal of celebrity nodes empirically boosts both LPand MAD (over all datasets):

+CEL

−CEL

0

0.2

0.4

0.6

0.8

0.520.54

0.50

0.56

Acc

@16

1LP MAD


User Geolocation: How to Combine the Text

and the Network? I

The easiest way to integrate graph- and text-basedclassification is to use the text as a source of priors

Approach 1: use pointwise text-based user priors asbackoff for disconnected nodes [post-processing] [Rahimiet al., 2015b]

Approach 2: use pointwise text-based user priors as priorsfor all unlabelled nodes [pre-processing] [Rahimi et al.,2015a]

I incorporate the priors directly on the nodesI incorporate the priors as “dongle” nodes (uniquely)

connected to a given user



and the Network? II

tr1

tr2

tr3

te1 te

2te3

te4

te5

m1

m2

@-mention Network

m3

tr1

tr2 tr

3

te1

te2

te3

te4

te5

d1

d2

d3 d

4d5

Collapsed Network plus Dongle Nodes

tri

mi

tei

di

train node

mentioned node

test node

dongle node



and the Network? III

−Text

+Bac

koff

+D

irect

+D

ongl

e

0

0.2

0.4

0.6

0.8

0.54 0.55

0.43

0.530.56

0.59

0.49

0.59

Acc

@16

1LP MAD


User Geolocation: Findings

Network-only results generally better than text-only results(Text only < LP < MAD)

Both network-based methods improve with the incorporationof text-based user priors (with dongle nodes or simplebackoff being the most effective way of integrating the two)

In terms of computational efficiency,LP > Text only� MAD

Removal of highly-connected nodes leads to greatertractability for MAD, and more accurate results for bothmethods


Contents




The Road Ahead

Network features highly effective in user geolocation(moreso than text features); much more work to be done incombining the two

Message-level geolocation still very much an unsolved task

How to keep the model temporally-relevant?

Interaction between lexical normalisation and geolocation


Summary

User geolocation: supervised text-based multi-classificationproblem

Location Indicative Words improve model effectiveness andefficiency

Model choice and location partitions are less crucial thanfeature selection

Adding non-geotagged data, identifying languages andincorporating user metadata all improve the predictionaccuracy

Network-based models more accurate than text-basedmodels, and small gains possible in combiningnetwork-based models with text-based geolocation priors


A Plug!

WNUT 2016 Shared Task on User and Mes-sage Geolocation

If you are interested in this space, we are running a“shared task” on user and user geolocation as part of The2nd Workshop on Noisy User-generated Text (W-NUT):

http://noisy-text.github.io/2016/




... and Another Plug!

New Python toolkit for geolocation research

We have just released a new toolkit for geolo-cation research (text- and network-based models):

https://github.com/afshinrahimi/pigeo




Talk Outline




Contents




Twitter POS Tagging

How is POS tagging for social media data (focusing onTwitter) different to POS tagging for any other text source?

I deterministically taggable tokens (URLs, emoticons)I higher proportion of OOV words → lexical normalisation,

beef-up novel word handling rules, add word clusterinformation

I lower reliability of casing → add more gazetteersI lots of untagged, little tagged data → incorporate

semi-supervised retraining (e.g. bootstrapping)I some POS tag distinctions hard to make in social media →

possibly tweak POS tagset to remove certain distinctions(and add others)

Source(s): Gimpel et al. [2011], Derczynski et al. [2013], Owoputi et al. [2013]


Penn ↔ CMU POS Tagset

Penn POS tag(s) CMU POS tag

NN, NNS NPRP, WP O

NNP, NNPS ˆMD, V* V

J* ARB, WRB R

UH !WDT, DT, WP$, PRP$ D

IN, TO PCC &RP T

EX, PDT XCD $— # (hashtag)— @ (mention)— U (URL)

...


Example POS Tag Assignment

Example

ikr smh he asked fir yo last! G O V P D A

name so he can add u on fb lolololN P O V V O P ∧ !


Approach of Owoputi et al. [2013]

Model: first-order maximum entropy Markov model(MEMM)

Features:I word cluster features (1000× Brown clusters), pre-trained

on uniformly-distributed sample of tweets from 4 year timeperiod (100K tweets per day) — cluster membership +cluster prefix + extra tokenisation to match OOV tokens toclusters

I careful tokenisation of emojiI tag dictionary (most frequent POS tag)I gazetteer match tag for names, locations, etc.

Source(s): Owoputi et al. [2013]


Other Notable Approaches

Possible to use tweets with URLs to edited documents toautomatically generate (very good) silver-standardannotations based on the document-based POS tagdistribution [Plank et al., 2014]; also applicable to NER

Domain adapt Stanford CoreNLP POS tagger (based on thePenn POS tagset) by: (a) lexical normalisation; (b) addingtag probabilities; (c) adding a gazetteer; and (d) generatingextra silver-standard training through cross-taggeragreement [Derczynski et al., 2013]


Contents




Dependency Parsing: What is it?

Dependency parsing = determine the syntax of an input interms of the binary (labelled, directed) links between wordsthat directly govern them

Example (from Kong et al. [2014])

NOUN PHRASE INTERNAL STRUCTURE

OMG I ♥ the Biebs & want to have his babies ! —> LA Times : Teen Pop Star Heartthrob is All the Rage on Social Media … #belieber

ROOT MULTIPLE ROOTS

COORD

ROOT

MWE MWE

ROOT

ROOT

… #belieber

Features: joint sentence tokenisation and parsing; treatmultiword expressions (MWEs) as atomic; don’t explicitlyinclude dependency links for punctuation; explicitlydisambiguate the internal structure of noun phrases

Source(s): Kong et al. [2014]


Twitter Dependency Parsing

How is dependency parsing for social media data different todependency parsing for any other text source?

I dependency parser carries out sentence tokenisation handin hand with syntactic disambiguation

I allow “orphan” tokens which are not part of the syntacticstructure of the message (e.g. non-syntactic hashtags)

I lots of untagged, little tagged data → incorporatesemi-supervised retraining (e.g. bootstrapping)

Source(s): Kong et al. [2014]


Twitter Dependency Parsing Approach of

Kong et al. [2014]

Model: TurboParser (integer linear program with multiplepasses of dual decomposition, to include second- andthird-order features: Martins et al. [2010])

Features:I add extra arcs to capture tokens that are not part of the

dependencyI Brown clustersI stacking-style interpretation of Penn Treebank featuresI POS tagger


Contents




Named Entity Recognition: What is it?

Named entity recognition (“NER”) = the process of(uniquely) identifying token sequences which denoteanchored entities with encyclopaedic knowledge associatedwith them (≈ proper names), and the type of each entityrelative to a pre-determined class set (e.g. PERSON,LOCATION, ORGANISATION, ...):

Example

[PER Wolff] , currently a journalist in [LOC Argentina] , playedwith [PER Del Bosque] in the final years of the seventies in [ORG

Real Madrid] .

Source(s): Tjong Kim Sang and De Meulder [2003]


Twitter Named Entity Recognition

How is NER for social media data different to NER for anyother text source?

I NER (for English) relies heavily on casing information,which can be inconsistent in Twitter

I NER is largely a game of learning lexical priors/transitionprobabilities correctly, and inconsistent lexicalisation inTwitter complicates this

I NEs come and go on Twitter burstily, and there is alsogeneral temporal burstiness

I the standard NE label sets are not necessarily well suited toTwitter

I lots of untagged, little tagged data → incorporatesemi-supervised retraining (e.g. bootstrapping)

Source(s): Ritter et al. [2011], Baldwin et al. [2015]


Twitter NER: Success StoriesGenerally speaking, things that work well for Twitter POStagging also work well for Twitter NER(clustering/distributed representations, gazetteers,structured classifiersOne very exciting result of the recent W-NUT 2015 SharedTask [Baldwin et al., 2015] was the demonstration of the(phenomenal) success of entity linking for NER [Yamadaet al., 2015]

I NE linking = disambiguation of NEs relative to aknowledge base (Wikipedia, in this case)

I why does it help?popularity of different NEs, NE “coherence”, context fit, ...particularly effective for Twitter because of the limitedtextual context/lack of redundancyrare instance of semantics helping an NLP task

Source(s): Ritter et al. [2011], Baldwin et al. [2015]


A(nother) Plug!

WNUT 2016 Shared Task on Named EntityRecognition

If you are interested in this space, we are running a“shared task” on named entity recognition as part of The2nd Workshop on Noisy User-generated Text (W-NUT):





Contents




“Social” POS Tagging, Dependency Parsing

and NER?

Little work on using the “social” nature of social media forPOS tagging etc.; possibilities include:

I learning user-specific tag probabilities (e.g. style of use ofuser mentions and hashtags) [Hovy and Søgaard, 2015]

I constraining the model to ensure consistency betweencomments and retweeted content

I learning NE usage patterns across Twitter (e.g. neologisticNEs) and dynamically updating gazetteers etc.


Summary of POS Tagging, Parsing and NER

Adaptations of computational (morpho-)syntactic analysisto social media text largely focused on robustness +gazetteering ... relatively little use made of social features


References I

Timothy Baldwin, Marie-Catherine de Marneffe, Bo Han, Young-Bum Kim, Alan Ritter,and Wei Xu. Shared tasks of the 2015 Workshop on Noisy User-generated Text:Twitter lexical normalization and named entity recognition. In Proceedings of the ACL2015 Workshop on Noisy User-generated Text (W-NUT), pages 126–135, Beijing,China, 2015.

Leon Derczynski, Alan Ritter, Sam Clark, and Kalina Bontcheva. Twitter part-of-speechtagging for all: Overcoming sparse and noisy data. In Proceedings of RANLP 2013(Recent Advances in Natural Language Processing), Hissar, Bulgaria, 2013.

Mark Dredze, Miles Osborne, and Prabhanjan Kambadur. Geolocation for twitter: Timingmatters. In Proceedings of the 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, pages1064–1069, 2016. URL http://aclweb.org/anthology/N16-1122.

Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. A latent variablemodel for geographic lexical variation. In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing (EMNLP 2010), pages1277–1287, Cambridge, USA, 2010. URLhttp://www.aclweb.org/anthology/D10-1124.

http://aclweb.org/anthology/N16-1122

http://www.aclweb.org/anthology/D10-1124


References II

Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, JacobEisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith.Part-of-speech tagging for Twitter: Annotation, features, and experiments. InProceedings of the 49th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies (ACL HLT 2011), pages 42–47, Portland,USA, 2011. URL http://www.aclweb.org/anthology/P11-2008.

Bo Han, Paul Cook, and Timothy Baldwin. Geolocation prediction in social media databy finding location indicative words. In Proceedings of the 24th InternationalConference on Computational Linguistics (COLING 2012), pages 1045–1062, Mumbai,India, 2012.

Bo Han, Paul Cook, and Timothy Baldwin. Text-based Twitter user geolocationprediction. Journal of Artificial Intelligence Research, 49:451–500, 2014.

Dirk Hovy and Anders Søgaard. Tagging performance correlates with author age. InProceedings of the Joint conference of the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th International Joint Conference on NaturalLanguage Processing of the Asian Federation of Natural Language Processing(ACL-IJCNLP 2015), pages 483–488, 2015. URLhttp://aclweb.org/anthology/P15-2079.

http://www.aclweb.org/anthology/P11-2008

http://aclweb.org/anthology/P15-2079


References IIIDavid Jurgens. That’s what friends are for: Inferring location in online social media

platforms based on social relationships. In Proceedings of the 7th InternationalConference on Weblogs and Social Media (ICWSM 2013), pages 273–282, Dublin,Ireland, 2013.

Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer,and Noah A. Smith. A dependency parser for tweets. In Proceedings of the 2014Conference on Empirical Methods in Natural Language Processing (EMNLP 2014),pages 1001–1012, Doha, Qatar, 2014.

Olivier Van Laere, Jonathan Quinn, Steven Schockaert, and Bart Dhoedt. Spatially-awareterm selection for geotagging. IEEE Transactions on Knowledge and DataEngineering, 26(1):221–234, 2014.

Andre Martins, Noah Smith, Eric Xing, Pedro Aguiar, and Mario Figueiredo. TurboParsers: Dependency parsing by approximate variational inference. In Proceedings ofthe 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP2010), pages 34–44, 2010. URL http://aclweb.org/anthology/D10-1004.

Andrew McCallum and Kamal Nigam. A comparison of event models for Naive Bayestext classification. In Proceedings of the AAAI-98 Workshop on Learning for TextCategorization, pages Available as Technical Report WS–98–05, AAAI Press.,Madison, USA, 1998.

http://aclweb.org/anthology/D10-1004


References IVOlutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and

Noah A. Smith. Improved part-of-speech tagging for online conversational text withword clusters. In Proceedings of the 2013 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies(NAACL HLT 2013), pages 380–390, Atlanta, USA, 2013.

Barbara Plank, Dirk Hovy, Ryan McDonald, and Anders Søgaard. Adapting taggers toTwitter with not-so-distant supervision. In Proceedings of the 25th InternationalConference on Computational Linguistics (COLING 2014), pages 1783–1792, 2014.URL http://aclweb.org/anthology/C14-1168.

Reid Priedhorsky, Aron Culotta, and Sara Y. Del Valle. Inferring the origin locations oftweets with quantitative confidence. In Proceedings of the 17th ACM Conference onComputer Supported Cooperative Work and Social Computing (CSCW 2014), pages1523–1536, Baltimore, USA, 2014.

Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. Twitter user geolocation using aunified text and network prediction model. In Proceedings of the Joint conference ofthe 53rd Annual Meeting of the Association for Computational Linguistics and the 7thInternational Joint Conference on Natural Language Processing of the AsianFederation of Natural Language Processing (ACL-IJCNLP 2015), pages 630–636,Beijing, China, 2015a.

http://aclweb.org/anthology/C14-1168


References V

Afshin Rahimi, Duy Vu, Trevor Cohn, and Timothy Baldwin. Exploiting text and networkcontext for geolocation of social media users. In Proceedings of the 2015 Conferenceof the North American Chapter of the Association for Computational Linguistics —Human Language Technologies (NAACL HLT 2015), pages 1362–1367, Denver, USA,2015b.

Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. Named entity recognition in tweets:An experimental study. In Proceedings of the 2011 Conference on Empirical Methodsin Natural Language Processing (EMNLP 2011), pages 1524–1534, Edinburgh, UK,2011.

Stephen Roller, Michael Speriosu, Sarat Rallapalli, Benjamin Wing, and Jason Baldridge.Supervised text-based geolocation using language models on an adaptive grid. InProceedings of the Joint Conference on Empirical Methods in Natural LanguageProcessing and Computational Natural Language Learning 2012 (EMNLP-CoNLL2012), pages 1500–1510, Jeju Island, Korea, 2012. URLhttp://www.aclweb.org/anthology/D12-1137.

Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, andTina Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93–106, 2008.

http://www.aclweb.org/anthology/D12-1137


References VI

Partha Pratim Talukdar and Koby Crammer. New regularized algorithms for transductivelearning. In Proceedings of the European Conference on Machine Learning(ECML-PKDD) 2009, pages 442–457, Bled, Slovenia, 2009.

Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 sharedtask: Language-independent named entity recognition. In Proceedings of the 7thConference on Natural Language Learning (CoNLL-2003), pages 142–147, Edmonton,Canada, 2003. URL http://www.aclweb.org/anthology/W03-0419.pdf.

Ikuya Yamada, Hideaki Takeda, and Yoshiyasu Takefuji. Enhancing named entityrecognition in Twitter messages using entity linking. In Proceedings of the Workshopon Noisy User-generated Text, pages 136–140, 2015. URLhttp://aclweb.org/anthology/W15-4320.

Xiaojin. Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data withlabel propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University,2002.

http://www.aclweb.org/anthology/W03-0419.pdf

http://aclweb.org/anthology/W15-4320

social media and text analytics ii user geolocation ... · social media and text analytics ii webst...

Documents