social media and text analytics ii user geolocation ... · social media and text analytics ii webst...
TRANSCRIPT
Social Media and Text Analytics II WebST (18/7/2016)
Social Media and Text Analytics IIUser Geolocation; Twitter POS Tagging, Parsing and NER
Timothy Baldwin
Social Media and Text Analytics II WebST (18/7/2016)
Talk Outline
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
Contents
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
What is Geolocation Prediction?
Example (1)
Given the a collection of messages, e.g.,:I Waiting for a tram in the rain in Collins St. A more typical
Melbourne day today.I Why you keep me up? I ain’t got no worries.I New Aussie Hip Hop News: The Yarra stinks.I Just had a rather thrilling albeit bumpy camel ride around
Uluru - SO. MUCH. FUN!! Fancy joining me? Enter mycomp here
predict the location associated with those messages
A: melbourne-au
Social Media and Text Analytics II WebST (18/7/2016)
What is Geolocation Prediction?
Example (1)
Given the a collection of messages, e.g.,:I Waiting for a tram in the rain in Collins St. A more typical
Melbourne day today.I Why you keep me up? I ain’t got no worries.I New Aussie Hip Hop News: The Yarra stinks.I Just had a rather thrilling albeit bumpy camel ride around
Uluru - SO. MUCH. FUN!! Fancy joining me? Enter mycomp here
predict the location associated with those messages
A: melbourne-au
Social Media and Text Analytics II WebST (18/7/2016)
The Wisdom of the Crowd: Predict the User
Geolocation I
Example (2)Its snowing ! <3Praying everyday for the strength and a clear path. Jeremiah 29:11♥@USER that is a normal amount of sleep! I would kill for 7 hours anight!sons of anarchy, gets better every episode. #waitingonmyjaxLove me some cute couple thompson square.“@USER: Who’s better in bed? Him or me?#LadiesWeWantAnswers” him, duh........Just aced my omm practical, do i go to the gym or nap? Hmmmmtough decision.
A: hinsdale-il043-us
Social Media and Text Analytics II WebST (18/7/2016)
The Wisdom of the Crowd: Predict the User
Geolocation I
Example (2)Its snowing ! <3Praying everyday for the strength and a clear path. Jeremiah 29:11♥@USER that is a normal amount of sleep! I would kill for 7 hours anight!sons of anarchy, gets better every episode. #waitingonmyjaxLove me some cute couple thompson square.“@USER: Who’s better in bed? Him or me?#LadiesWeWantAnswers” him, duh........Just aced my omm practical, do i go to the gym or nap? Hmmmmtough decision.
A: hinsdale-il043-us
Social Media and Text Analytics II WebST (18/7/2016)
The Wisdom of the Crowd: Predict the User
Geolocation II
Example (3)I am officially in hell (@ Bamboo Bernie’s) http://t.co/ribXOCJI’m at Chipotle Mexican Grill (2503 Brandermill Blvd, Gambrills)http://t.co/nIimrTLlI’m at ICAT Logistics, Inc. w/ @USER http://t.co/lCF6xXbuI’m at The Greene Turtle w/ @USER http://t.co/BEK0HyxxI’m at The Greene Turtle (7556 Teague Rd #100, Arundel MillsCorporate Park, Hanover) w/ 4 others http://t.co/9irDbxB9I’m at ICAT Logistics, Inc. (6805 Douglas Legum Drive, Elkridge)http://t.co/Z6h8NK0PI’m at National Museum of the American Indian (300 Maryland AveSW, at Independence Ave and 4th St, Washington)http://t.co/0HwHvQOT
A: elkridge-md027-us
Social Media and Text Analytics II WebST (18/7/2016)
The Wisdom of the Crowd: Predict the User
Geolocation II
Example (3)I am officially in hell (@ Bamboo Bernie’s) http://t.co/ribXOCJI’m at Chipotle Mexican Grill (2503 Brandermill Blvd, Gambrills)http://t.co/nIimrTLlI’m at ICAT Logistics, Inc. w/ @USER http://t.co/lCF6xXbuI’m at The Greene Turtle w/ @USER http://t.co/BEK0HyxxI’m at The Greene Turtle (7556 Teague Rd #100, Arundel MillsCorporate Park, Hanover) w/ 4 others http://t.co/9irDbxB9I’m at ICAT Logistics, Inc. (6805 Douglas Legum Drive, Elkridge)http://t.co/Z6h8NK0PI’m at National Museum of the American Indian (300 Maryland AveSW, at Independence Ave and 4th St, Washington)http://t.co/0HwHvQOT
A: elkridge-md027-us
Social Media and Text Analytics II WebST (18/7/2016)
Task Granularity
We will focus on the task of user geolocation:
given the posts of a given user, predict their location
Also the (less-researched) task of message geolocation:
given a single message, predict its location
Inherent uncertainty in both tasks, as there is no guaranteethat there is any geospatially-identifying information in apost/collection of posts
Note other end of extreme: check-in message, where thegeolocation is provided directly (as geotag AND URL ANDname of location AND ...)
All of the experiments we will report on in this lectureassume that checkins have been removed from the data
Social Media and Text Analytics II WebST (18/7/2016)
Why Geolocation Information?
Event detection
Sentiment analysis Advertising Recommendation
Social Media and Text Analytics II WebST (18/7/2016)
Research Motivation: Geospatial Information
Event detection
?
Sentiment analysis
?
Social Media and Text Analytics II WebST (18/7/2016)
Research Motivation: Geospatial Information
Event detection
?
Sentiment analysis
?
Social Media and Text Analytics II WebST (18/7/2016)
Predict Geolocations for Twitter Users
Challenges:
IP-based method?
Data availability issues
Tweets with GPS labels (i.e., geotagged tweets)? 1–1.7%
User self-declared locations? e.g., in your heart
Research objective:
predict a user’s primary location based on the text contentassociated with their recent posts
Input: Aggregated tweets froma Twitter user
⇒ Output: A location inthe world
Social Media and Text Analytics II WebST (18/7/2016)
Predict Geolocations for Twitter Users
Challenges:
IP-based method? Data availability issues
Tweets with GPS labels (i.e., geotagged tweets)? 1–1.7%
User self-declared locations? e.g., in your heart
Research objective:
predict a user’s primary location based on the text contentassociated with their recent posts
Input: Aggregated tweets froma Twitter user
⇒ Output: A location inthe world
Social Media and Text Analytics II WebST (18/7/2016)
Predict Geolocations for Twitter Users
Challenges:
IP-based method? Data availability issues
Tweets with GPS labels (i.e., geotagged tweets)?
1–1.7%
User self-declared locations? e.g., in your heart
Research objective:
predict a user’s primary location based on the text contentassociated with their recent posts
Input: Aggregated tweets froma Twitter user
⇒ Output: A location inthe world
Social Media and Text Analytics II WebST (18/7/2016)
Predict Geolocations for Twitter Users
Challenges:
IP-based method? Data availability issues
Tweets with GPS labels (i.e., geotagged tweets)? 1–1.7%
User self-declared locations? e.g., in your heart
Research objective:
predict a user’s primary location based on the text contentassociated with their recent posts
Input: Aggregated tweets froma Twitter user
⇒ Output: A location inthe world
Social Media and Text Analytics II WebST (18/7/2016)
Predict Geolocations for Twitter Users
Challenges:
IP-based method? Data availability issues
Tweets with GPS labels (i.e., geotagged tweets)? 1–1.7%
User self-declared locations?
e.g., in your heart
Research objective:
predict a user’s primary location based on the text contentassociated with their recent posts
Input: Aggregated tweets froma Twitter user
⇒ Output: A location inthe world
Social Media and Text Analytics II WebST (18/7/2016)
Predict Geolocations for Twitter Users
Challenges:
IP-based method? Data availability issues
Tweets with GPS labels (i.e., geotagged tweets)? 1–1.7%
User self-declared locations? e.g., in your heart
Research objective:
predict a user’s primary location based on the text contentassociated with their recent posts
Input: Aggregated tweets froma Twitter user
⇒ Output: A location inthe world
Social Media and Text Analytics II WebST (18/7/2016)
Predict Geolocations for Twitter Users
Challenges:
IP-based method? Data availability issues
Tweets with GPS labels (i.e., geotagged tweets)? 1–1.7%
User self-declared locations? e.g., in your heart
Research objective:
predict a user’s primary location based on the text contentassociated with their recent posts
Input: Aggregated tweets froma Twitter user
⇒ Output: A location inthe world
Social Media and Text Analytics II WebST (18/7/2016)
Predict Geolocations for Twitter Users
Challenges:
IP-based method? Data availability issues
Tweets with GPS labels (i.e., geotagged tweets)? 1–1.7%
User self-declared locations? e.g., in your heart
Research objective:
predict a user’s primary location based on the text contentassociated with their recent posts
Input: Aggregated tweets froma Twitter user
⇒ Output: A location inthe world
Social Media and Text Analytics II WebST (18/7/2016)
An Intuitive Example
Where is @BarackObama?
Source(s): http://maps.google.com
Social Media and Text Analytics II WebST (18/7/2016)
Can you Guess the City?
Social Media and Text Analytics II WebST (18/7/2016)
Contents
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
Geographic Representation of Location
The main approaches to geographically representing userlocations have been:
I point-based
I geopolitical (e.g. one of the 48 contiguous US states)[Eisenstein et al., 2010]
I city-based (e.g. one of 3709 cities of a certain size in theworld) [Han et al., 2012]
I grid cell-based
latlong uniform-sized grid cellpopulation-based uniform-sized grid cell [Roller et al., 2012]
Social Media and Text Analytics II WebST (18/7/2016)
Geographic Representation of Location
The main approaches to geographically representing userlocations have been:
I point-basedI geopolitical (e.g. one of the 48 contiguous US states)
[Eisenstein et al., 2010]
I city-based (e.g. one of 3709 cities of a certain size in theworld) [Han et al., 2012]
I grid cell-based
latlong uniform-sized grid cellpopulation-based uniform-sized grid cell [Roller et al., 2012]
Social Media and Text Analytics II WebST (18/7/2016)
Geographic Representation of Location
The main approaches to geographically representing userlocations have been:
I point-basedI geopolitical (e.g. one of the 48 contiguous US states)
[Eisenstein et al., 2010]I city-based (e.g. one of 3709 cities of a certain size in the
world) [Han et al., 2012]
I grid cell-based
latlong uniform-sized grid cellpopulation-based uniform-sized grid cell [Roller et al., 2012]
Social Media and Text Analytics II WebST (18/7/2016)
Geographic Representation of Location
The main approaches to geographically representing userlocations have been:
I point-basedI geopolitical (e.g. one of the 48 contiguous US states)
[Eisenstein et al., 2010]I city-based (e.g. one of 3709 cities of a certain size in the
world) [Han et al., 2012]I grid cell-based
latlong uniform-sized grid cellpopulation-based uniform-sized grid cell [Roller et al., 2012]
Social Media and Text Analytics II WebST (18/7/2016)
Geographic Representation of Location
The main approaches to geographically representing userlocations have been:
I point-basedI geopolitical (e.g. one of the 48 contiguous US states)
[Eisenstein et al., 2010]I city-based (e.g. one of 3709 cities of a certain size in the
world) [Han et al., 2012]I grid cell-based
latlong uniform-sized grid cell
population-based uniform-sized grid cell [Roller et al., 2012]
Social Media and Text Analytics II WebST (18/7/2016)
Geographic Representation of Location
The main approaches to geographically representing userlocations have been:
I point-basedI geopolitical (e.g. one of the 48 contiguous US states)
[Eisenstein et al., 2010]I city-based (e.g. one of 3709 cities of a certain size in the
world) [Han et al., 2012]I grid cell-based
latlong uniform-sized grid cellpopulation-based uniform-sized grid cell [Roller et al., 2012]
Social Media and Text Analytics II WebST (18/7/2016)
Stochastic Representation of Location
Additional dimension of how to stochastically represent theuser:
I “one-hot” (the user is at a unique location)I discrete probabilistic (probability distribution over a
discrete geographical representation)I continuous probabilistic (2D probability density function
over a region, based on a discrete geographicalrepresentation) [Priedhorsky et al., 2014]
Social Media and Text Analytics II WebST (18/7/2016)
Stochastic Representation of Location
Additional dimension of how to stochastically represent theuser:
I “one-hot” (the user is at a unique location)
I discrete probabilistic (probability distribution over adiscrete geographical representation)
I continuous probabilistic (2D probability density functionover a region, based on a discrete geographicalrepresentation) [Priedhorsky et al., 2014]
Social Media and Text Analytics II WebST (18/7/2016)
Stochastic Representation of Location
Additional dimension of how to stochastically represent theuser:
I “one-hot” (the user is at a unique location)I discrete probabilistic (probability distribution over a
discrete geographical representation)
I continuous probabilistic (2D probability density functionover a region, based on a discrete geographicalrepresentation) [Priedhorsky et al., 2014]
Social Media and Text Analytics II WebST (18/7/2016)
Stochastic Representation of Location
Additional dimension of how to stochastically represent theuser:
I “one-hot” (the user is at a unique location)I discrete probabilistic (probability distribution over a
discrete geographical representation)I continuous probabilistic (2D probability density function
over a region, based on a discrete geographicalrepresentation) [Priedhorsky et al., 2014]
Social Media and Text Analytics II WebST (18/7/2016)
Different Dimensions of Locative Aboutness
Location can be described in terms of:I the about location
what location is a tweet aboutI the tweeting location
where was the tweet sent from (the GPS coordinates of thesending location)
I a user’s primary locationwhat is the “home base” of the user
For example (sent from MAD, e.g.):
En route to Bilbao; looking forward to aproductive summer school
about = Bilbao, EStweeting = Madrid, ESprimary = Melbourne, AUSource(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Contents
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
Location Indicative Words
“Location indicative words” (LIWs) either explicitly orimplicitly infer locations, and take the form, e.g., of:
I Gazetted terms: Melbourne, London, Boston
I Dialectal terms: tata, aloha
I Local features: tram, tube, the G
I Sports terms: footy, superbowl, hockey
I Weather related words: cold, windy
Social Media and Text Analytics II WebST (18/7/2016)
Varying Levels of Location Indicativeness
Local words : yinz , hoagie, dippy
Somewhat local words : ferry , chinatown, tram
Common words : today , twitter, iphone
How to identify and select location indicative words?
Do LIWs improve prediction accuracy over using all tokens?
Social Media and Text Analytics II WebST (18/7/2016)
Varying Levels of Location Indicativeness
Local words : yinz , hoagie, dippy
Somewhat local words : ferry , chinatown, tram
Common words : today , twitter, iphone
How to identify and select location indicative words?
Do LIWs improve prediction accuracy over using all tokens?
Social Media and Text Analytics II WebST (18/7/2016)
Varying Levels of Location Indicativeness
Local words : yinz , hoagie, dippySomewhat local words : ferry , chinatown, tramCommon words : today , twitter, iphone
How to identify and select location indicative words?Do LIWs improve prediction accuracy over using all tokens?
Social Media and Text Analytics II WebST (18/7/2016)
Varying Levels of Location Indicativeness
Local words : yinz , hoagie, dippySomewhat local words : ferry , chinatown, tramCommon words : today , twitter, iphone
How to identify and select location indicative words?Do LIWs improve prediction accuracy over using all tokens?
Social Media and Text Analytics II WebST (18/7/2016)
Feature Selection I
LIW identification = feature selection, which can take anumber of forms:
Statistical:I chi square (CHI and MaxCHI)I log likelihood (LOGLIKE)
Information-theoretic:I information gain (IG)I information gain ratio (IGR)I maximum entropy weight (MEW)
Social Media and Text Analytics II WebST (18/7/2016)
Feature Selection IISpatial measures:
I TF-ICF (inverse city frequency)I geographical spread (geo) [Laere et al., 2014]
1 divide the earth into 1◦ × 1◦ cells2 calculate which cells word w occurs in3 merge neighbouring cells containing w until no more cells
can be merged, and calculate the resulting number of cellscontaining w
GeoSpread(w) =# of cells containing w after merging
Max(w)
where Max(w) = max frequency of w in an unmerged cell.
Social Media and Text Analytics II WebST (18/7/2016)
Feature Selection IIII Ripley K function (Ripley) [Laere et al., 2014]: measures
whether a given set of points is generated from ahomogeneous Poisson distribution
K (λ) = A× |{p, q ∈ Qw : distance(p, q) ≤ λ}||Qw |2
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Contents
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
Experimental Setup
Datasets:I North America dataset (NA, Roller et al. [2012]): 500K
users, 38M tweetsI World dataset (WORLD, Han et al. [2012]): 1.4M users,
192M tweets
Evaluation metrics:I Accuracy (Acc)I Accuracy within 161km (Acc@161), e.g., Bilbao and San
SebastianI Country-level accuracy (Acc@C)I Median error distance
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Experimental Parameters
Many variables and parameters to explore, including:I feature set: all tokens vs. LIWs (vs. L2 regularisation)I learner: multinomial naive Bayes (NB), Kullback-Leibler
divergence (KL), logistic regression (LR)I Location representation: City, k-d tree partitioned earth
grid [Roller et al., 2012]I language: English only, multilingualI data: geotagged data, geotagged and non-geotagged data,
metadata
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Multinomial Naive Bayes
The basic formulation for multinomial NB is:
P(D|ci) =
|V |∏j=1
P(tj |ci)ND,tj
ND,tj !
where ND,tj is the frequency of the jth term in D, V is theset of all terms, and:
P(t|ci) =1 +
∑|D |k=1 Nk,tP(ci |Dk)
|V |+∑|V |
j=1
∑|D |k=1 Nk,tjP(ci |Dk)
In practice, use addition of log-likelihoods rather thanproduct of likelihoodsSource(s): McCallum and Nigam [1998]
Social Media and Text Analytics II WebST (18/7/2016)
Experimental Parameters
Many variables and parameters to explore, including:I feature set: all tokens vs. LIWsI learner: multinomial naive Bayes (NB), Kullback-Leibler
divergence (KL), maximum entropy (ME)I Location representation: City, k-d tree partitioned earth
grid [Roller et al., 2012]I language: English only, multilingualI data: geotagged data, geotagged and non-geotagged data,
metadata
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Results using LIWs (NB, WORLD)
Features Acc Acc@161 Acc@C Median
Most Freq. 0.003 0.062 0.947 3089Full 0.171 0.308 0.831 571
CHI 0.233 0.402 0.850 385MaxCHI 0.238 0.412 0.848 356LOGLIKE 0.191 0.343 0.836 489
IG 0.184 0.336 0.838 491IGR 0.260 0.450 0.811 260MEW 0.183 0.326 0.836 520
ICF 0.209 0.359 0.841 533GEO 0.188 0.336 0.834 491Ripley 0.236 0.432 0.849 306
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Models and Location Representation (NA)
Partition Method Acc Acc@161 Acc@C Median
k-d tree
KL 0.117 0.344 – 469KL+IGR 0.161 0.437 – 273NB 0.122 0.367 – 404NB+IGR 0.153 0.432 – 280
City
NB 0.171 0.308 0.831 571NB+IGR 0.260 0.450 0.811 260ME 0.129 0.232 0.756 878ME+IGR 0.229 0.406 0.842 369
* Acc is not comparable between different class representations
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Models and Location Representation
(WORLD)
Dataset Method Acc Acc@161 Acc@C Median
CityNB 0.081 0.200 0.807 886NB+IGR 0.126 0.262 0.684 913
KD-tree
KL 0.116 0.283 - 564KL+IGR 0.121 0.286 - 602NB 0.119 0.289 - 553NB+IGR 0.134 0.290 - 577
Summary:
Feature selection improves geolocation prediction accuracyLess impact of model and location representation choicethan NA
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Models and Location Representation
(WORLD)
Dataset Method Acc Acc@161 Acc@C Median
CityNB 0.081 0.200 0.807 886NB+IGR 0.126 0.262 0.684 913
KD-tree
KL 0.116 0.283 - 564KL+IGR 0.121 0.286 - 602NB 0.119 0.289 - 553NB+IGR 0.134 0.290 - 577
Summary:
Feature selection improves geolocation prediction accuracyLess impact of model and location representation choicethan NA
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Adding Non-geotagged Data
In addition to the geotagged tweets from each user, weoften have non-geotagged tweets, which we can potentiallyuse to expand the training/test user representation
Train Test Acc Acc@161 Acc@C MedianG G 0.126 0.262 0.684 913G+NG G 0.170 0.323 0.733 615G G+NG 0.187 0.366 0.835 398G+NG G+NG 0.280 0.492 0.878 170
G G-small 0.121 0.258 0.675 960G NG-small 0.114 0.248 0.666 1057
Incorporating NG improves the prediction accuracy
The difference between G-small and NG-small is minor
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Adding Non-geotagged Data
In addition to the geotagged tweets from each user, weoften have non-geotagged tweets, which we can potentiallyuse to expand the training/test user representation
Train Test Acc Acc@161 Acc@C MedianG G 0.126 0.262 0.684 913G+NG G 0.170 0.323 0.733 615G G+NG 0.187 0.366 0.835 398G+NG G+NG 0.280 0.492 0.878 170
G G-small 0.121 0.258 0.675 960G NG-small 0.114 0.248 0.666 1057
Incorporating NG improves the prediction accuracy
The difference between G-small and NG-small is minor
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Exploration of Language Influence
All results to date on English data; some languages highlypredictive of location (e.g. Japanese, Finnish)Investigate interaction between language and geolocationaccuracy:
I Partition: cityI Learner: multinominal naive BayesI Training: IGR on geotagged multilingual data
Method Acc Acc@161 Acc@C Median
Per-language majority classes 0.107 0.189 0.693 2805Unified multilingual model 0.196 0.343 0.772 466Monolingual partitioned model 0.255 0.425 0.802 302
Table: WORLD in multilingual settings.
Language is a good indicator of location (EN hard!)Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Exploration of Language Influence
All results to date on English data; some languages highlypredictive of location (e.g. Japanese, Finnish)Investigate interaction between language and geolocationaccuracy:
I Partition: cityI Learner: multinominal naive BayesI Training: IGR on geotagged multilingual data
Method Acc Acc@161 Acc@C Median
Per-language majority classes 0.107 0.189 0.693 2805Unified multilingual model 0.196 0.343 0.772 466Monolingual partitioned model 0.255 0.425 0.802 302
Table: WORLD in multilingual settings.
Language is a good indicator of location (EN hard!)Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
User Metadata in Tweets
Examples of user-declared location in public profile:I Calgary, AlbertaI Iowa, USAI heat of ArizonaI north east side of indyI -iN A Veryy Dope Place (:I hugging my big sister
Examples of user-declared real names in public profile:I Michael JordanI Yuji MatsumotoI Hinrich Schutze
Train NB classifier for each of user-declared location,timezone, self-description, registered real nameSource(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
User Metadata in Tweets
Examples of user-declared location in public profile:I Calgary, AlbertaI Iowa, USAI heat of ArizonaI north east side of indyI -iN A Veryy Dope Place (:I hugging my big sister
Examples of user-declared real names in public profile:I Michael JordanI Yuji MatsumotoI Hinrich Schutze
Train NB classifier for each of user-declared location,timezone, self-description, registered real nameSource(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Exploration of User Metadata (WORLD)
Classifier Acc Acc@161 Acc@C Median
text 0.280 0.492 0.878 170loc 0.405 0.525 0.834 92tz 0.064 0.171 0.565 1330desc 0.048 0.117 0.526 2907rname 0.045 0.109 0.550 2611
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Stacking Metadata Classifiers
Level-0 classifiers Level-1 classifier
TEXT
LOC
TZ
DESC
RNAME
Logistic
Regression
Level-0
predictions
Tweet
Location
Time Zone
Description
Real NameStacking-based Geolocation Prediction
Final
prediction
Features Acc Acc@161 Acc@C Median0. text 0.280 0.492 0.878 1701. 0. + loc 0.483 0.653 0.903 142. 1. + tz 0.490 0.665 0.917 93. 2. + desc 0.490 0.666 0.919 94. 3. + rname 0.491 0.667 0.919 9
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Stacking Metadata Classifiers
Level-0 classifiers Level-1 classifier
TEXT
LOC
TZ
DESC
RNAME
Logistic
Regression
Level-0
predictions
Tweet
Location
Time Zone
Description
Real NameStacking-based Geolocation Prediction
Final
prediction
Features Acc Acc@161 Acc@C Median0. text 0.280 0.492 0.878 1701. 0. + loc 0.483 0.653 0.903 142. 1. + tz 0.490 0.665 0.917 93. 2. + desc 0.490 0.666 0.919 94. 3. + rname 0.491 0.667 0.919 9
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Temporal Influence I
Can a model trained on “old” data generalise to “new”data?
WORLD: 10K time-homogeneous usersFeatures Acc Acc@161 Acc@C Median1. text 0.280 0.492 0.878 1702. loc 0.405 0.525 0.834 923. tz 0.064 0.171 0.565 13301. + 2. + 3. 0.490 0.665 0.917 9
LIVE: 32K time-heterogeneous usersFeatures Acc Acc@161 Acc@C Median1. text 0.268 0.510 0.901 1512. loc 0.326 0.465 0.813 3063. tz 0.065 0.160 0.525 15291. + 2. + 3. 0.406 0.614 0.901 40
Social Media and Text Analytics II WebST (18/7/2016)
Temporal Influence II
Similar effect recently observed by Dredze et al. [2016] atthe message-level, with a drop in message-level accuracy of19% a week after the model was trained (to the level of amodel trained on two orders of magnitude less data(!))
... but by incrementally re-training a model trained on 1%of geotagged tweets, accuracy greater than that of astatically-trained model can be achieved within 20 days(online training more important than the volume of statictraining data)
Social Media and Text Analytics II WebST (18/7/2016)
Prediction Confidence I
More geolocatable user:
Porting my mobile toTelstra is a brilliant idea,#vodafail
@USER1 @USER2 @USER3actually Kevin Rudd alsohas an active weiboaccount.
@USER good memory, Ican hardly remember theday I came to Melbourne.
Less geolocatable user:
happy birthday to me
i just finished my hw, oooh,too much
Yes! all things are diffcultbefore they re easy
Social Media and Text Analytics II WebST (18/7/2016)
Prediction Confidence IIRank users by confidence: probability (AP), probability ratioof 1st and 2nd prediction (PR), geo-proximity in top-10predictions (PC), accumulated counts (FN) and weights(FW) of optimised features:
Social Media and Text Analytics II WebST (18/7/2016)
Prediction Confidence III
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
0.2
0.4
0.6
0.8A
cc@
16
1
Recall
Absolute Probability (AP)Prediction Coherence (PC)Prediction Ratio (PR)Feature Number (FN)Feature Weight (FW)
Source(s): Han et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Contents
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
Inter-document Graphs
Various possible sources of inter-document graphs:I explicit inter-document interactions (e.g. mentions)I explicit author-level interactions (e.g. following)I implicit inter-document similarity (e.g. document
overlap/similarity)
Various possibilities for graph semantics:I directed vs. undirectedI weighted vs. unweightedI single vs. multiple graphs
Social Media and Text Analytics II WebST (18/7/2016)
Inter-document Graphs
Various possible sources of inter-document graphs:I explicit inter-document interactions (e.g. mentions)I explicit author-level interactions (e.g. following)I implicit inter-document similarity (e.g. document
overlap/similarity)
Various possibilities for graph semantics:I directed vs. undirectedI weighted vs. unweightedI single vs. multiple graphs
Social Media and Text Analytics II WebST (18/7/2016)
Network Analytics 101
One of the core concepts in network analytics is homophily— the tendency of individuals to associate and bond withsimilar others
I the corollary for network analytics is that stronglyconnected subgraphs tend to share the same label
I obvious analogies in clustering and classification, with themain difference being the presence/absence of an explicitgraph
Sometimes connections actually represent heterophily, esp.in adversarial contexts such as debates
Social Media and Text Analytics II WebST (18/7/2016)
Network Analytics 101
One of the core concepts in network analytics is homophily— the tendency of individuals to associate and bond withsimilar others
I the corollary for network analytics is that stronglyconnected subgraphs tend to share the same label
I obvious analogies in clustering and classification, with themain difference being the presence/absence of an explicitgraph
Sometimes connections actually represent heterophily, esp.in adversarial contexts such as debates
Social Media and Text Analytics II WebST (18/7/2016)
Approaches to Network Inference I
Popular approaches to network inference:I label propagation: nearest neighbour-style iterative
semi-supervised approachI collective classification: combine base and network
classifiers to optimise consistency in the networkI matrix factorisation: factorise the matrix into a product
of lower-dimensional matrices
Social Media and Text Analytics II WebST (18/7/2016)
Label Propagation I
Given a graph G = (V ,E ,W ) where V is the set of nodeswith |V| = n = nl + nu (where nl nodes are labelled and nunodes are unlabelled), E is the set of edges, and W is anedge weight matrix.
Simple iterative algorithm [Zhu and Ghahramani, 2002]:
1 for each node u(i)u ∈ Vu, get the set of labelled neighbours
based on E , and label u(i)u based on the (weighted)
mean/median of the neighbours2 repeat until convergence
Social Media and Text Analytics II WebST (18/7/2016)
Label Propagation II
Modified Adsorption [Talukdar and Crammer, 2009]:I Intuition: want a semi-supervised method that:
(a) predicts labels for a-priori labelled vertices (ul ∈ Vl) asclose as possible to the original;
(b) labels proximate vertices similarly; and
(c) generates an output that is as uninformative as possible (ala logistic regression)
Social Media and Text Analytics II WebST (18/7/2016)
Label Propagation III
I Formally, we capture each of these with a dedicated term:
(a) (Yl − Yl)TS(Yl − Yl)
(b) Y Tl LYl
(c) ‖Yl − Rl‖2
where Yl and Yl are the columns of the a-priori labelmatrix Y and predicted label matrix Y , respectively,associated with label l ; S is a diagonal matrix used toidentify the a-priori labelled vertices; L is the Laplacian ofan undirected graph derived from G ; and Rl is the lthcolumn of regularisation matrix R of dimensions n×(m+1)set to zero for all but a unique “dummy” label.
Social Media and Text Analytics II WebST (18/7/2016)
Label Propagation IV
I These are combined as follows:
C (Y ) =∑l
[µ1(Yl − Yl)
TS(Yl − Yl)+
µ2YTl LYl + µ3‖Yl − Rl‖2
]where µ1, µ2 and µ3 are hyperparameters
Social Media and Text Analytics II WebST (18/7/2016)
Collective Classification I
Collective classification: given a network and an object oin the network, use (up to) three types of correlations toinfer a label for o:
1 the correlations between the label of o and its observedattributes
2 the correlations between the label of o and the observedattributes and labels of nodes connected to o
3 the correlations between the label of o and the unobservedlabels of objects connected to o
Source(s): Sen et al. [2008]
Social Media and Text Analytics II WebST (18/7/2016)
Collective Classification II
Formally, collective classification takes a graph, made up of:
I nodes V = {V1, . . . ,Vn}I edges E
The task is to label the nodes Vi ∈ V from a label setL = {L1, . . . , Lq}, making use of the graph in the form of aneighborhood function N = {N1, . . . ,Nn}, whereNi ⊆ V \ {Vi}.
Social Media and Text Analytics II WebST (18/7/2016)
Approaches to Collective Classification I
Two general approaches to capturing the first twocorrelations:
I iterative classification: bootstrap node labels with acontent-only classifier and generate a random ordering overnodes V, then iteratively update estimate of vi based onthe current Ni , and update ~ai accordingly [local approach]
I dual classifier + graph inference: train separatecontent-only and link classifiers, and use graph inference(mean field, loopy belief propagation, min-cut, etc.) to“smooth” the predictions over the graph [global approach]
Source(s): Sen et al. [2008]
Social Media and Text Analytics II WebST (18/7/2016)
User Geolocation: Enter the Network I
The easiest way to generate a network for Twitter usergeolocation is via @user mentions (e.g. @eltimsterlovin the talk)
Question of what to do with user mentions outside thetraining/dev/test data sample
(one) solution = collapse edges throughout-of-network nodes into direct edges
Social Media and Text Analytics II WebST (18/7/2016)
User Geolocation: Enter the Network II
tr1
tr2
tr3
te1 te
2te3
te4
te5
m1
m2
@-mention Network
m3
tr1
tr2 tr
3
te1
te2
te3
te4
te5
d1
d2
d3 d
4d5
Collapsed Network plus Dongle Nodes
tri
mi
tei
di
train node
mentioned node
test node
dongle node
Weighted, directed graph the most obvious approach, but Ipresent results for unweighted, undirected graphs in this talk
Social Media and Text Analytics II WebST (18/7/2016)
User Geolocation: Enter the Network IIIFirst, results for the simple network-based methods:
I label propagation (LP)I modified adsorption (MAD)
Text
only LP MA
D
0
0.2
0.4
0.6
0.8
0.39
0.520.50
Acc
@16
1
Source(s): Jurgens [2013], Rahimi et al. [2015a,b]
Social Media and Text Analytics II WebST (18/7/2016)
User Geolocation: “Celebrity Nodes” I
For the larger datasets (e.g. Twitter-World) weobserve: (a) MAD doesn’t scale; and (b) highly-connectednodes bias the output heavily
Based on this observation, we automatically detect andremove highly-connected (“celebrity”) nodes:
2 5 15 50 500 5kCelebrity threshold T (# of mentions)
700
720
740
760
780
800
820
840
860
Mea
n er
ror (
in k
m)
Mean errorGraph size
105
106
107
108
109
Grap
h si
ze (#
edg
es)
Social Media and Text Analytics II WebST (18/7/2016)
User Geolocation: “Celebrity Nodes” II
The removal of celebrity nodes empirically boosts both LPand MAD (over all datasets):
+CEL
−CEL
0
0.2
0.4
0.6
0.8
0.520.54
0.50
0.56
Acc
@16
1LP MAD
Social Media and Text Analytics II WebST (18/7/2016)
User Geolocation: How to Combine the Text
and the Network? I
The easiest way to integrate graph- and text-basedclassification is to use the text as a source of priors
Approach 1: use pointwise text-based user priors asbackoff for disconnected nodes [post-processing] [Rahimiet al., 2015b]
Approach 2: use pointwise text-based user priors as priorsfor all unlabelled nodes [pre-processing] [Rahimi et al.,2015a]
I incorporate the priors directly on the nodesI incorporate the priors as “dongle” nodes (uniquely)
connected to a given user
Social Media and Text Analytics II WebST (18/7/2016)
User Geolocation: How to Combine the Text
and the Network? II
tr1
tr2
tr3
te1 te
2te3
te4
te5
m1
m2
@-mention Network
m3
tr1
tr2 tr
3
te1
te2
te3
te4
te5
d1
d2
d3 d
4d5
Collapsed Network plus Dongle Nodes
tri
mi
tei
di
train node
mentioned node
test node
dongle node
Social Media and Text Analytics II WebST (18/7/2016)
User Geolocation: How to Combine the Text
and the Network? III
−Text
+Bac
koff
+D
irect
+D
ongl
e
0
0.2
0.4
0.6
0.8
0.54 0.55
0.43
0.530.56
0.59
0.49
0.59
Acc
@16
1LP MAD
Social Media and Text Analytics II WebST (18/7/2016)
User Geolocation: Findings
Network-only results generally better than text-only results(Text only < LP < MAD)
Both network-based methods improve with the incorporationof text-based user priors (with dongle nodes or simplebackoff being the most effective way of integrating the two)
In terms of computational efficiency,LP > Text only� MAD
Removal of highly-connected nodes leads to greatertractability for MAD, and more accurate results for bothmethods
Social Media and Text Analytics II WebST (18/7/2016)
Contents
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
The Road Ahead
Network features highly effective in user geolocation(moreso than text features); much more work to be done incombining the two
Message-level geolocation still very much an unsolved task
How to keep the model temporally-relevant?
Interaction between lexical normalisation and geolocation
Social Media and Text Analytics II WebST (18/7/2016)
Summary
User geolocation: supervised text-based multi-classificationproblem
Location Indicative Words improve model effectiveness andefficiency
Model choice and location partitions are less crucial thanfeature selection
Adding non-geotagged data, identifying languages andincorporating user metadata all improve the predictionaccuracy
Network-based models more accurate than text-basedmodels, and small gains possible in combiningnetwork-based models with text-based geolocation priors
Social Media and Text Analytics II WebST (18/7/2016)
A Plug!
WNUT 2016 Shared Task on User and Mes-sage Geolocation
If you are interested in this space, we are running a“shared task” on user and user geolocation as part of The2nd Workshop on Noisy User-generated Text (W-NUT):
http://noisy-text.github.io/2016/
Social Media and Text Analytics II WebST (18/7/2016)
... and Another Plug!
New Python toolkit for geolocation research
We have just released a new toolkit for geolo-cation research (text- and network-based models):
https://github.com/afshinrahimi/pigeo
Social Media and Text Analytics II WebST (18/7/2016)
Talk Outline
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
Contents
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
Twitter POS Tagging
How is POS tagging for social media data (focusing onTwitter) different to POS tagging for any other text source?
I deterministically taggable tokens (URLs, emoticons)I higher proportion of OOV words → lexical normalisation,
beef-up novel word handling rules, add word clusterinformation
I lower reliability of casing → add more gazetteersI lots of untagged, little tagged data → incorporate
semi-supervised retraining (e.g. bootstrapping)I some POS tag distinctions hard to make in social media →
possibly tweak POS tagset to remove certain distinctions(and add others)
Source(s): Gimpel et al. [2011], Derczynski et al. [2013], Owoputi et al. [2013]
Social Media and Text Analytics II WebST (18/7/2016)
Penn ↔ CMU POS Tagset
Penn POS tag(s) CMU POS tag
NN, NNS NPRP, WP O
NNP, NNPS ˆMD, V* V
J* ARB, WRB R
UH !WDT, DT, WP$, PRP$ D
IN, TO PCC &RP T
EX, PDT XCD $— # (hashtag)— @ (mention)— U (URL)
...
Social Media and Text Analytics II WebST (18/7/2016)
Example POS Tag Assignment
Example
ikr smh he asked fir yo last! G O V P D A
name so he can add u on fb lolololN P O V V O P ∧ !
Social Media and Text Analytics II WebST (18/7/2016)
Approach of Owoputi et al. [2013]
Model: first-order maximum entropy Markov model(MEMM)
Features:I word cluster features (1000× Brown clusters), pre-trained
on uniformly-distributed sample of tweets from 4 year timeperiod (100K tweets per day) — cluster membership +cluster prefix + extra tokenisation to match OOV tokens toclusters
I careful tokenisation of emojiI tag dictionary (most frequent POS tag)I gazetteer match tag for names, locations, etc.
Source(s): Owoputi et al. [2013]
Social Media and Text Analytics II WebST (18/7/2016)
Other Notable Approaches
Possible to use tweets with URLs to edited documents toautomatically generate (very good) silver-standardannotations based on the document-based POS tagdistribution [Plank et al., 2014]; also applicable to NER
Domain adapt Stanford CoreNLP POS tagger (based on thePenn POS tagset) by: (a) lexical normalisation; (b) addingtag probabilities; (c) adding a gazetteer; and (d) generatingextra silver-standard training through cross-taggeragreement [Derczynski et al., 2013]
Social Media and Text Analytics II WebST (18/7/2016)
Contents
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
Dependency Parsing: What is it?
Dependency parsing = determine the syntax of an input interms of the binary (labelled, directed) links between wordsthat directly govern them
Example (from Kong et al. [2014])
NOUN PHRASE INTERNAL STRUCTURE
OMG I ♥ the Biebs & want to have his babies ! —> LA Times : Teen Pop Star Heartthrob is All the Rage on Social Media … #belieber
ROOT MULTIPLE ROOTS
COORD
ROOT
MWE MWE
ROOT
ROOT
… #belieber
Features: joint sentence tokenisation and parsing; treatmultiword expressions (MWEs) as atomic; don’t explicitlyinclude dependency links for punctuation; explicitlydisambiguate the internal structure of noun phrases
Source(s): Kong et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Twitter Dependency Parsing
How is dependency parsing for social media data different todependency parsing for any other text source?
I dependency parser carries out sentence tokenisation handin hand with syntactic disambiguation
I allow “orphan” tokens which are not part of the syntacticstructure of the message (e.g. non-syntactic hashtags)
I lots of untagged, little tagged data → incorporatesemi-supervised retraining (e.g. bootstrapping)
Source(s): Kong et al. [2014]
Social Media and Text Analytics II WebST (18/7/2016)
Twitter Dependency Parsing Approach of
Kong et al. [2014]
Model: TurboParser (integer linear program with multiplepasses of dual decomposition, to include second- andthird-order features: Martins et al. [2010])
Features:I add extra arcs to capture tokens that are not part of the
dependencyI Brown clustersI stacking-style interpretation of Penn Treebank featuresI POS tagger
Social Media and Text Analytics II WebST (18/7/2016)
Contents
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
Named Entity Recognition: What is it?
Named entity recognition (“NER”) = the process of(uniquely) identifying token sequences which denoteanchored entities with encyclopaedic knowledge associatedwith them (≈ proper names), and the type of each entityrelative to a pre-determined class set (e.g. PERSON,LOCATION, ORGANISATION, ...):
Example
[PER Wolff] , currently a journalist in [LOC Argentina] , playedwith [PER Del Bosque] in the final years of the seventies in [ORG
Real Madrid] .
Source(s): Tjong Kim Sang and De Meulder [2003]
Social Media and Text Analytics II WebST (18/7/2016)
Twitter Named Entity Recognition
How is NER for social media data different to NER for anyother text source?
I NER (for English) relies heavily on casing information,which can be inconsistent in Twitter
I NER is largely a game of learning lexical priors/transitionprobabilities correctly, and inconsistent lexicalisation inTwitter complicates this
I NEs come and go on Twitter burstily, and there is alsogeneral temporal burstiness
I the standard NE label sets are not necessarily well suited toTwitter
I lots of untagged, little tagged data → incorporatesemi-supervised retraining (e.g. bootstrapping)
Source(s): Ritter et al. [2011], Baldwin et al. [2015]
Social Media and Text Analytics II WebST (18/7/2016)
Twitter NER: Success StoriesGenerally speaking, things that work well for Twitter POStagging also work well for Twitter NER(clustering/distributed representations, gazetteers,structured classifiersOne very exciting result of the recent W-NUT 2015 SharedTask [Baldwin et al., 2015] was the demonstration of the(phenomenal) success of entity linking for NER [Yamadaet al., 2015]
I NE linking = disambiguation of NEs relative to aknowledge base (Wikipedia, in this case)
I why does it help?popularity of different NEs, NE “coherence”, context fit, ...particularly effective for Twitter because of the limitedtextual context/lack of redundancyrare instance of semantics helping an NLP task
Source(s): Ritter et al. [2011], Baldwin et al. [2015]
Social Media and Text Analytics II WebST (18/7/2016)
A(nother) Plug!
WNUT 2016 Shared Task on Named EntityRecognition
If you are interested in this space, we are running a“shared task” on named entity recognition as part of The2nd Workshop on Noisy User-generated Text (W-NUT):
http://noisy-text.github.io/2016/
Social Media and Text Analytics II WebST (18/7/2016)
Contents
1 Geolocation PredictionBackgroundRepresentationText-based GeolocationResultsNetwork-based GeolocationSummary
2 Twitter POS Tagging, Parsing and NERPOS TaggingDependency ParsingNamed Entity Recognition“Social” Features?
Social Media and Text Analytics II WebST (18/7/2016)
“Social” POS Tagging, Dependency Parsing
and NER?
Little work on using the “social” nature of social media forPOS tagging etc.; possibilities include:
I learning user-specific tag probabilities (e.g. style of use ofuser mentions and hashtags) [Hovy and Søgaard, 2015]
I constraining the model to ensure consistency betweencomments and retweeted content
I learning NE usage patterns across Twitter (e.g. neologisticNEs) and dynamically updating gazetteers etc.
Social Media and Text Analytics II WebST (18/7/2016)
Summary of POS Tagging, Parsing and NER
Adaptations of computational (morpho-)syntactic analysisto social media text largely focused on robustness +gazetteering ... relatively little use made of social features
Social Media and Text Analytics II WebST (18/7/2016)
References I
Timothy Baldwin, Marie-Catherine de Marneffe, Bo Han, Young-Bum Kim, Alan Ritter,and Wei Xu. Shared tasks of the 2015 Workshop on Noisy User-generated Text:Twitter lexical normalization and named entity recognition. In Proceedings of the ACL2015 Workshop on Noisy User-generated Text (W-NUT), pages 126–135, Beijing,China, 2015.
Leon Derczynski, Alan Ritter, Sam Clark, and Kalina Bontcheva. Twitter part-of-speechtagging for all: Overcoming sparse and noisy data. In Proceedings of RANLP 2013(Recent Advances in Natural Language Processing), Hissar, Bulgaria, 2013.
Mark Dredze, Miles Osborne, and Prabhanjan Kambadur. Geolocation for twitter: Timingmatters. In Proceedings of the 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, pages1064–1069, 2016. URL http://aclweb.org/anthology/N16-1122.
Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. A latent variablemodel for geographic lexical variation. In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing (EMNLP 2010), pages1277–1287, Cambridge, USA, 2010. URLhttp://www.aclweb.org/anthology/D10-1124.
Social Media and Text Analytics II WebST (18/7/2016)
References II
Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, JacobEisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith.Part-of-speech tagging for Twitter: Annotation, features, and experiments. InProceedings of the 49th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies (ACL HLT 2011), pages 42–47, Portland,USA, 2011. URL http://www.aclweb.org/anthology/P11-2008.
Bo Han, Paul Cook, and Timothy Baldwin. Geolocation prediction in social media databy finding location indicative words. In Proceedings of the 24th InternationalConference on Computational Linguistics (COLING 2012), pages 1045–1062, Mumbai,India, 2012.
Bo Han, Paul Cook, and Timothy Baldwin. Text-based Twitter user geolocationprediction. Journal of Artificial Intelligence Research, 49:451–500, 2014.
Dirk Hovy and Anders Søgaard. Tagging performance correlates with author age. InProceedings of the Joint conference of the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th International Joint Conference on NaturalLanguage Processing of the Asian Federation of Natural Language Processing(ACL-IJCNLP 2015), pages 483–488, 2015. URLhttp://aclweb.org/anthology/P15-2079.
Social Media and Text Analytics II WebST (18/7/2016)
References IIIDavid Jurgens. That’s what friends are for: Inferring location in online social media
platforms based on social relationships. In Proceedings of the 7th InternationalConference on Weblogs and Social Media (ICWSM 2013), pages 273–282, Dublin,Ireland, 2013.
Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer,and Noah A. Smith. A dependency parser for tweets. In Proceedings of the 2014Conference on Empirical Methods in Natural Language Processing (EMNLP 2014),pages 1001–1012, Doha, Qatar, 2014.
Olivier Van Laere, Jonathan Quinn, Steven Schockaert, and Bart Dhoedt. Spatially-awareterm selection for geotagging. IEEE Transactions on Knowledge and DataEngineering, 26(1):221–234, 2014.
Andre Martins, Noah Smith, Eric Xing, Pedro Aguiar, and Mario Figueiredo. TurboParsers: Dependency parsing by approximate variational inference. In Proceedings ofthe 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP2010), pages 34–44, 2010. URL http://aclweb.org/anthology/D10-1004.
Andrew McCallum and Kamal Nigam. A comparison of event models for Naive Bayestext classification. In Proceedings of the AAAI-98 Workshop on Learning for TextCategorization, pages Available as Technical Report WS–98–05, AAAI Press.,Madison, USA, 1998.
Social Media and Text Analytics II WebST (18/7/2016)
References IVOlutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and
Noah A. Smith. Improved part-of-speech tagging for online conversational text withword clusters. In Proceedings of the 2013 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies(NAACL HLT 2013), pages 380–390, Atlanta, USA, 2013.
Barbara Plank, Dirk Hovy, Ryan McDonald, and Anders Søgaard. Adapting taggers toTwitter with not-so-distant supervision. In Proceedings of the 25th InternationalConference on Computational Linguistics (COLING 2014), pages 1783–1792, 2014.URL http://aclweb.org/anthology/C14-1168.
Reid Priedhorsky, Aron Culotta, and Sara Y. Del Valle. Inferring the origin locations oftweets with quantitative confidence. In Proceedings of the 17th ACM Conference onComputer Supported Cooperative Work and Social Computing (CSCW 2014), pages1523–1536, Baltimore, USA, 2014.
Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. Twitter user geolocation using aunified text and network prediction model. In Proceedings of the Joint conference ofthe 53rd Annual Meeting of the Association for Computational Linguistics and the 7thInternational Joint Conference on Natural Language Processing of the AsianFederation of Natural Language Processing (ACL-IJCNLP 2015), pages 630–636,Beijing, China, 2015a.
Social Media and Text Analytics II WebST (18/7/2016)
References V
Afshin Rahimi, Duy Vu, Trevor Cohn, and Timothy Baldwin. Exploiting text and networkcontext for geolocation of social media users. In Proceedings of the 2015 Conferenceof the North American Chapter of the Association for Computational Linguistics —Human Language Technologies (NAACL HLT 2015), pages 1362–1367, Denver, USA,2015b.
Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. Named entity recognition in tweets:An experimental study. In Proceedings of the 2011 Conference on Empirical Methodsin Natural Language Processing (EMNLP 2011), pages 1524–1534, Edinburgh, UK,2011.
Stephen Roller, Michael Speriosu, Sarat Rallapalli, Benjamin Wing, and Jason Baldridge.Supervised text-based geolocation using language models on an adaptive grid. InProceedings of the Joint Conference on Empirical Methods in Natural LanguageProcessing and Computational Natural Language Learning 2012 (EMNLP-CoNLL2012), pages 1500–1510, Jeju Island, Korea, 2012. URLhttp://www.aclweb.org/anthology/D12-1137.
Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, andTina Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93–106, 2008.
Social Media and Text Analytics II WebST (18/7/2016)
References VI
Partha Pratim Talukdar and Koby Crammer. New regularized algorithms for transductivelearning. In Proceedings of the European Conference on Machine Learning(ECML-PKDD) 2009, pages 442–457, Bled, Slovenia, 2009.
Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 sharedtask: Language-independent named entity recognition. In Proceedings of the 7thConference on Natural Language Learning (CoNLL-2003), pages 142–147, Edmonton,Canada, 2003. URL http://www.aclweb.org/anthology/W03-0419.pdf.
Ikuya Yamada, Hideaki Takeda, and Yoshiyasu Takefuji. Enhancing named entityrecognition in Twitter messages using entity linking. In Proceedings of the Workshopon Noisy User-generated Text, pages 136–140, 2015. URLhttp://aclweb.org/anthology/W15-4320.
Xiaojin. Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data withlabel propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University,2002.