continuous representation of location for geolocation and ... · regression geolocation, the model...

10
Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks Afshin Rahimi Timothy Baldwin Trevor Cohn School of Computing and Information Systems The University of Melbourne [email protected] {tbaldwin,t.cohn}@unimelb.edu.au Abstract We propose a method for embedding two- dimensional locations in a continuous vec- tor space using a neural network-based model incorporating mixtures of Gaussian distributions, presenting two model vari- ants for text-based geolocation and lexi- cal dialectology. Evaluated over Twitter data, the proposed model outperforms con- ventional regression-based geolocation and provides a better estimate of uncertainty. We also show the effectiveness of the rep- resentation for predicting words from loca- tion in lexical dialectology, and evaluate it using the DARE dataset. 1 Introduction Geolocation is an essential component of appli- cations such as trafc monitoring (Emadi et al., 2017), human mobility pattern analysis (McNeill et al., 2016; Dredze et al., 2016) and disaster re- sponse (Ashktorab et al., 2014; Wakamiya et al., 2016), as well as targeted advertising (Anagnos- topoulos et al., 2016) and local recommender sys- tems (Ho et al., 2012). Although Twitter provides users with the means to geotag their messages, less than 1% of users opt to turn on geotagging, so third- party service providers tend to use prole data, text content and network information to infer the lo- cation of users. Text content is the most widely used source of geolocation information, due to its prevalence across social media services. Text-based geolocation systems use the geo- graphical bias of language to infer the location of a user or message using models trained on geotagged posts. The models often use a representation of text (e.g. based on a bag-of-words, convolutional or recurrent model) to predict the location either in real-valued latitude/longitude coordinate space or in discretised region-based space, using regression or classication, respectively. Regression models, as a consequence of minimising squared loss for a unimodal distribution, predict inputs with multiple targets to lie between the targets (e.g. a user who mentions content in both NYC and LA is predicted to be in the centre of the U.S.). Classication mod- els, while eliminating this problem by predicting a more granular target, don’t provide ne-grained predictions (e.g. specic locations in NYC), and also require heuristic discretisation of locations into regions (e.g. using clustering). Mixture Density Networks (“MDNs”: Bishop (1994)) alleviate these problems by representing location as a mixture of Gaussian distributions. Given a text input, an MDN can generate a mixture model in the form of a probability distribution over all location points. In the example of a user who talks about both NYC and LA, e.g., the model will predict a strong Gaussian component in NYC and another one in LA, and also provide an estimate of uncertainty over all the coordinate space. Although MDNs are not new, they have not found widespread use in inverse regression prob- lems where for a single input, multiple correct outputs are possible. Given the integration of NLP technologies into devices (e.g. phones or robots with natural language interfaces) is growing quickly, there is a potential need for interfacing lan- guage with continuous variables as input or target. MDNs can also be used in general text regression problems such as risk assessment (Wang and Hua, 2014), sentiment analysis (Joshi et al., 2010) and loan rate prediction (Bitvai and Cohn, 2015), not only to improve prediction but also to use the mix- ture model as a representation for the continuous variables. We apply MDNs to geotagged Twit- ter data in two different settings: (a) predicting location given text; and (b) predicting text given location. arXiv:1708.04358v1 [cs.CL] 14 Aug 2017

Upload: others

Post on 24-Jul-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Continuous Representation of Location for Geolocation and ... · regression geolocation, the model learns to predict a real-valued latitude/longitude from a text input. This is a

Continuous Representation of Location for Geolocation and LexicalDialectology using Mixture Density Networks

Afshin Rahimi Timothy Baldwin Trevor Cohn

School of Computing and Information Systems

The University of Melbourne

[email protected]

{tbaldwin,t.cohn}@unimelb.edu.au

Abstract

We propose a method for embedding two-

dimensional locations in a continuous vec-

tor space using a neural network-based

model incorporating mixtures of Gaussian

distributions, presenting two model vari-

ants for text-based geolocation and lexi-

cal dialectology. Evaluated over Twitter

data, the proposed model outperforms con-

ventional regression-based geolocation and

provides a better estimate of uncertainty.

We also show the effectiveness of the rep-

resentation for predicting words from loca-

tion in lexical dialectology, and evaluate it

using the DARE dataset.

1 Introduction

Geolocation is an essential component of appli-

cations such as traffic monitoring (Emadi et al.,

2017), human mobility pattern analysis (McNeill

et al., 2016; Dredze et al., 2016) and disaster re-

sponse (Ashktorab et al., 2014; Wakamiya et al.,

2016), as well as targeted advertising (Anagnos-

topoulos et al., 2016) and local recommender sys-

tems (Ho et al., 2012). Although Twitter provides

users with the means to geotag their messages, less

than 1% of users opt to turn on geotagging, so third-

party service providers tend to use profile data, text

content and network information to infer the lo-

cation of users. Text content is the most widely

used source of geolocation information, due to its

prevalence across social media services.

Text-based geolocation systems use the geo-

graphical bias of language to infer the location of a

user or message using models trained on geotagged

posts. The models often use a representation of

text (e.g. based on a bag-of-words, convolutional

or recurrent model) to predict the location either in

real-valued latitude/longitude coordinate space or

in discretised region-based space, using regression

or classification, respectively. Regression models,

as a consequence of minimising squared loss for a

unimodal distribution, predict inputs with multiple

targets to lie between the targets (e.g. a user who

mentions content in both NYC and LA is predicted

to be in the centre of the U.S.). Classification mod-

els, while eliminating this problem by predicting

a more granular target, don’t provide fine-grained

predictions (e.g. specific locations in NYC), and

also require heuristic discretisation of locations into

regions (e.g. using clustering).

Mixture Density Networks (“MDNs”: Bishop

(1994)) alleviate these problems by representing

location as a mixture of Gaussian distributions.

Given a text input, an MDN can generate a mixture

model in the form of a probability distribution over

all location points. In the example of a user who

talks about both NYC and LA, e.g., the model will

predict a strong Gaussian component in NYC and

another one in LA, and also provide an estimate of

uncertainty over all the coordinate space.

Although MDNs are not new, they have not

found widespread use in inverse regression prob-

lems where for a single input, multiple correct

outputs are possible. Given the integration of

NLP technologies into devices (e.g. phones or

robots with natural language interfaces) is growing

quickly, there is a potential need for interfacing lan-

guage with continuous variables as input or target.

MDNs can also be used in general text regression

problems such as risk assessment (Wang and Hua,

2014), sentiment analysis (Joshi et al., 2010) and

loan rate prediction (Bitvai and Cohn, 2015), not

only to improve prediction but also to use the mix-

ture model as a representation for the continuous

variables. We apply MDNs to geotagged Twit-

ter data in two different settings: (a) predicting

location given text; and (b) predicting text given

location.

arX

iv:1

708.

0435

8v1

[cs

.CL

] 1

4 A

ug 2

017

Page 2: Continuous Representation of Location for Geolocation and ... · regression geolocation, the model learns to predict a real-valued latitude/longitude from a text input. This is a

Geotagged text content is not only useful in ge-

olocation, but can also be used in lexical dialectol-

ogy. Lexical dialectology is (in part) the converse

of text-based geolocation (Eisenstein, 2015): in-

stead of predicting location from language, lan-

guage (e.g. dialect terms) are predicted from a

given location. This is a much more challenging

task as the lexical items are not known beforehand,

and there is no notion of dialect regions in the con-

tinuous space of latitude/longitude coordinates. A

lexical dialectology model should not only be able

to predict dialect terms but also be able to automat-

ically learn dialect regions.

In this work, we use bivariate Gaussian mix-

tures over geotagged Twitter data in two different

settings, and demonstrate their use for geoloca-

tion and lexical dialectology. Our contributions

are as follows: (1) we propose a continuous rep-

resentation of location using bivariate Gaussian

mixtures; (2) we show that our geolocation model

outperforms regression-based models and achieves

comparable results with classification models, but

with added uncertainty over the continuous output

space; (3) we show that our lexical dialectology

model is able to predict geographical dialect terms

from latitude/longitude input with state-of-the-art

accuracy; and (4) we show that the automatically

learned Gaussian regions match expert-generated

dialect regions of the U.S.1

2 Related Work

2.1 Text-based Geolocation

Text-based geolocation models are defined as ei-

ther a regression or a classification problem. In

regression geolocation, the model learns to predict

a real-valued latitude/longitude from a text input.

This is a very challenging task for data types such

as Twitter, as they are often heavily biased toward

population centres and urban areas, and far from

uniform. As an example, Norwalk is the name

of a few cities in the U.S among which Norwalk,

California (West Coast) and Norwalk, Connecti-

cut (East Coast) are the two most populous cities.

Assuming that occurrences of the city’s name are

almost equal in both city regions within the training

set, a trained regression-based geolocation model

given Norwalk as input, would geolocate it to a

point in the middle of the U.S. instead of choosing

one of the cities. In the machine learning literature,

1Code available athttps://github.com/afshinrahimi/geomdn

regression problems where there are multiple real-

valued outputs for a given input are called inverse

problems (Bishop, 1994). Here, standard regres-

sion models predict an average point in the middle

of all training target points to minimise squared

error loss. Bishop (1994) proposes density mixture

networks to model such inverse problems, as we

discuss in detail in Section 3.

In addition, non-Bayesian interpretations of re-

gression models, which are often used in practice,

don’t produce any prediction of uncertainty, so

other than the predicted point, we have little idea

where else the term could have high or low prob-

ability. Priedhorsky et al. (2014) propose a Gaus-

sian Mixture Model (GMM) approach instead of

squared loss regression, whereby they learn a mix-

ture of bivariate Gaussian distributions for each

individual n-gram in the training set. During pre-

diction, they add the Gaussian mixture of each n-

gram in the input text, resulting in a new Gaussian

mixture which can be used to predict a coordinate

with associated uncertainty. To add the mixture

components they use a weighted sum, where the

weight of each n-gram is assigned by several heuris-

tic features. Learning a GMM for each n-gram is

resource-intensive if the size of the training set —

and thus the number of n-grams — is large.

Assuming sufficient training samples containing

the term Norwalk in the two main, a trained clas-

sification model would, given this term as input,

predict a probability distribution over all regions,

and assign higher probabilities to the regions con-

taining the two major cities. The challenge, though,

is that the coordinates in the training data must

first be partitioned into regions using administra-

tive regions (Cheng et al., 2010; Hecht et al., 2011;

Kinsella et al., 2011; Han et al., 2012, 2014), a uni-

form grid (Serdyukov et al., 2009), or a clustering

method such as a k-d tree (Wing and Baldridge,

2011) or K-means (Rahimi et al., to appear). The

cluster/region labels can then be used as targets.

Once we have a prediction about where a user is

more likely to be from, there is no more infor-

mation about the coordinates inside the predicted

region. If a region that contains Wyoming is pre-

dicted as the home location of a user, we have no

idea which city or county within Wyoming the user

might be from, unless we retrain the model using

a more fine-grained discretisation or a hierarchical

discretisation (Wing and Baldridge, 2014), which

is both time-consuming and challenging due to data

Page 3: Continuous Representation of Location for Geolocation and ... · regression geolocation, the model learns to predict a real-valued latitude/longitude from a text input. This is a

sparseness.

2.2 Lexical Dialectology

The traditional linguistic approach to lexical di-

alectology is to find the geographical distributions

of known contrast sets such as {you, yall, yinz}:

(Labov et al., 2005; Nerbonne et al., 2008;

Goncalves and Sanchez, 2014; Doyle, 2014;

Huang et al., 2015; Nguyen and Eisenstein, to

appear). This usually involves surveying a large

geographically-uniform sample of people from dif-

ferent locations and analysing where each known

alternative is used more frequently. Then, the co-

ordinates are clustered heuristically into dialect re-

gions, based on the lexical choices of users in each

region relative to the contrast set. This processing

is very costly and time-consuming, and relies crit-

ically on knowing the lexical alternatives a priori.

For example, it would require a priori knowledge

of the fact that people in different regions of the US

use pop and soda to refer to the same type of drink,

and a posteriori analysis of the empirical geograph-

ical distribution of the different terms. Language,

particularly in social media and among younger

speakers, is evolving so quickly, in ways that can

be measured over large-scale data samples such

as Twitter, that we ideally want to be able to infer

such contrast sets dynamically. The first step in

automatically collecting dialect words is to find

terms that are disproportionately distributed in dif-

ferent locations. The two predominant approaches

to this problem are model-based (Eisenstein et al.,

2010; Ahmed et al., 2013; Eisenstein, 2015) and

through the use of statistical metrics (Monroe et al.,

2008; Cook et al., 2014). Model-based approaches

use a topic model, e.g., to extract region-specific

topics, and from this, predict the probability of see-

ing a word given a geographical region (Eisenstein

et al., 2010). However, there are scalability issues,

limiting the utility of such models.

In this paper, we propose a neural network archi-

tecture that learns a mixture of Gaussian distribu-

tions as its activation function, and predicts both

locations from word-based inputs (geolocation),

and words from location-based inputs (lexical di-

alectology).

3 Model

3.1 Bivariate Gaussian Distribution

A bivariate Gaussian distribution is a probabil-

ity distribution over 2d space (in our case, a lat-

itude/longitude coordinate pair). The probability

mass function is given by:

N (x|µ,Σ) =1

(2π)

1

|Σ|1/2

exp

{

−1

2(x− µ)⊺Σ−1(x− µ)

}

where µ is the 2-dimensional mean vector, the ma-

trix Σ =(

σ12

ρ12σ1σ2

ρ12σ1σ2 σ22

)

is the covariance ma-

trix, and |Σ| is its determinant. σ1 and σ2 are the

standard deviations of the two dimensions, and ρ12

is the covariance. x is a latitude/longitude coordi-

nate whose probability we are seeking to predict.

3.2 Mixtures of Gaussians

A mixture of Gaussians is a probabilistic model to

represent subpopulations within a global popula-

tion in the form of a weighted sum of K Gaussian

distributions, where a higher weight with a compo-

nent Gaussian indicates stronger association with

that component. The probability mass function of

a Gaussian mixture model is given by:

P(x) =K∑

k=1

πkN (x|µk,Σk)

where∑K

k=1πk = 1, and the number of compo-

nents K is a hyper-parameter.

3.3 Mixture Density Network (MDN)

A mixture density network (“MDN”: Bishop

(1994)) is a latent variable model where the condi-

tional probability of p(y|x) is modelled as a mix-

ture of K Gaussians where the mixing coefficients

π and the parameters of Gaussian distributions µ

and Σ are computed as a function of input using a

neural network:

P(y|x) =K∑

k=1

πk(x)N(

y|µk(x),Σk(x))

In the bivariate case of latitude/longitude, the

number of parameters of each Gaussian is

6 (πk(x), µ1k(x), µ2k(x), ρk(x),σ1k(x),σ2k(x)),which are learnt in the output layer of a regular

neural network as a function of input x. The out-

put size of the network for K components would

be 6 × K. The output of an MDN for N sam-

ples (e.g. where N is the mini-batch size) is an

N × 6K matrix which is then sliced and reshaped

into (N × 2 × K), (N × 2 × K), (N × 1 × K)

Page 4: Continuous Representation of Location for Geolocation and ... · regression geolocation, the model learns to predict a real-valued latitude/longitude from a text input. This is a

and (N × 1 × K) matrices, providing the model

parameters µ, σ, ρ and π. Each parameter type

has its own constraints: σ (the standard deviation)

should be positive, ρ (the correlation) should be in

the interval [−1, 1] and π should be positive and

sum to one, as a probability distribution. To force

these constraints, the following transformations are

often applied to each parameter set:

σ ∼ SoftPlus(σ′) = log(exp(σ′) + 1) ∈ (0,+∞)

π ∼ SoftMax(π′)

ρ ∼ SoftSign(ρ′) =ρ′

1 + |ρ′|∈ [−1, 1]

As an alternative, it’s possible to use transforma-

tions like exp for σ and tanh for ρ. After applying

the transformations to enforce the range constraints,

the negative log likelihood loss of each sample x

given a 2d coordinate label y is computed as:

L(y|x) = − log

{ K∑

k=1

πk(x)N(

y|µk(x),Σk(x))

}

To predict a location, given an unseen input, the

output of the network is reshaped into a mixture

of Gaussians and µk, one of the K components’ µ

is chosen as the prediction. The selection criteria

is either based on the strongest component with

highest π, or the component that maximises the

overall mixture probability:

maxµi∈{µ1...µK}

K∑

k=1

πkN (µi|µk,Σk)

For further details on selection criteria, see

Bishop (1994).

3.4 Mixture Density Network with Shared

Parameters (MDN-SHARED)

In the original MDN model proposed by Bishop

(1994), the parameters of the mixture model are

separate functions of input, which is appropriate

when the inputs and outputs directly relate to each

other, but in the case of geolocation or lexical di-

alectology, the relationship between inputs and out-

puts is not so obvious. As a result, it might be a

difficult task for the model to learn all the param-

eters of each sample correctly. Instead of using

the output to predict all the parameters, we share

µ and Σ among all samples as parameters of the

output layer, and only use the input to predict π, the

mixture probabilities, using a SoftMax layer. We

initialise µ by applying K-means clustering to the

training coordinates and setting each value of µ to

the centroids of the K clusters; we initialise Σ ran-

domly between 0 and 10. We use the original cost

function to update the weight matrices, biases and

the global shared parameters of the mixture model

through backpropagation. Prediction is performed

in the same way as for MDN.

3.5 Continuous Representation of Location

Gaussian mixtures are usually used as the output

layer in neural networks (as in MDN) for inverse

regression problems. We extend their application

by using them as an input representation when

the input is a multidimensional continuous vari-

able. In problems such as lexical dialectology, the

input is real-valued 2d coordinates, and the goal

is to predict dialect words from a given location.

Small differences in latitude/longitude may result

in big shifts in language use (e.g. in regions such

as Switzerland or Gibraltar). One way to model

this is to discretise the input space (similar to the

discretisation of the output space in classification),

with the significant downside that the model is not

able to learn/fine-tune regions in a data-driven way.

A better solution is to use a K component Gaussian

mixture representation of location, where µ and Σ

are shared among all samples, and the output of the

layer is the probability of input in each of the mix-

ture components. Note that in this representation,

there is no need for π parameters as we just need

to represent the association of an input location to

K regions, which will then be used as input to the

next layer of a neural network and used to predict

the targets. We use this continuous representation

of location to predict dialect words from location

input.

4 Experiments

We apply the two described MDN models on two

widely-used geotagged Twitter datasets for geolo-

cation, and compare the results with state-of-the-art

classification and regression baselines. Also, we

use the mixture of Gaussian representation of loca-

tion to predict dialect terms from coordinates.

4.1 Data

In our experiments, we use two existing Twitter

user geolocation datasets: (1) GEOTEXT (Eisen-

stein et al., 2010), and (2) TWITTER-US (Roller

et al., 2012). Each dataset has fixed training, devel-

Page 5: Continuous Representation of Location for Geolocation and ... · regression geolocation, the model learns to predict a real-valued latitude/longitude from a text input. This is a

output layer: term probabilities

tanh hidden layer

Gaussian layer: K Gaussian components

input: location coordinates

(a) Predict text given location

output layer: mixing coefficients π1 . . .πk

tanh hidden layer

input: Text BoW

(b) Predict location given text

Figure 1: (a) The lexical dialectology model using a

Gaussian representation layer. (b) MDN-SHARED

geolocation model where the mixture weights π

are predicted for each sample, and µ and Σ are

parameters of the output layer, shared between all

samples.

opment and test partitions, and a user is represented

by the concatenation of their tweets, and labelled

with the latitude/longitude of the first collected geo-

tagged tweet.2 GEOTEXT and TWITTER-US cover

the continental US with 9k, 449k users, respec-

tively.3

DARE is a dialect-term dataset derived from the

Dictionary of American Regional English (Cassidy,

1985) by Rahimi et al. (to appear). DARE consists

of dialect regions, terms and the meaning of each

term.4 It represents the aggregation of a number

of dialectal surveys over different regions of the

U.S., to identify shared dialect regions. Because

the dialect regions in DARE maps are not machine

readable, populous cities within each dialect region

are manually extracted and associated with their

dialect region terms. The dataset is made up of

around 4.3k dialect terms across 99 U.S. dialect

regions.

2This geolocation representation is naive, but was madeby the creators of the original datasets and has been used byothers. It has been preserved in this work for comparabilitywith the results of others, despite misgivings about whetherthis is a faithful representation of the location for a given user.

3The datasets can be obtained from https://github.

com/utcompling/textgrounder.4http://www.daredictionary.com/

4.2 Geolocation

We use a 3-layer neural network as shown in Fig-

ure 1a where the input is the l2 normalised bag-of-

words model of a given user with stop words, @-

mentions and words with document frequency less

than 10 removed. The input is fed to a hidden layer

with tanh nonlinearity that produces the output of

the network (with no nonlinearity applied). The

output is the collection of Gaussian mixture param-

eters (µ,Σ,π) from MDN. For prediction, the µk of

the mixture component which has the highest prob-

ability within the mixture component is selected.

In the case of MDN-SHARED, the output is only

π, a vector with size K, but the output layer con-

tains extra global parameters µ and Σ (σlat,σlon, ρ)which are shared between all the samples. The

negative log likelihood objective is optimised using

Adam (Kingma and Ba, 2014) and early stopping is

used to prevent overfitting. The hidden layer is sub-

ject to drop-out and elastic net regularisation (with

equal l1 and l2 shares). As our baseline, we used

a multilayer perceptron regressor with the same

input and hidden architecture but with a 2d out-

put with linear activation that predicts the location

from text input. The regularisation and drop-out

rate, hidden layer size and the number of Gaussian

components K (for MDN and MDN-SHARED) are

tuned over the development set of each dataset, as

shown in Table 1.

We evaluate the predictions of the geolocation

models based on three measures (following Cheng

et al. (2010) and Eisenstein et al. (2010)):

1. the classification accuracy within a 161km

(= 100 mile) radius of the actual location

(“Acc@161”); i.e., if the predicted location

is within 161km of the actual location, it is

considered to be correct

2. the mean error (“Mean”) between the pre-

dicted location and the actual location of the

user, in kilometres

3. the median error (“Median”) between the pre-

dicted location and the actual location of the

user, in kilometres

4.3 Lexical Dialectology

To predict dialect words from location, we use a

4-layer neural network as shown in Figure 1b. The

input is a latitude/longitude coordinate, the first

hidden layer is a Gaussian mixture with K com-

ponents which has µ and Σ as its parameters and

produces a probability for each component as an

Page 6: Continuous Representation of Location for Geolocation and ... · regression geolocation, the model learns to predict a real-valued latitude/longitude from a text input. This is a

activation function, the second hidden layer with

tanh nonlinearity captures the association between

different Gaussians, and the output is a SoftMaxlayer which results in a probability distribution over

the vocabulary. For a user label, we use an l1 nor-

malised bag-of-words representation of its text con-

tent and use binary tf and idf for term-weighting.

The model should learn to predict the probability

distribution over the vocabulary and so be capable

of predicting dialect words with a higher probabil-

ity. It also learns regions (parameters of K Gaus-

sians) that represent dialect regions.

We evaluate the lexical dialectology model

(MDN-layer) using perplexity of the predicted

unigram distribution, and compare it with a base-

line where the Gaussian mixture layer is replaced

with a tanh hidden layer (tanh-layer). Also

we retrieve words given points within a region from

the DAREDS dataset, and measure recall with

respect to relevant dialect terms from DAREDS.

To do that, we randomly sample P = 10000 lat-

itude/longitude points from the training set and

predict the corresponding word distribution. To

come up with a ranking over words given region r

as query, we use the following measure:

score(wi|r) =1

N

pj∈r

log(P (wi|pj))

−1

P

P∑

j=1

log(P (wi|pj))

where N equals the number of points (out of 10000)

inside the query dialect region r and P equals the

total number of points (here 10000). For exam-

ple, if we are querying dialect terms from dialect

region South (r), N is the number of randomly se-

lected points that fall within the constituent states

of South. score(wi|r) measures the (log) probabil-

ity ratio of a word wi inside region r compared to

its global score: if a word is local to region r, the

ratio will be higher. We use this measure to create

a ranking over the vocabulary from which we mea-

sure precision and recall at k given gold-standard

dialect terms in DAREDS.

5 Results

5.1 Geolocation

The performance of Regression, MDN and

MDN-SHARED, along with several state-of-the-art

classification models, is shown in Table 2. The

MDN and MDN-SHARED models clearly outper-

form Regression, and achieve competitive or

slightly worse results than the classification mod-

els but provide uncertainty over the whole out-

put space. The geographical distribution of er-

ror for MDN-SHARED over the development set

of TWITTER-US is shown in Figure 3, indicating

larger errors in MidWest and particularly in North

Pacific regions (e.g. Oregon).

5.2 Dialectology

The perplexity of the lexical dialectology

model using Gaussian mixture representation

(MDN-layer) is 840 for the 54k vocabulary

of TWITTER-US dataset, 1% lower than a

similar network architecture with a tanh hidden

layer (tanh-layer), which is not a signif-

icant improvement. Also we evaluated the

model using recall at k and compared it to the

tanh-layer model which again is competitive

with tanh-layer but with the advantage of

learning dialect regions simultaneously. Because

the DARE dialect terms are not used frequently

in Twitter, many of the words are not covered in

our dataset, despite its size. However, our model

is able to retrieve dialect terms that are distinctly

associated with regions. The top dialect words

for regions New York, Louisiana, Illinois and

Pennsylvania are shown in Table 3, and include

named entities, dialect words and local hashtags.

We also visualised the learned Gaussians of the

dialectology model in Figure 2, which as expected

show several smaller regions (Gaussians with

higher σ) and larger regions in lower populated

areas. It is interesting to see that the shape of the

learned Gaussians matches natural borders such as

coastal regions.

We also visualised the log probability of dialect

terms hella (an intensifier mainly used in North-

ern California) and yall (means “you all”, used in

Southern U.S.) resulting from the Gaussian rep-

resentation model. As shown in Figure 5, the

heatmap matches the expected regions.

6 Conclusion

We proposed a continuous representation of loca-

tion using mixture of Gaussians and applied it to

geotagged Twitter data in two different settings: (1)

geolocation of social media users, and (2) lexical

dialectology. We used MDN (Bishop, 1994) in a

multilayer neural network as a geolocation model

Page 7: Continuous Representation of Location for Geolocation and ... · regression geolocation, the model learns to predict a real-valued latitude/longitude from a text input. This is a

GEOTEXT TWITTER-US

regul. dropout hidden K regul. dropout hidden K

Baseline (Regression) 0 0 (100, 50) — 10−5 0 (100, 50) —

Proposed method (MDN) 0 0.5 100 100 10−5 0 300 100

Proposed method (MDN-SHARED) 0 0 100 300 0 0 900 900

Table 1: Hyper-parameter settings of the geolocation models tuned over development set of each dataset.

K is the number of Gaussian components in MDN and MDN-SHARED. Regression has a tanh hidden

layer instead of the Gaussian layer. “—” means the parameter is not applicable to the model.

GEOTEXT TWITTER-US

Acc@161 Mean Median Acc@161 Mean Median

Baseline (Regression) 4 951 733 9 746 557

Proposed method (MDN) 24 983 505 29 696 281

Proposed method (MDN-SHARED) 39 865 412 42 655 216

CLASSIFICATION METHODS

(Rahimi et al., 2015) (LR) 38 880 397 50 686 159

(Wing and Baldridge, 2014) (uniform) — — — 49 703 170

(Wing and Baldridge, 2014) (k-d tree) — — — 48 686 191

(Melo and Martins, 2015) — — — — 702 208

(Cha et al., 2015) — 581 425 — — —

(Liu and Inkpen, 2015) — — — — 733 377

Table 2: Geolocation results over GEOTEXT and TWITTER-US datasets based on Regression, MDN

and MDN-SHARED methods. The results are also compared to state-of-the-art classification methods. “—”

signifies that no results were reported for the given metric or dataset.

���� ��

��

��

����� �

�� ����

� �

���� ��

��

����

��

���� ��

����

�� ����� ��

�����

��

��

������ ��

��

��

����

���

��

Figure 2: Learned Gaussian representation of input locations over TWITTER-US in the lexical dialectology

model. The number of Gaussian components, K, is set to 100 . The red points are µk and the contours are

drawn at p = 0.01.

and showed that it outperforms regression mod-

els by a large margin. There is also very recent

work (Iso et al., 2017) in tweet-level geolocation

that shows the effectiveness of MDN.

We modified MDN by sharing the parameters of

the Gaussian mixtures in MDN-SHARED and im-

proved upon MDN, achieving competetive results

with state-of-the-art classification models. We also

applied the Gaussian mixture representation to pre-

dict dialect words from location, and showed that it

Page 8: Continuous Representation of Location for Geolocation and ... · regression geolocation, the model learns to predict a real-valued latitude/longitude from a text input. This is a

MNWA MT

ID

NDMEWI

OR SD MI NHVT

NYWY

IANE MA

ILPA CTRI

CA

NV UT OHINNJ

CO WVMOKS DE

MDVAKYDC

AZOKNM

TN NC

TX

ARSC

AL GAMS

LA

FL

0

600

1200

1800

2400

3000

3600

4200

4800

err

or

inkm

Figure 3: The distribution of geolocation error

for MDN-SHARED over the development set of

TWITTER-US.

1% 5% 10% 15% 20%

0

10

20

30

k%

Rec

all@

k% tanh-layer

MDN-layer

Figure 4: Recall at k% of 54k vocabulary for

retrieving DAREDS dialect words given points

within a dialect region.

���� ��

��

������

� � �� ����

� �

���� ��

����

���

��

���� ����

���� ����

� ������� ��

������

�� ��

��

��

�� ���

��

��

����

����

���

���

���

���

���

��

�������

(a) hella (an intensifier) mostly used in Northern California, alsothe name of a company in Illinois.

���� ��

��

������

� � �� ����

� �

���� ��

����

���

��

���� ����

���� ����

� ������� ��

������

�� ��

��

��

�� ���

��

��

����

����

����

���

���

���

���

���

��

�������

(b) yall (means you all) mainly used in Southern U.S.

Figure 5: Log probabilities of terms: (a) hella and

(b) yall in continental U.S.

is competitive with simple tanh activation in terms

of both perplexity of the predicted unigram model

and also recall at k at retrieving DARE dialect

words by location input. Furthermore we showed

that the learned Gaussian mixtures have intereting

properties such as covering high population density

regions (e.g. NYC and LA) with a larger number

of small Gaussians, and a smaller number of larger

New York Louisiana Illinois Pennsylvania

flatbush kmsl metra cdfu

lirr jtfo osco jawn

reade wassam halsted ard

rivington kmfsl kedzie erked

mta ndc lbvs cthu

nostrand bookoo damen septa

stuyvesant #icantdeal niu prussia

pathmark #drove orland drawlin

bleecker slangs cermak youngbull

bowery daq uic prolli

macdougal #gramfam oms #ttm

broome gtf xsport dickeating

driggs metairie cta ctfu

Table 3: Top terms retrieved from the lexical dialec-

tology model given lat/lon training points within a

state ranked by Equation 4.3.

Gaussians in low density areas (e.g. the midwest).

Although we applied the mixture of Gaussians

to location data, it can be used in other settings

where the input or output are from a continuous

multivariate distribution. For example it can be

applied to predict financial risk (Wang and Hua,

2014) and sentiment (Joshi et al., 2010) given text.

We showed that when a global structure exists (e.g.

population centres, in the case of geolocation) it

is better to share the global parameters of the mix-

ture model to improve generalisation. In this work,

we used the bivariate Gaussian distribution in the

MDN’s mixture leaving the use of other distribu-

tions which might better suit the geolocation task

for future research.

Acknowledgments

We thank the anonymous reviewers for their valu-

able feedback. This work was supported in part by

the Australian Research Council.

References

Amr Ahmed, Liangjie Hong, and Alexander J Smola.2013. Hierarchical geographical modeling of userlocations from social media posts. In Proceedingsof the 22nd International Conference on World WideWeb (WWW 2013), pages 25–36, Rio de Janeiro,Brazil.

Aris Anagnostopoulos, Fabio Petroni, and MaraSorella. 2016. Targeted interest-driven advertisingin cities using twitter. In Proceedings of the 10th In-ternational Conference on Weblogs and Social Me-

Page 9: Continuous Representation of Location for Geolocation and ... · regression geolocation, the model learns to predict a real-valued latitude/longitude from a text input. This is a

dia (ICWSM 2016), pages 527–530, Cologne, Ger-many.

Zahra Ashktorab, Christopher Brown, Manojit Nandi,and Aron Culotta. 2014. Tweedr: Mining Twit-ter to inform disaster response. In Proceedingsof The 11th International Conference on Informa-tion Systems for Crisis Response and Management(ISCRAM 2014), pages 354–358, University Park,USA.

Christopher Bishop. 1994. Mixture density networks.Technical report, Aston University.

Zsolt Bitvai and Trevor Cohn. 2015. Predicting peer-to-peer loan rates using bayesian non-linear regres-sion. In Proceedings of the Twenty-Ninth AAAI Con-ference on Artificial Intelligence (AAAI-15), pages2203–2209, Austin, USA.

Frederic Gomes Cassidy. 1985. Dictionary of Amer-ican Regional English. Belknap Press of HarvardUniversity Press.

Miriam Cha, Youngjune Gwon, and H.T. Kung. 2015.Twitter geolocation and regional classification viasparse coding. In Proceedings of the 9th Inter-national Conference on Weblogs and Social Media(ICWSM 2015), pages 582–585, Oxford, UK.

Zhiyuan Cheng, James Caverlee, and Kyumin Lee.2010. You are where you tweet: a content-based ap-proach to geo-locating Twitter users. In Proceedingsof the 19th ACM International Conference Infor-mation and Knowledge Management (CIKM 2010),pages 759–768, Toronto, Canada.

Paul Cook, Bo Han, and Timothy Baldwin. 2014. Sta-tistical methods for identifying local dialectal termsfrom gps-tagged documents. Dictionaries: Jour-nal of the Dictionary Society of North America,35(35):248–271.

Gabriel Doyle. 2014. Mapping dialectal variation byquerying social media. In Proceedings of the 14thConference of the EACL (EACL 2014), pages 98–106, Gothenburg, Sweden.

Mark Dredze, Manuel Garcıa-Herranz, Alex Ruther-ford, and Gideon Mann. 2016. Twitter as a sourceof global mobility patterns for social good. In ICMLWorkshop on #Data4Good: Machine Learning in So-cial Good Applications, New York, USA.

Jacob Eisenstein. 2015. Written dialect variation inonline social media. In Charles Boberg, John Ner-bonne, and Dom Watt, editors, Handbook of Dialec-tology. Wiley.

Jacob Eisenstein, Brendan O’Connor, Noah A Smith,and Eric P Xing. 2010. A latent variable model forgeographic lexical variation. In Proceedings of the2010 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP 2010), pages 1277–1287, Boston, USA.

Noora Al Emadi, Sofiane Abbar, Javier Borge-Holthoefer, Francisco Guzman, and Fabrizio Sebas-tiani. 2017. QT2S: A system for monitoring roadtraffic via fine grounding of tweets. In Proceedingsof the 11th International Conference on Weblogs andSocial Media (ICWSM 2017), pages 456–459, Mon-treal, Canada.

Bruno Goncalves and David Sanchez. 2014. Crowd-sourcing dialect characterization through Twitter.PloS One, 9(11).

Bo Han, Paul Cook, and Timothy Baldwin. 2012. Ge-olocation prediction in social media data by find-ing location indicative words. In Proceedingsof the 24th International Conference on Compu-tational Linguistics (COLING 2012), pages 1045–1062, Mumbai, India.

Bo Han, Paul Cook, and Timothy Baldwin. 2014. Text-based Twitter user geolocation prediction. Journalof Artificial Intelligence Research, 49:451–500.

Brent Hecht, Lichan Hong, Bongwon Suh, and Ed H.Chi. 2011. Tweets from justin bieber’s heart: thedynamics of the location field in user profiles. InProceedings of the SIGCHI Conference on HumanFactors in Computing Systems, pages 237–246, Van-couver, Canada.

Shen-Shyang Ho, Mike Lieberman, Pu Wang, andHanan Samet. 2012. Mining future spatiotemporalevents and their sentiment from online news arti-cles for location-aware recommendation system. InProceedings of the First ACM SIGSPATIAL Interna-tional Workshop on Mobile Geographic InformationSystems, pages 25–32, Redondo Beach, USA.

Yuan Huang, Diansheng Guo, Alice Kasakoff, and JackGrieve. 2015. Understanding US regional linguisticvariation with Twitter data analysis. Computers, En-vironment and Urban Systems, 59:244–255.

Hayate Iso, Shoko Wakamiya, and Eiji Aramaki. 2017.Density estimation for geolocation via convolu-tional mixture density network. arXiv preprintarXiv:1705.02750.

Mahesh Joshi, Dipanjan Das, Kevin Gimpel, andNoah A Smith. 2010. Movie reviews and revenues:An experiment in text regression. In Proceedingsof The 2010 Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL2010), pages 293–296, Los An-geles, USA. Association for Computational Linguis-tics.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Sheila Kinsella, Vanessa Murdock, and Neil O’Hare.2011. “I’m eating a sandwich in Glasgow”: Model-ing locations with tweets. In Proceedings of the 3rdInternational Workshop on Search and Mining User-generated Contents, pages 61–68, Glasgow, UK.

Page 10: Continuous Representation of Location for Geolocation and ... · regression geolocation, the model learns to predict a real-valued latitude/longitude from a text input. This is a

William Labov, Sharon Ash, and Charles Boberg. 2005.The Atlas of North American English: Phonetics,Phonology and Sound Change. Walter de Gruyter.

Ji Liu and Diana Inkpen. 2015. Estimating user lo-cation in social media with stacked denoising auto-encoders. In Proceedings of the 2015 Conferenceof the North American Chapter of the Associationfor Computational Linguistics — Human LanguageTechnologies (NAACL HLT 2015), pages 201–210,Denver, USA.

Graham McNeill, Jonathan Bright, and Scott AHale. 2016. Estimating local commuting pat-terns from geolocated Twitter data. arXiv preprintarXiv:1612.01785.

Fernando Melo and Bruno Martins. 2015. Geocodingtextual documents through the usage of hierarchicalclassifiers. In Proceedings of the 9th Workshop onGeographic Information Retrieval (GIR 2015), Paris,France.

Burt L. Monroe, Michael P. Colaresi, and Kevin M.Quinn. 2008. Fightin’words: Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict. Political Analysis, 16(4):372–403.

John Nerbonne, Peter Kleiweg, Wilbert Heeringa, andFranz Manni. 2008. Projecting dialect distances togeography: Bootstrap clustering vs. noisy clustering.In Proceedings of the 31st Annual Meeting of theGerman Classification Society, pages 647–654.

Dong Nguyen and Jacob Eisenstein. to appear. Akernel independence test for geographical languagevariation. Computational Linguistics.

Reid Priedhorsky, Aron Culotta, and Sara Y. Del Valle.2014. Inferring the origin locations of tweets withquantitative confidence. In Proceedings of the 17thACM Conference on Computer Supported Cooper-ative Work & Social Computing, pages 1523–1536,Baltimore, USA.

Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. toappear. A neural model for user geolocation and lex-ical dialectology. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (ACL 2017), Vancouver, Canada.

Afshin Rahimi, Duy Vu, Trevor Cohn, and TimothyBaldwin. 2015. Exploiting text and network contextfor geolocation of social media users. In Proceed-ings of the 2015 Conference of the North AmericanChapter of the Association for Computational Lin-guistics — Human Language Technologies (NAACLHLT 2015), Denver, USA.

Stephen Roller, Michael Speriosu, Sarat Rallapalli,Benjamin Wing, and Jason Baldridge. 2012. Super-vised text-based geolocation using language modelson an adaptive grid. In Proceedings of the 2012Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CONLL 2012), pages1500–1510, Jeju, South Korea.

Pavel Serdyukov, Vanessa Murdock, and RoelofVan Zwol. 2009. Placing Flickr photos on a map. InProceedings of the 32nd International ACM SIGIRConference on Research and Development in Infor-mation Retrieval, pages 484–491, Boston, USA.

Shoko Wakamiya, Yukiko Kawai, and Eiji Aramaki.2016. After the boom no one tweets: Microblog-based influenza detection incorporating indirect in-formation. In Proceedings of the Sixth InternationalConference on Emerging Databases: Technologies,Applications, and Theory, pages 17–25, Jeju, SouthKorea.

William Yang Wang and Zhenhao Hua. 2014. A semi-parametric Gaussian copula regression model forpredicting financial risks from earnings calls. In Pro-ceedings of the 52nd Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1155–1165, Baltimore, USA.

Benjamin P Wing and Jason Baldridge. 2011. Sim-ple supervised document geolocation with geodesicgrids. In Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics: Hu-man Language Technologies-Volume 1 (ACL-HLT2011), pages 955–964, Portland, USA.

Benjamin P Wing and Jason Baldridge. 2014. Hierar-chical discriminative classification for text-based ge-olocation. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP 2014), pages 336–348, Doha, Qatar.