continuous representation of location for geolocation and ... · regression geolocation, the model...
TRANSCRIPT
Continuous Representation of Location for Geolocation and LexicalDialectology using Mixture Density Networks
Afshin Rahimi Timothy Baldwin Trevor Cohn
School of Computing and Information Systems
The University of Melbourne
{tbaldwin,t.cohn}@unimelb.edu.au
Abstract
We propose a method for embedding two-
dimensional locations in a continuous vec-
tor space using a neural network-based
model incorporating mixtures of Gaussian
distributions, presenting two model vari-
ants for text-based geolocation and lexi-
cal dialectology. Evaluated over Twitter
data, the proposed model outperforms con-
ventional regression-based geolocation and
provides a better estimate of uncertainty.
We also show the effectiveness of the rep-
resentation for predicting words from loca-
tion in lexical dialectology, and evaluate it
using the DARE dataset.
1 Introduction
Geolocation is an essential component of appli-
cations such as traffic monitoring (Emadi et al.,
2017), human mobility pattern analysis (McNeill
et al., 2016; Dredze et al., 2016) and disaster re-
sponse (Ashktorab et al., 2014; Wakamiya et al.,
2016), as well as targeted advertising (Anagnos-
topoulos et al., 2016) and local recommender sys-
tems (Ho et al., 2012). Although Twitter provides
users with the means to geotag their messages, less
than 1% of users opt to turn on geotagging, so third-
party service providers tend to use profile data, text
content and network information to infer the lo-
cation of users. Text content is the most widely
used source of geolocation information, due to its
prevalence across social media services.
Text-based geolocation systems use the geo-
graphical bias of language to infer the location of a
user or message using models trained on geotagged
posts. The models often use a representation of
text (e.g. based on a bag-of-words, convolutional
or recurrent model) to predict the location either in
real-valued latitude/longitude coordinate space or
in discretised region-based space, using regression
or classification, respectively. Regression models,
as a consequence of minimising squared loss for a
unimodal distribution, predict inputs with multiple
targets to lie between the targets (e.g. a user who
mentions content in both NYC and LA is predicted
to be in the centre of the U.S.). Classification mod-
els, while eliminating this problem by predicting
a more granular target, don’t provide fine-grained
predictions (e.g. specific locations in NYC), and
also require heuristic discretisation of locations into
regions (e.g. using clustering).
Mixture Density Networks (“MDNs”: Bishop
(1994)) alleviate these problems by representing
location as a mixture of Gaussian distributions.
Given a text input, an MDN can generate a mixture
model in the form of a probability distribution over
all location points. In the example of a user who
talks about both NYC and LA, e.g., the model will
predict a strong Gaussian component in NYC and
another one in LA, and also provide an estimate of
uncertainty over all the coordinate space.
Although MDNs are not new, they have not
found widespread use in inverse regression prob-
lems where for a single input, multiple correct
outputs are possible. Given the integration of
NLP technologies into devices (e.g. phones or
robots with natural language interfaces) is growing
quickly, there is a potential need for interfacing lan-
guage with continuous variables as input or target.
MDNs can also be used in general text regression
problems such as risk assessment (Wang and Hua,
2014), sentiment analysis (Joshi et al., 2010) and
loan rate prediction (Bitvai and Cohn, 2015), not
only to improve prediction but also to use the mix-
ture model as a representation for the continuous
variables. We apply MDNs to geotagged Twit-
ter data in two different settings: (a) predicting
location given text; and (b) predicting text given
location.
arX
iv:1
708.
0435
8v1
[cs
.CL
] 1
4 A
ug 2
017
Geotagged text content is not only useful in ge-
olocation, but can also be used in lexical dialectol-
ogy. Lexical dialectology is (in part) the converse
of text-based geolocation (Eisenstein, 2015): in-
stead of predicting location from language, lan-
guage (e.g. dialect terms) are predicted from a
given location. This is a much more challenging
task as the lexical items are not known beforehand,
and there is no notion of dialect regions in the con-
tinuous space of latitude/longitude coordinates. A
lexical dialectology model should not only be able
to predict dialect terms but also be able to automat-
ically learn dialect regions.
In this work, we use bivariate Gaussian mix-
tures over geotagged Twitter data in two different
settings, and demonstrate their use for geoloca-
tion and lexical dialectology. Our contributions
are as follows: (1) we propose a continuous rep-
resentation of location using bivariate Gaussian
mixtures; (2) we show that our geolocation model
outperforms regression-based models and achieves
comparable results with classification models, but
with added uncertainty over the continuous output
space; (3) we show that our lexical dialectology
model is able to predict geographical dialect terms
from latitude/longitude input with state-of-the-art
accuracy; and (4) we show that the automatically
learned Gaussian regions match expert-generated
dialect regions of the U.S.1
2 Related Work
2.1 Text-based Geolocation
Text-based geolocation models are defined as ei-
ther a regression or a classification problem. In
regression geolocation, the model learns to predict
a real-valued latitude/longitude from a text input.
This is a very challenging task for data types such
as Twitter, as they are often heavily biased toward
population centres and urban areas, and far from
uniform. As an example, Norwalk is the name
of a few cities in the U.S among which Norwalk,
California (West Coast) and Norwalk, Connecti-
cut (East Coast) are the two most populous cities.
Assuming that occurrences of the city’s name are
almost equal in both city regions within the training
set, a trained regression-based geolocation model
given Norwalk as input, would geolocate it to a
point in the middle of the U.S. instead of choosing
one of the cities. In the machine learning literature,
1Code available athttps://github.com/afshinrahimi/geomdn
regression problems where there are multiple real-
valued outputs for a given input are called inverse
problems (Bishop, 1994). Here, standard regres-
sion models predict an average point in the middle
of all training target points to minimise squared
error loss. Bishop (1994) proposes density mixture
networks to model such inverse problems, as we
discuss in detail in Section 3.
In addition, non-Bayesian interpretations of re-
gression models, which are often used in practice,
don’t produce any prediction of uncertainty, so
other than the predicted point, we have little idea
where else the term could have high or low prob-
ability. Priedhorsky et al. (2014) propose a Gaus-
sian Mixture Model (GMM) approach instead of
squared loss regression, whereby they learn a mix-
ture of bivariate Gaussian distributions for each
individual n-gram in the training set. During pre-
diction, they add the Gaussian mixture of each n-
gram in the input text, resulting in a new Gaussian
mixture which can be used to predict a coordinate
with associated uncertainty. To add the mixture
components they use a weighted sum, where the
weight of each n-gram is assigned by several heuris-
tic features. Learning a GMM for each n-gram is
resource-intensive if the size of the training set —
and thus the number of n-grams — is large.
Assuming sufficient training samples containing
the term Norwalk in the two main, a trained clas-
sification model would, given this term as input,
predict a probability distribution over all regions,
and assign higher probabilities to the regions con-
taining the two major cities. The challenge, though,
is that the coordinates in the training data must
first be partitioned into regions using administra-
tive regions (Cheng et al., 2010; Hecht et al., 2011;
Kinsella et al., 2011; Han et al., 2012, 2014), a uni-
form grid (Serdyukov et al., 2009), or a clustering
method such as a k-d tree (Wing and Baldridge,
2011) or K-means (Rahimi et al., to appear). The
cluster/region labels can then be used as targets.
Once we have a prediction about where a user is
more likely to be from, there is no more infor-
mation about the coordinates inside the predicted
region. If a region that contains Wyoming is pre-
dicted as the home location of a user, we have no
idea which city or county within Wyoming the user
might be from, unless we retrain the model using
a more fine-grained discretisation or a hierarchical
discretisation (Wing and Baldridge, 2014), which
is both time-consuming and challenging due to data
sparseness.
2.2 Lexical Dialectology
The traditional linguistic approach to lexical di-
alectology is to find the geographical distributions
of known contrast sets such as {you, yall, yinz}:
(Labov et al., 2005; Nerbonne et al., 2008;
Goncalves and Sanchez, 2014; Doyle, 2014;
Huang et al., 2015; Nguyen and Eisenstein, to
appear). This usually involves surveying a large
geographically-uniform sample of people from dif-
ferent locations and analysing where each known
alternative is used more frequently. Then, the co-
ordinates are clustered heuristically into dialect re-
gions, based on the lexical choices of users in each
region relative to the contrast set. This processing
is very costly and time-consuming, and relies crit-
ically on knowing the lexical alternatives a priori.
For example, it would require a priori knowledge
of the fact that people in different regions of the US
use pop and soda to refer to the same type of drink,
and a posteriori analysis of the empirical geograph-
ical distribution of the different terms. Language,
particularly in social media and among younger
speakers, is evolving so quickly, in ways that can
be measured over large-scale data samples such
as Twitter, that we ideally want to be able to infer
such contrast sets dynamically. The first step in
automatically collecting dialect words is to find
terms that are disproportionately distributed in dif-
ferent locations. The two predominant approaches
to this problem are model-based (Eisenstein et al.,
2010; Ahmed et al., 2013; Eisenstein, 2015) and
through the use of statistical metrics (Monroe et al.,
2008; Cook et al., 2014). Model-based approaches
use a topic model, e.g., to extract region-specific
topics, and from this, predict the probability of see-
ing a word given a geographical region (Eisenstein
et al., 2010). However, there are scalability issues,
limiting the utility of such models.
In this paper, we propose a neural network archi-
tecture that learns a mixture of Gaussian distribu-
tions as its activation function, and predicts both
locations from word-based inputs (geolocation),
and words from location-based inputs (lexical di-
alectology).
3 Model
3.1 Bivariate Gaussian Distribution
A bivariate Gaussian distribution is a probabil-
ity distribution over 2d space (in our case, a lat-
itude/longitude coordinate pair). The probability
mass function is given by:
N (x|µ,Σ) =1
(2π)
1
|Σ|1/2
exp
{
−1
2(x− µ)⊺Σ−1(x− µ)
}
where µ is the 2-dimensional mean vector, the ma-
trix Σ =(
σ12
ρ12σ1σ2
ρ12σ1σ2 σ22
)
is the covariance ma-
trix, and |Σ| is its determinant. σ1 and σ2 are the
standard deviations of the two dimensions, and ρ12
is the covariance. x is a latitude/longitude coordi-
nate whose probability we are seeking to predict.
3.2 Mixtures of Gaussians
A mixture of Gaussians is a probabilistic model to
represent subpopulations within a global popula-
tion in the form of a weighted sum of K Gaussian
distributions, where a higher weight with a compo-
nent Gaussian indicates stronger association with
that component. The probability mass function of
a Gaussian mixture model is given by:
P(x) =K∑
k=1
πkN (x|µk,Σk)
where∑K
k=1πk = 1, and the number of compo-
nents K is a hyper-parameter.
3.3 Mixture Density Network (MDN)
A mixture density network (“MDN”: Bishop
(1994)) is a latent variable model where the condi-
tional probability of p(y|x) is modelled as a mix-
ture of K Gaussians where the mixing coefficients
π and the parameters of Gaussian distributions µ
and Σ are computed as a function of input using a
neural network:
P(y|x) =K∑
k=1
πk(x)N(
y|µk(x),Σk(x))
In the bivariate case of latitude/longitude, the
number of parameters of each Gaussian is
6 (πk(x), µ1k(x), µ2k(x), ρk(x),σ1k(x),σ2k(x)),which are learnt in the output layer of a regular
neural network as a function of input x. The out-
put size of the network for K components would
be 6 × K. The output of an MDN for N sam-
ples (e.g. where N is the mini-batch size) is an
N × 6K matrix which is then sliced and reshaped
into (N × 2 × K), (N × 2 × K), (N × 1 × K)
and (N × 1 × K) matrices, providing the model
parameters µ, σ, ρ and π. Each parameter type
has its own constraints: σ (the standard deviation)
should be positive, ρ (the correlation) should be in
the interval [−1, 1] and π should be positive and
sum to one, as a probability distribution. To force
these constraints, the following transformations are
often applied to each parameter set:
σ ∼ SoftPlus(σ′) = log(exp(σ′) + 1) ∈ (0,+∞)
π ∼ SoftMax(π′)
ρ ∼ SoftSign(ρ′) =ρ′
1 + |ρ′|∈ [−1, 1]
As an alternative, it’s possible to use transforma-
tions like exp for σ and tanh for ρ. After applying
the transformations to enforce the range constraints,
the negative log likelihood loss of each sample x
given a 2d coordinate label y is computed as:
L(y|x) = − log
{ K∑
k=1
πk(x)N(
y|µk(x),Σk(x))
}
To predict a location, given an unseen input, the
output of the network is reshaped into a mixture
of Gaussians and µk, one of the K components’ µ
is chosen as the prediction. The selection criteria
is either based on the strongest component with
highest π, or the component that maximises the
overall mixture probability:
maxµi∈{µ1...µK}
K∑
k=1
πkN (µi|µk,Σk)
For further details on selection criteria, see
Bishop (1994).
3.4 Mixture Density Network with Shared
Parameters (MDN-SHARED)
In the original MDN model proposed by Bishop
(1994), the parameters of the mixture model are
separate functions of input, which is appropriate
when the inputs and outputs directly relate to each
other, but in the case of geolocation or lexical di-
alectology, the relationship between inputs and out-
puts is not so obvious. As a result, it might be a
difficult task for the model to learn all the param-
eters of each sample correctly. Instead of using
the output to predict all the parameters, we share
µ and Σ among all samples as parameters of the
output layer, and only use the input to predict π, the
mixture probabilities, using a SoftMax layer. We
initialise µ by applying K-means clustering to the
training coordinates and setting each value of µ to
the centroids of the K clusters; we initialise Σ ran-
domly between 0 and 10. We use the original cost
function to update the weight matrices, biases and
the global shared parameters of the mixture model
through backpropagation. Prediction is performed
in the same way as for MDN.
3.5 Continuous Representation of Location
Gaussian mixtures are usually used as the output
layer in neural networks (as in MDN) for inverse
regression problems. We extend their application
by using them as an input representation when
the input is a multidimensional continuous vari-
able. In problems such as lexical dialectology, the
input is real-valued 2d coordinates, and the goal
is to predict dialect words from a given location.
Small differences in latitude/longitude may result
in big shifts in language use (e.g. in regions such
as Switzerland or Gibraltar). One way to model
this is to discretise the input space (similar to the
discretisation of the output space in classification),
with the significant downside that the model is not
able to learn/fine-tune regions in a data-driven way.
A better solution is to use a K component Gaussian
mixture representation of location, where µ and Σ
are shared among all samples, and the output of the
layer is the probability of input in each of the mix-
ture components. Note that in this representation,
there is no need for π parameters as we just need
to represent the association of an input location to
K regions, which will then be used as input to the
next layer of a neural network and used to predict
the targets. We use this continuous representation
of location to predict dialect words from location
input.
4 Experiments
We apply the two described MDN models on two
widely-used geotagged Twitter datasets for geolo-
cation, and compare the results with state-of-the-art
classification and regression baselines. Also, we
use the mixture of Gaussian representation of loca-
tion to predict dialect terms from coordinates.
4.1 Data
In our experiments, we use two existing Twitter
user geolocation datasets: (1) GEOTEXT (Eisen-
stein et al., 2010), and (2) TWITTER-US (Roller
et al., 2012). Each dataset has fixed training, devel-
output layer: term probabilities
tanh hidden layer
Gaussian layer: K Gaussian components
input: location coordinates
(a) Predict text given location
output layer: mixing coefficients π1 . . .πk
tanh hidden layer
input: Text BoW
(b) Predict location given text
Figure 1: (a) The lexical dialectology model using a
Gaussian representation layer. (b) MDN-SHARED
geolocation model where the mixture weights π
are predicted for each sample, and µ and Σ are
parameters of the output layer, shared between all
samples.
opment and test partitions, and a user is represented
by the concatenation of their tweets, and labelled
with the latitude/longitude of the first collected geo-
tagged tweet.2 GEOTEXT and TWITTER-US cover
the continental US with 9k, 449k users, respec-
tively.3
DARE is a dialect-term dataset derived from the
Dictionary of American Regional English (Cassidy,
1985) by Rahimi et al. (to appear). DARE consists
of dialect regions, terms and the meaning of each
term.4 It represents the aggregation of a number
of dialectal surveys over different regions of the
U.S., to identify shared dialect regions. Because
the dialect regions in DARE maps are not machine
readable, populous cities within each dialect region
are manually extracted and associated with their
dialect region terms. The dataset is made up of
around 4.3k dialect terms across 99 U.S. dialect
regions.
2This geolocation representation is naive, but was madeby the creators of the original datasets and has been used byothers. It has been preserved in this work for comparabilitywith the results of others, despite misgivings about whetherthis is a faithful representation of the location for a given user.
3The datasets can be obtained from https://github.
com/utcompling/textgrounder.4http://www.daredictionary.com/
4.2 Geolocation
We use a 3-layer neural network as shown in Fig-
ure 1a where the input is the l2 normalised bag-of-
words model of a given user with stop words, @-
mentions and words with document frequency less
than 10 removed. The input is fed to a hidden layer
with tanh nonlinearity that produces the output of
the network (with no nonlinearity applied). The
output is the collection of Gaussian mixture param-
eters (µ,Σ,π) from MDN. For prediction, the µk of
the mixture component which has the highest prob-
ability within the mixture component is selected.
In the case of MDN-SHARED, the output is only
π, a vector with size K, but the output layer con-
tains extra global parameters µ and Σ (σlat,σlon, ρ)which are shared between all the samples. The
negative log likelihood objective is optimised using
Adam (Kingma and Ba, 2014) and early stopping is
used to prevent overfitting. The hidden layer is sub-
ject to drop-out and elastic net regularisation (with
equal l1 and l2 shares). As our baseline, we used
a multilayer perceptron regressor with the same
input and hidden architecture but with a 2d out-
put with linear activation that predicts the location
from text input. The regularisation and drop-out
rate, hidden layer size and the number of Gaussian
components K (for MDN and MDN-SHARED) are
tuned over the development set of each dataset, as
shown in Table 1.
We evaluate the predictions of the geolocation
models based on three measures (following Cheng
et al. (2010) and Eisenstein et al. (2010)):
1. the classification accuracy within a 161km
(= 100 mile) radius of the actual location
(“Acc@161”); i.e., if the predicted location
is within 161km of the actual location, it is
considered to be correct
2. the mean error (“Mean”) between the pre-
dicted location and the actual location of the
user, in kilometres
3. the median error (“Median”) between the pre-
dicted location and the actual location of the
user, in kilometres
4.3 Lexical Dialectology
To predict dialect words from location, we use a
4-layer neural network as shown in Figure 1b. The
input is a latitude/longitude coordinate, the first
hidden layer is a Gaussian mixture with K com-
ponents which has µ and Σ as its parameters and
produces a probability for each component as an
activation function, the second hidden layer with
tanh nonlinearity captures the association between
different Gaussians, and the output is a SoftMaxlayer which results in a probability distribution over
the vocabulary. For a user label, we use an l1 nor-
malised bag-of-words representation of its text con-
tent and use binary tf and idf for term-weighting.
The model should learn to predict the probability
distribution over the vocabulary and so be capable
of predicting dialect words with a higher probabil-
ity. It also learns regions (parameters of K Gaus-
sians) that represent dialect regions.
We evaluate the lexical dialectology model
(MDN-layer) using perplexity of the predicted
unigram distribution, and compare it with a base-
line where the Gaussian mixture layer is replaced
with a tanh hidden layer (tanh-layer). Also
we retrieve words given points within a region from
the DAREDS dataset, and measure recall with
respect to relevant dialect terms from DAREDS.
To do that, we randomly sample P = 10000 lat-
itude/longitude points from the training set and
predict the corresponding word distribution. To
come up with a ranking over words given region r
as query, we use the following measure:
score(wi|r) =1
N
∑
pj∈r
log(P (wi|pj))
−1
P
P∑
j=1
log(P (wi|pj))
where N equals the number of points (out of 10000)
inside the query dialect region r and P equals the
total number of points (here 10000). For exam-
ple, if we are querying dialect terms from dialect
region South (r), N is the number of randomly se-
lected points that fall within the constituent states
of South. score(wi|r) measures the (log) probabil-
ity ratio of a word wi inside region r compared to
its global score: if a word is local to region r, the
ratio will be higher. We use this measure to create
a ranking over the vocabulary from which we mea-
sure precision and recall at k given gold-standard
dialect terms in DAREDS.
5 Results
5.1 Geolocation
The performance of Regression, MDN and
MDN-SHARED, along with several state-of-the-art
classification models, is shown in Table 2. The
MDN and MDN-SHARED models clearly outper-
form Regression, and achieve competitive or
slightly worse results than the classification mod-
els but provide uncertainty over the whole out-
put space. The geographical distribution of er-
ror for MDN-SHARED over the development set
of TWITTER-US is shown in Figure 3, indicating
larger errors in MidWest and particularly in North
Pacific regions (e.g. Oregon).
5.2 Dialectology
The perplexity of the lexical dialectology
model using Gaussian mixture representation
(MDN-layer) is 840 for the 54k vocabulary
of TWITTER-US dataset, 1% lower than a
similar network architecture with a tanh hidden
layer (tanh-layer), which is not a signif-
icant improvement. Also we evaluated the
model using recall at k and compared it to the
tanh-layer model which again is competitive
with tanh-layer but with the advantage of
learning dialect regions simultaneously. Because
the DARE dialect terms are not used frequently
in Twitter, many of the words are not covered in
our dataset, despite its size. However, our model
is able to retrieve dialect terms that are distinctly
associated with regions. The top dialect words
for regions New York, Louisiana, Illinois and
Pennsylvania are shown in Table 3, and include
named entities, dialect words and local hashtags.
We also visualised the learned Gaussians of the
dialectology model in Figure 2, which as expected
show several smaller regions (Gaussians with
higher σ) and larger regions in lower populated
areas. It is interesting to see that the shape of the
learned Gaussians matches natural borders such as
coastal regions.
We also visualised the log probability of dialect
terms hella (an intensifier mainly used in North-
ern California) and yall (means “you all”, used in
Southern U.S.) resulting from the Gaussian rep-
resentation model. As shown in Figure 5, the
heatmap matches the expected regions.
6 Conclusion
We proposed a continuous representation of loca-
tion using mixture of Gaussians and applied it to
geotagged Twitter data in two different settings: (1)
geolocation of social media users, and (2) lexical
dialectology. We used MDN (Bishop, 1994) in a
multilayer neural network as a geolocation model
GEOTEXT TWITTER-US
regul. dropout hidden K regul. dropout hidden K
Baseline (Regression) 0 0 (100, 50) — 10−5 0 (100, 50) —
Proposed method (MDN) 0 0.5 100 100 10−5 0 300 100
Proposed method (MDN-SHARED) 0 0 100 300 0 0 900 900
Table 1: Hyper-parameter settings of the geolocation models tuned over development set of each dataset.
K is the number of Gaussian components in MDN and MDN-SHARED. Regression has a tanh hidden
layer instead of the Gaussian layer. “—” means the parameter is not applicable to the model.
GEOTEXT TWITTER-US
Acc@161 Mean Median Acc@161 Mean Median
Baseline (Regression) 4 951 733 9 746 557
Proposed method (MDN) 24 983 505 29 696 281
Proposed method (MDN-SHARED) 39 865 412 42 655 216
CLASSIFICATION METHODS
(Rahimi et al., 2015) (LR) 38 880 397 50 686 159
(Wing and Baldridge, 2014) (uniform) — — — 49 703 170
(Wing and Baldridge, 2014) (k-d tree) — — — 48 686 191
(Melo and Martins, 2015) — — — — 702 208
(Cha et al., 2015) — 581 425 — — —
(Liu and Inkpen, 2015) — — — — 733 377
Table 2: Geolocation results over GEOTEXT and TWITTER-US datasets based on Regression, MDN
and MDN-SHARED methods. The results are also compared to state-of-the-art classification methods. “—”
signifies that no results were reported for the given metric or dataset.
���� ��
��
��
����� �
�� ����
� �
���� ��
��
����
�
��
���� ��
����
�� ����� ��
�����
��
��
������ ��
��
��
����
���
��
Figure 2: Learned Gaussian representation of input locations over TWITTER-US in the lexical dialectology
model. The number of Gaussian components, K, is set to 100 . The red points are µk and the contours are
drawn at p = 0.01.
and showed that it outperforms regression mod-
els by a large margin. There is also very recent
work (Iso et al., 2017) in tweet-level geolocation
that shows the effectiveness of MDN.
We modified MDN by sharing the parameters of
the Gaussian mixtures in MDN-SHARED and im-
proved upon MDN, achieving competetive results
with state-of-the-art classification models. We also
applied the Gaussian mixture representation to pre-
dict dialect words from location, and showed that it
MNWA MT
ID
NDMEWI
OR SD MI NHVT
NYWY
IANE MA
ILPA CTRI
CA
NV UT OHINNJ
CO WVMOKS DE
MDVAKYDC
AZOKNM
TN NC
TX
ARSC
AL GAMS
LA
FL
0
600
1200
1800
2400
3000
3600
4200
4800
err
or
inkm
Figure 3: The distribution of geolocation error
for MDN-SHARED over the development set of
TWITTER-US.
1% 5% 10% 15% 20%
0
10
20
30
k%
Rec
all@
k% tanh-layer
MDN-layer
Figure 4: Recall at k% of 54k vocabulary for
retrieving DAREDS dialect words given points
within a dialect region.
���� ��
��
������
� � �� ����
� �
���� ��
����
���
��
���� ����
���� ����
� ������� ��
������
�� ��
��
��
�� ���
��
��
����
����
���
���
���
���
���
��
�������
(a) hella (an intensifier) mostly used in Northern California, alsothe name of a company in Illinois.
���� ��
��
������
� � �� ����
� �
���� ��
����
���
��
���� ����
���� ����
� ������� ��
������
�� ��
��
��
�� ���
��
��
����
����
����
���
���
���
���
���
��
�������
(b) yall (means you all) mainly used in Southern U.S.
Figure 5: Log probabilities of terms: (a) hella and
(b) yall in continental U.S.
is competitive with simple tanh activation in terms
of both perplexity of the predicted unigram model
and also recall at k at retrieving DARE dialect
words by location input. Furthermore we showed
that the learned Gaussian mixtures have intereting
properties such as covering high population density
regions (e.g. NYC and LA) with a larger number
of small Gaussians, and a smaller number of larger
New York Louisiana Illinois Pennsylvania
flatbush kmsl metra cdfu
lirr jtfo osco jawn
reade wassam halsted ard
rivington kmfsl kedzie erked
mta ndc lbvs cthu
nostrand bookoo damen septa
stuyvesant #icantdeal niu prussia
pathmark #drove orland drawlin
bleecker slangs cermak youngbull
bowery daq uic prolli
macdougal #gramfam oms #ttm
broome gtf xsport dickeating
driggs metairie cta ctfu
Table 3: Top terms retrieved from the lexical dialec-
tology model given lat/lon training points within a
state ranked by Equation 4.3.
Gaussians in low density areas (e.g. the midwest).
Although we applied the mixture of Gaussians
to location data, it can be used in other settings
where the input or output are from a continuous
multivariate distribution. For example it can be
applied to predict financial risk (Wang and Hua,
2014) and sentiment (Joshi et al., 2010) given text.
We showed that when a global structure exists (e.g.
population centres, in the case of geolocation) it
is better to share the global parameters of the mix-
ture model to improve generalisation. In this work,
we used the bivariate Gaussian distribution in the
MDN’s mixture leaving the use of other distribu-
tions which might better suit the geolocation task
for future research.
Acknowledgments
We thank the anonymous reviewers for their valu-
able feedback. This work was supported in part by
the Australian Research Council.
References
Amr Ahmed, Liangjie Hong, and Alexander J Smola.2013. Hierarchical geographical modeling of userlocations from social media posts. In Proceedingsof the 22nd International Conference on World WideWeb (WWW 2013), pages 25–36, Rio de Janeiro,Brazil.
Aris Anagnostopoulos, Fabio Petroni, and MaraSorella. 2016. Targeted interest-driven advertisingin cities using twitter. In Proceedings of the 10th In-ternational Conference on Weblogs and Social Me-
dia (ICWSM 2016), pages 527–530, Cologne, Ger-many.
Zahra Ashktorab, Christopher Brown, Manojit Nandi,and Aron Culotta. 2014. Tweedr: Mining Twit-ter to inform disaster response. In Proceedingsof The 11th International Conference on Informa-tion Systems for Crisis Response and Management(ISCRAM 2014), pages 354–358, University Park,USA.
Christopher Bishop. 1994. Mixture density networks.Technical report, Aston University.
Zsolt Bitvai and Trevor Cohn. 2015. Predicting peer-to-peer loan rates using bayesian non-linear regres-sion. In Proceedings of the Twenty-Ninth AAAI Con-ference on Artificial Intelligence (AAAI-15), pages2203–2209, Austin, USA.
Frederic Gomes Cassidy. 1985. Dictionary of Amer-ican Regional English. Belknap Press of HarvardUniversity Press.
Miriam Cha, Youngjune Gwon, and H.T. Kung. 2015.Twitter geolocation and regional classification viasparse coding. In Proceedings of the 9th Inter-national Conference on Weblogs and Social Media(ICWSM 2015), pages 582–585, Oxford, UK.
Zhiyuan Cheng, James Caverlee, and Kyumin Lee.2010. You are where you tweet: a content-based ap-proach to geo-locating Twitter users. In Proceedingsof the 19th ACM International Conference Infor-mation and Knowledge Management (CIKM 2010),pages 759–768, Toronto, Canada.
Paul Cook, Bo Han, and Timothy Baldwin. 2014. Sta-tistical methods for identifying local dialectal termsfrom gps-tagged documents. Dictionaries: Jour-nal of the Dictionary Society of North America,35(35):248–271.
Gabriel Doyle. 2014. Mapping dialectal variation byquerying social media. In Proceedings of the 14thConference of the EACL (EACL 2014), pages 98–106, Gothenburg, Sweden.
Mark Dredze, Manuel Garcıa-Herranz, Alex Ruther-ford, and Gideon Mann. 2016. Twitter as a sourceof global mobility patterns for social good. In ICMLWorkshop on #Data4Good: Machine Learning in So-cial Good Applications, New York, USA.
Jacob Eisenstein. 2015. Written dialect variation inonline social media. In Charles Boberg, John Ner-bonne, and Dom Watt, editors, Handbook of Dialec-tology. Wiley.
Jacob Eisenstein, Brendan O’Connor, Noah A Smith,and Eric P Xing. 2010. A latent variable model forgeographic lexical variation. In Proceedings of the2010 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP 2010), pages 1277–1287, Boston, USA.
Noora Al Emadi, Sofiane Abbar, Javier Borge-Holthoefer, Francisco Guzman, and Fabrizio Sebas-tiani. 2017. QT2S: A system for monitoring roadtraffic via fine grounding of tweets. In Proceedingsof the 11th International Conference on Weblogs andSocial Media (ICWSM 2017), pages 456–459, Mon-treal, Canada.
Bruno Goncalves and David Sanchez. 2014. Crowd-sourcing dialect characterization through Twitter.PloS One, 9(11).
Bo Han, Paul Cook, and Timothy Baldwin. 2012. Ge-olocation prediction in social media data by find-ing location indicative words. In Proceedingsof the 24th International Conference on Compu-tational Linguistics (COLING 2012), pages 1045–1062, Mumbai, India.
Bo Han, Paul Cook, and Timothy Baldwin. 2014. Text-based Twitter user geolocation prediction. Journalof Artificial Intelligence Research, 49:451–500.
Brent Hecht, Lichan Hong, Bongwon Suh, and Ed H.Chi. 2011. Tweets from justin bieber’s heart: thedynamics of the location field in user profiles. InProceedings of the SIGCHI Conference on HumanFactors in Computing Systems, pages 237–246, Van-couver, Canada.
Shen-Shyang Ho, Mike Lieberman, Pu Wang, andHanan Samet. 2012. Mining future spatiotemporalevents and their sentiment from online news arti-cles for location-aware recommendation system. InProceedings of the First ACM SIGSPATIAL Interna-tional Workshop on Mobile Geographic InformationSystems, pages 25–32, Redondo Beach, USA.
Yuan Huang, Diansheng Guo, Alice Kasakoff, and JackGrieve. 2015. Understanding US regional linguisticvariation with Twitter data analysis. Computers, En-vironment and Urban Systems, 59:244–255.
Hayate Iso, Shoko Wakamiya, and Eiji Aramaki. 2017.Density estimation for geolocation via convolu-tional mixture density network. arXiv preprintarXiv:1705.02750.
Mahesh Joshi, Dipanjan Das, Kevin Gimpel, andNoah A Smith. 2010. Movie reviews and revenues:An experiment in text regression. In Proceedingsof The 2010 Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL2010), pages 293–296, Los An-geles, USA. Association for Computational Linguis-tics.
Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.
Sheila Kinsella, Vanessa Murdock, and Neil O’Hare.2011. “I’m eating a sandwich in Glasgow”: Model-ing locations with tweets. In Proceedings of the 3rdInternational Workshop on Search and Mining User-generated Contents, pages 61–68, Glasgow, UK.
William Labov, Sharon Ash, and Charles Boberg. 2005.The Atlas of North American English: Phonetics,Phonology and Sound Change. Walter de Gruyter.
Ji Liu and Diana Inkpen. 2015. Estimating user lo-cation in social media with stacked denoising auto-encoders. In Proceedings of the 2015 Conferenceof the North American Chapter of the Associationfor Computational Linguistics — Human LanguageTechnologies (NAACL HLT 2015), pages 201–210,Denver, USA.
Graham McNeill, Jonathan Bright, and Scott AHale. 2016. Estimating local commuting pat-terns from geolocated Twitter data. arXiv preprintarXiv:1612.01785.
Fernando Melo and Bruno Martins. 2015. Geocodingtextual documents through the usage of hierarchicalclassifiers. In Proceedings of the 9th Workshop onGeographic Information Retrieval (GIR 2015), Paris,France.
Burt L. Monroe, Michael P. Colaresi, and Kevin M.Quinn. 2008. Fightin’words: Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict. Political Analysis, 16(4):372–403.
John Nerbonne, Peter Kleiweg, Wilbert Heeringa, andFranz Manni. 2008. Projecting dialect distances togeography: Bootstrap clustering vs. noisy clustering.In Proceedings of the 31st Annual Meeting of theGerman Classification Society, pages 647–654.
Dong Nguyen and Jacob Eisenstein. to appear. Akernel independence test for geographical languagevariation. Computational Linguistics.
Reid Priedhorsky, Aron Culotta, and Sara Y. Del Valle.2014. Inferring the origin locations of tweets withquantitative confidence. In Proceedings of the 17thACM Conference on Computer Supported Cooper-ative Work & Social Computing, pages 1523–1536,Baltimore, USA.
Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. toappear. A neural model for user geolocation and lex-ical dialectology. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (ACL 2017), Vancouver, Canada.
Afshin Rahimi, Duy Vu, Trevor Cohn, and TimothyBaldwin. 2015. Exploiting text and network contextfor geolocation of social media users. In Proceed-ings of the 2015 Conference of the North AmericanChapter of the Association for Computational Lin-guistics — Human Language Technologies (NAACLHLT 2015), Denver, USA.
Stephen Roller, Michael Speriosu, Sarat Rallapalli,Benjamin Wing, and Jason Baldridge. 2012. Super-vised text-based geolocation using language modelson an adaptive grid. In Proceedings of the 2012Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CONLL 2012), pages1500–1510, Jeju, South Korea.
Pavel Serdyukov, Vanessa Murdock, and RoelofVan Zwol. 2009. Placing Flickr photos on a map. InProceedings of the 32nd International ACM SIGIRConference on Research and Development in Infor-mation Retrieval, pages 484–491, Boston, USA.
Shoko Wakamiya, Yukiko Kawai, and Eiji Aramaki.2016. After the boom no one tweets: Microblog-based influenza detection incorporating indirect in-formation. In Proceedings of the Sixth InternationalConference on Emerging Databases: Technologies,Applications, and Theory, pages 17–25, Jeju, SouthKorea.
William Yang Wang and Zhenhao Hua. 2014. A semi-parametric Gaussian copula regression model forpredicting financial risks from earnings calls. In Pro-ceedings of the 52nd Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1155–1165, Baltimore, USA.
Benjamin P Wing and Jason Baldridge. 2011. Sim-ple supervised document geolocation with geodesicgrids. In Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics: Hu-man Language Technologies-Volume 1 (ACL-HLT2011), pages 955–964, Portland, USA.
Benjamin P Wing and Jason Baldridge. 2014. Hierar-chical discriminative classification for text-based ge-olocation. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP 2014), pages 336–348, Doha, Qatar.