exponential family embeddingsmic3,stanford.… · maja rudolph et al.“exponential family...
TRANSCRIPT
Exponential Family Embeddings
Maja Rudolph ([email protected])with Francisco Ruiz, and David Blei
Columbia University
September 12, 2017
Exponential Family Embeddings
• class of conditionally specified models
• goal: learn distributed representations of objects
00...100...00
−→
1.2−0.5...−1.74.3
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations byback-propagating errors”. In: Nature 323 (1986), p. 9.
Geoffrey E Hinton. “Learning distributed representations of concepts”. In: Proceedings of theeighth annual conference of the cognitive science society. Vol. 1. Amherst, MA. 1986, p. 12.
Maja Rudolph et al. “Exponential Family Embeddings”. In: Advances in Neural InformationProcessing Systems. 2016, pp. 478–486.
Exponential Family Embeddings
• goal: learn distributed representations of objects
• objects: words in text, neurons in neuro-science data, or items incollaborative filtering task
single neuron held out 25% of neurons held outModel K D 10 K D 100 K D 10 K D 100�� 0:261˙ 0:004 0:251˙ 0:004 0:261˙ 0:004 0:252˙ 0:004�-��� (c=10) 0:230˙ 0:003 0:230˙ 0:003 0:242˙ 0:004 0:242˙ 0:004�-��� (c=50) 0:226˙ 0:003 0:222˙ 0:003 0:233˙ 0:003 0:230˙ 0:003��-��� (c=10) 0:238˙ 0:004 0:233˙ 0:003 0:258˙ 0:004 0:244˙ 0:004
Table 2: Analysis of neural data: mean squared error and standard errors of neuralactivity (on the test set) for di�erent models. Both ��-��� models significantlyoutperform ��; �-��� is more accurate than ��-���.
Figure 1: Top view of the zebrafish brain, with blue circles at the location of theindividual neurons. We zoom on 3 neurons and their 50 nearest neighbors (smallblue dots), visualizing the “synaptic weights” learned by a �-��� model (K D 100).The edge color encodes the inner product of the neural embedding vector and thecontext vectors ⇢>n ˛m for each neighborm. Positive values are green, negative valuesare red, and the transparency is proportional to the magnitude. With these weightswe can form hypotheses about how nearby neurons are connected.
the lagged activity conditional on the simultaneous lags of surrounding neurons. Westudied context sizes c 2 f10; 50g and latent dimension K 2 f10; 100g.Models. We compare ��-��� to probabilistic factor analysis (��), fitting K-dimensional factors for each neuron andK-dimensional factor loadings for each timeframe. In ��, each entry of the data matrix is a Gaussian with mean equal to theinner product of the corresponding factor and factor loading.
Evaluation. We train each model on the first 95% of the time frames and holdout the last 5% for testing. With the test set, we use two types of evaluation. (1)Leave one out: For each neuron xi in the test set, we use the measurements of theother neurons to form predictions. For �� this means the other neurons are usedto recover the factor loadings; for ��-��� this means the other neurons are usedto construct the context. (2) Leave 25% out: We randomly split the neurons into 4folds. Each neuron is predicted using the three sets of neurons that are out of its fold.(This is a more di�cult task.) Note in ��-���, the missing data might change thesize of the context of some neurons. See Table 5 in Appendix B for the choice ofhyperparameters.
10
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations byback-propagating errors”. In: Nature 323 (1986), p. 9.
Geoffrey E Hinton. “Learning distributed representations of concepts”. In: Proceedings of theeighth annual conference of the cognitive science society. Vol. 1. Amherst, MA. 1986, p. 12.
Maja Rudolph et al. “Exponential Family Embeddings”. In: Advances in Neural InformationProcessing Systems. 2016, pp. 478–486.
Fitted Poisson Embeddings – Similarity Queries
Maruchan ramen soup Yoplait strawberry y. Mountain Dew soda Dean Foods 1 % milkMaruchan chicken ramen Yoplait vanilla yogurt Pepsi wild cherry soda Dean Foods 2 % milkMaruchan ramen, ls. Yoplait cherry yogurt Dr Pepper soda Dean Foods fat free milkKemps chocolate milk Yoplait blueberry yogurt Martin’s potato chips Danone fat free yoghurt
tSNE Component 1
tSN
E C
ompo
nent
2
• representations learned from shopping data (counts not text)
Exponential Family Embeddings
• encapsulates main ideas of neural language models:• each observation is modeled conditionally on a context
• the conditional distributions come from the exponential family
• each object has two embeddings: an embedding vector ρ and the contextvector α
Yoshua Bengio et al. “A neural probabilistic language model”. In: Journal of machine learningresearch 3.Feb (2003), pp. 1137–1155.
Tomas Mikolov et al. “Distributed representations of words and phrases and theircompositionality”. In: Neural Information Processing Systems. 2013, pp. 3111–3119.
Jeffrey Pennington et al. “Glove: Global Vectors for Word Representation.” In: Conference onEmpirical Methods on Natural Language Processing. Vol. 14. 2014, pp. 1532–1543.
Exponential Family Embeddings
• use exponential families for the conditional of each data point,
xi |xci ∼ ExpFam(ηi(xci), t(xi))
• the natural parameter combines the embedding and context vectors,
ηi(xci) = fi
ρ[i]>∑j∈ci
α[j]xj
• the exponential family embedding (ef-emb) has latent variables for eachindex:an embedding ρ[i] and a context vector α[i]
Pseudo-likelihood
• combine these ingredients in a “pseudo-likelihood”
L (ρ,α) =n∑i=1
(η>i t(xi)− a(ηi)
)+ logp(ρ) + logp(α).
• fit with stochastic optimization; exponential families simplify thegradients.
• (Stochastic gradients give justification to NN ideas like “negativesampling.”)
0 100 200 300 400 500
Barry C Arnold, Enrique Castillo, Jose Maria Sarabia, et al. “Conditionally specified distributions:an introduction”. In: Statistical Science 16.3 (2001), pp. 249–274.
Exponential Family Embeddings
In summary, an EF-Emb has 3 ingredients (context, conditional distribution,parameterization)
• Multinomial embedding for text (similar to CBOW)
• Poisson embeddings for shopping data (or movie ratings data)
• Bernoulli embeddings for text (related to word2vec)
Tomas Mikolov et al. “Distributed representations of words and phrases and theircompositionality”. In: Neural Information Processing Systems. 2013, pp. 3111–3119.
Tomas Mikolov et al. “Efficient estimation of word representations in vector space”. In: ICLRWorkshop Proceedings. arXiv:1301.3781 (2013).
Multinomial embeddings for text
• observations xi : one-hot vectors
• context ci : index of words before and after
• each word is associated with two embedding vectors:• embedding vector ρv ∈ RK• context vector αv ∈ RK
• Categorical distribution on xi |xci
• natural parameter (log probability):
ηiv = ρ>v
∑j∈ci ,w∈V
αwxjw
Poisson embeddings for movie ratings and shopping data
• observations: counts
• context: other movies same user rated, other items same user purchased
• each item is associated with two embedding vectors:• embedding vector ρv ∈ RK• context vector αv ∈ RK
• Poisson distribution on xui |xci
• natural parameter (log rate):
ηui = ρ>i
∑j∈ci
αjxuj
Results – Poisson Embeddings – Better Held-out LogLikelihood
• Movie Ratings: MovieLens-100k.
• Market Baskets: IRI dataset. Over 100,000 baskets of 8000 distinct items
Market Baskets
Model K = 20 K = 100Poisson Emb. −7.11 −6.95Poisson Fact. −7.74 −7.63Poisson PCA −8.31 −11.01
Movie Ratings
K = 20 K = 100−5.69 −5.73−5.80 −5.86−5.92 −7.50
P. Gopalan, J. Hofman, and D. M. Blei. “Scalable recommendation with hierarchical Poissonfactorization”. In: Uncertainty in Artificial Intelligence. 2015.
Michael Collins, Sanjoy Dasgupta, and Robert E Schapire. “A generalization of principalcomponents analysis to the exponential family”. In: Neural Information Processing Systems. 2001,pp. 617–624.
Bernoulli embeddings
• breaks up one-hot constraint of word indicators in text
• instead of Categorical with expensive normalization use conditionalBernoullis to model individual entries in data matrix
• biased SGD (modeling ones, subsampling zeros) we get an objective thatclosely resembles word2vec
p(xiv | xci) = Bernoulli(piv)
piv = σ
ρ>v ∑j∈ci ,w∈V
αwxjw
• rest of the talk: extensions of Bernoulli embeddings
Tomas Mikolov et al. “Distributed representations of words and phrases and theircompositionality”. In: Neural Information Processing Systems. 2013, pp. 3111–3119.
Why Embeddings?
• as input features in downsteam NLP tasks
• as output layer in deep models for word prediction tasks
• document classification
Ronan Collobert et al. “Natural language processing (almost) from scratch”. In: Journal ofMachine Learning Research 12.Aug (2011), pp. 2493–2537.
Jason Weston et al. “Deep learning via semi-supervised embedding”. In: Neural Networks: Tricksof the Trade. Springer, 2012, pp. 639–655.
Matt Taddy. “Document classification by inversion of distributed language representations”. In:arXiv preprint arXiv:1504.07295 (2015).
Why Embeddings?
• This talk: As descriptive statistic of text for computational social science
US Congressional Record (1858 - 2009)
The Meaning of Words Changes - Computer
computer
1858 1986computer computerdraftsman softwaredraftsmen computerscopyist copyright
photographer technologicalcomputers innovationcopyists mechanicaljanitor hardware
accountant technologiesbookkeeper vehicles
Maja Rudolph and David Blei. “Dynamic Embeddings for Language Evolution”. In:arXiv:1703.08052. 2017.
Dynamic Bernoulli Embeddings
Dynamic Bernoulli Embeddings
• Divide corpus into time slices t = 1, · · · ,T
• E.g. divide speeches from 1858 - 2009 into 76 time slices, 2 years each
• Model semantic changes using Gaussian random walk on the embeddingvectors ρ(t)
v
• Fit objective using stochastic gradients
Dynamic Bernoulli Embeddings
• ρ(0)v ∼ N(0, (1/λ
(0)ρ )I)
• ρ(t)v ∼ N(ρ
(t−1)v , (1/λρ)I)
Results - Held-out Likelihood
Senate speeches
context size 2 context size 8
s-emb −2.409± 0.001 −2.286± 0.001t-emb −2.444± 0.001 −2.458± 0.001d-emb [this work] −2.340± 0.001 −2.282± 0.001
N ≈ 14M, V = 25k, K = 100
Maja Rudolph et al. “Exponential Family Embeddings”. In: Advances in Neural InformationProcessing Systems. 2016, pp. 478–486.
William L Hamilton, Jure Leskovec, and Dan Jurafsky. “Diachronic Word Embeddings RevealStatistical Laws of Semantic Change”. In: arXiv preprint arXiv:1605.09096 (2016).
Dynamic Embeddings — U.S. Senate Speeches (1858 - 2009)
Grouped data
• What if the data is grouped differently?
• How do we share statistical strength when we cannot exploit dynamics?
Structured Embedding Models for Grouped Data
• Goal: uncover how word usage differs between different groups
data embedding of groups grouped by sizeArXiv abstracts text 15k terms 19 subject areas 15M wordsSenate speeches text 15k terms 83 home state/party 20M wordsShopping data counts 5.5k items 12 months 0.5M trips
Maja Rudolph, Francisco Ruiz, and David Blei. “Structured Embedding Models for GroupedData”. In: Advances in Neural Information Processing Systems. 2017.
Hierarchical Embedding Model
ρ(s)v
αv
X(s)
ρ(0)v
V
S
ρ(s)v ∼N (ρ(0)
v ,σ2ρI)
Amortized Embedding Model
ρ(s)v
αv
X(s)
ρ(0)v
V
S
ρ(s)v = fs(ρ
(0)v )
Neural Network maps Global Embeddings to Group SpecificEmbeddings
input:word vector
ρv
output:group specificword vectorρ(s)v
W (s)1 W (s)
2
...
......• We compare feed forward NNs and ResNet architectures.
Kaiming He et al. “Deep residual learning for image recognition. arXiv 2015”. In: arXiv preprintarXiv:1512.03385 ().
Results
ArXiv papers Senate speeches Shopping data
Global emb −2.176± 0.005 −2.239± 0.002 −0.772± 0.000Separated emb −2.500± 0.012 −2.915± 0.004 −0.807± 0.002s-emb −2.287± 0.007 −2.645± 0.002 −0.770± 0.001s-emb (hierarchical) −2.170± 0.003 − 2.217± 0.001 −0.767± 0.000s-emb (amortiz+feedf) −2.153± 0.004 −2.484± 0.002 −0.774± 0.000s-emb (amortiz+resnet) − 2.120± 0.004 −2.249± 0.002 − 0.762± 0.000
Amortized Embedding of Intelligence
Which Words Does a Group Use Most Differently?
Amortized embeddings uncover which words are used most differently byRepublican Senators (red) and Democratic Senators (blue) from different states.
Summary
• Exponential family embeddings• conditionally specified models• learn distributed representations• Bernoulli embeddings for text
• Structured embeddings• dynamics• hierarchical modeling• amortization
Discussion: How do these methods relate to embeddings 2.0?
• context vectors global (slow learning?)
• embeddings specific to each group
• amortization: embeddings are constructed, not retrieved
• predefined partitioning of the data:• determines groups• determines number of representations• determines when which representation needs to be accessed
• a smarter system should be able to learn group structure
• can CLS theory help us design such models?
Contact info: Maja Rudolph ([email protected])
Multinomial embeddings for text
• observations xi : one-hot vectors
• context ci : index of words before and after
• each word is associated with two embedding vectors:• embedding vector ρv ∈ RK• context vector αv ∈ RK
xi |xci ∼ Categorical(ηi)
ηiv = ρ>v
∑j∈ci
α>xj
Exponential family
• Exponential family with natural parameter η, sufficient statistic t(x), andlog-normalizer a(η).
x ∼ ExpFam(η, t(x)) = h(x)exp{ηT t(x)− a(η)}
• e.g. Gaussian for reals, Poisson for counts, categorical for categorical,Bernoulli for binary...
• nice properties, derive algorithm once
Dynamic Bernoulli Embeddings
• xiv | xci ∼ Bern(piv)
• ηiv = log piv1−piv
• ηiv = ρ>v(∑
j∈ci
∑v′ αv′xjv′
)• αv ∼ N(0, (1/λα)I)
• ρ(0)v ∼ N(0, (1/λ
(0)ρ )I)
• ρ(t)v ∼ N(ρ
(t−1)v , (1/λρ)I)
Dynamic Embeddings — ACM abstracts (1951 - 2014)
values
1858 2000values values
fluctuations sacredvalue inalienable
currencies uniquefluctuation preservingdepreciation exemplifiedfluctuating principles
purchasing power philanthropyfluctuate virtuesbasis historical
fine
1858 2004fine fine
luxurious punishedfinest penitentiariescoarse imprisonmentbeautiful misdemeanor
imprisonment punishablefiner offenselighter guiltyweaves convictionspun penitentiary
data (ACM)
1961 1969 1991 2011 2014data data data data data
directories repositories voluminous raw data data streamsfiles voluminous raw data voluminous voluminous
bibliographic lineage repositories data sources raw dataformatted metadata data streams data streams warehousesretrieval snapshots data sources dws dwspublishing data streams volumes repositories repositoriesarchival raw data dws warehouses data sourcesarchives cleansing dsms marts data mining
manuscripts data mining data access volumes marts
Detecting Words with Largest Drift
drift(v) = ||ρ(T)v −ρ
(0)v ||
words with largest drift
iraq 3.09 coin 2.39tax cuts 2.84 social security 2.38health care 2.62 fine 2.38energy 2.55 signal 2.38medicare 2.55 program 2.36discipline 2.44 moves 2.35text 2.41 credit 2.34values 2.40 unemployment 2.34
unemployment
1858 1940 2000unemployment unemployment unemploymentunemployed unemployed joblessdepression depression rateacute alleviating depression
deplorable destitution forecastsalleviating acute cratedestitution reemployment upwardurban deplorable lag
employment employment economistsdistressing distress predict