exponential family embeddingsmic3,stanford.… · maja rudolph et al.“exponential family...

Exponential Family Embeddings

Maja Rudolph ([email protected])with Francisco Ruiz, and David Blei

Columbia University

September 12, 2017


• class of conditionally specified models

• goal: learn distributed representations of objects

00...100...00

−→

1.2−0.5...−1.74.3

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations byback-propagating errors”. In: Nature 323 (1986), p. 9.

Geoffrey E Hinton. “Learning distributed representations of concepts”. In: Proceedings of theeighth annual conference of the cognitive science society. Vol. 1. Amherst, MA. 1986, p. 12.

Maja Rudolph et al. “Exponential Family Embeddings”. In: Advances in Neural InformationProcessing Systems. 2016, pp. 478–486.


• goal: learn distributed representations of objects

• objects: words in text, neurons in neuro-science data, or items incollaborative filtering task

single neuron held out 25% of neurons held outModel K D 10 K D 100 K D 10 K D 100�� 0:261˙ 0:004 0:251˙ 0:004 0:261˙ 0:004 0:252˙ 0:004�-�� (c=10) 0:230˙ 0:003 0:230˙ 0:003 0:242˙ 0:004 0:242˙ 0:004�-�� (c=50) 0:226˙ 0:003 0:222˙ 0:003 0:233˙ 0:003 0:230˙ 0:003��-�� (c=10) 0:238˙ 0:004 0:233˙ 0:003 0:258˙ 0:004 0:244˙ 0:004

Table 2: Analysis of neural data: mean squared error and standard errors of neuralactivity (on the test set) for di�erent models. Both ��-�� models significantlyoutperform ��; �-�� is more accurate than ��-��.

Figure 1: Top view of the zebrafish brain, with blue circles at the location of theindividual neurons. We zoom on 3 neurons and their 50 nearest neighbors (smallblue dots), visualizing the “synaptic weights” learned by a �-�� model (K D 100).The edge color encodes the inner product of the neural embedding vector and thecontext vectors ⇢>n ˛m for each neighborm. Positive values are green, negative valuesare red, and the transparency is proportional to the magnitude. With these weightswe can form hypotheses about how nearby neurons are connected.

the lagged activity conditional on the simultaneous lags of surrounding neurons. Westudied context sizes c 2 f10; 50g and latent dimension K 2 f10; 100g.Models. We compare ��-�� to probabilistic factor analysis (��), fitting K-dimensional factors for each neuron andK-dimensional factor loadings for each timeframe. In ��, each entry of the data matrix is a Gaussian with mean equal to theinner product of the corresponding factor and factor loading.

Evaluation. We train each model on the first 95% of the time frames and holdout the last 5% for testing. With the test set, we use two types of evaluation. (1)Leave one out: For each neuron xi in the test set, we use the measurements of theother neurons to form predictions. For �� this means the other neurons are usedto recover the factor loadings; for ��-�� this means the other neurons are usedto construct the context. (2) Leave 25% out: We randomly split the neurons into 4folds. Each neuron is predicted using the three sets of neurons that are out of its fold.(This is a more di�cult task.) Note in ��-��, the missing data might change thesize of the context of some neurons. See Table 5 in Appendix B for the choice ofhyperparameters.

10

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations byback-propagating errors”. In: Nature 323 (1986), p. 9.

Geoffrey E Hinton. “Learning distributed representations of concepts”. In: Proceedings of theeighth annual conference of the cognitive science society. Vol. 1. Amherst, MA. 1986, p. 12.


Fitted Poisson Embeddings – Similarity Queries

Maruchan ramen soup Yoplait strawberry y. Mountain Dew soda Dean Foods 1 % milkMaruchan chicken ramen Yoplait vanilla yogurt Pepsi wild cherry soda Dean Foods 2 % milkMaruchan ramen, ls. Yoplait cherry yogurt Dr Pepper soda Dean Foods fat free milkKemps chocolate milk Yoplait blueberry yogurt Martin’s potato chips Danone fat free yoghurt

tSNE Component 1

tSN

E C

ompo

nent

2

• representations learned from shopping data (counts not text)


• encapsulates main ideas of neural language models:• each observation is modeled conditionally on a context

• the conditional distributions come from the exponential family

• each object has two embeddings: an embedding vector ρ and the contextvector α

Yoshua Bengio et al. “A neural probabilistic language model”. In: Journal of machine learningresearch 3.Feb (2003), pp. 1137–1155.

Tomas Mikolov et al. “Distributed representations of words and phrases and theircompositionality”. In: Neural Information Processing Systems. 2013, pp. 3111–3119.

Jeffrey Pennington et al. “Glove: Global Vectors for Word Representation.” In: Conference onEmpirical Methods on Natural Language Processing. Vol. 14. 2014, pp. 1532–1543.


• use exponential families for the conditional of each data point,

xi |xci ∼ ExpFam(ηi(xci), t(xi))

• the natural parameter combines the embedding and context vectors,

ηi(xci) = fi

ρ[i]>∑j∈ci

α[j]xj

• the exponential family embedding (ef-emb) has latent variables for eachindex:an embedding ρ[i] and a context vector α[i]

Pseudo-likelihood

• combine these ingredients in a “pseudo-likelihood”

L (ρ,α) =n∑i=1

(η>i t(xi)− a(ηi)

)+ logp(ρ) + logp(α).

• fit with stochastic optimization; exponential families simplify thegradients.

• (Stochastic gradients give justification to NN ideas like “negativesampling.”)

0 100 200 300 400 500

Barry C Arnold, Enrique Castillo, Jose Maria Sarabia, et al. “Conditionally specified distributions:an introduction”. In: Statistical Science 16.3 (2001), pp. 249–274.


In summary, an EF-Emb has 3 ingredients (context, conditional distribution,parameterization)

• Multinomial embedding for text (similar to CBOW)

• Poisson embeddings for shopping data (or movie ratings data)

• Bernoulli embeddings for text (related to word2vec)


Tomas Mikolov et al. “Efficient estimation of word representations in vector space”. In: ICLRWorkshop Proceedings. arXiv:1301.3781 (2013).

Multinomial embeddings for text

• observations xi : one-hot vectors

• context ci : index of words before and after

• each word is associated with two embedding vectors:• embedding vector ρv ∈ RK• context vector αv ∈ RK

• Categorical distribution on xi |xci

• natural parameter (log probability):

ηiv = ρ>v

∑j∈ci ,w∈V

αwxjw

Poisson embeddings for movie ratings and shopping data

• observations: counts

• context: other movies same user rated, other items same user purchased

• each item is associated with two embedding vectors:• embedding vector ρv ∈ RK• context vector αv ∈ RK

• Poisson distribution on xui |xci

• natural parameter (log rate):

ηui = ρ>i

∑j∈ci

αjxuj

Results – Poisson Embeddings – Better Held-out LogLikelihood

• Movie Ratings: MovieLens-100k.

• Market Baskets: IRI dataset. Over 100,000 baskets of 8000 distinct items

Market Baskets

Model K = 20 K = 100Poisson Emb. −7.11 −6.95Poisson Fact. −7.74 −7.63Poisson PCA −8.31 −11.01

Movie Ratings

K = 20 K = 100−5.69 −5.73−5.80 −5.86−5.92 −7.50

P. Gopalan, J. Hofman, and D. M. Blei. “Scalable recommendation with hierarchical Poissonfactorization”. In: Uncertainty in Artificial Intelligence. 2015.

Michael Collins, Sanjoy Dasgupta, and Robert E Schapire. “A generalization of principalcomponents analysis to the exponential family”. In: Neural Information Processing Systems. 2001,pp. 617–624.

Bernoulli embeddings

• breaks up one-hot constraint of word indicators in text

• instead of Categorical with expensive normalization use conditionalBernoullis to model individual entries in data matrix

• biased SGD (modeling ones, subsampling zeros) we get an objective thatclosely resembles word2vec

p(xiv | xci) = Bernoulli(piv)

piv = σ

ρ>v ∑j∈ci ,w∈V

αwxjw

• rest of the talk: extensions of Bernoulli embeddings


Why Embeddings?

• as input features in downsteam NLP tasks

• as output layer in deep models for word prediction tasks

• document classification

Ronan Collobert et al. “Natural language processing (almost) from scratch”. In: Journal ofMachine Learning Research 12.Aug (2011), pp. 2493–2537.

Jason Weston et al. “Deep learning via semi-supervised embedding”. In: Neural Networks: Tricksof the Trade. Springer, 2012, pp. 639–655.

Matt Taddy. “Document classification by inversion of distributed language representations”. In:arXiv preprint arXiv:1504.07295 (2015).

Why Embeddings?

• This talk: As descriptive statistic of text for computational social science

US Congressional Record (1858 - 2009)

The Meaning of Words Changes - Computer

computer

1858 1986computer computerdraftsman softwaredraftsmen computerscopyist copyright

photographer technologicalcomputers innovationcopyists mechanicaljanitor hardware

accountant technologiesbookkeeper vehicles

Maja Rudolph and David Blei. “Dynamic Embeddings for Language Evolution”. In:arXiv:1703.08052. 2017.

Dynamic Bernoulli Embeddings


• Divide corpus into time slices t = 1, · · · ,T

• E.g. divide speeches from 1858 - 2009 into 76 time slices, 2 years each

• Model semantic changes using Gaussian random walk on the embeddingvectors ρ(t)

v

• Fit objective using stochastic gradients


• ρ(0)v ∼ N(0, (1/λ

(0)ρ )I)

• ρ(t)v ∼ N(ρ

(t−1)v , (1/λρ)I)

Results - Held-out Likelihood

Senate speeches

context size 2 context size 8

s-emb −2.409± 0.001 −2.286± 0.001t-emb −2.444± 0.001 −2.458± 0.001d-emb [this work] −2.340± 0.001 −2.282± 0.001

N ≈ 14M, V = 25k, K = 100


William L Hamilton, Jure Leskovec, and Dan Jurafsky. “Diachronic Word Embeddings RevealStatistical Laws of Semantic Change”. In: arXiv preprint arXiv:1605.09096 (2016).

Dynamic Embeddings — U.S. Senate Speeches (1858 - 2009)

Grouped data

• What if the data is grouped differently?

• How do we share statistical strength when we cannot exploit dynamics?

Structured Embedding Models for Grouped Data

• Goal: uncover how word usage differs between different groups

data embedding of groups grouped by sizeArXiv abstracts text 15k terms 19 subject areas 15M wordsSenate speeches text 15k terms 83 home state/party 20M wordsShopping data counts 5.5k items 12 months 0.5M trips

Maja Rudolph, Francisco Ruiz, and David Blei. “Structured Embedding Models for GroupedData”. In: Advances in Neural Information Processing Systems. 2017.

Hierarchical Embedding Model

ρ(s)v

αv

X(s)

ρ(0)v

V

S

ρ(s)v ∼N (ρ(0)

v ,σ2ρI)

Amortized Embedding Model

ρ(s)v

αv

X(s)

ρ(0)v

V

S

ρ(s)v = fs(ρ

(0)v )

Neural Network maps Global Embeddings to Group SpecificEmbeddings

input:word vector

ρv

output:group specificword vectorρ(s)v

W (s)1 W (s)

2

...

......• We compare feed forward NNs and ResNet architectures.

Kaiming He et al. “Deep residual learning for image recognition. arXiv 2015”. In: arXiv preprintarXiv:1512.03385 ().

Results

ArXiv papers Senate speeches Shopping data

Global emb −2.176± 0.005 −2.239± 0.002 −0.772± 0.000Separated emb −2.500± 0.012 −2.915± 0.004 −0.807± 0.002s-emb −2.287± 0.007 −2.645± 0.002 −0.770± 0.001s-emb (hierarchical) −2.170± 0.003 − 2.217± 0.001 −0.767± 0.000s-emb (amortiz+feedf) −2.153± 0.004 −2.484± 0.002 −0.774± 0.000s-emb (amortiz+resnet) − 2.120± 0.004 −2.249± 0.002 − 0.762± 0.000

Amortized Embedding of Intelligence

Which Words Does a Group Use Most Differently?

Amortized embeddings uncover which words are used most differently byRepublican Senators (red) and Democratic Senators (blue) from different states.

Summary

• Exponential family embeddings• conditionally specified models• learn distributed representations• Bernoulli embeddings for text

• Structured embeddings• dynamics• hierarchical modeling• amortization

Discussion: How do these methods relate to embeddings 2.0?

• context vectors global (slow learning?)

• embeddings specific to each group

• amortization: embeddings are constructed, not retrieved

• predefined partitioning of the data:• determines groups• determines number of representations• determines when which representation needs to be accessed

• a smarter system should be able to learn group structure

• can CLS theory help us design such models?

Contact info: Maja Rudolph ([email protected])

Multinomial embeddings for text

• observations xi : one-hot vectors

• context ci : index of words before and after

• each word is associated with two embedding vectors:• embedding vector ρv ∈ RK• context vector αv ∈ RK

xi |xci ∼ Categorical(ηi)

ηiv = ρ>v

∑j∈ci

α>xj

Exponential family

• Exponential family with natural parameter η, sufficient statistic t(x), andlog-normalizer a(η).

x ∼ ExpFam(η, t(x)) = h(x)exp{ηT t(x)− a(η)}

• e.g. Gaussian for reals, Poisson for counts, categorical for categorical,Bernoulli for binary...

• nice properties, derive algorithm once


• xiv | xci ∼ Bern(piv)

• ηiv = log piv1−piv

• ηiv = ρ>v(∑

j∈ci

∑v′ αv′xjv′

)• αv ∼ N(0, (1/λα)I)

• ρ(0)v ∼ N(0, (1/λ

(0)ρ )I)

• ρ(t)v ∼ N(ρ

(t−1)v , (1/λρ)I)

Dynamic Embeddings — ACM abstracts (1951 - 2014)

values

1858 2000values values

fluctuations sacredvalue inalienable

currencies uniquefluctuation preservingdepreciation exemplifiedfluctuating principles

purchasing power philanthropyfluctuate virtuesbasis historical

fine

1858 2004fine fine

luxurious punishedfinest penitentiariescoarse imprisonmentbeautiful misdemeanor

imprisonment punishablefiner offenselighter guiltyweaves convictionspun penitentiary

data (ACM)

1961 1969 1991 2011 2014data data data data data

directories repositories voluminous raw data data streamsfiles voluminous raw data voluminous voluminous

bibliographic lineage repositories data sources raw dataformatted metadata data streams data streams warehousesretrieval snapshots data sources dws dwspublishing data streams volumes repositories repositoriesarchival raw data dws warehouses data sourcesarchives cleansing dsms marts data mining

manuscripts data mining data access volumes marts

Detecting Words with Largest Drift

drift(v) = ||ρ(T)v −ρ

(0)v ||

words with largest drift

iraq 3.09 coin 2.39tax cuts 2.84 social security 2.38health care 2.62 fine 2.38energy 2.55 signal 2.38medicare 2.55 program 2.36discipline 2.44 moves 2.35text 2.41 credit 2.34values 2.40 unemployment 2.34

unemployment

1858 1940 2000unemployment unemployment unemploymentunemployed unemployed joblessdepression depression rateacute alleviating depression

deplorable destitution forecastsalleviating acute cratedestitution reemployment upwardurban deplorable lag

employment employment economistsdistressing distress predict

exponential family embeddingsmic3,stanford.… · maja rudolph et al.“exponential family...

Documents