word embeddings as a service - pydata nyc 2015

37
Word embeddings as a service François Scharffe ( ), PyData NYC 2015 @lechatpito (http://www.twitter.com/lechatpito) 3Top (http://www.3top.com)

Upload: francois-scharffe

Post on 14-Jan-2017

930 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Word embeddings as a service -  PyData NYC 2015

Word embeddings as a service

François Scharffe ( ),

PyData NYC 2015

@lechatpito (http://www.twitter.com/lechatpito) 3Top(http://www.3top.com)

Page 2: Word embeddings as a service -  PyData NYC 2015

Outline of the talkWhat is 3Top?What are word embeddings?How to implement a simple recommendation system for 3Top categories?

Page 3: Word embeddings as a service -  PyData NYC 2015

Rank Anything, Rank Everything3Top is a ranking and recommendation platformRankings convey more information than star ratings

Who cares about 3 stars or less? I just want the best stuffI'd rather trust my friends than reading through reviewsIf I have more than 3 items to rank, I can probably use a more precisecategory

Not yet launched, but the site is up

Let's take a look at http://www.3top.com (http://www.3top.com)

Page 4: Word embeddings as a service -  PyData NYC 2015
Page 5: Word embeddings as a service -  PyData NYC 2015

Places

http://www.3top.com/category/1138/gyms-for-students-near-lower-east-side(http://www.3top.com/category/1138/gyms-for-students-near-lower-east-side)

Page 6: Word embeddings as a service -  PyData NYC 2015

Movies

http://www.3top.com/category/142/movies-about-wall-street(http://www.3top.com/category/142/movies-about-wall-street)

Page 7: Word embeddings as a service -  PyData NYC 2015

Anything really

http://www.3top.com/category/765/foods-named-after-people(http://www.3top.com/category/765/foods-named-after-people)

Page 8: Word embeddings as a service -  PyData NYC 2015

Data & knowledge engineering at 3TopBuilding a solid data engineering architecture before launching the site.

Natural language processing pipelineParsing categories

Currently using a parser we developpedAbout to switch to from MatthewHonnibal. It's great, check it out!

Detecting named entities, locationsA large knowledge graph backed by an ontologyAn itemization pipeline

matching free text items to entities in the knowledge graph

spaCy (http://spacy.io)

Page 9: Word embeddings as a service -  PyData NYC 2015
Page 10: Word embeddings as a service -  PyData NYC 2015

Category recommendationHow are we going to build a simple recommendation system without having anysignificant number of user, categories or rankings?

Note the impressive figures:

Number of Users: 316Number of Rankings: 2123

Number of Categories: 1316

Feel free to add a few rankings:

W o w ! ; )

http://www.3top.com (http://www.3top.com)

Page 11: Word embeddings as a service -  PyData NYC 2015

Word embeddings ?Who hasn't heard about word2vec?

Word embeddings allow to represent words in a high dimensional space in a way thatwords appearing in the same context will be close in that space.

Dimensionality of the space is not that high, typically a few 100 dimensions.Word embeddings is a language modeling method, more precisely a distributedvector representation of words.

Page 12: Word embeddings as a service -  PyData NYC 2015

Compared to Bag of words:

Dimensionality is low, and constant wrt the vocabulary sizeDepending on the training algorithm, partially learned models give partiallygood results

Compared to topic modeling:

Better granularity, the base element is a wordPhrases vector can also be learnt

Page 13: Word embeddings as a service -  PyData NYC 2015

What are word embeddings models good at:

Modeling similarity between words

Allows algebric operations on word vectors

s i m ( t o m a t o , b e e f s t e a k ) < s i m ( a p p l e , t o m a t o ) < s i m ( p e a r , a p p l e )

v ( P a r i s ) - v ( F r a n c e ) ~ = v ( B e r l i n ) - v ( G e r m a n y )

Page 14: Word embeddings as a service -  PyData NYC 2015

ExamplesExamples here are using a GloVe model (100d, 400k vocab, trained on Wikipedia andGigaword (news articles)).

In [3]: from gensim.models import Word2Vecmodel = Word2Vec().load_word2vec_format("./glove.6B.100d.txt")

In [4]: model.most_similar("python", topn=10)

Out[4]: [(u'monty', 0.6886237859725952), (u'php', 0.586538553237915), (u'perl', 0.5784406661987305), (u'cleese', 0.5446674823760986), (u'flipper', 0.5112984776496887), (u'ruby', 0.5066927671432495), (u'spamalot', 0.505638837814331), (u'javascript', 0.5030568838119507), (u'reticulated', 0.4983375668525696), (u'monkey', 0.49764129519462585)]

Page 15: Word embeddings as a service -  PyData NYC 2015

In [5]: model.most_similar_cosmul(positive=["python", "programming"], topn=5)

Out[5]: [(u'perl', 0.5658619999885559), (u'scripting', 0.559501588344574), (u'scripts', 0.5469149351119995), (u'php', 0.5461974740028381), (u'language', 0.5350533127784729)]

In [6]: model.most_similar_cosmul(positive=["python", "venomous"], topn=5)

Out[6]: [(u'scorpion', 0.5413044095039368), (u'snakes', 0.5263831615447998), (u'snake', 0.5222328901290894), (u'spider', 0.5214570164680481), (u'marsupial', 0.517005205154419)]

Page 16: Word embeddings as a service -  PyData NYC 2015

The classical example:

v ( k i n g ) - v ( m a n ) + v ( w o m a n ) - > v ( q u e e n )

In [7]: model.most_similar_cosmul(positive=["king", "woman"], negative=["man"])

Out[7]: [(u'queen', 0.8964556455612183), (u'monarch', 0.8495977520942688), (u'throne', 0.8447030782699585), (u'princess', 0.8371668457984924), (u'elizabeth', 0.835679292678833), (u'daughter', 0.8348594903945923), (u'prince', 0.8230059742927551), (u'mother', 0.8154449462890625), (u'margaret', 0.8147734999656677), (u'father', 0.8100854158401489)]

Page 17: Word embeddings as a service -  PyData NYC 2015

Training a modelVery easy once you have a clean corpusGreat tools in Python

Tutorial on training a model using Gensim:

Radim Řehůřek gave a talk last year at PyData Berlin aboutoptimizations in Cython:

For GloVe

http://rare-technologies.com/word2vec-tutorial/ (http://rare-technologies.com/word2vec-tutorial/)

https://www.youtube.com/watch?v=vU4TlwZzTfU (https://www.youtube.com/watch?v=vU4TlwZzTfU)

https://github.com/maciejkula/glove-python(https://github.com/maciejkula/glove-python)

Page 18: Word embeddings as a service -  PyData NYC 2015

Gensim word2vec implementation specifics:

Training time ~ 8hours on a 8 proc/8 threads to learn 600 dimensions on a 1.9Bwords corpusMemory requirements depends on the vocabulary size and on the number ofdimensions:

The takes half the time but has a quadratic memory size. Check pull requests for memoryoptimizations.

3 m a t r i c e s * 4 b y t e s ( f l o a t ) * | d i m e n s i o n s | * | v o c a b u l a r y |

GloVe implementation in Python (https://github.com/maciejkula/glove-python/)

Page 19: Word embeddings as a service -  PyData NYC 2015

A good think to know: a bigger training set does improve the quality of the model, even forspecialized tasks.

As a consequence, you probably want to use a huge corpus. Good models are available.

Building you own model can be useful when you want to find out about the properties ofyour corpus, or you want to compare different corpora together. For examples evolutionof language in a newspaper during different periods of time.

Page 20: Word embeddings as a service -  PyData NYC 2015

Finding a modelFrom:

Model fileNumber ofdimensions

300

1000

1000

50/100/200/300

300

300

25/50/100/200

300

https://github.com/3Top/word2vec-api/ (https://github.com/3Top/word2vec-api/)

Google News (GoogleNews-vectors-negative300.bin.gz)

Freebase IDs(https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit?usp=sharing)

Freebase names(https://docs.google.com/file/d/0B7XkCwpI5KDYeFdmcVltWkhtbmM/edit?usp=sharing)

Wikipedia+Gigaword 5 (http://nlp.stanford.edu/data/glove.6B.zip)

Common Crawl 42B (http://nlp.stanford.edu/data/glove.42B.300d.zip)

Common Crawl 840B (http://nlp.stanford.edu/data/glove.840B.300d.zip)

Twitter (2B Tweets) (http://www-nlp.stanford.edu/data/glove.twitter.27B.zip)

Wikipedia dependency

Page 21: Word embeddings as a service -  PyData NYC 2015

300

1000

Wikipedia dependency(http://u.cs.biu.ac.il/~yogo/data/syntemb/deps.words.bz2)

DBPedia vectors(https://github.com/idio/wiki2vec/raw/master/torrents/enwiki-gensim-word2vec-1000-nostem-10cbow.torrent)

Page 22: Word embeddings as a service -  PyData NYC 2015

Building a recommendation engine for 3Top categoriesBy combining word vectors, we build category vectors.

In [8]: def build_category_vector(category): pass # Get the postags vector = [] for tag in postags: if tag.tagged in ['NN', 'NNS', 'JJ', 'NNP', 'NNPS', 'NNDBN', 'VBG', 'CD']: # Only keep meaningful words try: v = word2vec(tag.tagValue) # Get the word vector if v.any(): vector.append(v) except: logger.debug("Word not found in corpus: %s" % tag.tagValue) tagset.add(tag.tagValue) if vector: return matutils.unitvec(np.array(vector).mean(axis=0)) # Average the vector else: return np.empty(300)

We store those vectors in a category space, and at page load time compute the mostsimilar categories for a given category.

Page 23: Word embeddings as a service -  PyData NYC 2015

Now let us look at the similarity method.

s i m ( c 1 , c 2 ) = v ( c 1 ) . v ( c 2 )

In [21]: cs = CategorySimilarity()# print(Category.objects.all().count())category = Category.objects.get(category=u"Blue-collar beers that come in a can")_ = [print(c) for c in cs.most_similar_categories(category, n=5)]

DEBUG Category space size (as found in the cache): 1125

Belgian Trappist Beers Belgian Beer Cafe in NYCDark and Stormy Cocktail in NYCBrands of Ginger BeerPink Drinks

Page 24: Word embeddings as a service -  PyData NYC 2015

In [10]: category = Category.objects.get(category=u"Italian Restaurants in NYC.")_ = [print(c) for c in cs.most_similar_categories(category, n=5)]

Italian Restaurants in NYRestaurants in NycNYC Mexican Restaurants Romanian Restaurants in NYCThai Restaurants in NYC

In [11]: category = Category.objects.get(category=u"Coen Brothers Movies")_ = [print(c) for c in cs.most_similar_categories(category, n=10)]

Quentin Tarantino MoviesMartin Scorsese Films.Movies Starring Creepy ChildrenTim Burton MoviesMovies Starring Sean PennPixar MoviesGodfather MoviesBerlin Indie Movie TheatersKubrick MoviesHarry Potter Movies

Page 25: Word embeddings as a service -  PyData NYC 2015

Our recommendation system uses the Common Crawl 42B words, 300 dimensions modeltrained with GloVe.

It takes around 6GB in memory ... and this is a problem:

We run a Django server and 8 celery workers on an EC2 T2 Micro... That would be a lot ofmemory for that poor instance.

Page 26: Word embeddings as a service -  PyData NYC 2015

A word embedding serviceWe separate the word embedding model as a serviceSimple Flask server with a few primitives:

curl http://127.0.0.1:5000/word2vec/similarity?w1=Python&w2=Javacurl http://127.0.0.1:5000/word2vec/n_similarity?ws1=Python&ws1=programming&ws2=Javacurl http://127.0.0.1:5000/word2vec/model?word=Pythoncurl http://127.0.0.1:5000/word2vec/most_similar?positive=king&positive=queen&negative=man

Easy to setup:python word2vec-api --model path/to/the/model [--host host --port 1234]

Get it at https://github.com/3Top/word2vec-api(https://github.com/3Top/word2vec-api)

Page 27: Word embeddings as a service -  PyData NYC 2015
Page 28: Word embeddings as a service -  PyData NYC 2015

Caching the vector spaceAs the number of categories increase we do not want to rebuild category vectors and hitthe database every time a recommendation is needed (every page access). Categoryvectors size is actually significant:

In [12]: print(np.fromstring(category.vector).nbytes)

2400

In [13]: print(u"... and for {} categories the space becomes large (~{}MB)".format(len(cs.category_space.syn0), cs.category_space.syn0.nbytes/1000000))

... and for 1125 categories the space becomes large (~2MB)

Page 29: Word embeddings as a service -  PyData NYC 2015

We store vectors with the category object in MySQL, using a base64 encoding of thenumpy object. Let's look at it:

In [14]: print(category._vector[:1000] + "...")

oNl7j9l8hr/2FoHhEbSyv+GALkJ6JVU/rxm5pueouL80aF72QLiivzr0z1WKaZM/i3M5FvqQmD+hDmIs7fitv+lL6cLFSbM/lSwwkBxv0D/sA+FgTxmhP5+lJZPGVLE/q8pBj07ukT9OceKjxl2jv4s4cA7RJIc/JxVUF8afnr/RQcXciUyCP+M4N3mbtrS/Ngwo85uUor/+4vqargCyP7YnHbTSv3W/MHjrh6iHsr/1vkzSmI+yP+bsS7E0B5u/JJ5iJAUNoT//xo0IJ3inP5/BwCfgWZ2/Q2r8q9Fuir/KdAOAr0OQPwzGTnUXU5i/9uQD77+xuD/1QKbEaDWjvyqfSePd7XE/3RLqJXiOrz8ZyEDICd2UP2beFLiqPZy/rIb+8sFgqr+ILyc3/5yoP5pL25IahpQ/4WpgCeuNZ7/ley+Tl9ygP+knz2odUHo/yBSdc5+Klj+GLgrafftvP2yiB76KBY2/z0RqB+1ri7+THdXBVVKvPzwZ2X+2HaA/oOThsHeidL/O7w8+bummv8Z8XCqeYas/WzQpioG6qr+JaauGrie2P7+8NmNBN5o/0Ji6XFJMpj/xAtoHvg9PPyr3OOBXVaG/M2aCbN8dv79pANKgDzNrPy4XXBNVi4S/WBuYYjWZlD8T/W3jLbOJPy3xHNTzarQ/MoOWx7aZtz/RDMwbryievwA5kQgazaO/3Ep0jVo1rD+ns3oJ3iWUv4TlEPcAnJy/dHNcwygjnr/cMGYNKPbCP5E06afPWa6/mUHAC+8mjj+NwgyjQFB5v6ffLvduuai/kBntVvsdpb8Yg3KzY/qev9r5son3VJg/h06aD0/IuD8NMHm7jGViv7o8zQzPd5U/esP4Ax6BrL8TOZuX+qGpP1WHNPzdQH0/7HXRMAqXmr9G8pkwjbenv3RvQppal7i/E5jWmLXSp792VpPxJeOjPyEKhEhl324/1E00QnHdvr9Mg0Fohd+cP6UAj0X5R5g/2umwTF42...

Page 30: Word embeddings as a service -  PyData NYC 2015

In [15]: # a property method takes care of the decodingdef get_vector(self): return base64.b64decode(self._vector)

def set_vector(self, value): encoded = base64.b64encode(value) self._vector = encoded vector = property(get_vector, set_vector)

In [16]: np.fromstring(category.vector)[:100]

Out[16]: array([-0.01098032, -0.07306015, 0.00129067, -0.09632728, -0.03656199, 0.01895729, 0.02399054, -0.05853978, 0.07534443, 0.25678171, 0.03339623, 0.06769982, 0.01751063, -0.03782483, 0.01130069, -0.02990636, 0.00893505, -0.08091137, -0.03629005, 0.07032291, -0.00530989, -0.07238248, 0.07250362, -0.02639468, 0.03330246, 0.04583857, -0.02866316, -0.01290668, 0.0158832 , -0.02375447, 0.09646225, -0.03751686, 0.00437724, 0.06163383, 0.02037444, -0.02757899, -0.05151945, 0.04807279, 0.02004282, -0.00287529, 0.03293298, 0.00642406, 0.02201318, 0.0039041 , -0.01417073, -0.01338945, 0.06117504, 0.03147669, -0.00503775, -0.04474968, 0.05347914, -0.05220418, 0.086543 , 0.02560141, 0.04355104, 0.00094792, -0.03385424, -0.12154957, 0.00332025, -0.01003138, 0.02011569, 0.01254879, 0.07975696, 0.09218924, -0.02945207, -0.03867418, 0.05509456, -0.0196757 , -0.02793886, -0.029431 , 0.1481371 , -0.05927895, 0.0147227 , -0.00618005, -0.04828975, -0.04124437, -0.03025203, 0.02376162, 0.09680647, -0.00224569, 0.02096485, -0.05567259, 0.05006393, 0.00714194, -0.0259668 , -0.04632226, -0.09605948, -0.04652946, 0.03884238, 0.00376863, -0.12056644, 0.02819642, 0.02371206, 0.08286085, 0.08104846, -0.03060514, -0.0313298 , -0.00715603, -0.05278924, 0.0031662 ])

Page 31: Word embeddings as a service -  PyData NYC 2015

In order to avoid issuing a few thousand SQL queries every time a page is loadedwe use Memcache to store the category space.As the space is larger than 1 MB we store each vector with its own key (thecategory Id). They share a common key prefix.We directly store the numpy vectors through the Gensim API.A separate key is used for the vocabulary indexes.

In [17]: def set_space_cache(space): sim.set(VOC, space.vocab) sim.set(IDX, space.index2word) sim.set_many({"{0}-{1}".format(VEC, i): space.syn0[i] for i in range(len(space.vocab))})

This also allows to add a category vector to the space without having to rebuild it. Simplyby stacking its vector in the cache and updating the cached space indexes.

In [18]: def add_last_vector_to_space_cache(space): sim.set(VOC, space.vocab) sim.set(IDX, space.index2word) sim.set("{}-{}".format(VEC, len(space.vocab)-1), space.syn0[-1])

Page 32: Word embeddings as a service -  PyData NYC 2015

UpdatesEach process gets its own copy of the vector space.Whenever a category is added, the space is updated in cache.Django signals are used to tell other processes to reload the space from cache.

Page 33: Word embeddings as a service -  PyData NYC 2015

Work in progressWe are about to add a few 100k generated categoriesThe category space will become large in memory: 8 workers * 2.4 kb * 100000categories = 1,9 GBIncluding entity vectors would improve results for names, places, etc.Training a specialized corpus using categories scraped all over the webTrain a phrase2vec model on these categories

Page 34: Word embeddings as a service -  PyData NYC 2015

Resources

Tutorials & Applications

Page 35: Word embeddings as a service -  PyData NYC 2015

Instagram:

Word embeddings and RNNs:

Word2vec gensim tutorial:

Clothing style search:

In digital humanities:

In digital humanities, application to gender studies:

Document classification on Yelp reviews:

http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji (http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji)

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ (http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)

http://rare-technologies.com/word2vec-tutorial/(http://rare-technologies.com/word2vec-tutorial/)

http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/ (http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/)

http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html (http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html)

http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html (http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html)

http://nbviewer.ipython.org/github/taddylab/deepir/blob/master/w2v-inversion.ipynb(http://nbviewer.ipython.org/github/taddylab/deepir/blob/master/w2v-

Page 36: Word embeddings as a service -  PyData NYC 2015

inversion.ipynb)

Resources

Academic PapersLe, Quoc V., and Tomas Mikolov. "Distributed representations of sentences anddocuments." arXiv preprint arXiv:1405.4053 (2014).JeffreyPennington, RichardSocher, and ChristopherD Manning. "Glove: Globalvectors for word representation." (2014).Levy, Omer, and Yoav Goldberg. "Dependencybased word embeddings."Proceedings of the 52nd Annual Meeting of the Association for ComputationalLinguistics. Vol. 2. 2014.Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'snegative-sampling word-embedding method." arXiv preprint arXiv:1402.3722(2014).

Page 37: Word embeddings as a service -  PyData NYC 2015

In [19]: Thank you !

File "<ipython-input-19-f087ca1d6988>", line 1 Thank you ! ̂SyntaxError: invalid syntax