hacking human language (pydata london)

92
Hacking Human Language Hendrik Heuer London

Upload: hendrik

Post on 03-Aug-2015

216 views

Category:

Education


2 download

TRANSCRIPT

Hacking!Human!Language!Hendrik Heuer

London

– Hacker Ethics

“Access to computers —

and anything which might !teach you something about !

the way the world works!—

should be unlimited and total.

Always yield to !the Hands-On Imperative!”

Levy, Steven (2001). Hackers: Heroes of the Computer Revolution (updated ed.). New York: Penguin Books. ISBN 0141000511. OCLC 47216793.

Agenda• Computational Social Science

• Natural Language Processing

• Word Vector Representations

• Comparing different Wikipedia revisions

• Random Indexing

• word2vec patent

About me!Hej, I’m Hendrik.

Computational Social Science

Computational Social Science Digital Humanities

• combines computer science & social sciences

• makes new research possible, e.g. the analysis of massive social networks and content of millions of books

immersion.media.mit.edu

D. Crandall and N. Snavely, ‘Modeling People and Places with Internet Photo Collections’, Commun. ACM, vol. 55, no. 6, pp. 52–60, Jun. 2012. DOI:

10.1145/2184319.2184336

Massive-scale automated !analysis of news-content• 2.5 million articles from 498 different

English-language news outlets (Reuters & New York Times Corpus)

• automatically annotated into 15 topic areas

• the topics were compared in regards to readability, linguistic subjectivity and gender imbalances

I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and N. Cristianini, ‘Research Methods in the Age of Digital Journalism: Massive-scale

automated analysis of news-content: topics, style and gender’, Digital Journalism, vol. 1, no. 1, 2013. DOI:10.1080/21670811.2012.714928

Linguistic Subjectivity!Adjectives (Part-of-Speech Tagging) & SentiWordNet

“Low level of political interest and engagement could be connected to the !

lack of subjectivity (adjectival excess)”

Linguistic Subjectivity!Adjectives (Part-of-Speech Tagging) & SentiWordNet

Male-to-Female Ratio!Named Entity Recognition

Male-to-Female Ratio!Named Entity Recognition

“Gender bias in sports coverage (...) females only account for between

only 7 and 25 per cent of coverage”

scikit-learn

gensimNatural Language ToolkitspaCyword2vec

Machine Learning

Text ProcessingTopic Modeling

Visualizationd3.js

Google Chart APIHighcharts

Part-of-Speech Tagging!Identifying nouns, verbs, adjectives…

>>> import nltk >>> text = "In the middle ages Sweden had the same king as Denmark and Norway." >>> words = nltk.word_tokenize( text ) !>>> nltk.pos_tag( words ) [('In', 'IN'), ('the', 'DT'), ('middle', 'NN'), ('ages', 'NNS'), ('Sweden', 'NNP'), ('had', 'VBD'), ('the', 'DT'), ('same', 'JJ'), ('king', 'NN'), ('as', 'IN'), ('Denmark', 'NNP'), ('and', 'CC'), ('Norway', 'NNP'), ('.', '.')]

NN* Noun VB* Verb JJ* Adjective RB* Adverb DT Determiner IN Preposition

Named Entity Recognition!Identifying people, organizations, locations…

>>> import nltk >>> text = "New York City is the largest city in the United States." >>> words = nltk.word_tokenize( text ) !>>> nltk.ne_chunk( nltk.pos_tag( words ) ) Tree('S', [Tree('GPE', [('New', 'NNP'), ('York', 'NNP'), ('City', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('city', 'NN'), ('in', 'IN'), ('the', 'DT'), Tree('GPE', [('United', 'NNP'), ('States', 'NNPS')]), ('.', '.')])

ORGANIZATION Georgia-Pacific Corp., WHO PERSON Eddy Bonte, President Obama LOCATION Murray River, Mount Everest DATE June, 2008-06-29 TIME two fifty a m, 1:30 p.m. MONEY GBP 10.40 PERCENT twenty pct, 18.75 % FACILITY Washington Monument, Stonehenge GPE South East Asia, Midlothian (geo-political entity)

Sentiment AnalysisTell if a sentence is positive or negative

Stanford Core NLP Tools

Word Vector Representations

–J. R. Firth 1957

“You shall know a word by the company it keeps”

Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis, 1–32. Oxford: Blackwell.

–J. R. Firth 1957

“You shall know a word by the company it keeps”

Quoted after Socher

Vectors are directions in space

Vectors are directions in space

Quoted after Socher

word2vecRepresenting a word with a vector

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

MAN

WOMANAUNT

UNCLEQUEEN

KING

word2vecVectors can encode relationships

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

word2vecVectors can encode relationships

KINGS

KING

QUEEN

QUEENS

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

word2vecVectors can encode relationships

England is to Cameron as Germany is to ?

England is to London as Germany is to ?

Link: http://radimrehurek.com/2014/02/word2vec-tutorial/

598.7ms [["Berlin",0.563393235206604],["Dusseldorf",0.5625754594802856],["Munich",0.5460122227668762],["Budapest",0.5285829901695251],

["Düsseldorf",0.5266501903533936]]

556.8ms [["Merkel",0.5016422867774963],["Schroeder",0.49941977858543396],["Klaus",0.4981233477592468],["Schröder",

0.4947296977043152],["Peer_Nils",0.492642343044281]]

word2vecAnalogy puzzles

wake is to woken as be is to ?

fast is to fastest as slow is to ?

Link: http://radimrehurek.com/2014/02/word2vec-tutorial/

806.2ms [["slowest",0.7025301456451416],["slower",0.6236234307289124],["slowed",0.5842559337615967],["slowing",0.5462259650230408],["quickest",

0.5290436744689941]]

929.9ms [["been",0.41698968410491943],["tobe",0.40402814745903015],["are",0.3866569399833679],["being",0.3746173679828644],["notbe",

0.36837878823280334]]

word2vecAnalogy puzzles

Scotland is to haggis as Germany is to ?

Link: http://radimrehurek.com/2014/02/word2vec-tutorial/

793.5ms [["Currywurst",0.5284685492515564],["schnitzel",0.5208959579467773],["wursts",0.5166285037994385],["sauerkraut",

0.512742817401886],["stollen",0.5095855593681335]]

word2vecAnalogy puzzles

communism is to Karl_Marx as capitalism is to ?

Link: http://radimrehurek.com/2014/02/word2vec-tutorial/

544.7ms [["Capitalism",0.5884973406791687],["capitalist",0.5700926184654236],["Friedrich_Hayek",0.5352163314819336],

["Milton_Friedman",0.5348755121231079],["John_Maynard_Keynes",0.5335651636123657]]

word2vecAnalogy puzzles

SwedenMost similar words

SwedenMost similar words

HarvardMost similar words

Word vector representations

in Python

Link: https://radimrehurek.com/gensim/models/word2vec.html

Link: https://radimrehurek.com/gensim/models/word2vec.html

Link: https://honnibal.github.io/spaCy/

Link: https://honnibal.github.io/spaCy/

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

2 words context window

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

5 words context window

2 words context window

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

word2vecTraining word vectors with generator

Link: https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming

Applications

Machine Translation

T. Mikolov, Q. V. Le, and I. Sutskever, ‘Exploiting Similarities among Languages for Machine Translation’, CoRR, vol. abs/1309.4168, 2013 [Online]. Available:

http://arxiv.org/abs/1309.4168

Comparing Wikipedia revisions

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Visualisation

JSON

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Visualisation

JSON

Link gensim: https://radimrehurek.com/gensim/!Link word2vec: https://code.google.com/p/word2vec/

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Visualisation

JSON

linguistics

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Visualisation

JSON

Link: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Visualisation

JSON

Link: https://github.com/mbostock/d3/wiki/Gallery

Comparing Wikipedia revisions

Comparing Wikipedia revisions!Game of Thrones

Bird’s eye view!Game of Thrones

Bird’s eye view, intersection set!Game of Thrones

Characters in 2013 and 2015!Game of Thrones

Bird’s eye view!United States

Bird’s eye view, intersection set!United States

Bird’s eye view, 2015!United States

Bird’s eye view, 2013!United States

word2vec patent

• “Pretty much every time Google has engaged in patent infringement litigation, it has been against someone who has brought an infringement suit against them first. (…) it keeps inventions they are using out of the hands of patent trolls”

• “Idiotic. I'm surprised they didn't just patent matrix algebra”

• “fuck software patents”

https://www.reddit.com/r/MachineLearning/comments/37b1bl/word2vec_has_been_patented_what_does_it_change/

Google’s word2vec patentReactions from the community

• Omer Levy: • The novelty claim in this

patent is somewhat bogus • word2vec is doing more or

less what the NLP research community has been doing for the past 25 years

• much of the improvement in performance stems from preprocessing "hacks" and hyperparameter settings

• word2vec is a brilliantly efficient implementation of decade-old ideas

https://www.reddit.com/r/MachineLearning/comments/37b1bl/word2vec_has_been_patented_what_does_it_change/

Google’s word2vec patentReactions from the community

Google’s word2vec patentWhat does it change for NLP practitioners?

• “Likely nothing. It's probably one of the thousands of overly broad "defensive" patents held by companies”

• “Didn't it have an Apache open source license before-hand?”

Google’s word2vec patentWhat does it change for NLP practitioners?

• “Likely nothing. It's probably one of the thousands of overly broad "defensive" patents held by companies.”

• “Didn't it have an Apache open source license before-hand?”

Map Reduce & Hadoop

Random Indexing

• all information in vectors • each word has a hash key!

• n-dimensional vector • most dimensions are 0 • for a small number k, randomly

distributed -1 or +1 values • the dimension of the vectors is

much smaller than the number of contexts

Random Indexing!Incremental word space model

hash !key

• Every time you see a word wi, add the hash key of the words in the context window vi-3, …, vi+3 to the word’s context vector vi

• After a number of occurrences, the context vector holds information about a word’s distribution

• dimensionality reduction, computationally less costly than methods like PCA

Random Indexing!Incremental word space model

hash !key

context!vector

Gavagai Living Lexicon

Gavagai Living Lexicon

Gavagai Living Lexicon

Gavagai Living Lexicon

https://en.wikipedia.org/wiki/Athens

Gavagai Living Lexicon

https://en.wikipedia.org/wiki/Athens

Gavagai Living Lexicon

More than words…

doc2vec

Image Captioning

Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n, http://cs231n.stanford.edu/syllabus.html

Image Captioning

Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n, http://cs231n.stanford.edu/syllabus.html

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

word2vecVectors can encode relationships

KINGS

KING

QUEEN

QUEENS

Image Captioning

Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n, http://cs231n.stanford.edu/syllabus.html

H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, ‘Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering’,

CoRR, vol. abs/1505.05612, 2015 [Online]. Available: http://arxiv.org/abs/1505.05612

Image Question Answering

Hacking!Human!Language!Hendrik Heuer

London

[email protected]!http://hen-drik.de!@hen_drik

Thanks to Andrii, Jussi & Roelof

Slides: https://tinyurl.com/pydata-language

predict the current word!input!

wi-2, wi-1, wi+1, wi+2 !output !

wi!

word2vecHow it is trained

predict the current word!input!

wi-2, wi-1, wi+1, wi+2 !output !

wi!

predict the surrounding words!input

wi !output !

wi-2, wi-1, wi +1, wi +2.

word2vecHow it is trained