nlp document and sequence models. computational models of how natural languages work these are...

30
NLP Document and Sequence Models

Upload: emma-stephany-hoover

Post on 16-Dec-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

NLP Document and Sequence Models

Computational models of how natural languages work

These are sometimes called Language Models or sometimes Grammars

Three main types (among many others):1. Document models, or “topic” models2. Sequence models: Markov models, HMMs,

others3. Context-free grammar models

Computational models of how natural languages work

Most of the models I will show you are - Probabilistic models- Graphical models- Generative modelsIn other words, they are essentially Bayes Nets.

In addition, many (but not all) are- Latent variable modelsThis means that some variables in the model are not observed in data, and must be inferred.(Like the hidden states in an HMM.)

Topic Models

Topic models are models that • can predict the likelihood of a document • using latent topic variables

Let x1, …, xN be the words in a document.

P(x1, …, xN) is the likelihood of a document.

Topic models compute this (conceptually) by computing

where each zi is a latent variable that represents the “topic” of word xi.

Three documents with the word “play”(numbers & colors topic assignments)

A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077... J im296 plays166 the game166. J im296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166....

Example: topics from an educational corpus (TASA)

PRINTINGPAPERPRINT

PRINTEDTYPE

PROCESSINK

PRESSIMAGEPRINTERPRINTS

PRINTERSCOPYCOPIESFORMOFFSETGRAPHICSURFACEPRODUCED

CHARACTERS

PLAYPLAYSSTAGE

AUDIENCETHEATERACTORSDRAMA

SHAKESPEAREACTOR

THEATREPLAYWRIGHT

PERFORMANCEDRAMATICCOSTUMESCOMEDYTRAGEDY

CHARACTERSSCENESOPERA

PERFORMED

TEAMGAME

BASKETBALLPLAYERSPLAYERPLAY

PLAYINGSOCCERPLAYEDBALLTEAMSBASKET

FOOTBALLSCORECOURTGAMESTRY

COACHGYMSHOT

JUDGETRIALCOURTCASEJURY

ACCUSEDGUILTY

DEFENDANTJUSTICE

EVIDENCEWITNESSES

CRIMELAWYERWITNESSATTORNEYHEARINGINNOCENTDEFENSECHARGECRIMINAL

HYPOTHESISEXPERIMENTSCIENTIFIC

OBSERVATIONSSCIENTISTS

EXPERIMENTSSCIENTIST

EXPERIMENTALTEST

METHODHYPOTHESES

TESTEDEVIDENCEBASED

OBSERVATIONSCIENCEFACTSDATA

RESULTSEXPLANATION

STUDYTEST

STUDYINGHOMEWORK

NEEDCLASSMATHTRY

TEACHERWRITEPLAN

ARITHMETICASSIGNMENT

PLACESTUDIED

CAREFULLYDECIDE

IMPORTANTNOTEBOOKREVIEW

• 37K docs, 26K words• 1700 topics, e.g.:

Polysemy

PRINTINGPAPERPRINT

PRINTEDTYPE

PROCESSINK

PRESSIMAGEPRINTERPRINTS

PRINTERSCOPYCOPIESFORMOFFSETGRAPHICSURFACEPRODUCED

CHARACTERS

PLAYPLAYSSTAGE

AUDIENCETHEATERACTORSDRAMA

SHAKESPEAREACTOR

THEATREPLAYWRIGHT

PERFORMANCEDRAMATICCOSTUMESCOMEDYTRAGEDY

CHARACTERSSCENESOPERA

PERFORMED

TEAMGAME

BASKETBALLPLAYERSPLAYERPLAY

PLAYINGSOCCERPLAYEDBALLTEAMSBASKET

FOOTBALLSCORECOURTGAMESTRY

COACHGYMSHOT

JUDGETRIALCOURTCASEJURY

ACCUSEDGUILTY

DEFENDANTJUSTICE

EVIDENCEWITNESSES

CRIMELAWYERWITNESSATTORNEYHEARINGINNOCENTDEFENSECHARGECRIMINAL

HYPOTHESISEXPERIMENTSCIENTIFIC

OBSERVATIONSSCIENTISTS

EXPERIMENTSSCIENTIST

EXPERIMENTALTEST

METHODHYPOTHESES

TESTEDEVIDENCE

BASEDOBSERVATION

SCIENCEFACTSDATA

RESULTSEXPLANATION

STUDYTEST

STUDYINGHOMEWORK

NEEDCLASSMATHTRY

TEACHERWRITEPLAN

ARITHMETICASSIGNMENT

PLACESTUDIED

CAREFULLYDECIDE

IMPORTANTNOTEBOOKREVIEW

Topic Modeling Techniques

The two most common techniques are:• Probabilistic Latent Semantic Analysis (pLSA)• Latent Dirichlet Allocation (LDA)

Commonly-used software packages:• Mallet, a Java toolkit for various NLP related things like

document classification, and includes a widely-used implementation of LDA.

• Stanford Topic Modeling Toolbox• A list of implementations for various topic modeling

techniques

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

For each document,• Choose ~Dirichlet()• For each of the N words wn:

– Choose a topic zn» Multinomial()

– Choose a word wn from p(wn|zn,), a multinomial probability conditioned on the topic zn.

Applications

• Visualization and exploration of a corpus• Track news stories as they evolve• Pre-processing of a corpus for document

classification tasks

(slide from tutorial by David Blei, KDD 2011)

Microsoft’s Twahpic System

http://twahpic.cloudapp.net/S4.aspx

LDA/pLSA for Text Classification

• Topic models are easy to incorporate into text classification:1. Train a topic model using a big corpus2. Decode the topic model (find best topic/cluster

for each word) on a training set3. Train classifier using the topic/cluster as a

feature4. On a test document, first decode the topic

model, then make a prediction with the classifier

Why use a topic model for classification?

• Topic models help handle polysemy and synonymy– The count for a topic in a document can be much

more informative than the count of individual words belonging to that topic.

• Topic models help combat data sparsity– You can control the number of topics– At a reasonable choice for this number, you’ll

observe the topics many times in training data(unlike individual words, which may be very sparse)

Synonymy and Polysemy(example from Lillian Lee)

autoenginebonnettyreslorryboot

caremissions

hood makemodeltrunk

makehiddenMarkovmodel

emissionsnormalize

Synonymy

Appear dissimilar if you compare words

but are related

Polysemy

Appear similar if you compare words

but not truly related

Training

Sequence Models

Sequences models are models that • can predict the likelihood of a sequence of text (eg, a sentence). • Sometimes using latent state variables

Let x1, …, xN be the words in a sentence.

P(x1, …, xN) is the likelihood of the sentence.

We’ll look at two types of generative sequence models:N-gram models, which are slightly fancy versions of Markov modelsHidden Markov Models, which you’ve seen before

What’s a sequence model for?• Speech Recognition:

– often the acoustic model will be confused between several possible words for a given sound

– Speech recognition systems choose between these possibilities by selecting the one with the highest probability, according to a sequence model

• Machine translation– Often, the translation model will be confused between several possible translations of a

given phrase– The system chooses between these possibilities by selecting the one with the highest

probability, according to a sequence model

Many other applications:• Handwriting recognition• Spelling correction• Optical character recognition• …

Example Language Model Application

Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file).

Straightforward model:

But this can be hard to train effectively.

)|( soundtextP

Example Language Model Application

Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file).

Traditional solution: Bayes’ Rule

)(

)()|()|(

soundP

textPtextsoundPsoundtextP

Ignore: doesn’t matter for picking a good text

Acoustic Model(easier to train)

Language Model

Example Language Model Application

Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file).

Traditional solution: Bayes’ Rule

Ignore: doesn’t matter for picking a good text

Acoustic Model(easier to train)

Sequence Model

N-gram Models

P(w1, …, wT)=

P(w1) x P(w2|w1) x P(w3|w2,w1) x P(w4|w3,w2,w1) x … x P(wT|wT-1,…,w1)

N-gram models make the assumption that the next word depends only on the previous N-1 words.

For example, a 2-gram model (usually called bigram):P(w1, …, wT)=

P(w1) x P(w2|w1) x P(w3|w2) x P(w4|w3) x … x P(wT|wT-1)

Notice: this is a Markov model, where the states are words.

N-gram Tools

• IRSTLM, a C++ n-gram toolkit often used for speech recognition

• Berkeleylm, another n-gram toolkit, in Java• Moses, a machine translation toolkit• CMU Sphinx, open-source speech recognition

toolkit

Some HMM Tools

• jHMM, a Java implementation that’s relatively easy to use

• MPI implementation of HMMs, for training large HMMs in a distributed environment

(HMM Demo)

Conditional Random Fields

CRFs, like HMMs, are also latent-variable sequence models.

However, CRFs are discriminative instead of generative:

They cannot tell you P(x1, …, xN, z1, …, zN),

and they cannot tell you P(x1, …, xN).

But they are often better than HMMs at predicting P(z1, …, zN | x1, …, xN).

For these reason, they are often used as sequence labelers.

Example Sequence Labeling Task

Slide from Jenny Rose Finkel (ACL 2005)

Other common sequence-labeling tasks

• Tokenization/segmentation

SSNNNNNNNNNNNSSNSNNSSNNNNNSSNNNSS“Tokenization isn’t always easy.”

Chinese word segmentation: Pi-Chuan Chang, Michel Galley and Chris Manning, 2008:

Other commonsequence-labeling tasks

• Part-of-speech tagging

“ N V Adv Adv Adj . ““ Tokenization is n’t always easy . ”

• Relation extraction

O O B-arg I-arg O B-rel I-relDrug giant Pfizer Inc. has reached an I-rel I-rel O Oagreement to buy the private O O B-arg I-argbiotechnology firm Rinat Neuroscience I-arg O O O O OCorp. , the companies announced Thursday

buy(Pfizer Inc., Rinat Neuroscience Corp.)

ReVerb demo

http://reverb.cs.washington.edu/

Some CRF implementations, tools

• http://teamcohen.github.io/MinorThird/, a package of a variety of NLP tools, including CRFs

• http://www.chokkan.org/software/crfsuite/, a CRF toolkit

• http://nlp.stanford.edu/software/, a variety of NLP tools, including several CRF models