latent semantic analysis probabilistic topic models & associative memory
Post on 21-Dec-2015
235 views
TRANSCRIPT
Latent Semantic Analysis
Probabilistic Topic Models
& Associative Memory
The Psychological Problem
How do we learn semantic structure? Covariation between words and the contexts they
appear in (e.g. LSA)
How do we represent semantic structure? Semantic Spaces (e.g. LSA) Probabilistic Topics
Latent Semantic Analysis(Landauer & Dumais, 1997)
word-document counts
high dimensional space
SVD
RIVERSTREAM
MONEY
BANK
Each word is a single point in semantic space Similarity measured by cosine of angle between word
vectors
Critical Assumptions of Semantic Spaces (e.g. LSA)
Psychological distance should obey three axioms
Minimality
Symmetry
Triangle inequality
0),(),(),( bbdaadbad
),(),( abdbad
),(),(),( cadcbdbad
For conceptual relations, violations of distance axioms often found
Similarities can often be asymmetric
“North-Korea” is more similar to “China” than vice versa
“Pomegranate” is more similar to “Apple” than vice versa
Violations of triangle inequality:
AB
BC
AC
Euclidian distance: AC AB + BC
Triangle Inequality in Semantic Spaces might not always hold
w1
PLAY SOCCER
THEATER
Cosine similarity: cos(w1,w3) ≥ cos(w1,w2)cos(w2,w3) – sin(w1,w2)sin(w2,w3)
w2 w3
Euclidian distance:AC AB + BC
Nearest neighbor problem (Tversky & Hutchinson (1986)
• In similarity data, “Fruit” is nearest neighbor in 18 out of 20 fruit words
• In 2D solution, “Fruit” can be nearest neighbor of at most 5 items
• High-dimensional solutions might solve this but these are less appealing
Probabilistic Topic Models
A probabilistic version of LSA: no spatial constraints.
Originated in domain of statistics & machine learning (e.g., Hoffman, 2001; Blei, Ng, Jordan, 2003)
Extracts topics from large collections of text
Topics are interpretable unlike the arbitrary dimensions of LSA
DATACorpus of text:
Word counts for each document
Topic Model
Find parameters that “reconstruct” data
Model is Generative
Probabilistic Topic Models
Each document is a probability distribution over topics (distribution over topics = gist)
Each topic is a probability distribution over words
Document generation as a probabilistic process
TOPICS MIXTURETOPICS MIXTURE
TOPIC TOPIC TOPICTOPIC
WORDWORD WORDWORD
......
......
1. for each document, choosea mixture of topics
2. For every word slot, sample a topic [1..T] from the mixture
3. sample a word from the topic
loan
TOPIC 1
money
loan
bank
moneyb
an
k
river
TOPIC 2
river
river
stream
bank
bank
stream
bank
loan
DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1 loan1
river2 stream2 loan1 bank2 river2 bank2 bank1 stream2 river2 loan1
bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2 bank2 money1 river2 stream2 loan1 bank2 river2 bank2 money1 bank1 stream2 river2 bank2 stream2 bank2 money1
DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1
money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1
money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1
money1 bank1 loan1 bank1 money1 stream2
.3
.8
.2
Example
Mixture components
Mixture weights
Bayesian approach: use priors Mixture weights ~ Dirichlet( ) Mixture components ~ Dirichlet( )
.7
DOCUMENT 2: river? stream? bank? stream? bank? money? loan?
river? stream? loan? bank? river? bank? bank? stream? river? loan?
bank? stream? bank? money? loan? river? stream? bank? stream? bank? money? river? stream? loan? bank? river? bank? money? bank? stream? river? bank? stream? bank? money?
DOCUMENT 1: money? bank? bank? loan? river? stream? bank?
money? river? bank? money? bank? loan? money? stream? bank?
money? bank? bank? loan? river? stream? bank? money? river? bank?
money? bank? loan? bank? money? stream?
Inverting (“fitting”) the model
Mixture components
Mixture weights
TOPIC 1
TOPIC 2
?
?
?
Application to corpus data
TASA corpus: text from first grade to college representative sample of text
26,000+ word types (stop words removed) 37,000+ documents 6,000,000+ word tokens
Example: topics from an educational corpus (TASA)
PRINTINGPAPERPRINT
PRINTEDTYPE
PROCESSINK
PRESSIMAGE
PRINTERPRINTS
PRINTERSCOPY
COPIESFORM
OFFSETGRAPHICSURFACE
PRODUCEDCHARACTERS
PLAYPLAYSSTAGE
AUDIENCETHEATERACTORSDRAMA
SHAKESPEAREACTOR
THEATREPLAYWRIGHT
PERFORMANCEDRAMATICCOSTUMES
COMEDYTRAGEDY
CHARACTERSSCENESOPERA
PERFORMED
TEAMGAME
BASKETBALLPLAYERSPLAYER
PLAYPLAYINGSOCCERPLAYED
BALLTEAMSBASKET
FOOTBALLSCORECOURTGAMES
TRYCOACH
GYMSHOT
JUDGETRIAL
COURTCASEJURY
ACCUSEDGUILTY
DEFENDANTJUSTICE
EVIDENCEWITNESSES
CRIMELAWYERWITNESS
ATTORNEYHEARING
INNOCENTDEFENSECHARGE
CRIMINAL
HYPOTHESISEXPERIMENTSCIENTIFIC
OBSERVATIONSSCIENTISTS
EXPERIMENTSSCIENTIST
EXPERIMENTALTEST
METHODHYPOTHESES
TESTEDEVIDENCE
BASEDOBSERVATION
SCIENCEFACTSDATA
RESULTSEXPLANATION
STUDYTEST
STUDYINGHOMEWORK
NEEDCLASSMATHTRY
TEACHERWRITEPLAN
ARITHMETICASSIGNMENT
PLACESTUDIED
CAREFULLYDECIDE
IMPORTANTNOTEBOOK
REVIEW
• 37K docs, 26K words• 1700 topics, e.g.:
Polysemy
PRINTINGPAPERPRINT
PRINTEDTYPE
PROCESSINK
PRESSIMAGE
PRINTERPRINTS
PRINTERSCOPY
COPIESFORM
OFFSETGRAPHICSURFACE
PRODUCEDCHARACTERS
PLAYPLAYSSTAGE
AUDIENCETHEATERACTORSDRAMA
SHAKESPEAREACTOR
THEATREPLAYWRIGHT
PERFORMANCEDRAMATICCOSTUMES
COMEDYTRAGEDY
CHARACTERSSCENESOPERA
PERFORMED
TEAMGAME
BASKETBALLPLAYERSPLAYERPLAY
PLAYINGSOCCERPLAYED
BALLTEAMSBASKET
FOOTBALLSCORECOURTGAMES
TRYCOACH
GYMSHOT
JUDGETRIAL
COURTCASEJURY
ACCUSEDGUILTY
DEFENDANTJUSTICE
EVIDENCEWITNESSES
CRIMELAWYERWITNESS
ATTORNEYHEARING
INNOCENTDEFENSECHARGE
CRIMINAL
HYPOTHESISEXPERIMENTSCIENTIFIC
OBSERVATIONSSCIENTISTS
EXPERIMENTSSCIENTIST
EXPERIMENTALTEST
METHODHYPOTHESES
TESTEDEVIDENCE
BASEDOBSERVATION
SCIENCEFACTSDATA
RESULTSEXPLANATION
STUDYTEST
STUDYINGHOMEWORK
NEEDCLASSMATHTRY
TEACHERWRITEPLAN
ARITHMETICASSIGNMENT
PLACESTUDIED
CAREFULLYDECIDE
IMPORTANTNOTEBOOK
REVIEW
Three documents with the word “play”(numbers & colors topic assignments)
A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077... J im296 plays166 the game166. J im296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166....
No Problem of Triangle Inequality
SOCCER
MAGNETICFIELD
TOPIC 1 TOPIC 2
Topic structure easily explains violations of triangle inequality
Applications
Enron email data 500,000 emails500,000 emails
5000 authors5000 authors
1999-20021999-2002
Enron topics
2000 2001 2002 2003
PERSON1
PERSON2
TEXANSWIN
FOOTBALLFANTASY
SPORTSLINEPLAYTEAMGAME
SPORTSGAMES
GODLIFEMAN
PEOPLECHRISTFAITHLORDJESUS
SPIRITUALVISIT
ENVIRONMENTALAIR
MTBEEMISSIONS
CLEANEPA
PENDINGSAFETYWATER
GASOLINE
FERCMARKET
ISOCOMMISSION
ORDERFILING
COMMENTSPRICE
CALIFORNIAFILED
POWERCALIFORNIAELECTRICITY
UTILITIESPRICESMARKET
PRICEUTILITY
CUSTOMERSELECTRIC
STATEPLAN
CALIFORNIADAVISRATE
BANKRUPTCYSOCALPOWERBONDSMOU
TIMELINEMay 22, 2000
Start of California
energy crisis
Applying Model to Psychological Data
BASEBALL
BAT
BALL
GAME
PLAY
STAGE
Network of Word Associations
THEATER
(Association norms by Doug Nelson et al. 1998)
BASEBALL
BAT
BALL
GAME
PLAY
STAGE THEATER
Explaining structure with topics
topic 1
topic 2
Modeling Word Association
Word association modeled as prediction
Given that a single word is observed, what future other words might occur?
Under a single topic assumption:
z
nn zPzwPwP w||w| 11
Response Cue
Observed associates for the cue “play”
Word P( word ) Word P( word ) Word Cosine FUN .141 BALL .041 KICKBALL .558 GAME 42 BALL .134 GAME .039 VOLLEYBALL .519 BALL 33 GAME .074 CHILDREN .019 GAMES .492 CHILDREN 30 WORK .067 ROLE .014 COSTUMES .478 SCHOOL 27
GROUND .060 GAMES .014 DRAMA .469 ROLE 25 MATE .027 MUSIC .009 ROLE .465 WANT 24 CHILD .020 BASEBALL .009 PLAYWRIGHT .464 GAMES 23 ENJOY .020 HIT .008 FUN .454 MOTHER 23 WIN .020 FUN .008 ACTOR .448 THINGS 21
ACTOR .013 TEAM .008 REHEARSALS .445 MUSIC 21 FIGHT .013 IMPORTANT .006 GAME .445 HELP 20 HORSE .013 BAT .006 ACTORS .439 FUN 19
KID .013 RUN .006 CHECKERS .431 READ 18 MUSIC .013 STAGE .005 MOLIERE .429 DON 18
HUMANS
Model predictions
Word P( word ) Word P( word ) Word Cosine FUN .141 BALL .041 KICKBALL .558 GAME 42 BALL .134 GAME .039 VOLLEYBALL .519 BALL 33 GAME .074 CHILDREN .019 GAMES .492 CHILDREN 30 WORK .067 ROLE .014 COSTUMES .478 SCHOOL 27
GROUND .060 GAMES .014 DRAMA .469 ROLE 25 MATE .027 MUSIC .009 ROLE .465 WANT 24 CHILD .020 BASEBALL .009 PLAYWRIGHT .464 GAMES 23 ENJOY .020 HIT .008 FUN .454 MOTHER 23 WIN .020 FUN .008 ACTOR .448 THINGS 21
ACTOR .013 TEAM .008 REHEARSALS .445 MUSIC 21 FIGHT .013 IMPORTANT .006 GAME .445 HELP 20 HORSE .013 BAT .006 ACTORS .439 FUN 19
KID .013 RUN .006 CHECKERS .431 READ 18 MUSIC .013 STAGE .005 MOLIERE .429 DON 18
HUMANS TOPICS (T=500)
RANK 9
Median rank of first associate
10
5
10
15
20
25
30
35
40Best LSA cosineBest LSA inner product1700 topics1500 topics1300 topics1100 topics900 topics700 topics500 topics300 topics
Med
ian
R
an
k
Recall: example study List
STUDY: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy
FALSE RECALL: “Sleep” 61%
Recall as a reconstructive process
Reconstruct study list based on the stored “gist”
The gist can be represented by a distribution over topics
Under a single topic assumption:
z
nn zPzwPwP w||w| 11
Retrieved wordStudy list
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
BEDRESTTIRED
AWAKEWAKE
NAPDREAM
YAWNDROWSYBLANKETSNORE
SLUMBERPEACEDOZE
SLEEPNIGHT
ASLEEPMORNINGHOURS
SLEEPYEYESAWAKENED
Predictions for the “Sleep” list
STUDYLIST
EXTRALIST
(top 8)
w|1nwP