topic models for semantic memory and information extraction mark steyvers department of cognitive...

72
Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine oint work with: om Griffiths, UC Berkeley adhraic Smyth, UC Irvine ave Newman, UC Irvine haitanya Chemudugunta, UC Irvine

Post on 20-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Topic Models for Semantic Memory and Information Extraction

Mark SteyversDepartment of Cognitive Sciences

University of California, Irvine

Joint work with:Tom Griffiths, UC BerkeleyPadhraic Smyth, UC IrvineDave Newman, UC IrvineChaitanya Chemudugunta, UC Irvine

Page 2: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Human MemoryInformation

Retrieval

Finding relevant memories

Extracting meaning

Finding relevant documents

Extracting content

Page 3: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Semantic Representations

• What are suitable representations?

• How can it be learned from experience?

Page 4: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

MONEY

• Structured representations • Encoding of propositions• Hand coded examples• Not inferred from data

CASH

BILL

LOAN

MONEY

BANK

RIVER STREAM

Two approaches to semantic representation

Semantic networkse.g. Collins & Quillian 1969

Semantic Spacese.g. Landauer & Dumais, 1997

LOANCASH

RIVER

BANK

BILL

• Relatively unstructured representations• Can be learned from data

Page 5: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Overview

I Probabilistic Topic Models

II Associative Memory

III Episodic Memory

IV Applications

V Conclusion

Page 6: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Probabilistic Topic Models

• Extract topics from large text collections

unsupervised

Bayesian statistical techniques

• Topics provide quick summary of content / gist

Page 7: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

What are topics?

• A topic represents a probability distribution over words

– Related words get high probability in same topic

• Example topics extracted from psychology grant applications:

brainfmri

imagingfunctional

mrisubjects

magneticresonance

neuroimagingstructural

schizophreniapatientsdeficits

schizophrenicpsychosissubjects

psychoticdysfunction

abnormalitiesclinical

memoryworking

memoriestasks

retrievalencodingcognitive

processingrecognition

performance

diseasead

alzheimerdiabetes

cardiovascularinsulin

vascularblood

clinicalindividuals

Probability distribution over words. Most likely words listed at the top

Page 8: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Model Input

• Matrix of counts: number of times words occur in documents

• Note:

– word order is lost: “bag of words” approach

– Some function words are deleted: “the”, “a”, “in”

documents

wor

ds

1

16

0

FOOD

6190ITALIAN

2012PASTA

3034PIZZA

Doc3 … Doc2Doc1

Page 9: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Model Assumptions

• Each topic is a probability distribution over words

• Each document is modeled as a mixture of topics

Page 10: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Generative Model for Documents

TOPICS MIXTURE

TOPIC TOPIC

WORD WORD

...

...

1. for each document, choosea mixture of topics

2. sample a topic [1..T] from the mixture

3. sample a word from the topic

(“LDA” by Blei, Ng, & Jordan, 2002; “pLSI” by Hoffman, 1999; Griffith & Steyvers, 2002 & 2004)(“LDA” by Blei, Ng, & Jordan, 2002; “pLSI” by Hoffman, 1999; Griffith & Steyvers, 2002 & 2004)

Page 11: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Document = mixture of topics

brainfmri

imagingfunctional

mrisubjects

magneticresonance

neuroimagingstructural

schizophreniapatientsdeficits

schizophrenicpsychosissubjects

psychoticdysfunction

abnormalitiesclinical

memoryworking

memoriestasks

retrievalencodingcognitive

processingrecognition

performance

diseasead

alzheimerdiabetes

cardiovascularinsulin

vascularblood

clinicalindividuals

Document

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

80%20%

Document

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

100%

Page 12: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

The Generative Model

• Conditional probability of word in document:

1

| | |T

j

P w d P w z j P z j d

word probability in topic j

probability of topic jin document

Page 13: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Inverting the generative model

• Generative model gives procedure to obtain corpus from

topics and mixing proportions

• Inverting the model involves extracting topics and mixing

proportions per document from corpus

Page 14: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Inverting the generative model

• Estimate topic assignments

– Each occurrence of a word is assigned to one topic [ 1..T ]

• Large state space

– With T=300 topics and 6,000,000 words, the size of the

discrete state space is (300)6,000,000

• Need efficient sampling techniques

– Markov Chain Monte Carlo (MCMC) with Gibbs sampling

Page 15: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

INPUT: word-document counts (word order is irrelevant)

OUTPUT:topic assignments to each word P( zi )likely words in each topic P( w | z )likely topics in each document (“gist”) P( z | d )

Summary

Page 16: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

THEORYSCIENTISTS

EXPERIMENTOBSERVATIONS

SCIENTIFICEXPERIMENTSHYPOTHESIS

EXPLAINSCIENTISTOBSERVED

EXPLANATIONBASED

OBSERVATIONIDEA

EVIDENCETHEORIESBELIEVED

DISCOVEREDOBSERVE

FACTS

SPACEEARTHMOON

PLANETROCKET

MARSORBIT

ASTRONAUTSFIRST

SPACECRAFTJUPITER

SATELLITESATELLITES

ATMOSPHERESPACESHIPSURFACE

SCIENTISTSASTRONAUT

SATURNMILES

ARTPAINT

ARTISTPAINTINGPAINTEDARTISTSMUSEUM

WORKPAINTINGS

STYLEPICTURES

WORKSOWN

SCULPTUREPAINTER

ARTSBEAUTIFUL

DESIGNSPORTRAITPAINTERS

STUDENTSTEACHERSTUDENT

TEACHERSTEACHING

CLASSCLASSROOM

SCHOOLLEARNING

PUPILSCONTENT

INSTRUCTIONTAUGHTGROUPGRADE

SHOULDGRADESCLASSES

PUPILGIVEN

BRAINNERVESENSE

SENSESARE

NERVOUSNERVES

BODYSMELLTASTETOUCH

MESSAGESIMPULSES

CORDORGANSSPINALFIBERS

SENSORYPAIN

IS

CURRENTELECTRICITY

ELECTRICCIRCUIT

ISELECTRICAL

VOLTAGEFLOW

BATTERYWIRE

WIRESSWITCH

CONNECTEDELECTRONSRESISTANCE

POWERCONDUCTORS

CIRCUITSTUBE

NEGATIVE

A selection from 500 topics [P(w|z = j)]

Page 17: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

FIELDMAGNETICMAGNET

WIRENEEDLE

CURRENTCOIL

POLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Words can have high probability in multiple topics

Page 18: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Disambiguation

• Give example for field

Page 19: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Topics versus LSA

• Latent Semantic Analysis (LSI/LSA)– Projects words into a K-dimensional hidden space– Less interpretable– Not generalizable– Not as accurate

MONEY

LOANCASH

RIVER

BANK

BILL

(high dimensional space)

Page 20: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Modeling Word Association

Page 21: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Word Association(norms from Nelson et al. 1998)

CUE: PLANET

Page 22: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

people

EARTH STARS SPACE

SUN MARS

UNIVERSE SATURN GALAXY

associate number

12345678

Word Association(norms from Nelson et al. 1998)

CUE: PLANET

(vocabulary = 5000+ words)

Page 23: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Topic model for Word Association

• Association as a problem of prediction:

– Given that a single word is observed, predict what other

words might occur in that context

• Under a single topic assumption:

2 1 2 1| | |z

P w w P w z P z w

Response Cue

z

wzPzwPwwP )|()|()|( 1212

Page 24: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

people

EARTH STARS SPACE

SUN MARS

UNIVERSE SATURN GALAXY

model

STARS STAR SUN

EARTH SPACE

SKY PLANET

UNIVERSE

associate number

12345678

Word Association(norms from Nelson et al. 1998)

CUE: PLANET

First associate “EARTH” is in the set of 8 associates (from the model)

Page 25: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

P( set contains first associate )

100

101

102

103

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1P

(se

t co

nta

ins

first

ass

oci

ate

)

Set size

LSATOPICS

LSA

TOPICS

Page 26: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Why would LSA perform worse?

• Cosine similarity measure imposes unnecessary

constraints on representations, e.g.

– Symmetry

– Triangle inequality

Page 27: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Violation of triangle inequality

Can find associations that violate this:

A

B C

AC AB + BC

SOCCER FIELD MAGNETIC

Page 28: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

No Triangle Inequality with Topics

SOCCER

MAGNETICFIELD

TOPIC 1 TOPIC 2

Topic structure easily explains violations of triangle inequality

Page 29: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Modeling Episodic Memory

Page 30: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Semantic Isolation Effect

Study this list:PEAS, CARROTS, BEANS, SPINACH, LETTUCE, HAMMER, TOMATOES, CORN, CABBAGE, SQUASH

HAMMER,PEAS,

CARROTS,...

Page 31: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Semantic Isolation Effect

• Verbal explanations:

– Attention, surprise, distinctiveness

• Our approach: memory system is trading off two encoding

resources

– Storing specific words (e.g. “hammer”)

– Storing general theme of list (e.g. “vegetables”)

Page 32: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Computational Problem

• How to tradeoff specificity and generality?

– Remembering detail and gist

Special word topic model

Page 33: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Special word topic (SW) model

• Each word can be generated via one route

– Topics

– Special words distribution (unique to a document)

• Conditional prob. of a word under a document:

| ( 0 | ) ( | )

( 1 | ) ( | )

topics

special

P w d P x d P w d

P x d P w d

Page 34: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Study wordsPEAS

CARROTS BEANS

SPINACH LETTUCE HAMMER

TOMATOES CORN

CABBAGE SQUASH

Retrieval Probability

0.00 0.01 0.02

HAMMER

BEANS

CORN

PEAS

SPINACH

CABBAGE

LETTUCE

CARROTS

SQUASH

TOMATOES

Special Word Probability

0.00 0.05 0.10 0.15

HAMMER

PEAS

SPINACH

CABBAGE

CARROTS

LETTUCE

SQUASH

BEANS

TOMATOES

CORN

Switch Probability

0.0 0.5 1.0

Verbatim

Topic

Topic Probability

0.0 0.1 0.2

VEGETABLES

FURNITURE

TOOLS

ENCODING RETRIEVAL

Special

Page 35: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Hunt & Lamb (2001 exp. 1)

OUTLIER LIST

PEAS

CARROTS

BEANS

SPINACH

LETTUCE

HAMMER

TOMATOES

CORN

CABBAGE

SQUASH

CONTROL LIST

SAW

SCREW

CHISEL

DRILL

SANDPAPER

HAMMER

NAILS

BENCH

RULER

ANVIL

outlier list pure listP

rob.

of

Rec

all

0.0

0.2

0.4

0.6

0.8

1.0TargetBackground

DATA

Page 36: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Model Predictions

outlier list pure list

Pro

b. o

f R

ecal

l

0.0

0.2

0.4

0.6

0.8

1.0TargetBackground

DATA PREDICTED

outlier list pure list

Nee

d P

roba

bilit

y0.00

0.01

0.02

0.03

0.04

0.05TargetBackground

Page 37: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

False memory effects

MADFEARHATE

SMOOTHNAVYHEAT

SALADTUNE

COURTSCANDYPALACEPLUSHTOOTHBLIND

WINTER

Number of Associates

3 6 9

MADFEARHATERAGE

TEMPERFURY

SALADTUNE

COURTSCANDYPALACEPLUSHTOOTHBLIND

WINTER

MADFEARHATERAGE

TEMPERFURY

WRATHHAPPYFIGHT

CANDYPALACEPLUSHTOOTHBLIND

WINTER

(lure = ANGER)

Number of Associates Studied

3 6 9 12 15

Pro

b. o

f R

ecal

l

0.0

0.2

0.4

0.6

0.8

1.0Studied itemsNonstudied (lure)

DATA

Robinson & Roediger (1997)

Number of Associates Studied

3 6 9 12 15

Pro

b. o

f R

etri

eval

0.00

0.01

0.02

0.03Studied associatesNonstudied (lure)

PREDICTED

Page 38: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Relation to Dual Process Models

• Gist/Verbatim distinction (e.g. Reyna, Brainerd, 1995)

– Maps onto topics and special words

• Our approach specifies both encoding and retrieval

representations and processes

– Routes are not independent

– Model explains performance for actual word lists

Page 39: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Applications I

Page 40: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Topics provide quick summary of content

• What is in this corpus?

• What is in this document?

• What are the topical trends over time?

• Who writes on this topic?

Page 41: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

330,000 articles

2000-2002

Analyzing the New York Times

Page 42: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Three investigations began Thursday into the securities and exchange_commission's choice of william_webster to head a new board overseeing the accounting profession. house and senate_democrats called for the resignations of both judge_webster and harvey_pitt, the commission's chairman. The white_house expressed support for judge_webster as well as for harvey_pitt, who was harshly criticized Thursday for failing to inform other commissioners before they approved the choice of judge_webster that he had led the audit committee of a company facing fraud accusations. “The president still has confidence in harvey_pitt,” said dan_bartlett, bush's communications director …

Extracted Named Entities

• Used standard algorithms to extract named entities:

- People- Places- Organizations

Page 43: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Standard Topic Model with Entities

team 0.028 tour 0.039 holiday 0.071 award 0.026play 0.015 rider 0.029 gift 0.050 film 0.020game 0.013 riding 0.017 toy 0.023 actor 0.020season 0.012 bike 0.016 season 0.019 nomination 0.019final 0.011 team 0.016 doll 0.014 movie 0.015games 0.011 stage 0.014 tree 0.011 actress 0.011point 0.011 race 0.013 present 0.008 won 0.011series 0.011 won 0.012 giving 0.008 director 0.010player 0.010 bicycle 0.010 special 0.007 nominated 0.010coach 0.009 road 0.009 shopping 0.007 supporting 0.010playoff 0.009 hour 0.009 family 0.007 winner 0.008championship 0.007 scooter 0.008 celebration 0.007 picture 0.008playing 0.006 mountain 0.008 card 0.007 performance 0.007win 0.006 place 0.008 tradition 0.006 nominees 0.007LAKERS 0.062 LANCE-ARMSTRONG 0.021 CHRISTMAS 0.058 OSCAR 0.035SHAQUILLE-O-NEAL0.028 FRANCE 0.011 THANKSGIVING 0.018 ACADEMY 0.020KOBE-BRYANT 0.028 JAN-ULLRICH 0.003 SANTA-CLAUS 0.009 HOLLYWOOD 0.009PHIL-JACKSON 0.019 LANCE 0.003 BARBIE 0.004 DENZEL-WASHINGTON 0.006NBA 0.013 U-S-POSTAL-SERVICE 0.002 HANUKKAH 0.003 JULIA-ROBERT 0.005SACRAMENTO 0.007 MARCO-PANTANI 0.002 MATTEL 0.003 RUSSELL-CROWE 0.005RICK-FOX 0.007 PARIS 0.002 GRINCH 0.003 TOM-HANK 0.005PORTLAND 0.006 ALPS 0.002 HALLMARK 0.002 STEVEN-SODERBERGH 0.004ROBERT-HORRY 0.006 PYRENEES 0.001 EASTER 0.002 ERIN-BROCKOVICH 0.003DEREK-FISHER 0.006 SPAIN 0.001 HASBRO 0.002 KEVIN-SPACEY 0.003

Basketball Holidays OscarsTour de France

Page 44: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

computer 0.069 play 0.030technology 0.026 show 0.029system 0.015 stage 0.022digital 0.014 theater 0.022chip 0.013 director 0.017software 0.013 production 0.017machine 0.011 performance 0.016devices 0.010 dance 0.014machines 0.010 audience 0.014video 0.009 festival 0.013Companies 1.000 Theater 0.960

Music 0.040

IBM 0.074 BROADWAY 0.119 BACH 0.035APPLE 0.061 NEW_YORK 0.044 BEETHOVEN 0.026INTEL 0.059 SHAKESPEARE 0.029 LOUIS_ARMSTRONG 0.019MICROSOFT 0.053 THEATER 0.022 MOZART 0.019COMPAQ 0.041 LONDON 0.019 CARNEGIE_HALL 0.017SONY 0.029 GUINNESS 0.018 LATIN 0.017DELL 0.019 TONY 0.016HP 0.018 LINCOLN_CTR 0.015

ArtsComputers

MusicCompanies Theatre

Page 45: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

50

100

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

5

10

15

Topic Trends

Tour-de-France

Anthrax

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

10

20

30Quarterly Earnings

Proportion of words assigned to topic for that

time slice

Page 46: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Example of Extracted Entity-Topic Network

Muslim_Militance

Mid_East_Conflict

Palestinian_Territories

Pakistan_Indian_War

FBI_Investigation

Detainees

Mid_East_Peace

US_Military

Religion

Terrorist_Attacks

Afghanistan_War

AL_QAEDA

HAMID_KARZAIMOHAMMED

MOHAMMED_ATTA

NORTHERN_ALLIANCE

BIN_LADEN

TALIBAN

ZAWAHIRI

YASSER_ARAFAT

EHUD_BARAK

ARIEL_SHARON

HAMAS

AL_HAZMI

KING_HUSSEIN

Page 47: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Prediction of Missing Entities in Text

Shares of XXXX slid 8 percent, or $1.10, to $12.65 Tuesday, as major

credit agencies said the conglomerate would still be challenged in repaying

its debts, despite raising $4.6 billion Monday in taking its finance group

public. Analysts at XXXX Investors service in XXXX said they were

keeping XXXX and its subsidiaries under review for a possible debt

downgrade, saying the company “will continue to face a significant debt

burden,'' with large slices of debt coming due, over the next 18 months.

XXXX said …

Test article with entities

removed

Page 48: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Prediction of Missing Entities in Text

Shares of XXXX slid 8 percent, or $1.10, to $12.65 Tuesday, as major

credit agencies said the conglomerate would still be challenged in repaying

its debts, despite raising $4.6 billion Monday in taking its finance group

public. Analysts at XXXX Investors service in XXXX said they were

keeping XXXX and its subsidiaries under review for a possible debt

downgrade, saying the company “will continue to face a significant debt

burden,'' with large slices of debt coming due, over the next 18 months.

XXXX said …

fitch goldman-sachs lehman-brother moody morgan-stanley new-york-

stock-exchange standard-and-poor tyco tyco-international wall-street

worldco

Test article with entities

removed

Actual missing entities

Page 49: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Prediction of Missing Entities in Text

Shares of XXXX slid 8 percent, or $1.10, to $12.65 Tuesday, as major

credit agencies said the conglomerate would still be challenged in repaying

its debts, despite raising $4.6 billion Monday in taking its finance group

public. Analysts at XXXX Investors service in XXXX said they were

keeping XXXX and its subsidiaries under review for a possible debt

downgrade, saying the company “will continue to face a significant debt

burden,'' with large slices of debt coming due, over the next 18 months.

XXXX said …

fitch goldman-sachs lehman-brother moody morgan-stanley new-york-

stock-exchange standard-and-poor tyco tyco-international wall-street

worldco

wall-street new-york nasdaq securities-exchange-commission sec merrill-

lynch new-york-stock-exchange goldman-sachs standard-and-poor

Test article with entities

removed

Actual missing entities

Predicted entities given observed words (matches

in blue)

Page 50: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Applications II

Page 51: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Faculty Browser

• System collects PDF files from websites

– UCI

– UCSD

• Applies topic model on text extracted from PDFs

• Display faculty research with topics

• Demo:

http://yarra.calit2.uci.edu/calit2/

Page 52: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work
Page 53: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

one topic

most prolific researchers for this topic

Page 54: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

topics this researcher works on

other researchers with similar

topical interests

one researcher

Page 55: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Conclusions

Page 56: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Human MemoryInformation

Retrieval

Finding relevant memories

Finding relevant documents

Page 57: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Software

Public-domain MATLAB toolbox for topic modeling on the Web:

http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

Page 58: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Hidden Markov Topic Model

Page 59: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Hidden Markov Topics Model

• Syntactic dependencies short range dependencies

• Semantic dependencies long-range

z z z z

w w w w

s s s s

Semantic state: generate words from topic model

Syntactic states: generate words from HMM

(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

Page 60: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

Transition between semantic state and syntactic states

Page 61: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

THE ………………………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 62: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

THE LOVE……………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 63: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

THE LOVE OF………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 64: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

THE LOVE OF RESEARCH ……

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 65: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

FOODFOODSBODY

NUTRIENTSDIETFAT

SUGARENERGY

MILKEATINGFRUITS

VEGETABLESWEIGHT

FATSNEEDS

CARBOHYDRATESVITAMINSCALORIESPROTEIN

MINERALS

MAPNORTHEARTHSOUTHPOLEMAPS

EQUATORWESTLINESEAST

AUSTRALIAGLOBEPOLES

HEMISPHERELATITUDE

PLACESLAND

WORLDCOMPASS

CONTINENTS

DOCTORPATIENTHEALTH

HOSPITALMEDICAL

CAREPATIENTS

NURSEDOCTORSMEDICINENURSING

TREATMENTNURSES

PHYSICIANHOSPITALS

DRSICK

ASSISTANTEMERGENCY

PRACTICE

BOOKBOOKS

READINGINFORMATION

LIBRARYREPORT

PAGETITLE

SUBJECTPAGESGUIDE

WORDSMATERIALARTICLE

ARTICLESWORDFACTS

AUTHORREFERENCE

NOTE

GOLDIRON

SILVERCOPPERMETAL

METALSSTEELCLAYLEADADAM

OREALUMINUM

MINERALMINE

STONEMINERALS

POTMININGMINERS

TIN

BEHAVIORSELF

INDIVIDUALPERSONALITY

RESPONSESOCIAL

EMOTIONALLEARNINGFEELINGS

PSYCHOLOGISTSINDIVIDUALS

PSYCHOLOGICALEXPERIENCES

ENVIRONMENTHUMAN

RESPONSESBEHAVIORSATTITUDES

PSYCHOLOGYPERSON

CELLSCELL

ORGANISMSALGAE

BACTERIAMICROSCOPEMEMBRANEORGANISM

FOODLIVINGFUNGIMOLD

MATERIALSNUCLEUSCELLED

STRUCTURESMATERIAL

STRUCTUREGREENMOLDS

Semantic topics

PLANTSPLANT

LEAVESSEEDSSOIL

ROOTSFLOWERS

WATERFOOD

GREENSEED

STEMSFLOWER

STEMLEAF

ANIMALSROOT

POLLENGROWING

GROW

Page 66: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

GOODSMALL

NEWIMPORTANT

GREATLITTLELARGE

*BIG

LONGHIGH

DIFFERENTSPECIAL

OLDSTRONGYOUNG

COMMONWHITESINGLE

CERTAIN

THEHIS

THEIRYOURHERITSMYOURTHIS

THESEA

ANTHATNEW

THOSEEACH

MRANYMRSALL

MORESUCHLESS

MUCHKNOWN

JUSTBETTERRATHER

GREATERHIGHERLARGERLONGERFASTER

EXACTLYSMALLER

SOMETHINGBIGGERFEWERLOWER

ALMOST

ONAT

INTOFROMWITH

THROUGHOVER

AROUNDAGAINSTACROSS

UPONTOWARDUNDERALONGNEAR

BEHINDOFF

ABOVEDOWN

BEFORE

SAIDASKED

THOUGHTTOLDSAYS

MEANSCALLEDCRIED

SHOWSANSWERED

TELLSREPLIED

SHOUTEDEXPLAINEDLAUGHED

MEANTWROTE

SHOWEDBELIEVED

WHISPERED

ONESOMEMANYTWOEACHALL

MOSTANY

THREETHIS

EVERYSEVERAL

FOURFIVEBOTHTENSIX

MUCHTWENTY

EIGHT

HEYOU

THEYI

SHEWEIT

PEOPLEEVERYONE

OTHERSSCIENTISTSSOMEONE

WHONOBODY

ONESOMETHING

ANYONEEVERYBODY

SOMETHEN

Syntactic classes

BEMAKE

GETHAVE

GOTAKE

DOFINDUSESEE

HELPKEEPGIVELOOKCOMEWORKMOVELIVEEAT

BECOME

Page 67: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

MODELALGORITHM

SYSTEMCASE

PROBLEMNETWORKMETHOD

APPROACHPAPER

PROCESS

ISWASHAS

BECOMESDENOTES

BEINGREMAINS

REPRESENTSEXISTSSEEMS

SEESHOWNOTE

CONSIDERASSUMEPRESENT

NEEDPROPOSEDESCRIBESUGGEST

USEDTRAINED

OBTAINEDDESCRIBED

GIVENFOUND

PRESENTEDDEFINED

GENERATEDSHOWN

INWITHFORON

FROMAT

USINGINTOOVER

WITHIN

HOWEVERALSOTHENTHUS

THEREFOREFIRSTHERENOW

HENCEFINALLY

#*IXTN-CFP

EXPERTSEXPERTGATING

HMEARCHITECTURE

MIXTURELEARNINGMIXTURESFUNCTION

GATE

DATAGAUSSIANMIXTURE

LIKELIHOODPOSTERIOR

PRIORDISTRIBUTION

EMBAYESIAN

PARAMETERS

STATEPOLICYVALUE

FUNCTIONACTION

REINFORCEMENTLEARNINGCLASSESOPTIMAL

*

MEMBRANESYNAPTIC

CELL*

CURRENTDENDRITICPOTENTIAL

NEURONCONDUCTANCE

CHANNELS

IMAGEIMAGESOBJECT

OBJECTSFEATURE

RECOGNITIONVIEWS

#PIXEL

VISUAL

KERNELSUPPORTVECTOR

SVMKERNELS

#SPACE

FUNCTIONMACHINES

SET

NETWORKNEURAL

NETWORKSOUPUTINPUT

TRAININGINPUTS

WEIGHTS#

OUTPUTS

NIPS Semantics

NIPS Syntax

Page 68: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Random sentence generation

LANGUAGE:[S] RESEARCHERS GIVE THE SPEECH[S] THE SOUND FEEL NO LISTENERS[S] WHICH WAS TO BE MEANING[S] HER VOCABULARIES STOPPED WORDS[S] HE EXPRESSLY WANTED THAT BETTER VOWEL

Page 69: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Nested Chinese Restaurant Process

Page 70: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Topic Hierarchies

• In regular topic model, no relations

between topicstopic 1

topic 2 topic 3

topic 4 topic 5 topic 6 topic 7

• Nested Chinese Restaurant Process

– Blei, Griffiths, Jordan, Tenenbaum (2004)

– Learn hierarchical structure, as well as

topics within structure

Page 71: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Example: Psych Review Abstracts

RESPONSESTIMULUS

REINFORCEMENTRECOGNITION

STIMULIRECALLCHOICE

CONDITIONING

SPEECHREADINGWORDS

MOVEMENTMOTORVISUALWORD

SEMANTIC

ACTIONSOCIALSELF

EXPERIENCEEMOTION

GOALSEMOTIONALTHINKING

GROUPIQ

INTELLIGENCESOCIAL

RATIONALINDIVIDUAL

GROUPSMEMBERS

SEXEMOTIONS

GENDEREMOTIONSTRESSWOMENHEALTH

HANDEDNESS

REASONINGATTITUDE

CONSISTENCYSITUATIONALINFERENCEJUDGMENT

PROBABILITIESSTATISTICAL

IMAGECOLOR

MONOCULARLIGHTNESS

GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC

CONDITIONINSTRESS

EMOTIONALBEHAVIORAL

FEARSTIMULATIONTOLERANCERESPONSES

AMODEL

MEMORYFOR

MODELSTASK

INFORMATIONRESULTSACCOUNT

SELFSOCIAL

PSYCHOLOGYRESEARCH

RISKSTRATEGIES

INTERPERSONALPERSONALITY

SAMPLING

MOTIONVISUAL

SURFACEBINOCULAR

RIVALRYCONTOUR

DIRECTIONCONTOURSSURFACES

DRUGFOODBRAIN

AROUSALACTIVATIONAFFECTIVEHUNGER

EXTINCTIONPAIN

THEOF

ANDTOINAIS

Page 72: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work

Generative Process

RESPONSESTIMULUS

REINFORCEMENTRECOGNITION

STIMULIRECALLCHOICE

CONDITIONING

SPEECHREADINGWORDS

MOVEMENTMOTORVISUALWORD

SEMANTIC

ACTIONSOCIALSELF

EXPERIENCEEMOTION

GOALSEMOTIONALTHINKING

GROUPIQ

INTELLIGENCESOCIAL

RATIONALINDIVIDUAL

GROUPSMEMBERS

SEXEMOTIONS

GENDEREMOTIONSTRESSWOMENHEALTH

HANDEDNESS

REASONINGATTITUDE

CONSISTENCYSITUATIONALINFERENCEJUDGMENT

PROBABILITIESSTATISTICAL

IMAGECOLOR

MONOCULARLIGHTNESS

GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC

CONDITIONINSTRESS

EMOTIONALBEHAVIORAL

FEARSTIMULATIONTOLERANCERESPONSES

AMODEL

MEMORYFOR

MODELSTASK

INFORMATIONRESULTSACCOUNT

SELFSOCIAL

PSYCHOLOGYRESEARCH

RISKSTRATEGIES

INTERPERSONALPERSONALITY

SAMPLING

MOTIONVISUAL

SURFACEBINOCULAR

RIVALRYCONTOUR

DIRECTIONCONTOURSSURFACES

DRUGFOODBRAIN

AROUSALACTIVATIONAFFECTIVEHUNGER

EXTINCTIONPAIN

THEOF

ANDTOINAIS