information extraction, social network analysis structured topic models & influence mapping...

101
Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum [email protected] Information Extraction & Synthesis Laboratory Department of Computer Science University of Massachusetts Joint work with Aron Culotta, Charles Sutton, Wei Li, Chris Pal, Pallika Kanani, Gideon Mann, Natasha Mohanty, Xuerui Wang.

Upload: alannah-daniels

Post on 11-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Information Extraction,Social Network Analysis

Structured Topic Models & Influence Mapping

Andrew [email protected]

Information Extraction & Synthesis Laboratory

Department of Computer Science

University of Massachusetts

Joint work with Aron Culotta, Charles Sutton, Wei Li, Chris Pal, Pallika Kanani, Gideon

Mann, Natasha Mohanty, Xuerui Wang.

Page 2: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Goals

• Quickly understand and analyze contents oflarge volume of text + other data– browse topics– navigate connections– discover & see patterns

• Assess data source to determine relevance• Browse data newly acquired from the field• Navigate your own data• Discover structure and patterns• Assess impact and influence

Collaborative

opportunity

assessment

Let analysts drive discovery process

Inducing organizational structure

unfamiliar,

inter-agency

^

Rapid ingest

Map flow of ideas

Page 3: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Clustering words into topics withLatent Dirichlet Allocation

[Blei, Ng, Jordan 2003]

Sample a distributionover topics,

For each document:

Sample a topic, z

For each word in doc

Sample a wordfrom the topic, w

Example:

70% Iraq war30% US election

Iraq war

“bombing”

GenerativeProcess:

Page 4: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

Example topicsinduced from a large collection of text

FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

[Tennenbaum et al]

Page 5: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Example topicsinduced from a large collection of text

[Tennenbaum et al]

Page 6: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Social Network in an Email Dataset

Page 7: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Author-Recipient-Topic SNA model

Topic choice depends on:- author- recipient

r

[McCallum, Corrada, Wang, 2005]

Page 8: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Enron Email Corpus

• 250k email messages• 23k people

Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001

Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.

DP

Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]

Page 9: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topics, and prominent senders / receiversdiscovered by ARTTopic names,

by hand [McCallum et al 2005]

Page 10: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topics, and prominent senders / receiversdiscovered by ART

Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”

Page 11: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Comparing Role Discovery

connection strength (A,B) =

distribution overauthored topics

Traditional SNA

distribution overrecipients

distribution overauthored topics

Author-TopicART

Page 12: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Comparing Role Discovery Tracy Geaconne Dan McCarty

Traditional SNA Author-TopicART

Similar roles Different rolesDifferent roles

Geaconne = “Secretary”McCarty = “Vice President”

Page 13: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Traditional SNA Author-TopicART

Different roles Very differentVery similar

Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”

Comparing Role Discovery Lynn Blair Kimberly Watson

Page 14: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Traditional SNA Author-TopicART

Block structured NotNot

ART: Roles but not Groups

Enron TransWestern Division

Page 15: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Two Relations with Different Attributes

A C B D E FG1G1G2G2G3G3

G1G1G2G2G3G3

A C E B D FG1G1G1G2G2G2

G1G1G1G2G2G2

ACEBDF

Student Roster

AdamsBennettCarterDavisEdwardsFrederking

Academic Admiration

Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)

Social Admiration

Soci(A, B) Soci(A, D) Soci(A, F)Soci(B, A) Soci(B, C) Soci(B, E)Soci(C, B) Soci(C, D) Soci(C, F)Soci(D, A) Soci(D, C) Soci(D, E)Soci(E, B) Soci(E, D) Soci(E, F)Soci(F, A) Soci(F, C) Soci(F, E)

ACBDEF

Page 16: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

The Group-Topic Model: Discovering Groups and Topics Simultaneously

bNw

t

B

T

φ

η

DirichletMultinomial

Uniform

2Sv

β

2Gγ α

Beta

Dirichlet

Binomial

SgMultinomial

T

Page 17: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Inference and EstimationGibbs Sampling:- Many r.v.s can be integrated out- Easy to implement- Reasonably fast

We assume the relationship is symmetric.

Page 18: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Dataset #1:U.S. Senate

• 16 years of voting records in the US Senate (1989 – 2005)

• a Senator may respond Yea or Nay to a resolution

• 3423 resolutions with text attributes (index terms)

• 191 Senators in total across 16 years

S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms

Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……

Page 19: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topics Discovered (U.S. Senate)Education Energy

MilitaryMisc.

Economic

education energy government federalschool power military labor

aid water foreign insurancechildren nuclear tax aid

drug gas congress taxstudents petrol aid business

elementary research law employeeprevention pollution policy care

Mixture of Unigrams

Group-Topic Model

Education

+ DomesticForeign Economic

Social Security

+ Medicareeducation foreign labor social

school trade insurance securityfederal chemicals tax insurance

aid tariff congress medicalgovernment congress income care

tax drugs minimum medicareenergy communicable wage disability

research diseases business assistance

Page 20: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Groups Discovered (US Senate)

Groups from topic Education + Domestic

Page 21: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Senators Who Change Coalition the most Dependent on Topic

e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicaid

Page 22: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Dataset #2:The UN General Assembly

• Voting records of the UN General Assembly (1990 - 2003)

• A country may choose to vote Yes, No or Abstain

• 931 resolutions with text attributes (titles)

• 192 countries in total

• Also experiments later with resolutions from 1960-2003

Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting

The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions:

In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.

Page 23: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topics Discovered (UN)

Everything Nuclear

Human RightsSecurity

in Middle East

nuclear rights occupiedweapons human israel

use palestine syriaimplementation situation security

countries israel calls

Mixture ofUnigrams

Group-TopicModel

NuclearNon-proliferation

Nuclear Arms Race

Human Rights

nuclear nuclear rightsstates arms humanunited prevention palestine

weapons race occupiednations space israel

Page 24: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

GroupsDiscovered(UN)The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.

Page 25: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Groups and Topics, Trends over Time (UN)

Page 26: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Structured Topic Models

Models that combine text analysiswith other structured data:

people, senders, receivers, organizations, votes,

time, locations, materials, ...

I call these...

Page 27: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Improve Basic Infrastructureof Topic Models

• Incorporate time

• Finer-grained, more interpretable topicsby representing topic correlations

• Discover relevant phrases

• Map influence and impact

Page 28: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Groups and Topics, Trends over Time (UN)

Page 29: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Want to Model Trends over Time

• Pattern appears only briefly– Capture its statistics in focused way– Don’t confuse it with patterns elsewhere in time

• Is prevalence of topic growing or waning?

• How do roles, groups, influence shift over time?

Page 30: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topics over Time (TOT)

w t

α

Nd

z

D

T

T

Betaover time

Multinomialover words

β γ

Dirichlet

multinomialover topics

topicindex

wordtime

stamp

Dirichletprior

Uniformprior

[Wang, McCallum, KDD 2006]

Page 31: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

State of the Union Address

208 Addresses delivered between January 8, 1790 and January 29, 2002.

To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied.

• 17156 ‘documents’

• 21534 words

• 669,425 tokens

Our scheme of taxation, by means of which this needless surplus is takenfrom the people and put into the public Treasury, consists of a tariff orduty levied upon importations from abroad and internal-revenue taxes leviedupon the consumption of tobacco and spirituous and malt liquors. It must beconceded that none of the things subjected to internal-revenue taxationare, strictly speaking, necessaries. There appears to be no just complaintof this taxation by the consumers of these articles, and there seems to benothing so well able to bear the burden without hardship to any portion ofthe people.

1910

Page 32: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Comparing

TOT

against

LDA

Page 33: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

TOT

versus

LDA

on my email

Page 34: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topic Distributions Conditioned on Time

time

top

ic m

ass

(in

ver

tica

l h

eig

ht)

in N

IPS

con

ference p

apers

Page 35: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Discovering Group StructureTrends over Time

Group Modelwithout Time

Group Modelwith Time

groupid

observedrelation

per group-pairbinomial overrelation absent / present

multinomialdistributionover groups

time- stamp

G

per groupbeta overtime

Page 36: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Improve Basic Infrastructureof Topic Models

• Incorporate time

• Finer-grained, more interpretable topicsby representing topic correlations

• Discover relevant phrases

• Map influence and impact

Page 37: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Latent Dirichlet Allocation

[Blei, Ng, Jordan, 2003]

N

n

w

z

θ

α

β

LDA 100motiondetectionfieldopticalflowsensitivemovingfunctionaldetectcontrastlightdimensionalintensitycomputermtmeasuresocclusiontemporaledgereal

“motion”(+ some generic)

LDA 20visual modelmotionfieldobjectimageimagesobjectsfieldsreceptiveeyepositionspatialdirectiontargetvisionmultiplefigureorientationlocation

“images,motion, eyes”

topic distribution

topic

word

Per-topic multinomial over words

Page 38: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Pachinko Machine

Page 39: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Pachinko Allocation Model (PAM)[Li, McCallum, 2006]

α22

α31 α33

α41 α42 α43 α44 α45

Model stru

cture

,

not the g

raphical m

odel

α32

word1 word2 word3 word4 word5 word6 word7 word8

Model structure: directed acyclic graph (DAG); at each interior node: a Dirichlet over its children and words at leaves

For each document: Sample a multinomial from each Dirichlet

For each word in this document: Starting from the root, sample a child from successive nodes, down to a leaf.Generate the word at the leaf

α21

α11

Like a Polya tree, but DAG shaped, with arbitrary number of children.

Thanks to Michael Jordan

for suggesting the name

Page 40: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Pachinko Allocation Model[Li, McCallum, 2006]

Model stru

cture

,

not the g

raphical m

odel

Distributions over words (like “LDA topics”)

Distributions over topics;mixtures, representing topic correlations

Distributions over distributions over topics...

Some interior nodes could contain one multinomial, used for all documents.(i.e. a very peaked Dirichlet)

α22

α31 α33

α41 α42 α43 α44 α45

α32

word1 word2 word3 word4 word5 word6 word7 word8

α21

α11

Page 41: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Pachinko Allocation Model[Li, McCallum, 2006]

Model stru

cture

,

not the g

raphical m

odel

Estimate all these Dirichlets from data.

Estimate model structure from data. (number of nodes, and connectivity)

α22

α31 α33

α41 α42 α43 α44 α45

α32

word1 word2 word3 word4 word5 word6 word7 word8

α21

α11

Page 42: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Pachinko Allocation Special CasesLatent Dirichlet Allocation

α21 α22 α23 α24 α25

α11

word1 word2 word3 word4 word5 word6 word7 word8

Page 43: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Inference – Gibbs Sampling

Dirichlet parameters α are estimated with moment matching

N

n

w

T’

α2

θ2

z2 z3

β

α3

θ3

Jointlysampled

∑∑∑ ++

×++

×++

∝== −

m mp

wpw

p kpdk

kpdkp

k kd

kdk

wpwkw n

n

n

n

n

nzDtztzP

β

β

α

α

α

αβα

' ')(

)(

' '1)(

1

1)(

132 ),,,|,(

)( ktP )|( kp ttP )|( ptwP

Page 44: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Example Topics

LDA 100motiondetectionfieldopticalflowsensitivemovingfunctionaldetectcontrastlightdimensionalintensitycomputermtmeasuresocclusiontemporaledgereal

PAM 100motionvideosurfacesurfacesfigurescenecameranoisy sequenceactivationgeneratedanalyticalpixelsmeasurementsassigneadvancelatedshownclosedperceptual

LDA 20visual modelmotionfieldobjectimageimagesobjectsfieldsreceptiveeyepositionspatialdirectiontargetvisionmultiplefigureorientationlocation

PAM 100eyeheadvorvestibulooculomotorvestibularvaryreflexvipanrapidsemicircularcanalsrespondsstreamscholinergicrotationtopographicallydetectorsning

“motion”(some generic)

“images,motioneyes” “motion” “eyes”

PAM 100imagedigitfacespixelsurfaceinterpolationscenepeopleviewingneighboringsensorspatchesmanifolddatasetmagnitudetransparencyrichdynamicalamountstor

“images”

Page 45: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Blind Topic Evaluation

• Randomly select 25 similar pairs of topics generated from PAM and LDA

• 5 people• Each asked to “select

the topic in each pair that you find more semantically coherent.”

LDA PAM

5 votes 0 5

>= 4 votes 3 8

>= 3 votes 9 16

Topic counts

Page 46: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Examples

PAM LDA

control

systems

robot

adaptive

environment

goal

state

controller

control

systems

based

adaptive

direct

con

controller

change

5 votes 0 votes

Page 47: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Examples

4 votes 1 vote

PAM LDA

motion

image

detection

images

scene

vision

texture

segmentation

image

motion

images

multiple

local

generated

noisy

optical

Page 48: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Examples

PAM LDA

algorithm

learning

algorithms

gradient

convergence

function

stochastic

weight

algorithm

algorithms

gradient

convergence

stochastic

line

descent

converge

PAM LDA

signals

source

separation

eeg

sources

blind

single

event

signal

signals

single

time

low

source

temporal

processing

4 votes 1 vote 1 vote 4 votes

Page 49: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topic Correlations

Page 50: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Likelihood Comparison

• Varying number of topicsPAM supports ~5x more topics than LDA

Page 51: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Improve Basic Infrastructureof Topic Models

• Incorporate time

• Finer-grained, more interpretable topicsby representing topic correlations

• Discover relevant phrases

• Map influence and impact

Page 52: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topic Interpretability

LDA

algorithmsalgorithmgenetic

problemsefficient

Topical N-grams

genetic algorithmsgenetic algorithm

evolutionary computationevolutionary algorithms

fitness function

Page 53: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topical N-gram Model

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1

T

D

. . .

. . .

. . .

α

WTW

γ1 γ2β 2

[Wang, McCallum 2005]See also:

[Steyvers, Griffiths, Newman, Smyth 2005]

topic

uni- / bi-gramstatus

words

uni- bi-

Page 54: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Features of Topical N-Grams model

• Easily trained by Gibbs sampling– Can run efficiently on millions of words

• Topic-specific phrase discovery– “white house” has special meaning as a phrase

in the politics topic,– ... but not in the real estate topic.

Page 55: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topic Comparison

learningoptimalreinforcementstateproblemspolicydynamicactionprogrammingactionsfunctionmarkovmethodsdecisionrlcontinuousspacessteppoliciesplanning

LDA

reinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning rlfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods

policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies

Topical N-grams (2) Topical N-grams (1)

Page 56: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topic Comparison

wordsystemrecognitionhmmspeechtrainingperformancephonemewordscontextsystemsframetrainedspeakersequencespeakersmlpframessegmentationmodels

LDA

speech recognitiontraining dataneural networkerror ratesneural nethidden markov modelfeature vectorscontinuous speechtraining procedurecontinuous speech recognitiongamma filterhidden controlspeech productionneural netsinput representationoutput layerstraining algorithmtest setspeech framesspeaker dependent

speechwordtrainingsystemrecognitionhmmspeakerperformancephonemeacousticwordscontextsystemsframetrainedsequencephoneticspeakersmlphybrid

Topical N-grams (2) Topical N-grams (1)

Page 57: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Improve Basic Infrastructureof Topic Models

• Incorporate time

• Finer-grained, more interpretable topicsby representing topic correlations

• Discover relevant phrases

• Map influence and impact

Page 58: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 59: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

ResearchPaper

Cites

Previous Systems

Page 60: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

ResearchPaper

Cites

Person

UniversityVenue

Grant

Groups

Expertise

More Entities and Relations

Page 61: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction
Page 62: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction
Page 63: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction
Page 64: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction
Page 65: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction
Page 66: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction
Page 67: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topical TransferCitation counts from one topic to another.

Map “producers and consumers”

Page 68: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topical Bibliometric Impact Measures

• Topical Citation Counts

• Topical Impact Factors

• Topical Longevity

• Topical Precedence

• Topical Diversity

• Topical Transfer

[Mann, Mimno, McCallum, 2006]

Page 69: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topical Transfer

Transfer from Digital Libraries to other topics

Other topic Cit’s Paper Title

Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.

Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic...

Video 12 Lessons learned from the creation and deployment of a terabyte digital video libr..

Graphs 12 Trawling the Web for Emerging Cyber-Communities

Web Pages 11 WebBase: a repository of Web pages

Page 70: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topical Diversity

Papers that had the most influence across many other fields...

Page 71: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topical DiversityEntropy of the topic distribution among

papers that cite this paper (this topic).

HighDiversity

LowDiversity

Page 72: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topical Bibliometric Impact Measures

• Topical Citation Counts

• Topical Impact Factors

• Topical Longevity

• Topical Precedence

• Topical Diversity

• Topical Transfer

[Mann, Mimno, McCallum, 2006]

Page 73: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?

“Early-ness”

Speech Recognition:

Some experiments on the recognition of speech, with one and two ears,E. Colin Cherry (1953)

Spectrographic study of vowel reduction, B. Lindblom (1963)

Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965)

Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974)

Automatic Recognition of Speakers from Their Voices, B. Atal (1976)

Page 74: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?

“Early-ness”

Information Retrieval:

On Relevance, Probabilistic Indexing and Information Retrieval,Kuhns and Maron (1960)

Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems,

Cooper (1968)

Relevance feedback in information retrieval, Rocchio (1971)

Relevance feedback and the optimization of retrieval effectiveness, Salton (1971)

New experiments in relevance feedback, Ide (1971)

Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)

Page 75: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topical Transfer Through Time

• Can we predict which research topicswill be “hot” at the ICML conference next year?

• ...based on– the hot topics in “neighboring” venues last year– learned “neighborhood” distances for venue pairs

Page 76: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

How do Ideas Progress Through Social Networks?

COLT

“ADA Boost”

ICML

ACL(NLP)

ICCV(Vision)

SIGIR(Info. Retrieval)

Hypothetical Example:

Page 77: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

How do Ideas Progress Through Social Networks?

COLT

“ADA Boost”

ICML

ACL(NLP)

ICCV(Vision)

SIGIR(Info. Retrieval)

Hypothetical Example:

Page 78: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

How do Ideas Progress Through Social Networks?

COLT

“ADA Boost”

ICML

ACL(NLP)

ICCV(Vision)

SIGIR(Info. Retrieval)

Hypothetical Example:

Page 79: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Topic Prediction Models

Static Model

Transfer Model

Linear Regression and Ridge RegressionUsed for Coefficient Training.

Page 80: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Preliminary Results

MeanSquaredPredictionError

# Venues used for prediction

Transfer Model with Ridge Regression is a good Predictor

(SmallerIs better) Transfer

Model

Page 81: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Toward More Detailed, Structured Data

Prediction Outlier detection Decision support

Documentcollection

Actionableknowledge

Leveraging Text in Social Network Analysis

SegmentClassifyAssociateCluster

IE

Database

Discover patterns - entity types - links / relations - events

DataMining

Extract structured data aboutentities, relations, events

Structured Topic Models

Page 82: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Joint Inference

Page 83: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

ProbabilisticModel

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Solution:

Conditional Random Fields [Lafferty, McCallum, Pereira]

Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]

Discriminatively-trained undirected graphical models

Complex Inference and LearningJust what we researchers like to sink our teeth into!

Unified Model

Page 84: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

(Linear Chain) Conditional Random Fields

yt -1

yt

xt

yt+1

xt +1

xt -1

Finite state model Graphical model

Undirected graphical model, trained to maximize

conditional probability of output sequence given input sequence

. . .

FSM states

observations

yt+2

xt +2

yt+3

xt +3

said Jones a Microsoft VP …

OTHER PERSON OTHER ORG TITLE …

output seq

input seq

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

[Lafferty, McCallum, Pereira 2001]

p(y | x) =1

Zx

Φ(y t ,y t−1,x, t)t

∏ where

Φ(y t ,y t−1,x, t) = exp λ k fk (y t ,y t−1,x, t)k

∑ ⎛

⎝ ⎜

⎠ ⎟

Page 85: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Table Extraction from Government ReportsCash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Page 86: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Table Extraction from Government Reports

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars,

was

slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and

dealers

as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the

remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

----------------------------------------------------------------------------

----

: : Production of Milk and Milkfat 2/

:

Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in

All :------------------

: : Milk : Milkfat : Milk Produced : Milk :

Milkfat

----------------------------------------------------------------------------

----

: 1,000 Head --- Pounds --- Percent Million

Pounds

1993 : 9,589 15,704 575 3.66 150,582

5,514.4

1994 : 9,500 16,175 592 3.66 153,664

5,623.7

1995 : 9,461 16,451 602 3.66 155,644

5,694.3

----------------------------------------------------------------------------

----

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.

CRFLabels:• Non-Table• Table Title• Table Header• Table Data Row• Table Section Data Row• Table Footnote• ... (12 in all)

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

Features:• Percentage of digit chars• Percentage of alpha chars• Indented• Contains 5+ consecutive spaces• Whitespace in this line aligns with prev.• ...• Conjunctions of all previous features,

time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

100+ documents from www.fedstats.gov

Page 87: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Table Extraction Experimental Results

Line labels,percent correct

95 %

65 %

85 %

HMM

StatelessMaxEnt

CRF

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

Page 88: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

IE from Research Papers[McCallum et al ‘99]

Page 89: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

IE from Research Papers

Field-level F1

Hidden Markov Models (HMMs) 75.6[Seymore, McCallum, Rosenfeld, 1999]

Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]

Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]

error40%

Page 90: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Named Entity Recognition

CRICKET - MILLNS SIGNS FOR BOLAND

CAPE TOWN 1996-08-22

South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.

Labels: Examples:

PER Yayuk BasukiInnocent Butare

ORG 3MKDPCleveland

LOC ClevelandNirmal HridayThe Oval

MISC JavaBasque1,000 Lakes Rally

Page 91: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Named Entity Extraction Results

Method F1

HMMs BBN's Identifinder 73%

CRFs 90%

[McCallum & Li, 2003, CoNLL]

Page 92: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

MALLETMachine Learning for LanguagE Toolkit

• ~80k lines of Java• Based on experience with previous toolkits

– Rainbow, WhizBang. GATE, Weka.• Document classification, information extraction, clustering, co-reference,

cross-document co-reference, POS tagging, shallow parsing, relational classification, sequence alignment, structured topic models, social network analysis with text.

• Infrastructure for pipelining feature extraction and processing steps.• Many ML basics in common, convenient framework:

– naïve Bayes, MaxEnt, Boosting, SVMs; Dirichlets, Conjugate Gradient• Advanced ML algorithms:

– Conditional Random Fields, BFGS, Expectation Propagation, …

• Unlike other general toolkits (e.g. Weka) MALLET scales to millions of features, millions of training examples, as needed for NLP.

• Now being used in many universities & companies all over the world:– MIT, CMU, UPenn, Berkeley, UTexas, Purdue, Oregon State, UWash, UMass,

Google, Yahoo, BAE.– Also in UK, Germany, France.

Page 93: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Semi-Supervised Learning

• Labeled data is expensive– Especially for sequence modeling tasks– POS tagging, word segmentation, NER

• Unlabeled data is abundant– The Web– Newswire– Other internal reports– etc.

Page 94: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

HMM-LDA Model

• Distinguish between semantic words and syntactic words

[Griffiths, et al. 2004]

Page 95: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Experiments

• Dataset– Wall Street Journal (WSJ) collection labeled with

part-of-speech tags. There are totally 2312 documents in this corpus, 38665 unique words and 1.2M word tokens.

• 50 topics and 40 syntactic classes

• Gibbs sampling – 40 samples with a lag of 100 iterations between

them and an initial burn-in period of 4000 iterations.

Page 96: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Sample Syntactic Clusters

make 0.0279 of 0.7448 way 0.0172 last 0.0767 sell 0.0210 in 0.0828 agreement 0.0140 first 0.0740 buy 0.0174 for 0.0355 price 0.0136 next 0.0479 take 0.0164 from 0.0239 time 0.0121 york 0.0433 get 0.0157 and 0.0238 bid 0.0103 third 0.0424 do 0.0155 to 0.0185 effort 0.0100 past 0.0368 pay 0.0152 ; 0.0096 position 0.0098 this 0.0361 go 0.0113 with 0.0073 meeting 0.0098 dow 0.0295 give 0.0104 that 0.0055 offer 0.0093 federal 0.0288 provide 0.0086 or 0.0039 day 0.0092 fiscal 0.0262

Table 1: Sample syntactic word clusters, each column displays the top 10 words in one cluster and their probabilities

Page 97: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Sample Semantic Clusters

bank 0.0918 computer 0.0610 jaguar 0.0824 ad 0.0314 loans 0.0327 computers 0.0301 ford 0.0641 advertising 0.0298 banks 0.0291 ibm 0.0280 gm 0.0353 agency 0.0268 loan 0.0289 data 0.0200 shares 0.0249 brand 0.0181 thrift 0.0264 machines 0.0191 auto 0.0172 ads 0.0177 assets 0.0235 technology 0.0182 express 0.0144 saatchi 0.0162 savings 0.0220 software 0.0176 maker 0.0136 brands 0.0142 federal 0.0179 digital 0.0173 car 0.0134 account 0.0120 regulators 0.0146 systems 0.0169 share 0.0128 industry 0.0106 debt 0.0142 business 0.0151 saab 0.0116 clients 0.0105

Table 2: Sample semantic word clusters, each column displays the top 10 words in one cluster and their probabilities

Page 98: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

POS Tagging

• Features– Word unigrams and bigrams– Spelling features– Word suffixes– Cluster features

• HMM-LDA: the most likely class assignment for each word over all the samples

• HC: bit string prefixes of lengths 8, 12, 16 and 20

• CRFs

Page 99: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Evaluation Results

(a) 10k labeled words, OOV rate = 24.46%

(b) 30k labeled words, OOV rate = 15.31%

(c) 50k labeled words, OOV rate = 12.49%

Error(%) No Clusters Hierarchical HMM-LDA

Overall 10.04 9.46 (5.78) 8.56 (14.74)

OOV 22.32 21.56 (3.40) 18.49 (17.16)

Error(%) No Clusters Hierarchical HMM-LDA

Overall 6.08 5.85 (3.78) 5.40 (11.18)

OOV 17.34 17.35 (-0.00) 15.01 (13.44)

Error(%) No Clusters Hierarchical HMM-LDA

Overall 5.34 5.12 (4.12) 4.78 (10.30)

OOV 16.36 16.21 (0.92) 14.45 (11.67)

18%reductionin error

Page 100: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

Desired Future Work

• Add more “structured data types” to topic models.

• Leverage Pachinko Allocation to learn topic hierarchies and topic correlations in time.

• New type of topic model– fast enough to work on streaming data– more naturally combines many data modalities

(add more “structured data types” together)– topics defined by both positive and negative features

• Use structured topic models to help predict influence and impact.

• Extremely low-supervision training of information extractors. Discover interesting entity/relation classes.

Page 101: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction

End of Talk