information extraction, social network analysis structured topic models & influence mapping...
TRANSCRIPT
Information Extraction,Social Network Analysis
Structured Topic Models & Influence Mapping
Andrew [email protected]
Information Extraction & Synthesis Laboratory
Department of Computer Science
University of Massachusetts
Joint work with Aron Culotta, Charles Sutton, Wei Li, Chris Pal, Pallika Kanani, Gideon
Mann, Natasha Mohanty, Xuerui Wang.
Goals
• Quickly understand and analyze contents oflarge volume of text + other data– browse topics– navigate connections– discover & see patterns
• Assess data source to determine relevance• Browse data newly acquired from the field• Navigate your own data• Discover structure and patterns• Assess impact and influence
Collaborative
opportunity
assessment
Let analysts drive discovery process
Inducing organizational structure
unfamiliar,
inter-agency
^
Rapid ingest
Map flow of ideas
Clustering words into topics withLatent Dirichlet Allocation
[Blei, Ng, Jordan 2003]
Sample a distributionover topics,
For each document:
Sample a topic, z
For each word in doc
Sample a wordfrom the topic, w
Example:
70% Iraq war30% US election
Iraq war
“bombing”
GenerativeProcess:
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
Example topicsinduced from a large collection of text
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
[Tennenbaum et al]
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
Example topicsinduced from a large collection of text
[Tennenbaum et al]
Social Network in an Email Dataset
Author-Recipient-Topic SNA model
Topic choice depends on:- author- recipient
r
[McCallum, Corrada, Wang, 2005]
Enron Email Corpus
• 250k email messages• 23k people
Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001
Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.
DP
Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]
Topics, and prominent senders / receiversdiscovered by ARTTopic names,
by hand [McCallum et al 2005]
Topics, and prominent senders / receiversdiscovered by ART
Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”
Comparing Role Discovery
connection strength (A,B) =
distribution overauthored topics
Traditional SNA
distribution overrecipients
distribution overauthored topics
Author-TopicART
Comparing Role Discovery Tracy Geaconne Dan McCarty
Traditional SNA Author-TopicART
Similar roles Different rolesDifferent roles
Geaconne = “Secretary”McCarty = “Vice President”
Traditional SNA Author-TopicART
Different roles Very differentVery similar
Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”
Comparing Role Discovery Lynn Blair Kimberly Watson
Traditional SNA Author-TopicART
Block structured NotNot
ART: Roles but not Groups
Enron TransWestern Division
Two Relations with Different Attributes
A C B D E FG1G1G2G2G3G3
G1G1G2G2G3G3
A C E B D FG1G1G1G2G2G2
G1G1G1G2G2G2
ACEBDF
Student Roster
AdamsBennettCarterDavisEdwardsFrederking
Academic Admiration
Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)
Social Admiration
Soci(A, B) Soci(A, D) Soci(A, F)Soci(B, A) Soci(B, C) Soci(B, E)Soci(C, B) Soci(C, D) Soci(C, F)Soci(D, A) Soci(D, C) Soci(D, E)Soci(E, B) Soci(E, D) Soci(E, F)Soci(F, A) Soci(F, C) Soci(F, E)
ACBDEF
The Group-Topic Model: Discovering Groups and Topics Simultaneously
bNw
t
B
T
φ
η
DirichletMultinomial
Uniform
2Sv
β
2Gγ α
Beta
Dirichlet
Binomial
SgMultinomial
T
Inference and EstimationGibbs Sampling:- Many r.v.s can be integrated out- Easy to implement- Reasonably fast
We assume the relationship is symmetric.
Dataset #1:U.S. Senate
• 16 years of voting records in the US Senate (1989 – 2005)
• a Senator may respond Yea or Nay to a resolution
• 3423 resolutions with text attributes (index terms)
• 191 Senators in total across 16 years
S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……
Topics Discovered (U.S. Senate)Education Energy
MilitaryMisc.
Economic
education energy government federalschool power military labor
aid water foreign insurancechildren nuclear tax aid
drug gas congress taxstudents petrol aid business
elementary research law employeeprevention pollution policy care
Mixture of Unigrams
Group-Topic Model
Education
+ DomesticForeign Economic
Social Security
+ Medicareeducation foreign labor social
school trade insurance securityfederal chemicals tax insurance
aid tariff congress medicalgovernment congress income care
tax drugs minimum medicareenergy communicable wage disability
research diseases business assistance
Groups Discovered (US Senate)
Groups from topic Education + Domestic
Senators Who Change Coalition the most Dependent on Topic
e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicaid
Dataset #2:The UN General Assembly
• Voting records of the UN General Assembly (1990 - 2003)
• A country may choose to vote Yes, No or Abstain
• 931 resolutions with text attributes (titles)
• 192 countries in total
• Also experiments later with resolutions from 1960-2003
Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting
The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions:
In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.
Topics Discovered (UN)
Everything Nuclear
Human RightsSecurity
in Middle East
nuclear rights occupiedweapons human israel
use palestine syriaimplementation situation security
countries israel calls
Mixture ofUnigrams
Group-TopicModel
NuclearNon-proliferation
Nuclear Arms Race
Human Rights
nuclear nuclear rightsstates arms humanunited prevention palestine
weapons race occupiednations space israel
GroupsDiscovered(UN)The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.
Groups and Topics, Trends over Time (UN)
Structured Topic Models
Models that combine text analysiswith other structured data:
people, senders, receivers, organizations, votes,
time, locations, materials, ...
I call these...
Improve Basic Infrastructureof Topic Models
• Incorporate time
• Finer-grained, more interpretable topicsby representing topic correlations
• Discover relevant phrases
• Map influence and impact
Groups and Topics, Trends over Time (UN)
Want to Model Trends over Time
• Pattern appears only briefly– Capture its statistics in focused way– Don’t confuse it with patterns elsewhere in time
• Is prevalence of topic growing or waning?
• How do roles, groups, influence shift over time?
Topics over Time (TOT)
w t
α
Nd
z
D
T
T
Betaover time
Multinomialover words
β γ
Dirichlet
multinomialover topics
topicindex
wordtime
stamp
Dirichletprior
Uniformprior
[Wang, McCallum, KDD 2006]
State of the Union Address
208 Addresses delivered between January 8, 1790 and January 29, 2002.
To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied.
• 17156 ‘documents’
• 21534 words
• 669,425 tokens
Our scheme of taxation, by means of which this needless surplus is takenfrom the people and put into the public Treasury, consists of a tariff orduty levied upon importations from abroad and internal-revenue taxes leviedupon the consumption of tobacco and spirituous and malt liquors. It must beconceded that none of the things subjected to internal-revenue taxationare, strictly speaking, necessaries. There appears to be no just complaintof this taxation by the consumers of these articles, and there seems to benothing so well able to bear the burden without hardship to any portion ofthe people.
1910
Comparing
TOT
against
LDA
TOT
versus
LDA
on my email
Topic Distributions Conditioned on Time
time
top
ic m
ass
(in
ver
tica
l h
eig
ht)
in N
IPS
con
ference p
apers
Discovering Group StructureTrends over Time
Group Modelwithout Time
Group Modelwith Time
groupid
observedrelation
per group-pairbinomial overrelation absent / present
multinomialdistributionover groups
time- stamp
G
per groupbeta overtime
Improve Basic Infrastructureof Topic Models
• Incorporate time
• Finer-grained, more interpretable topicsby representing topic correlations
• Discover relevant phrases
• Map influence and impact
Latent Dirichlet Allocation
[Blei, Ng, Jordan, 2003]
N
n
w
z
θ
α
Tφ
β
LDA 100motiondetectionfieldopticalflowsensitivemovingfunctionaldetectcontrastlightdimensionalintensitycomputermtmeasuresocclusiontemporaledgereal
“motion”(+ some generic)
LDA 20visual modelmotionfieldobjectimageimagesobjectsfieldsreceptiveeyepositionspatialdirectiontargetvisionmultiplefigureorientationlocation
“images,motion, eyes”
topic distribution
topic
word
Per-topic multinomial over words
Pachinko Machine
Pachinko Allocation Model (PAM)[Li, McCallum, 2006]
α22
α31 α33
α41 α42 α43 α44 α45
Model stru
cture
,
not the g
raphical m
odel
α32
word1 word2 word3 word4 word5 word6 word7 word8
Model structure: directed acyclic graph (DAG); at each interior node: a Dirichlet over its children and words at leaves
For each document: Sample a multinomial from each Dirichlet
For each word in this document: Starting from the root, sample a child from successive nodes, down to a leaf.Generate the word at the leaf
α21
α11
Like a Polya tree, but DAG shaped, with arbitrary number of children.
Thanks to Michael Jordan
for suggesting the name
Pachinko Allocation Model[Li, McCallum, 2006]
Model stru
cture
,
not the g
raphical m
odel
Distributions over words (like “LDA topics”)
Distributions over topics;mixtures, representing topic correlations
Distributions over distributions over topics...
Some interior nodes could contain one multinomial, used for all documents.(i.e. a very peaked Dirichlet)
α22
α31 α33
α41 α42 α43 α44 α45
α32
word1 word2 word3 word4 word5 word6 word7 word8
α21
α11
Pachinko Allocation Model[Li, McCallum, 2006]
Model stru
cture
,
not the g
raphical m
odel
Estimate all these Dirichlets from data.
Estimate model structure from data. (number of nodes, and connectivity)
α22
α31 α33
α41 α42 α43 α44 α45
α32
word1 word2 word3 word4 word5 word6 word7 word8
α21
α11
Pachinko Allocation Special CasesLatent Dirichlet Allocation
α21 α22 α23 α24 α25
α11
word1 word2 word3 word4 word5 word6 word7 word8
Inference – Gibbs Sampling
Dirichlet parameters α are estimated with moment matching
N
n
w
T’
α2
θ2
z2 z3
Tφ
β
α3
θ3
Jointlysampled
∑∑∑ ++
×++
×++
∝== −
m mp
wpw
p kpdk
kpdkp
k kd
kdk
wpwkw n
n
n
n
n
nzDtztzP
β
β
α
α
α
αβα
' ')(
)(
' '1)(
1
1)(
132 ),,,|,(
)( ktP )|( kp ttP )|( ptwP
Example Topics
LDA 100motiondetectionfieldopticalflowsensitivemovingfunctionaldetectcontrastlightdimensionalintensitycomputermtmeasuresocclusiontemporaledgereal
PAM 100motionvideosurfacesurfacesfigurescenecameranoisy sequenceactivationgeneratedanalyticalpixelsmeasurementsassigneadvancelatedshownclosedperceptual
LDA 20visual modelmotionfieldobjectimageimagesobjectsfieldsreceptiveeyepositionspatialdirectiontargetvisionmultiplefigureorientationlocation
PAM 100eyeheadvorvestibulooculomotorvestibularvaryreflexvipanrapidsemicircularcanalsrespondsstreamscholinergicrotationtopographicallydetectorsning
“motion”(some generic)
“images,motioneyes” “motion” “eyes”
PAM 100imagedigitfacespixelsurfaceinterpolationscenepeopleviewingneighboringsensorspatchesmanifolddatasetmagnitudetransparencyrichdynamicalamountstor
“images”
Blind Topic Evaluation
• Randomly select 25 similar pairs of topics generated from PAM and LDA
• 5 people• Each asked to “select
the topic in each pair that you find more semantically coherent.”
LDA PAM
5 votes 0 5
>= 4 votes 3 8
>= 3 votes 9 16
Topic counts
Examples
PAM LDA
control
systems
robot
adaptive
environment
goal
state
controller
control
systems
based
adaptive
direct
con
controller
change
5 votes 0 votes
Examples
4 votes 1 vote
PAM LDA
motion
image
detection
images
scene
vision
texture
segmentation
image
motion
images
multiple
local
generated
noisy
optical
Examples
PAM LDA
algorithm
learning
algorithms
gradient
convergence
function
stochastic
weight
algorithm
algorithms
gradient
convergence
stochastic
line
descent
converge
PAM LDA
signals
source
separation
eeg
sources
blind
single
event
signal
signals
single
time
low
source
temporal
processing
4 votes 1 vote 1 vote 4 votes
Topic Correlations
Likelihood Comparison
• Varying number of topicsPAM supports ~5x more topics than LDA
Improve Basic Infrastructureof Topic Models
• Incorporate time
• Finer-grained, more interpretable topicsby representing topic correlations
• Discover relevant phrases
• Map influence and impact
Topic Interpretability
LDA
algorithmsalgorithmgenetic
problemsefficient
Topical N-grams
genetic algorithmsgenetic algorithm
evolutionary computationevolutionary algorithms
fitness function
Topical N-gram Model
z1 z2 z3 z4
w1 w2 w3 w4
y1 y2 y3 y4
1
T
D
. . .
. . .
. . .
α
WTW
γ1 γ2β 2
[Wang, McCallum 2005]See also:
[Steyvers, Griffiths, Newman, Smyth 2005]
topic
uni- / bi-gramstatus
words
uni- bi-
Features of Topical N-Grams model
• Easily trained by Gibbs sampling– Can run efficiently on millions of words
• Topic-specific phrase discovery– “white house” has special meaning as a phrase
in the politics topic,– ... but not in the real estate topic.
Topic Comparison
learningoptimalreinforcementstateproblemspolicydynamicactionprogrammingactionsfunctionmarkovmethodsdecisionrlcontinuousspacessteppoliciesplanning
LDA
reinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning rlfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods
policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies
Topical N-grams (2) Topical N-grams (1)
Topic Comparison
wordsystemrecognitionhmmspeechtrainingperformancephonemewordscontextsystemsframetrainedspeakersequencespeakersmlpframessegmentationmodels
LDA
speech recognitiontraining dataneural networkerror ratesneural nethidden markov modelfeature vectorscontinuous speechtraining procedurecontinuous speech recognitiongamma filterhidden controlspeech productionneural netsinput representationoutput layerstraining algorithmtest setspeech framesspeaker dependent
speechwordtrainingsystemrecognitionhmmspeakerperformancephonemeacousticwordscontextsystemsframetrainedsequencephoneticspeakersmlphybrid
Topical N-grams (2) Topical N-grams (1)
Improve Basic Infrastructureof Topic Models
• Incorporate time
• Finer-grained, more interpretable topicsby representing topic correlations
• Discover relevant phrases
• Map influence and impact
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
ResearchPaper
Cites
Previous Systems
ResearchPaper
Cites
Person
UniversityVenue
Grant
Groups
Expertise
More Entities and Relations
Topical TransferCitation counts from one topic to another.
Map “producers and consumers”
Topical Bibliometric Impact Measures
• Topical Citation Counts
• Topical Impact Factors
• Topical Longevity
• Topical Precedence
• Topical Diversity
• Topical Transfer
[Mann, Mimno, McCallum, 2006]
Topical Transfer
Transfer from Digital Libraries to other topics
Other topic Cit’s Paper Title
Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.
Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic...
Video 12 Lessons learned from the creation and deployment of a terabyte digital video libr..
Graphs 12 Trawling the Web for Emerging Cyber-Communities
Web Pages 11 WebBase: a repository of Web pages
Topical Diversity
Papers that had the most influence across many other fields...
Topical DiversityEntropy of the topic distribution among
papers that cite this paper (this topic).
HighDiversity
LowDiversity
Topical Bibliometric Impact Measures
• Topical Citation Counts
• Topical Impact Factors
• Topical Longevity
• Topical Precedence
• Topical Diversity
• Topical Transfer
[Mann, Mimno, McCallum, 2006]
Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?
“Early-ness”
Speech Recognition:
Some experiments on the recognition of speech, with one and two ears,E. Colin Cherry (1953)
Spectrographic study of vowel reduction, B. Lindblom (1963)
Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965)
Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974)
Automatic Recognition of Speakers from Their Voices, B. Atal (1976)
Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?
“Early-ness”
Information Retrieval:
On Relevance, Probabilistic Indexing and Information Retrieval,Kuhns and Maron (1960)
Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems,
Cooper (1968)
Relevance feedback in information retrieval, Rocchio (1971)
Relevance feedback and the optimization of retrieval effectiveness, Salton (1971)
New experiments in relevance feedback, Ide (1971)
Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)
Topical Transfer Through Time
• Can we predict which research topicswill be “hot” at the ICML conference next year?
• ...based on– the hot topics in “neighboring” venues last year– learned “neighborhood” distances for venue pairs
How do Ideas Progress Through Social Networks?
COLT
“ADA Boost”
ICML
ACL(NLP)
ICCV(Vision)
SIGIR(Info. Retrieval)
Hypothetical Example:
How do Ideas Progress Through Social Networks?
COLT
“ADA Boost”
ICML
ACL(NLP)
ICCV(Vision)
SIGIR(Info. Retrieval)
Hypothetical Example:
How do Ideas Progress Through Social Networks?
COLT
“ADA Boost”
ICML
ACL(NLP)
ICCV(Vision)
SIGIR(Info. Retrieval)
Hypothetical Example:
Topic Prediction Models
Static Model
Transfer Model
Linear Regression and Ridge RegressionUsed for Coefficient Training.
Preliminary Results
MeanSquaredPredictionError
# Venues used for prediction
Transfer Model with Ridge Regression is a good Predictor
(SmallerIs better) Transfer
Model
Toward More Detailed, Structured Data
Prediction Outlier detection Decision support
Documentcollection
Actionableknowledge
Leveraging Text in Social Network Analysis
SegmentClassifyAssociateCluster
IE
Database
Discover patterns - entity types - links / relations - events
DataMining
Extract structured data aboutentities, relations, events
Structured Topic Models
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Uncertainty Info
Emerging Patterns
Joint Inference
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
ProbabilisticModel
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Solution:
Conditional Random Fields [Lafferty, McCallum, Pereira]
Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]
Discriminatively-trained undirected graphical models
Complex Inference and LearningJust what we researchers like to sink our teeth into!
Unified Model
(Linear Chain) Conditional Random Fields
yt -1
yt
xt
yt+1
xt +1
xt -1
Finite state model Graphical model
Undirected graphical model, trained to maximize
conditional probability of output sequence given input sequence
. . .
FSM states
observations
yt+2
xt +2
yt+3
xt +3
said Jones a Microsoft VP …
OTHER PERSON OTHER ORG TITLE …
output seq
input seq
Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]
Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…
[Lafferty, McCallum, Pereira 2001]
€
p(y | x) =1
Zx
Φ(y t ,y t−1,x, t)t
∏ where
€
Φ(y t ,y t−1,x, t) = exp λ k fk (y t ,y t−1,x, t)k
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
Table Extraction from Government ReportsCash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995 at $19.9 billion dollars,
was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and
dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
----------------------------------------------------------------------------
----
: : Production of Milk and Milkfat 2/
:
Number :-------------------------------------------------------
Year : of : Per Milk Cow : Percentage : Total
:Milk Cows 1/:-------------------: of Fat in
All :------------------
: : Milk : Milkfat : Milk Produced : Milk :
Milkfat
----------------------------------------------------------------------------
----
: 1,000 Head --- Pounds --- Percent Million
Pounds
1993 : 9,589 15,704 575 3.66 150,582
5,514.4
1994 : 9,500 16,175 592 3.66 153,664
5,623.7
1995 : 9,461 16,451 602 3.66 155,644
5,694.3
----------------------------------------------------------------------------
----
1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
CRFLabels:• Non-Table• Table Title• Table Header• Table Data Row• Table Section Data Row• Table Footnote• ... (12 in all)
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Features:• Percentage of digit chars• Percentage of alpha chars• Indented• Contains 5+ consecutive spaces• Whitespace in this line aligns with prev.• ...• Conjunctions of all previous features,
time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
100+ documents from www.fedstats.gov
Table Extraction Experimental Results
Line labels,percent correct
95 %
65 %
85 %
HMM
StatelessMaxEnt
CRF
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
IE from Research Papers[McCallum et al ‘99]
IE from Research Papers
Field-level F1
Hidden Markov Models (HMMs) 75.6[Seymore, McCallum, Rosenfeld, 1999]
Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]
Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]
error40%
Named Entity Recognition
CRICKET - MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.
Labels: Examples:
PER Yayuk BasukiInnocent Butare
ORG 3MKDPCleveland
LOC ClevelandNirmal HridayThe Oval
MISC JavaBasque1,000 Lakes Rally
Named Entity Extraction Results
Method F1
HMMs BBN's Identifinder 73%
CRFs 90%
[McCallum & Li, 2003, CoNLL]
MALLETMachine Learning for LanguagE Toolkit
• ~80k lines of Java• Based on experience with previous toolkits
– Rainbow, WhizBang. GATE, Weka.• Document classification, information extraction, clustering, co-reference,
cross-document co-reference, POS tagging, shallow parsing, relational classification, sequence alignment, structured topic models, social network analysis with text.
• Infrastructure for pipelining feature extraction and processing steps.• Many ML basics in common, convenient framework:
– naïve Bayes, MaxEnt, Boosting, SVMs; Dirichlets, Conjugate Gradient• Advanced ML algorithms:
– Conditional Random Fields, BFGS, Expectation Propagation, …
• Unlike other general toolkits (e.g. Weka) MALLET scales to millions of features, millions of training examples, as needed for NLP.
• Now being used in many universities & companies all over the world:– MIT, CMU, UPenn, Berkeley, UTexas, Purdue, Oregon State, UWash, UMass,
Google, Yahoo, BAE.– Also in UK, Germany, France.
Semi-Supervised Learning
• Labeled data is expensive– Especially for sequence modeling tasks– POS tagging, word segmentation, NER
• Unlabeled data is abundant– The Web– Newswire– Other internal reports– etc.
HMM-LDA Model
• Distinguish between semantic words and syntactic words
[Griffiths, et al. 2004]
Experiments
• Dataset– Wall Street Journal (WSJ) collection labeled with
part-of-speech tags. There are totally 2312 documents in this corpus, 38665 unique words and 1.2M word tokens.
• 50 topics and 40 syntactic classes
• Gibbs sampling – 40 samples with a lag of 100 iterations between
them and an initial burn-in period of 4000 iterations.
Sample Syntactic Clusters
make 0.0279 of 0.7448 way 0.0172 last 0.0767 sell 0.0210 in 0.0828 agreement 0.0140 first 0.0740 buy 0.0174 for 0.0355 price 0.0136 next 0.0479 take 0.0164 from 0.0239 time 0.0121 york 0.0433 get 0.0157 and 0.0238 bid 0.0103 third 0.0424 do 0.0155 to 0.0185 effort 0.0100 past 0.0368 pay 0.0152 ; 0.0096 position 0.0098 this 0.0361 go 0.0113 with 0.0073 meeting 0.0098 dow 0.0295 give 0.0104 that 0.0055 offer 0.0093 federal 0.0288 provide 0.0086 or 0.0039 day 0.0092 fiscal 0.0262
Table 1: Sample syntactic word clusters, each column displays the top 10 words in one cluster and their probabilities
Sample Semantic Clusters
bank 0.0918 computer 0.0610 jaguar 0.0824 ad 0.0314 loans 0.0327 computers 0.0301 ford 0.0641 advertising 0.0298 banks 0.0291 ibm 0.0280 gm 0.0353 agency 0.0268 loan 0.0289 data 0.0200 shares 0.0249 brand 0.0181 thrift 0.0264 machines 0.0191 auto 0.0172 ads 0.0177 assets 0.0235 technology 0.0182 express 0.0144 saatchi 0.0162 savings 0.0220 software 0.0176 maker 0.0136 brands 0.0142 federal 0.0179 digital 0.0173 car 0.0134 account 0.0120 regulators 0.0146 systems 0.0169 share 0.0128 industry 0.0106 debt 0.0142 business 0.0151 saab 0.0116 clients 0.0105
Table 2: Sample semantic word clusters, each column displays the top 10 words in one cluster and their probabilities
POS Tagging
• Features– Word unigrams and bigrams– Spelling features– Word suffixes– Cluster features
• HMM-LDA: the most likely class assignment for each word over all the samples
• HC: bit string prefixes of lengths 8, 12, 16 and 20
• CRFs
Evaluation Results
(a) 10k labeled words, OOV rate = 24.46%
(b) 30k labeled words, OOV rate = 15.31%
(c) 50k labeled words, OOV rate = 12.49%
Error(%) No Clusters Hierarchical HMM-LDA
Overall 10.04 9.46 (5.78) 8.56 (14.74)
OOV 22.32 21.56 (3.40) 18.49 (17.16)
Error(%) No Clusters Hierarchical HMM-LDA
Overall 6.08 5.85 (3.78) 5.40 (11.18)
OOV 17.34 17.35 (-0.00) 15.01 (13.44)
Error(%) No Clusters Hierarchical HMM-LDA
Overall 5.34 5.12 (4.12) 4.78 (10.30)
OOV 16.36 16.21 (0.92) 14.45 (11.67)
18%reductionin error
Desired Future Work
• Add more “structured data types” to topic models.
• Leverage Pachinko Allocation to learn topic hierarchies and topic correlations in time.
• New type of topic model– fast enough to work on streaming data– more naturally combines many data modalities
(add more “structured data types” together)– topics defined by both positive and negative features
• Use structured topic models to help predict influence and impact.
• Extremely low-supervision training of information extractors. Discover interesting entity/relation classes.
End of Talk