graphical models for structure extraction and information integration sunita sarawagi iit bombay...

Graphical models for structure extraction and information

integration

Sunita Sarawagi

IIT Bombayhttp://www.it.iitb.ac.in/~sunita

Information Extraction (IE) & IntegrationThe Extraction task: Given,

– E: a set of structured elements– S: unstructured source S

extract all instances of E from S

Many versions involving many source types• Actively researched in varied communities• Several tools and techniques• Several commercial applications

The Integration task: Also, given– Database of existing inter-linked entities

Resolve which extracted entities exists, andInsert appropriate links and entities.

• Classical Named Entity Recognition – Extract person, location, organization names

According to Robert Callahan, president of Eastern's flight attendants union, the past practice of Eastern's parent, Houston-based Texas Air Corp., has involved ultimatums to unions to accept the carrier's terms

IE from free format text

Several applications–News tracking

– Monitor events–Bio-informatics

– Protein and Gene names from publications–Customer care

•Part number, problem description from emails in help centers

Text segmentation

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Author Year Title JournalVolume

Page

4089 Whispering Pines Nobel Drive San Diego CA 92122

House number Building Road City ZipState

Information Extraction on the web

Personal Information Systems– Automatically add a bibtex entry of a paper I download– Integrate a resume in email with the candidates

database

People Papers

ProjectsEmails

Email

Web

Files

Resumes

History of approaches• Manually-developed set of scripts

– Tedious, lots and lots of special cases– Needs continuous refinement as new cases arise– Ad hoc ways of combining varied set of clues– Example: wrappers, OK for regular tasks

• Learning-based approach (lots!)– Rule-based (Whisk, Rapier etc) 80s

• Brittle

– Statistical• Generative: HMMs 90s

– Intuitive but not too flexible

• Conditional (flexible feature set) 00s

Basic chain model for extraction

My review of Fermat’s last theorem by S. Singh

1 2 3 4 5 6 7 8 9


Other Other Other Title Title Title other Author Author

t

x

y

y1 y2 y3 y4 y5 y6 y7 y8 y9

Independent model

Features

• The word as-is• Orthographic word properties

• Capitalized? Digit? Ends-with-dot?• Part of speech

• Noun?• Match in a dictionary

• Appears in a dictionary of people names?• Appears in a list of stop-words?

• Fire these for each label and• The token,• W tokens to the left or right, or• Concatenation of tokens.

Basic chain model for extraction


1 2 3 4 5 6 7 8 9


Other Other Other Title Title Title other Author Author

t

x

y

y1 y2 y3 y4 y5 y6 y7 y8 y9

Global conditional model over Pr(y1,y2…y9|x)

Outline

– Chain models - Basic extraction (Word-level)

– Associative Markov Networks - Collective labeling

– Dynamic CRFs - Two labelings (POS, extraction)

– 2-D CRFs - Layout-driven extraction (web)

Graphical models Extraction

+ Integration

– Segmentation models - Match with entities databases

– Constrained models - Integrating to multiple tables

Undirected Graphical models

• Joint probability distribution of multiple variables expressed compactly as a graph

y1 y2 y3

y4

y5Discrete variables over finite set of labels

Example: {Author, Title, Other}

y3 directly dependent on y4

y3 independent of y1 and y5

given y2 & y4

The joint probability distribution

Cliques of graph

Normalizing

constant

Potential function

y1 y2 y3

y4 y5

Conditional Random Fields (CRFs)

Model probability of a set of labels given

observation x

Form of potentials

Lafferty et al, ICML 2000

Numeric feature

Model parameter

s

Observed variables

Inference on graphical models

• Probability of an assignment of variables

• Most likely assignment of variables

• Marginal probability of a subset of variables

Exponential terms

Message passing

• Efficient two-pass dynamic programming algorithm for graphs without cycles– Viterbi is a special case for chains

• Cyclic graphs– Approximate answer after convergence, or,– Transform cliques to nodes in a junction tree– Alternatives to message passing

• Exploit structure of potentials to design special algorithms– two examples in this talk

• Upper bound using one or more trees• MCMC sampling

Outline






+ Integration



Long range dependencies• Extraction with repeated names (Bunescu et al

2004)

Dependency graph

• Assume only word-level matches.

y1 y2 y3 y4 y5 y6 y7 y8

• Approximate message passing• Sample results (Bunescu et al ACL 2004)

Protein names from medline abstracts– F1: 65% 68%

Person names, organization names etc from news articles– F1: 80% 82%

nitric oxide synthase eNos ….with …synthase interaction eNOS

Associative Markov Networks

y1 y2 y3 y4 y5 y6 y7 y8

• Consider a simpler graph • Binary labels• Only associative edges

– Higher potential when same label to both

y1 y2 y3 y4 y5 y6 y7 y8

Exact inference in polynomial time via mincut (Greig 1989)

Multi-class, metric labelingapproximate algorithm with

guarantees (Kleinberg 1999)

Factorial CRFs: multiple linked chains

• Several synchronized inter-dependent tasks– POS, Noun phrase, Entity extraction

• Cascading propagates errors• Joint models

w1 w2 w3 w4 w5 w6 w7 w8

y1 y2 y3 y4 y5 y6 y7 y8

i saw mr. ray canning at the marketPOS

IE

Inference with multiple chains

• Graph has cycles, most likely exact inference intractable

• Two alternatives– Approximate message passing– Upper bound marginal (Piecewise training)

• Treat each edge potential as an independent training instance

• Results (F1): noun phrase + POS – Piecewise training 88%, faster – Belief propagation 86%

Combined

Staged(Sutton et al, ICML 2004)

(McCallum et al, EMNLP/HLT 2005)

Outline






+ Integration



Conventional Extraction Research

Labeled

unstructured

textTraining Model

Unstructured

text 1

Entities

Unstructured

text 2Unstructured

text 3

EntitiesEntities

Linked entity database

Labeled

unstructured

text

Training Model

Unstructured

text 1

Entities integrated

with existing data

Unstructured

text 2Unstructured

text 3

Data integration

Goals of integration

• Exploit database to improve extraction– Entity might exist in the database

• Integrate extracted entities, resolve if entity already in database– If existing, create links– If not existing, create a new entry

Id Title Year Journal Canonical

2 Update Semantics 1983 10

Id Name Canonical

10 ACM TODS

17 AI 17

16 ACM Trans. Databases

Article Author

2 11

2 2

2 3

Id Name Canonical

11 M Y Vardi

2 J. Ullman 4

3 Ron Fagin 3

4 Jeffrey Ullman 4

Authors

Writes

JournalsArticles

Variant links to

canonical entries

Database: normalized, stores noisy variants

3 Top-level

entities

R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see

1 2 3 4 5 6 7 8

R. Fagin and J. Helpbern Belief Awareness Reasoning

Author Author Other Author Author Title Title Title

t

x

y

Features describe the single word “Fagin”

Segmentation models (Semi-CRFs)

1 2 3 4 5 6 7 8


Author Author Other Author Author Title Title Title

t

x

y

Features describe the single word “Fagin”

Segmentation models (Semi-CRFs)

l1=1, u1=2 l1=u1=3 l1=4, u1=5 l1=6, u1=8


Author Other Author Title

x

y

Features describe the segment from l to u

Similarity to author’s column in database

l,u

Graphical models for segmentation

• Graph has many cycles– clique size = maximum segment length

• Two kinds of potentials– Transition potentials

• Only across adjacent nodes

– Segment potentials• Requires all positions in segment to have the same label

exact inference possible in time linear * maximum segment length (Cohen & Sarawagi 2004)

y1 y2 y3 y4 y5 y6 y7 y8

Effect of database on extraction performance

L L+DB %Δ

PersonalBib

author 75.7 79.5 4.9

journal 33.9 50.3 48.6

title 61.0 70.3 15.1

Address

city_name 72.4 76.7 6.0

state_name 13.9 33.2 138.5

zipcode 91.6 94.3 3.0

L = Only labeled structured data

L + DB: similarity to database entities and other DB features

(from Mansuri et al ICDE 2006)



7 Belief, awareness, reasoning

1988 17

Id Name Canonical

10 ACM TODS

17 AI 17

16 ACM Trans. Databases

10

Article Author

2 11

2 2

2 3

7 8

7 9

Id Name Canonical

11 M Y Vardi

2 J. Ullman 4

3 Ron Fagin 3

4 Jeffrey Ullman 4

8 R Fagin 3

9 J Helpern 8

Authors

Writes

JournalsArticles

R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see

Extraction

Integration

Match with existing linked entities while respecting

all constraints

Author: R. Fagin AAuthor: J. Helpern Title Belief,..reasoning Journal: AI Year: 1998

CACM 2000, R. Fagin and J. Helpern, Belief, awareness, reasoning in AI

Author: R. Fagin Author: J. Helpern Title: Belief,..reasoning Journal: AI Year: 2000

Only extractionCombined

Extraction+integration

Year mismatch!



7 Belief, awareness, reasoning

1988 17

Author: R. FaginAuthor: J. Helpern Title: Belief,..reasoning in AI Journal: CACM Year: 2000

Combined extraction + matching

• Convert predicted label to be a pair y = (a,r)• (r=0) means none-of-the-above or a new entry

Id of matching entity

r

l1=1, u1=2 l1=u1=3 l1=4, u1=8

CACM. 2000 Fagin Belief Awareness Reasoning In AI

Journal Year Author Title

0 7 3 7

x

y

l,u

Constraints exist on ids that can be assigned to

two segments

Constrained models

• Training– Ignore constraints or use max-margin methods that

require only MAP estimates

• Application: – Formulate as a constrained integer programming

problem (expensive) – Use general A-star search to find most likely

constrained assignment

Full integration performance

L L+DB %Δ

PersonalBib

author 70.8 74.0 4.5

journal 29.6 45.5 53.6

title 51.6 65.0 25.9

Address

city_name 70.1 74.6 6.4

state_name 9.0 28.3 213.8

pincode 87.8 90.7 3.3

• L = conventional extraction + matching• L + DB = technology presented here

• Much higher accuracies possible with more training data

(from Mansuri et al ICDE 2006)

What next in data integration?• Lots to be done in building large-scale, viable data

integration systems• Online collective inference

– Cannot freeze database– Cannot batch too many inferences– Need theoretically sound, practical alternatives to exact, batch

inference

• Performance of integration (Chandel et al, ICDE 2006)• Other operations

– Data standardization– Schema management

Probabilistic Querying Systems

• Integration systems while improving, cannot be perfect particularly for domains like the web

• Users supervision of each integration result impossible

Create uncertainty-aware storage and querying engines– Two enablers:

• Probabilistic database querying engines over generic uncertainty models

• Conditional graphical models produce well-calibrated probabilities

Probabilities in CRFs are well-calibrated

Probability of segmentation Probability correct

E.g: 0.5 probability Correct 50% of the times

Cora citations Cora headers

Ideal Ideal

Uncertainty in integration systems:

Model

Unstructured

text

Entities p1

Entities p2

Entities pk

Other more compact models?

Very

uncertain?

Additional training data

Probabilistic database system

Select conference name of article RJ03?

Find most cited author?

IEEE Intl. Conf. On data mining 0.8

Conf. On data mining 0.2

D Johnson 16000 0.6

J Ullman 13000 0.4

In summary

• Data integration provides scope for several interesting learning problems

• Probabilistic graphical models provide robust, unified mechanism of exploiting wide variety of clues and dependencies

• Lot of open research challenges in making graphical models work in a practical setting

graphical models for structure extraction and information integration sunita sarawagi iit bombay...

Documents

web slide

information extraction

extraction task

structure extraction

web files resumes slide

s intuitive

basic chain model

set of structured elements