table extraction using conditional random fields david pinto andrew mccallum xing wei bruce croft...

38
Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous joint work with John Lafferty and Fernando Pereira

Upload: cory-little

Post on 21-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Table Extraction usingConditional Random Fields

David Pinto

Andrew McCallum

Xing Wei

Bruce CroftUniversity of Massachusetts Amherst

Building on previous joint work with John Lafferty and Fernando Pereira

Page 2: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Documents convey meaning by…

Apple to Open Its First Retail Storein New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.

"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."

Stream of Words Words + Formatting & Layout

Page 3: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Apple to Open Its First Retail Storein New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.

"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."

Different modalities of “Grammar”

Prepositions Formatting & Layout

Page 4: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Most complex use of layout: The Table

Page 5: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Tables have a long history

Old (circa 2700 BC) New (2003)

Page 6: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Simple

Total Returns -- Year Ended December 31, 2001

Money Market Fund + 3.9% All America Fund -17.4% Equity Index Fund -12.2% Mid-Cap Equity Index Fund - 1.1% Bond Fund + 8.7% Short-Term Bond Fund + 7.4% Mid-Term Bond Fund +10.4% Composite Fund -11.0% Aggressive Equity Fund -10.6%

Page 7: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Complex

Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Source: www.FedStats.gov

Page 8: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Table Information Extraction

Total Returns -- Year Ended December 31, 2001

Money Market Fund + 3.9% All America Fund -17.4% Equity Index Fund -12.2% Mid-Cap Equity Index Fund - 1.1% Bond Fund + 8.7% Short-Term Bond Fund + 7.4% Mid-Term Bond Fund +10.4% Composite Fund -11.0% Aggressive Equity Fund -10.6% FUND NAME TOTAL % RETURN 2001

Money Market +3.9All America -17.4Equity Index -12.2...

Text

Database

TableIE

or

Total returns, year ended December 31, 2001, Aggressive Equity Fund, -10.6%

Short document

Page 9: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Automated Table Extraction needed forQuestion Answering

Question:How much snow fell in the greatest single snow storm?

Answer:4800 mm

e.g. [Pinto, et al 2002]

Page 10: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Automated Table Extraction needed forData Mining

ENRON GLOBAL POWER & PIPELINES L.L.C. CONSOLIDATED BALANCE SHEETS (IN THOUSANDS, EXCEPT SHARE AMOUNTS)

SEPTEMBER 30, DECEMBER 31, 1997 1996 ------------- ------------ (UNAUDITED)ASSETSCurrent Assets Cash and cash equivalents $ 54,262 $ 24,582 Accounts receivable 8,473 6,301 Dividends receivable 7,189 -- Current portion of notes receivable 1,470 1,394 Other current assets 336 404 -------- -------- Total Current Assets 71,730 32,681 -------- --------Investments in to Unconsolidated Subsidiaries 286,340 298,530Notes Receivable 16,059 12,111 -------- -------- Total Assets $374,408 $343,843 ======== ========LIABILITIES AND SHAREHOLDERS' EQUITYCurrent Liabilities Accounts payable $ 13,461 $ 11,277 Accrued taxes 1,910 1,488 Current portion of note payable -- 36,583 -------- -------- Total Current Liabilities 15,371 49,348 -------- --------Deferred Income Taxes 525 4,301Commitments and Contingencies (Note 9)

Much of the data in SEC reports is contained in tables.

Would like to mine these reports forsuspicious behavior, and to better understand what is normal.

Page 11: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Sub-problems of Table Information Extraction1. Locate the table

2. Identify the row positions and types

3. Identify the column positions and types

4. Segment the table into individual cells

5. Label the cells as data or label

6. Associate data cells with their appropriate headers

... value. The three largest crops in terms of production were head lettuce,

onions, and watermelon, which combined to account for 41 percent of the total

production. Head lettuce, tomatoes, and onions were the most valuable crops,

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market: Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600

Brussels :

Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200

Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850

WEIGHTS and MEASURES

The approximate or average weights as given in this table do not necessarily

official standing as a basis for packing or as grounds for settling...

Page 12: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Sub-problems of Table Information Extraction1. Locate the table

2. Identify the row positions and types

3. Identify the column positions and types

4. Segment the table into individual cells

5. Label the cells as data or label

6. Associate data cells with their appropriate headers

... value. The three largest crops in terms of production were head lettuce,

onions, and watermelon, which combined to account for 41 percent of the total

production. Head lettuce, tomatoes, and onions were the most valuable crops,

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market: Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600

Brussels :

Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200

Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850

WEIGHTS and MEASURES

The approximate or average weights as given in this table do not necessarily

official standing as a basis for packing or as grounds for settling...

Page 13: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Sub-problems of Table Information Extraction1. Locate the table

2. Identify the row positions and types

3. Identify the column positions and types

4. Segment the table into individual cells

5. Label the cells as data or label

6. Associate data cells with their appropriate headers

... value. The three largest crops in terms of production were head lettuce,

onions, and watermelon, which combined to account for 41 percent of the total

production. Head lettuce, tomatoes, and onions were the most valuable crops,

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market: Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600

Brussels :

Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200

Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850

WEIGHTS and MEASURES

The approximate or average weights as given in this table do not necessarily

official standing as a basis for packing or as grounds for settling...

Title

Super HeaderColumn Header

Data Row

Sub Header

Section Header RowSection Data Row

Separator

Page 14: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Sub-problems of Table Information Extraction1. Locate the table

2. Identify the row positions and types

3. Identify the column positions and types

4. Segment the table into individual cells

5. Label the cells as data or label

6. Associate data cells with their appropriate headers

... value. The three largest crops in terms of production were head lettuce,

onions, and watermelon, which combined to account for 41 percent of the total

production. Head lettuce, tomatoes, and onions were the most valuable crops,

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market: Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600

Brussels :

Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200

Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850

WEIGHTS and MEASURES

The approximate or average weights as given in this table do not necessarily

official standing as a basis for packing or as grounds for settling...

Page 15: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Sub-problems of Table Information Extraction1. Locate the table

2. Identify the row positions and types

3. Identify the column positions and types

4. Segment the table into individual cells

5. Label the cells as data or label

6. Associate data cells with their appropriate headers

... value. The three largest crops in terms of production were head lettuce,

onions, and watermelon, which combined to account for 41 percent of the total

production. Head lettuce, tomatoes, and onions were the most valuable crops,

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market: Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600

Brussels :

Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200

Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850

WEIGHTS and MEASURES

The approximate or average weights as given in this table do not necessarily

official standing as a basis for packing or as grounds for settling...

Page 16: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Sub-problems of Table Information Extraction1. Locate the table

2. Identify the row positions and types

3. Identify the column positions and types

4. Segment the table into individual cells

5. Label the cells as data or label

6. Associate data cells with their appropriate headers

... value. The three largest crops in terms of production were head lettuce,

onions, and watermelon, which combined to account for 41 percent of the total

production. Head lettuce, tomatoes, and onions were the most valuable crops,

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market: Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600

Brussels :

Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200

Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850

WEIGHTS and MEASURES

The approximate or average weights as given in this table do not necessarily

official standing as a basis for packing or as grounds for settling...

Header

Data

Page 17: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Sub-problems of Table Information Extraction1. Locate the table

2. Identify the row positions and types

3. Identify the column positions and types

4. Segment the table into individual cells

5. Label the cells as data or label

6. Associate data cells with their appropriate headers

... value. The three largest crops in terms of production were head lettuce,

onions, and watermelon, which combined to account for 41 percent of the total

production. Head lettuce, tomatoes, and onions were the most valuable crops,

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market: Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600

Brussels :

Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200

Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850

WEIGHTS and MEASURES

The approximate or average weights as given in this table do not necessarily

official standing as a basis for packing or as grounds for settling...

Page 18: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Sub-problems of Table Information Extraction1. Locate the table

2. Identify the row positions and types

3. Identify the column positions and types

4. Segment the table into individual cells

5. Label the cells as data or label

6. Associate data cells with their appropriate headers

... value. The three largest crops in terms of production were head lettuce,

onions, and watermelon, which combined to account for 41 percent of the total

production. Head lettuce, tomatoes, and onions were the most valuable crops,

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market: Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600

Brussels :

Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200

Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850

WEIGHTS and MEASURES

The approximate or average weights as given in this table do not necessarily

official standing as a basis for packing or as grounds for settling...

Treat as sequence labeling problem

Assign each lineone of 12 labels:Non TableTitleSuper HeaderTable HeaderSub HeaderSection HeaderData RowSection Data RowTable FootnoteTable CaptionBlankSeparator

Page 19: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Hidden Markov Models

St -1

St

Ot

St+1

Ot +1

Ot -1

...

...

Finite state model Graphical model

Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)

||

11 )|()|(),(

o

ttttt soPssPosP

HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …

...transitions

observations

o1 o2 o3 o4 o5 o6 o7 o8

Generates:

State sequenceObservation sequence

Usually a multinomial over atomic, fixed alphabet

Page 20: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Table Row Labelingwith Hidden Markov Models

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market: Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600

Brussels :

Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200

Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850

WEIGHTS and MEASURES

The approximate or average weights as given in this table do not necessarily

official standing as a basis for packing or as grounds for settling...

Given a sequence text lines: …and a trained HMM:

Table Title

Table Header

Data Row

Non-Table

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market: Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Find the most likely state sequence (Viterbi): ),(maxarg osPs

…then any line said to be generated by the designated “Table Title” state is extracted as part of the title.

Page 21: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

What Observational Features?Usually, P(o|s) is a multinomial over a one-dimensional alphabet of atomic features (such as words),

but here we care about words plus many aspects of the layout:

Example features of text lines:

• Is indented• Is indented by more than 4 spaces• Is centered• Contains more than 3 separate

multi-space regions • Has an interior region with more

spaces than the indentation• Whitespace in this line aligns

vertically with whitespace in the previous line

• Contains mostly digits• Contains mostly alphabetics,• Contains all “ASCII-graphics” characters• Contains some “ASCII-graphics”• Contains month names, years, or other

strings associated with headers• Contains more than 4 consecutive

periods

• Next line contains all “ASCII-graphics”• This line contains mostly alphabetics and

contains more than 3 separate multi-space regions.

Page 22: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Problems with Rich Representationand a Generative Model

• These features are not independent:– Overlapping features

– Multiple levels of granularity (words, characters)

– Multiple modalities (words, formatting, layout)

– Observations from past and future

• HMMs are generative models of the text:

• Generative models do not easily handle these non-independent features. Two choices:– Model the dependencies. Each state would have its own

Bayes Net. But we are already starved for training data!

– Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

),( osP

Page 23: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Conditional Sequence Models

• We would prefer a conditional model:P(s|o) instead of P(s,o):– Can examine features, but not responsible for generating

them.

– Don’t have to explicitly model their dependencies.

– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

Page 24: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

nn oooossss ,...,,..., 2121

HMM

MEMM

CRF

St-1 St

Ot

St+1

Ot+1Ot-1

St-1 St

Ot

St+1

Ot+1Ot-1

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

...

...

...

...

||

11 )|()|(),(

o

ttttt soPssPosP

||

1

1

),(

),(

exp1

)|(o

t

kttkk

jttjj

o osg

ssf

ZosP

(A special case of MEMMs and CRFs.)

Conditional Finite State Sequence Models

From HMMs to MEMMs to CRFs [Lafferty, McCallum, Pereira 2001]

[McCallum, Freitag & Pereira, 2000]

||

1

1

,

||

11

),(

),(

exp1

),|()|(

1

o

t

kttkk

jttjj

os

o

tttt

xsg

ssf

Z

ossPosP

tt

Page 25: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Conditional Random Fields (CRFs)

St St+1 St+2

O = Ot, Ot+1, Ot+2, Ot+3, Ot+4

St+3 St+4

Markov on s, conditional dependency on o.

||

11 ),,,(exp

1)|(

o

t kttkk

o

tossfZ

osP

Hammersley-Clifford theorem stipulates that the CRF has this form—an exponential function of the cliques in the graph.

Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|2)—just like HMMs.

Set parameters by maximum likelihood, using optimization method on L.

[Lafferty, McCallum, Pereira 2001]

Page 26: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

General CRFs vs. HMMs

• More general and expressive modeling technique

• Comparable computational efficiency for inference

• Features may be arbitrary functions of any or all observations

• Parameters need not fully specify generation of observations; require less training data

• Easy to incorporate domain knowledge

Page 27: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Experimental Results

• 114 plain ASCII documents obtained from FedStats.gov.

• Train on 52 documents, test on 62.

• Line labels:TitleSuper HeaderTable HeaderSeparatorSub HeaderData RowTable CaptionSection HeaderSection Data RowBlankTable FootnoteNon Table

... value. The three largest crops in terms of production were head lettuce,

onions, and watermelon, which combined to account for 41 percent of the total

production. Head lettuce, tomatoes, and onions were the most valuable crops,

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market: Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600

Brussels :

Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200

Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850

/2 Includes processing total for dual usage crops

WEIGHTS and MEASURES

The approximate or average weights as given in this table do not necessarily

official standing as a basis for packing or as grounds for settling...

1.2.3.4.5.6.7.8.9.10.11.12.

Page 28: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Features

• 4 spaces in a row• 5+ spaces in a row• 4+ space indentation• 1 space indentation• 2+ regions of 2+ spaces• 3+ regions of 2+ spaces• all spaces

• contains alphabetics• contains digits• contains separator characters

(-+!=:*)• contains 4+ periods• contains common “header”

words, such as months, years, etc.

• percentage of white space• percentage of alphabetics• percentage of digits• percentage of separator chars• percentage of “header” words.

• All these features in time-shifted conjunctions {-1, 0}, {0, 1}, {1,2}.

Page 29: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Table Extraction Experimental Results

Line labels,percent correct

Table segments,F1

95 % 92 %

65 % 64 %

error = 85%

error = 77%

85 % -

HMM

StatelessMaxEnt

CRF w/out conjunctions

CRFcontinuous features

81 % 71 %

93 % 91 %CRFbinary features

Page 30: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Per-label Results

Label Recall Precision

Non Table 98 95

Separator 90 94Title 54 90Super Header 65 91Table Header 46 34Sub Header 92 62Section Header 44 70

Data Row 86 91Section Data Row 55 68

Table Footnote 69 90

Page 31: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Current WorkInfer rows, columns, cell boundaries and types all at once!

... value. The three largest crops in terms of production were head lettuce,

onions, and watermelon, which combined to account for 41 percent of the total

production. Head lettuce, tomatoes, and onions were the most valuable crops,

accounting for 34 percent of the total value when combined.

Principal Vegetables for Fresh Market:Area Planted and Harvested

by Crop, United States, 1997-99 1/

--------------------------------------------------------------------------------

: Area Planted : Area Harvested

Crop :----------------------------------------------------------------

: 1997 : 1998 : 1999 : 1997 : 1998 : 1999

--------------------------------------------------------------------------------

: Acres

:

Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800

Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890

Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600

Brussels :

Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200

Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850

/2 Includes processing total for dual usage crops

WEIGHTS and MEASURES

A random variable per-character,connected in a grid of dependencies

Similar to Markov Random Fields as used in computer vision, but conditionally-trained.

Exact inference with loops is intractable.We use recent methods of approximate inference.[Wainwright et al 02, 03]

Page 32: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

CRF Related Work• Maximum entropy for language tasks

– Language modeling [Rosenfeld ’94], [Chen & Rosenfeld ’99]– POS tagging, conditioning on previous state [Ratnaparkhi ‘98]– Segmentation [Beeferman, Berger, Lafferty ’99]

• Other Conditional Markov Models– Sequence of Winnow classifiers [Roth ‘98]– Gradient descent on state path [LeCun et al ’98]– Maximum Entropy Markov Models [McCallum, Freitag, Pereira 2000],

used by [Klein, Smarr, Ngueng, Manning 2003],…– Maximum Margin sequence models [Taskar et al 2003], [Altun et al 2003],

[Joachims 2003]– Feature induction for CRFs [McCallum 2003]

• Training methods– Limited Memory Quasi-Newton [Malouf 2002], [Sha & Pereira 2002]– Voted Perceptron [Collins 2002]– Adaptive Over-relaxed Bound Optimization [Roweis 2003]

Page 33: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Table Extraction Related Work

• Matthew Hurst [1999, 2000, 2002]– Defined many of the issues of table modeling– Used a naïve-Bayes-like model of table layout

• Ng, Lim, Koo [1999]– Serially find table, segment rows and columns using stateless C4.5

and neural network classifiers.

• Preddy & Croft [1997], Pinto et al [2002]– Heuristically find tables, cells and their associations; use for

question answering.

• “Wrapper Learning” for extraction from consistently formatted Web pages also uses language and formatting– e.g. Stephen Soderland, Nick Kushmeric, Dayne Freitag, William

Cohen, Ion Muslea, …

Page 34: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Summary

• In many documents, meaning is conveyed not only in the stream of words, but in layout.

• Conditional Random Fields combine the benefits of finite-state context, and robustness to non-independent language+layout features.

• Variants of CRFs will bring even finer-grained and more tightly integrated decision-making capabilities.

Page 35: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

End of talk

Page 36: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

MEMM & CRF Related Work• Maximum entropy for language tasks:

– Language modeling [Rosenfeld ‘94, Chen & Rosenfeld ‘99]– Part-of-speech tagging [Ratnaparkhi ‘98]– Segmentation [Beeferman, Berger & Lafferty ‘99]– Named entity recognition “MENE” [Borthwick, Grishman,…’98]

• HMMs for similar language tasks– Part of speech tagging [Kupiec ‘92]– Named entity recognition [Bikel et al ‘99]– Other Information Extraction [Leek ‘97], [Freitag & McCallum ‘99]

• Serial Generative/Discriminative Approaches– Speech recognition [Schwartz & Austin ‘93]– Reranking Parses [Collins, ‘00]

• Other conditional Markov models– Non-probabilistic local decision models [Brill ‘95], [Roth ‘98]– Gradient-descent on state path [LeCun et al ‘98]– Markov Processes on Curves (MPCs) [Saul & Rahim ‘99]– Voted Perceptron-trained FSMs [Collins ’02]

Page 37: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Voted Perceptron Sequence Models

before as ),,,(),( where

),(),( :k

),,,(expmaxarg

i instances, trainingallfor

:econvergenc toIterate

0k :zero toparameters Initialize

},{ :data ningGiven trai

1

)()()(

1

k

)(

tossfosC

osCosC

tossfs

so

ttt

kk

iViterbik

iikk

t kttkksViterbi

i

[Collins 2002]

Like CRFs with stochastic gradient ascent and a Viterbi approximation.

Avoids calculating the partition function (normalizer), Zo, but gradient ascent, not 2nd-order or conjugate gradient method.

Analogous tothe gradientfor this onetraining instance

Page 38: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous

Part-of-speech Tagging

The asbestos fiber , crocidolite, is unusually resilient once

it enters the lungs , with even brief exposures to it causing

symptoms that show up decades later , researchers said .

DT NN NN , NN , VBZ RB JJ IN

PRP VBZ DT NNS , IN RB JJ NNS TO PRP VBG

NNS WDT VBP RP NNS JJ , NNS VBD .

45 tags, 1M words training data, Penn Treebank

Error oov error error err oov error err

HMM 5.69% 45.99%

CRF 5.55% 48.05% 4.27% -24% 23.76% -50%

Using spelling features*

* use words, plus overlapping features: capitalized, begins with #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.

[Pereira 2001 personal comm.]