table extraction using conditional random fields david pinto andrew mccallum xing wei bruce croft...
TRANSCRIPT
Table Extraction usingConditional Random Fields
David Pinto
Andrew McCallum
Xing Wei
Bruce CroftUniversity of Massachusetts Amherst
Building on previous joint work with John Lafferty and Fernando Pereira
Documents convey meaning by…
Apple to Open Its First Retail Storein New York City
MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.
"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."
Stream of Words Words + Formatting & Layout
Apple to Open Its First Retail Storein New York City
MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.
"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."
Different modalities of “Grammar”
Prepositions Formatting & Layout
Most complex use of layout: The Table
Tables have a long history
Old (circa 2700 BC) New (2003)
Simple
Total Returns -- Year Ended December 31, 2001
Money Market Fund + 3.9% All America Fund -17.4% Equity Index Fund -12.2% Mid-Cap Equity Index Fund - 1.1% Bond Fund + 8.7% Short-Term Bond Fund + 7.4% Mid-Term Bond Fund +10.4% Composite Fund -11.0% Aggressive Equity Fund -10.6%
Complex
Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.
Source: www.FedStats.gov
Table Information Extraction
Total Returns -- Year Ended December 31, 2001
Money Market Fund + 3.9% All America Fund -17.4% Equity Index Fund -12.2% Mid-Cap Equity Index Fund - 1.1% Bond Fund + 8.7% Short-Term Bond Fund + 7.4% Mid-Term Bond Fund +10.4% Composite Fund -11.0% Aggressive Equity Fund -10.6% FUND NAME TOTAL % RETURN 2001
Money Market +3.9All America -17.4Equity Index -12.2...
Text
Database
TableIE
or
Total returns, year ended December 31, 2001, Aggressive Equity Fund, -10.6%
Short document
Automated Table Extraction needed forQuestion Answering
Question:How much snow fell in the greatest single snow storm?
Answer:4800 mm
e.g. [Pinto, et al 2002]
Automated Table Extraction needed forData Mining
ENRON GLOBAL POWER & PIPELINES L.L.C. CONSOLIDATED BALANCE SHEETS (IN THOUSANDS, EXCEPT SHARE AMOUNTS)
SEPTEMBER 30, DECEMBER 31, 1997 1996 ------------- ------------ (UNAUDITED)ASSETSCurrent Assets Cash and cash equivalents $ 54,262 $ 24,582 Accounts receivable 8,473 6,301 Dividends receivable 7,189 -- Current portion of notes receivable 1,470 1,394 Other current assets 336 404 -------- -------- Total Current Assets 71,730 32,681 -------- --------Investments in to Unconsolidated Subsidiaries 286,340 298,530Notes Receivable 16,059 12,111 -------- -------- Total Assets $374,408 $343,843 ======== ========LIABILITIES AND SHAREHOLDERS' EQUITYCurrent Liabilities Accounts payable $ 13,461 $ 11,277 Accrued taxes 1,910 1,488 Current portion of note payable -- 36,583 -------- -------- Total Current Liabilities 15,371 49,348 -------- --------Deferred Income Taxes 525 4,301Commitments and Contingencies (Note 9)
Much of the data in SEC reports is contained in tables.
Would like to mine these reports forsuspicious behavior, and to better understand what is normal.
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Title
Super HeaderColumn Header
Data Row
Sub Header
Section Header RowSection Data Row
Separator
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Header
Data
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Treat as sequence labeling problem
Assign each lineone of 12 labels:Non TableTitleSuper HeaderTable HeaderSub HeaderSection HeaderData RowSection Data RowTable FootnoteTable CaptionBlankSeparator
Hidden Markov Models
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Finite state model Graphical model
Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)
||
11 )|()|(),(
o
ttttt soPssPosP
HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …
...transitions
observations
o1 o2 o3 o4 o5 o6 o7 o8
Generates:
State sequenceObservation sequence
Usually a multinomial over atomic, fixed alphabet
Table Row Labelingwith Hidden Markov Models
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Given a sequence text lines: …and a trained HMM:
Table Title
Table Header
Data Row
Non-Table
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Find the most likely state sequence (Viterbi): ),(maxarg osPs
…then any line said to be generated by the designated “Table Title” state is extracted as part of the title.
What Observational Features?Usually, P(o|s) is a multinomial over a one-dimensional alphabet of atomic features (such as words),
but here we care about words plus many aspects of the layout:
Example features of text lines:
• Is indented• Is indented by more than 4 spaces• Is centered• Contains more than 3 separate
multi-space regions • Has an interior region with more
spaces than the indentation• Whitespace in this line aligns
vertically with whitespace in the previous line
• Contains mostly digits• Contains mostly alphabetics,• Contains all “ASCII-graphics” characters• Contains some “ASCII-graphics”• Contains month names, years, or other
strings associated with headers• Contains more than 4 consecutive
periods
• Next line contains all “ASCII-graphics”• This line contains mostly alphabetics and
contains more than 3 separate multi-space regions.
Problems with Rich Representationand a Generative Model
• These features are not independent:– Overlapping features
– Multiple levels of granularity (words, characters)
– Multiple modalities (words, formatting, layout)
– Observations from past and future
• HMMs are generative models of the text:
• Generative models do not easily handle these non-independent features. Two choices:– Model the dependencies. Each state would have its own
Bayes Net. But we are already starved for training data!
– Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!
),( osP
Conditional Sequence Models
• We would prefer a conditional model:P(s|o) instead of P(s,o):– Can examine features, but not responsible for generating
them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.
nn oooossss ,...,,..., 2121
HMM
MEMM
CRF
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
...
...
...
...
||
11 )|()|(),(
o
ttttt soPssPosP
||
1
1
),(
),(
exp1
)|(o
t
kttkk
jttjj
o osg
ssf
ZosP
(A special case of MEMMs and CRFs.)
Conditional Finite State Sequence Models
From HMMs to MEMMs to CRFs [Lafferty, McCallum, Pereira 2001]
[McCallum, Freitag & Pereira, 2000]
||
1
1
,
||
11
),(
),(
exp1
),|()|(
1
o
t
kttkk
jttjj
os
o
tttt
xsg
ssf
Z
ossPosP
tt
Conditional Random Fields (CRFs)
St St+1 St+2
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
St+3 St+4
Markov on s, conditional dependency on o.
||
11 ),,,(exp
1)|(
o
t kttkk
o
tossfZ
osP
Hammersley-Clifford theorem stipulates that the CRF has this form—an exponential function of the cliques in the graph.
Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|2)—just like HMMs.
Set parameters by maximum likelihood, using optimization method on L.
[Lafferty, McCallum, Pereira 2001]
General CRFs vs. HMMs
• More general and expressive modeling technique
• Comparable computational efficiency for inference
• Features may be arbitrary functions of any or all observations
• Parameters need not fully specify generation of observations; require less training data
• Easy to incorporate domain knowledge
Experimental Results
• 114 plain ASCII documents obtained from FedStats.gov.
• Train on 52 documents, test on 62.
• Line labels:TitleSuper HeaderTable HeaderSeparatorSub HeaderData RowTable CaptionSection HeaderSection Data RowBlankTable FootnoteNon Table
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
/2 Includes processing total for dual usage crops
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
1.2.3.4.5.6.7.8.9.10.11.12.
Features
• 4 spaces in a row• 5+ spaces in a row• 4+ space indentation• 1 space indentation• 2+ regions of 2+ spaces• 3+ regions of 2+ spaces• all spaces
• contains alphabetics• contains digits• contains separator characters
(-+!=:*)• contains 4+ periods• contains common “header”
words, such as months, years, etc.
• percentage of white space• percentage of alphabetics• percentage of digits• percentage of separator chars• percentage of “header” words.
• All these features in time-shifted conjunctions {-1, 0}, {0, 1}, {1,2}.
Table Extraction Experimental Results
Line labels,percent correct
Table segments,F1
95 % 92 %
65 % 64 %
error = 85%
error = 77%
85 % -
HMM
StatelessMaxEnt
CRF w/out conjunctions
CRFcontinuous features
81 % 71 %
93 % 91 %CRFbinary features
Per-label Results
Label Recall Precision
Non Table 98 95
Separator 90 94Title 54 90Super Header 65 91Table Header 46 34Sub Header 92 62Section Header 44 70
Data Row 86 91Section Data Row 55 68
Table Footnote 69 90
Current WorkInfer rows, columns, cell boundaries and types all at once!
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market:Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
/2 Includes processing total for dual usage crops
WEIGHTS and MEASURES
A random variable per-character,connected in a grid of dependencies
Similar to Markov Random Fields as used in computer vision, but conditionally-trained.
Exact inference with loops is intractable.We use recent methods of approximate inference.[Wainwright et al 02, 03]
CRF Related Work• Maximum entropy for language tasks
– Language modeling [Rosenfeld ’94], [Chen & Rosenfeld ’99]– POS tagging, conditioning on previous state [Ratnaparkhi ‘98]– Segmentation [Beeferman, Berger, Lafferty ’99]
• Other Conditional Markov Models– Sequence of Winnow classifiers [Roth ‘98]– Gradient descent on state path [LeCun et al ’98]– Maximum Entropy Markov Models [McCallum, Freitag, Pereira 2000],
used by [Klein, Smarr, Ngueng, Manning 2003],…– Maximum Margin sequence models [Taskar et al 2003], [Altun et al 2003],
[Joachims 2003]– Feature induction for CRFs [McCallum 2003]
• Training methods– Limited Memory Quasi-Newton [Malouf 2002], [Sha & Pereira 2002]– Voted Perceptron [Collins 2002]– Adaptive Over-relaxed Bound Optimization [Roweis 2003]
Table Extraction Related Work
• Matthew Hurst [1999, 2000, 2002]– Defined many of the issues of table modeling– Used a naïve-Bayes-like model of table layout
• Ng, Lim, Koo [1999]– Serially find table, segment rows and columns using stateless C4.5
and neural network classifiers.
• Preddy & Croft [1997], Pinto et al [2002]– Heuristically find tables, cells and their associations; use for
question answering.
• “Wrapper Learning” for extraction from consistently formatted Web pages also uses language and formatting– e.g. Stephen Soderland, Nick Kushmeric, Dayne Freitag, William
Cohen, Ion Muslea, …
Summary
• In many documents, meaning is conveyed not only in the stream of words, but in layout.
• Conditional Random Fields combine the benefits of finite-state context, and robustness to non-independent language+layout features.
• Variants of CRFs will bring even finer-grained and more tightly integrated decision-making capabilities.
End of talk
MEMM & CRF Related Work• Maximum entropy for language tasks:
– Language modeling [Rosenfeld ‘94, Chen & Rosenfeld ‘99]– Part-of-speech tagging [Ratnaparkhi ‘98]– Segmentation [Beeferman, Berger & Lafferty ‘99]– Named entity recognition “MENE” [Borthwick, Grishman,…’98]
• HMMs for similar language tasks– Part of speech tagging [Kupiec ‘92]– Named entity recognition [Bikel et al ‘99]– Other Information Extraction [Leek ‘97], [Freitag & McCallum ‘99]
• Serial Generative/Discriminative Approaches– Speech recognition [Schwartz & Austin ‘93]– Reranking Parses [Collins, ‘00]
• Other conditional Markov models– Non-probabilistic local decision models [Brill ‘95], [Roth ‘98]– Gradient-descent on state path [LeCun et al ‘98]– Markov Processes on Curves (MPCs) [Saul & Rahim ‘99]– Voted Perceptron-trained FSMs [Collins ’02]
Voted Perceptron Sequence Models
before as ),,,(),( where
),(),( :k
),,,(expmaxarg
i instances, trainingallfor
:econvergenc toIterate
0k :zero toparameters Initialize
},{ :data ningGiven trai
1
)()()(
1
k
)(
tossfosC
osCosC
tossfs
so
ttt
kk
iViterbik
iikk
t kttkksViterbi
i
[Collins 2002]
Like CRFs with stochastic gradient ascent and a Viterbi approximation.
Avoids calculating the partition function (normalizer), Zo, but gradient ascent, not 2nd-order or conjugate gradient method.
Analogous tothe gradientfor this onetraining instance
Part-of-speech Tagging
The asbestos fiber , crocidolite, is unusually resilient once
it enters the lungs , with even brief exposures to it causing
symptoms that show up decades later , researchers said .
DT NN NN , NN , VBZ RB JJ IN
PRP VBZ DT NNS , IN RB JJ NNS TO PRP VBG
NNS WDT VBP RP NNS JJ , NNS VBD .
45 tags, 1M words training data, Penn Treebank
Error oov error error err oov error err
HMM 5.69% 45.99%
CRF 5.55% 48.05% 4.27% -24% 23.76% -50%
Using spelling features*
* use words, plus overlapping features: capitalized, begins with #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.
[Pereira 2001 personal comm.]