a survey of web information extraction systems

A Survey of WEB A Survey of WEB Information Information

Extraction SystemsExtraction SystemsChia-Hui ChangChia-Hui Chang

National Central UniversityNational Central UniversitySep. 22, 2005Sep. 22, 2005

IntroductionIntroduction• Abundant information on the Web

– Static Web pages– Searchable databases: Deep Web

• Information Integration– Information for life

• e.g. shopping agents, travel agents

– Data for research purpose• e.g. bioinformatics, auction economy

Introduction (Cont.)Introduction (Cont.)• Information Extraction (IE)

– is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form

• An IE task is defined by its input and output

An IE TaskAn IE Task

Web Data ExtractionWeb Data Extraction

Data

Record

Data Record

IE SystemsIE Systems• Wrappers

– Programs that perform the task of IE are referred to as extractors or wrappers.

• Wrapper Induction – IE systems are software tools that

are designed to generate wrappers.

Various IE SurveyVarious IE Survey• Muslea• Hsu and Dung• Chang• Kushmerick• Laender• Sarawagi• Kuhlins and Tredwell

Related Work: Time Related Work: Time • MUC Approaches

– AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995]

• Post-MUC Approaches – WHISK [Soderland, 1999], RAPIER [califf, 1998],

SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]

Related Work: Automation DegreeRelated Work: Automation Degree

• Hsu and Dung [1998]– hand-crafted wrappers using general

programming languages– specially designed programming

languages or tools– heuristic-based wrappers, and – WI approaches

Related Work: Automation DegreeRelated Work: Automation Degree

• Chang and Kuo [2003]– systems that need programmers, – systems that need annotation examples,– annotation-free systems and – semi-supervised systems

Related Work: Related Work: Input and Extraction RulesInput and Extraction Rules

• Muslea [1999]– IE from free text using extraction patterns that a

re mainly based on syntactic/semantic constraints.

– The second class is Wrapper induction systems which rely on the use of delimiter-based rules.

– The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.

Related Work: Extraction RulesRelated Work: Extraction Rules

• Kushmerick [2003]– Finite-state tools (regular expressions)– Relational learning tools (logic rules)

Related Work: TechniquesRelated Work: Techniques• Laender [2002]

– languages for wrapper development – HTML-aware tools – NLP-based tools – Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER),

– Modeling-based tools – Ontology-based tools

• New Criteria:– degree of automation, support for complex objects, page con

tents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.

Related Work: Output TargetsRelated Work: Output Targets

• Sarawagi [VLDB 2002]– Record-level– Page-level– Site-level

Related Work: UsabilityRelated Work: Usability • Kuhlins and Tredwell [2002]

– Commercial– Noncommercial

Three DimensionsThree Dimensions• Task Domain

– Input (Unstructured, semi-structured)– Output Targets (record-level, page-level, site-level)

• Automation Degree– Programmer-involved, learning-based or annotatio

n-free approaches• Techniques

– Regular expression rules vs Prolog-like logic rules– Deterministic finite-state transducer vs probabilisti

c hidden Markov models

Task Domain: InputTask Domain: Input

Task Domain: OutputTask Domain: Output• Missing Attributes• Multi-valued Attributes• Multiple Permutations• Nested Data Objects• Various Templates for an attribute• Common Templates for various attribut

es• Untokenized Attributes

Classification by Automation DegreeClassification by Automation Degree

• Manually– TSIMMIS, Minerva, WebOQL, W4F, XWrap

• Supervised– WIEN, Stalker, Softmealy

• Semi-supervised– IEPAD, OLERA

• Unsupervised– DeLa, RoadRunner, EXALG

Automation DegreeAutomation Degree• Page-fetching Support• Annotation Requirement• Output Support• API Support

TechnologiesTechnologies• Scan passes• Extraction rule types• Learning algorithms• Tokenization schemes• Feature used

A Survey of Contemporary A Survey of Contemporary IE SystemsIE Systems

• Manually-constructed IE tools– Programmer-aided

• Supervised IE systems– Labeled based

• Semi-supervised IE systems• Unsupervised IE systems

– Annotation-free

Un-Supervised

GUI

Manual

Semi-Supervised

Supervised

Wrapper Induction

System

Wrapper

User Extracted

data

Test Page

GUI

Un-Labeled Training

Web Pages

User

User

Manually-constructed IE Manually-constructed IE SystemsSystems

• TSIMMIS [Hammer, et al, 1997]• Minerva [Crescenzi, 1998] • WebOQL [Arocena and Mendelzon, 199

8] • W4F [Saiiuguet and Azavant, 2001] • XWrap [Liu, et al. 2000]

A Running ExampleA Running Example

TSIMMISTSIMMIS

• Each command is of the form: [variables, source, pattern] where

– source specifies the input text to be considered– pattern specifies how to find the text of interest within the source, and – variables are a list of variables that hold the extracted results.

• Note:– # means “save in the variable”– * means “discard”

(a) (b)

1 [ [ "root", "get('pe1.html')", "#"],2 [ "Book", "root", "*<body>#</body>"],3 [ "BookName", "Book", "*#"],4 [ "Reviews", "Book", "*<ol>#</ol>"],5 [ "_Reviewer", "split (Reviews, '<li>')", "#"],6 [ "Reviewer", "_Reviewer[0:0]", "#"],7 [ "ReviewerName, Rat ing, T ext", "Reviewer",8 "*#*#*#*"] ]

root complex { book_name st ring "Dat abases" reviews complex { Reviewer_Name st ring John Rat ing int 7 T ext st ring … } }

MinervaMinerva• The grammar used by Minerva is defined in an

EBNF style

Page Book_Reviews $Book_Reviews: <html><body> $Book </body></html>; $Book : Book Name $bName Reviews [<ol> ( <li> Reviewer Name $rName Rat ing $rate T ext $text $T P )* </ol>]; $bName : *(?); $rName : *(?); $rate : *(?); $text : *(?</li>);

$T P : { $bName, $rName $rate $text } END

WebOQLWebOQLSelect [ Z!’.Text] From x in browse (“pe2.html”)’, y in x’, Z in y’ Where x.Tag = “ol” and Z.Text=”Reviewer Name”

Tag: Body,Source: <Body>…</Body>Text: Book Name …

Tag: Source:Book NameText: Book Name Tag: NOTAG

Source: DatabasesText: Database

Tag: Source:ReviewsText: Reviews

Tag: OL,Source: <ol>…</ol>Text: Reviewer Name …

Tag: LI,Source: <li>…</li>Text: Reviewer Name …

Tag: NOTAGSource: JohnText: John

Tag: Source:RatingText: Rating

Tag: NOTAGSource: 7Text: 7

Tag: Source:TextText: Text

Tag: NOTAGSource: …Text: …

Tag: Source:Reviewer NameText: Reviewer Name

W4FW4F• Wysiwyg support• Java toolkit• Extraction rule

– HTML parse tree (DOM object)• e.g. html.body.ol[0].li[*].pcdata[0].txt

– Regular expression to address finer pieces of information

Supervised IE systemsSupervised IE systems• SRV [Freitag, 1998] • Rapier [Califf and Mooney, 1998] • WIEN [Kushmerick, 1997]• WHISK [Soderland, 1999]• NoDoSE [Adelberg, 1998]• Softmealy [Hsu and Dung, 1998]• Stalker [Muslea, 1999]• DEByE [Laender, 2002b ]

SRVSRV• Single-slot information extraction• Top-down (general to specific)

relational learning algorithm– Positive examples – Negative examples

• Learning algorithm work like FOIL– Token-oriented features– Logic rule

Rating extraction rule:-Length(=1), Every(numeric true),Every(in_list true).

RapierRapier• Field-level (Single-slot) data extraction• Bottom-up (specific to general) • The extraction rules consist of 3 parts:

– Pre-filler– Slot-filler– Post-filler

Book Title extraction rule:-Pre-filler slot-filler post-fillerword: Book Length=2 word=word: Name Tag: [nn, nns]word:

WIENWIEN• LR Wrapper

– (‘Reviewer name ’, ‘’, ‘Rating ’, ‘’, ‘Text ’, ‘</li>’)

• HLRT Wrapper (Head LR Tail)• OCLR Wrapper (Open-Close LR)• HOCLRT Wrapper• N-LR Wrapper (Nested LR)• N-HLRT Wrapper (Nested HLRT)

WHISKWHISK• Top-down (general to specific) learning• Example

– To generate 3-slot book reviews, it start with empty rule “*(*)*(*)*(*)*”

– Each parenthesis indicates a phrase to be extracted

– The phrase in the first set of parenthesis is bound to variable $1, and 2nd to $2, etc.

– The extraction logic is similar to the LR wrapper for WIEN.

Pattern:: * ‘Reviewer Name ’ (Person) ‘’ * (Digit) ‘Text’(*) ‘</li>’Output:: BookReview {Name $1} {Rating $2} {Comment $3}

NoDoSENoDoSE• Assume the order of attributes within a record

to be fixed• The user interacts with the system to decomp

ose the input.• For the running example

– a book title (an attribute of type string) and – a list of Reviewer

• RName (string), Rate (integer), and Text (string).

SoftmealySoftmealy• Finite transducer • Contextual rules

b eN N R R T

s<b,N>/“N=”+next_tokn

s<,R>/“R=”+next_tokn

s<,T>/“T=”+next_tokn

?/next_token ?/next_token ?/next_token?/ε ?/ε ?/ε

s<N, >/ ε

s<T,e>/ ε

s<R, e>/ ε

s<,R>L ::= HTML() C1Alph(Rating) HTML()s<,R>R ::= Spc(-) Num(-)s<R,>L ::= Num(-)s<R,>R ::= NL(-) HTML()

StalkerStalker• Embedded Category Tree• Multipass Softmealy

(a) (b)

Extraction rule for Lis t(Re vie we r): SkipT o(<ol>) SkipT o(</ol>)Ite ration rule for List(Re vie we r): SkipT o(<li>) SkipT o(</li>)Extraction rule for Rating: SkipT o(Rating ) SkpT o()

Whole document

Name List(Reviewer)

Name Rate T ext

DEByEDEByE• Bottom-up extraction strategy• Comparison

– DEByE: the user marks only atomic (attribute) values to assemble nested tables

– NoDoSE: the user decomposes the whole document in a top-down fashion

Semi-supervised ApproachesSemi-supervised Approaches

• IEPAD [Chang and Lui, 2001]• OLERA [Chang and Kuo, 2003]• Thresher [Hogue, 2005]

IEPADIEPAD• Encoding of the input page• Multiple-record pages

– Pattern Mining by PAT Tree• Multiple string alignment• For the running example

– <li>TTTTTT</li>

OLERAOLERA• Online extraction rule analysis

– Enclosing – Drill-down / Roll-up– Attribute Assignment

ThresherThresher• Work similar to OLERA• Apply tree alignment instead of

string alignment

Unsupervised ApproachesUnsupervised Approaches• Roadrunner [Crescenzi, 2001]• DeLa [Wang, 2002; 2003] • EXALG [Arasu and Garcia-Molina, 2003] • DEPTA [Zhai, et al., 2005]

RoadrunnerRoadrunner• Input: multiple

pages with the same template

• Match two input pages at one time

Sample page01: <html><body>02: 03: Book Name04: 05: Data mining06: 07: Reviews08: 09: <OL>10: <LI>11: Reviewer Name 12: Jeff13: Rating 14: 215: Text 16: …17: </LI>18: <LI>19: Reviewer Name 20: Jane21: Rating 22: 623: Text 24: …25: </LI>26: </OL>27:</body></html>

tag mismatch

Wrapper (initially)01: <html><body>02: 03: Book Name04: 05: Databases06: 07: Reviews08: 09: <OL>10: <LI>11: Reviewer Name 12: John13: Rating 14: 715: Text 16: …17: </LI>10: </OL>11:</body></html>

parsing

String mismatch

String mismatch

String mismatch

String mismatch

<html><body> Book Name #PCDATA Reviews <OL> ( <LI> Reviewer Name #PCDATA Rating #PCDATA Text #PCDATA </LI> )+</OL></body></html>

Wrapper after solving mismatch

Terminal search match

DeLaDeLa• Similar to IEPAD

– Works for one input page

• Handle nested data structure• Example

– <A>T</A><A>T</A> T<A>T</A>T

– <A>T</A>T<A>T</A>T– ((<A>T</A>)*T)*

EXALGEXALG• Input: multiple pages with the same

template• Techniques:

– Differentiating token roles– Equivalence class (EC) form a template

• Tokens with the same occurrence vector

DEPTADEPTA• Identify data region

– Allow mismatch between data records

• Identify data record– Data records may not be continuous

• Identify data items– By partial tree alignment

ComparisonComparison• How do we differentiate template token from

data token?– DeLa and DEPTA assume HTML tags are template w

hile others are data tokens– IEPAD and OLERA leaves the problems to users

• How to apply the information from multiple pages?– DeLa and DEPTA conduct the mining from single pa

ge– Roadrunner and EXALG do the analysis from multip

le pages

Comparison (Cont.)Comparison (Cont.)• Techniques improvement

– From string alignment (IEPAD, RoadRunner) to tree alignment (DEPTA, Thresher)

– From full alignment (IEPAD) to partial alignment (DEPTA)

Task domain comparisonTask domain comparison• Page type

– structured, semi-structured or free-text Web pages

• Non-HTML support• Extraction level

– Field level, record-level, page-level

Task domain comparison Task domain comparison (Cont.)(Cont.)

• Extraction target variation– Missing attributes, multiple-value attributes,

multi-order attribute permutation• Template variation• Untokernized Attributes

ToolsPage Type

NHSExtraction

Level

Extraction Targets Variation Template Variation

UTAMA/MVA MOA Nested VF CT

Manual

Minerva Semi-S Yes Record Level Yes Yes Yes Both No Yes

TSIMMIS Semi-S Yes Record Level Yes No Yes Disj No No

WebOQL Semi-S No Record Level Yes Yes Yes Disj No No

W4F Temp No Record Level Yes Yes Yes SP No Yes

XWRAP Temp No Record Level Yes No Yes SP No Yes

Supervised

RAPIER Free Yes Field Level Yes -- -- Disj Yes No

SRV Free Yes Field Level Yes -- -- Disj Yes No

WHISK Free Yes Record Level Yes Yes No Disj Yes No

NoDoSE Semi-S Yes Page/Record Yes Limited Yes No No No

DEByE Semi-S Yes Record Level Yes Yes Yes Disj No No

WIEN Semi-S Yes Record Level No No No No No No

STALKER Semi-S Yes Record Level Yes Yes Yes Both No Yes

SoftMealy Semi-S Yes Record Level Yes LimitedMulti Pass

Disj No Yes

Semi-Supervi

sed

IEPAD Temp No Record Level Yes Limited Limited Both No Yes

OLERA Temp No Record Level Yes Limited Limited Both No Yes

Un-Supervise

d

DeLa Temp No Record Level Yes Limited Yes Both No No

RoadRunner Temp No Page Level Yes No Yes SP No No

EXALG Temp Yes Page Level Yes No Yes Both No No

DEPTA Temp No Record Level Yes No Limited Disj No No

Technique-based comparisonTechnique-based comparison• Scan pass

– Single pass vs mutiple pass• Extraction rule type

– Regular expression vs. logic rules• Feature used

– DOM tree information, POS tags, etc.• Learning algorithm

– Machine learning vs pattern mining• Tokernization schemes

Tools Scan PassExtraction Rule Type

Features Used Learning AlgorithmTokenization Schemes

Minerva Single Regular exp. HTML tags/Literal words None Manually

TSIMMIS Single Regular exp. HTML tags/Literal words None Manually

WebOQL Single Regular exp. Hypertree None Manually

W4F Single Regular exp. DOM tree path addressing None Tag Level

XWRAP Single Context-Free DOM tree None Tag Level

RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level

SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level

WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level

NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

DEByE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

IEPAD Single Regular exp. HTML tagsPattern Mining, String

AlignmentMulti-Level

OLERA Single Regular exp. HTML tags String Alignment Multi-Level

RoadRunner Single Regular exp. HTML tags String Alignment Tag Level

EXALG Single Regular exp. HTML tags/Literal wordsEquivalent Class and Role Differentiation

Word Level

DeLa Single Regular exp. HTML tags Pattern Mining Tag Level

ToolsGUI

support

Page-Fetching support

Output Support

Training Examples

API. Support

Minerva No No XML No Yes

TSIMMIS No No Text No Yes

WebOQL No No Text No Yes

W4F Yes Yes XML Labeled Yes

XWRAP Yes Yes XML Labeled Yes

RAPIER No No Text Labeled No

SRV No No Text Labeled No

WHISK No No Text Labeled No

NoDoSE Yes No XML, OEM Labeled Yes

DEByE Yes Yes XML, SQL DB Labeled Yes

WIEN Yes No Text Labeled Yes

STALKER Yes No Text Labeled Yes

SoftMealy Yes Yes XML, SQL DB Labeled Yes

IEPAD Yes No Text Unlabeled No

OLERA Yes No XML Unlabeled No

RoadRunner No Yes XML Unlabeled Yes

EXALG No No Text Unlabeled No

DeLa No Yes Text Unlabeled Yes

ConclusionConclusion• Criteria for evaluating IE systems

from the task domain• Comparison of IE systems from

various automation degree• The use of various techniques in IE

systems

Future WorkFuture Work• Page Fetching

– XWrap, W4F, WNDL• Schema Mapping

– Full information– Partial information

• Query Interface Integration

ReferencesReferences• C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, A surve

y of Web Information Extraction Systems.

a survey of web information extraction systems

Documents

tredwell related work

based rules

software tools

wrapper induction systems

toolsheuristicbased

information extraction

annotationfree systems

relevant information