a survey of web information extraction systems

58
A Survey of WEB A Survey of WEB Information Information Extraction Systems Extraction Systems Chia-Hui Chang Chia-Hui Chang National Central University National Central University Sep. 22, 2005 Sep. 22, 2005

Upload: amara

Post on 19-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

A Survey of WEB Information Extraction Systems. Chia-Hui Chang National Central University Sep. 22, 2005. Introduction. Abundant information on the Web Static Web pages Searchable databases: Deep Web Information Integration Information for life e.g. shopping agents, travel agents - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Survey of WEB Information Extraction Systems

A Survey of WEB A Survey of WEB Information Information

Extraction SystemsExtraction SystemsChia-Hui ChangChia-Hui Chang

National Central UniversityNational Central UniversitySep. 22, 2005Sep. 22, 2005

Page 2: A Survey of WEB Information Extraction Systems

IntroductionIntroduction• Abundant information on the Web

– Static Web pages– Searchable databases: Deep Web

• Information Integration– Information for life

• e.g. shopping agents, travel agents

– Data for research purpose• e.g. bioinformatics, auction economy

Page 3: A Survey of WEB Information Extraction Systems

Introduction (Cont.)Introduction (Cont.)• Information Extraction (IE)

– is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form

• An IE task is defined by its input and output

Page 4: A Survey of WEB Information Extraction Systems

An IE TaskAn IE Task

Page 5: A Survey of WEB Information Extraction Systems

Web Data ExtractionWeb Data Extraction

Data

Record

Data Record

Page 6: A Survey of WEB Information Extraction Systems

IE SystemsIE Systems• Wrappers

– Programs that perform the task of IE are referred to as extractors or wrappers.

• Wrapper Induction – IE systems are software tools that

are designed to generate wrappers.

Page 7: A Survey of WEB Information Extraction Systems

Various IE SurveyVarious IE Survey• Muslea• Hsu and Dung• Chang• Kushmerick• Laender• Sarawagi• Kuhlins and Tredwell

Page 8: A Survey of WEB Information Extraction Systems

Related Work: Time Related Work: Time • MUC Approaches

– AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995]

• Post-MUC Approaches – WHISK [Soderland, 1999], RAPIER [califf, 1998],

SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]

Page 9: A Survey of WEB Information Extraction Systems

Related Work: Automation DegreeRelated Work: Automation Degree

• Hsu and Dung [1998]– hand-crafted wrappers using general

programming languages– specially designed programming

languages or tools– heuristic-based wrappers, and – WI approaches

Page 10: A Survey of WEB Information Extraction Systems

Related Work: Automation DegreeRelated Work: Automation Degree

• Chang and Kuo [2003]– systems that need programmers, – systems that need annotation examples,– annotation-free systems and – semi-supervised systems

Page 11: A Survey of WEB Information Extraction Systems

Related Work: Related Work: Input and Extraction RulesInput and Extraction Rules

• Muslea [1999]– IE from free text using extraction patterns that a

re mainly based on syntactic/semantic constraints.

– The second class is Wrapper induction systems which rely on the use of delimiter-based rules.

– The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.

Page 12: A Survey of WEB Information Extraction Systems

Related Work: Extraction RulesRelated Work: Extraction Rules

• Kushmerick [2003]– Finite-state tools (regular expressions)– Relational learning tools (logic rules)

Page 13: A Survey of WEB Information Extraction Systems

Related Work: TechniquesRelated Work: Techniques• Laender [2002]

– languages for wrapper development – HTML-aware tools – NLP-based tools – Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER),

– Modeling-based tools – Ontology-based tools

• New Criteria:– degree of automation, support for complex objects, page con

tents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.

Page 14: A Survey of WEB Information Extraction Systems

Related Work: Output TargetsRelated Work: Output Targets

• Sarawagi [VLDB 2002]– Record-level– Page-level– Site-level

Page 15: A Survey of WEB Information Extraction Systems

Related Work: UsabilityRelated Work: Usability • Kuhlins and Tredwell [2002]

– Commercial– Noncommercial

Page 16: A Survey of WEB Information Extraction Systems

Three DimensionsThree Dimensions• Task Domain

– Input (Unstructured, semi-structured)– Output Targets (record-level, page-level, site-level)

• Automation Degree– Programmer-involved, learning-based or annotatio

n-free approaches• Techniques

– Regular expression rules vs Prolog-like logic rules– Deterministic finite-state transducer vs probabilisti

c hidden Markov models

Page 17: A Survey of WEB Information Extraction Systems

Task Domain: InputTask Domain: Input

Page 18: A Survey of WEB Information Extraction Systems

Task Domain: OutputTask Domain: Output• Missing Attributes• Multi-valued Attributes• Multiple Permutations• Nested Data Objects• Various Templates for an attribute• Common Templates for various attribut

es• Untokenized Attributes

Page 19: A Survey of WEB Information Extraction Systems

Classification by Automation DegreeClassification by Automation Degree

• Manually– TSIMMIS, Minerva, WebOQL, W4F, XWrap

• Supervised– WIEN, Stalker, Softmealy

• Semi-supervised– IEPAD, OLERA

• Unsupervised– DeLa, RoadRunner, EXALG

Page 20: A Survey of WEB Information Extraction Systems

Automation DegreeAutomation Degree• Page-fetching Support• Annotation Requirement• Output Support• API Support

Page 21: A Survey of WEB Information Extraction Systems

TechnologiesTechnologies• Scan passes• Extraction rule types• Learning algorithms• Tokenization schemes• Feature used

Page 22: A Survey of WEB Information Extraction Systems

A Survey of Contemporary A Survey of Contemporary IE SystemsIE Systems

• Manually-constructed IE tools– Programmer-aided

• Supervised IE systems– Labeled based

• Semi-supervised IE systems• Unsupervised IE systems

– Annotation-free

Page 23: A Survey of WEB Information Extraction Systems

Un-Supervised

GUI

Manual

Semi-Supervised

Supervised

Wrapper Induction

System

Wrapper

User Extracted

data

Test Page

GUI

Un-Labeled Training

Web Pages

User

User

Page 24: A Survey of WEB Information Extraction Systems

Manually-constructed IE Manually-constructed IE SystemsSystems

• TSIMMIS [Hammer, et al, 1997]• Minerva [Crescenzi, 1998] • WebOQL [Arocena and Mendelzon, 199

8] • W4F [Saiiuguet and Azavant, 2001] • XWrap [Liu, et al. 2000]

Page 25: A Survey of WEB Information Extraction Systems

A Running ExampleA Running Example

Page 26: A Survey of WEB Information Extraction Systems

TSIMMISTSIMMIS

• Each command is of the form: [variables, source, pattern] where

– source specifies the input text to be considered– pattern specifies how to find the text of interest within the source, and – variables are a list of variables that hold the extracted results.

• Note:– # means “save in the variable”– * means “discard”

(a) (b)

1 [ [ "root", "get('pe1.html')", "#"],2 [ "Book", "root", "*<body>#</body>"],3 [ "BookName", "Book", "*</b>#<b>"],4 [ "Reviews", "Book", "*<ol>#</ol>"],5 [ "_Reviewer", "split (Reviews, '<li>')", "#"],6 [ "Reviewer", "_Reviewer[0:0]", "#"],7 [ "ReviewerName, Rat ing, T ext", "Reviewer",8 "*</b>#<b>*</b>#<b>*</b>#*"] ]

root complex { book_name st ring "Dat abases" reviews complex { Reviewer_Name st ring John Rat ing int 7 T ext st ring … } }

Page 27: A Survey of WEB Information Extraction Systems

MinervaMinerva• The grammar used by Minerva is defined in an

EBNF style

Page Book_Reviews $Book_Reviews: <html><body> $Book </body></html>; $Book : <b>Book Name </b> $bName <b> Reviews </b> [<ol> ( <li><b> Reviewer Name </b> $rName <b> Rat ing </b>$rate <b> T ext </b> $text $T P )* </ol>]; $bName : *(?<b>); $rName : *(?<b>); $rate : *(?<b>); $text : *(?</li>);

$T P : { $bName, $rName $rate $text } END

Page 28: A Survey of WEB Information Extraction Systems

WebOQLWebOQLSelect [ Z!’.Text] From x in browse (“pe2.html”)’, y in x’, Z in y’ Where x.Tag = “ol” and Z.Text=”Reviewer Name”

Tag: Body,Source: <Body>…</Body>Text: Book Name …

Tag: <b>Source:<b>Book Name</b>Text: Book Name Tag: NOTAG

Source: DatabasesText: Database

Tag: <b>Source:<b>Reviews</b>Text: Reviews

Tag: OL,Source: <ol>…</ol>Text: Reviewer Name …

Tag: LI,Source: <li>…</li>Text: Reviewer Name …

Tag: NOTAGSource: JohnText: John

Tag: <b>Source:<b>Rating</b>Text: Rating

Tag: NOTAGSource: 7Text: 7

Tag: <b>Source:<b>Text</b>Text: Text

Tag: NOTAGSource: …Text: …

Tag: <b>Source:<b>Reviewer Name</b>Text: Reviewer Name

Page 29: A Survey of WEB Information Extraction Systems

W4FW4F• Wysiwyg support• Java toolkit• Extraction rule

– HTML parse tree (DOM object)• e.g. html.body.ol[0].li[*].pcdata[0].txt

– Regular expression to address finer pieces of information

Page 30: A Survey of WEB Information Extraction Systems

Supervised IE systemsSupervised IE systems• SRV [Freitag, 1998] • Rapier [Califf and Mooney, 1998] • WIEN [Kushmerick, 1997]• WHISK [Soderland, 1999]• NoDoSE [Adelberg, 1998]• Softmealy [Hsu and Dung, 1998]• Stalker [Muslea, 1999]• DEByE [Laender, 2002b ]

Page 31: A Survey of WEB Information Extraction Systems

SRVSRV• Single-slot information extraction• Top-down (general to specific)

relational learning algorithm– Positive examples – Negative examples

• Learning algorithm work like FOIL– Token-oriented features– Logic rule

Rating extraction rule:-Length(=1), Every(numeric true),Every(in_list true).

Page 32: A Survey of WEB Information Extraction Systems

RapierRapier• Field-level (Single-slot) data extraction• Bottom-up (specific to general) • The extraction rules consist of 3 parts:

– Pre-filler– Slot-filler– Post-filler

Book Title extraction rule:-Pre-filler slot-filler post-fillerword: Book Length=2 word=<b>word: Name Tag: [nn, nns]word: </b>

Page 33: A Survey of WEB Information Extraction Systems

WIENWIEN• LR Wrapper

– (‘Reviewer name </b>’, ‘<b>’, ‘Rating </b>’, ‘<b>’, ‘Text </b>’, ‘</li>’)

• HLRT Wrapper (Head LR Tail)• OCLR Wrapper (Open-Close LR)• HOCLRT Wrapper• N-LR Wrapper (Nested LR)• N-HLRT Wrapper (Nested HLRT)

Page 34: A Survey of WEB Information Extraction Systems

WHISKWHISK• Top-down (general to specific) learning• Example

– To generate 3-slot book reviews, it start with empty rule “*(*)*(*)*(*)*”

– Each parenthesis indicates a phrase to be extracted

– The phrase in the first set of parenthesis is bound to variable $1, and 2nd to $2, etc.

– The extraction logic is similar to the LR wrapper for WIEN.

Pattern:: * ‘Reviewer Name </b>’ (Person) ‘<b>’ * (Digit) ‘<b>Text</b>’(*) ‘</li>’Output:: BookReview {Name $1} {Rating $2} {Comment $3}

Page 35: A Survey of WEB Information Extraction Systems

NoDoSENoDoSE• Assume the order of attributes within a record

to be fixed• The user interacts with the system to decomp

ose the input.• For the running example

– a book title (an attribute of type string) and – a list of Reviewer

• RName (string), Rate (integer), and Text (string).

Page 36: A Survey of WEB Information Extraction Systems

SoftmealySoftmealy• Finite transducer • Contextual rules

b eN N R R T

s<b,N>/“N=”+next_tokn

s<,R>/“R=”+next_tokn

s<,T>/“T=”+next_tokn

?/next_token ?/next_token ?/next_token?/ε ?/ε ?/ε

s<N, >/ ε

s<T,e>/ ε

s<R, e>/ ε

s<,R>L ::= HTML(<b>) C1Alph(Rating) HTML(</b>)s<,R>R ::= Spc(-) Num(-)s<R,>L ::= Num(-)s<R,>R ::= NL(-) HTML(<b>)

Page 37: A Survey of WEB Information Extraction Systems

StalkerStalker• Embedded Category Tree• Multipass Softmealy

(a) (b)

Extraction rule for Lis t(Re vie we r): SkipT o(<ol>) SkipT o(</ol>)Ite ration rule for List(Re vie we r): SkipT o(<li>) SkipT o(</li>)Extraction rule for Rating: SkipT o(Rating </b>) SkpT o(<b>)

Whole document

Name List(Reviewer)

Name Rate T ext

Page 38: A Survey of WEB Information Extraction Systems

DEByEDEByE• Bottom-up extraction strategy• Comparison

– DEByE: the user marks only atomic (attribute) values to assemble nested tables

– NoDoSE: the user decomposes the whole document in a top-down fashion

Page 39: A Survey of WEB Information Extraction Systems

Semi-supervised ApproachesSemi-supervised Approaches

• IEPAD [Chang and Lui, 2001]• OLERA [Chang and Kuo, 2003]• Thresher [Hogue, 2005]

Page 40: A Survey of WEB Information Extraction Systems

IEPADIEPAD• Encoding of the input page• Multiple-record pages

– Pattern Mining by PAT Tree• Multiple string alignment• For the running example

– <li><b>T</b>T<b>T</b>T<b>T</b>T</li>

Page 41: A Survey of WEB Information Extraction Systems

OLERAOLERA• Online extraction rule analysis

– Enclosing – Drill-down / Roll-up– Attribute Assignment

Page 42: A Survey of WEB Information Extraction Systems

ThresherThresher• Work similar to OLERA• Apply tree alignment instead of

string alignment

Page 43: A Survey of WEB Information Extraction Systems

Unsupervised ApproachesUnsupervised Approaches• Roadrunner [Crescenzi, 2001]• DeLa [Wang, 2002; 2003] • EXALG [Arasu and Garcia-Molina, 2003] • DEPTA [Zhai, et al., 2005]

Page 44: A Survey of WEB Information Extraction Systems

RoadrunnerRoadrunner• Input: multiple

pages with the same template

• Match two input pages at one time

Sample page01: <html><body>02: <b>03: Book Name04: </b>05: Data mining06: <b>07: Reviews08: </b>09: <OL>10: <LI>11: <b> Reviewer Name </b>12: Jeff13: <b> Rating </b>14: 215: <b>Text </b>16: …17: </LI>18: <LI>19: <b> Reviewer Name </b>20: Jane21: <b> Rating </b>22: 623: <b>Text </b>24: …25: </LI>26: </OL>27:</body></html>

tag mismatch

Wrapper (initially)01: <html><body>02: <b>03: Book Name04: </b>05: Databases06: <b>07: Reviews08: </b>09: <OL>10: <LI>11: <b> Reviewer Name </b>12: John13: <b> Rating </b>14: 715: <b>Text </b>16: …17: </LI>10: </OL>11:</body></html>

parsing

String mismatch

String mismatch

String mismatch

String mismatch

<html><body><b> Book Name </b>#PCDATA<b> Reviews </b><OL> ( <LI><b> Reviewer Name </b> #PCDATA <b> Rating </b> #PCDATA <b> Text </b> #PCDATA </LI> )+</OL></body></html>

Wrapper after solving mismatch

Terminal search match

Page 45: A Survey of WEB Information Extraction Systems

DeLaDeLa• Similar to IEPAD

– Works for one input page

• Handle nested data structure• Example

– <P><A>T</A><A>T</A> T</P><P><A>T</A>T</P>

– <P><A>T</A>T</P><P><A>T</A>T</P>– (<P>(<A>T</A>)*T<P>)*

Page 46: A Survey of WEB Information Extraction Systems

EXALGEXALG• Input: multiple pages with the same

template• Techniques:

– Differentiating token roles– Equivalence class (EC) form a template

• Tokens with the same occurrence vector

Page 47: A Survey of WEB Information Extraction Systems

DEPTADEPTA• Identify data region

– Allow mismatch between data records

• Identify data record– Data records may not be continuous

• Identify data items– By partial tree alignment

Page 48: A Survey of WEB Information Extraction Systems

ComparisonComparison• How do we differentiate template token from

data token?– DeLa and DEPTA assume HTML tags are template w

hile others are data tokens– IEPAD and OLERA leaves the problems to users

• How to apply the information from multiple pages?– DeLa and DEPTA conduct the mining from single pa

ge– Roadrunner and EXALG do the analysis from multip

le pages

Page 49: A Survey of WEB Information Extraction Systems

Comparison (Cont.)Comparison (Cont.)• Techniques improvement

– From string alignment (IEPAD, RoadRunner) to tree alignment (DEPTA, Thresher)

– From full alignment (IEPAD) to partial alignment (DEPTA)

Page 50: A Survey of WEB Information Extraction Systems

Task domain comparisonTask domain comparison• Page type

– structured, semi-structured or free-text Web pages

• Non-HTML support• Extraction level

– Field level, record-level, page-level

Page 51: A Survey of WEB Information Extraction Systems

Task domain comparison Task domain comparison (Cont.)(Cont.)

• Extraction target variation– Missing attributes, multiple-value attributes,

multi-order attribute permutation• Template variation• Untokernized Attributes

Page 52: A Survey of WEB Information Extraction Systems

ToolsPage Type

NHSExtraction

Level

Extraction Targets Variation Template Variation

UTAMA/MVA MOA Nested VF CT

Manual

Minerva Semi-S Yes Record Level Yes Yes Yes Both No Yes

TSIMMIS Semi-S Yes Record Level Yes No Yes Disj No No

WebOQL Semi-S No Record Level Yes Yes Yes Disj No No

W4F Temp No Record Level Yes Yes Yes SP No Yes

XWRAP Temp No Record Level Yes No Yes SP No Yes

Supervised

RAPIER Free Yes Field Level Yes -- -- Disj Yes No

SRV Free Yes Field Level Yes -- -- Disj Yes No

WHISK Free Yes Record Level Yes Yes No Disj Yes No

NoDoSE Semi-S Yes Page/Record Yes Limited Yes No No No

DEByE Semi-S Yes Record Level Yes Yes Yes Disj No No

WIEN Semi-S Yes Record Level No No No No No No

STALKER Semi-S Yes Record Level Yes Yes Yes Both No Yes

SoftMealy Semi-S Yes Record Level Yes LimitedMulti Pass

Disj No Yes

Semi-Supervi

sed

IEPAD Temp No Record Level Yes Limited Limited Both No Yes

OLERA Temp No Record Level Yes Limited Limited Both No Yes

Un-Supervise

d

DeLa Temp No Record Level Yes Limited Yes Both No No

RoadRunner Temp No Page Level Yes No Yes SP No No

EXALG Temp Yes Page Level Yes No Yes Both No No

DEPTA Temp No Record Level Yes No Limited Disj No No

Page 53: A Survey of WEB Information Extraction Systems

Technique-based comparisonTechnique-based comparison• Scan pass

– Single pass vs mutiple pass• Extraction rule type

– Regular expression vs. logic rules• Feature used

– DOM tree information, POS tags, etc.• Learning algorithm

– Machine learning vs pattern mining• Tokernization schemes

Page 54: A Survey of WEB Information Extraction Systems

Tools Scan PassExtraction Rule Type

Features Used Learning AlgorithmTokenization Schemes

Minerva Single Regular exp. HTML tags/Literal words None Manually

TSIMMIS Single Regular exp. HTML tags/Literal words None Manually

WebOQL Single Regular exp. Hypertree None Manually

W4F Single Regular exp. DOM tree path addressing None Tag Level

XWRAP Single Context-Free DOM tree None Tag Level

RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level

SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level

WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level

NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

DEByE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

IEPAD Single Regular exp. HTML tagsPattern Mining, String

AlignmentMulti-Level

OLERA Single Regular exp. HTML tags String Alignment Multi-Level

RoadRunner Single Regular exp. HTML tags String Alignment Tag Level

EXALG Single Regular exp. HTML tags/Literal wordsEquivalent Class and Role Differentiation

Word Level

DeLa Single Regular exp. HTML tags Pattern Mining Tag Level

Page 55: A Survey of WEB Information Extraction Systems

ToolsGUI

support

Page-Fetching support

Output Support

Training Examples

API. Support

Minerva No No XML No Yes

TSIMMIS No No Text No Yes

WebOQL No No Text No Yes

W4F Yes Yes XML Labeled Yes

XWRAP Yes Yes XML Labeled Yes

RAPIER No No Text Labeled No

SRV No No Text Labeled No

WHISK No No Text Labeled No

NoDoSE Yes No XML, OEM Labeled Yes

DEByE Yes Yes XML, SQL DB Labeled Yes

WIEN Yes No Text Labeled Yes

STALKER Yes No Text Labeled Yes

SoftMealy Yes Yes XML, SQL DB Labeled Yes

IEPAD Yes No Text Unlabeled No

OLERA Yes No XML Unlabeled No

RoadRunner No Yes XML Unlabeled Yes

EXALG No No Text Unlabeled No

DeLa No Yes Text Unlabeled Yes

Page 56: A Survey of WEB Information Extraction Systems

ConclusionConclusion• Criteria for evaluating IE systems

from the task domain• Comparison of IE systems from

various automation degree• The use of various techniques in IE

systems

Page 57: A Survey of WEB Information Extraction Systems

Future WorkFuture Work• Page Fetching

– XWrap, W4F, WNDL• Schema Mapping

– Full information– Partial information

• Query Interface Integration

Page 58: A Survey of WEB Information Extraction Systems

ReferencesReferences• C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, A surve

y of Web Information Extraction Systems.