a survey of web information extraction systems
DESCRIPTION
A Survey of WEB Information Extraction Systems. Chia-Hui Chang National Central University Sep. 22, 2005. Introduction. Abundant information on the Web Static Web pages Searchable databases: Deep Web Information Integration Information for life e.g. shopping agents, travel agents - PowerPoint PPT PresentationTRANSCRIPT
A Survey of WEB A Survey of WEB Information Information
Extraction SystemsExtraction SystemsChia-Hui ChangChia-Hui Chang
National Central UniversityNational Central UniversitySep. 22, 2005Sep. 22, 2005
IntroductionIntroduction• Abundant information on the Web
– Static Web pages– Searchable databases: Deep Web
• Information Integration– Information for life
• e.g. shopping agents, travel agents
– Data for research purpose• e.g. bioinformatics, auction economy
Introduction (Cont.)Introduction (Cont.)• Information Extraction (IE)
– is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form
• An IE task is defined by its input and output
An IE TaskAn IE Task
Web Data ExtractionWeb Data Extraction
Data
Record
Data Record
IE SystemsIE Systems• Wrappers
– Programs that perform the task of IE are referred to as extractors or wrappers.
• Wrapper Induction – IE systems are software tools that
are designed to generate wrappers.
Various IE SurveyVarious IE Survey• Muslea• Hsu and Dung• Chang• Kushmerick• Laender• Sarawagi• Kuhlins and Tredwell
Related Work: Time Related Work: Time • MUC Approaches
– AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995]
• Post-MUC Approaches – WHISK [Soderland, 1999], RAPIER [califf, 1998],
SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]
Related Work: Automation DegreeRelated Work: Automation Degree
• Hsu and Dung [1998]– hand-crafted wrappers using general
programming languages– specially designed programming
languages or tools– heuristic-based wrappers, and – WI approaches
Related Work: Automation DegreeRelated Work: Automation Degree
• Chang and Kuo [2003]– systems that need programmers, – systems that need annotation examples,– annotation-free systems and – semi-supervised systems
Related Work: Related Work: Input and Extraction RulesInput and Extraction Rules
• Muslea [1999]– IE from free text using extraction patterns that a
re mainly based on syntactic/semantic constraints.
– The second class is Wrapper induction systems which rely on the use of delimiter-based rules.
– The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.
Related Work: Extraction RulesRelated Work: Extraction Rules
• Kushmerick [2003]– Finite-state tools (regular expressions)– Relational learning tools (logic rules)
Related Work: TechniquesRelated Work: Techniques• Laender [2002]
– languages for wrapper development – HTML-aware tools – NLP-based tools – Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER),
– Modeling-based tools – Ontology-based tools
• New Criteria:– degree of automation, support for complex objects, page con
tents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.
Related Work: Output TargetsRelated Work: Output Targets
• Sarawagi [VLDB 2002]– Record-level– Page-level– Site-level
Related Work: UsabilityRelated Work: Usability • Kuhlins and Tredwell [2002]
– Commercial– Noncommercial
Three DimensionsThree Dimensions• Task Domain
– Input (Unstructured, semi-structured)– Output Targets (record-level, page-level, site-level)
• Automation Degree– Programmer-involved, learning-based or annotatio
n-free approaches• Techniques
– Regular expression rules vs Prolog-like logic rules– Deterministic finite-state transducer vs probabilisti
c hidden Markov models
Task Domain: InputTask Domain: Input
Task Domain: OutputTask Domain: Output• Missing Attributes• Multi-valued Attributes• Multiple Permutations• Nested Data Objects• Various Templates for an attribute• Common Templates for various attribut
es• Untokenized Attributes
Classification by Automation DegreeClassification by Automation Degree
• Manually– TSIMMIS, Minerva, WebOQL, W4F, XWrap
• Supervised– WIEN, Stalker, Softmealy
• Semi-supervised– IEPAD, OLERA
• Unsupervised– DeLa, RoadRunner, EXALG
Automation DegreeAutomation Degree• Page-fetching Support• Annotation Requirement• Output Support• API Support
TechnologiesTechnologies• Scan passes• Extraction rule types• Learning algorithms• Tokenization schemes• Feature used
A Survey of Contemporary A Survey of Contemporary IE SystemsIE Systems
• Manually-constructed IE tools– Programmer-aided
• Supervised IE systems– Labeled based
• Semi-supervised IE systems• Unsupervised IE systems
– Annotation-free
Un-Supervised
GUI
Manual
Semi-Supervised
Supervised
Wrapper Induction
System
Wrapper
User Extracted
data
Test Page
GUI
Un-Labeled Training
Web Pages
User
User
Manually-constructed IE Manually-constructed IE SystemsSystems
• TSIMMIS [Hammer, et al, 1997]• Minerva [Crescenzi, 1998] • WebOQL [Arocena and Mendelzon, 199
8] • W4F [Saiiuguet and Azavant, 2001] • XWrap [Liu, et al. 2000]
A Running ExampleA Running Example
TSIMMISTSIMMIS
• Each command is of the form: [variables, source, pattern] where
– source specifies the input text to be considered– pattern specifies how to find the text of interest within the source, and – variables are a list of variables that hold the extracted results.
• Note:– # means “save in the variable”– * means “discard”
(a) (b)
1 [ [ "root", "get('pe1.html')", "#"],2 [ "Book", "root", "*<body>#</body>"],3 [ "BookName", "Book", "*</b>#<b>"],4 [ "Reviews", "Book", "*<ol>#</ol>"],5 [ "_Reviewer", "split (Reviews, '<li>')", "#"],6 [ "Reviewer", "_Reviewer[0:0]", "#"],7 [ "ReviewerName, Rat ing, T ext", "Reviewer",8 "*</b>#<b>*</b>#<b>*</b>#*"] ]
root complex { book_name st ring "Dat abases" reviews complex { Reviewer_Name st ring John Rat ing int 7 T ext st ring … } }
MinervaMinerva• The grammar used by Minerva is defined in an
EBNF style
Page Book_Reviews $Book_Reviews: <html><body> $Book </body></html>; $Book : <b>Book Name </b> $bName <b> Reviews </b> [<ol> ( <li><b> Reviewer Name </b> $rName <b> Rat ing </b>$rate <b> T ext </b> $text $T P )* </ol>]; $bName : *(?<b>); $rName : *(?<b>); $rate : *(?<b>); $text : *(?</li>);
$T P : { $bName, $rName $rate $text } END
WebOQLWebOQLSelect [ Z!’.Text] From x in browse (“pe2.html”)’, y in x’, Z in y’ Where x.Tag = “ol” and Z.Text=”Reviewer Name”
Tag: Body,Source: <Body>…</Body>Text: Book Name …
Tag: <b>Source:<b>Book Name</b>Text: Book Name Tag: NOTAG
Source: DatabasesText: Database
Tag: <b>Source:<b>Reviews</b>Text: Reviews
Tag: OL,Source: <ol>…</ol>Text: Reviewer Name …
Tag: LI,Source: <li>…</li>Text: Reviewer Name …
Tag: NOTAGSource: JohnText: John
Tag: <b>Source:<b>Rating</b>Text: Rating
Tag: NOTAGSource: 7Text: 7
Tag: <b>Source:<b>Text</b>Text: Text
Tag: NOTAGSource: …Text: …
Tag: <b>Source:<b>Reviewer Name</b>Text: Reviewer Name
W4FW4F• Wysiwyg support• Java toolkit• Extraction rule
– HTML parse tree (DOM object)• e.g. html.body.ol[0].li[*].pcdata[0].txt
– Regular expression to address finer pieces of information
Supervised IE systemsSupervised IE systems• SRV [Freitag, 1998] • Rapier [Califf and Mooney, 1998] • WIEN [Kushmerick, 1997]• WHISK [Soderland, 1999]• NoDoSE [Adelberg, 1998]• Softmealy [Hsu and Dung, 1998]• Stalker [Muslea, 1999]• DEByE [Laender, 2002b ]
SRVSRV• Single-slot information extraction• Top-down (general to specific)
relational learning algorithm– Positive examples – Negative examples
• Learning algorithm work like FOIL– Token-oriented features– Logic rule
Rating extraction rule:-Length(=1), Every(numeric true),Every(in_list true).
RapierRapier• Field-level (Single-slot) data extraction• Bottom-up (specific to general) • The extraction rules consist of 3 parts:
– Pre-filler– Slot-filler– Post-filler
Book Title extraction rule:-Pre-filler slot-filler post-fillerword: Book Length=2 word=<b>word: Name Tag: [nn, nns]word: </b>
WIENWIEN• LR Wrapper
– (‘Reviewer name </b>’, ‘<b>’, ‘Rating </b>’, ‘<b>’, ‘Text </b>’, ‘</li>’)
• HLRT Wrapper (Head LR Tail)• OCLR Wrapper (Open-Close LR)• HOCLRT Wrapper• N-LR Wrapper (Nested LR)• N-HLRT Wrapper (Nested HLRT)
WHISKWHISK• Top-down (general to specific) learning• Example
– To generate 3-slot book reviews, it start with empty rule “*(*)*(*)*(*)*”
– Each parenthesis indicates a phrase to be extracted
– The phrase in the first set of parenthesis is bound to variable $1, and 2nd to $2, etc.
– The extraction logic is similar to the LR wrapper for WIEN.
Pattern:: * ‘Reviewer Name </b>’ (Person) ‘<b>’ * (Digit) ‘<b>Text</b>’(*) ‘</li>’Output:: BookReview {Name $1} {Rating $2} {Comment $3}
NoDoSENoDoSE• Assume the order of attributes within a record
to be fixed• The user interacts with the system to decomp
ose the input.• For the running example
– a book title (an attribute of type string) and – a list of Reviewer
• RName (string), Rate (integer), and Text (string).
SoftmealySoftmealy• Finite transducer • Contextual rules
b eN N R R T
s<b,N>/“N=”+next_tokn
s<,R>/“R=”+next_tokn
s<,T>/“T=”+next_tokn
?/next_token ?/next_token ?/next_token?/ε ?/ε ?/ε
s<N, >/ ε
s<T,e>/ ε
s<R, e>/ ε
s<,R>L ::= HTML(<b>) C1Alph(Rating) HTML(</b>)s<,R>R ::= Spc(-) Num(-)s<R,>L ::= Num(-)s<R,>R ::= NL(-) HTML(<b>)
StalkerStalker• Embedded Category Tree• Multipass Softmealy
(a) (b)
Extraction rule for Lis t(Re vie we r): SkipT o(<ol>) SkipT o(</ol>)Ite ration rule for List(Re vie we r): SkipT o(<li>) SkipT o(</li>)Extraction rule for Rating: SkipT o(Rating </b>) SkpT o(<b>)
Whole document
Name List(Reviewer)
Name Rate T ext
DEByEDEByE• Bottom-up extraction strategy• Comparison
– DEByE: the user marks only atomic (attribute) values to assemble nested tables
– NoDoSE: the user decomposes the whole document in a top-down fashion
Semi-supervised ApproachesSemi-supervised Approaches
• IEPAD [Chang and Lui, 2001]• OLERA [Chang and Kuo, 2003]• Thresher [Hogue, 2005]
IEPADIEPAD• Encoding of the input page• Multiple-record pages
– Pattern Mining by PAT Tree• Multiple string alignment• For the running example
– <li><b>T</b>T<b>T</b>T<b>T</b>T</li>
OLERAOLERA• Online extraction rule analysis
– Enclosing – Drill-down / Roll-up– Attribute Assignment
ThresherThresher• Work similar to OLERA• Apply tree alignment instead of
string alignment
Unsupervised ApproachesUnsupervised Approaches• Roadrunner [Crescenzi, 2001]• DeLa [Wang, 2002; 2003] • EXALG [Arasu and Garcia-Molina, 2003] • DEPTA [Zhai, et al., 2005]
RoadrunnerRoadrunner• Input: multiple
pages with the same template
• Match two input pages at one time
Sample page01: <html><body>02: <b>03: Book Name04: </b>05: Data mining06: <b>07: Reviews08: </b>09: <OL>10: <LI>11: <b> Reviewer Name </b>12: Jeff13: <b> Rating </b>14: 215: <b>Text </b>16: …17: </LI>18: <LI>19: <b> Reviewer Name </b>20: Jane21: <b> Rating </b>22: 623: <b>Text </b>24: …25: </LI>26: </OL>27:</body></html>
tag mismatch
Wrapper (initially)01: <html><body>02: <b>03: Book Name04: </b>05: Databases06: <b>07: Reviews08: </b>09: <OL>10: <LI>11: <b> Reviewer Name </b>12: John13: <b> Rating </b>14: 715: <b>Text </b>16: …17: </LI>10: </OL>11:</body></html>
parsing
String mismatch
String mismatch
String mismatch
String mismatch
<html><body><b> Book Name </b>#PCDATA<b> Reviews </b><OL> ( <LI><b> Reviewer Name </b> #PCDATA <b> Rating </b> #PCDATA <b> Text </b> #PCDATA </LI> )+</OL></body></html>
Wrapper after solving mismatch
Terminal search match
DeLaDeLa• Similar to IEPAD
– Works for one input page
• Handle nested data structure• Example
– <P><A>T</A><A>T</A> T</P><P><A>T</A>T</P>
– <P><A>T</A>T</P><P><A>T</A>T</P>– (<P>(<A>T</A>)*T<P>)*
EXALGEXALG• Input: multiple pages with the same
template• Techniques:
– Differentiating token roles– Equivalence class (EC) form a template
• Tokens with the same occurrence vector
DEPTADEPTA• Identify data region
– Allow mismatch between data records
• Identify data record– Data records may not be continuous
• Identify data items– By partial tree alignment
ComparisonComparison• How do we differentiate template token from
data token?– DeLa and DEPTA assume HTML tags are template w
hile others are data tokens– IEPAD and OLERA leaves the problems to users
• How to apply the information from multiple pages?– DeLa and DEPTA conduct the mining from single pa
ge– Roadrunner and EXALG do the analysis from multip
le pages
Comparison (Cont.)Comparison (Cont.)• Techniques improvement
– From string alignment (IEPAD, RoadRunner) to tree alignment (DEPTA, Thresher)
– From full alignment (IEPAD) to partial alignment (DEPTA)
Task domain comparisonTask domain comparison• Page type
– structured, semi-structured or free-text Web pages
• Non-HTML support• Extraction level
– Field level, record-level, page-level
Task domain comparison Task domain comparison (Cont.)(Cont.)
• Extraction target variation– Missing attributes, multiple-value attributes,
multi-order attribute permutation• Template variation• Untokernized Attributes
ToolsPage Type
NHSExtraction
Level
Extraction Targets Variation Template Variation
UTAMA/MVA MOA Nested VF CT
Manual
Minerva Semi-S Yes Record Level Yes Yes Yes Both No Yes
TSIMMIS Semi-S Yes Record Level Yes No Yes Disj No No
WebOQL Semi-S No Record Level Yes Yes Yes Disj No No
W4F Temp No Record Level Yes Yes Yes SP No Yes
XWRAP Temp No Record Level Yes No Yes SP No Yes
Supervised
RAPIER Free Yes Field Level Yes -- -- Disj Yes No
SRV Free Yes Field Level Yes -- -- Disj Yes No
WHISK Free Yes Record Level Yes Yes No Disj Yes No
NoDoSE Semi-S Yes Page/Record Yes Limited Yes No No No
DEByE Semi-S Yes Record Level Yes Yes Yes Disj No No
WIEN Semi-S Yes Record Level No No No No No No
STALKER Semi-S Yes Record Level Yes Yes Yes Both No Yes
SoftMealy Semi-S Yes Record Level Yes LimitedMulti Pass
Disj No Yes
Semi-Supervi
sed
IEPAD Temp No Record Level Yes Limited Limited Both No Yes
OLERA Temp No Record Level Yes Limited Limited Both No Yes
Un-Supervise
d
DeLa Temp No Record Level Yes Limited Yes Both No No
RoadRunner Temp No Page Level Yes No Yes SP No No
EXALG Temp Yes Page Level Yes No Yes Both No No
DEPTA Temp No Record Level Yes No Limited Disj No No
Technique-based comparisonTechnique-based comparison• Scan pass
– Single pass vs mutiple pass• Extraction rule type
– Regular expression vs. logic rules• Feature used
– DOM tree information, POS tags, etc.• Learning algorithm
– Machine learning vs pattern mining• Tokernization schemes
Tools Scan PassExtraction Rule Type
Features Used Learning AlgorithmTokenization Schemes
Minerva Single Regular exp. HTML tags/Literal words None Manually
TSIMMIS Single Regular exp. HTML tags/Literal words None Manually
WebOQL Single Regular exp. Hypertree None Manually
W4F Single Regular exp. DOM tree path addressing None Tag Level
XWRAP Single Context-Free DOM tree None Tag Level
RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level
SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level
WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level
NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level
DEByE Single Regular exp. HTML tags/Literal words Data Modeling Word Level
WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level
STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level
SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level
IEPAD Single Regular exp. HTML tagsPattern Mining, String
AlignmentMulti-Level
OLERA Single Regular exp. HTML tags String Alignment Multi-Level
RoadRunner Single Regular exp. HTML tags String Alignment Tag Level
EXALG Single Regular exp. HTML tags/Literal wordsEquivalent Class and Role Differentiation
Word Level
DeLa Single Regular exp. HTML tags Pattern Mining Tag Level
ToolsGUI
support
Page-Fetching support
Output Support
Training Examples
API. Support
Minerva No No XML No Yes
TSIMMIS No No Text No Yes
WebOQL No No Text No Yes
W4F Yes Yes XML Labeled Yes
XWRAP Yes Yes XML Labeled Yes
RAPIER No No Text Labeled No
SRV No No Text Labeled No
WHISK No No Text Labeled No
NoDoSE Yes No XML, OEM Labeled Yes
DEByE Yes Yes XML, SQL DB Labeled Yes
WIEN Yes No Text Labeled Yes
STALKER Yes No Text Labeled Yes
SoftMealy Yes Yes XML, SQL DB Labeled Yes
IEPAD Yes No Text Unlabeled No
OLERA Yes No XML Unlabeled No
RoadRunner No Yes XML Unlabeled Yes
EXALG No No Text Unlabeled No
DeLa No Yes Text Unlabeled Yes
ConclusionConclusion• Criteria for evaluating IE systems
from the task domain• Comparison of IE systems from
various automation degree• The use of various techniques in IE
systems
Future WorkFuture Work• Page Fetching
– XWrap, W4F, WNDL• Schema Mapping
– Full information– Partial information
• Query Interface Integration
ReferencesReferences• C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, A surve
y of Web Information Extraction Systems.