web information extraction: from supervised learning to unsupervised learning dr. chia-hui chang (...
TRANSCRIPT
![Page 1: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/1.jpg)
Web Information Extraction: From supervised learning to unsupervised learning
Dr. Chia-Hui Chang (張嘉惠 )Department of Computer Science and Information Engineering,
National Central University, TaiwanJune 16, 2006
(Talk given at the Greater China Database Summit)
![Page 2: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/2.jpg)
2
Motivation Abundant information on the Web
Static Web pages Searchable databases: Deep Web
The Need for Web Information Integration Comparison Shopping Agents Citeseer for finding research papers
![Page 3: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/3.jpg)
3
Web Information Integration We need programs/tools for
Web Page Crawling/Collection Information Extraction (IE) Schema Matching
Definition of IE Identify relevant information from documents,
pulling information from a variety of sources and aggregates it into a homogeneous form
An IE task is defined by its input and output
![Page 4: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/4.jpg)
4
IE or WI Systems Wrappers
Programs that perform the task of IE are referred to as extractors or wrappers.
Wrapper Induction Systems IE systems are software tools that are designed
to generate wrappers.
![Page 5: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/5.jpg)
5
Related Work: Various IE Survey Time scale
MUC vs. Post-MUC or Pre-Web vs. Web Automation Degree
programmer involved or general user involved [HD98] Task Domain
Input [Muslea99, Sarawagi02] Formatting degree, complexity, single slot/record [Cohen04]
Techniques [Laender02] Extraction Rules [Muslea99, Kushmerick03]
![Page 6: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/6.jpg)
6
Three Dimensions Task Domain
Input (Unstructured, semi-structured) Output Targets (record-level, page-level, site-level)
Automation Degree Programmer-involved, annotation-based or annotation-fre
e approaches Techniques
Learning algorithm: specific/general to general/specific Rule type: regular expression rules vs logic rules Deterministic finite-state transducer vs probabilistic hidde
n Markov models
![Page 7: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/7.jpg)
7
Task Domain: Input
![Page 8: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/8.jpg)
Information Extraction From Free Texts
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
Excerpted from Cohen & MaCallum’s talk.
![Page 9: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/9.jpg)
IE from Semi-structured Documents
Title: SOLARIS Systems AdministratorSalary: 38-44KState: GeorgiaCity: AtlantaPlatform: SOLARISArea: telecomminications
Posting from Newsgroup
Telecomminications, SOLARIS SystemsAdministrator, 38-44K, Immediate need
Leading telecommications firm in needOf an energetic individual to fill the Following position in the Atlanta office:
SOLARIS SYSTEM ADMINISTRATOR Salary: 38-44K with full benefits Location: Atlanta Georgia, no relocation assistance provided
Ungrammatical snippets
IE
Excerpted from Cohen & MaCallum’s talk.
![Page 10: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/10.jpg)
IE from Semi-structured DocumentsRicher formatting (non-template)
Excerpted from Cohen & MaCallum’s talk.
![Page 11: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/11.jpg)
Amazon.com Book Pages
IE from Nearly-structured DocumentsTemplate Pages
![Page 12: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/12.jpg)
12
Information Extraction from Template Pages
Encoding
Decoding: A reverse engineering
………………Database
CGI (T,x)
Template (T)
Output PagesExcerpted from EXALG.
![Page 13: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/13.jpg)
Task Domain: Output
Slot-level
Person: Jack Welch
Record-level
“Named entity” extraction
Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.
Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt
Person: Jeffrey Immelt
Location: Connecticut
Site-level extraction
Conference: KDD 2006Location: Philadelphia, PADate: 20-23, AugAccepted Paper:
………
Authors: ………
![Page 14: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/14.jpg)
14
The task can be difficult because … Missing Attributes Multi-valued Attributes Multiple Permutations Nested Data Objects Various Templates for an attribute Common Templates for various attributes Untokenized Attributes
![Page 15: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/15.jpg)
15
Technologies Tokenization schemes Learning algorithms Features used Scan passes Extraction rule types
![Page 16: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/16.jpg)
16
Classification by Automation Degree
Manual
Supervised
Semi-supervised
Unsupervised
![Page 17: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/17.jpg)
An Example IE Task
![Page 18: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/18.jpg)
18
Manually-constructed IE Systems TSIMMIS [Hammer, et al, 1997] Minerva [Crescenzi, 1998] WebOQL [Arocena and Mendelzon, 1998] W4F [Saiiuguet and Azavant, 2001] XWrap [Liu, et al. 2000] …
![Page 19: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/19.jpg)
19
TSIMMIS
Each command is of the form: [variables, source, pattern] where
source specifies the input text to be considered pattern specifies how to find the text of interest within the source, and variables are a list of variables that hold the extracted results.
Notation: # means “save in the variable” * means “discard”
(a) (b)
1 [ [ "root", "get('pe1.html')", "#"],2 [ "Book", "root", "*<body>#</body>"],3 [ "BookName", "Book", "*</b>#<b>"],4 [ "Reviews", "Book", "*<ol>#</ol>"],5 [ "_Reviewer", "split (Reviews, '<li>')", "#"],6 [ "Reviewer", "_Reviewer[0:0]", "#"],7 [ "ReviewerName, Rat ing, T ext", "Reviewer",8 "*</b>#<b>*</b>#<b>*</b>#*"] ]
root complex { book_name st ring "Dat abases" reviews complex { Reviewer_Name st ring John Rat ing int 7 T ext st ring … } }
![Page 20: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/20.jpg)
20
Minerva The grammar used by Minerva is defined in an EBNF style
Page Book_Reviews $Book_Reviews: <html><body> $Book </body></html>; $Book : <b>Book Name </b> $bName <b> Reviews </b> [<ol> ( <li><b> Reviewer Name </b> $rName <b> Rat ing </b>$rate <b> T ext </b> $text $T P )* </ol>]; $bName : *(?<b>); $rName : *(?<b>); $rate : *(?<b>); $text : *(?</li>);
$T P : { $bName, $rName $rate $text } END
![Page 21: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/21.jpg)
21
WebOQLSelect [ Z!’.Text]
From x in browse (“pe2.html”)’, y in x’, Z in y’
Where x.Tag = “ol” and Z.Text=”Reviewer Name”
Tag: Body,Source: <Body>…</Body>Text: Book Name …
Tag: <b>Source:<b>Book Name</b>Text: Book Name Tag: NOTAG
Source: DatabasesText: Database
Tag: <b>Source:<b>Reviews</b>Text: Reviews
Tag: OL,Source: <ol>…</ol>Text: Reviewer Name …
Tag: LI,Source: <li>…</li>Text: Reviewer Name …
Tag: NOTAGSource: JohnText: John
Tag: <b>Source:<b>Rating</b>Text: Rating
Tag: NOTAGSource: 7Text: 7
Tag: <b>Source:<b>Text</b>Text: Text
Tag: NOTAGSource: …Text: …
Tag: <b>Source:<b>Reviewer Name</b>Text: Reviewer Name
![Page 22: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/22.jpg)
22
Supervised IE systems SRV [Freitag, 1998] Rapier [Califf and Mooney, 1998] WIEN [Kushmerick, 1997] WHISK [Soderland, 1999] NoDoSE [Adelberg, 1998] Softmealy [Hsu and Dung, 1998] Stalker [Muslea, 1999] DEByE [Laender, 2002b ] ….
![Page 23: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/23.jpg)
23
WIEN Fixed frameworks for extractors
LR Wrapper HLRT Wrapper (Head LR Tail) OCLR Wrapper (Open-Close LR) HOCLRT Wrapper N-LR Wrapper (Nested LR) N-HLRT Wrapper (Nested HLRT)
Delimiter-based rules
![Page 24: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/24.jpg)
24
Stalker Embedded Category Tree
(a) (b)
Extraction rule for Lis t(Re vie we r): SkipT o(<ol>) SkipT o(</ol>)Ite ration rule for List(Re vie we r): SkipT o(<li>) SkipT o(</li>)Extraction rule for Rating: SkipT o(Rating </b>) SkpT o(<b>)
Whole document
Name List(Reviewer)
Name Rate T ext
![Page 25: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/25.jpg)
25
Softmealy Finite transducer Contextual rules instead of delimiter-based rules
b eN N R R T
s<b,N>/“N=”+next_tokn
s<,R>/“R=”+next_tokn
s<,T>/“T=”+next_tokn
?/next_token ?/next_token ?/next_token?/ε ?/ε ?/ε
s<N, >/ ε
s<T,e>/ ε
s<R, e>/ ε
s<,R>L ::= HTML(<b>) C1Alph(Rating) HTML(</b>)s<,R>R ::= Spc(-) Num(-)s<R,>L ::= Num(-)s<R,>R ::= NL(-) HTML(<b>)
![Page 26: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/26.jpg)
26
DEByE Bottom-up extraction strategy Comparison
DEByE: the user marks only atomic (attribute) values to assemble nested tables
Stalker: the user decomposes the whole document in a top-down fashion
![Page 27: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/27.jpg)
27
Semi-supervised Approaches IEPAD [Chang and Lui, 2001] OLERA [Chang and Kuo, 2003] Thresher [Hogue, 2005]
![Page 28: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/28.jpg)
28
IEPAD Encoding of the input page Pattern Discovery (Mining)
Pattern (String) Mining by PAT Tree/Suffix Tree For the running example pe4
<html><body><b>T<b>T<b>T</b><ol><li><b>T</b>T<b>T</b>T<b>T</b>T</li><li><b>T</b>T<b>T</b>T<b>T</b>T</li>
<li><b>T</b>T<b>T</b>T<b>T</b>T</li> </ol></body><html>
Multiple String Alignment <li><b>T</b>T<b>T</b>T<b>T</b>T</li> <li><b>T</b>T<b>T</b>T<b>T </li> <li><b>T</b>T<b>T</b>T<b>T</b>T</li>
![Page 29: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/29.jpg)
29
Unsupervised Approaches Roadrunner [Crescenzi, 2001] DeLa [Wang, 2002; 2003] EXALG [Arasu and Garcia-Molina, 2003] DEPTA [Zhai, et al., 2005]
![Page 30: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/30.jpg)
30
DeLa Similar to IEPAD
Works for one input page Handle nested data structure Example
<P><A>T</A><A>T</A> T</P><P><A>T</A>T</P> <P><A>T</A>T</P><P><A>T</A>T</P> (<P>(<A>T</A>)*T<P>)*
![Page 31: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/31.jpg)
31
DEPTA [Liu, et al, KDD03, WWW05] Identify data region by mining tree patterns
Allow mismatch between data records Identify data record
Data records may not be continuous Identify data items by partial tree alignment
![Page 32: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/32.jpg)
32
Roadrunner Input: multiple
pages with the same template
Match two input pages at one time
Sample page01: <html><body>02: <b>03: Book Name04: </b>05: Data mining06: <b>07: Reviews08: </b>09: <OL>10: <LI>11: <b> Reviewer Name </b>12: Jeff13: <b> Rating </b>14: 215: <b>Text </b>16: …17: </LI>18: <LI>19: <b> Reviewer Name </b>20: Jane21: <b> Rating </b>22: 623: <b>Text </b>24: …25: </LI>26: </OL>27:</body></html>
tag mismatch
Wrapper (initially)01: <html><body>02: <b>03: Book Name04: </b>05: Databases06: <b>07: Reviews08: </b>09: <OL>10: <LI>11: <b> Reviewer Name </b>12: John13: <b> Rating </b>14: 715: <b>Text </b>16: …17: </LI>10: </OL>11:</body></html>
parsing
String mismatch
String mismatch
String mismatch
String mismatch
<html><body><b> Book Name </b>#PCDATA<b> Reviews </b><OL> ( <LI><b> Reviewer Name </b> #PCDATA <b> Rating </b> #PCDATA <b> Text </b> #PCDATA </LI> )+</OL></body></html>
Wrapper after solving mismatch
Terminal search match
![Page 33: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/33.jpg)
33
EXALG Input: multiple pages with the same template Techniques:
Differentiating token roles Equivalent class (EC) form a template
Tokens with the same occurrence vector e.g. <1,1,1,1>: {<html>, <body>, </body>, </html>}
![Page 34: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/34.jpg)
34
On the use of techniques From supervised to unsupervised approaches From string alignment (IEPAD, RoadRunner)
to tree alignment (DEPTA, Thresher) From full alignment (IEPAD) to partial
alignment (DEPTA)
![Page 35: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/35.jpg)
35
Task domain comparison Page type
structured, semi-structured or free-text Web pages Non-HTML support Extraction level
Field level, record-level, page-level Extraction target variation
Missing attributes, multiple-value attributes, multi-order attribute permutation
Template variation Untokernized Attributes
![Page 36: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/36.jpg)
ToolsPage Type
NHSExtraction
Level
Extraction Targets Variation Template Variation
UTAMA/MVA MOA Nested VF CT
Manual
Minerva Semi-S Yes Record Level Yes Yes Yes Both No Yes
TSIMMIS Semi-S Yes Record Level Yes No Yes Disj No No
WebOQL Semi-S No Record Level Yes Yes Yes Disj No No
W4F Temp No Record Level Yes Yes Yes SP No Yes
XWRAP Temp No Record Level Yes No Yes SP No Yes
Supervised
RAPIER Free Yes Field Level Yes -- -- Disj Yes No
SRV Free Yes Field Level Yes -- -- Disj Yes No
WHISK Free Yes Record Level Yes Yes No Disj Yes No
NoDoSE Semi-S Yes Page/Record Yes Limited Yes No No No
DEByE Semi-S Yes Record Level Yes Yes Yes Disj No No
WIEN Semi-S Yes Record Level No No No No No No
STALKER Semi-S Yes Record Level Yes Yes Yes Both No Yes
SoftMealy Semi-S Yes Record Level Yes LimitedMulti Pass
Disj No Yes
Semi-Supervi
sed
IEPAD Temp No Record Level Yes Limited Limited Both No Yes
OLERA Temp No Record Level Yes Limited Limited Both No Yes
Un-Supervise
d
DeLa Temp No Record Level Yes Limited Yes Both No No
RoadRunner Temp No Page Level Yes No Yes SP No No
EXALG Temp Yes Page Level Yes No Yes Both No No
DEPTA Temp No Record Level Yes No Limited Disj No No
![Page 37: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/37.jpg)
37
Technique-based comparison Scan pass
Single pass vs mutiple pass Extraction rule type
Regular expression vs. logic rules Feature used
DOM tree information, POS tags, etc. Learning algorithm
Machine learning vs pattern mining Tokernization schemes
![Page 38: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/38.jpg)
Tools Scan PassExtraction Rule Type
Features Used Learning AlgorithmTokenization Schemes
Minerva Single Regular exp. HTML tags/Literal words None Manually
TSIMMIS Single Regular exp. HTML tags/Literal words None Manually
WebOQL Single Regular exp. Hypertree None Manually
W4F Single Regular exp. DOM tree path addressing None Tag Level
XWRAP Single Context-Free DOM tree None Tag Level
RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level
SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level
WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level
NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level
DEByE Single Regular exp. HTML tags/Literal words Data Modeling Word Level
WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level
STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level
SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level
IEPAD Single Regular exp. HTML tagsPattern Mining, String
AlignmentMulti-Level
OLERA Single Regular exp. HTML tags String Alignment Multi-Level
DeLa Single Regular exp. HTML tags Pattern Mining Tag Level
DEPTA Single Tag Tree HTML tags Tree Mining/Aligment Tag Level
RoadRunner Single Regular exp. HTML tags String Alignment Tag Level
EXALG Single Regular exp. HTML tags/Literal wordsEquivalent Class and Role Differentiation
Word Level
![Page 39: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/39.jpg)
39
Conclusion and Future Work Content of this talk
Compare IE systems from the task domain Classify IE systems from the automation degree Discuss various techniques used in IE systems
Future Work The combination of EC with tree patterns.
![Page 40: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649ebe5503460f94bc8df9/html5/thumbnails/40.jpg)
Thank You!