web information extraction: from supervised learning to unsupervised learning dr. chia-hui chang (...

40
Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張張張 ) Department of Computer Science and Information En gineering, National Central University, Taiwan June 16, 2006 (Talk given at the Greater China Database Summit)

Upload: jemima-mcdonald

Post on 13-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

Web Information Extraction: From supervised learning to unsupervised learning

Dr. Chia-Hui Chang (張嘉惠 )Department of Computer Science and Information Engineering,

National Central University, TaiwanJune 16, 2006

(Talk given at the Greater China Database Summit)

Page 2: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

2

Motivation Abundant information on the Web

Static Web pages Searchable databases: Deep Web

The Need for Web Information Integration Comparison Shopping Agents Citeseer for finding research papers

Page 3: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

3

Web Information Integration We need programs/tools for

Web Page Crawling/Collection Information Extraction (IE) Schema Matching

Definition of IE Identify relevant information from documents,

pulling information from a variety of sources and aggregates it into a homogeneous form

An IE task is defined by its input and output

Page 4: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

4

IE or WI Systems Wrappers

Programs that perform the task of IE are referred to as extractors or wrappers.

Wrapper Induction Systems IE systems are software tools that are designed

to generate wrappers.

Page 5: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

5

Related Work: Various IE Survey Time scale

MUC vs. Post-MUC or Pre-Web vs. Web Automation Degree

programmer involved or general user involved [HD98] Task Domain

Input [Muslea99, Sarawagi02] Formatting degree, complexity, single slot/record [Cohen04]

Techniques [Laender02] Extraction Rules [Muslea99, Kushmerick03]

Page 6: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

6

Three Dimensions Task Domain

Input (Unstructured, semi-structured) Output Targets (record-level, page-level, site-level)

Automation Degree Programmer-involved, annotation-based or annotation-fre

e approaches Techniques

Learning algorithm: specific/general to general/specific Rule type: regular expression rules vs logic rules Deterministic finite-state transducer vs probabilistic hidde

n Markov models

Page 7: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

7

Task Domain: Input

Page 8: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

Information Extraction From Free Texts

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Excerpted from Cohen & MaCallum’s talk.

Page 9: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

IE from Semi-structured Documents

Title: SOLARIS Systems AdministratorSalary: 38-44KState: GeorgiaCity: AtlantaPlatform: SOLARISArea: telecomminications

Posting from Newsgroup

Telecomminications, SOLARIS SystemsAdministrator, 38-44K, Immediate need

Leading telecommications firm in needOf an energetic individual to fill the Following position in the Atlanta office:

SOLARIS SYSTEM ADMINISTRATOR Salary: 38-44K with full benefits Location: Atlanta Georgia, no relocation assistance provided

Ungrammatical snippets

IE

Excerpted from Cohen & MaCallum’s talk.

Page 10: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

IE from Semi-structured DocumentsRicher formatting (non-template)

Excerpted from Cohen & MaCallum’s talk.

Page 11: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

Amazon.com Book Pages

IE from Nearly-structured DocumentsTemplate Pages

Page 12: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

12

Information Extraction from Template Pages

Encoding

Decoding: A reverse engineering

………………Database

CGI (T,x)

Template (T)

Output PagesExcerpted from EXALG.

Page 13: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

Task Domain: Output

Slot-level

Person: Jack Welch

Record-level

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt

Person: Jeffrey Immelt

Location: Connecticut

Site-level extraction

Conference: KDD 2006Location: Philadelphia, PADate: 20-23, AugAccepted Paper:

………

Authors: ………

Page 14: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

14

The task can be difficult because … Missing Attributes Multi-valued Attributes Multiple Permutations Nested Data Objects Various Templates for an attribute Common Templates for various attributes Untokenized Attributes

Page 15: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

15

Technologies Tokenization schemes Learning algorithms Features used Scan passes Extraction rule types

Page 16: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

16

Classification by Automation Degree

Manual

Supervised

Semi-supervised

Unsupervised

Page 17: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

An Example IE Task

Page 18: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

18

Manually-constructed IE Systems TSIMMIS [Hammer, et al, 1997] Minerva [Crescenzi, 1998] WebOQL [Arocena and Mendelzon, 1998] W4F [Saiiuguet and Azavant, 2001] XWrap [Liu, et al. 2000] …

Page 19: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

19

TSIMMIS

Each command is of the form: [variables, source, pattern] where

source specifies the input text to be considered pattern specifies how to find the text of interest within the source, and variables are a list of variables that hold the extracted results.

Notation: # means “save in the variable” * means “discard”

(a) (b)

1 [ [ "root", "get('pe1.html')", "#"],2 [ "Book", "root", "*<body>#</body>"],3 [ "BookName", "Book", "*</b>#<b>"],4 [ "Reviews", "Book", "*<ol>#</ol>"],5 [ "_Reviewer", "split (Reviews, '<li>')", "#"],6 [ "Reviewer", "_Reviewer[0:0]", "#"],7 [ "ReviewerName, Rat ing, T ext", "Reviewer",8 "*</b>#<b>*</b>#<b>*</b>#*"] ]

root complex { book_name st ring "Dat abases" reviews complex { Reviewer_Name st ring John Rat ing int 7 T ext st ring … } }

Page 20: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

20

Minerva The grammar used by Minerva is defined in an EBNF style

Page Book_Reviews $Book_Reviews: <html><body> $Book </body></html>; $Book : <b>Book Name </b> $bName <b> Reviews </b> [<ol> ( <li><b> Reviewer Name </b> $rName <b> Rat ing </b>$rate <b> T ext </b> $text $T P )* </ol>]; $bName : *(?<b>); $rName : *(?<b>); $rate : *(?<b>); $text : *(?</li>);

$T P : { $bName, $rName $rate $text } END

Page 21: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

21

WebOQLSelect [ Z!’.Text]

From x in browse (“pe2.html”)’, y in x’, Z in y’

Where x.Tag = “ol” and Z.Text=”Reviewer Name”

Tag: Body,Source: <Body>…</Body>Text: Book Name …

Tag: <b>Source:<b>Book Name</b>Text: Book Name Tag: NOTAG

Source: DatabasesText: Database

Tag: <b>Source:<b>Reviews</b>Text: Reviews

Tag: OL,Source: <ol>…</ol>Text: Reviewer Name …

Tag: LI,Source: <li>…</li>Text: Reviewer Name …

Tag: NOTAGSource: JohnText: John

Tag: <b>Source:<b>Rating</b>Text: Rating

Tag: NOTAGSource: 7Text: 7

Tag: <b>Source:<b>Text</b>Text: Text

Tag: NOTAGSource: …Text: …

Tag: <b>Source:<b>Reviewer Name</b>Text: Reviewer Name

Page 22: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

22

Supervised IE systems SRV [Freitag, 1998] Rapier [Califf and Mooney, 1998] WIEN [Kushmerick, 1997] WHISK [Soderland, 1999] NoDoSE [Adelberg, 1998] Softmealy [Hsu and Dung, 1998] Stalker [Muslea, 1999] DEByE [Laender, 2002b ] ….

Page 23: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

23

WIEN Fixed frameworks for extractors

LR Wrapper HLRT Wrapper (Head LR Tail) OCLR Wrapper (Open-Close LR) HOCLRT Wrapper N-LR Wrapper (Nested LR) N-HLRT Wrapper (Nested HLRT)

Delimiter-based rules

Page 24: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

24

Stalker Embedded Category Tree

(a) (b)

Extraction rule for Lis t(Re vie we r): SkipT o(<ol>) SkipT o(</ol>)Ite ration rule for List(Re vie we r): SkipT o(<li>) SkipT o(</li>)Extraction rule for Rating: SkipT o(Rating </b>) SkpT o(<b>)

Whole document

Name List(Reviewer)

Name Rate T ext

Page 25: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

25

Softmealy Finite transducer Contextual rules instead of delimiter-based rules

b eN N R R T

s<b,N>/“N=”+next_tokn

s<,R>/“R=”+next_tokn

s<,T>/“T=”+next_tokn

?/next_token ?/next_token ?/next_token?/ε ?/ε ?/ε

s<N, >/ ε

s<T,e>/ ε

s<R, e>/ ε

s<,R>L ::= HTML(<b>) C1Alph(Rating) HTML(</b>)s<,R>R ::= Spc(-) Num(-)s<R,>L ::= Num(-)s<R,>R ::= NL(-) HTML(<b>)

Page 26: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

26

DEByE Bottom-up extraction strategy Comparison

DEByE: the user marks only atomic (attribute) values to assemble nested tables

Stalker: the user decomposes the whole document in a top-down fashion

Page 27: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

27

Semi-supervised Approaches IEPAD [Chang and Lui, 2001] OLERA [Chang and Kuo, 2003] Thresher [Hogue, 2005]

Page 28: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

28

IEPAD Encoding of the input page Pattern Discovery (Mining)

Pattern (String) Mining by PAT Tree/Suffix Tree For the running example pe4

<html><body><b>T<b>T<b>T</b><ol><li><b>T</b>T<b>T</b>T<b>T</b>T</li><li><b>T</b>T<b>T</b>T<b>T</b>T</li>

<li><b>T</b>T<b>T</b>T<b>T</b>T</li> </ol></body><html>

Multiple String Alignment <li><b>T</b>T<b>T</b>T<b>T</b>T</li> <li><b>T</b>T<b>T</b>T<b>T </li> <li><b>T</b>T<b>T</b>T<b>T</b>T</li>

Page 29: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

29

Unsupervised Approaches Roadrunner [Crescenzi, 2001] DeLa [Wang, 2002; 2003] EXALG [Arasu and Garcia-Molina, 2003] DEPTA [Zhai, et al., 2005]

Page 30: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

30

DeLa Similar to IEPAD

Works for one input page Handle nested data structure Example

<P><A>T</A><A>T</A> T</P><P><A>T</A>T</P> <P><A>T</A>T</P><P><A>T</A>T</P> (<P>(<A>T</A>)*T<P>)*

Page 31: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

31

DEPTA [Liu, et al, KDD03, WWW05] Identify data region by mining tree patterns

Allow mismatch between data records Identify data record

Data records may not be continuous Identify data items by partial tree alignment

Page 32: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

32

Roadrunner Input: multiple

pages with the same template

Match two input pages at one time

Sample page01: <html><body>02: <b>03: Book Name04: </b>05: Data mining06: <b>07: Reviews08: </b>09: <OL>10: <LI>11: <b> Reviewer Name </b>12: Jeff13: <b> Rating </b>14: 215: <b>Text </b>16: …17: </LI>18: <LI>19: <b> Reviewer Name </b>20: Jane21: <b> Rating </b>22: 623: <b>Text </b>24: …25: </LI>26: </OL>27:</body></html>

tag mismatch

Wrapper (initially)01: <html><body>02: <b>03: Book Name04: </b>05: Databases06: <b>07: Reviews08: </b>09: <OL>10: <LI>11: <b> Reviewer Name </b>12: John13: <b> Rating </b>14: 715: <b>Text </b>16: …17: </LI>10: </OL>11:</body></html>

parsing

String mismatch

String mismatch

String mismatch

String mismatch

<html><body><b> Book Name </b>#PCDATA<b> Reviews </b><OL> ( <LI><b> Reviewer Name </b> #PCDATA <b> Rating </b> #PCDATA <b> Text </b> #PCDATA </LI> )+</OL></body></html>

Wrapper after solving mismatch

Terminal search match

Page 33: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

33

EXALG Input: multiple pages with the same template Techniques:

Differentiating token roles Equivalent class (EC) form a template

Tokens with the same occurrence vector e.g. <1,1,1,1>: {<html>, <body>, </body>, </html>}

Page 34: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

34

On the use of techniques From supervised to unsupervised approaches From string alignment (IEPAD, RoadRunner)

to tree alignment (DEPTA, Thresher) From full alignment (IEPAD) to partial

alignment (DEPTA)

Page 35: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

35

Task domain comparison Page type

structured, semi-structured or free-text Web pages Non-HTML support Extraction level

Field level, record-level, page-level Extraction target variation

Missing attributes, multiple-value attributes, multi-order attribute permutation

Template variation Untokernized Attributes

Page 36: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

ToolsPage Type

NHSExtraction

Level

Extraction Targets Variation Template Variation

UTAMA/MVA MOA Nested VF CT

Manual

Minerva Semi-S Yes Record Level Yes Yes Yes Both No Yes

TSIMMIS Semi-S Yes Record Level Yes No Yes Disj No No

WebOQL Semi-S No Record Level Yes Yes Yes Disj No No

W4F Temp No Record Level Yes Yes Yes SP No Yes

XWRAP Temp No Record Level Yes No Yes SP No Yes

Supervised

RAPIER Free Yes Field Level Yes -- -- Disj Yes No

SRV Free Yes Field Level Yes -- -- Disj Yes No

WHISK Free Yes Record Level Yes Yes No Disj Yes No

NoDoSE Semi-S Yes Page/Record Yes Limited Yes No No No

DEByE Semi-S Yes Record Level Yes Yes Yes Disj No No

WIEN Semi-S Yes Record Level No No No No No No

STALKER Semi-S Yes Record Level Yes Yes Yes Both No Yes

SoftMealy Semi-S Yes Record Level Yes LimitedMulti Pass

Disj No Yes

Semi-Supervi

sed

IEPAD Temp No Record Level Yes Limited Limited Both No Yes

OLERA Temp No Record Level Yes Limited Limited Both No Yes

Un-Supervise

d

DeLa Temp No Record Level Yes Limited Yes Both No No

RoadRunner Temp No Page Level Yes No Yes SP No No

EXALG Temp Yes Page Level Yes No Yes Both No No

DEPTA Temp No Record Level Yes No Limited Disj No No

Page 37: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

37

Technique-based comparison Scan pass

Single pass vs mutiple pass Extraction rule type

Regular expression vs. logic rules Feature used

DOM tree information, POS tags, etc. Learning algorithm

Machine learning vs pattern mining Tokernization schemes

Page 38: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

Tools Scan PassExtraction Rule Type

Features Used Learning AlgorithmTokenization Schemes

Minerva Single Regular exp. HTML tags/Literal words None Manually

TSIMMIS Single Regular exp. HTML tags/Literal words None Manually

WebOQL Single Regular exp. Hypertree None Manually

W4F Single Regular exp. DOM tree path addressing None Tag Level

XWRAP Single Context-Free DOM tree None Tag Level

RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level

SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level

WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level

NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

DEByE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

IEPAD Single Regular exp. HTML tagsPattern Mining, String

AlignmentMulti-Level

OLERA Single Regular exp. HTML tags String Alignment Multi-Level

DeLa Single Regular exp. HTML tags Pattern Mining Tag Level

DEPTA Single Tag Tree HTML tags Tree Mining/Aligment Tag Level

RoadRunner Single Regular exp. HTML tags String Alignment Tag Level

EXALG Single Regular exp. HTML tags/Literal wordsEquivalent Class and Role Differentiation

Word Level

Page 39: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

39

Conclusion and Future Work Content of this talk

Compare IE systems from the task domain Classify IE systems from the automation degree Discuss various techniques used in IE systems

Future Work The combination of EC with tree patterns.

Page 40: Web Information Extraction: From supervised learning to unsupervised learning Dr. Chia-Hui Chang ( 張嘉惠 ) Department of Computer Science and Information

Thank You!