web information extraction: from supervised learning to unsupervised learning dr. chia-hui chang (...

Web Information Extraction: From supervised learning to unsupervised learning

Dr. Chia-Hui Chang (張嘉惠 )Department of Computer Science and Information Engineering,

National Central University, TaiwanJune 16, 2006

(Talk given at the Greater China Database Summit)

2

Motivation Abundant information on the Web

Static Web pages Searchable databases: Deep Web

The Need for Web Information Integration Comparison Shopping Agents Citeseer for finding research papers

3

Web Information Integration We need programs/tools for

Web Page Crawling/Collection Information Extraction (IE) Schema Matching

Definition of IE Identify relevant information from documents,

pulling information from a variety of sources and aggregates it into a homogeneous form

An IE task is defined by its input and output

4

IE or WI Systems Wrappers

Programs that perform the task of IE are referred to as extractors or wrappers.

Wrapper Induction Systems IE systems are software tools that are designed

to generate wrappers.

5

Related Work: Various IE Survey Time scale

MUC vs. Post-MUC or Pre-Web vs. Web Automation Degree

programmer involved or general user involved [HD98] Task Domain

Input [Muslea99, Sarawagi02] Formatting degree, complexity, single slot/record [Cohen04]

Techniques [Laender02] Extraction Rules [Muslea99, Kushmerick03]

6

Three Dimensions Task Domain

Input (Unstructured, semi-structured) Output Targets (record-level, page-level, site-level)

Automation Degree Programmer-involved, annotation-based or annotation-fre

e approaches Techniques

Learning algorithm: specific/general to general/specific Rule type: regular expression rules vs logic rules Deterministic finite-state transducer vs probabilistic hidde

n Markov models

7

Task Domain: Input

Information Extraction From Free Texts

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Excerpted from Cohen & MaCallum’s talk.

IE from Semi-structured Documents

Title: SOLARIS Systems AdministratorSalary: 38-44KState: GeorgiaCity: AtlantaPlatform: SOLARISArea: telecomminications

Posting from Newsgroup

Telecomminications, SOLARIS SystemsAdministrator, 38-44K, Immediate need

Leading telecommications firm in needOf an energetic individual to fill the Following position in the Atlanta office:

SOLARIS SYSTEM ADMINISTRATOR Salary: 38-44K with full benefits Location: Atlanta Georgia, no relocation assistance provided

Ungrammatical snippets

IE


IE from Semi-structured DocumentsRicher formatting (non-template)


Amazon.com Book Pages

IE from Nearly-structured DocumentsTemplate Pages

12

Information Extraction from Template Pages

Encoding

Decoding: A reverse engineering

………………Database

CGI (T,x)

Template (T)

Output PagesExcerpted from EXALG.

Task Domain: Output

Slot-level

Person: Jack Welch

Record-level

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt

Person: Jeffrey Immelt

Location: Connecticut

Site-level extraction

Conference: KDD 2006Location: Philadelphia, PADate: 20-23, AugAccepted Paper:

………

Authors: ………

14

The task can be difficult because … Missing Attributes Multi-valued Attributes Multiple Permutations Nested Data Objects Various Templates for an attribute Common Templates for various attributes Untokenized Attributes

15

Technologies Tokenization schemes Learning algorithms Features used Scan passes Extraction rule types

16

Classification by Automation Degree

Manual

Supervised

Semi-supervised

Unsupervised

An Example IE Task

18

Manually-constructed IE Systems TSIMMIS [Hammer, et al, 1997] Minerva [Crescenzi, 1998] WebOQL [Arocena and Mendelzon, 1998] W4F [Saiiuguet and Azavant, 2001] XWrap [Liu, et al. 2000] …

19

TSIMMIS

Each command is of the form: [variables, source, pattern] where

source specifies the input text to be considered pattern specifies how to find the text of interest within the source, and variables are a list of variables that hold the extracted results.

Notation: # means “save in the variable” * means “discard”

(a) (b)

1 [ [ "root", "get('pe1.html')", "#"],2 [ "Book", "root", "*<body>#</body>"],3 [ "BookName", "Book", "*#"],4 [ "Reviews", "Book", "*<ol>#</ol>"],5 [ "_Reviewer", "split (Reviews, '<li>')", "#"],6 [ "Reviewer", "_Reviewer[0:0]", "#"],7 [ "ReviewerName, Rat ing, T ext", "Reviewer",8 "*#*#*#*"] ]

root complex { book_name st ring "Dat abases" reviews complex { Reviewer_Name st ring John Rat ing int 7 T ext st ring … } }

20

Minerva The grammar used by Minerva is defined in an EBNF style

Page Book_Reviews $Book_Reviews: <html><body> $Book </body></html>; $Book : Book Name $bName Reviews [<ol> ( <li> Reviewer Name $rName Rat ing $rate T ext $text $T P )* </ol>]; $bName : *(?); $rName : *(?); $rate : *(?); $text : *(?</li>);

$T P : { $bName, $rName $rate $text } END

21

WebOQLSelect [ Z!’.Text]

From x in browse (“pe2.html”)’, y in x’, Z in y’

Where x.Tag = “ol” and Z.Text=”Reviewer Name”

Tag: Body,Source: <Body>…</Body>Text: Book Name …

Tag: Source:Book NameText: Book Name Tag: NOTAG

Source: DatabasesText: Database

Tag: Source:ReviewsText: Reviews

Tag: OL,Source: <ol>…</ol>Text: Reviewer Name …

Tag: LI,Source: <li>…</li>Text: Reviewer Name …

Tag: NOTAGSource: JohnText: John

Tag: Source:RatingText: Rating

Tag: NOTAGSource: 7Text: 7

Tag: Source:TextText: Text

Tag: NOTAGSource: …Text: …

Tag: Source:Reviewer NameText: Reviewer Name

22

Supervised IE systems SRV [Freitag, 1998] Rapier [Califf and Mooney, 1998] WIEN [Kushmerick, 1997] WHISK [Soderland, 1999] NoDoSE [Adelberg, 1998] Softmealy [Hsu and Dung, 1998] Stalker [Muslea, 1999] DEByE [Laender, 2002b ] ….

23

WIEN Fixed frameworks for extractors

LR Wrapper HLRT Wrapper (Head LR Tail) OCLR Wrapper (Open-Close LR) HOCLRT Wrapper N-LR Wrapper (Nested LR) N-HLRT Wrapper (Nested HLRT)

Delimiter-based rules

24

Stalker Embedded Category Tree

(a) (b)

Extraction rule for Lis t(Re vie we r): SkipT o(<ol>) SkipT o(</ol>)Ite ration rule for List(Re vie we r): SkipT o(<li>) SkipT o(</li>)Extraction rule for Rating: SkipT o(Rating ) SkpT o()

Whole document

Name List(Reviewer)

Name Rate T ext

25

Softmealy Finite transducer Contextual rules instead of delimiter-based rules

b eN N R R T

s<b,N>/“N=”+next_tokn

s<,R>/“R=”+next_tokn

s<,T>/“T=”+next_tokn

?/next_token ?/next_token ?/next_token?/ε ?/ε ?/ε

s<N, >/ ε

s<T,e>/ ε

s<R, e>/ ε

s<,R>L ::= HTML() C1Alph(Rating) HTML()s<,R>R ::= Spc(-) Num(-)s<R,>L ::= Num(-)s<R,>R ::= NL(-) HTML()

26

DEByE Bottom-up extraction strategy Comparison

DEByE: the user marks only atomic (attribute) values to assemble nested tables

Stalker: the user decomposes the whole document in a top-down fashion

27

Semi-supervised Approaches IEPAD [Chang and Lui, 2001] OLERA [Chang and Kuo, 2003] Thresher [Hogue, 2005]

28

IEPAD Encoding of the input page Pattern Discovery (Mining)

Pattern (String) Mining by PAT Tree/Suffix Tree For the running example pe4

<html><body>TTT<ol><li>TTTTTT</li><li>TTTTTT</li>

<li>TTTTTT</li> </ol></body><html>

Multiple String Alignment <li>TTTTTT</li> <li>TTTTT </li> <li>TTTTTT</li>

29

Unsupervised Approaches Roadrunner [Crescenzi, 2001] DeLa [Wang, 2002; 2003] EXALG [Arasu and Garcia-Molina, 2003] DEPTA [Zhai, et al., 2005]

30

DeLa Similar to IEPAD

Works for one input page Handle nested data structure Example

<A>T</A><A>T</A> T<A>T</A>T <A>T</A>T<A>T</A>T ((<A>T</A>)*T)*

31

DEPTA [Liu, et al, KDD03, WWW05] Identify data region by mining tree patterns

Allow mismatch between data records Identify data record

Data records may not be continuous Identify data items by partial tree alignment

32

Roadrunner Input: multiple

pages with the same template

Match two input pages at one time

Sample page01: <html><body>02: 03: Book Name04: 05: Data mining06: 07: Reviews08: 09: <OL>10: <LI>11: Reviewer Name 12: Jeff13: Rating 14: 215: Text 16: …17: </LI>18: <LI>19: Reviewer Name 20: Jane21: Rating 22: 623: Text 24: …25: </LI>26: </OL>27:</body></html>

tag mismatch

Wrapper (initially)01: <html><body>02: 03: Book Name04: 05: Databases06: 07: Reviews08: 09: <OL>10: <LI>11: Reviewer Name 12: John13: Rating 14: 715: Text 16: …17: </LI>10: </OL>11:</body></html>

parsing

String mismatch

String mismatch

String mismatch

String mismatch

<html><body> Book Name #PCDATA Reviews <OL> ( <LI> Reviewer Name #PCDATA Rating #PCDATA Text #PCDATA </LI> )+</OL></body></html>

Wrapper after solving mismatch

Terminal search match

33

EXALG Input: multiple pages with the same template Techniques:

Differentiating token roles Equivalent class (EC) form a template

Tokens with the same occurrence vector e.g. <1,1,1,1>: {<html>, <body>, </body>, </html>}

34

On the use of techniques From supervised to unsupervised approaches From string alignment (IEPAD, RoadRunner)

to tree alignment (DEPTA, Thresher) From full alignment (IEPAD) to partial

alignment (DEPTA)

35

Task domain comparison Page type

structured, semi-structured or free-text Web pages Non-HTML support Extraction level

Field level, record-level, page-level Extraction target variation

Missing attributes, multiple-value attributes, multi-order attribute permutation

Template variation Untokernized Attributes

ToolsPage Type

NHSExtraction

Level

Extraction Targets Variation Template Variation

UTAMA/MVA MOA Nested VF CT

Manual

Minerva Semi-S Yes Record Level Yes Yes Yes Both No Yes

TSIMMIS Semi-S Yes Record Level Yes No Yes Disj No No

WebOQL Semi-S No Record Level Yes Yes Yes Disj No No

W4F Temp No Record Level Yes Yes Yes SP No Yes

XWRAP Temp No Record Level Yes No Yes SP No Yes

Supervised

RAPIER Free Yes Field Level Yes -- -- Disj Yes No

SRV Free Yes Field Level Yes -- -- Disj Yes No

WHISK Free Yes Record Level Yes Yes No Disj Yes No

NoDoSE Semi-S Yes Page/Record Yes Limited Yes No No No

DEByE Semi-S Yes Record Level Yes Yes Yes Disj No No

WIEN Semi-S Yes Record Level No No No No No No

STALKER Semi-S Yes Record Level Yes Yes Yes Both No Yes

SoftMealy Semi-S Yes Record Level Yes LimitedMulti Pass

Disj No Yes

Semi-Supervi

sed

IEPAD Temp No Record Level Yes Limited Limited Both No Yes

OLERA Temp No Record Level Yes Limited Limited Both No Yes

Un-Supervise

d

DeLa Temp No Record Level Yes Limited Yes Both No No

RoadRunner Temp No Page Level Yes No Yes SP No No

EXALG Temp Yes Page Level Yes No Yes Both No No

DEPTA Temp No Record Level Yes No Limited Disj No No

37

Technique-based comparison Scan pass

Single pass vs mutiple pass Extraction rule type

Regular expression vs. logic rules Feature used

DOM tree information, POS tags, etc. Learning algorithm

Machine learning vs pattern mining Tokernization schemes

Tools Scan PassExtraction Rule Type

Features Used Learning AlgorithmTokenization Schemes

Minerva Single Regular exp. HTML tags/Literal words None Manually

TSIMMIS Single Regular exp. HTML tags/Literal words None Manually

WebOQL Single Regular exp. Hypertree None Manually

W4F Single Regular exp. DOM tree path addressing None Tag Level

XWRAP Single Context-Free DOM tree None Tag Level

RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level

SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level

WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level

NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

DEByE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

IEPAD Single Regular exp. HTML tagsPattern Mining, String

AlignmentMulti-Level

OLERA Single Regular exp. HTML tags String Alignment Multi-Level

DeLa Single Regular exp. HTML tags Pattern Mining Tag Level

DEPTA Single Tag Tree HTML tags Tree Mining/Aligment Tag Level

RoadRunner Single Regular exp. HTML tags String Alignment Tag Level

EXALG Single Regular exp. HTML tags/Literal wordsEquivalent Class and Role Differentiation

Word Level

39

Conclusion and Future Work Content of this talk

Compare IE systems from the task domain Classify IE systems from the automation degree Discuss various techniques used in IE systems

Future Work The combination of EC with tree patterns.

Thank You!