r oad r unner : towards automatic data extraction from large web sites valter crescenzi...

16
ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

ROADRUNNER: Towards Automatic Data Extraction

from Large Web Sites

Valter Crescenzi

Giansalvatore Mecca

Paolo Merialdo

VLDB 2001

Overview Automatically generates a wrapper from large

structured Web pages Supports nested structures Efficient approach to large, complex pages with

regular structures

Approach Given a set of example pages Generate a Union-free Regular Expression

(UFRE) Find the least upper bounds on the RE lattice

to generate a wrapper Reduces to find the least upper bound on two

UFRES

Matching/Mismatching Start with the first page and create a RE that defines

the wrapper Match each successive sample against the wrapper Mismatches result in generalizations of the regular

expression Types of mismatches

– String mismatches– Tag mismatches

Example Pages

Example

#PCDATA

String mismatches are used to discover fields of the documents

Wrapper is generated by replacing “John Smith” with #PCDATA

Example (Cont.)

#PCDATA

Tag Mismatches :Discovering Optionals First check to see if mismatch is caused by an iterator If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional

– (<img src=…/>)?

Example (Cont.)

#PCDATA

Tag Mismatches :Discovering Optionals First check to see if mismatch is caused by an iterator If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional

– (<img src=…/>)?

(<IMG src=…/>)?

Example (Cont.)

#PCDATA

(<IMG src=…/>)?

#PCDATA

#PCDATA

Tag Mismatches :Discovering Iterators Assume mismatch is caused by repeated elements in a list Match possible squares against earlier squares Generalize the wrapper by finding all contiguous repeated

occurrences– (<li><i>Title:</i>#PCDATA</li>)+

Extracted Result

Recursive Example

Complexity

Discussion Assumptions

– Pages are well-structured– Want to extract at the level of entire fields– Structure can be modeled without disjunctions

Search Space for explaining mismatches is huge– Uses a number of heuristics to prune space

Limited backtracking Limit on number of choices to explore Patterns can not be delimited by optionals

– Will result in pruning possible wrappers

Experimental Result

Comparison with Other Works

Name Struc_

ture

Semi Free Single-slot

Multi-slot

Missing items

Permuta_tions

Nested_

data

Resilient

WIEN X X XSoftMealy X X X X X X*STALKER X X X * X X XRAPIER X X ? X X X ?SRV X X ? X X X ?WHISK X X X X X X X* ?AutoSlog X X X XROAD_

RUNNER X X X X XBYU Onto X X ? X X X X X X

X means the information extraction system has the capability; X* means the information extraction system

has the ability as long as the training corpus can accommodate the required training data; ? Shows that the

systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the

ability, but the overall system has the capability.