focused crawling and extraction team please write … · structured data from unstructured or...

Please Write Here The Title of This PosterPlease Write Here the Different Authors of this Poster

Image & Pervasive Access LabCNRS UMI 2955 - Singaporewww.ipal.cnrs.fr

Focused Crawling and Extraction by ExampleAgus Sanjaya (1), Talel Abdessalem (1), Stéphane Bressan (2)

(1) Mines Télécom ParisTech, (2) NUS

Motivation• Focused crawling identifies Web pages relevant to a

selected topic

• Automatic information extraction automatically extracts structured data from unstructured or semi-structured Web pages

• Set expansion uses Web pages content to automatically completes a given small set of examples

Research Objectives• Integrate and extend approaches for focused crawling,

automatic information extraction and set expansion

• Design and implement a practical system able to extract structured data from the web and deep web from queries consisting of a small set of examples

• Design a ranking mechanism based on the analysis of the heterogeneous graph of objects (websites, web pages, wrapper, data elements, and concepts in ontologies)

The system will also able to rank websites, web pages and ontology elements

• We extract structured data from web pages automatically by first infering the underlying structure (wrapper)

• We use unsupervised systems for generating wrapper: RoadRunner [3], ExAlg [2]

• Both of these systems only exploits the structure of web pages without prior knowledge of the content

• ObjectRunner [1] incorporates structured object definition as a way for the user to describe the targeted data

Source S1

Type Recognizer

SOD

USER

WebCorpus

YagoOntology

Sam

e d

om

ain

Page pre-processingSource

Sn

annotation

Unlabeled Web Pages Labeled Web Pages

Extracted data Wrapper Generation

De-duplication

All pages sampepages

Enrich

Structured Data and textQuery Interface

Automatic Information Extraction

Set Expansion Problem• Extract elements of a particular semantic class from a

given data source

• For example: given the seeds ( {Barrack Obama, George Bush, Bill Clinton} ) extract more element of the particular semantic class (US Presidents) from the web

• Three steps framework:

– Fetch relevant documents: find occurrences of the seeds in the collection of documents

– Construct patterns and extract candidates: using regular expression to construct patterns and from these patterns extract the candidates

– Rank candidates: use some kind of ranking mechanism on candidates, usually variations of PageRank [4]

• SEAL [5] generates wrapper for each page and extract candidates from the same page

• A wrapper is defined as a pair of maximally long common left context and maximally long right context

• It uses an heterogeneous graph model, node as entities (seeds, document, wrapper, mention) and edge as the relations between nodes

• It performs lazy walk on the graph to measure the similarity between two nodes(similar to PageRank)

• The rank of any entities in the graph can be calculated using this weight

SEAL

① Abdessalem, T., Cautis, B., & Derouiche, N. (2010). ObjectRunner: Lightweight, Targeted Extraction and Querying of Structured Web Data. 36th International Conference on Very Large Data Bases. Singapore

② Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from Web pages. . SIGMOD International Conference on Management of Data (pp. 337-348). San Diego: ACM.

③ Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner: towards automatic data extraction from large Web sites. 26th International Conference on Very Large Database Systems (VLDB), (pp. 109-118). Rome

④ Page, L., Brin, S., Motwani, R., & Winograd, T. (1999, November). The pagerank citation ranking: Bringing order to the web. Tech. Report

⑤ Wang, R. C., & Cohen, W. W. (2007). Language-independent set expansion of named entities using the Web. International Conference on Data Mining (pp. 342-350). IEEE.

Bibliography

Example• Result from SEAL using the examples “Yves Rocher” and

“L’Oreal” as seeds for a search of cosmetic brand names: “Maybelline”, “Biotherm”, “Guerlain”, “Clinique”, “Garnier”, “Lancome”, “Boscia”, “Dior”, etc.

Team Web & Data

Science

Agus Sanjaya is a PhD student in the EDITE doctoral school (edite-de-paris.fr) under Professor Talel Abdessalem’s supervision. He targets the following conferences and journals for his work: WWW, CIKM, ICDE, TKDE and VLDBJ.

focused crawling and extraction team please write … · structured data from unstructured or...

Documents