focused crawling and extraction team please write … · structured data from unstructured or...
TRANSCRIPT
Please Write Here The Title of This PosterPlease Write Here the Different Authors of this Poster
Image & Pervasive Access LabCNRS UMI 2955 - Singaporewww.ipal.cnrs.fr
Focused Crawling and Extraction by ExampleAgus Sanjaya (1), Talel Abdessalem (1), Stéphane Bressan (2)
(1) Mines Télécom ParisTech, (2) NUS
Motivation• Focused crawling identifies Web pages relevant to a
selected topic
• Automatic information extraction automatically extracts structured data from unstructured or semi-structured Web pages
• Set expansion uses Web pages content to automatically completes a given small set of examples
Research Objectives• Integrate and extend approaches for focused crawling,
automatic information extraction and set expansion
• Design and implement a practical system able to extract structured data from the web and deep web from queries consisting of a small set of examples
• Design a ranking mechanism based on the analysis of the heterogeneous graph of objects (websites, web pages, wrapper, data elements, and concepts in ontologies)
The system will also able to rank websites, web pages and ontology elements
• We extract structured data from web pages automatically by first infering the underlying structure (wrapper)
• We use unsupervised systems for generating wrapper: RoadRunner [3], ExAlg [2]
• Both of these systems only exploits the structure of web pages without prior knowledge of the content
• ObjectRunner [1] incorporates structured object definition as a way for the user to describe the targeted data
Source S1
Type Recognizer
SOD
USER
WebCorpus
YagoOntology
Sam
e d
om
ain
Page pre-processingSource
Sn
annotation
Unlabeled Web Pages Labeled Web Pages
Extracted data Wrapper Generation
De-duplication
All pages sampepages
Enrich
Structured Data and textQuery Interface
Automatic Information Extraction
Set Expansion Problem• Extract elements of a particular semantic class from a
given data source
• For example: given the seeds ( {Barrack Obama, George Bush, Bill Clinton} ) extract more element of the particular semantic class (US Presidents) from the web
• Three steps framework:
– Fetch relevant documents: find occurrences of the seeds in the collection of documents
– Construct patterns and extract candidates: using regular expression to construct patterns and from these patterns extract the candidates
– Rank candidates: use some kind of ranking mechanism on candidates, usually variations of PageRank [4]
• SEAL [5] generates wrapper for each page and extract candidates from the same page
• A wrapper is defined as a pair of maximally long common left context and maximally long right context
• It uses an heterogeneous graph model, node as entities (seeds, document, wrapper, mention) and edge as the relations between nodes
• It performs lazy walk on the graph to measure the similarity between two nodes(similar to PageRank)
• The rank of any entities in the graph can be calculated using this weight
SEAL
① Abdessalem, T., Cautis, B., & Derouiche, N. (2010). ObjectRunner: Lightweight, Targeted Extraction and Querying of Structured Web Data. 36th International Conference on Very Large Data Bases. Singapore
② Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from Web pages. . SIGMOD International Conference on Management of Data (pp. 337-348). San Diego: ACM.
③ Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner: towards automatic data extraction from large Web sites. 26th International Conference on Very Large Database Systems (VLDB), (pp. 109-118). Rome
④ Page, L., Brin, S., Motwani, R., & Winograd, T. (1999, November). The pagerank citation ranking: Bringing order to the web. Tech. Report
⑤ Wang, R. C., & Cohen, W. W. (2007). Language-independent set expansion of named entities using the Web. International Conference on Data Mining (pp. 342-350). IEEE.
Bibliography
Example• Result from SEAL using the examples “Yves Rocher” and
“L’Oreal” as seeds for a search of cosmetic brand names: “Maybelline”, “Biotherm”, “Guerlain”, “Clinique”, “Garnier”, “Lancome”, “Boscia”, “Dior”, etc.
Team Web & Data
Science
Agus Sanjaya is a PhD student in the EDITE doctoral school (edite-de-paris.fr) under Professor Talel Abdessalem’s supervision. He targets the following conferences and journals for his work: WWW, CIKM, ICDE, TKDE and VLDBJ.