diadem: domain-centric, intelligent, automated data...

4
DIADEM: Domain-centric, Intelligent, Automated Data Extraction Methodology * Tim Furche, Georg Gottlob, Giovanni Grasso, Ömer Gunes, Xiaonan Guo, Andrey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, Cheng Wang Department of Computer Science, Oxford University, Wolfson Building, Parks Road, Oxford OX1 3QD [email protected] ABSTRACT Search engines are the sinews of the web. These sinews have be- come strained, however: Where the web’s function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain milage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reason- able answers, making us sift through dozens of websites with thou- sands of offers—never to be sure a better offer isn’t just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites. Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with ex- tensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the represen- tation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula. Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: On-line Information Services—Web-based services General Terms Languages, Experimentation Keywords data extraction, deep web, knowledge * The research leading to these results has received funding from the European Research Council under the European Community’s Sev- enth Framework Programme (FP7/2007–2013) / ERC grant agree- ment DIADEM, no. 246858, http://diadem-project.info/. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WWW’12 Apr 16-20, 2012 Lyon, France. Copyright 2012 ACM XXX ...$10.00. Figure 1: Exploration: Form 1. INTRODUCTION Search is failing. In our society, search engines play a crucial role as information brokers. They allow us to find web pages and by extension businesses, articles, or information about products wher- ever they are published. Visibility on search engines has become mission critical for most companies and persons. However, for searches where the answer is not just a single web page, search en- gines are failing: What is the best apartment for my needs? Who of- fers the best price for this product in my area? Such object queries have become an exercise in frustration: we users need to manually sift through, compare, and rank the offers from dozens of web- sites returned by a search engine. At the same time, businesses have to turn to stopgap measures, such as aggregators, that pro- vide customers with search facilities in a specific domain (vertical search). This endangers the just emerging universal information market, “the great equalizer” of the Internet economy: visibility de- pends on deals with aggregators, the barrier to entry is raised, and the market fragments. Rather than further stopgap measures, the “publish on your website” model that has made the web such a suc- cess and so resilient against control by a few (government agents or companies) must be extended to object search: Without additional technological burdens to publishers, users should be able to search and query objects such as properties or laptops based on attributes such as price, location, or brand. Increasing the burden on publish- ers is not an option, as it further widens the digital divide between those that can afford the necessary expertise and those that can not. DIADEM, an ERC advanced investigator grant, is developing a system that aims to provide this bridge: Without human supervi- sion it finds, navigates, and analyses websites of a specific domain and extracts all contained objects using highly efficient, scalable, automatically generated wrappers. The analysis is parameterized with domain knowledge that DIADEM uses to replace human an- notators in traditional wrapper induction systems and to refine and verify the generated wrappers. This domain knowledge describes the ontology as well as the phenomenology of the domain: what are the entities and their relations as well as how do they occur on web- sites. The latter describes that, e.g., real estate properties include a location and a price and that these are displayed prominently. Figures 1 and 2 show examples (screenshots from the current prototype) of the kind of analysis required by DIADEM for fully

Upload: others

Post on 31-Jan-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DIADEM: Domain-centric, Intelligent, Automated Data ...christian.schallhart.net/publications/2012--...Andrey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, Cheng Wang

DIADEM: Domain-centric, Intelligent, Automated DataExtraction Methodology∗

Tim Furche, Georg Gottlob, Giovanni Grasso, Ömer Gunes, Xiaonan Guo,Andrey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, Cheng Wang

Department of Computer Science, Oxford University, Wolfson Building, Parks Road, Oxford OX1 [email protected]

ABSTRACTSearch engines are the sinews of the web. These sinews have be-come strained, however: Where the web’s function once was a mixof library and yellow pages, it has become the central marketplacefor information of almost any kind. We search more and more forobjects with specific characteristics, a car with a certain milage, anaffordable apartment close to a good school, or the latest accessoryfor our phones. Search engines all too often fail to provide reason-able answers, making us sift through dozens of websites with thou-sands of offers—never to be sure a better offer isn’t just around thecorner. What search engines are missing is understanding of theobjects and their attributes published on websites.

Automatically identifying and extracting these objects is akin toalchemy: transforming unstructured web information into highlystructured data with near perfect accuracy. With DIADEM wepresent a formula for this transformation, but at a price: DIADEMidentifies and extracts data from a website with high accuracy. Theprice is that for this task we need to provide DIADEM with ex-tensive knowledge about the ontology and phenomenology of thedomain, i.e., about entities (and relations) and about the represen-tation of these entities in the textual, structural, and visual languageof a website of this domain. In this demonstration, we demonstratewith a first prototype of DIADEM that, in contrast to alchemists,DIADEM has developed a viable formula.

Categories and Subject DescriptorsH.3.5 [Information Storage and Retrieval]: On-line InformationServices—Web-based services

General TermsLanguages, Experimentation

Keywordsdata extraction, deep web, knowledge

∗The research leading to these results has received funding from theEuropean Research Council under the European Community’s Sev-enth Framework Programme (FP7/2007–2013) / ERC grant agree-ment DIADEM, no. 246858, http://diadem-project.info/.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WWW’12 Apr 16-20, 2012 Lyon, France.Copyright 2012 ACM XXX ...$10.00.

Figure 1: Exploration: Form

1. INTRODUCTIONSearch is failing. In our society, search engines play a crucial

role as information brokers. They allow us to find web pages and byextension businesses, articles, or information about products wher-ever they are published. Visibility on search engines has becomemission critical for most companies and persons. However, forsearches where the answer is not just a single web page, search en-gines are failing: What is the best apartment for my needs? Who of-fers the best price for this product in my area? Such object querieshave become an exercise in frustration: we users need to manuallysift through, compare, and rank the offers from dozens of web-sites returned by a search engine. At the same time, businesseshave to turn to stopgap measures, such as aggregators, that pro-vide customers with search facilities in a specific domain (verticalsearch). This endangers the just emerging universal informationmarket, “the great equalizer” of the Internet economy: visibility de-pends on deals with aggregators, the barrier to entry is raised, andthe market fragments. Rather than further stopgap measures, the“publish on your website” model that has made the web such a suc-cess and so resilient against control by a few (government agents orcompanies) must be extended to object search: Without additionaltechnological burdens to publishers, users should be able to searchand query objects such as properties or laptops based on attributessuch as price, location, or brand. Increasing the burden on publish-ers is not an option, as it further widens the digital divide betweenthose that can afford the necessary expertise and those that can not.

DIADEM, an ERC advanced investigator grant, is developing asystem that aims to provide this bridge: Without human supervi-sion it finds, navigates, and analyses websites of a specific domainand extracts all contained objects using highly efficient, scalable,automatically generated wrappers. The analysis is parameterizedwith domain knowledge that DIADEM uses to replace human an-notators in traditional wrapper induction systems and to refine andverify the generated wrappers. This domain knowledge describesthe ontology as well as the phenomenology of the domain: what arethe entities and their relations as well as how do they occur on web-sites. The latter describes that, e.g., real estate properties include alocation and a price and that these are displayed prominently.

Figures 1 and 2 show examples (screenshots from the currentprototype) of the kind of analysis required by DIADEM for fully

Page 2: DIADEM: Domain-centric, Intelligent, Automated Data ...christian.schallhart.net/publications/2012--...Andrey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, Cheng Wang

Figure 2: Identification: Result Page

automated data extraction: Web sites needs to be explored to locaterelevant data, here real estate properties. This includes in particularforms as in Figure 1 which DIADEM automatically classifies suchthat each form field is associated with a type from the domain on-tology such as “minimum price” field in a “price range” segment.These form models are then used for filling the form for furtheranalysis, but also to generate exhaustive queries for latter extrac-tion of all relevant data. Figure 2 illustrates a (partial) result of theanalysis performed on a result page, i.e., a page containing relevantobjects of the domain: DIADEM identifies the objects, their bound-aries on the page, as well as their attributes. Objects and attributesare typed according to the domain ontology and verified against thedomain constraints. In Figure 2, e.g., we identify real estate prop-erties (for sale) together with price, location, legal status, numberof bed and bathrooms, pictures, etc. The identified objects are gen-eralised into a wrapper that can extract such objects from any pagefollowing the same template without further analysis.

In the first year and a half, DIADEM has progressed beyondour expectations: The current prototype is already able to gener-ate wrappers for most UK real estate and used car websites withhigher accuracy than existing wrapper induction systems. In thisdemonstration, we outline the DIADEM approach, first results, anddescribe the live demonstration of the current prototype.

2. DIADEM—THE METHODOLOGYDIADEM is fundamentally different from previous approaches.

The integration of state-of-the-art technology with reasoning usinghigh-level expert knowledge at the scale envisaged by this projecthas not yet been attempted and has a chance to become the corner-stone of next generation web data extraction technology.

Standard wrapping approaches [16, 6] are limited by their depen-dence on templates for web pages or sites and their lack of under-standing of the extracted information. They cannot be generalisedand their dependence on human interaction prevents scaling to thesize of the Web. Several approaches that move towards fully au-tomatic or generic wrapping of the existing World Wide Web havebeen proposed and are currently under active development. Thoseapproaches often try to iteratively learn new instances from knownpatterns and new patterns from known instances [5, 18] or to findgeneralised patterns on web pages such as record boundaries. Al-though current web harvesting and automated information extrac-

DIADEM

Glue

Ontology

Extraction

MALAC

HITE

OP

AL

Exploration

AM

BE

R

Identification

Oxtractor

Reasoning

GLU

E

Datalog±

Phenomenology

OXP

ath

Ox La

tin

In the Cloud

Ontology

Figure 3: DIADEM Overview

tion systems show significant progress, they still lack the combinedrecall and precision necessary to allow for very robust queries. Weintend to build upon those approaches, but add deep knowledgeinto the extraction process and combine them in a new manner intoour knowledge-based approach; we propose to use them as low-level fact finders, whose output facts may be revised in case moreaccurate information arises from other sources.

DIADEM focuses on a dual approach: at the low level of webpattern recognition, we use machine-learning complemented by lin-guistic analysis and basic ontological annotation. Several of theseapproaches, in addition to newly developed methods, are imple-mented as lower level building blocks (fact finders) to extract asmuch knowledge as possible and integrate it into a common knowl-edge base. At a higher level, we use goal-directed domain-specificrules for reasoning on top of all generated lower-level knowledge,e.g. in order to identify the main input mask and the main extrac-tion data structure, and to finalize the navigation process.

DIADEM approaches data extraction as a two stage process: Ina first stage, we analyse a small fraction of a web site to generate awrapper that is then executed in the second stage to extract all therelevant data on the site at high speed and low cost. Figure 3 givesan overview of the high-level architecture of DIADEM. On the left,we show the analysis, on the right the execution stage.

(1) Sampling analysis: In the first stage, a sample of the webpages of a site are used to fully automatically generate wrappers(i.e., extraction program). The analysis is based on domain knowl-edge that describes the objects of interest (the “ontology”) and howthey appear on the web (the “phenomenology”). The result of theanalysis is a wrapper program, i.e., a specification how to extractall the data from the website without further analysis.Conceputally, it is divided into two major phases, though these areclosely interwoven in the actual analysis:

(i) Exploration: DIADEM automatically explores a site to locaterelevant objects. The major challenge here are web forms: DI-ADEM needs to understand such forms enough to fill them forsampling, but also to generate exhaustive queries for the ex-traction stage such that all the relevant data is extracted (see[1]). DIADEM’s form understanding engine OPAL [9] is usesan phenomenology of forms in the domain to classify the formfields.The exploration phase is supported by the page and block clas-sification in MALACHITE where we identify, e.g., next linksin paginate results, navigation menus, and irrelevant data suchas advertisements. We further cluster pages by structural andvisual similarity to guide the exploration strategy and to avoidanalysing many similar pages. Since such pages follow a com-mon template, the analysis of one or two pages from a clusterusually suffices to generate a high confidence wrapper.

(ii) Identification: The exploration unearths those web pages thatcontain actual objects. But DIADEM still needs to identify theprecise boundaries of these objects as well as their attributes.To that end, DIADEM’s result page analysis AMBER [10]analyses the repeated structure within and among pages. It

Page 3: DIADEM: Domain-centric, Intelligent, Automated Data ...christian.schallhart.net/publications/2012--...Andrey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, Cheng Wang

0.97

0.98

0.99

1

Precision Recall F-score

0.93

0.94

0.95

0.96

UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436)

(a) OPAL

0

100

200

300

400

500

600

700

800

0 20 40 60 80 100 120 140

time

[se

c]

#pages

OXPathWeb Content Extractor

LixtoVisual Web Ripper

Web HarvestChickenfoot

(a) Evaluation time, <150 p.

0

100

200

300

400

500

600

700

0 20 40 60 80 100 120 140

time

w/o

pa

ge

loa

din

g [

sec]

#pages

OXPathWeb Content Extractor

LixtoVisual Web Ripper

Web HarvestChickenfoot

(b) Norm. evaluation time, <150 p.

0

200

400

600

800

1000

1200

1400

1600

0 100 200 300 400 500 600 700 800

time

w/o

pa

ge

loa

din

g [

sec]

Number of pages

OXPathLixto

Web HarvestChickenfoot

(c) Norm. evaluation time, <850 p.

Figure 7: Comparison.

0

50

100

150

200

250

300

350

0 100 200 300 400 500 600 700 800

me

mo

ry [

MB

]

#pages

OXPathLixto

Web HarvestChickenfoot

Figure 8: Comparison: Memory (<850 p.).

tion graph to a depth of 3 for papers on “Seattle” (see Appendix F).We record evaluation time and memory for each system. In Fig-

ure 7(a), we show the (averaged) evaluation time for each system upto 150 pages. Though Chickenfoot and Web Harvest do not man-age page and browser state or do not render pages at all, OXPathstill outperforms them. Systems that manage state similar to OX-Path are between two and four times slower than OXPath even onthis small number of pages. Figure 7(b) show the evaluation timediscounting the page loading, cleaning, and rendering time. Thisallows for a more balanced comparison as the different browser en-gines or web cleaning approaches used in the systems affect theoverall runtime considerably. Figure 7(c) shows the same evalua-tion time up to 850 pages, but omits WCE and VWR as they werenot able to run these tests. Again both figures show a considerableadvantage for OXPath (at least one order of magnitude). Finally,Figure 8 illustrates the memory use of these systems. Again WCEand VWR are excluded, but they show a clear linear trend in mem-ory usage in the tests we were able to run. Among the systems inFigure 8, only Web Harvest comes close to the memory usage ofOXPath, which is not surprising as it does not render pages. Yet,even Web Harvest shows a clear linear trend. Chickenfoot exhibitsa constant memory use just as OXPath, though it uses about tentimes more memory in absolute terms. The constant memory isdue to Chickenfoot’s lack of support for multi-way navigation thatwe compensate by using the browser’s history. This forces reload-ing when a page is no longer cached and does not preserve pagestate, but requires only a single active DOM instance at any time.We also tried to simulate multi-way navigation in Chickenfoot, butthe resulting program was too slow for the tests shown here.

6. CONCLUSION AND FUTURE WORKTo the best of our knowledge, OXPath is the first web extrac-

tion system with strict memory guarantees, which reflect stronglyin our experimental evaluation. We believe that it can become animportant part of the toolset of developers interacting with the web.

We are committed to building a strong set of tools around OX-Path. We provide a visual generator for OXPath expressions and

a Java API based on JAXP. Some of the issues raised by OXPaththat we plan to address in future work are: (1) OXPath is amenableto significant optimization and a good target for automated gener-ation of web extraction programs. (2) Further, OXPath is perfectlysuited for highly parallel execution: Different bindings for the samevariable can be filled into forms in parallel. The effective parallelexecution of actions on context sets with many nodes is an openissue. (3) We plan to further investigate language features, such asmore expressive visual features and multi-property axes.

7. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring

documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web

information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,

41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.

Automation and customization of rendered web pages. InUIST, 163–172, 2005.

[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.

[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.

[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.

[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.

[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.

[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.

[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.

[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the

World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers

for legacy web data-sources using W4F. In VLDB, 738–741,1999.

[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.

[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.

[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.

1023

(b) OXPath Comparative

0

10

20

30

40

50

60

0 50 100 150 200 250 300 350 400 450 0

200

400

600

800

1000

1200

mem

ory

[M

B]

pages

time [sec]

memoryvisited pages

(a) Many pages.

0

50

100

150

200

0 2 4 6 8 10 12 0

20

40

60

80

100

120

140

160

mem

ory

[M

B]

#pages

[1000] / #re

sults

[100,0

00]

time [h]

memoryextracted matches

visited pages

(b) Millions of results.

0

10

20

30

40

50

10 20 30 40 50 60 0

20

40

60

80

100

mem

ory

[M

B]

#m

atc

hes

[100] / #pages

time [sec]

memoryextracted matches

visited pages

(c) Large pages.Figure 5: Scaling OXPath: Memory, visited pages, and output size vs. time.

PAAT (2%)browser initialization (13%)page rendering (85%)

(a) Profiling OXPath (avg.).

0

10

20

30

40

50

D1 D2 D3 D4 D5

time

[se

c]

query set

browser initialization

page rendering

PAAT evaluation

(b) Profiling OXPath (per query).

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6

time

[se

c]

#actions

absolutecontextual

(c) Contextual actions.

Figure 6: Profiling OXPath’s components.

5. EVALUATIONIn this section, we confirm the scaling behaviour of OXPath:(1) The theoretical complexity bounds from Section 4.3 are con-

firmed in several large scale extraction experiments in diverse set-tings, in particular the constant memory use even if extracting mil-lions of records from hundreds of thousands of web pages.

(2) We illustrate that OXPath’s evaluation is dominated by thebrowser rendering time, even for complex queries on small webpages. None of the extensions of OXPath (Section 1.2) significantlyaffects the scaling behaviour or the dominance of page rendering.

(3) In an extensive comparison with commercial or academicweb extraction tools, we show that OXPath outperforms previousapproaches by at least an order of magnitude, although they aremore limited. Where OXPath requires memory independent of thenumber of accessed pages, most others use linear memory.

Scaling: Millions of Results at Constant Memory. Wevalidate the complexity bounds for OXPath’s PAAT algorithm andillustrate its scaling behaviour by evaluating three classes of queriesthat require complex page buffering. Figure 5(a) shows the resultsof the first class, which searches for papers on “Seattle” in GoogleScholar and repeatedly clicks on the “Cited by” links of all resultsusing the Kleene star operator. The query descends to a depth of3 into the citation graph and is evaluated in under 9 minutes (93pages/min). Pages retrieved are linear w.r.t. time, but memory sizeremains constant even as the number of pages accessed increase.The jitter in memory use is due to the repeated ascends and de-scends of the citation hierarchy. Figure 5(b) shows the same testsimulating Google Scholar pages on our web server (to avoid overlytaxing Google’s servers). The number in brackets indicate how wescale the axes. OXPath extracts over 7 million pieces of data from140,000 pages in 12 hours (194 pages/min) with constant memory.

We conduct similar tests (repeatedly clicking on all links) fortasks with different characteristics: one on very large pages (Wiki-pedia) and one on pages with many results (product listings onGoogle) from different web shops. Figures 5(c) and 12 (in Ap-pendix E) again show that memory is constant w.r.t. to pages visitedeven for the much larger pages on Wikipedia and the often quitevisually-involved pages reached by the Google product search.

Profiling: Page Rendering is Dominant. We profile eachstage of OXPath’s evaluation performing five sets of queries on thefollowing web sites: apple.com (D1), diadem-project.info (D2),bing.com (D3), www.vldb.org/2011/ (D4), and the Seattle pageon Wikipedia (D5). On each, we click on all links and extract thehtml tag of each result page. Figures 6(a) and 6(b) show the totalaverage and the individual averages for each site, respectively. Forbing.com, the page rendering time and the number of links is verylow, and thus also the overall evaluation time. Wikipedia pages,on the other hand, are comparatively large and contain many links,thus the overall evaluation time is high.

The second experiment analyses the effect of OXPath’s actionson query evaluation, comparing time performance of contextualand absolute actions. Our test performs actions on pages that donot result in new page retrievals. Figure 6(c) shows the results withqueries containing one to six actions on Google’s advanced prod-uct search form. Contextual actions suffer a small but insignificantpenalty to evaluation time compared to their absolute equivalents.

Comparison: Order-of-Magnitude Improvement. OX-Path is compared to four commercial web extraction tools, WebContent Extractor (WCE), Lixto, Visual Web Ripper (VWR), theacademic web automation and extraction system Chickenfoot andthe open source extraction toolkit Web Harvest. Where the firstthree can express at least the same extraction tasks as OXPath (and,e.g., Lixto goes considerably beyond), Chickenfoot and Web Har-vest require scripted iteration and manual memory management formany extraction tasks, in particular where multi-way navigation isneeded. We do not consider tools such as CoScripter and iMacrosas they focus on automation only and offer no iterative constructsas required for extraction tasks. We also do not consider tools suchas RoadRunner [6] or XWRAP [10] as they work on single pagesand lack the ability to traverse to new web pages.

In contrast to OXPath, many of these tools cannot processscripted websites easily. Thus, we compare using an extractiontask on Google Scholar, which does not require scripted actions.On heavily scripted pages, the performance advantage of OXPathis even more pronounced. With each system, we navigate the cita-

1022

(c) OXPath Scaling

Figure 4: Exploration and Execution

exploits the domain knowledge to distinguish noise from rele-vant data and is thus far more robust than existing data extrac-tion approaches. AMBER is complemented by Oxtractor, thatanalysis the free text descriptions. It benefits in this task fromthe contextual knowledge in form of attributes already iden-tified from AMBER and of background knowledge from theontology.

(2) Large-scale extraction: The wrapper generated by the anal-ysis stage can be executed independently and repeatedly. We havedeveloped a new wrapper language, called OXPath [12], the firstwrapper language for large scale, repeated (or even continuous)data extraction. OXPath is powerful enough to express nearly anyextraction task, yet as a careful extension of XPath maintains thelow data and combined complexity. In fact, it is so efficient, thatpage retrieval and rendering time by far dominate the execution.For large scale execution, the aim is thus to minimize page ren-dering and retrieval by storing pages that are possibly needed forfurther processing. At the same time, memory should be indepen-dent from the number of pages visited, as otherwise large-scale orcontinuous extraction tasks become impossible. With OXPath wemanage to obtain all these characteristics, as shown in Section 3.For a more detailed description of DIADEM’s stages, see [11].

DIADEM’s analysis uses a knowledge driven approach based ona domain ontology and phenomenology. To that end, most of theanalysis is implemented in logical rules on top of a thin layer offact finders. For the reasoning in DIADEM we are currently de-veloping a reasoning language targeted at highly dynamic, modu-lar, expressive reasoning on top of a live browser. This language,called GLUE, builds on Datalog± [4, 15, 2, 3, 17] , DIADEM’sontological query language. Datalog± allows us to reason on topof fairly complex ontologies including value invention and equalitydependencies with little performance loss over basic datalog (see[7] for a current state of datalog engines). We are also working onprobabilistic extensions [13, 14] of Datalog± for use in DIADEM.

3. FIRST RESULTSTo give an impression of the achievements of DIADEM in its

first year, we briefly summarise results on three components of DI-ADEM: its form understanding system, OPAL; its result page anal-ysis, AMBER; and the OXPath extraction language.

Figures 4a and 5 report on the quality of form understandingand result page analysis in DIADEM’s first prototype. Figure 4 [9]shows that OPAL is able to identify about 99% of all form fieldsin the UK real estate and used car domain correctly. We also showthe results on the ICQ and Tel-8 form benchmarks, where OPALachieves > 96% accuracy (in contrast recent approaches achieve at

unannotated instances (328) total instances (1484)

rnd. aligned corr. prec. rec. prec. rec.

1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%

Table 1: Total learned instances

rnd. unannot. recog. corr. prec. rec. terms

1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0

Table 2: Incrementally recognized instances and learned terms

As an example, consider the webpage of Figure 3 taken fromrightmove.co.uk. The page shows properties in a radius of 15miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.

We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.

Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 197of those instances are correctly identified, which yields a precisionand recall of 87.2% and 60.1% of the unannotated instances, 84.3%and 81.3% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.

Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.

We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.

Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-

94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!

Figure 4: AMBER Evaluation on Real-Estate Domain

domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence offillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.

Figure 5: Identification: in UK real estate

best 92% [8]). The latter result is without use of domain knowl-edge. With domain knowledge we could easily achieve close to99% accuracy also in these cases. Figure 5 [10] shows the resultsfor data area, record, and attribute identification on result pagesfor AMBER in the UK real estate domain. We report each at-tribute separately. AMBER achieves on average 98% accuracy forall these tasks, with a tendency to perform worse on attributes thatoccur less frequently (such as the number of reception rooms).

Figures 4b and 4c summarise the results for OXPath: It easilyoutperforms existing data extraction systems, often by a wide mar-gin. Its high performance execution leaves page retrieval and ren-dering to dominate execution (> 85%) and thus makes avoidingpage rendering imperative. We minimize page rendering by buffer-ing any page that may still be needed in further processing, yetmanage to keep memory consumption constant in nearly all cases,as evidenced by Figure 4c where we show OXPaths memory useover a 12h extraction task extracting millions of records from hun-dreds of thousands of pages. For more details see [12].

4. DIADEM DEMONSTRATIONAs its approach, the demonstration of DIADEM is split into two

parts. The first part illustrates its analysis using the current DIA-DEM prototype. The second part illustrates wrapper execution withOXPath using a combination of web demos and visual wrapper de-velopment tools. To give the audience a better understanding ofwrappers and the challenges in automatically identifying wrappers,we start the demonstration with the execution.

The demonstration starts with Visual OXPath, a visual IDE forOXPath, shown in Figure 6b and illustrated further in the screen-cast at diadem-project.info/oxpath/visual. With this UI, we

Page 4: DIADEM: Domain-centric, Intelligent, Automated Data ...christian.schallhart.net/publications/2012--...Andrey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, Cheng Wang

(a) OPAL

1

2

3

4

5

6

(b) OXPath

Figure 6: Demonstration UIs

demonstrate how a human would construct a wrapper supported bya visual interface and how a supervised wrapper generator can gen-eralize a wrapper from just a few examples. We provide a numberof standard examples, but are also open to suggestions for the audi-ence. Once the wrapper is created we submit it and let it run in thebackground while we continue with the rest of the demonstration.

In the second part of the demonstration, we focus on the DI-ADEM prototype. A screencast of the demo is available fromdiadem-project.info/screencast/diadem-01-april-2011.mp4.In the demonstration, we let the prototype run on several pagesfrom the UK real estate and used car domain. The prototype runsin interactive mode, i.e., it visualizes its deliberations and conclu-sions and prompts the user before continuing to a new page, so thatthe user can inspect the results on the current page. With this, wecan demonstrate first how DIADEM analyses forms and exploresweb sites based on these analyses. In particular, we show how DI-ADEM deals with form constraints such as mandatory fields, multistage forms and other advanced navigation structures. The demofully automatically explores the web site until it finds result pagesfor which it performs a selective analysis sufficient to generate theintended wrapper. As before, the demonstration discusses a numberof such result pages and shows in the interactive prototype how DI-ADEM extracts objects from these result pages as well as identifiesfurther exploration steps.

We conclude the demonstration with a brief outlook into ap-plications of DIADEM technology beyond data extraction: Thescreencast at diadem-project.info/opal shows how we use DI-ADEM’s form understanding to provide advanced assistive formfilling: where current assistive technologies are limited to simplekeyword matching or filling of already encountered forms, DIA-DEM’s technology allows us to fill any form of a domain with val-ues from a master form with high precision.

5. REFERENCES[1] M. Benedikt, G. Gottlob, and P. Senellart. Determining

relevance of accesses at runtime. In PODS, 2011.[2] A. Calì, G. Gottlob, and A. Pieris. New expressive languages

for ontological query answering. In AAAI, 2011.[3] A. Calì, G. Gottlob, and A. Pieris. Querying conceptual

schemata with expressive equality constraints. In ER, 2011.

[4] A. Calì, G. Gottlob, and A. Pieris. Query answering undernon-guarded rules in datalog±. In RR, 2010.

[5] V. Crescenzi and G. Mecca. Automatic informationextraction from large websites. J. ACM, 51(5), 2004.

[6] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automaticwrappers for large scale web extraction. In VLDB, 2011.

[7] O. de Moor, G. Gottlob, T. Furche, and A. Sellers, eds.Datalog Reloaded, Revised Selected Papers, LNCS. 2011.

[8] E. C. Dragut, T. Kabisch, C. Yu, and U. Leser. A hierarchicalapproach to model web query interfaces for web sourceintegration. In VLDB, 2009.

[9] T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, andC. Schallhart. Real understanding of real estate forms. InWIMS, 2011.

[10] T. Furche, G. Gottlob, G. Grasso, G. Orsi, C. Schallhart, andC. Wang. Little knowledge rules the web: Domain-centricresult page extraction. In RR, 2011.

[11] T. Furche, G. Gottlob, X. Guo, C. Schallhart, A. Sellers, andC. Wang. How the minotaur turned into Ariadne: Ontologiesin web data extraction. ICWE, (keynote), 2011.

[12] T. Furche, G. Gottlob, G. Grasso, C. Schallhart, andA. Sellers. OXPath: A language for scalable,memory-efficient data extraction from web applications. InVLDB, 2011.

[13] G. Gottlob, T. Lukasiewicz, and G. I. Simari. Answeringthreshold queries in probabilistic datalog+/- ontologies. InSUM, 2011.

[14] G. Gottlob, T. Lukasiewicz, and G. I. Simari. Conjunctivequery answering in probabilistic datalog+/- ontologies. InRR, 2011.

[15] G. Gottlob, G. Orsi, and A. Pieris. Ontological queries:Rewriting and optimization. In ICDE, 2011.

[16] N. Kushmerick. Wrapper induction: efficiency andexpressiveness. AI, 118, 2000.

[17] G. Orsi and A. Pieris. Optimizing query answering underontological constraints. In VLDB, 2011.

[18] A. Yates, M. Cafarella, M. Banko, O. Etzioni, M. Broadhead,and S. Soderland. Textrunner: open information extractionon the web. In NAACL 2007.