extraction and integration of data from semi-structured

9
acceptedfor publication at the International Conference on Industrial Applications of Prolog (INAP '97) Extraction and Integration of Data from Semi-structured Documents into Business Applications Ph. Bonnet & S. Bressan Sloan WP# 3979 CISL WP# 97-12 September 1997 The Sloan School of Management Massachusetts Institute of Technology Cambridge, MA 02142

Upload: others

Post on 25-Oct-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extraction and Integration of Data from Semi-structured

acceptedfor publication at the International Conference on Industrial Applications ofProlog (INAP '97)

Extraction and Integration of Data fromSemi-structured Documents into

Business Applications

Ph. Bonnet & S. Bressan

Sloan WP# 3979 CISL WP# 97-12September 1997

The Sloan School of ManagementMassachusetts Institute of Technology

Cambridge, MA 02142

Page 2: Extraction and Integration of Data from Semi-structured

Extraction and Integration of Data from Semi-structuredDocuments into Business Applications

Ph. BonnetGIE Dyade, Inria Rh6ne-Alpes

655 Avenue de l'Europe38330, Monbonnot Saint Martin, France

Philippe. Bonnet~dyade. fr

1 Introduction

Collecting useful information along the informationAutobahn is a fun window shopping activity whichrapidly becomes frustrating. As the technology of-fers a broader logical connectivity, enables schemesfor secure transactions, and offers more guaranteesof quality, validity, and reliability, it remains diffi-cult to manage the potentially available data, and tointegrate them into applications.

Can I write a program which would assist me inplanning my next business trip by retrieving rele-vant data about hotel prices, weather forecast, planeticket reservations? How can I manage my invest-ment portfolio without gathering and copying every-day, by hand, stock prices and analysts recommen-dations into spreadsheets?

On the one hand users can browse or surf theWorld Wide Web (under the name of which we lib-erally include all electronic information sources andprotocols, e.g. WAIS, FTP, etc). Data is thenembedded into HyperText Markup Language doc-uments, Postscript, LaTeX, Rich Text Format, orPDF documents. These documents are meant forviewer applications, i.e application whose task is onlyto present them to end-users in a readable, possiblynicely laid out format. On the other hand, ElectronicData Interchange networks, which allow the integra-tion of data into complex applications, remain spe-cific to narrow application domains, operating sys-tems, and most often proprietary applications.

Wiederhold, in [Wie92], proposed a reference ar-chitecture, the mediation architecture, for the de-sign of heterogeneous information systems integra-tion networks. As the reference for the IntelligentIntegration of Information (Is) DARPA program, ithas been adopted by most American research groupsand became a de facto standard for the research com-munity. As several other projects (e.g. the SIMSproject at ISI [AK92], the TSIMMIS project at Stan-ford [GM95], The Information Manifold project at

S. BressanMIT Sloan School of Management

50 memorial drive, E53-320Cambridge, MA, 02139, USA

stephQcontext .mit. edu

AT&T [LSK95], the Garlic project at IBM [PGH96],the knowledge Broker project at Xerox [ABP96], orthe Infomaster project at Stanford [GKD97]), theDisco project at GIE Dyade [TRV96, TAB+97] andthe COIN project at MIT [GBMS97, BGF+97] im-plement this architecture.

The mediation architecture categorizes the media-tion tasks into three groups of processes: the facilita-tors, the mediators, and the wrappers. The facilita-tors are application programming and user interfacesfacilitating the seamless integration of the mediationnetwork into end-users environments. The mediatorsprovide the advanced integration services: query andretrieval over multiple sources, schematic federation,semantic integration, as well as the necessary meta-data management. The mediator assumes physicalconnectivity and a first level of logical connectivityto the remote information sources. This initial con-nectivity is provided by wrappers. At the physicallevel a wrapper constitutes a gateway to a remoteinformation source, being an on-line database, a pro-gram, a document repository, or a Web site. At thelogical level, a wrapper masks the heterogeneity ofan information source; it provides a mapping, suitedto the mediation network, of the data representation,the data model and the query language.

Although the World Wide Web has departed fromthe initial raw distributed hypertext paradigm to be-come a generic medium for all kinds of informationservices including online databases and programs,most Web services, with the exception of Java andActiveX based interfaces, provide information in theform of documents. The Web-wrapping challenge isto extract structured data from these documents. Infact, for large number of documents, it is possible toidentify certain patterns in the layout of the pre-sentation which reflect the structure of the data thedocument presents. These patterns can be used toenable the automatic identification and extractionof data. It is for instance the case when data isorganized in Excel spreadsheet tables, HTML item-

1

Page 3: Extraction and Integration of Data from Semi-structured

--- .. , ......... ...

Figure 1: Sample Web Page

ized lists, or Word structured paragraphs. Further-more, as documents are updated or as new docu-ments are added, a relative perennity of the patternscan be assumed. We will call such documents semi-structured documents [Suc97]. Of course, it is aparadox to try and extract data from a document,which, most likely, was generated from the contentof a structured database in the first place. However,such a database, when it exists, is generally not avail-able.

We focus in this paper on the pattern matchingtechniques that can be used to implement genericwrappers for semi-structured documents; we discusshow Prolog can support these techniques.

2 Generic Wrappers

Let us consider the example document rendered onfigure 1. The document is an HTML page containinga list of hotels together with rates, addresses, anddescriptions of facilities. It is one example of thedocuments served by a Web hotel guide for the cityof Amsterdam. This Web service has been wrappedwith the COIN Web wrappers, and is now accessibleas a relational database exporting the relation "h-ams", which contains attributes such as Name, thename of the hotel, Single, the rate for a single room,or Double, the rate for a double room. A wealthyuser can now look for hotels whose rates for singlerooms are higher than 250 Dutch Guilder. Table 1shows an example of a query in SQL and its results asprocessed and presented by the COIN Web wrappers.

Web wrappers can either be specific programs de-

select Name, Single from h-ams where Single>250;as of Tue Aug 26 12:34:12 1997; cols=2,rows= 7

Name SingleAmstel InterContinental 725Amsterdam Renaissance 332De l'Europe 475Golden Tulip Barbizon Centre 375Grand Hotel Krasnapolsky 350Hilton Amsterdam 360Holiday Inn Crown Plaza 345

Table 1: Query Results (as returnedwrappers Web query interface)

by the COIN

signed and implemented for interfacing a specific setof documents, or generic components intended to bereused for a variety of sets of documents. For spe-cific wrappers, the patterns to be identified are gen-erally defined once and for all, and the data extrac-tion procedure is an optimal implementation of thematching of these patterns. For generic wrappers,the patterns are parameters in a specification. Thespecification is a declaration of what data need to beextracted, how it can be extracted, and how it shouldbe re-structured and presented.

Figure 2 shows the architecture of the Disco[ABLS97] and COIN [BL97] generic Web wrappers.Queries are submitted to the wrapper interface.They are formulated in the wrapper query language(Object-relational algebra for Disco, SQL for COIN).The query is compiled, together with the relevantspecifications (the specifications of the documentsneeded to answer the query). The planner/optimizergenerates an execution plan. The plan is executed bythe executioner. The plan describes what documentsneed to be retrieved by the network access compo-nent, which patterns need to be matched from indi-vidual documents by the pattern matching compo-nent, and how individual data need to be combinedand presented.

3 Pattern MIatching

3.1 Example

The basic wrapping problem is the definition of pat-terns which can be used to automatically and effi-ciently identify and extract data from the documents.The Web service we consider in our running exampleserves HTML pages. An excerpt of the source codefor the HTML document 1 is presented in figure 3. Alist of hotels with rates, addresses, and descriptionsof facilities is presented on a single page in an HTMLtable.

A simple piece of data to extract is, for instance,

2

.P. e'dit '' %ii ,ladnafl Optt b ., r h .. · ·i

.......,,,,,, ' .. ...... ,,,, ........ .. .

_ _._..._ ....... .... _... .._ .", iG ,,u"..'_.:.:_ ':.":..'.' .:.' .."'7.: " _;:.'. . _,:.:.'.i .7 ....- : ._.'..:'':'.._ ... ..... _. ".....'..',,._".: "_. :. '. .......... :".- .'_

.... ,,· .,,,,,' ..... '·'"" - ...... ",~ ' , ,,,ta. '-. · . ... ,..' "', ,' "...,'., ' -..'.·:.... '. '.,.. '· '','., ',.,,,o'l"', ,,,,''',·,'...'.,..,....' " ''..· ' ...... ".,--. ... ' '. ... '. ''..... "-. ... - .... .* .''- .. ..... .. 'i'- ,'. . t' ;. . . ". '.., ". '. . ' * * ' .*.*. .,.., ".-

*'. .. 'tt& l... , ,'' * ' ..............'. ~..? ',~.,,,.~~~~,? .-." '-. "-..' ' ..- .' .'.....'. '.. " '..-- .: , ' '... ' . ' '..'. '..'.. . 'v. " -. ,'. '.,7' '- '- "'.

., '..-. '. '... '. "·' '.. . ml~.s<W Ad t' l,0t 05 .~eem '. *,,v,,'; .--- ,'m Ot * i- -s '*- -. ..-

:'.. "'--.-. odditin'",,3 ... .. '..· ,,i ...... ......... ............... ": '-. ··-..' ... :................ '.:. '-.. *" .- :. · ."''~;. ~:,,''. , ,ii ,(t.?,, hx r. , , ~'u. .... : '.-..." ... *. '...' ... '. ". ' " ',' -.. '. ... ."o' "p . .. d ". .. . .

d.$. .d.. . . .*'.. .' . ......: v' " ., ,'..l , X -. " ., ''" "fa': 3(~' .,'.... " : '" " ' '.. .:' ' '.' "'" ·

,, '7., · , .. ". .........~,....... ................. ..... .... .. ' ...-' * ' .' ' . '.. i . ' '.. '. .*,'. . ', ,, . ..~· ' d+·o2zz43.. .,~~ , . ,, -.... ...~bU~

. . :S dl 2 . .. a000.' "' . '' ' ;· ·· u ^hnlnxco:tL"~~i yc~. _ ,, ' * ' . .:.. .. ' ' ··'. "~hl J,,0................... . '.. .. -. s ..ir~P

'' ..'. "'' "'. 1ob ' ~" .,' ' ' . "' "'- ..' '," '", ' "' ̀'.. ' .' . , '. ': ''' _ __._rLm~~9; I

�0. ...__ ------ --------- ....................... R -

1:

i~~i~....... L . [.7,7..,...,...·..,.,

Page 4: Extraction and Integration of Data from Semi-structured

Query

Interface

QueryCompiler/Interpreter

Plnnr \

Planner/

OptimizerCmIler

Specification

Results

4

Compiler/ Interpreter

Web Documents

Specifications

Figure 2: Generic Web Wrapper Architecture

the name of the hotels in the list. Taking advantageof the text formatting, namely of the HTML tags,we decide to rely on the fact that hotel names arealways between <STRONG> and </STRONG> tags. Wealso make sure that any text in such a block is a hotelname. As it is the case, a simple regular expressionsmatches the hotel names and tags and stores themin variables, called back references.

For instance the following is a Perl program thatreturns the list of hotel names as specified above ($dis a string that contains the source code of the HTMLdocument):

<1> Qr = ($d = /<STRONG>(.*?)<\/STRONG>/sg);

<2> foreach $e (r)

<3>{print $e; print "\n";}

The string is matched with the regular expressionon line <1> 2 . For each succesfully matched substring,the portion corresponding to a part of the regular ex-pression in parenthesis is stored in a variable, called aback reference. Several matches are obtained by theoption s and g on line <1> indicating that the stringis to be treated as a single sequence (as opposed to

'We use the Perl regular expression syntax unless otherwisespecified

2n*+,, and "*?" are, respectiveley, the greedy and lazy iter-ation quantifiers. For instance, ".*?" at the begining or at theend of a regular expression matches nothing and is thereforeuseless. For instance, ".*" at the beginning or at the end of aregular expressions matches as much as possible, and thereforelimits the number of matches to at most one.

raItwtI IVlULL;IAIles

Network Access

<TR><TD WIDTH="450" COLSPAN="8">

<IMG WIDTH="444" HEIGHT="3" vspace="15"

ALT="line" SRC="streep. gif"><BR>

<FONT SIZE="3">

<STRONG>Condette</STRONG>

</FONT><BR>

<IMG vspace="2" SRC="dot_clear.gif"><BR>

</TD></TR>

<TR><TD WIDTH="25"><BR></TD>

<TD WIDTH="425" COLSPAN="7">

<FONT SIZE="2">Dijsselhofplantsoen 7<BR>

1077 BJ Amsterdam<BR>

telephone: +31 (0)20-6642121<BR>

fax: +31 (0)20-6799356<BR>

98 rooms<BR>

price single room: FL 225,00/395,00<BR>

price double room: FL 275,00/495,00<BR>

Figure 3: Sample Web Page (Excerpt of the HTMLSource)

a sequence of lines), and that matching should beiterated, respectively. The list of back references isstored in the array Qr. The loop (line <2>) prints thevalues in the cells of the array (line <3>). A Similarprogram can be written in Prolog, for instance, witha Definite Clause Grammar (DCG) (taking its inputfrom a list of characters):

hotel_name(Name) --> anything(_),["<", "S", "T", "R", "0", "N", "G", ">"],anything(Name),

anything(t]) -- > [].

anything( [X L] -- > X], anything(L).

The use of pattern matching as opposed to a com-plete parsing of the document introduce a form ofrobustness in our specification. Indeed, other partsof the document, which can be free text, will havelittle impact on the extraction as long as they do notcontain a matching string. Therefore they can varywithout compromising the behaviour of the wrapper.

In order to extract additional data about the indi-vidual hotels, for instance, the name, together withthe (low and high) rates for single and double rooms,we need to use a more complex pattern with threeback references. In Perl, the pattern could look like:

<STRONG>(.*?)<\/STRONG>.*?\

single room: FL (.*?)/(.*?)<BR>\n\

price double room: FL (.*?)/(.*?)<BR>

As we use more complex expressions, relying onmore elements in the structure of the document toanchor our patterns, the robustness of the specifica-tion decreases.

Regular expressions seem to be a good candidatefor a pattern specification language. However, as

3

Executioner

~^t+^ hA·,renh:~r

!

D.

s

Page 5: Extraction and Integration of Data from Semi-structured

we shall see, we need to understand their expres-sive power to decide wether they fit our needs ina sufficient number of situations. It must be clear,for instance, that a regular expression cannot under-stand the notion of nested blocks. In addition, weshould not be confused by the fact that <STRONG>,</STRONG> is an HTML block. Indeed, only a partic-ular control of the backtracking in the pattern match-ing of our Perl program (part of it is expressed by the? after the *) will prevent the matching of substringsof the form <STRONG>. <\/STRONG>. <\/STRONG>. Aregular expression language with negation (regularsets are closed under difference) could be used to ex-clude the <\/STRONG> sequence from the innermostmatch.

In summary, the pattern description languageneeds to allow the synthetic (concise and intuitive)and declarative expression of the patterns. It needsto be expressive enough to cover a satisfying class ofpatterns. Together with its robustness these qualita-tive properties will allow to handle documents whosecontent vary and will ease the maintenance of wrap-pers. Wrappers are most likely to be used as just-on-time services for the extraction of data. The ex-traction process needs to be efficient and avoid thepitfall of combinatorial complexity of the general pat-tern matching problem. As we shall see, this issue ismainly related to the control of backtracking.

Several tools and languages provide support forparsing and pattern matching3 . Several unix com-mands such as grep, or Is, and many text ed-itors, such as ed, or emacs support find- orreplace-functions based on regular expressions pat-tern matching. SNOBOL44 , Perl, or Python providea native support for regular expressions and pat-tern matching. Regular expression pattern match-ing tools are announced for JavaScript. C (withits compiler-compiler tools Lex, Flex, Yacc, and Bi-son, or regular expression libraries such as the gnuregex),and Java (with JLex, Jacc, or regular expres-sion libraries) also provide the necessary support forstandard parsing and pattern matching. However,these tools do not offer the necessary flexibility inthe control of the backtracking and are not as easilycustomizable as Prolog DCG. In the next sections wewill study pattern matching from the point of viewof its implementation with Prolog DCG.

3.2 Back to the Basics

A formal language [HU69] L is a set of words,i.e. sequences of letters or tokens, from an alpha-

3 A practical survey of regular expression pattern matchingin standard tools and languages can be found in [Fri97]

4 SNOBOL was the first language to provide regular expres-sion pattern macthing and backtracking [GPI71]!

bet A (the empty sequence is noted e). A lan-guage described in extension, by the enumeration ofits elements, is of little practical use. Instead, lan-guages can be described by generators or recognizers.A generator, for instance a grammar, is a devicewhich describes the constructions of the words of thelanguage. A recognizer is a device which, given aword of the language, answers the boolean questionwether the words belongs to the language. In ad-dition, a transducer is a device which translates aword of the language into a structure (e.g. compilersare transducers).

If a string s is a sequence of elements of the al-phabet A, string pattern matching can be formallydefined by the recognition of word w of a formal lan-guage L as a prefix of the string. If. stands for thelanguage of words composed of a single element of A,L* for the language of sequences composed by con-catenation of any number of words of L, and L1 L2 forthe language of sequences composed by the concate-nation of any pair of words of L1 and L2, respectively,then the matching of a pattern L is the recognitionof the language . * L as a prefix of the string s.

A grammar, or generative grammar, is mainly a setof productions rules (grammar rules) whose left andright-end side are composed of non-terminal sym-bols (variables) and terminals (letters or tokens).Chomsky has proposed a four layer hierarchy of lan-guages corresponding to four syntactically caracteri-zable sets of grammars: the type 0 languages (almostanything), the context sensitive languages (e.g. SwissGerman, cf. [GM89], or abncn), the context freelanguages (e.g. block languages, most programminglanguages), and the regular languages (language gen-erated by regular expressions).

There are three main categories of recognizers: au-tomata and machines, transition network, and defi-nite clause grammars. In all categories we find a oneto one mapping between types of recognizers, lan-guage classes in the Chomsky hierarchy, and gram-mar types. The recognizers as automatic devices ordecision procedure give a framework to study notonly how to implement parsers for a language butalso for studying the inherent complexity of a lan-guage and therefore the cost of its parsing. DCGis a powerful formalism: DCG have the syntax ofgenerative grammars but they correspond to Prologprograms which can serve as both generators and rec-ognizers. In addition, the adjunction of argumentsto the grammar terms allows the DCG to implementtransducers.

A simple way to implement a generic patternmatcher for any language L using DCG is given in[O 'K90]:

match(Pattern) --> Pattern I (_], match(Pattern)).

4

Page 6: Extraction and Integration of Data from Semi-structured

The pattern Pattern is to be replaced by the ac-tual root of the DCG recognizing the language L.

Unfortunately, the problem of matching a patternis combinatorial by nature when its output is all thematching substrings and their respective positions.For regular expressions it is quadratic. For instance,matching the regular expression a+ in a string of n

a can be done in about (2-+) different ways. If thequestion compels a boolean answer: "does there ex-ist a match?", the complexity can be dramaticallyreduced by only looking for the first match. This ishowever not satisfactory if the goal of the matchingis to return back references. A usual compromise isto look for matching substrings which do not over-lap: if s is a sequence composed of two strings sland s2, s = S1s2, and s is a word of the language. * L, the matching is recursively iterated on s2. Thisstandard behaviour of string pattern matching algo-rithm corresponds to the following Prolog program(which becomes slightly more complicated if we addprovision for the management of back references):

matches(P) --> match(P),! ,result(P).

result(P) --> .

result(P) --> matches(P).

In regular expressions, there are two other sourcesof non determinism: the alternation ( I ) and the iter-ation *. The non determinism of these constructs isabsolutely clear when the expressions are translatedinto DCG rules; It correspond to backtracking op-portunities. For instance, an alternation of the forma b becomes:

pattern -- > [a] I [b].

An iteration of the form a*, resp. a*?, becomesthe following DCG greedy_star, resp. lazystar:

greedy_star --> ([a], greedystar) i [].

lazystar --> [] I ([], lazystar).

There are three main techniques to reduce the non-determinism of a DCG: indexing, elimination of thee-transitions, and state combination. Indexing con-sists in preventing the DCG from backtracking onthe reading of input. For instance, the DCG for a I bcan be rewritten:

pattern --> [X], pattern(X).

pattern(a) --> [].

pattern(b) --> [].

e-transitions do not read any token from the in-put. They only indicate possible continuations. Forinstance, in the DCG given in the next section for theautomaton on figure 4, states state2 and state5 canbe merged. Finally, by combining states which havethe same input one can reduce the non-determinism,and eventually eliminate it for regular languages.The drawback is the possible combinatorial growthof the number of states as new states are createdfrom the power set of the original set of states.

It is possible to use DCG to implement efficientlygenerators, recogni2ers, and transducers. Regularexpressions can be optimized and compiled into ef-ficient DCG. We shall use these devices for imple-menting pattern matching tools in Prolog.

4 Prototyping Generic Wrap-pers In Prolog

The Disco and the COIN projects are sharing aProlog-based prototyping platform for the experi-mentation and testing of Web wrappers. The plat-form is implemented in ECLiPSe Prolog. The Webwrapper prototyping environment follows the gen-eral wrapper architecture illustrated in figure 2.The query compiler, the planner optimizer and therelational operators of the executioner have beenadapted from the COIN mediator components. Anoverview of the design and Prolog implementation ofthese components is given in [BFM+97]. The net-work communication layer of the executioner, neces-sary to access documents over the World Wide Webis build from the authors ECLiPSe HTTP library[BB96].

Of interest in this paper are the libraries devel-oped for the compilation and execution of pattern de-scription language. These libraries implement stan-dard algorithms for the construction, optimization,and simulation of regular expressions. The regularexpressions are transformed into Finite State Au-tomata, the automata are optimized and compiledinto DCG. The DCG can be combined with non reg-ular parsers.

A Finite State Automaton is encoded as a DCG.For instance, the automaton of figure 4, represent-ing the regular expression a(alab)*a, correspondsto (e.g.) the following DCG:

stateO -- > [a], statel.statel -- > state3 I state5 I ([a], state2).state2 --> state5.

state3 --> [a], state4.

state4 --> [b],state5.state5 --> statel I ([a], state6).

state6 --> [].

A procedure compiles the regular expression intothe DCG representation. Various procedures can thebe applied to optimize the automaton: eliminationof the e-transitions, elimination of redundant or in-accessible states, determinization. Finally, the au-tomaton is given a name and is compiled as a DCG.

Most importantly, we allow the token classes usedin the regular expressions to correspond to otherDCG. This simple mechanism enables the combina-tion of regular and non regular parsers into regular

5

Page 7: Extraction and Integration of Data from Semi-structured

JI

Oa

Figure 4: Finite State Automaton

expressions. It also enables the re-use of existing tok-enizers (e.g. a tokenizer for HTML) by modularizingthe code.

Finally, although we have concentrated on stringpattern matching, we would like to remark thata simple modification of the input stream ofthe DCG allows to manipulate Directed AcyclicGraphs (DAGs). Let us assume, for instance, thatthe DAG composed of the two paths abcd andaefd is represented by the nested list structure[a, £[b,c] , [e, f]] ,d]. We only need to triviallymodify the 'C'/3 procedure in order for the DCGto parse paths in the DAG. Such a simple extensioncan be very useful for Web documents as argued in[PK95]. The nested structure can not only representitemized lists or tables but also acyclic networks ofhypertext documents linked by their hyperlinks.

5 Commercial and PrototypeWrappers

In this section we present five different Web wrap-pers. They are representative of three different ap-proaches. EdgarScan is a specialized wrapper de-signed, programmed and tuned for specific docu-ments of a specific application. The COIN and theDisco Web wrappers are generic wrappers. Theycompromise with the expressive power of the speci-fication language to protect its declarativeness andminimize the programming task. TSIMMIS, andOnDisplay combine the declarative specificationswith programming hooks to create a wrapper devel-opment environment. In fact many more projects(e.g [CBC97, KNNM96]) are developing wrapper de-velopment environemnts as a set of libraries for re-trieving, parsing, and extracting data from docu-ments. The set of ECLiPSe tools we have developedfor the prototyping of wrappers fall into the lattercategory.

5.1 EdgarScan

EdgarScan [Fer97] is an application developped atthe Price Waterhouse Technology Centre for the ef-ficient and accurate automatic parsing of financial

statements. In the United States, the Securities andExchange Commission (SEC) has made compulsaryfor US corporation to file a variety of financial doc-uments. This documents are stored and publiclyavailable in the EDGAR database. Guidelines forwhat information is to be presented are stricter thanguidelines for how information should be presented.The disparity of accounting practices and the speci-ficity of each situation has prevented a structurationthat can easily be used to automatically process thecontent of the filings. In these filings, accounting ta-bles are structured summary of the main figures ofa company. They are of highest interest for the fi-nancial analysts. Even'in those tables, most data areaccompanied by footnotes in plain english comment-ing on the definition of individual items, columns orrows, and necessary for a correct interpretation ofthe data. EdgarScan is specialized in the parsingof accounting tables. It extracts (using a C regularexpression pattern matching library) and reconcile(using Prolog-based expert system) the data in thetables.

5.2 Centerstage

Centerstage [Ond97] is a commercial wrapper devel-opment and deployment tool kit. It operates as aPlug-in of the Internet Explorer. The kit not onlycomprises tools for the wrapper definition but alsofacilities to export data in a variety of desktop appli-cations such as Excel, Word, etc. Additional serverand driver components allow to interface the result-ing data with Web or ODBC front-end applications.The wrapper definition and testing environment isbased on a rich point-and-click interface: the userselect the data to be extracted on a example Web-page and composes a pattern "by example". Thepattern is then compiled into a wrapper agent, i.e. aJavaScript program which is able to extract the data.The agent program can be further edited and refinedby the user. Our tests have shown that the patternmatching technique used by Centerstage is similar tothe a la Perl regular expression pattern matching.

5.3 TSIMMIS

The Data model of the TSIMMIS mediation net-work [GMIV95] is the Object Exchange Model (OEM).OEM is a complex object model which allow the flex-ible manipulation of complex data without the con-straint of a rigid schema. The TSIMMIS wrappers(in [Suc97]) extract OEM objects from Web docu-ments. The specification consist in a sequence ofcommands of the form [variables, source, pattern],which indicate the construction of an OEM objectas the results of the matching of the pattern on the

6

Page 8: Extraction and Integration of Data from Semi-structured

source string. The variables can be used as sourcesfor other commands, implicitely defining a nested ob-ject. The source can be the result of any primitive orfunction of the underlying language (Python) whichproduces a string (e.g. http-get, split). Interest-ingly, the commands are similar to (regular) gram-mar rules. The programmer can sometimes choosebetween regular expressions pattern matching andrule programming depending on the structure shewants to confer to the OEM object. Unfortunately,the semantics of the command language is not clearlydefined (e.g. w.r.t backtracking). The semantics ofthe regular expression pattern matching is the one ofthe implementation language Python.

5.4 COIN

The COIN web wrappers [BL97] provide a relationalinterface to Web services. Web and ODBC front-ends are available and used in COIN mediation pro-totype [BGF+97].

The relational schema chosen by the wrapper de-signer to be the interface of a particular service is de-fined in terms of individual relations correspondingto the data extracted from a set of documents de-fined by the pattern of an HTTP message (method,URL, object-body). The semantics of the relationexported is defined under the Universal Relation con-cept: attributes of the same name are the same.When a query (in SQL) incomes, the wrapper createsand executes a plan accessing the necessary pages.A relation can be composed of both static and dy-namic documents (e.g. document generated by cgi-programs and accessible through forms) in a verynatural way (by automatically "filing the forms" withdata from the query or extracted from other pages),making the COIN wrappers uniquely powerful.

To each relation exported by a given site, corre-spond a specification file. The file contains for eachset of documents patterns described in a simple reg-ular expression language. The COIN pattern defini-tion language is currently limited to plain regular ex-pressions, however because it is simple, it allows thevery rapid development and maintainance of simplebut practical wrappers.

The COIN wrappers have been prototyped in Pro-log and Perl, they are currently ported to Java.

5.5 Disco

Disco wrappers, called adapters, provide an object-relational interface to programs, services and docu-ments [ABLS97]. They are used in the Disco me-diation prototype [TAB+97]. The extraction of in-formation from a semi structured document is per-formed in the pattern matching component of the

Disco adapter. A stand-alone prototype, InfoExtrac-tor (in [Suc97), wag developed to validate and refinethe Disco information extraction approach: a docu-ment abstract model is used in which the documentis hierarchically partitioned into regions which con-tain concepts of interest.

For each document, the configuration describes anabstract model of the document, i.e the hierarchicalpartitioning of the document into regions, conceptsand subconcepts. To a concept is associated a regularexpression and a set of weighted keywords. If the setof keywords is not empty, the match is accepted onlyif a function of the weights is above a given threshold.This feature provides a unique combination of pat-tern matching and information retrieval techniquesand increases the robustness of the wrappers. Theoutput of the pattern matching component is a treeof concepts and subconcepts.

The Disco wrappers have been prototyped in Pro-log and Perl, they are now implemented in Java.

6 Conclusion

We believe in and we will contribute to the de-velopment of a network of Web wrappers offeringa structured access to semi-structured Web docu-ments. The application domains which can benefitby the just-on-time availability of data are numer-ous. As an illustration of the potential, the directo-ries and catalogs registered in the Yahoo thesaurusalready contain large amounts of useful informationfor professional and individuals waiting to be auto-matically integrated into decision support processes.

Although in many cases, ad-hoc solutions will needto be programmed, we believe that a trade-off can befound for the definition of a generic tool for the easydevelopment of Web wrappers. The generic Webwrapper architecture we have described relies on apattern description language and on pattern match-ing techniques. Prolog and Logic programming ingeneral, as platforms for the implementation of sym-bolic manipulation programs, seemed to be a perfectmatch. Indeed, ECLiPSe proved to be a very goodtool for the rapid prototyping and experimentationof Web wrappers. However, to reach the efficiency re-quirements of industrial strength code, designers andprogrammers are needed who have high programingskills [O'K90]. This is for instance the case for theefficient implementation of the parsing and patternmatching algorithms. At this level of fine tuning,the suitability of Prolog versus C or even Java isquestionned. Furthermore, some limitations of mostexisting Prologs, such as the lack of support for con-current programming, is a serious drawback for theimplementation of network applications. Neverthe-

7

Page 9: Extraction and Integration of Data from Semi-structured

less, as a rapid prototyping tool, Prolog brings a com-petitive advantage in a situation where developmentcycles are constantly shrinking due to the pressure ofcompetition in a global distribution infrastructure.

References[ABLS97] R. Amouroux, Ph. Bonnet, M. Lopez, and

D. Smith. The design and construction ofadapters for information mediation. (Sub-mitted), 1997.

[ABP96] J.M. Andreoli, U. Borghoff, and R. Pareshi.The constraint-based knowledge brokermodel: Semantics, implementation, andanalysis. J. of Symbolic Computation, 1996.

[AK92] Y. Arens and C. Knobloch. Planning andreformulating queries for semantically mod-elled multidatabase. In Proc. of the Intl.Conf. on Information and Knowledge Man-agement, 1992.

[BB96] S. Bressan and Ph. Bonnet. The ECLiPSehttp library. In Proc. of the Intl. Conf. onIndutrial Applications of Prolog, 1996.

[BFM+97] S. Bressan, K. Fynn, S. Madnick, T. Pena,and M. Siegel. Overview of the prolog imple-mentation of the context interchange media-tor. In Proc. of the Intl. Conf. on PracticalApplications of Prolog, 1997.

[BGF+97] S. Bressan, C. Goh, K. Fynn, M. Jakobisiak,K. Hussein, H. Kon, T. Lee, S. Madnick,T. Pena, J. Qu, A. Shum, and M. Siegel.The context interchange mediator prototype.In Proc. of the ACM SIGMOD Intl. Conf.on the Management of Data, 1997. (demotrack).

[BL97] S. Bressan and T. Lee. Information brokeringon the world wide web. In Proc. of the Web-net World Conference, 1997. (To be pub-lished).

[CBC97] B. Chidlovskii, U. Borghoff, and P.-Y.Chevalier. Towards sophisticated wrappingof web-based information repositories. InProceedings of the 5th International RIA 0Conference, Montreal, Canada, 1997.

[Fer97] D. Ferguson. Parsing financial statements ef-ficiently and accurately using c and prolog.In Proc. of the Intl. Conf. on Practical Ap-plications of Prolog, 1997.

[Fri97] J. Friedl. Mastering Regular Expressions.O'Reilly and Associates, Inc., 1997.

[GBMS97] C. Goh, S. Bressan, S. Madnick, andM. Siegel. Context interchange: New fea-tures and formalisms for the intelligent in-tegration of information. Technical ReportCISL WP97-03, MIT Sloan School of Man-agement, 1997.

[GKD97] 'M. Genesereth, A. Keller, and O. Duschka.Infomaster: An information integration sys-tem. In Proc. of the ACM SIGMOD Intl.Conf. on the Management of Data, 1997.(demo track).

[GM89] G. Gazdar and C. Mellish. Natural LanguageProcessing in Prolog. Addison Wesley, 1989.

[GM95] H. Garcia-Molina. The TSIMMIS approachto mediation: Data models and languages. InProc. of the Conf. on Next Generation Infor-mation Technologies and Systems, 1995.

[GPI71] R. Griswold, J. Poage, and Plonsky. I. TheSNOBOL4 programming Language. PrenticeHall, 1971.

[HU69] J. Hopcroft and J. Ullman. Formal languagesand their relation to automata. Addison-Wesley, 1969.

[KNNM96] Y. Kitamura, H. Nakanishi, T. Nozaki, andT. Miura. Metaviewer and metacommander:Applying www tools to genome informatics.In 7th Workshop on Genome Informatics,1996.

[LSK95] A. Levy, D. Srivastava, and T. Kirk. Datamodel and query evaluation in global infor-mation systems. Journal of Intelligent Infor-mation Systems, 1995.

[O'K90] R. O'Keefe. The Craft of Prolog. MIT Press,1990.

[Ond97] Ondisplay. centerstage.http://www.ondisplay.com, 1997.

[PGH96] Y. Papakonstantinou, A. Gupta, andL. Haas. Capabilities-based query rewritingin mediator systems. In Proc. of the Intl.Conf. on Parallel and Distributed Informa-tion Systems, 1996.

[PK95] K. Park and D. Kim. String matching inhypertext. In Proc. of the Symp. on Combi-natorial Pattern Matching, 1995.

[Suc97] D. Suciu, editor. Proceedings of theWorkshop on Management of Semi-structured Data, Tucson, Arizona, 1997.http: / /www. research.att.com/ suciu/workshop-papers.html.

[TAB+97] A. Tomasic, R. Amouroux, Ph. Bonnet,O. I(apistskaia, H. Naacke, and L. Raschid.The distributed information search compo-nent (disco) and the world wide web. In Proc.of the A CM SIGMOD Intl. Conf. on theManagement of Data, 1997. (demo track).

[TRV96] A. Tomasic, L. Raschid, and P. Valduriez.Scaling heterogeneous database and the de-sign of Dsco. In Proc. of the 16th Intl. Conf.on Distributed Computing Systems, 1996.

[Wie92j G. Wiederhold. Mediation in the architec-ture of future information systems. Com-puter, 23(3), 1992.

8