framework for annotation and composition of web space analysis tools vojtěch svátek department of...

Framework for annotation and composition of web space analysis

tools

Vojtěch SvátekDepartment of Information and Knowledge Engineering

University of Economics, Prague, Czech Republic

V. Svátek, KEG 29.9.2005

Pre-cursor KEG talks related to the Rainbow project• 6 November 2002 “Projekt Rainbow” (Svátek)

– overview of tools developed in the project to date

• 10 March 2005 „Machine Learning and the Semantic Web“(Labský)– focus on information extraction using a statistical technique

• Current topic– focus on integration of different techniques and tools


Related publications(see http://rainbow.vse.cz)• Svátek et al., ISM 2002• Labský and Svátek, DATESO 2003• Svátek and Vacura, WWW 2003 (Poster Track)• Svátek, Labský and Vacura, EKAW 2004• Svátek, ten Teije and Vacura, Znalosti 2005• Svátek and Vacura, RAWS 2005


Agenda

• Overview of web space analysis landscape (5’)• TODD annotation framework and Rainbow collection

of web space ontologies (15’)• Rainbow collection of problem solving methods and

re-description of real applications (10’)• Parametric design as starting point for automated

composition of classification services (10’)• Simulation of service composition and execution (15’)• Rainbow and the semantic web service cycle (5’)• Future work (5’)


Web space analysis communities

• Classification of documents (or images)• Information extraction from text• Web graph analysis• Web document or image retrieval• Clustering of documents or other web objects• Discovery of associations among web pages...• Natural language understanding on the web• ... and many others


Web space as subject of analysis

• Brings unprecedented heterogeneity to the art of data analysis– free text, structured tables and lists, hyperlink topologies,

images, meta-data, URL conventions...– different data types and representations potentially provide

complementary as well as supplementary information

• Most web analysis methods reduce this heterogeneity in one or more aspects– inevitably leads to information loss

• Is it possible to analyse the web ‘in virgin state’?– only solution: by combination of multiple methods based on

different principles


Method combination: technically as web service composition...• Building sophisticated monolithic applications would

be impractical; they would be hard to understand and difficult to maintain...

• A better solution seems to be to implement individual tools as web services and to combine them using state-of-the-art web service composition techniques


Agenda






TODD framework for service annotation• Four dimensions of web space analysis

methods/tools:– Abstract task accomplished by the tool– Type/class and identity of web objects that appear on I/O– Type/representation of underlying data

(also called ‘web view’)– Problem domain

• Presumably covers all what can be said about an arbitrary method/tool


‘Task’ dimension

• Classification• Retrieval• Extraction

• Clustering• Association discovery (and other inductive tasks)


Task: Classification

• Classify: assign a web object one of predefined classes– table look-up (lowest level of task hierarchy)– use information from related objects: adjacent, sub- and

super-objects etc. (higher levels)

• Examples:– a page is a hub, product catalogue, form-based page…

– a link is upward, outward, dictionary...

– a HTML structure is menu, product catalogue item...


Task: Retrieval

• Retrieve: find (locations of) objects satisfying given conditions– syntactical retrieval - find objects of certain type, with certain

relation of a given object (lowest level of task hierarchy)– semantic retrieval - objects also have to belong to certain

classes (higher levels)– direct retrieval– index-based retrieval

• Examples:– find all pages belonging to same website as given start-up page– find all outward links in a page– find all phrases describing the company history, in a page


Task: Extraction

• Extract: access some (textual) content within a web object– ‘dump’ the whole content of a web object (lowest level of

task hierarchy)– extract specific information from a larger object (higher

levels), by either explicit or implicit decomposition into sub-objects

• Examples:– extract the sentence starting at position <XPointerExp>– extract the name of company from the homepage– extract the values of ‘keywords’ in META tags– extract the codes and prices of products from catalogue


‘Object’ dimension

• Identity of objects: variables allowing to bind available outputs with required inputs and vice versa

• Types of objects: defined in the Upper Web Ontology– are assumed to be known a priori for a given object

• (‘Semantic’) Classes of objects: defined in more specific ontologies subordinated to UWO


Upper Web Ontology


‘Data’ dimension

• Captures the representation of the web space that is used by the given tool as input data– text in sentences; HTML code; URL strings; images; explicit

metadata; link topology...

• Correlated but not identical with the ‘object’ dimension (the same object can be represented in different ways!)

• Also called ‘web view’; each can define its own taxonomy of objects’ classes

• An FCA-based method for integration of view-specific taxonomies developed (Labský & Svátek, 2003)


Fragments of ‘view’-specific ontologies (HTML vs. links)

D ocum ent

TH ub

TLoca lH ubTR em oteH ub

TLeaf

D ocum ent

H ProductC ata logue

H Im ageG alleryH References

H AboutC om pany


D ocum ent

TH ub

TLoca lH ubTR em oteH ub

TLeaf

Merging class taxonomies

D ocum ent

H ProductC ata logue

H Im ageG alleryH References

H AboutC om pany

Docum ent

Com plexCata logue

G raphicallCatalogue

THub

TRem oteHub TLocalHub

HReferences

TLeafHProductCata logue HIm ageGallery HAboutCom pany


‘Domain’ dimension

• Generic web space analysis tools• Tools specialised in web pornography• Tools specialised in company sites offering any

products/services• Tools specialised in company sites offering bicycle

products


Agenda






Knowledge modelling and PSMs

• Original concept of ‘knowledge-level model’ formulated by Newell (1982)– Capture the conceptual nature of a knowledge-based system

independent of implementation and data representation

• From 1985 on: models of various AI reasoning tasks as well as methods to solve them– called PSMs, for Problem Solving Methods

• Collected into libraries (e.g. in CommonKADS)– dichotomy: system analysis (classification, diagnosis,

assessment, monitoring...) vs. system synthesis (design, configuration, planning, scheduling...)


• Classification task– Look-up based Classification– Compact Classification– Structural Classification

• Extraction task– Overall Extraction– Compact Extraction– Structural Extraction

• Retrieval Task– Direct Retrieval– Index-based Retrieval

Tentative PSMs for web analysis


CommonKADS inference model of Direct Retrieval

S truc turalc ons train ts O bjec ts pec ify

c las s ify

Clas sClas s

c ons train tsevaluate

Res ultC la s s

de f in itions


CommonKADS inference model of Index-based Retrieval

Clas sc ons train ts

operationalis e

Contentc ons train ts

retr ieve

O bjec t

Inde xs truc ture

S truc turalc ons train ts

C la s sde f in itions


Describing web space analysis applications with PSMs• Relevant for those exploiting multiple views of web

– two varieties of Rainbow-based architecture for acquisition of bicycle product (and associated) information

– multi-way recognition of web pornography– three projects by other groups

• First in ad hoc, Prolog-like pseudo-code – to demonstrate adequacy of the TODD framework and the

PSMs for pre-existing applications– not meant to be operational


ExtS(DC, DocCollection, _, CSDept, [names]) :-

RetD(P1, Phrase, text, General, [P1 part-of DC, PotentPName(P1)]),

% named entity recognition for person names

ClaC(P1, Phrase, text, General, [PName,@other]),

% use of public search tools over papers and homepages

RetI(P2, Phrase, freq, Biblio, P1 part-of P2, PaperCitation(P2)]),

RetI(D, Document, freq, General, [P1 part-of D, D part-of DC, PHomepage(D)]),

RetD(DF1, DocFragment, freq, General, [Heading(DF1), DF1 part-of D, P1 part-of DF1),

ExtO(P1, Phrase, text, General, [names]),

% co-occurrence-based extraction

RetD(DF2, DocFragment, html, General, [ListItem(DF2), DF2 part-of DC, P1 part-of DF2]),

RetD(DF3, DocFragment, html, General, [ListItem(DF3), (DF3 below DF2; DF2 below DF3)]),

ExtS(DF3, DocFragment, text, General, [names]),

RetD(DF4, DocFragment, html, General, [TableField(DF4), DF4 part-of DC, P1 part-of DF4]),

RetD(Q, DocFragment, html, General, [TableField(DF5), (DF5 below DF4; DF4 below DF5)]),


% extraction from links

RetD(DF5, DocFragment, html, General, [IntraSiteLinkElement(DF5), DF5 part-of DC]),


...

% extraction of potential person names from document fragments

ExtS(DF, DocFragment, text, General, [names]) :-

RetD(P, Phrase, text, General, [DF contains P, PotentialPersonName(P)]),

ExtO(P, Phrase, text, General, [names]).

Example description: name collection application by Armadillo (Ciravegna 2003)


Towards generation of service compositions• Next step: attempt to generate the control code for

composed service automatically, and then execute over a collection of available tools!

• Would in fact be web service composition if used in real environment and with real services...

• So far only simulated experiments; application: pornography recognition (cf. PhD thesis by Vacura)


Web service composition

• Most popular approaches to WS composition (also called configuration, choreography etc.)– manual composition in workflow-inspired languages such as

BPEL4WS: actually “programming in the large” popular with industries

– fully automated (‘semantic’) composition based on pre-/post-condition reasoning, e.g. OWL-S ... actually AI planning popular with academics

• Here: middle-way approach based on automated filling (and possibly folding/unfolding) of templates

• Also sort of ‘semantic’ but less ambitious


Agenda






Parametric design (PD)

• A reasoning task well-examined by the knowledge modelling community

• Setting values to a set of parameters in a template, while considering constraints and preferences

• Classical problem solving method (PSM) for PD: Propose - Critique - Modify (PCM)– Propose an initial configuration– Verify the required properties (if satisfied then Stop)– (Critique:) Analyse reasons for failure in the Verify step– (Modify:) Change the values – return to Verify step


PD as model of WS composition

• If a suitable template can be identified, WS composition will merely amount to filling in (concrete or abstract) services as “values” for “parameters”, in the PCM cycle

• The filling can be carried out based on either:– generic pre-/post-condition reasoning over complete functional

descriptions of individual services

– dedicated broker equipped with a method-specific knowledge base (“PCM knowledge”)

• The latter option examined by ten Teije (2004): experiments with using the classification PSM as template to be filled in


Filling in a classification template

• A large proportion of to-date composite web services indeed have the nature of classification (e.g. credit assignment applications)

• Although multiple PSMs for classification have been identified in literature, they can be combined into a single template (Motta & Lu 2000)

• Testing domain: assignment of reviewers to conference papers, according to paper topics– PCM cycle carried out in several iterations– changes in criteria fulfilment recorded during the updates of

the template (re-setting of parameter values)


Classification template

Observations Knowledge

Solutions

LegalObservations

ScoredObservations

AggregatedScores

CandidateSolutions

MicroMatch Aggregate

Admissibility

Check

Selection


Examples of “broker” knowledge

• Propose knowledge for the Admissibility parameter: if many {feature,value} pairs are irrelevant then do not use strong-coverage

• Modify knowledge for the Admissibility parameter: if the solution set has to increased (reduced) in size, then the value for the Admissibility parameter has to be moved down (up) in the following partial ordering:

weak-coverage strong-coverage strong-explanative


Web analysis services’ specifics

• Large number of diverse tools potentially available manual programming would be cumbersome

• Can be tested in real environment with low or zero cost (unlike e.g. business IS or medical applications) experiments with automated composition might be relatively ambitious

• Abstract descriptions of analysis services can (in addition to task characteristics) explicitly include characteristics related to analysed data (e.g. those enforced by mark-up languages such as HTML)


Service combination in Rainbow

• Currently, the only option used for building more complex applications is conventional programming– control routines for bicycle application, by O. Šváb (in Java),

which calls individual web services and integrates results

• Some more flexible solution needed; ideally, it should be capable of including an unforeseen component (with appropriate semantic description) use of PSMs and ontologies?


Can parametric design scenario be applied here?• Web analysis PSMs can be viewed as templates• Filling unforeseen services to slots rather than just

choosing among the known values of parameters!• To capture the connectivity and multiple views over

the web space, the object-feature-value view (from traditional knowledge engineering) does not suffice: an object-relation-object view is more appropriate!

• Structural classification/extraction is thus recursive, i.e. slots could be replaced with further templates!


Possible adaptations of PD

• Simple: pre-set template versions for degrees of recursion and combinations with non-recursive versions

• Advanced: in addition to setting values for attributes, the Propose and Modify steps could also fold/unfold slots to/from templates


Some tentative broker knowledge

• Templates with lower number of distinct objects and non-recursive templates should be preferred

• Look-up classification should be preferred to compact classification

• Default partial ordering of data types with respect to object classification, for Document object:frequency > URL > topology free text > metadata

• URL-based or topology-based classification should never be used alone

• Default partial ordering of types of relations (@rel):part-of > is-part > adjacent


Agenda






Limits of the current Prolog simulation• Data: Prolog facts with high level of abstraction,

instead of real data• Only the classification task covered; only binary

classification (with certainty factor)• Only six service ‘mock-ups’ implemented so far

– though writing a new one is a matter of 20-30 minutes

• Only the initial ‘Propose’ step implemented• Multiple fixed templates rather than un/folding• Broker knowledge not yet implemented

– just blind search with checking ‘service signatures’


Example of ‘data’page(p36). % image page with 1 picture

url_of(u36,p36).

url_terms(u36,[teen,sex]). % terms in URL

part(p36,s3).

linkto(p31,p36).

textprop(p36,0.0). % proportion of text on page

part(f361,p36).

html_frag(f361). % fragment of HTML code

part(i3611,f361).

image(i3611).

body_color(i3611,0.4). % proportion of body color


Classification template example

templ(sc1,s(cla,0,0,Tp1,Tp2),

[s(cla,0,0,Tp3,Tp4)],

[subclasseq(Tp3,Tp1),subclasseq(Tp4,Tp2)]).

• Simplest, with one slot only• More complex ones have to deal e.g. with

aggregation or transformation of certainty factors

task type: classification input object same as output object

type of input object of ‘lower-level’ service at most as general as type of input object of ‘higher-level’ service

template header

template body with one slot

templateconstraints


Classification template example templ(sc5,s(cla,0,0,Tp1,Tp2), [s(cla,0,0,Tp3,Tp4),

s(ret,0,1,Tp5,Tp6),

s(cla,1,1,Tp7,Tp8),

s(tsf,ref(3,1),0,Tp8,Tp4),

s(agr,[ref(1,0),ref(4,0)],0,Tp4,Tp4)],

[subclasseq(Tp3,Tp1), subclasseq(Tp5,Tp1), rel(part,Tp6,Tp5), subclasseq(Tp6,Tp7), subclasseq(Tp4,Tp2)]).


Service description example

meta(

cla_por_html,

s(cla,document,pornoContentPage),

url,

pornography,

4).

service identifier

data type/representation

input object typeoutput object type (class)

problem domain

time cost

task


Sample simulation run ?- propose(cla, doc_coll, porno_coll).

Number of solutions: 2

Template: sc3

Configuration:

s(ret, 0, 1, doc_coll, localhub, ret_localhub)

s(cla, 1, 1, document, pornoContentPage, cla_por_html)

s(tsf, ref(2, 1), 0, pornoContentPage, porno_coll, tsf_porno1)

Time cost: 15

Template: sc3

Configuration:

s(ret, 0, 1, doc_coll, localhub, ret_localhub)

s(cla, 1, 1, document, pornoContentPage, cla_por_url)

s(tsf, ref(2, 1), 0, pornoContentPage, porno_coll, tsf_porno1)

Time cost: 13


Agenda






Coverage of semantic web service cycle• Service annotation with semantic description

– Here: TODD framework and ontologies

• Service discovery in open and heterogeneous space– Here: not addressed (we rely on a single annotation model

and centralised ontology), hence this is not an ‘upper semantic web’ application!

• Service composition (‘choreography’)– Here: main focus; template-based (PSM) approach

• Composed service execution (‘orchestration’)– Here: extremely simplified


Agenda






Ongoing and future work

• Implementation of broker knowledge base• Further elaboration of prototype broker: beyond initial

template filling: ‘Critique’ and ‘Modify’ phases?– Capture the possible structure of templates (initial proposal

as well as modification) with a grammar?– Iterative template refinement with verification on data

• Enrichment of the collection of analysis components (by Rainbow team as well as third party)

• Implementation of full-fledged broker


Thanks for your attention

Questions?

framework for annotation and composition of web space analysis tools vojtěch svátek department of...

Documents

web services

web analysis methods

web pages

semantic web service

tools slide

poster track svtek

service annotation

subject of analysis