framework for annotation and composition of web space analysis tools vojtěch svátek department of...
TRANSCRIPT
Framework for annotation and composition of web space analysis
tools
Vojtěch SvátekDepartment of Information and Knowledge Engineering
University of Economics, Prague, Czech Republic
V. Svátek, KEG 29.9.2005
Pre-cursor KEG talks related to the Rainbow project• 6 November 2002 “Projekt Rainbow” (Svátek)
– overview of tools developed in the project to date
• 10 March 2005 „Machine Learning and the Semantic Web“(Labský)– focus on information extraction using a statistical technique
• Current topic– focus on integration of different techniques and tools
V. Svátek, KEG 29.9.2005
Related publications(see http://rainbow.vse.cz)• Svátek et al., ISM 2002• Labský and Svátek, DATESO 2003• Svátek and Vacura, WWW 2003 (Poster Track)• Svátek, Labský and Vacura, EKAW 2004• Svátek, ten Teije and Vacura, Znalosti 2005• Svátek and Vacura, RAWS 2005
V. Svátek, KEG 29.9.2005
Agenda
• Overview of web space analysis landscape (5’)• TODD annotation framework and Rainbow collection
of web space ontologies (15’)• Rainbow collection of problem solving methods and
re-description of real applications (10’)• Parametric design as starting point for automated
composition of classification services (10’)• Simulation of service composition and execution (15’)• Rainbow and the semantic web service cycle (5’)• Future work (5’)
V. Svátek, KEG 29.9.2005
Web space analysis communities
• Classification of documents (or images)• Information extraction from text• Web graph analysis• Web document or image retrieval• Clustering of documents or other web objects• Discovery of associations among web pages...• Natural language understanding on the web• ... and many others
V. Svátek, KEG 29.9.2005
Web space as subject of analysis
• Brings unprecedented heterogeneity to the art of data analysis– free text, structured tables and lists, hyperlink topologies,
images, meta-data, URL conventions...– different data types and representations potentially provide
complementary as well as supplementary information
• Most web analysis methods reduce this heterogeneity in one or more aspects– inevitably leads to information loss
• Is it possible to analyse the web ‘in virgin state’?– only solution: by combination of multiple methods based on
different principles
V. Svátek, KEG 29.9.2005
Method combination: technically as web service composition...• Building sophisticated monolithic applications would
be impractical; they would be hard to understand and difficult to maintain...
• A better solution seems to be to implement individual tools as web services and to combine them using state-of-the-art web service composition techniques
V. Svátek, KEG 29.9.2005
Agenda
• Overview of web space analysis landscape (5’)• TODD annotation framework and Rainbow collection
of web space ontologies (15’)• Rainbow collection of problem solving methods and
re-description of real applications (10’)• Parametric design as starting point for automated
composition of classification services (10’)• Simulation of service composition and execution (15’)• Rainbow and the semantic web service cycle (5’)• Future work (5’)
V. Svátek, KEG 29.9.2005
TODD framework for service annotation• Four dimensions of web space analysis
methods/tools:– Abstract task accomplished by the tool– Type/class and identity of web objects that appear on I/O– Type/representation of underlying data
(also called ‘web view’)– Problem domain
• Presumably covers all what can be said about an arbitrary method/tool
V. Svátek, KEG 29.9.2005
‘Task’ dimension
• Classification• Retrieval• Extraction
• Clustering• Association discovery (and other inductive tasks)
V. Svátek, KEG 29.9.2005
Task: Classification
• Classify: assign a web object one of predefined classes– table look-up (lowest level of task hierarchy)– use information from related objects: adjacent, sub- and
super-objects etc. (higher levels)
• Examples:– a page is a hub, product catalogue, form-based page…
– a link is upward, outward, dictionary...
– a HTML structure is menu, product catalogue item...
V. Svátek, KEG 29.9.2005
Task: Retrieval
• Retrieve: find (locations of) objects satisfying given conditions– syntactical retrieval - find objects of certain type, with certain
relation of a given object (lowest level of task hierarchy)– semantic retrieval - objects also have to belong to certain
classes (higher levels)– direct retrieval– index-based retrieval
• Examples:– find all pages belonging to same website as given start-up page– find all outward links in a page– find all phrases describing the company history, in a page
V. Svátek, KEG 29.9.2005
Task: Extraction
• Extract: access some (textual) content within a web object– ‘dump’ the whole content of a web object (lowest level of
task hierarchy)– extract specific information from a larger object (higher
levels), by either explicit or implicit decomposition into sub-objects
• Examples:– extract the sentence starting at position <XPointerExp>– extract the name of company from the homepage– extract the values of ‘keywords’ in META tags– extract the codes and prices of products from catalogue
V. Svátek, KEG 29.9.2005
‘Object’ dimension
• Identity of objects: variables allowing to bind available outputs with required inputs and vice versa
• Types of objects: defined in the Upper Web Ontology– are assumed to be known a priori for a given object
• (‘Semantic’) Classes of objects: defined in more specific ontologies subordinated to UWO
V. Svátek, KEG 29.9.2005
Upper Web Ontology
V. Svátek, KEG 29.9.2005
‘Data’ dimension
• Captures the representation of the web space that is used by the given tool as input data– text in sentences; HTML code; URL strings; images; explicit
metadata; link topology...
• Correlated but not identical with the ‘object’ dimension (the same object can be represented in different ways!)
• Also called ‘web view’; each can define its own taxonomy of objects’ classes
• An FCA-based method for integration of view-specific taxonomies developed (Labský & Svátek, 2003)
V. Svátek, KEG 29.9.2005
Fragments of ‘view’-specific ontologies (HTML vs. links)
D ocum ent
TH ub
TLoca lH ubTR em oteH ub
TLeaf
D ocum ent
H ProductC ata logue
H Im ageG alleryH References
H AboutC om pany
V. Svátek, KEG 29.9.2005
D ocum ent
TH ub
TLoca lH ubTR em oteH ub
TLeaf
Merging class taxonomies
D ocum ent
H ProductC ata logue
H Im ageG alleryH References
H AboutC om pany
Docum ent
Com plexCata logue
G raphicallCatalogue
THub
TRem oteHub TLocalHub
HReferences
TLeafHProductCata logue HIm ageGallery HAboutCom pany
V. Svátek, KEG 29.9.2005
‘Domain’ dimension
• Generic web space analysis tools• Tools specialised in web pornography• Tools specialised in company sites offering any
products/services• Tools specialised in company sites offering bicycle
products
V. Svátek, KEG 29.9.2005
Agenda
• Overview of web space analysis landscape (5’)• TODD annotation framework and Rainbow collection
of web space ontologies (15’)• Rainbow collection of problem solving methods and
re-description of real applications (10’)• Parametric design as starting point for automated
composition of classification services (10’)• Simulation of service composition and execution (15’)• Rainbow and the semantic web service cycle (5’)• Future work (5’)
V. Svátek, KEG 29.9.2005
Knowledge modelling and PSMs
• Original concept of ‘knowledge-level model’ formulated by Newell (1982)– Capture the conceptual nature of a knowledge-based system
independent of implementation and data representation
• From 1985 on: models of various AI reasoning tasks as well as methods to solve them– called PSMs, for Problem Solving Methods
• Collected into libraries (e.g. in CommonKADS)– dichotomy: system analysis (classification, diagnosis,
assessment, monitoring...) vs. system synthesis (design, configuration, planning, scheduling...)
V. Svátek, KEG 29.9.2005
• Classification task– Look-up based Classification– Compact Classification– Structural Classification
• Extraction task– Overall Extraction– Compact Extraction– Structural Extraction
• Retrieval Task– Direct Retrieval– Index-based Retrieval
Tentative PSMs for web analysis
V. Svátek, KEG 29.9.2005
CommonKADS inference model of Direct Retrieval
S truc turalc ons train ts O bjec ts pec ify
c las s ify
Clas sClas s
c ons train tsevaluate
Res ultC la s s
de f in itions
V. Svátek, KEG 29.9.2005
CommonKADS inference model of Index-based Retrieval
Clas sc ons train ts
operationalis e
Contentc ons train ts
retr ieve
O bjec t
Inde xs truc ture
S truc turalc ons train ts
C la s sde f in itions
V. Svátek, KEG 29.9.2005
Describing web space analysis applications with PSMs• Relevant for those exploiting multiple views of web
– two varieties of Rainbow-based architecture for acquisition of bicycle product (and associated) information
– multi-way recognition of web pornography– three projects by other groups
• First in ad hoc, Prolog-like pseudo-code – to demonstrate adequacy of the TODD framework and the
PSMs for pre-existing applications– not meant to be operational
V. Svátek, KEG 29.9.2005
ExtS(DC, DocCollection, _, CSDept, [names]) :-
RetD(P1, Phrase, text, General, [P1 part-of DC, PotentPName(P1)]),
% named entity recognition for person names
ClaC(P1, Phrase, text, General, [PName,@other]),
% use of public search tools over papers and homepages
RetI(P2, Phrase, freq, Biblio, P1 part-of P2, PaperCitation(P2)]),
RetI(D, Document, freq, General, [P1 part-of D, D part-of DC, PHomepage(D)]),
RetD(DF1, DocFragment, freq, General, [Heading(DF1), DF1 part-of D, P1 part-of DF1),
ExtO(P1, Phrase, text, General, [names]),
% co-occurrence-based extraction
RetD(DF2, DocFragment, html, General, [ListItem(DF2), DF2 part-of DC, P1 part-of DF2]),
RetD(DF3, DocFragment, html, General, [ListItem(DF3), (DF3 below DF2; DF2 below DF3)]),
ExtS(DF3, DocFragment, text, General, [names]),
RetD(DF4, DocFragment, html, General, [TableField(DF4), DF4 part-of DC, P1 part-of DF4]),
RetD(Q, DocFragment, html, General, [TableField(DF5), (DF5 below DF4; DF4 below DF5)]),
ExtS(DF5, DocFragment, text, General, [names]),
% extraction from links
RetD(DF5, DocFragment, html, General, [IntraSiteLinkElement(DF5), DF5 part-of DC]),
ExtS(DF5, DocFragment, text, General, [names]),
...
% extraction of potential person names from document fragments
ExtS(DF, DocFragment, text, General, [names]) :-
RetD(P, Phrase, text, General, [DF contains P, PotentialPersonName(P)]),
ExtO(P, Phrase, text, General, [names]).
Example description: name collection application by Armadillo (Ciravegna 2003)
V. Svátek, KEG 29.9.2005
Towards generation of service compositions• Next step: attempt to generate the control code for
composed service automatically, and then execute over a collection of available tools!
• Would in fact be web service composition if used in real environment and with real services...
• So far only simulated experiments; application: pornography recognition (cf. PhD thesis by Vacura)
V. Svátek, KEG 29.9.2005
Web service composition
• Most popular approaches to WS composition (also called configuration, choreography etc.)– manual composition in workflow-inspired languages such as
BPEL4WS: actually “programming in the large” popular with industries
– fully automated (‘semantic’) composition based on pre-/post-condition reasoning, e.g. OWL-S ... actually AI planning popular with academics
• Here: middle-way approach based on automated filling (and possibly folding/unfolding) of templates
• Also sort of ‘semantic’ but less ambitious
V. Svátek, KEG 29.9.2005
Agenda
• Overview of web space analysis landscape (5’)• TODD annotation framework and Rainbow collection
of web space ontologies (15’)• Rainbow collection of problem solving methods and
re-description of real applications (10’)• Parametric design as starting point for automated
composition of classification services (10’)• Simulation of service composition and execution (15’)• Rainbow and the semantic web service cycle (5’)• Future work (5’)
V. Svátek, KEG 29.9.2005
Parametric design (PD)
• A reasoning task well-examined by the knowledge modelling community
• Setting values to a set of parameters in a template, while considering constraints and preferences
• Classical problem solving method (PSM) for PD: Propose - Critique - Modify (PCM)– Propose an initial configuration– Verify the required properties (if satisfied then Stop)– (Critique:) Analyse reasons for failure in the Verify step– (Modify:) Change the values – return to Verify step
V. Svátek, KEG 29.9.2005
PD as model of WS composition
• If a suitable template can be identified, WS composition will merely amount to filling in (concrete or abstract) services as “values” for “parameters”, in the PCM cycle
• The filling can be carried out based on either:– generic pre-/post-condition reasoning over complete functional
descriptions of individual services
– dedicated broker equipped with a method-specific knowledge base (“PCM knowledge”)
• The latter option examined by ten Teije (2004): experiments with using the classification PSM as template to be filled in
V. Svátek, KEG 29.9.2005
Filling in a classification template
• A large proportion of to-date composite web services indeed have the nature of classification (e.g. credit assignment applications)
• Although multiple PSMs for classification have been identified in literature, they can be combined into a single template (Motta & Lu 2000)
• Testing domain: assignment of reviewers to conference papers, according to paper topics– PCM cycle carried out in several iterations– changes in criteria fulfilment recorded during the updates of
the template (re-setting of parameter values)
V. Svátek, KEG 29.9.2005
Classification template
Observations Knowledge
Solutions
LegalObservations
ScoredObservations
AggregatedScores
CandidateSolutions
MicroMatch Aggregate
Admissibility
Check
Selection
V. Svátek, KEG 29.9.2005
Examples of “broker” knowledge
• Propose knowledge for the Admissibility parameter: if many {feature,value} pairs are irrelevant then do not use strong-coverage
• Modify knowledge for the Admissibility parameter: if the solution set has to increased (reduced) in size, then the value for the Admissibility parameter has to be moved down (up) in the following partial ordering:
weak-coverage strong-coverage strong-explanative
V. Svátek, KEG 29.9.2005
Web analysis services’ specifics
• Large number of diverse tools potentially available manual programming would be cumbersome
• Can be tested in real environment with low or zero cost (unlike e.g. business IS or medical applications) experiments with automated composition might be relatively ambitious
• Abstract descriptions of analysis services can (in addition to task characteristics) explicitly include characteristics related to analysed data (e.g. those enforced by mark-up languages such as HTML)
V. Svátek, KEG 29.9.2005
Service combination in Rainbow
• Currently, the only option used for building more complex applications is conventional programming– control routines for bicycle application, by O. Šváb (in Java),
which calls individual web services and integrates results
• Some more flexible solution needed; ideally, it should be capable of including an unforeseen component (with appropriate semantic description) use of PSMs and ontologies?
V. Svátek, KEG 29.9.2005
Can parametric design scenario be applied here?• Web analysis PSMs can be viewed as templates• Filling unforeseen services to slots rather than just
choosing among the known values of parameters!• To capture the connectivity and multiple views over
the web space, the object-feature-value view (from traditional knowledge engineering) does not suffice: an object-relation-object view is more appropriate!
• Structural classification/extraction is thus recursive, i.e. slots could be replaced with further templates!
V. Svátek, KEG 29.9.2005
Possible adaptations of PD
• Simple: pre-set template versions for degrees of recursion and combinations with non-recursive versions
• Advanced: in addition to setting values for attributes, the Propose and Modify steps could also fold/unfold slots to/from templates
V. Svátek, KEG 29.9.2005
Some tentative broker knowledge
• Templates with lower number of distinct objects and non-recursive templates should be preferred
• Look-up classification should be preferred to compact classification
• Default partial ordering of data types with respect to object classification, for Document object:frequency > URL > topology free text > metadata
• URL-based or topology-based classification should never be used alone
• Default partial ordering of types of relations (@rel):part-of > is-part > adjacent
V. Svátek, KEG 29.9.2005
Agenda
• Overview of web space analysis landscape (5’)• TODD annotation framework and Rainbow collection
of web space ontologies (15’)• Rainbow collection of problem solving methods and
re-description of real applications (10’)• Parametric design as starting point for automated
composition of classification services (10’)• Simulation of service composition and execution (15’)• Rainbow and the semantic web service cycle (5’)• Future work (5’)
V. Svátek, KEG 29.9.2005
Limits of the current Prolog simulation• Data: Prolog facts with high level of abstraction,
instead of real data• Only the classification task covered; only binary
classification (with certainty factor)• Only six service ‘mock-ups’ implemented so far
– though writing a new one is a matter of 20-30 minutes
• Only the initial ‘Propose’ step implemented• Multiple fixed templates rather than un/folding• Broker knowledge not yet implemented
– just blind search with checking ‘service signatures’
V. Svátek, KEG 29.9.2005
Example of ‘data’page(p36). % image page with 1 picture
url_of(u36,p36).
url_terms(u36,[teen,sex]). % terms in URL
part(p36,s3).
linkto(p31,p36).
textprop(p36,0.0). % proportion of text on page
part(f361,p36).
html_frag(f361). % fragment of HTML code
part(i3611,f361).
image(i3611).
body_color(i3611,0.4). % proportion of body color
V. Svátek, KEG 29.9.2005
Classification template example
templ(sc1,s(cla,0,0,Tp1,Tp2),
[s(cla,0,0,Tp3,Tp4)],
[subclasseq(Tp3,Tp1),subclasseq(Tp4,Tp2)]).
• Simplest, with one slot only• More complex ones have to deal e.g. with
aggregation or transformation of certainty factors
task type: classification input object same as output object
type of input object of ‘lower-level’ service at most as general as type of input object of ‘higher-level’ service
template header
template body with one slot
templateconstraints
V. Svátek, KEG 29.9.2005
Classification template example templ(sc5,s(cla,0,0,Tp1,Tp2), [s(cla,0,0,Tp3,Tp4),
s(ret,0,1,Tp5,Tp6),
s(cla,1,1,Tp7,Tp8),
s(tsf,ref(3,1),0,Tp8,Tp4),
s(agr,[ref(1,0),ref(4,0)],0,Tp4,Tp4)],
[subclasseq(Tp3,Tp1), subclasseq(Tp5,Tp1), rel(part,Tp6,Tp5), subclasseq(Tp6,Tp7), subclasseq(Tp4,Tp2)]).
V. Svátek, KEG 29.9.2005
Service description example
meta(
cla_por_html,
s(cla,document,pornoContentPage),
url,
pornography,
4).
service identifier
data type/representation
input object typeoutput object type (class)
problem domain
time cost
task
V. Svátek, KEG 29.9.2005
Sample simulation run ?- propose(cla, doc_coll, porno_coll).
Number of solutions: 2
Template: sc3
Configuration:
s(ret, 0, 1, doc_coll, localhub, ret_localhub)
s(cla, 1, 1, document, pornoContentPage, cla_por_html)
s(tsf, ref(2, 1), 0, pornoContentPage, porno_coll, tsf_porno1)
Time cost: 15
Template: sc3
Configuration:
s(ret, 0, 1, doc_coll, localhub, ret_localhub)
s(cla, 1, 1, document, pornoContentPage, cla_por_url)
s(tsf, ref(2, 1), 0, pornoContentPage, porno_coll, tsf_porno1)
Time cost: 13
V. Svátek, KEG 29.9.2005
Agenda
• Overview of web space analysis landscape (5’)• TODD annotation framework and Rainbow collection
of web space ontologies (15’)• Rainbow collection of problem solving methods and
re-description of real applications (10’)• Parametric design as starting point for automated
composition of classification services (10’)• Simulation of service composition and execution (15’)• Rainbow and the semantic web service cycle (5’)• Future work (5’)
V. Svátek, KEG 29.9.2005
Coverage of semantic web service cycle• Service annotation with semantic description
– Here: TODD framework and ontologies
• Service discovery in open and heterogeneous space– Here: not addressed (we rely on a single annotation model
and centralised ontology), hence this is not an ‘upper semantic web’ application!
• Service composition (‘choreography’)– Here: main focus; template-based (PSM) approach
• Composed service execution (‘orchestration’)– Here: extremely simplified
V. Svátek, KEG 29.9.2005
Agenda
• Overview of web space analysis landscape (5’)• TODD annotation framework and Rainbow collection
of web space ontologies (15’)• Rainbow collection of problem solving methods and
re-description of real applications (10’)• Parametric design as starting point for automated
composition of classification services (10’)• Simulation of service composition and execution (15’)• Rainbow and the semantic web service cycle (5’)• Future work (5’)
V. Svátek, KEG 29.9.2005
Ongoing and future work
• Implementation of broker knowledge base• Further elaboration of prototype broker: beyond initial
template filling: ‘Critique’ and ‘Modify’ phases?– Capture the possible structure of templates (initial proposal
as well as modification) with a grammar?– Iterative template refinement with verification on data
• Enrichment of the collection of analysis components (by Rainbow team as well as third party)
• Implementation of full-fledged broker
V. Svátek, KEG 29.9.2005
Thanks for your attention
Questions?