declaratively producing data mash-ups
DESCRIPTION
Declaratively Producing Data Mash-ups. Sudarshan Murthy 1 , David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland State University. http://www.sixml.org. Mash-ups. Web applications that combine information from multiple sources [Wikipedia] - PowerPoint PPT PresentationTRANSCRIPT
Declaratively Producing Data Mash-ups
Sudarshan Murthy1, David Maier2
1Applied Research, Wipro Technologies2 Department of Computer Science, Portland State University
http://www.sixml.org
Apr 19, 2023 Declaratively Producing Data Mash-ups 2
Mash-ups
• Web applications that combine information from multiple sources [Wikipedia]– A mash-up does not need to be a web app
• Data that includes or transcludes content from multiple sources
• In either case, a source is likely only a fragment
• This work is about data mash-ups– In this talk, a mash-up is an XML document
Apr 19, 2023 Declaratively Producing Data Mash-ups 3
Portland State University Campus Map
• 45 markers, 53 landmarks– Marker: Balloon
on map– Landmark:
Building, department, …
• Information from 188 fragments in 58 web pages
• Fragments selected manuallyhttp://sparce.cs.pdx.edu/cmap/
Apr 19, 2023 Declaratively Producing Data Mash-ups 4
Portland Metro Food Markets
• 154 markers, 154 landmarks
• 154 fragments harvested programmatically from 4 MS Word documents
• Developed for the Oregon Department of Agriculture
http://sparce.cs.pdx.edu/Declaratively Producing Data Mash-ups/oda-1.1/
An HTML Review Report
Apr 19, 2023 Declaratively Producing Data Mash-ups 5
Apr 19, 2023 Declaratively Producing Data Mash-ups 6
Problem Areas
• Development– Getting data from heterogeneous fragments– Might use a DBMS, yet code operators such
as sort, join, and aggregate for external data
• Execution– When to get external data, how much to get?
• Design: Expressing that– A part comes from an external fragment– A part is data (such as page number) which
cannot be “selected” in the source
Apr 19, 2023 Declaratively Producing Data Mash-ups 7
Outline
• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion
Apr 19, 2023 Declaratively Producing Data Mash-ups 8
Superimposed Information (SI)
• SI is new data and structure overlaid on existing base information
• Mark: A reference to an external fragment
• Benefits– Multiple, simultaneous
organizations – Make new connections
among base fragments– Preserve context
Superimposed
Layer
Base Layer
Information Source1
Information Source2
Information Sourcen
…
marks
Heterogeneous sources: Word, Excel, PDF, HTML,…
Apr 19, 2023 Declaratively Producing Data Mash-ups 9
The Mash-up Production Process
Collect marks, add new data and structure
Extract data from marks and combine with added data
Collect and Classify Extract and Combine Transform
DocsDBMS
Services
Services
Format reconstituted data for display and other purposes
Services
Condensed mash-up
Reconstitutedmash-up
DBMS DocsDBMS Docs
Formattedmash-up
Apr 19, 2023 Declaratively Producing Data Mash-ups 10
SI, Bi-level Information, Mash-ups
• A condensed mash-up is SI– Links mash-up parts to external fragments– Relates to mash-up design: Sixml
• A reconstituted mash-up and a formatted mash-up are both bi-level information – SI plus reconstituted parts– Relates to runtime mash-up manipulation
and execution: Sixml DOM and Sixml Navigator
Apr 19, 2023 Declaratively Producing Data Mash-ups 11
Outline
• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion
Apr 19, 2023 Declaratively Producing Data Mash-ups 12
Sixml
• A mash-up specification language – SI represented as XML; Sixml is XML
• A condensed mash-up is encoded as a Sixml document
• A mark association is encoded as an XML element of a type we define– Associate marks with six kinds of content– Validated using standard schema constructs– Uniform and comprehensible serialization
<Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> <sixml:TMark> Contradicts prior work <sixml:Descriptor>…</sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark></Comment>
<Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> <sixml:TMark> Contradicts prior work <sixml:Descriptor>…</sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt" sixml:valueSource="true"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark></Comment>
<Comment excerpt=""> Contradicts prior work</Comment>
<Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> Contradicts prior work <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark></Comment>
<Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> Contradicts prior work <sixml:AMark target="excerpt"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark></Comment>
Apr 19, 2023 Declaratively Producing Data Mash-ups 13
Sixml Mark Associations
• By default text excerpt is assigned at run time, but possible to declare that the value should be something other than the excerpt• Mark association names shown here are same as type name, but custom names are possible (with both static and dynamic typing)
<Comment excerpt="" xmlns:sixml="…" xmlns:xsi="…"> <sixml:TMark> Contradicts prior work <sixml:Descriptor xsi:type="sixml:XPointer"> <pointer>http://www.w3.org/#element(/1/2)</pointer> </sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt" sixml:valueSource="true"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor xsi:type="sixml:SPARCE"> <Agent>OfficeAgents.MSWord</Agent> <Doc location="c:\abc.doc" /> <Subdoc startChar="45" endChar="53" /> </sixml:Descriptor> </sixml:EMark></Comment>Apr 19, 2023 Declaratively Producing Data Mash-ups 14
Sixml Mark Descriptors
Apr 19, 2023 Declaratively Producing Data Mash-ups 15
Outline
• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion
Apr 19, 2023 Declaratively Producing Data Mash-ups 16
Sixml DOM
• Extends W3C XML DOM to easily manipulate Sixml documents – Using DOM can be tedious and inefficient
• Automatic and lazy reconstitution– Detects mark associations and interprets
attributes such as sixml:valueSource– Developer uses only the DOM interface
• Access to descriptors and “context” of external fragments
Apr 19, 2023 Declaratively Producing Data Mash-ups 17
Run-time Representation
A Descriptor
AMark
A Descriptor
EMark
5. @markID@a“text”
“Contradicts…” A
“excerpt”
@excerpt
Comment
TMark
true
A @target
“” @valueSource A Descriptor
<Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> <sixml:TMark> Contradicts prior work <sixml:Descriptor>…</sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt" sixml:valueSource="true"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark></Comment>
DOM tree
A Context TMark
EMark
5. @markID@a“text” A Descriptor
A Descriptor
A Descriptor
A @valueSource
“provides…”
“excerpt”
@excerpt
true
A @target
Comment
“Contradicts…”
A Context
A Context
AMark
Apr 19, 2023 Declaratively Producing Data Mash-ups 18
Generating a Sixml DOM Tree
A Descriptor
AMark
A Descriptor
EMark
5. @markID@a“text”
“Contradicts…” A
“excerpt”
@excerpt
Comment
TMark
true
A @target
“” @valueSource A Descriptor
Sixml DOM tree
A mark association is “attached” to its target, but is not a child - The DOM interface suffices to access the reconstituted mash-up
Descriptor is not a child
Value reconstituted
Apr 19, 2023 Declaratively Producing Data Mash-ups 19
Context Information• Information retrieved from the context of an
external fragment
• An xsi:type-specific implementation determines (statically or dynamically) what is in context
<sixml:Context> <Content> <Text>provide ... system</Text> </Content> <Presentation> <FontName>Times New Roman</FontName> <FontSize>11</FontSize> </Presentation> <Placement> <Page>3</Page> </Placement></sixml:Context>
Programming with Sixml DOM
1.procedure WriteComment(SixmlElement c)2. XmlElement ctxt = c.markAssociations[0].Context
3. XmlNode page = ctxt.getElementsByTagName("Page")[0]
4. Writeln("Page: ", page.firstChild.nodeValue)
5. Writeln("Excerpt: ", c.getAttribute("excerpt"))
6. Writeln("Comment: ", c.firstChild.nodeValue)
• Only Lines 1 and 2 use the Sixml DOM interface
• Lines 2–4 get page number; Line 5 the reconstituted excerpt; and Line 6 the comment text
Apr 19, 2023 Declaratively Producing Data Mash-ups 20
Apr 19, 2023 Declaratively Producing Data Mash-ups 21
Outline
• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion
Apr 19, 2023 Declaratively Producing Data Mash-ups 22
Sixml Navigator
• Alternative to the traditional path navigator
• Extends XDM so that Sixml documents can be declaratively queried using existing languages and query processors– Also applies to XPath 1.0 and XSLT 1.0
• Performs automatic and lazy reconstitution
Apr 19, 2023 Declaratively Producing Data Mash-ups 23
XDM Extensions
• Allow child elements for any kind of node with which a mark may be associated
• Make a mark association a child of its target node
• Represent a mark descriptor and context as children of a mark association
• These extensions allow reuse of existing query languages and processors
A Context TMark
EMark
5. @markID@a“text” A Descriptor
A Descriptor
A Descriptor
A @valueSource
“provides…”
“excerpt”
@excerpt
true
A @target
Comment
“Contradicts…”
A Context
A Context
AMark
Apr 19, 2023 Declaratively Producing Data Mash-ups 24
An Extended-XDM Tree
A Context TMark
EMark
5. @markID@a“text” A Descriptor
A Descriptor
A Descriptor
A @valueSource
@excerpt
A @target
Comment
“Contradicts…”
A Context
A Context
AMark
Extended-XDMtree
Apr 19, 2023 Declaratively Producing Data Mash-ups 25
Queries over Bi-level Information
• With Comment as current node, get the comment text
./text()
• Get excerpt of commented region ./@excerpt
• Get page number of commented region ./sixml:EMark/sixml:Context/Placement/Page
<sixml:Context> <Placement> <Page>3</Page> </Placement></sixml:Context>
EMark
5. @markID@a“tA Descriptor A Context
Apr 19, 2023 Declaratively Producing Data Mash-ups 26
Outline
• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion
Apr 19, 2023 Declaratively Producing Data Mash-ups 27
Implementation and Usage
• Element types for Sixml mark associations defined in XML Schema
• Sixml DOM and Sixml Navigator in C# on the .NET Framework– Sixml DOM implemented by extending DOM
and by revising DOM– Three implementations of Sixml DOM: 2
extensions (MS and Mono), 1 revision (Mono)
• Sixml, Sixml DOM, and Sixml Navigator used in mash-ups for several applications
Experimental Data
• 8 mash-ups – 4 each from 2 apps; different scale factors– File size: 200 KB to 26.1 MB– #Docs referenced: 18 to 426– #Mark associations: 1.9K to over 311K
• 3 traditional XML documents– File size: 484 KB to 113.7 MB– Tree depth: 4, 8, 16
Apr 19, 2023 Declaratively Producing Data Mash-ups 28
Evaluation Summary
• Sixml DOM– Saves time over DOM when accessing mark
associations– When accessing SI, savings decrease as
the amount of SI increases– It is better to use DOM to access large
traditional XML documents
• Sixml Navigator– Saves time over traditional navigator for
both mark associations and SI
Apr 19, 2023 Declaratively Producing Data Mash-ups 29
Apr 19, 2023 Declaratively Producing Data Mash-ups 30
Outline
• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion
Summary
• A mash-up has three forms: condensed, reconstituted, and formatted
• Sixml, Sixml DOM, and Sixml Navigator support the three forms, respectively
• Sixml makes it easier to specify mash-ups; Sixml DOM and Navigator provide a more efficient means of manipulating mash-ups
• The XML Schema instance documents and the source code are on www.sixml.org
Apr 19, 2023 Declaratively Producing Data Mash-ups 31
Apr 19, 2023 Declaratively Producing Data Mash-ups 32
Outline
• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion
Apr 19, 2023 Declaratively Producing Data Mash-ups 33
Our Mash-up Framework
XSLT and XQuery Processors
XPath ProcessorClient Application
Sixml Sixml DOM Sixml Navigator
SPARCE Bulk Accessor Cloaker
Reference and retrieve fragments of arbitrary types
Efficiently retrieve large number of fragments
Hide data to improve query expression and execution
Bi-level Query Processors
• Sixml Navigator uses Sixml DOM internally: Does not construct extended-XDM trees
• Existing query processors use the Sixml Navigator instead of using the traditional navigator
Apr 19, 2023 Declaratively Producing Data Mash-ups 34
BulkAccessor transform(contextInfo) XMLContextTransformer
scope SixmlNavigator
0..1 *
Produces
apply(styleSheet) XSLTProcessor
Node Evaluation Context 1 *
Embeds
Source * *
moveToRoot() moveToFirstChild() moveToNextSibling() moveToPreviousSibling() moveToParent()
XPathNavigator
evaluate(expression) XPathEvaluator
1 * Uses
SixmlNode
Apr 19, 2023 Declaratively Producing Data Mash-ups 35
Mark Creation
Superimposed Application
Mark Manager
Clipboard
Superimposed Info Descriptors
Repository
<Mark ID="M4"> <Agent>AcrobatAgents.PDFAgent</Agent> <Class>AcrobatPDFTextMark</Class> <Address>2|395|439</Address> … <ContainerID>D6</ContainerID> </Mark>
M4S1
Apr 19, 2023 Declaratively Producing Data Mash-ups 36
Activation and Context Retrieval
Superimposed Application
Mark Manager
Context Manager
Superimposed Info
Base Application
Descriptors Repository
Base Info
<Mark ID="M4"> <Agent>AcrobatAgents.PDFAgent</Agent> <Class>AcrobatPDFTextMark</Class> <Address>2|395|439</Address> … <ContainerID>D6</ContainerID> </Mark>
M4S1
Apr 19, 2023 Declaratively Producing Data Mash-ups 37
About ContextPDF Mark PowerPoint Mark
• Context information is modeled as a hierarchical property set