handling markup overlaps using owl
DESCRIPTION
A lot of applications handle XML documents where multi- ple overlapping hierarchies are necessary and make use of a number of workarounds to force overlaps into the single hierarchy of an XML for- mat. Although these workarounds are transparent to the users, they are very difficult to handle by applications reading into these formats. This paper proposes an approach to document markup based on Semantic Web technologies. Our model allows the same expressiveness as XML and any other hierarchical meta-markup language, and, rather than re- quiring complex workarounds, allows the explicit expression of overlap- ping structures in such a way that search and manipulation of these structures does not require any specific tool or language. By simply us- ing mainstream technologies such as OWL and SPARQL, our model – called EARMARK (Extremely Annotational RDF Markup) – can per- form rather sophisticated tasks with no special tricks.TRANSCRIPT
http://creativecommons.org/licenses/by-sa/3.0
Handling markup overlaps using OWL
Angelo Di Iorio ([email protected])Silvio Peroni ([email protected])
Fabio Vitali ([email protected])
Summary
• Overlapping markup in everyday life
• EARMARK: an OWL-based meta-markup language
• Conclusions and future works
Overlapping markup... wait, what?
• A definition: overlapping markup “describes cases where some markup structures do not nest neatly into others”DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal, Canada.<body> <p>Some <em>very</p> <p>interesting</em> text</p></body>
• Different techniques to embed overlap in XML hierarchies, for instance:
Overlapping markup... wait, what?
• A definition: overlapping markup “describes cases where some markup structures do not nest neatly into others”DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal, Canada.<body> <p>Some <em>very</p> <p>interesting</em> text</p></body>
• Different techniques to embed overlap in XML hierarchies, for instance:✦ milestones – expressed through empty elements to mark the boundaries of the content
<body> <p>Some <em start=”id1”/>very</p> <p>interesting<em end=”id1”/> text</p></body>
Overlapping markup... wait, what?
• A definition: overlapping markup “describes cases where some markup structures do not nest neatly into others”DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal, Canada.<body> <p>Some <em>very</p> <p>interesting</em> text</p></body>
• Different techniques to embed overlap in XML hierarchies, for instance:✦ milestones – expressed through empty elements to mark the boundaries of the content
<body> <p>Some <em start=”id1”/>very</p> <p>interesting<em end=”id1”/> text</p></body>
✦ fragmentation – expressed by two non-overlapping elements linked through id-idref pairs<body> <p>Some <em id=”em1” next=”em2”>very</em></p> <p><em id=”em2”>interesting</em> text</p></body>
Overlapping everywhere
• Where we can find it: word processor formats + change tracking (e.g., ODT)<office:text>
<text:changed-region text:id="S1"><text:insertion>
<office:change-info><dc:creator>John Smith</dc:creator><dc:date>2009-10-27T18:45:00</dc:date>
</office:change-info></text:insertion>
</text:changed-region><text:p>
The beginning and <text:change-start text:change-id="S1"/>
</text:p><text:p>
also<text:change-end text:change-id="S1"/>the end.
</text:p></office:text>
What the document is
Overlapping everywhere
• Where we can find it: word processor formats + change tracking (e.g., ODT)<office:text>
<text:changed-region text:id="S1"><text:insertion>
<office:change-info><dc:creator>John Smith</dc:creator><dc:date>2009-10-27T18:45:00</dc:date>
</office:change-info></text:insertion>
</text:changed-region><text:p>
The beginning and <text:change-start text:change-id="S1"/>
</text:p><text:p>
also<text:change-end text:change-id="S1"/>the end.
</text:p></office:text>
What the document is
office:text
text:p
The beginning and the end.2009-10-27T18:45:00
before
What the documentrepresents
Overlapping everywhere
• Where we can find it: word processor formats + change tracking (e.g., ODT)<office:text>
<text:changed-region text:id="S1"><text:insertion>
<office:change-info><dc:creator>John Smith</dc:creator><dc:date>2009-10-27T18:45:00</dc:date>
</office:change-info></text:insertion>
</text:changed-region><text:p>
The beginning and <text:change-start text:change-id="S1"/>
</text:p><text:p>
also<text:change-end text:change-id="S1"/>the end.
</text:p></office:text>
What the document is
office:text
text:p
The beginning and the end.2009-10-27T18:45:00
before
What the documentrepresents
office:text
text:p text:p
alsoafter
Overlapping everywhere
• Where we can find it: word processor formats + change tracking (e.g., ODT)<office:text>
<text:changed-region text:id="S1"><text:insertion>
<office:change-info><dc:creator>John Smith</dc:creator><dc:date>2009-10-27T18:45:00</dc:date>
</office:change-info></text:insertion>
</text:changed-region><text:p>
The beginning and <text:change-start text:change-id="S1"/>
</text:p><text:p>
also<text:change-end text:change-id="S1"/>the end.
</text:p></office:text>
What the document is
office:text
text:p
The beginning and the end.2009-10-27T18:45:00
before
What the documentrepresents
office:text
text:p text:p
alsoafter
inserted by John Smith
• EARMARK is a vocabulary that defines a meta-markup language by means of OWL ontologies – http://www.essepuntato.it/2008/12/earmark
• It is more expressive than XML
• Three disjoint base classes:✦ Docuverse – it represents the textual content of a document
Subclasses: StringDocuverse, URIDocuverse
✦ Range – it describes any text lying between two locationsSubclasses: PointerRange, XPathRange, XPathPointerRange
✦ MarkupItem – a collection of individuals belonging to the classes MarkupItem and RangeSubclasses: Element, Attribute, Comment
XML EARMARK
Data structure
Overlapping
Semantics
Tree DAGOnly by using tricks Of course, it is a feature here
What? Yes, it is OWL!
An example
:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .
The beginning and the end.
An example
:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .
The beginning and the end.
An example
:r2 a earmark:PointerRange; earmark:refersTo :aDoc; earmark:begins “14”^^xsd:nonNegativeInteger; earmark:ends “26”^^xsd:nonNegativeInteger .
:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .
The beginning and the end.
An example
:r2 a earmark:PointerRange; earmark:refersTo :aDoc; earmark:begins “14”^^xsd:nonNegativeInteger; earmark:ends “26”^^xsd:nonNegativeInteger .
:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .
The beginning and the end.
also
office:text
text:p
office:text
text:p text:p
An example
:r2 a earmark:PointerRange; earmark:refersTo :aDoc; earmark:begins “14”^^xsd:nonNegativeInteger; earmark:ends “26”^^xsd:nonNegativeInteger .
@prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> .
:aMarkupItem a earmark:Element; earmark:hasGeneralIdentifier “p”; earmark:hasNamespace“urn:oasis:names:tc:opendocument:xmlns:text:1.0”
; c:firstItem :item1; c:lastItem :item2 .
:item1 c:itemContent :r1; c:nextItem :item2 .
:item2 c:itemContent :r2 .
:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .
The beginning and the end.
also
office:text
text:p
office:text
text:p text:p
An example
:r2 a earmark:PointerRange; earmark:refersTo :aDoc; earmark:begins “14”^^xsd:nonNegativeInteger; earmark:ends “26”^^xsd:nonNegativeInteger .
@prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> .
:aMarkupItem a earmark:Element; earmark:hasGeneralIdentifier “p”; earmark:hasNamespace“urn:oasis:names:tc:opendocument:xmlns:text:1.0”
; c:firstItem :item1; c:lastItem :item2 .
:item1 c:itemContent :r1; c:nextItem :item2 .
:item2 c:itemContent :r2 .
:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .
The beginning and the end.
also
office:text
text:p
office:text
text:p text:p
inserted by John Smith
An example
:r2 a earmark:PointerRange; earmark:refersTo :aDoc; earmark:begins “14”^^xsd:nonNegativeInteger; earmark:ends “26”^^xsd:nonNegativeInteger .
@prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> .
:aMarkupItem a earmark:Element; earmark:hasGeneralIdentifier “p”; earmark:hasNamespace“urn:oasis:names:tc:opendocument:xmlns:text:1.0”
; c:firstItem :item1; c:lastItem :item2 .
:item1 c:itemContent :r1; c:nextItem :item2 .
:item2 c:itemContent :r2 .
:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .
The beginning and the end.
also
office:text
text:p
office:text
text:p text:p
inserted by John Smith
:p2 a Insertion ; dc:creator “John Smith”; dc:date “2009-10-27T18:45:00”^^xsd:dateTime .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
EARMARK Data Structure
• It is an API and a Java library that allows to easily create and modify EARMARK document within Java applications
• Open Source project: http://earmark.sourceforge.netEARMARKDocument ed = new EARMARKDocument(new URI("http://www.example.com"));
Docuverse aDoc =ed.createStringDocuverse("The beginning and the end.");
[...]
Range aRange = ed.createPointerRange(aDoc, 14, 26);
[...]
Element aMarkupItem = ed.createElement("p", "urn:oasis:names:tc:opendocument:xmlns:text:1.0",Collection.Type.List);
ed.appendChild(anotherMarkupItem);
[...]
Semantic Web technologies as added value
• Because every EARMARK document is expressed as proper ABox of an ontology, we can use Semantic Web technologies:
✦ to manipulate documents✦ to query them✦ to infer new assertions ✦ to check some integrity constraints on document structure and on content semantics
• In EARMARK, those technologies can be very helpful in solving issues that are difficult to solve or are not solvable at all by using XML tools
• An example: “get all the text fragments inserted by John Smith”
Semantic Web technologies as added value
• Because every EARMARK document is expressed as proper ABox of an ontology, we can use Semantic Web technologies:
✦ to manipulate documents✦ to query them✦ to infer new assertions ✦ to check some integrity constraints on document structure and on content semantics
• In EARMARK, those technologies can be very helpful in solving issues that are difficult to solve or are not solvable at all by using XML tools
• An example: “get all the text fragments inserted by John Smith”✦ XPath
for $id in //@text:id[../text:insertion//(dc:creator[. = ‘John Smith’] | @office:chg-author[. = ’ John Smith’])] return //text:p//text()[(preceding-sibling::text:change-start[1][@text:change-id = $id] and following-sibling::text:change-end[1][@text:change-id = $id]) or ancestor::text:changed-region/@text:id = $id]
Semantic Web technologies as added value
• Because every EARMARK document is expressed as proper ABox of an ontology, we can use Semantic Web technologies:
✦ to manipulate documents✦ to query them✦ to infer new assertions ✦ to check some integrity constraints on document structure and on content semantics
• In EARMARK, those technologies can be very helpful in solving issues that are difficult to solve or are not solvable at all by using XML tools
• An example: “get all the text fragments inserted by John Smith”✦ XPath
for $id in //@text:id[../text:insertion//(dc:creator[. = ‘John Smith’] | @office:chg-author[. = ’ John Smith’])] return //text:p//text()[(preceding-sibling::text:change-start[1][@text:change-id = $id] and following-sibling::text:change-end[1][@text:change-id = $id]) or ancestor::text:changed-region/@text:id = $id]
✦ SPARQLSELECT ?r WHERE { ?r a earmark:Range , Insertion ; dc:creator "John Smith" . }
Conclusions andfuture works
• We presented a new meta-markup language called EARMARK, defined by means of OWL ontologies, that allows to make very complex markup documents
• We applied it in a real-case scenario (ODT format with change tracking) showing how it allows to handle, manipulate and query complex documents in a better way (than XML does)
• Future works about this topic include:✦ Rocco and Fretta are two on-going projects that allow transformations from
XML documents (with overlapping markup specified by using tricks) to EARMARK documents, and vice versa
✦ a formalism to specify explicitly semantics of markup and of textual content✦ a word processor that allows to define EARMARK documents in a very
simple way, with the possibility to add any kind of semantic assertions to any entity of the document (both markup items and textual content)
Late time example:A more complex ODT document...
<office:text><text:changed-region text:id="S2">! <text:deletion><office:change-info>! ! ! <dc:creator>Silvio Peroni</dc:creator>! ! ! <dc:date>2009-10-27T18:45:00</dc:date>
! ! </office:change-info><text:p>.</text:p></text:deletion>! <text:insertion>! ! <office:change-info office:chg-author="Angelo Di Iorio"! ! ! office:chg-date-time="2009-10-27T18:42:00"/>! </text:insertion></text:changed-region><text:changed-region text:id="A2">! <text:insertion><office:change-info>! ! ! <dc:creator>Angelo Di Iorio</dc:creator>! ! ! <dc:date>2009-10-27T18:42:00</dc:date>
! ! </office:change-info></text:insertion></text:changed-region>[...]<text:p>This is one paragraph<text:change-start text:change-id="S1"/>;! actually, it was!<text:change-end text:change-id="S1"/>! <text:change text:change-id="S2"/>
<text:change-start text:change-id="A2"/></text:p><text:p><text:change-end text:change-id="A2"/>! <text:change text:change-id="A3"/><text:change-start text:change-id="A4"/>S! <text:change-end text:change-id="A4"/>plit in two.</text:p>
</office:text>
... and its representation in EARMARK
TIME
r3
r1
r5
r4
r6
This is one paragraph that will be split in two.
; actually, it was!
text
p
p
text
p
textr2 p
a text:insertion ;dc:creator “Silvio Peroni”dc:date “2009-10-27T18:45:00”
a text:deletion ;dc:creator “Silvio Peroni”dc:date “2009-10-27T18:45:00”
a text:insertion ;dc:creator “Angelo Di Iorio”dc:date “2009-10-27T18:42:00”
a text:deletion ;dc:creator “Angelo Di Iorio”dc:date “2009-10-27T18:42:00”
. S
Legend
beginlocation
endlocation
string in the range
docuversecontent
docuverses ranges markup items assertions