handling markup overlaps using owl

26
http://creativecommons.org/licenses/by-sa/3.0 Handling markup overlaps using OWL Angelo Di Iorio ([email protected]) Silvio Peroni ([email protected]) Fabio Vitali ([email protected])

Upload: silvio-peroni

Post on 04-Jun-2015

567 views

Category:

Technology


3 download

DESCRIPTION

A lot of applications handle XML documents where multi- ple overlapping hierarchies are necessary and make use of a number of workarounds to force overlaps into the single hierarchy of an XML for- mat. Although these workarounds are transparent to the users, they are very difficult to handle by applications reading into these formats. This paper proposes an approach to document markup based on Semantic Web technologies. Our model allows the same expressiveness as XML and any other hierarchical meta-markup language, and, rather than re- quiring complex workarounds, allows the explicit expression of overlap- ping structures in such a way that search and manipulation of these structures does not require any specific tool or language. By simply us- ing mainstream technologies such as OWL and SPARQL, our model – called EARMARK (Extremely Annotational RDF Markup) – can per- form rather sophisticated tasks with no special tricks.

TRANSCRIPT

http://creativecommons.org/licenses/by-sa/3.0

Handling markup overlaps using OWL

Angelo Di Iorio ([email protected])Silvio Peroni ([email protected])

Fabio Vitali ([email protected])

Summary

• Overlapping markup in everyday life

• EARMARK: an OWL-based meta-markup language

• Conclusions and future works

Overlapping markup... wait, what?

• A definition: overlapping markup “describes cases where some markup structures do not nest neatly into others”DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal, Canada.<body> <p>Some <em>very</p> <p>interesting</em> text</p></body>

• Different techniques to embed overlap in XML hierarchies, for instance:

Overlapping markup... wait, what?

• A definition: overlapping markup “describes cases where some markup structures do not nest neatly into others”DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal, Canada.<body> <p>Some <em>very</p> <p>interesting</em> text</p></body>

• Different techniques to embed overlap in XML hierarchies, for instance:✦ milestones – expressed through empty elements to mark the boundaries of the content

<body> <p>Some <em start=”id1”/>very</p> <p>interesting<em end=”id1”/> text</p></body>

Overlapping markup... wait, what?

• A definition: overlapping markup “describes cases where some markup structures do not nest neatly into others”DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal, Canada.<body> <p>Some <em>very</p> <p>interesting</em> text</p></body>

• Different techniques to embed overlap in XML hierarchies, for instance:✦ milestones – expressed through empty elements to mark the boundaries of the content

<body> <p>Some <em start=”id1”/>very</p> <p>interesting<em end=”id1”/> text</p></body>

✦ fragmentation – expressed by two non-overlapping elements linked through id-idref pairs<body> <p>Some <em id=”em1” next=”em2”>very</em></p> <p><em id=”em2”>interesting</em> text</p></body>

Overlapping everywhere

• Where we can find it: word processor formats + change tracking (e.g., ODT)<office:text>

<text:changed-region text:id="S1"><text:insertion>

<office:change-info><dc:creator>John Smith</dc:creator><dc:date>2009-10-27T18:45:00</dc:date>

</office:change-info></text:insertion>

</text:changed-region><text:p>

The beginning and <text:change-start text:change-id="S1"/>

</text:p><text:p>

also<text:change-end text:change-id="S1"/>the end.

</text:p></office:text>

What the document is

Overlapping everywhere

• Where we can find it: word processor formats + change tracking (e.g., ODT)<office:text>

<text:changed-region text:id="S1"><text:insertion>

<office:change-info><dc:creator>John Smith</dc:creator><dc:date>2009-10-27T18:45:00</dc:date>

</office:change-info></text:insertion>

</text:changed-region><text:p>

The beginning and <text:change-start text:change-id="S1"/>

</text:p><text:p>

also<text:change-end text:change-id="S1"/>the end.

</text:p></office:text>

What the document is

office:text

text:p

The beginning and the end.2009-10-27T18:45:00

before

What the documentrepresents

Overlapping everywhere

• Where we can find it: word processor formats + change tracking (e.g., ODT)<office:text>

<text:changed-region text:id="S1"><text:insertion>

<office:change-info><dc:creator>John Smith</dc:creator><dc:date>2009-10-27T18:45:00</dc:date>

</office:change-info></text:insertion>

</text:changed-region><text:p>

The beginning and <text:change-start text:change-id="S1"/>

</text:p><text:p>

also<text:change-end text:change-id="S1"/>the end.

</text:p></office:text>

What the document is

office:text

text:p

The beginning and the end.2009-10-27T18:45:00

before

What the documentrepresents

office:text

text:p text:p

alsoafter

Overlapping everywhere

• Where we can find it: word processor formats + change tracking (e.g., ODT)<office:text>

<text:changed-region text:id="S1"><text:insertion>

<office:change-info><dc:creator>John Smith</dc:creator><dc:date>2009-10-27T18:45:00</dc:date>

</office:change-info></text:insertion>

</text:changed-region><text:p>

The beginning and <text:change-start text:change-id="S1"/>

</text:p><text:p>

also<text:change-end text:change-id="S1"/>the end.

</text:p></office:text>

What the document is

office:text

text:p

The beginning and the end.2009-10-27T18:45:00

before

What the documentrepresents

office:text

text:p text:p

alsoafter

inserted by John Smith

• EARMARK is a vocabulary that defines a meta-markup language by means of OWL ontologies – http://www.essepuntato.it/2008/12/earmark

• It is more expressive than XML

• Three disjoint base classes:✦ Docuverse – it represents the textual content of a document

Subclasses: StringDocuverse, URIDocuverse

✦ Range – it describes any text lying between two locationsSubclasses: PointerRange, XPathRange, XPathPointerRange

✦ MarkupItem – a collection of individuals belonging to the classes MarkupItem and RangeSubclasses: Element, Attribute, Comment

XML EARMARK

Data structure

Overlapping

Semantics

Tree DAGOnly by using tricks Of course, it is a feature here

What? Yes, it is OWL!

An example

The beginning and the end.

An example

:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .

@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .

The beginning and the end.

An example

:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .

@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .

The beginning and the end.

An example

:r2 a earmark:PointerRange; earmark:refersTo :aDoc; earmark:begins “14”^^xsd:nonNegativeInteger; earmark:ends “26”^^xsd:nonNegativeInteger .

:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .

@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .

The beginning and the end.

An example

:r2 a earmark:PointerRange; earmark:refersTo :aDoc; earmark:begins “14”^^xsd:nonNegativeInteger; earmark:ends “26”^^xsd:nonNegativeInteger .

:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .

@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .

The beginning and the end.

also

office:text

text:p

office:text

text:p text:p

An example

:r2 a earmark:PointerRange; earmark:refersTo :aDoc; earmark:begins “14”^^xsd:nonNegativeInteger; earmark:ends “26”^^xsd:nonNegativeInteger .

@prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> .

:aMarkupItem a earmark:Element; earmark:hasGeneralIdentifier “p”; earmark:hasNamespace“urn:oasis:names:tc:opendocument:xmlns:text:1.0”

; c:firstItem :item1; c:lastItem :item2 .

:item1 c:itemContent :r1; c:nextItem :item2 .

:item2 c:itemContent :r2 .

:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .

@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .

The beginning and the end.

also

office:text

text:p

office:text

text:p text:p

An example

:r2 a earmark:PointerRange; earmark:refersTo :aDoc; earmark:begins “14”^^xsd:nonNegativeInteger; earmark:ends “26”^^xsd:nonNegativeInteger .

@prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> .

:aMarkupItem a earmark:Element; earmark:hasGeneralIdentifier “p”; earmark:hasNamespace“urn:oasis:names:tc:opendocument:xmlns:text:1.0”

; c:firstItem :item1; c:lastItem :item2 .

:item1 c:itemContent :r1; c:nextItem :item2 .

:item2 c:itemContent :r2 .

:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .

@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .

The beginning and the end.

also

office:text

text:p

office:text

text:p text:p

inserted by John Smith

An example

:r2 a earmark:PointerRange; earmark:refersTo :aDoc; earmark:begins “14”^^xsd:nonNegativeInteger; earmark:ends “26”^^xsd:nonNegativeInteger .

@prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> .

:aMarkupItem a earmark:Element; earmark:hasGeneralIdentifier “p”; earmark:hasNamespace“urn:oasis:names:tc:opendocument:xmlns:text:1.0”

; c:firstItem :item1; c:lastItem :item2 .

:item1 c:itemContent :r1; c:nextItem :item2 .

:item2 c:itemContent :r2 .

:aDoc a earmark:StringDocuverse; earmark:hasContent “The beginning and the end.”^^xsd:string .

@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .@prefix : <http://www.example.com/> .

The beginning and the end.

also

office:text

text:p

office:text

text:p text:p

inserted by John Smith

:p2 a Insertion ; dc:creator “John Smith”; dc:date “2009-10-27T18:45:00”^^xsd:dateTime .

@prefix dc: <http://purl.org/dc/elements/1.1/> .

EARMARK Data Structure

• It is an API and a Java library that allows to easily create and modify EARMARK document within Java applications

• Open Source project: http://earmark.sourceforge.netEARMARKDocument ed = new EARMARKDocument(new URI("http://www.example.com"));

Docuverse aDoc =ed.createStringDocuverse("The beginning and the end.");

[...]

Range aRange = ed.createPointerRange(aDoc, 14, 26);

[...]

Element aMarkupItem = ed.createElement("p", "urn:oasis:names:tc:opendocument:xmlns:text:1.0",Collection.Type.List);

ed.appendChild(anotherMarkupItem);

[...]

Semantic Web technologies as added value

• Because every EARMARK document is expressed as proper ABox of an ontology, we can use Semantic Web technologies:

✦ to manipulate documents✦ to query them✦ to infer new assertions ✦ to check some integrity constraints on document structure and on content semantics

• In EARMARK, those technologies can be very helpful in solving issues that are difficult to solve or are not solvable at all by using XML tools

• An example: “get all the text fragments inserted by John Smith”

Semantic Web technologies as added value

• Because every EARMARK document is expressed as proper ABox of an ontology, we can use Semantic Web technologies:

✦ to manipulate documents✦ to query them✦ to infer new assertions ✦ to check some integrity constraints on document structure and on content semantics

• In EARMARK, those technologies can be very helpful in solving issues that are difficult to solve or are not solvable at all by using XML tools

• An example: “get all the text fragments inserted by John Smith”✦ XPath

for $id in //@text:id[../text:insertion//(dc:creator[. = ‘John Smith’] | @office:chg-author[. = ’ John Smith’])] return //text:p//text()[(preceding-sibling::text:change-start[1][@text:change-id = $id] and following-sibling::text:change-end[1][@text:change-id = $id]) or ancestor::text:changed-region/@text:id = $id]

Semantic Web technologies as added value

• Because every EARMARK document is expressed as proper ABox of an ontology, we can use Semantic Web technologies:

✦ to manipulate documents✦ to query them✦ to infer new assertions ✦ to check some integrity constraints on document structure and on content semantics

• In EARMARK, those technologies can be very helpful in solving issues that are difficult to solve or are not solvable at all by using XML tools

• An example: “get all the text fragments inserted by John Smith”✦ XPath

for $id in //@text:id[../text:insertion//(dc:creator[. = ‘John Smith’] | @office:chg-author[. = ’ John Smith’])] return //text:p//text()[(preceding-sibling::text:change-start[1][@text:change-id = $id] and following-sibling::text:change-end[1][@text:change-id = $id]) or ancestor::text:changed-region/@text:id = $id]

✦ SPARQLSELECT ?r WHERE { ?r a earmark:Range , Insertion ; dc:creator "John Smith" . }

Conclusions andfuture works

• We presented a new meta-markup language called EARMARK, defined by means of OWL ontologies, that allows to make very complex markup documents

• We applied it in a real-case scenario (ODT format with change tracking) showing how it allows to handle, manipulate and query complex documents in a better way (than XML does)

• Future works about this topic include:✦ Rocco and Fretta are two on-going projects that allow transformations from

XML documents (with overlapping markup specified by using tricks) to EARMARK documents, and vice versa

✦ a formalism to specify explicitly semantics of markup and of textual content✦ a word processor that allows to define EARMARK documents in a very

simple way, with the possibility to add any kind of semantic assertions to any entity of the document (both markup items and textual content)

Thanks for your attentionI think it’s time for questions :-)

Late time example:A more complex ODT document...

<office:text><text:changed-region text:id="S2">! <text:deletion><office:change-info>! ! ! <dc:creator>Silvio Peroni</dc:creator>! ! ! <dc:date>2009-10-27T18:45:00</dc:date>

! ! </office:change-info><text:p>.</text:p></text:deletion>! <text:insertion>! ! <office:change-info office:chg-author="Angelo Di Iorio"! ! ! office:chg-date-time="2009-10-27T18:42:00"/>! </text:insertion></text:changed-region><text:changed-region text:id="A2">! <text:insertion><office:change-info>! ! ! <dc:creator>Angelo Di Iorio</dc:creator>! ! ! <dc:date>2009-10-27T18:42:00</dc:date>

! ! </office:change-info></text:insertion></text:changed-region>[...]<text:p>This is one paragraph<text:change-start text:change-id="S1"/>;! actually, it was!<text:change-end text:change-id="S1"/>! <text:change text:change-id="S2"/>

<text:change-start text:change-id="A2"/></text:p><text:p><text:change-end text:change-id="A2"/>! <text:change text:change-id="A3"/><text:change-start text:change-id="A4"/>S! <text:change-end text:change-id="A4"/>plit in two.</text:p>

</office:text>

... and its representation in EARMARK

TIME

r3

r1

r5

r4

r6

This is one paragraph that will be split in two.

; actually, it was!

text

p

p

text

p

textr2 p

a text:insertion ;dc:creator “Silvio Peroni”dc:date “2009-10-27T18:45:00”

a text:deletion ;dc:creator “Silvio Peroni”dc:date “2009-10-27T18:45:00”

a text:insertion ;dc:creator “Angelo Di Iorio”dc:date “2009-10-27T18:42:00”

a text:deletion ;dc:creator “Angelo Di Iorio”dc:date “2009-10-27T18:42:00”

. S

Legend

beginlocation

endlocation

string in the range

docuversecontent

docuverses ranges markup items assertions