coda catchplus open document annotation · 2012-08-31 · coda – catchplus open document...

Post on 27-May-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CODA – CATCHPlus Open Document Annotation

Hennie Brugman

OAC II Project Review meeting

Chicago – July 26-27, 2012

Annotation context

• Audiovisual

– ASR, language, gesture, oral history

• Text – Semantic annotation

• Music – lyrics, music notation

• Linguistic Annotation – named entities

• Image annotation

• Programs: CATCH, CATCHPlus, CLARIN

CODA main use cases

• Queen’s Cabinet (Henny van Schie/National Archive,

Lambert Schomaker/Univ Groningen)

– Line strip and word zone annotations

– ML: search in manuscript images

– Add Named Entity annotations

• Sailing Letters (Nicoline van de Sijs/Meertens +

consortium, Lambert Schomaker)

– Support manual annotation

– Line strip detection service

2

Line annotation tools (catchplus)

<txt>godefroit</txt>

<id>navis-SAL7316_0195-line-026

-y1=2094-y2=2317-zone-HUMAN

-x=1145-y=105-w=315-h=116

-unshear=0.0-version=ortho </id>

<user>mceunen</user>

<time>Wed Jan 26 16:37:01 2011</time>

OAC representation ImageAnnotation TextAnnotations

hasBody

hasTarget

hasBody hasTarget

constrains constrains

constrains constrains

hasTarget hasBody

“Dit is een beschrijving van Den Haag. En dit is een tweede zin.”

cnt:chars

imageScan.jpg

ia:1

page:0

zone:2

line:1

Canvas1

ct:1

ct:2 cb:2

cb:1

ib:0

hasBody

linestrip.jpg ia:2

Named Entity

OAC representation – Named Entities

ImageAnnotation TextAnnotations EntityAnnotation

hasBody hasTarget hasBody hasTarget hasTarget

hasTarget

hasBody constrains

constrains

constrains constrains

constrains constrains

hasTarget hasBody “Dit is een beschrijving van Den Haag. En dit is een tweede zin.”

“location” cnt:chars

cnt:chars imageScan.jpg

ia:1 ta:0

ta:2

ta:1

Canvas1

ct:1

ct:2

ct:3

ct:4

cb:2

cb:1

ib:0 ib:1

ea:1

! Annotation of annotations?

! Annotation of segments of inline text?

InlineTextConstraint: <rdf:Description rdf:about="urn:uuid:533624bb-d565-40ba-a14a-2e95c19c20df">

<rdf:type rdf:resource="http://www.openannotation.org/ns/ConstrainedTarget"/>

<constrains xmlns="http://www.openannotation.org/ns/"

rdf:resource="http://oas.dev.seecr.nl:8000/resolve/urn%3Auuid

%3Ad8741024-18bf-40a8-a648-2cd5ebb9acfd"/>

<constrainedBy xmlns="http://www.openannotation.org/ns/"

rdf:resource="urn:uuid:4f6b7d34-2329-4ab6-be89-a0feec9e7208"/>

</rdf:Description>

<rdf:Description rdf:about="urn:uuid:4f6b7d34-2329-4ab6-be89-a0feec9e7208">

<rdf:type rdf:resource="http://www.openannotation.org/ns/Constraint"/>

<rdf:type rdf:resource="http://www.catchplus.nl/annotation/InlineTextConstraint"/>

<rdf:type rdf:resource="http://www.w3.org/2008/content#ContentAsText"/>

<chars xmlns="http://www.w3.org/2008/content#">

"&lt;textsegment offset="279" range="2"/&gt;"</chars>

<characterEncoding xmlns="http://www.w3.org/2008/content#">

UTF-8</characterEncoding>

</rdf:Description>

KdK-2-OAC conversion

• Implicit line and page text

• Word and line order

• Text offsets and ranges

• Spatial information

• Identifiers and ‘annotatability’

• Redundant text for searchability

! Need for explicit representation of Sequence?

! Search on text of ConstrainedTarget/Body?

KdK2OAC conclusions

• Bidirectional mapping is possible

• Compatible with SharedCanvas model

• OAC + Canvas links everything together

• Implicit information made explicit

• Supports alternative text segmentations

• OAC representation is extremely verbose

! For many annotation tasks OA may be overkill

Open Annotation Service (OAS) • Upload annotation RDF using SRU/Update

• Inlines external text and XML Bodies and authors

• Indexes OA and DC properties

• Assigns resolvable http URIs and resolves those

• Implementation: RDF store icw Solr, production quality

software components (Meresco)

• Built-in OAI-PMH data provider and harvester for

‘annotation sets’

• Query: SRU/CQL, SPARQL, OAI-PMH

• Simple management dashboard (authentication and

authorization, collection management, harvesting)

• Easy installation and Open Source

! Model does not support Annotation “sets”

OAS: issues

• Annotation publication

• Searchability: ‘harvest and index’

• Text search on external bodies

• Annotation boundaries

• ‘Bypassing’ oac:constrains

! In RDF, what are the boundaries of an annotation?

Entity Recognition service

service

frog

converter

URL or

text OAS resolve

source_text

FoLiA_document

URL

or ID

entity

annotations

‘frog’ and FoLiA

• ‘Frog’ tool generates FoLiA XML document with

– Segmentation of text in paragraphs, sentences and words

(tokens) – XML hierarchy

– Part of speech, lemma, morphology, chunking, dependency

structure and named entities

• Mix of inline and standoff annotation

– ‘Frog’ does not keep track of character offsets

– Explicit ordering: numbering system in ids

• Trained for Dutch

• Widely used for Dutch corpora

• Made available by: ILK @ Tilburg University

FoLiA-2-OAC conversion

• Reconstruct character offsets after tokenization

• Operates on inline text as published by OAS

• Construct and add entity text from tokens +

sequence (the+hague != hague+the)

• Two approaches

1. Minimal: extract entity annotations and tokens, and

convert to OAC

2. Maximal: full conversion to OAC

Linguistic Annotation

! Mix-in domain semantics as subtypes/subproperties?

! Maximal OA mapping or embed linguistic standards?

! Layers, hierarchies (syntax) and Documents

! Sequence (e.g. entities, morpheme breakup)

Synchronized viewing client demo

• Demo/screenshot

Summary of OA issues ! Annotation of annotations?

! Annotation of segments of inline text?

! Need for explicit representation of Sequence?

! Search on ConstrainedTarget/Body?

! For many annotation tasks OA may be overkill

! Model does not support Annotation sets

! In RDF, what are the boundaries of an annotation?

Future work

• Finalize and integrate software (with web

services)

• Upgrade to new OA spec (incl OAS)

• Line strip detection web service

• Possible applications

– AV annotation in CATCHPlus

– Nederlab

Questions?

top related