semantic annotation for knowledge management: requirements ...staff · structuring and interpreting...

Web Semantics: Science, Services and Agentson the World Wide Web xxx (2005) xxx–xxx

Semantic annotation for knowledge management: Requirementsand a survey of the state of the art

Victoria Urena,∗, Philipp Cimianob, Jose Iriac, Siegfried Handschuhd,Maria Vargas-Veraa, Enrico Mottaa, Fabio Ciravegnac

a Knowledge Media Institute, The Open University, Walton Hall, Milton Keynes MK7 6AA, UKb Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe, Germany

c Department of Computer Science, University of Sheffield, Regent Court, 21 1Portobello Street, Sheffield S1 4DP, UKd FZI Forschungszentrum Informatik, Haid-und-Neu-Strasse 10-14, 76131 Karlsruhe, Germany

Received 24 January 2005; accepted 18 October 2005

Abstract

While much of a company’s knowledge can be found in text repositories, current content management systems have limited capabilities forstructuring and interpreting documents. In the emerging Semantic Web, search, interpretation and aggregation can be addressed by ontology-baseds of semantica ess fully allt©

K

1

actosu“cmtseb

ch((

makentsasthatwl-roughd thentent,docu-olo-ableentsanno-ntics

oreticd be, ar andutionidehet-

1d

emantic mark-up. In this paper, we examine semantic annotation, identify a number of requirements, and review the current generationnnotation systems. This analysis shows that, while there is still some way to go before semantic annotation tools will be able to addr

he knowledge management needs, research in the area is active and making good progress.2005 Elsevier B.V. All rights reserved.

eywords: Semantic annotation; Knowledge management; Automation

. Introduction

Does Semantic Web technology matter for knowledge man-gement (KM)? We believe that it does because KM oftenenters on documents and the business processes that build onhem. Documents provide a rich resource describing what anrganization knows and account for 80–85% of the informationtored by many companies. Indeed, for some professions doc-ments are effectively the product they sell. Examples of theseproduct” documents include contracts, consultancy reports andonsumer surveys. KM systems for handling this unstructuredaterial are a large and growing sector of the software indus-

ry. IDC expects that content management and retrieval softwarepending will outpace the overall software market by 2007. Theystimate the market at $6.46 billion in 2004 and a $9.72 billiony 2006[1].

∗ Corresponding author. Tel.: +44 1908 858516; fax: +44 1908 653169.E-mail addresses: [email protected] (V. Uren),

[email protected] (P. Cimiano), [email protected] (J. Iria),[email protected] (S. Handschuh), [email protected]

The Semantic Web envisages technologies, which canpossible the generation of the kind of “intelligent” documeimagined 10 years ago[2]. We define an intelligent documenta document which “knows about” its own content in orderautomated processes can “know what to do” with it. Knoedge about documents has traditionally been managed ththe use of metadata, which can concern the world aroundocument, e.g. the author, and often at least part of the coe.g. keywords. The Semantic Web proposes annotatingment content using semantic information from domain ontgies [3]. The result is Web pages with machine interpretmark-up that provide the source material with which agand Semantic Web services operate. The goal is to createtations with well-defined semantics, however those semamay be defined. The Semantic Web relies on a model-thedefinition of meaning, but other types of semantics coulthought of[4]. In any case, for the sake of interoperabilitywell-defined semantics is a must to ensure that annotatoannotation consumer actually share meaning. A key contribof the Semantic Web is therefore to provide a set of worldwstandards. These open the possibility of operating with

M. Vargas-Vera), [email protected] (E. Motta), [email protected]. Ciravegna).

erogenous resources by providing a bridge of common syntax,methods, etc.

570-8268/$ – see front matter © 2005 Elsevier B.V. All rights reserved.oi:10.1016/j.websem.2005.10.002

5
WEBSEM-71; No. of Pages 1

2 V. Uren et al. / Web Semantics: Science, Services and Agents on the World Wide Web xxx (2005) xxx–xxx

Fig. 1. Example of a document with semantic annotation from KMi’s semantic website.

Semantic Web annotations go beyond familiar textual anno-tations about the content of the documents, such as “clause sevenof this contract has been deleted because. . .”, “the test resultsneed to go in here”. This kind of informal annotation is commonin word processor applications and is intended primarily for useby document creators. Semantic annotation formally identifiesconcepts and relations between concepts in documents, and isintended primarily for use by machines. For example, a semanticannotation might relate “Paris” in a text to an ontology whichboth identifies it as the abstract concept “City” and links it tothe instance “France” of the abstract concept “Country”, thusremoving any ambiguity about which “Paris” it refers to.

Semantic Web annotation brings benefits of two kinds overthese systems, enhanced information retrieval and improvedinteroperability. Information retrieval is improved by the abilityto perform searches, which exploit the ontology to make infer-ences about data from heterogeneous resources[5]. For example,consider the semantic mark-up shown inFig. 1, which is takenfrom the Semantic Web site of the Knowledge Media Institute.1

1 KMi, The Open University, Semantic websitehttp://semanticweb.kmi.open.ac.uk/.

The semantic annotations in the example identify people, organi-zations, and projects, which are mentioned in a web news story,as well as including traditional metadata, such as the author’sname and date of publication. Since these statements are inte-grated with a large departmental ontology, we can then supportqueries like “give me all the stories which talk about projectson the Semantic Web”. A query agent will exploit the semanticannotations to map stories to projects and then use informationfrom the departmental project database to identify only projectsrelated to the Semantic Web area. Ontology-based semanticannotations also allow us to resolve anomalies in searches, e.g.if a document collection were annotated using a geographicalontology, it would become easy to distinguish “Niger” the coun-try from “Niger” the river in searches, because they would beannotated with references to different concepts in the ontol-ogy. Interoperability is particularly important for organizations,which have large legacy databases, often in different proprietaryformats that do not easily interact. In these circumstances, anno-tations based on a common ontology can provide a commonframework for the integration of information from heteroge-neous sources.

As a motivating example of what can be achieved oncedocuments are given semantic mark-up consider the Medical

http://semanticweb.kmi.open.ac.uk/

V. Uren et al. / Web Semantics: Science, Services and Agents on the World Wide Web xxx (2005) xxx–xxx 3

Fig. 2. The role of annotation in document centric KM. Annotations provide interoperability between different kinds of documents and support enhanced searchservices. Annotation tools draw on knowledge workers’ domain knowledge and automatic analysis. Ontologies evolve to fit changing needs.

Imaging and Advanced Knowledge Technologies (MIAKT)project.2 MIAKT has developed problem solving environmentsfor use in the medical domain. Specifically, it is tackling tripleassessment in symptomatic focal breast disease, which involvesthe interpretation of three different kinds of scan data by groupsof medical professionals. In MIAKT the annotations make theknowledge contained in unstructured sources (medical imagessuch as X-rays) available in a structured form, allowing bothaccurate and focused retrieval and knowledge sharing for a givenpatient’s case. Moreover, the annotations can be used to provideautomated services. For example, they can be processed usingnatural language generation software to automatically drafttextual reports about the patient, the diagnostic information thatis available and assessments made about the data by the medicalteam, a task which usually consumes doctors’ valuable time[6].

Furthermore, this example is by no means an isolated case.The use of knowledge embodied in annotations is being investi-gated in domains as diverse as scientific knowledge[7], radio andtelevision news[8], genomics[9], making web pages accessibleto visually impaired people[10], employment information[11],online shopping[12] and the description of cultural artifacts inmuseums[13].

An intelligent, document centric KM process of the type wepropose must handle three classes of data: ontologies, docu-ments and annotations. As illustrated inFig. 2, these need to besupported by new kinds of KM tools. Semantic search tools are

ject,h

needed to connect and exploit the information in annotations anddocuments. Ontology maintenance tools must support users inmaintaining and evolving knowledge models to meet changingneeds. Finally, tools are needed to facilitate the annotation ofdocuments, which can detect changes in an ontology related toexisting annotations. Annotation tools will, in their turn, need togive feedback to the ontology maintenance process when neces-sary. Strong coupling is needed between these systems to copewith the re-versioning and reuse of documents, the evolutionof the ontologies used to describe them and a range of differ-ent users who may require different views on the data or havedifferent access rights.

Annotation is, potentially, an additional burden in this modelof KM. Human annotators are prone to error and non-trivialannotations usually require domain expertise, diverting technicalstaff from other tasks. Also, without maintenance, annotationscan easily become obsolete. Therefore, unless annotation can bedone cost-effectively the commercial future for the technology islimited. In this paper, we review the systems that currently existto support the mark-up of documents and determine how wellthey fit the requirements of KM. Taking the document centricperspective described above, we have identified seven require-ments for semantic annotation systems, which we use to assessthe capabilities of existing annotation systems. We would liketo emphasize that we are exclusively concerned with technicalrequirements and not with ‘soft’ requirements related to annota-t tingp hesea s wella pects

2 Medical Imaging and Advanced Knowledge Technologies (MIAKT) prottp://www.aktors.org/miakt/, accessed on 22 July 2004.

ion. Social and psychological aspects are crucial for motivaeople to annotate information. However, a discussion of tspects is, though important, out of the scope of the paper as our competencies. For a detailed discussion of soft as

http://www.aktors.org/miakt/


related to the use of annotations for knowledge management thereader is referred to Ref.[14].

Our pool of ‘systems’ includes two Semantic Web annota-tion frameworks, which could be implemented differently bydifferent tools, as well as the current generation of manual andautomatic tools for semantic mark-up. In a fast developing field,such as this one, it is impossible to complete a comprehen-sive survey as new tools and new versions of existing toolsare emerging constantly. We have tried to survey the general-purpose annotation tools that have some aspect of automationas completely as possible but have had to be more selective withexamples of manual tools. We conclude that, while there is stillsome way to go before semantic annotation tools will be ableto address fully all the knowledge management needs identifiedhere, research in the area is very active and constant progress isbeing made. Semantic annotation tools suitable for large-scaleknowledge management can be expected sooner rather than later.

2. Requirements

The document centric model of KM illustrated inFig. 2hasled us to formulate seven requirements for semantic annotationsystems. These overlap to some extent with the requirementsset out by Handschuh et al.[15], but there are also differences.For example, we do not concern ourselves with issues such ase that oino s thl temsE ch ow . Foi ls ts iew-p ingd

2

ibleb sidea newt iginas mpla traino mens n gee ovida esb orgn 3Ci theS h foll arer WeO nss

2.2. Requirement 2—user centered/collaborative design

Annotation can potentially become a bottleneck if it is doneby knowledge workers with many demands on their time. Sincefew organizations have the capacity to employ professionalannotators, it is crucial to provide knowledge workers with easyto use interfaces that simplify the annotation process and place itin the context of their everyday work. A good approach would bea single point of entry interface, so that the environment in whichusers annotate documents is integrated with the one in whichthey create, read, share and edit them. System design also needsto facilitate collaboration between users, which is a key facet ofknowledge work with experts from different fields contributingto and reusing intelligent documents. We have already identi-fied standard formats as a prerequisite for sharing annotations.Other issues for collaboration include implementing systems tocontrol what to share with whom. For example, in a medicalcontext, physicians might share all information about patientsamong themselves but only share anonymized information withplanners. This brings us to issues related to trust, provenanceand access rights. An intranet provides a more controlled envi-ronment for tracing the provenance of annotations than the wildWeb but access policies are a critical issue to organizations,which are invariably concerned with confidentiality issues forclient and staff data.

2a

no-t . Fore y forg logiest giesm hicho opew orpo-r , thep anno-t giesa in thec fort howc otatedd nflictsw s foro litiest .

2f

e thatt sucha tol usesX M.

fficiency and proper reference, although we acknowledgehese are important. Instead, we have considered four viewpn the task: the ontologies, the documents, the annotation

ink ontologies to documents, and the users of the sysach viewpoint suggests one or more requirements, eahich normally brings together several associated needs

nstance, the ontology viewpoint suggests the need for tooupport multiple, evolving ontologies and the document voint suggests the need to support the reuse and versionocuments.

.1. Requirement 1—standard formats

Using standard formats is preferred, wherever possecause the investment in marking up resources is conble and standardization builds in future proofing because

ools, services, etc., which were not envisaged when the oremantic annotation was performed may be developed. Conce with standards also frees companies from the consf proprietary formats when choosing knowledge manageoftware. These advantages can be applied to systems iral. For annotation systems in particular, standards can prbridging mechanism that allows heterogeneous resourc

e accessed simultaneously and collaborating users andizations to share annotations. It is the activity of the W

n developing and promoting international standards foremantic Web that has convinced us that this route is wort

owing in knowledge management. Two types of standardequired, standards for describing ontologies such as thentology Language OWL[16] and standards for annotatiouch as the W3C’s RDF annotation schema[17].

ttsat.fr

o

of

,r-

li-tstn-etoa-

-

b

.3. Requirement 3—ontology support (multiple ontologiesnd evolution)

In addition to supporting appropriate ontology formats, anation tools need to be able to support multiple ontologiesxample, in a medical context, there may be one ontologeneral metadata about a patient and other technical onto

hat deal with diagnosis and treatment. Either the ontoloust be merged or annotations must explicitly declare wntology they refer to. In addition, systems will have to cith changes made to ontologies over time, such as inc

ating new classes or modifying existing ones. In this caseroblem is ensuring consistency between ontologies and

ations with respect to ontology changes. Multiple ontolond evolving ontologies have been discussed elsewhereontext of KM, e.g.[18]. Some of the important issueshe design of an annotation environment are to determinehanges should be reflected in the knowledge base of annocuments and whether changes to ontologies create coith existing annotations. There are also design implicationntology support as knowledge workers may require faci

o help them explore and edit the ontologies they are using

.4. Requirement 4—support of heterogeneous documentormats

Semantic Web standards for annotation tend to assumhe documents being annotated are in web-native formatss HTML and XML. For example, the Annotea approach

ocating an annotation at a particular point in a documentPointers. This approach will have limited usefulness for K


Documents will be in many different formats including wordprocessor files, spreadsheets, graphics files and complex mix-tures of different formats. This presents a technical challengerather than a research challenge, but dealing with multiple doc-ument formats is a prerequisite for integrating annotation intoexisting work practices.

2.5. Requirement 5—document evolution (document andannotation consistency)

Ontologies change sometimes but some documents changemany times. An example is W3C’s specification documents,which go through multiple revisions. Requirement 3 concernsthe problem of keeping ontologies and annotations consistent.This requirement concerns consistency from a textual point ofview, i.e. maintaining correct pointers from the annotations tothe surface representation in the text. What should happen tothe annotations on a document when it is revised, poses bothtechnical and application specific questions. If the anchor foran annotation in a shared document is removed during editingshould the current document author be informed, so that theycan re-anchor or delete the annotation, or is the original authorof the annotation the only person with the right to do this? Isit even desirable, in general, to transfer annotations to a newversion of a document, or do versions of annotations need tobe maintained in parallel with document versions. For example,i thar ionsw cant atioe prop

2

ill bs wordp n ing ther cont nvi-r arilh . InK iliarw rguet ns aa keea willc .

2

bot-t ofd tiono tiono tion

environment is vital. These can automatically identify entitiesthat are instances of a particular class and relations betweenthe classes. Once again, HCI implications are important so thatautomated tools can be used effectively by knowledge workerswithout expertise in, say, natural language processing methods.

3. Annotation frameworks

Having specified the requirements, we now look at gen-eral frameworks for annotation, which could be implementeddifferently by different tools. We discuss two frameworks forannotation in the Semantic Web, the W3C annotation projectAnnotea[19], and CREAM[20], an annotation framework beingdeveloped at the University of Karlsruhe.

Annotea [19,21] is a W3C project, which specifies infras-tructure for annotation of Web documents, with emphasis on thecollaborative use of annotations. The use of open standards isa very important principle for all the work of W3C to promoteinteroperability and extensibility. The main format for Annoteais RDF and the kinds of documents that can be annotated arelimited to HTML or XML-based documents. This is restrictivefor KM, as much commercial data is in other formats. However,it provides in XPointer a method for locating annotations withina document. XPointer is a W3C recommendation for identify-ing fragments of URI resources. So long as the component ofa document to which an XPointer refers is retained, the loca-t in thed ade,a ints.T yle ofa aboutd or, cre-a RDFs s notq entd witha hineso beeni illaa

ha at oft red bya witha t sys-t AMs adei s int andH id-e lvesa gener-a ith thep anies,t odelt nota-t . This

f a contract were prepared for a new client, annotationseferred to a legal ontology could be retained, but annotathich referred to previous clients, could be removed. How

his selective transfer of annotations be achieved? Annotnvironments need to help knowledge workers maintain apriate annotations as documents change.

.6. Requirement 6—annotation storage

The Semantic Web model assumes that annotations wtored separately from the original document, whereas the “rocessor” model assumes that comments are stored as aral part of the document, which can be viewed or not aseader prefers. The Semantic Web model, which decouplesent and semantics, works particularly well for the Web eonment in which the authors of annotations do not necessave any control over the documents they are annotatingM environment, however, many annotators are more famith the document-centric, word processor model. They a

hat, as they have control of documents, storing annotatiopart of those documents is preferable and helps them tonnotations consistent with new document versions. Weonsider both these storage models for annotations in KM

.7. Requirement 7—automation

Another aspect of easing the knowledge acquisitionleneck is the provision of facilities for automatic mark-upocument collections to facilitate the economical annotaf large document collections. To achieve this, the integraf knowledge extraction technologies into the annota

t,

n-

e

te-

-

ya

sp

ion of the associated annotation will be robust to changesetail of the document, but if large-scale revisions are mnnotations can easily come adrift from their anchor pohe Annotea approach concentrates on a semi-formal stnnotation, in which annotations are free text statementsocuments. These statements must have metadata (authtion time, etc.) and may be typed according to user-definedchemata of arbitrary complexity. In this respect, Annotea iuite as formal as would be ideal for the creation of intelligocuments. The storage model proposed is a mixed onennotations being stored as RDF held either on local macr on public RDF servers. The Annotea framework has

nstantiated in a number of tools including Amaya, Annoznd Vannotea (see Section4.1).

TheCREAM framework[15] looks at the context in whicnnotations could be made and used as well as the form

he annotations themselves. It specifies components requin annotation system including the annotation interface,utomatic support for annotators, document managemen

em and annotation inference server. Like Annotea, CREubscribes to W3C standard formats with annotations mn RDF or OWL and XPointers used to locate annotationext, which restricts it to web-native formats such as XMLTML. Unlike Annotea, the authors of CREAM have consred the possibility of annotating the deep web. This invonnotating the databases from which deep web pages areted so that the annotations are generated automatically wages. As databases hold much of the legacy data in comp

his is a substantial addition. It is supported by a storage mhat allows users to choose whether they want to store anions separately on a server or embedded in a web page


assumes more user control of the document and recognizes thatusers may prefer to store annotations with the source material.The CREAM framework allows for relational metadata, definedas “annotations which contain relationship instances”. Rela-tional metadata is essential for constructing knowledge baseswhich can be used to provide semantic services. Examples oftools based on the CREAM framework are S-CREAM and M-OntoMat-Annotizer (see Section4.1).

4. Semantic annotation tools

Having examined frameworks for annotation, which couldbe implemented in different ways, we now turn our attention tospecific tools which can produce semantic annotations, i.e. anno-tations that reference an ontology. These are a first generationof tools which meet some of the requirements outlined abovebut which need further development to make a fully integratedannotation environment.Table 1provides a summary and belowwe describe each system briefly.

4.1. Manual annotation

The most basic annotation tools allow users to manually cre-ate annotations. They have a great deal in common with purelytextual annotation tools but provide some support for ontolo-g noteR ditorA L.T e fob pleo an-u turet st wsera as

serf e”u in an s wha ightf gs tt atedw no g, vt

t fora rkeu nd datat esh)

3A

/a

files, with the mesh being used to define regions of images. It isof particular interest from the view point of distributed knowl-edge management because it has been designed to allow inputfrom distributed users. This has, for example, allowed it to bedeployed to annotate cultural artifacts in a collaborative anno-tation exercise involving both museum curators and indigenousgroups[13].

Some manual annotation tools have been developed to pro-vide more sophisticated user support and a degree of semi-automatic or automatic annotation facilities. TheOntoMatAnnotizer is a tool for making annotations which is built onthe principles of the CREAM framework. It has a Web browserto display the page which is being annotated and provides somereasonably user friendly functions for manual annotation, suchas drag and drop creation of instances and the ability to mark-uppages while they are being created. OntoMat has been extendedto include support for semi-automatic annotation. The first ofthese extensions wasS-CREAM [15], which uses an informa-tion extraction (IE) system (Amilcare[26]). The user annotatesand the system learns how to reproduce the user annotation, tobe able to suggest annotations for new documents. OntoMat alsoincorporates methods for deep annotation[27], i.e. annotationfor Web pages that are generated from databases. Other researchin the CREAM family focuses on extending annotation to mul-timedia formats.M-OntoMat-Annotizer [28] supports manualannotation of image and video data by indexers with little mul-t urest n ofO .

vel-o ogyE[ MLp ia aU usu-a er tod code.R -upb ecifyh ularf anticW y ist sa

O then r thats ratede tions.

rec tions

n3

m

ies. There are several such programs which produce AnDF mark-up. For example, the W3C Web browser and emaya [22] can mark-up Web documents in XML or HTMhe user can make annotations in the same tool they usrowsing and for editing text, making Amaya a good examf a single point of access environment. It has facilities for mal annotation of web pages but does not contain any fea

o support automatic annotation. TheAnnozilla3 browser aimo make all Amaya annotations readable in the Mozilla brond to shadow Amaya developments. Teknowledge4 producesimilar plug in for Internet Explorer.

TheMangrove system is another example of manual but uriendly annotation[23]. The aim of the system was to “enticsers into marking up their HTML by using the data createdumber of semantic services such as a departmental who’nd a calendar of events. The annotation tool itself is a stra

orward GUI that allows users to associate a selection of taext that they highlight. Mangrove has recently been integrith a semantic email service[24], which supports the initiatiof semantic email processes, such as meeting schedulin

ext forms.Multimedia annotation is the next phase of developmen

nnotation, expanding the range files types that can be map into images, video and audio.Vannotea [25] has beeeveloped by the University of Brisbane for adding meta

o MPEG-2 (video), JPEG2000 (image) and Direct 3D (m

3 Annozilla annotator (http://annozilla.mozdev.org/index.htmlaccessed onugust 2004).4 Teknowledge Annotation Applications (http://mr.teknowledge.com/DAMLccessed on 3 August 2004).

a

r

s

o-o

ia

d-

imedia experience by automatic extraction of low level feathat describe objects in the content. A commercial versiontoMat, calledOntoAnnotate,5 is available from OntopriseThe Mindswap lab at the University of Maryland has de

ped annotation systems for both Simple HTML Ontolxtensions (SHOE) and RDF.SHOE Knowledge Annotator

29] was an early system which allowed users to mark-up HTages in SHOE guided by ontologies available locally or vRL. Users were assisted by being prompted for inputs. Unlly, the SHOE Knowledge Annotator did not have a browsisplay Web pages, which could only be viewed as sourceunning SHOE [29] took a step towards automated marky assisting users to build wrappers for Web pages that spow to extract entities from lists and other pages with reg

ormats. Mindswap is continuing to develop a range of Semeb tools[30]. A recent addition of relevance to this surve

he RDF annotatorSMORE6 which allows mark-up of imagend emails as well as HTML and text.

A tool with similar characteristics to SMORE is theOpenntology Forge (OOF)[31]. OOF is seen by its creators atational Institute of Informatics, Japan, as an ontology editoupports annotation, taking it a step further towards an integnvironment to handle documents, ontologies and annota

The COHSE Annotator[32] produces annotations that aompatible with Annotea standards, although the annota

5 OntoAnnotate (http://www.ontoprise.de/products/ontoannotateaccessed o0 November 2004).6 SMORE: Semantic Markup, Ontology and RDF Editor (http://www.indswap.org/∼aditkal/editor.shtmlaccessed on 28 July 2004).

http://annozilla.mozdev.org/index.html

http://mr.teknowledge.com/daml/

http://www.ontoprise.de/products/ontoannotate

http://www.mindswap.org/~aditkal/editor.shtml

V.Uren

etal./Web

Semantics:

Science,Servicesand

Agents

onthe

World

Wide

Web

xxx(2005)

xxx–xxx7

Table 1Comparison of annotation tools for requirements 1–6

Annotation tool Standard formats User-centred design Ontology support Document formats Documentevolution

Annotation storage

Amaya RDF(S) XLink, XPointer Web browser & editor Annotation server HTML, XHTML andXML

XPointer Local or annotation server

Mangrove RDF Graphical annotation tool HTML, email RDF database (Jena)Vannotea XML Collaboration support MPEG-2, JPEG2000,

Direct3DAnnotation server

OntoMat DAML + OIL, OWL, SQL,XPointer

Drag & drop, create & annotate OntoBroker annotationinference server

HTML, Deep Web XPointer, patternmatching

Annotation server, embeddedin webpage, separate file

M-OntoMat-Annotizer

XML, RDF(S) DOLCE Automatic extraction of visualdescriptors

MPEG-7 Annotation server

SHOE Knowledgeannotator

SHOE Prompting Ontology server HTML Embedded in Webpage

SMORE RDF(S) Web browser & editor Ontology server and ontologyediting

HTML, text, emailand images

Open Ontology Forge RDF(S), XML, XlinkXPointer, Dublin Core

Web browser + drag & drop,create & annotate

Local, editable ontologies HTML, text, images(SVG)

XPointer Local RDF or XML file

COHSE annotator DAML+OIL Plug in for Mozilla & IE Ontology server HTML (via DOM) XPointer Annotation server, DLSLixto WrappersMnM RDF(S), DAML+OIL,

OCMLWeb browser Ontology server HTML, text Stores annotated

pageEmbedded in Webpage

Melita RDF(S) DAML + OIL Control of intrusiveness of IE Local, editable ontologies HTML, text Regular expressionsParmenides XML (CAS) Clustering to suggest additionsArmadillo RDF(S) HTML RDF triple storeKnowItAll HTMLSmartWeb RDF, RDF(S), OWL RDF knowledge basePANKOW HTML CREAMAeroSWARM

(AeroDAML)OWL Web service Local ontologies HTML

SemTag RDF(S) HTML Label bureau (PICS)KIM RDF(S), OWL Various plug-in front ends KIMO HTML RDF(S) knowledge baseRainbow Project RDF WSDL/SOAP AmphorA XHTML database Shared upper level ontology HTML RDF repository (Sesame)h-TechSight DAML + OIL RDF KM Portal Ontology editor, dynamics

metricsHTML Tagged HTML web server

WiCKOffice Microsoft Smart Documents Office applications, support forform filling

Microsoft Office Annotation server (3 store)

AktivDoc HTML RDF Integrated editing environment HTML RDF triple storeSemanticWord DAML+OIL Microsoft Word GUIs Word Mark-up tied to text

regionsMagpie HTML OCML Web browser plug in HTML None, real timeThresher RDF Web browser (Thresher) Ontology personalization HTML None, real time

The comparison for requirement 7 (automation) is given inTable 2.


Table 2Comparison of annotation tools for requirement 7 (automation)

Annotation tool Automation Type of analysis for automation Learning in automation

Amaya NoMangrove NoVannotea NoOntoMat Yes PANKOW, Amilcare Supervised learningM-OntoMat-Annotizer Yes Extraction of spatial descriptors Genetic algorithmSHOE knowledge annotator Yes Running SHOE (wrappers) NoSMORE Yes Screen scraper NoOpen Ontology Forge Yes String matching NoCOHSE annotator Yes Ontology string matching NoLixto Yes Wrappers NoMnM Yes POS tagging, Named Entity Recognition Supervised learningMelita Yes String matching, POS tagging, Named Entity Recognition Supervised learningParmenides Yes Text mining with constraints Unsupervised learningArmadillo Yes String matching, POS tagging, Named Entity Recognition Unsupervised learningKnowItAll Yes String matching, Hearst patterns Unsupervised learningSmartWeb Yes Shallow linguistic parsing Unsupervised learningPANKOW Yes Hearst patterns Unsupervised learningAeroSWARM (AeroDAML) Yes AeroText NoSemTag Yes Seeker, similarity, TBD Unsupervised learningKIM Yes String matching, POS tagging, Named Entity Recognition NoRainbow project Yes Hidden Markov models, bit-map classification Supervised learningh-TechSight Yes Shallow linguistic analysis (POS tagging, Named Entity Recognition) NoWiCKOffice Yes Named Entity Recognition NoAktiveDoc Yes String matching, POS tagging, Named Entity Recognition Supervised and unsupervised learningSemanticWord Yes AeroDAML NoMagpie Yes String-matching, Named Entity Recognition NoThresher Yes Screen scraping, wrappers Supervised learning

are conceived as hyperlinks stored using the Distributed LinksService[33]. In this scenario, automatically applied hyperlinksare acceptable but only a word-matching service that highlightsontology terms in the text has been implemented so far. Theannotator is provided as a plug-in suitable for use in Mozillaor Internet Explorer, giving the user a choice of working envi-ronment. The COHSE architecture has been used to supporta number of domain applications, including the generation ofsemantic annotation for visually impaired users[10] and enrich-ing a Java tutorial site[34].

4.2. Automatic annotation

In this group, we consider both annotation tools that includeautomation components which provide suggestions for annota-tions, but still require intervention by knowledge workers, andtools which acquire annotations automatically on a large scale.Some are still limited to usage by specialists while others aresuitable for knowledge workers. Automated systems intendedto support knowledge workers take into account user interfacedesign issues related to minimizing intrusiveness while maxi-mizing accuracy.

Automation can generally be regarded as falling into threecategories. The most basic kind use rules or wrappers writtenby hand that try to capture known patterns for the annotations.T otateS up bt ough

good examples is a non-trivial and error-prone task. In orderto tackle this problem unsupervised systems employ a varietyof strategies to learn how to annotate without user supervision,but their accuracy is still limited. A summary of the automationaspects of the systems reviewed here is given inTable 2.

Lixto is a web information extraction system which allowswrappers to be defined for converting unstructured resources intostructured ones. The tool allows users to create wrappers inter-actively and visually by selecting relevant pieces of information[35]. It was originally developed at the Technical University ofVienna by Gottlob and colleagues and is now distributed by thespin-off Lixto Software GmbH.7

MnM was designed to mark-up training data for IE toolsrather than as an annotation tool per se[36]. This means that itstores marked up documents as tagged versions of the original,rather than the RDF formats used by the Semantic Web commu-nity. It has reasonable user support, with an HTML browser todisplay the document and ontology browser features. A strengthof MnM is that it provides open APIs to link to ontology serversand for integrating information extraction tools, making it flex-ible about the formats and methods it uses.

Melita [37] is a user driven automated semantic annotationtool which makes two main strategies available to the user. Onthe one hand, it provides an underlying adaptive information

ber2

hen there are two kinds of systems that learn how to annupervised systems learn from sample annotations marked

he user. A problem with these methods is that picking en

.y7 Lixto Software GmbH (http://www.lixto.comaccessed on 16th Septem005).

http://www.lixto.com/


extraction system (Amilcare) that learns how to annotate thedocuments by generalizing on the user annotations. Annotationis therefore a process that starts by requiring full user annotationat early stages, but ends in having the user merely verify the cor-rectness of suggestions made by the system. On the other hand,it provides facilities for rule writing (based on regular expres-sions) to allow sophisticated users to define their own rules. InMelita, documents are not selected randomly for annotation, butrather selected automatically based on the expected usefulness,to the IE system, of annotating the document. The Amilcare IEsystem has been incorporated inK@, a legal KM system withRDF based semantic capabilities produced by Quinary[38].

CAFETIERE is a rule-based system for generating XMLannotations developed as part of theParmenides project[39],which has been used, for example, to annotate the GENIAbiomedical corpus[9]. Text mining techniques supplementedwith slot based constraints are used to suggest annotations toanalysts[40]. The Parmenides project also experimented with aclustering approach to suggest concepts and relations to extendontologies[41].

Armadillo is a system for unsupervised creation of knowl-edge bases from large repositories (e.g. the Web) as well asdocument annotation[42]. It uses the redundancy of the infor-mation in repositories to bootstrap learning from a handful ofseed examples selected by the user. Seeds are searched in therepository. Then Adaptive IE is used to generalize over the exam-p (e.gd ewlya canb untt tion.A aseds

geb Them plas wisem ul-t surei hito iegei thee irea uthoh earni ownt

edaa arku tho plei hichh nol-o obilea tbaw

Another approach to learning annotations which exploitsthe sheer size of the Web is Pattern-based Annotation throughKnowledge On the Web (PANKOW) [45]. PANKOW uses arange of relatively rare, but informative, syntactic patterns tomark-up candidate phrases in Web pages without having to man-ually produce an initial set of marked-up Web pages and gothrough a supervised learning step.

AeroSWARM8 is an automatic tool for annotation usingOWL ontologies based on the DAML annotatorAeroDAML[46]. This has both a client server version and a Web enableddemonstrator in which the user enters a URI and the systemautomatically returns a file of annotations on another web page.To view this in context the user would have to save the RDF to anannotation server and view the results in an annotation friendlybrowser such as Amaya.

SemTag is another example of a tool which focuses only onautomatic mark-up[47]. It is based on IBM’s text analysis plat-form Seeker and uses similarity functions to recognize entitieswhich occur in contexts similar to marked up examples. The keyproblem of large-scale automatic mark-up is identified as ambi-guity, e.g. identical strings, such as “Niger” which can refer todifferent things, a river or a country. A Taxonomy Based Disam-biguation (TBD) algorithm is proposed to tackle this problem.SemTag is proposed as a bootstrapping solution to get a seman-tically tagged collection off the ground. It is intended as a toolfor specialists rather than one for knowledge workers.

uilda KIMa , etc.)w nlyf stric-t tendt vern e, Fore toh factt ishedi set pro-v mple,t ds,i hatp iewerf mento ayaS

cs,P atinga pli-c pperl m

ml/a

les and find new facts. Confirmation by several sourcesocuments) is then required to check the quality of the ncquired data. After confirmation, a new round of learninge initiated. This bootstrapping process can be repeated

he user is satisfied with the quality of the learned informarmadillo uses a number of techniques, from keyword bearches, to adaptive IE to information integration.

KnowItAll [43] automates extraction of large knowledases of facts from the Web in a similar fashion to Armadillo.ost notable difference is the way the system assesses the

ibility of candidate extractions. This is done using the pointutual information (PMI) measure rather than weighing m

iple evidence from domain-specific oracles. The PMI meas roughly the ratio between the number of search enginebtained by querying with the discriminator phrase (e.g. “L

s a city”) by the number of hits obtained by querying withxtracted fact (e.g. “Liege”). Also, KnowItAll does not requny set of initial seeds. Besides the baseline system, the aave provided three extensions to the system (pattern l

ng, subclass extraction and list extraction) which are sho improve overall performance.

The SmartWeb project is also investigating unsupervispproaches for RDF knowledge base population[44]. Theirpproach resolves the issue of not having pre-existing mp to learn from by using class and subclass names fromntology to construct examples. The context of these exam

s then learnt. In this way, instances can be identified wave similar contexts, but which may use different termigy to the ontology. Smart Web is aimed at broadband mccess and plans an initial demonstrator for the 2006 fooorld cup.

.

il

u-

s

rs-

-es

ll

KIM [48,49], uses information extraction techniques to blarge knowledge base of annotations. The annotations inre metadata in the form of named entities (people, placeshich are defined in the KIMO ontology and identified mai

rom reference to extremely large gazetteers. This is reive, and it would be a significant research challenge to exhe KIM methodology to domain specific ontologies. Howeamed entities are a class of metadata with broad usagxample, in theRich News application KIM has been usedelp annotate television and radio news by exploiting the

hat Web news stories on the same topics are often publn parallel [8]. The KIM platform is well placed to showcahe kinds of retrieval and data analysis services that can beided over large knowledge bases of annotations. For exahe KIM server is able to use a variety of plug in front enncluding one for Microsoft’s Internet Explorer, a Web UI trovides different semantic search services, and a graph v

or exploring the connections between entities. The developf KIM is set to continue in collaboration with DERI Galwnd the GATE research team under the banner ofSWAN9, aemantic Web Annotator.TheRainbow project, based at the University of Economi

rague, is taking a web-mining led approach to automnnotation. Rainbow is in fact a family of independent apations which share a common web-service front end and uevel ontology[50]. The applications include text mining fro

8 AeroSWARM project page (http://ubot.lockheedmartin.com/ubot/hotdaeroswarm.htmlaccessed on 2 August 2004).9 http://www.deri.ie/projects/swan/accessed August 2005.

http://ubot.lockheedmartin.com/ubot/hotdaml/aeroswarm.html

http://www.deri.ie/projects/swan/


product catalogues as well as more general pattern matchingapplications such as pornography recognition in bit-map imagefiles. The generated RDF is stored in Sesame databases forsemantic retrieval[12].

A traditional approach to information extraction is used bythe h-TechSight Knowledge Management Platform, in whichthe GATE rule-based IE system is used to feed a semanticportal [51]. This work is of particular interest because theautomatically generated annotations are monitored to producemetrics describing the “dynamics” of concepts and instanceswhich can be fed back to end users[11]. It is envisaged thatdynamics data will be used to inform the manual evolution ofontologies.

4.3. Integrated annotation environments

We emphasized in the requirements the need for single pointof entry systems that incorporate the annotation process withknowledge workers’ everyday tasks. In this section, we reviewsystems that are aimed at integrating annotation into standardtools and making annotation simultaneous to writing.

WiCKOffice [52] explores this approach. It demonstrateshow writing within a knowledge aware environment has use-ful support possibilities, such as automatic assistance for formfilling using data extracted from knowledge bases.

lev-e ts ano uringb nteni ingA , iti atics d tofi CK-O elyr ortss ationi ns as assc edgew

vi-rw nno-t ast oftOO

4

rictlya vice

.p

on demand for users browsing un-annotated resources. In thisway, they fill a niche for resources which it is either impossible toannotate, such as external web pages, documents which changerapidly, or those which might be annotated but with an unsuitableontology.

Magpie [55], for example operates from within a webbrowser and does “real-time” annotation of web resources byhighlighting text strings related to an ontology of the user’schoice. Appropriate web services can be linked to highlightedstrings. While the annotation of documents is automatic, Magpiecurrently has the disadvantage that subject specific parts of thelexicons of text strings for each ontology have to be producedmanually (common named entities such as people’s names andorganizations can be highlighted with a Named Entity Recog-nition plug-in called ESpotter). Work on automating lexicongeneration is in progress.

The Thresher system is similar to Magpie in that it useswrappers to generate RDF on the fly as users browse deep webresources[56]. As with Magpie, the user can access semantic ser-vices for recognized objects. Writing wrappers is a complex taskwhich Thresher tackles by providing facilities for non-technicalusers to mark-up examples of a particular class. These are thenused to induce wrappers automatically. Because Thresher is partof the Haystack semantic browser[57] users can also personalizethe ontologies they use.

5

usei artic-u Thek ifiedi isedl learn-i e sys-t matics

tiono t al.[ getso efulw ch ass user.C blep

\tract

t d alli bys pi-t rker,h ym-b omef t itr

AktiveDoc [53] enables annotation of documents at threels: ontology based content annotation, free text statemenn-demand document enrichment. Support is provided doth editing and reading. Semi-automatic annotation of co

s provided via Adaptive Information Extraction from text (usmilcare). As AktiveDoc is designed for knowledge reuse

s able to monitor editing actions and to provide automuggestions about relevant content. Support is not limitelling forms and other pre-determined structures (as in WIffice), but it is extended to free text as well. This enables tim

euse of existing knowledge when available. Armadillo suppearches of relevant knowledge in large repositories; annotn the document are used as context for searches. Annotatioaved in a separate database; levels of confidentiality areiated to annotations so to ensure confidentiality of knowlhen necessary.AeroDAML can provide automation within authoring en

onments. For example, theSemanticWord annotator[54],hich provides GUI based tools to help analysts a

ate Microsoft Word documents with DAML ontologieshey write. A commercial annotation system for Microsffice applications calledOntoOffice10 is available fromntoprise.

.4. On-demand annotation

In this section, we describe two systems which are not stnnotation tools. Instead, they produce annotation-like ser

10 OntoOffice tutorial (http://www.ontoprise.de/documents/tutorialontoofficedf accessed on 30 November 2004).

d

t

sreo-

s

. Automation

Automation is a particularly important requirement becat is needed to ease the knowledge acquisition bottleneck, plarly for annotating large collections of legacy documents.inds of support provided for annotating text can be classnto four kinds, wrappers, IE systems incorporating supervearning, IE systems that use some unsupervised machineng, and natural language processing systems. Many of thems we reviewed had one or more of these kinds of autoupport for annotators (seeTable 2for a summary).

The most common form of support in the current generaf tools is wrappers as originally developed by Kushmerick e

58], which exploit the structure of Web pages to identify nugf information for mark-up. Wrappers and rules are most ushen there are very regular patterns in the documents, sutandard tables of data. They require skill on the part of theiravegna et al.[37] give as an example of a typical user editaattern for finding times of events in their Melita system:

d : \d\d\W + (AM |PM|am|pm)

That this pattern or “regular expression” is intended to exime expressions would be clear to most programmers annformation extraction specialists (it means a digit followedymbol“:” then 2 digits, a word and either AM or PM in caal letters or lowercase letters). The average knowledge woowever, would certainly need support in deciphering the sols and would probably prefer it to be translated into s

orm of natural language template that “looks like” the texepresents.

http://www.ontoprise.de/documents/tutorial_ontooffice.pdf


The Thresher system, which allows non-technical users togenerate wrappers automatically from examples, is a good exam-ple of how progress is being made towards a more user-centricapproach to wrapper generation.

Supervised IE systems (e.g. Amilcare, used by S-CREAM,MnM and Melita) learn how to recognize the objects that requireannotation by learning from a collection of previously annotateddocuments. This usually requires the mark-up of a consider-able collection of documents. The MnM system, for example,was built to investigate how this task could be facilitated fordomain experts. Merely marking a number of documents is notsufficient; the items marked need to be good examples of thekinds of contexts in which the items are found. Finding the rightmix of exemplar documents is a tougher challenge for non IEexperts than the time-consuming work of marking up a sam-ple of documents. Melita addressed this problem by suggestingthe best mix of documents for annotation. Unsupervised sys-tems, like Armadillo, are starting to tackle these challenges byexploiting unsupervised learning techniques. PANKOW (usedin OntoMat), for example, demonstrates how the distribution ofcertain patterns on the Web can be used as evidence in orderto approximate the formal annotation of entities in Web pagesby a principle of ‘annotation by maximal (syntactic) evidence’.For example, the number of times the phrase “cities such asParis” occurs on web pages, would supply one piece of evi-dence that Paris is a city, which would be considered in thel ces-s thel peai

re ot otat ions( therH datai nnot

gna[ a-t iticat ofi cepi xplicr cont ll nob tancb xamp s. ThI sonT tioni CEe n then

0N

6. Requirements revisited

In the above survey of annotation tools, we have examineda selection of tools that allow manual mark-up, plus a numberof annotators supporting semi-automatic and automatic mark-up. Next, we look again at the seven requirements of annotationtools for KM to see how the tools measure up to them, whereprogress is being made, and where there are still challenges tobe met.

6.1. Requirement 1—standard formats

We identified standardization of the format of annotationsas essential to build in future proofing and compatibility of datawith the widest possible range of systems. The survey shows thatthe W3C standards, particularly Annotea, are becoming domi-nant in this area. Systems like CAFETIERE, which uses its ownXML based annotation scheme, are rare. This requirement hasbeen fulfilled, although the standards may need to be augmentedto tackle inadequacies in the existing standards (see discussionof requirement 5).

6.2. Requirement 2—user centered/collaborative design

Our ideal semantic annotation system would use a sin-gle point of entry approach in which annotation functionality,i ld bes owl-e ot yete rateda oc.T seeni hemw s thea sidei likeH essf eedst tiesi rkerst s ana ativee oolsr anceo licies,s systema ncea t issuew tationsa

6a

ngesi orer for

ight of counts of other patterns containing “Paris”. The sucor C-PANKOW extends PANKOW by taking into accountocal context of the web page the entity to be annotated apn [59].

Users of automatic annotation systems need to be awaheir limitations. Broadly speaking these are missing annions (known technically as low recall) and incorrect annotatknown as low precision), and they trade off against each oowever, for organizations with large collections of legacy

n particular, imperfect annotation may be preferable to no aation.

Additional issues for IE in KM are discussed by Cirave60]. Cimiano et al.[61] identify an additional problem, relion extraction, that we need to address here. This is cro the mark-up of ontological information and the creationntelligent documents. Most IE systems can recognize connstances and values, but they are not able to establish eelations between entities. For this reason, if a documentains more than one instance of a concept, the system wie able to allocate the correct properties to the correct insecause it is unable to differentiate among them. A typical ele is a home page with several names and phone number

E system would not be able to match phone numbers to perhe problem of relation detection is under active investiga

n the information extraction community, e.g. through the Axercises,11 and progress on this issue can be expected iext few years.

11 ACE exercises (http://www.itl.nist.gov/iad/894.01/tests/ace/accessed on 3ovember 2004).

rs

f-

.

-

l

tit-te-e

s.

ncluding access to maintain the underlying ontologies, woueamlessly integrated with other tools routinely used by kndge workers to author and read documents. This does nxist although there are signs of a trend towards integuthoring environments, such as WickOffice and AktiveDhe most common home environment of the tools we have

s a Web browser, a natural result of the fact that most of tere designed for the Semantic Web. Even for KM, this hadvantage of being a very familiar technology. The down

s that it both focuses development on native Web formatsTML and XML and tends to divorce the annotation proc

rom the process of document creation. More attention no be paid to build in or plug-in semantic annotation facilin commonly used packages to encourage knowledge woo view annotation as part of the authoring process not afterthought, and also to supporting annotation in collabornvironments, as for example in Vannotea. Most of the teviewed in this survey did not address issues of provenr access rights. Concerning the specification of access potandard methods to restrict access to databases or the filere available. Offering this kind of support for trust, provenand access policies concerning annotations is an importanhich needs to be addressed to make Semantic Web annoviable knowledge management tool.

.3. Requirement 3—ontology support (multiple ontologiesnd evolution)

Annotation tools have adapted rapidly to recent chan ontology standards for the Web, with many of the mecent tools already supporting OWL. However, support

http://www.itl.nist.gov/iad/894.01/tests/ace/


doing anything more complex than searching and navigatingan ontology browser is the exception. Ontology maintenance,which directly affects the maintenance of annotations, is poorlysupported, or not supported at all, by the current generationof tools. This perhaps reflects the intended user groups; withthe assumption being that knowledge workers will use existingontologies rather than editing or creating them. However thereare signs that annotation systems are giving users more controlof ontologies. Melita allows users to split a concept and thenview all the instances that have been created for the old conceptand reassign them. The COHSE architecture includes a com-ponent for maintaining the ontology but this does not appear tobe available from the annotator. The Open Ontology Forge sup-ports the creation of new classes from a root class. Much more isstill required. A genuinely integrated semantic annotation envi-ronment should give the user automatic support for ontologymaintenance, for example, using text mining methods to suggestnew classes as they emerge in documents and spotting incon-sistencies between new and existing annotations. h-TechSighthas made a start in this direction by monitoring the dynamics ofinstances and concepts to assist end-users in manual ontologyevolution. Parmenides has gone rather further and experimentedwith clustering methods to suggest ontology changes. Howeverthere is still a long way to go and we believe that ontologymaintenance represents a significant research challenge.

6f

cingi ts tht dint ats.M MLa forw itiesf oteaa agesa ro-v atioe autm hisk undf ratea

6a

nizew arei . ThA nott ocatt ur os ich ww nter

approach is that connections are one way from annotations todocuments and, therefore, too easily broken by edits at the docu-ment end. An environment in which documents and annotationsare stored separately, but closely coordinated is required. Anumber of practical fixes have been implemented in OntoMat,including the ability to search for similar documents that havealready been annotated, and a proposal to use pattern match-ing to help relocate annotations in suitable places in the newdocument. However, these are ways of coping with the prob-lem. For KM applications a coordinated approach is needed totackle the issues of versioning annotations as documents evolve.These include determining who has permission to edit annota-tions, at which points in the document life cycle it is appropriateto update the annotations, what automatic interventions are pos-sible to reduce the burden on users, etc. Our survey did notdiscover any concerted work on these lines.

6.6. Requirement 6—annotation storage options

In the Semantic Web, documents and their annotations arestored separately. This is unavoidable since documents andannotations are likely to be owned by different people or orga-nizations and stored in different places. A variety of approachesto separate storage were seen in the tools we examined. TheAnnotea approach calls for RDF servers. Web storage technolo-gies that have been used are RDF triplestore (Armadillo andA

cu-m in thed cur-r eets,e Worda of itsf m ofk t justp itorso s hasa andc set upr s aret nfor-m lly asa erentv rga-n dges e stor-a quiredt

6

ttle-n ed hads nota-t s, IEa e sys-t bow

.4. Requirement 4—support of heterogeneous documentormats

Satisfying this requirement is a prerequisite for produntegrated annotation environments and our survey suggeshe range of document types that can be handled is expanhough few individual systems handle many different formost of the annotation tools we looked at supported only HTnd XML. WICKOffice and OntoOffice provided annotationord processor files. Mangrove and SMORE provided facil

or handling emails. Open Ontology Forge, SMORE, Vannnd M-OntoMat-Annotizer provided means to annotate imnd image regions. The Rich News application of KIM pided an interesting example of how an area where automxpertise exists (text based IE) can be used to support theatic annotation of more difficult (audio visual) media. T

ind of cross media approach is likely to prove fertile groor the development of environments that take a more integpproach to handling heterogeneous formats.

.5. Requirement 5—document evolution (document andnnotation consistency)

We have observed that keeping annotations synchroith changes to documents is challenging and this is one

n which the current annotation standards are inadequatennotea approach adopted by many of the tools, stores an

ions separately from the document and uses XPointer to lhem in the document. There are strong arguments in favoeparate storage of annotations and documents, some whill discuss in requirement 6, but the problem with the XPoi

atg,

no-

d

daea-efe

ktiveDoc), Label Bureaus (SemTag) and DLS (COHSE).In an organizational setting, with greater control of do

ents, an alternative model is to store annotations directlyocument. This is familiar to knowledge workers from theent text comment facilities for word processors, spreadshtc. We have also seen it used for example in Semanticnd MnM. This approach is appealing, not just because

amiliarity, but because users believe it avoids the probleeeping annotations and documents consistent (in fact ilaces all responsibility for consistency on the human edf the document). However, separate storage of annotationdvantages for KM. The resulting decoupling of semanticsontent facilitates document reuse because it is possible toules which control and automate which kinds of annotationransferred to new documents and which are not. It allows iation from heterogeneous resources to be queried centraknowledge base. It also makes it easy to produce diff

iews of a document for users with different roles in an oization or different access rights, thus facilitating knowleharing and collaboration. We therefore argue that separatge is the better model, even when extra overheads are re

o maintain links between a document and its annotations.

.7. Requirement 7—automation

Automation is vital to ease the knowledge acquisition boeck, as discussed above. Many of the systems we examinome kind of automatic and semi-automatic support for anion. Most of these handled just text, using mainly wrappernd natural language processing although there are som

ems, notably M-OntoMat-Annotizer and parts of the Rain


Project looking to automate the handling of other media. Lan-guage technologies present usability challenges when deployedfor knowledge workers since most are research tools or designedfor use by specialists. A first step in this direction is Melita,where attention has been paid in finding ways to enable a seam-less user interaction with the underlying IE system. In additionto the usability challenges there are also research challenges,among which we have highlighted the extraction of relations asimportant for semantic annotation.

7. Summing up

Documents are central to KM, but intelligent documents,created by semantic annotation, would bring the advantages ofsemantic search and interoperability. These benefits, however,come at the cost of increased authoring effort. We have, there-fore, argued that integrated systems are needed which supportusers in dealing with the documents, the ontologies and the anno-tations that link documents to ontologies within familiar docu-ment authoring environments. These systems need automationto support annotation, automation to support ontology main-tenance, and automation to help maintain the consistency ofdocuments, ontologies and annotations. They also need to sup-port collaborative working. In this area we have identified thelack of access controls as a limitation of the Semantic Webapproach for knowledge management.

AMf h itse t ofn t arw itsg egar stemm

thatr prov nvi-r ocp entm eird chnc ediad n ofm anca Probl docu comb ono

A

ch-n C),t geM ISTp s”

(01IN901C0) and the SmartWeb project, funded by the GermanMinistry of Research. AKT is sponsored by the UK Engineer-ing and Physical Sciences Research Council under grant numberGR/N15764/01. Dot.Kom is sponsored by the European Com-mission as part of the Information Society Technologies (IST)programme under grant number IST-2001034038.

References

[1] S. Olsen, IBM sets out to make sense of the Web, 2004, CNETNews.com (http://news.com.com/2100-10323-5153627.html accessedon 14 September 2005).

[2] Delphi Group, The document is the process, White Paper, Del-phi Consulting Group Inc., 1994,http://www.delphigroup.com/research/whitepapers/DocIsProcess.pdf.

[3] T. Berners-Lee, J. Hendler, O. Lassila, The Semantic Web, Sci. Am.(2001) 34–43.

[4] P. Gardenfors, How to make the Semantic Web more semantic, in: Pro-ceedings of the 3rd International Conference on Formal Ontology inInformation Systems, 2004, IOS Press, 2004.

[5] C. Welty, N. Ide, Using the right tools: enhancing retrieval from marked-up documents, J. Comput. Humanit. 33 (10) (1999) 59–84.

[6] K. Bontcheva, Y. Wilks, Automatic report generation from ontologies:the MIAKT approach, in: Proceedings of the 9th International Con-ference on Applications of Natural Language to Information Sys-tems(NLDB’2004), 2004, Manchester, UK, 2004.

[7] N.S. Friedland, P.G. Allen, G. Matthews, M. Witbrock, D. Baxter, J.Curtis, B. Shepard, P. Miraglia, J. Angele, S. Staab, E. Moench, H.

rker,igital

nno-s, in:ence

O.ean24

[ vens,, in:

ence

[ cre-edgeys-

[ tionIR’04

[ anticnities,hop,

[ Man-998.

[ for thetel-

AI,

[ age3

[ Pro-d

The two Semantic Web frameworks Annotea and CREavour different aspects of annotation activity. Annotea, witmphasis on collaboration has influenced the developmenumber of excellent systems with good user interfaces thaell suited to distributed knowledge sharing. CREAM, withreater emphasis on the deep Web and the annotation of lesources has pushed the development of annotation syore aimed towards corporate KM.Our review of existing annotation systems indicates

esearch is very active and there are many systems whichide some of the requirements, but that fully integrated eonments are still some way off. WiCKOffice and AktiveDresent exemplars of what integrated authoring environmight look like but are currently limited with respect to thegree of automation and range of documents covered. Teal challenges to development include supporting multi-mocument formats, particularly automating the annotatioedia other than text, addressing issues of trust, provennd access rights and resolving the problems of storage.

ems around keeping annotations consistent with evolvingments, present a considerable challenge, particularly inination with evolving ontologies and the field needs to drawntology maintenance research to tackle this challenge.

cknowledgements

This work was funded by the Advanced Knowledge Teologies (AKT) Interdisciplinary Research Collaboration (IR

he Designing Adaptive Information Extraction for Knowledanagement (Dot.Kom) project, the 6th Framework EUroject “aceMedia”, the DARPA DAML project “OntoAgent

ae

cys

-

s

i-

e-

--

Oppermann, D. Wenke, D. Israel, V. Chaudhri, B. Porter, K. BaJ. Fan, S.Y. Chaw, P. Yeh, D. Tecuci, Project halo: towards a dAristotle, AI Magazine, Winter 2004, 2004.

[8] M. Dowman, V. Tablan, H. Cunningham, B. Popov, Web-assisted atation, semantic indexing and search of television and radio newProceedings of the 14th International World Wide Web Confer(WWW2005), May 10–14, Chiba, Japan, 2005, pp. 225–234.

[9] F. Rinaldi, G. Schneider, K. Kaljurand, J. Dowdall, A. Persidis,Konstanti, Mining relations in the GENIA corpus, Second EuropWorkshop on Data Mining and Text Mining for Bioin-formatics,September 2004, Pisa, Italy, 2004.

10] P. Plessers, S. Casteleyn, Y. Yesilada, O. De Troyer, R. SteS. Harper, C. Goble, Accessibility: a web engineering approachProceedings of the 14th International World Wide Web Confer(WWW2005), May 10–14, Chiba, Japan, 2005, pp. 353–362.

11] D. Maynard, M. Yankova, N. Aswani, H. Cunningham, Automatication and monitoring of semantic metadata in a dynamic knowlportal, in: Proceedings on the Artificial Intelligence: Methodology, Stems, Applications (AIMSA 2004), LNAI 3192, 2004, pp. 65–74.

12] O. Svab, M. Labsky, V. Svatek, RDF-based retrieval of informaextracted from web product catalogues, in: Proceedings of the SIGSemantic Web Workshop, Sheffield, 2004.

13] J. Hunter, R. Schroeter, B. Koopman, M. Henderson, Using the semgrid to build bridges between museums and indigenous commuin: Proceedings of the GGF11—Semantic Grid Applications WorksHonolulu, June 10, 2004, 2004.

14] Harvard Business School, Harvard Business Review on Knowledgeagement, Harvard Business School Publishing, Boston, MA, USA, 1

15] S. Handschuh, S. Staab, R. Studer, Leveraging metadata creationSemantic Web with CREAM, KI ’2003—advances in artificial inligence, in: Proceedings of the Annual German Conference onSeptember 2003, 2003.

16] D.L. McGuinness, F. van Harmelen, OWL Web Ontology LanguOverview, 2004 (http://www.w3.org/TR/owl-features/accessed on 2July 2004).

17] R. Swick, E. Prud’hommeaux, M.-R. Koivunen, J. Kahan, Anno teatocols, 2004 (http://www.w3.org/2001/Annotea/User/Protocol, accesseon 23 July 2004).

http://news.com.com/2100-1032_3-5153627.html

http://www.delphigroup.com/research/whitepapers/docisprocess.pdf

http://www.w3.org/tr/owl-features/

http://www.w3.org/2001/annotea/user/protocol


[18] A. Maedche, B. Motik, L. Stojanovic, R. Studer, R. Volz, Ontologiesfor Enterprise KM, IEEE Intell. Syst. 18 (2) (2003) 26–33.

[19] J. Kahan, M.-J. Koivunen, E. Prud’Hommeaux, R. Swick, Annotea: anopen RDF infrastructure for shared web annotations, in: Proceedingsof the 10th International World Wide Web Conference (WWW 2001),Hong Kong, 2001.

[20] S. Handschuh, S. Staab, Authoring and annotation of web pages inCREAM, in: Proceedings of the 11th International World Wide WebConference (WWW 2002), 7–11 May 2002 Honolulu, Hawaii, USA,2002.

[21] M.-R. Koivunen, Annotea and Semantic Web supported collaboration,Invited talk at Workshop on User Aspects of the Semantic Web (User-SWeb) at European Semantic Web Conference (ESWC 2005), 29 May2005, Heraklion, Greece, 2005.

[22] V. Quint, I. Vatton, An Introduction to Amaya, W3C NOTE 20-February-1997, 1997 (http://www.w3.org/TR/NOTE-amaya-970220.htmlaccessedon 28 July 2004).

[23] L. McDowell, O. Etzioni, S. Gribble, A. Halevy, H. Levy, W. Pentney,D. Verma, S. Vlasseva, Enticing ordinary people onto the Seman-tic Web via instant gratification, in: Proceedings of the 2nd Inter-national Semantic Web Conference (ISWC 2003), October 2003,2003.

[24] L. McDowell, O. Etzioni, A. Halevy, Semantic email: theory and appli-cations, J. Web Semantics 2 (2) (2004) 153–183.

[25] R. Schroeter, J. Hunter, D. Kosovic, Vannotea, A collaborative videoindexing, annotation and discussion system for broadband networks, in:Proceedings of the K-CAP 2003 Workshop on “Knowledge Markup andSemantic Annotation”, October 2003, Florida, 2003.

[26] F. Ciravegna, Y. Wilks, Designing adaptive information extraction for theSemantic Web in amilcare, in: S. Handschuh, S. Staab (Eds.), Annotationfor the Semantic Web, in the Series Frontiers in Artificial Intelligence

[ veil-ating187

[ , Y.tzis,is, in:SWC

[ EE

[ s fornage13th

edgeI

[ ta,openmance),

[ in:at 1soria,

[ ce:1 (1)

[ ure,h, Srdam

[ ctionery

[36] Vargas-Vera M., E. Motta, J. Domingue, M. Lanzoni, A. Stutt, F.Ciravegna, MnM: A tool for automatic support on semantic markup,KMi Technical Report, September 2003, TR Number 133, 2003.

[37] F. Ciravegna, A. Dingli, D. Petrelli, Y. Wilks, User-system coopera-tion in document annotation based on information, in: Proceedings ofthe 13th International Conference on Knowledge Engineering and KM(EKAW02), 1–4 October 2002, Siguenza, Spain, 2002.

[38] L. Gilardoni, M. Biasuzzi, M. Ferraro, R. Fonti, P. Slavazza, Machinelearning for the Semantic Web: putting the user in the cycle, in: Pro-ceedings of the Dagstuhl Seminar, Machine Learning for the SemanticWeb, 13–18 February 2005, 2005.

[39] W.J. Black, J. McNaught, A. Vasilakopoulos, K. Zervanou, B.Theodoulidis, F. Rinaldi, CAFETIERE conceptual annotations for facts,events, terms, individual entities, and relations, Parmenides TechnicalReport, 11 Jan, 2005, TR-U4.3.1, 2005.

[40] A. Vasilakopoulos, M. Bersani, W.J. Black, A Suite of Tools for MarkingUp Textual Data for Temporal Text Mining Scenarios, in: Proceedings ofthe 4th International Conference on Language Resources and Evaluation(LREC-2004), 24–30 May 2004, Lisbon, 2004.

[41] M. Siliopoulou, F. Rinaldi, W.J. Black, G.P. Zarri, R.M. Mueller,M. Brunzel, B. Theodoulidis, G. Orphanos, M. Hess, J. Dowdall, J.McNaught, M. King, A. Persidis, L. Bernard, Coupling informationextraction and data mining for ontology learning in PARMENIDES,in: Proceedings of the Recherche d’Information Assistee par Ordinateur(RIAO’2004), 2004.

[42] F. Ciravegna, S. Chapman, A. Dingli, Y. Wilks, Learning to harvestinformation for the Semantic Web, in: Proceedings of the 1st Euro-pean Semantic Web Symposium, May 10–12, 2004, Heraklion, Greece,2004.

[43] O. Etzioni, M.J. Cafarella, D. Downey, A.-M. Popescu, T. Shaked,S. Soderland, D.S. Weld, A. Yates, Unsupervised named-entity extrac-

05)

[ aggingrningrning,

[ web,ence

[ tof thenter-oria,

[ T.lin,

tion, J.

[ a-gs ofl Se-SA,

[ aang.

[ tivee on04),

[ sedtch,Web

5), 29

[ forpo-

and Applications, IOS Press, Amsterdam, 2003.27] R. Volz, S. Handschuh, S. Staab, L. Stojanovic, N. Stojanovic, Un

ing the hidden bride: deep annotation for mapping and migrlegacy data to the Semantic Web, J. Web Semantics 1 (2) (2004)206.

28] S. Bloehdorn, K. Petridis, C. Saathoff, N. Simou, V. TzouarasAvrithis, S. Handschuh, Y. Kompatsiaris, S. Staab, M.G. StrinSemantic annotation of images and videos for multimedia analysProceedings of the 2nd European Semantic Web Conference (E2005), 29 May–1 June 2005, Heraklion, Greece, 2005.

29] J. Heflin, J. Hendler, A portrait of the Semantic Web in action, IEIntell. Syst. 16 (2) (2001) 54–59.

30] J. Golbeck, M. Grove, B. Parsia, A. Kalyanpur, J. Hendler, New toolthe Semantic Web, in knowledge engineering and knowledge mament (ontologies and the Semantic Web), in: Proceedings of theInternational Conference on Knowledge Engineering and KnowlManagement (EKAW02), 1–4 October 2002, Siguenza (Spain), LNA2473, Springer Verlag, 2002, pp. 392–400.

31] N. Collier, A. Kawazoe, A.A. Kitamoto, T. Wattarujeekrit, T.Y. MizuA. Mullen, Integrating deep and shallow semantic structures inontology forge, in: Proceedings of the Special Interest Group on Setic Web and Ontology, JSAI (Japanese Society for Artificial Intelligenvol. SIG-SWO-A402-05, 2004.

32] S. Bechhofer, C. Goble, Towards annotation using DAML + OIL,Proceedings of the Workshop on Semantic Markup and AnnotationInternational Conference on Knowledge Capture (K-CAP 2001),VictB.C., Canada.

33] L. Carr, D. de Roure, W. Hall, G. Hill, The distributed link servia tool for publishers, authors and readers, World Wide Web J.(1995) 647–656.

34] S. Bechhofer, C. Goble, L. Carr, W. Hall, S. Kampa, D. De RoCOHSE: conceptual open hypermedia service, in: S. HandschuStaab (Eds.), Annotation for the Semantic Web, IOS Press, Amste2003.

35] R. Baumgartner, R. Flesca, Gottlob G, Visual web information extrawith Lixto, in: Proceedings of the International Conference on VLarge Data Bases (VLDB), 2001.

–

-

-

t

.,

tion from the Web: an experimental study, Artif. Intell. 165 (1) (2091–134.

44] P. Buitelaar, S. Ramaka, Unsupervised ontology based semantic tfor knowledge markup, in: Proceedings of the Workshop on Leain Web Search at the International Conference on Machine LeaAugust 2005, Bonn, Germany, 2005.

45] P. Cimiano, S. Handschuh, S. Staab, Towards the self-annotatingin: Proceedings of the 13th International World Wide Web Confer(WWW 2004), May 17–22, 2004, New York, NY, 2004.

46] P. Kogut, W. Holmes, AeroDAML: applying information extractiongenerate DAML annotations from web pages, in: Proceedings oWorkshop on Knowledge Markup and Semantic Annotation at 1st Inational Conference on Knowledge Capture (K-CAP 2001), VictB.C., Canada, 2001.

47] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran,Kanungo, K.S. McCurley, S. Rajagopalan, A. Tomkins, J.A. TomJ.Y. Zienberer, A case for automated large scale semantic annotaWeb Semantics 1 (1) (December 2003).

48] B. Popov, A. Kiryakov, D. Ognyanoff, D. Manov, A. Kirilov, M. Gornov, Towards Semantic Web information extraction, in: Proceedinthe Human Language Technologies Workshop at 2nd Internationamantic Web Conference (ISWC2003), 20 October 2003, Florida, U2003.

49] B. Popov, A. Kirayakov, D. Ognyanoff, D. Manov, A. Kirilov, KIM—semantic platform fo information extraction and retrieval, Nat. LEng. 10 (3/4) (2004) 375–392.

50] V. Svatek, M. Labsky, M. Vacura, Knowledge modelling for deducweb mining, in: Proceedings of the 14th International ConferencKnowledge Engineering and Knowledge Management (EKAW 20Whittlebury Hall, Northamptonshire, UK, 2004.

51] D. Maynard, M. Yankova, A. Kourakis, A. Kokossis, Ontology-bainformation extraction for market monitoring and technology waProceedings of the Workshop on User Aspects of the Semantic(UserSWeb) at European Semantic Web Conference (ESWC 200May 2005, Heraklion, Greece, 2005.

52] L. Carr, T. Miles-Board, A. Woukeu, G. Wills, W. Hall, The caseexplicit knowledge in documents, in: Proceedings of the ACM Sym

http://www.w3.org/tr/note-amaya-970220.html


sium on Document Engineering (DocEng ’04), Oct 28–30, Milwaukee,Wisconsin, USA, 2004, pp. 90–98.

[53] V. Lanfranchi, F. Ciravegna, D. Petrelli, Semantic Web-based document:editing and browsing in AktiveDoc, in: Proceedings of the 2nd EuropeanSemantic Web Conference, May 29–June 1, 2005, Heraklion, Greece,2005.

[54] M. Tallis, SemanticWord processing for content authors, in: Proceed-ings of the Knowledge Markup and Semantic Annotation Workshop(SEMANNOT 2003) at 2nd International Conference on KnowledgeCapture (K-CAP 2003), October 26, 2003. Sanibel, Florida, USA, 2003.

[55] M. Dzbor, E. Motta, J. Domingue, Opening up magpie via semanticservices, in: Proceedings of the 3rd International Semantic Web Confer-ence, November 2004, Hiroshima, Japan, 2004.

[56] A. Hogue, D. Karger, Thresher: automating the unwrapping of semanticcontent from the world wide web, in: Proceedings of the 14th Interna-tional World Wide Web Conference (WWW2005), May 10–14, Chiba,Japan, 2005, pp. 86–95.

[57] D. Huynh, D. Kerger, D. Quan, Haystack: a platform for creating, orga-nizing and visualizing information using RDF, in: Proceedings of the11th International World Wide Web Conference (WWW2002), Hawaii,USA, 2002.

[58] N. Kushmerick, D. Weld, B. Doorenbos, Wrapper induction for infor-mation extraction, in: Proeedings of the International Joint Conerenceon Artificial Intelligence, 1997.

[59] P. Cimiano, G. Ladwig, S. Staab, Gimme’ the context: context-drivenautomatic semantic annotation with C-PANKOW, in: Proceedings of the14th International World Wide Web Conference (WWW 2005), Tokyo,Japan, 2005.

[60] F. Ciravegna, Challenges in information extraction from text for KM,Proceedings of the IEEE Intelligent Systems and Their Applications,November 2001 (Trend and Controversies), 2001.

[ li, S.KM,2nd03.

l-rcholo-Sheng

telyndhineralor-er-nder-xts.

Jose Iria is a Research Associate in the Departmentof Computer Science at the University of Sheffield.He has worked on several projects including Dot.Komand Abraxas. His research interests include the appli-cation of Machine Learning techniques to InformationExtraction for automated extraction of entities andrelations from text.

Siegfried Handschuh is a researcher at the FZI(Research Center for Information Technology) Karl-sruhe. His current research interests include annota-tions in the Semantic Web and knowledge acquisition.He has chaired several workshops on semantic anno-tation and is co-editor of the book “Annotation forthe Semantic Web”. He is currently working at FZIfor the Ontoprise team on the HALO 2 project. Previ-ously he worked at the Institute AIFB (University ofKarlsruhe) in the Onto Agents project in the DARPADAML program.

Maria Vargas-Vera is a researcher at the KnowledgeMedia Institute (Open University) currently workingon the AKT project. She is interested in the use ofOntologies in Natural Language Processing (NLP).In particular Ontology-Based Information Extraction,Question Answering, Automated Assessment of Stu-dent Essays and Semi-Automatic Construction ofOntologies from text.

lo-utepri-an-ition

ndofch-ess-rilyge-

ion,in

61] P. Cimiano, F. Ciravegna, J. Domingue, S. Handschuh, A. LavelStaab, M. Stevenson, Requirements for information extraction forin: Proceedings of the KM and Semantic Annotation Workshop atInternational Conference on Knowledge Capture (KCAP-2003), 20

Victoria Uren is a Research Fellow at the Knowedge Media Institute (Open University). Her reseaconcerns the application of natural language techngies for knowledge acquisition and management.has worked on several projects within KMi includiScholOnto, AKT and Dot.Kom.

Philipp Cimiano is a researcher at the InstituAIFB (University of Karlsruhe). He is currentworking in the German BMBF project SmartWeb aco-organizing an ontology learning challenge witthe PASCAL framework. He has worked on sevprojects including the Dot.Kom and Genome Infmation Extraction (GenIE) projects. His main intests include natural language processing and ustanding as well as knowledge acquisition from te

Enrico Motta is Professor in Knowledge Technogies and Director of the Knowledge Media Instit(Open University). His current research focusesmarily on the integration of semantic, web and lguage technologies to support knowledge acquisand management and problem solving.

Fabio Ciravegna is Professor of Language aKnowledge Technologies at the UniversitySheffield. He coordinates the Web Intelligence Tenologies Lab, part of the Natural Language Procing group. His current research focuses primaon Language Technologies for Knowledge Manament and in particular on techniques for acquisitsharing and reuse of information and knowledgeSemantic Web oriented environments.