nlp interchange format

23
www.sti-innsbruck.at © Copyright 2008 STI INNSBRUCK www.sti- innsbruck.at NLP Interchange Format José M. García

Upload: tamber

Post on 24-Feb-2016

63 views

Category:

Documents


1 download

DESCRIPTION

NLP Interchange Format. José M. García. Outline. What is NIF? Design requirements URI schemes NIF ontologies Use cases Relationship with ELRA Roadmap for NIF 2.0 Conclusions . What is NIF?. N atural Language Processing I nterchange F ormat - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NLP Interchange Format

www.sti-innsbruck.at © Copyright 2008 STI INNSBRUCK www.sti-innsbruck.at

NLP Interchange Format

José M. García

Page 2: NLP Interchange Format

www.sti-innsbruck.at

Outline

• What is NIF?• Design requirements• URI schemes• NIF ontologies• Use cases• Relationship with ELRA• Roadmap for NIF 2.0• Conclusions

2

Page 3: NLP Interchange Format

www.sti-innsbruck.at 3

What is NIF?

• Natural Language Processing Interchange Format

• NIF is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.

• Building blocks– URI scheme for identifying elements in texts– Ontology for describing common NLP terms

• Created and maintained by AKSW group of University of Leipzig, during the LOD2 EU project.

• Community project: http://persistence.uni-leipzig.org/nlp2rdf/

Page 4: NLP Interchange Format

www.sti-innsbruck.at 4

NIF design requirements

Compatibility with RDF Coverage Structural

Interoperability

Conceptual Interoperability Granularity Provenance and

Confidence

Simplicity Scalability

Page 5: NLP Interchange Format

www.sti-innsbruck.at 5

URI schemes

• Text needs to be referenceable by URIs

• With URI references text can be used as resources in RDF statements

• NIF distinguishes:– Documents– Text of the document– Substrings of the text.

• URI scheme is an algorithm to create IDs for text and substrings

• URI elements– Document URI– Separator– Character indices

Page 6: NLP Interchange Format

www.sti-innsbruck.at 6

RFC 5147

• Canonical URI scheme for NIF is based on RFC 5147

• It standardizes fragment identifiers for text/plain media type

http://www.w3.org/DesignIssues/LinkedData.html

Page 7: NLP Interchange Format

www.sti-innsbruck.at 7

RFC 5147

• Canonical URI scheme for NIF is based on RFC 5147

• It standardizes fragment identifiers for text/plain media type

http://www.w3.org/DesignIssues/LinkedData.html

http://www.w3.org/DesignIssues/LinkedData.html#char=0,26610

Page 8: NLP Interchange Format

www.sti-innsbruck.at 8

RFC 5147

• Canonical URI scheme for NIF is based on RFC 5147

• It standardizes fragment identifiers for text/plain media type

http://www.w3.org/DesignIssues/LinkedData.html

http://www.w3.org/DesignIssues/LinkedData.html#char=0,26610

http://www.w3.org/DesignIssues/LinkedData.html#char=1206,1218

Page 9: NLP Interchange Format

www.sti-innsbruck.at 9

NIF Core Ontology

• Classes and properties to describe relation between– Documents– Text– Substrings– Corresponding URI schemes

Page 10: NLP Interchange Format

www.sti-innsbruck.at 10

NIF Core Ontology

• Additional classes and properties (unstable/testing)

– More URI schemes

– Text structure (words, sentences, paragraphs…)

– Part of Speech (POS)

– Annotations with Stanbol

– Confidence

Page 11: NLP Interchange Format

www.sti-innsbruck.at 11

Workflows, Modularity and Extensibility of NIF

• Workflows for NLP integration– Normalization– Tokenization– Merge RDF annotations

Page 12: NLP Interchange Format

www.sti-innsbruck.at 12

Workflows, Modularity and Extensibility of NIF

• NIF ontology logical modules– Terminological model– Inference model– Validation model

• Vocabulary modules– FISE– ITS– OLiA– NERD– …

Page 13: NLP Interchange Format

www.sti-innsbruck.at 13

Workflows, Modularity and Extensibility of NIF

• Granularity profiles

Page 14: NLP Interchange Format

www.sti-innsbruck.at 14

ITS Use Case

• The Internationalization Tag Set 2.0 is a W3C working draft that is becoming a Recommendation.

• ITS standardizes HTML and XML attributes which can be used to annotate nodes with processing information for language service providers (i18n, l10n)

• ITS 2.0 RDF ontology was developed using NIF, including a round-trip conversion algorithm from ITS to NIF.

• NIF is expected to receive wide adoption by translation & language service providers

• ITS 2.0 RDF ontology provides properties which can be used to provide best practices for NLP annotations.

Page 15: NLP Interchange Format

www.sti-innsbruck.at 15

OLiA Use Case

• The Ontologies of Linguistic Annotation provide stable identifiers for morpho-syntactical annotation tag sets, so that NLP tools can use these ids for better interoperability.

• OLiA provides Annotation Models and a Reference Model, comprising more than 110 OWL ontologies for over 34 tag sets in 69 languages

• Features– Documentation– Flexible Granularity– Language Independence

• NIF provides two properties– nif:oliaIndividual (links a nif:String to an OLiA Annotation Model)– nif:oliaCategory (links to the Reference Model)

Page 16: NLP Interchange Format

www.sti-innsbruck.at 16

RDFaCE Use Case

• RDFa Content Editor is a rich text editor that supports WYSIWYM authoring including various views of the semantically enriched textual content.

• It combines results of different NLP APIs for automatic content annotation

– Heterogeneous APIs access, URI generation and output data structure– Solution: server-side proxy, hard-coded input and connection of each API.

• NIF simplified the integration, adding an interoperability layer

Page 17: NLP Interchange Format

www.sti-innsbruck.at 17

What is ELRA?

• European Language Resources Association

• http://www.elra.info

• Effort to make available Language Resources (LR) for language engineering and to evaluate language engineering technologies.

• LR marketplace

• Related organizations– ELDA (ELRA’s operational body)– LREC conferences

Page 18: NLP Interchange Format

www.sti-innsbruck.at 18

What is ELRA?

Page 19: NLP Interchange Format

www.sti-innsbruck.at 19

Relationship with NIF

• Different objectives

• LR written resources (esp. Corpora) can be annotated with NIF for further interoperability and integration with NLP tools

• ADVANTAGE: Large test data collection to evaluate NLP tools

• DISADVANTAGE: Cost of LR (though there are free ones)

Page 20: NLP Interchange Format

www.sti-innsbruck.at 20

Roadmap for NIF 2.0

• Release of NIF 1.0– DONE (Nov 2009)

• Release of NIF 2.0 Draft– CURRENT effort on solving pending issues– Adoption in ITS 2.0 W3C (soon-to-be) Recommendation– NIF-Core ontology is becoming stable– RLOG - an RDF Logging Ontology– NIF Validator software available

• Release of NIF 2.0 Core

• Release of NIF 2.0 Extensions– ITS ontology, PROV ontology, Lemon Ontology, NERD, UIMA, MARL opinion

ontology…

Page 21: NLP Interchange Format

www.sti-innsbruck.at 21

Conclusions

• NIF allows to integrate NLP tools using Linked Data

• Ongoing effort

• Many adopters and supporters– LOD2 EU project– Several W3C working groups– Named Entity Recognition and Disambiguation (NERD)– Ontologies of Linguistic Annotation (OLiA)– …

• 27 different implementations and use cases– Some available at http://persistence.uni-leipzig.org/nlp2rdf/

Page 22: NLP Interchange Format

www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK www.sti-innsbruck.at

Thanks for your attention

Questions?

22

Page 23: NLP Interchange Format

www.sti-innsbruck.at

References

1. http://persistence.uni-leipzig.org/nlp2rdf/

2. Integrating NLP using Linked Data by Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmer in 12th International Semantic Web Conference, 21-25 October 2013, Sydney, Australia

23