linguistic linked open data: what’s in for (deep) machine translation? christian chiarcos...

117
Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos [email protected] frankfurt.de DeepMT, Sep 4 th , 2015, Pragu

Upload: rodney-hicks

Post on 29-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

Linguistic Linked Open Data: What’s in for (Deep) Machine Translation?

Christian [email protected]

DeepMT, Sep 4th, 2015, Prague

Page 2: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

Linked Open Data

Basic Concepts

Page 3: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

3

Linked Open Data (LOD)

Plenty of Resources, linked with each other ;)Aug 2014

Page 4: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

4

Linked Open Data (LOD)

• However, LOD pertains not so much to a resource (or a bundle of resources), but to a philosophy

• Best practices for publishing data on the web– Goals

• reusability• accessibility• transparent and explicit semantics

– esp. for links

Page 5: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

Linked (Open) Data, informally

• use URIs as names for things (1)– links to external URIs (links) allow us to retrieve more

information from these sites• if they can be resolved via HTTP (2)• and provide information via SPARQL/RDF* (3)• and they include links to other URIs (4)Þ then, this is Linked Data

http://www.w3.org/DesignIssues/LinkedData.html<Nr>

Page 6: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

6

Linked Open Data: The 5 star plan

Page 7: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

7

From Tables …

PHOnetics Information Base and LExicon (PHOIBLE) Moran, S. 2012. Using Linked Data to Create a Typological Knowledge Base. In Chiarcos, C., Nordhoff, S., and Hellmann, S. (eds), Linked Data in Linguistics: Representing and Connecting Language Data and Language Metadata. Springer, Heidelberg.

Page 8: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

8

From Tables to RDF …

Subject(primary key)

Page 9: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

9

From Tables to RDF …

Subject

Property(„Relation“)

Page 10: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

10

From Tables to RDF …

Subject

Property(„Relation“)

Object

Page 11: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

11

From Tables to RDF …

1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object

Subject

Property(„Relation“)

Object

Page 12: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

12

From Tables to RDF …

1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object

Subject

Property(„Relation“)

Object

tha u:glyph

Page 13: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

13

From Tables to RDF …

1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object

Subject

Property(„Relation“)

Object

tha u:glyph

We chose “hasSegment” for the property corresponding to column “glyph”

Page 14: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

14

From Tables to RDF …

1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object

Subject

Property(„Relation“)

Object

tha u:hasSegment

We chose “hasSegment” for the property corresponding to column “glyph”

Page 15: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

15

From Tables to RDF …

1. Decompose tables into triples2. Multiple triples constitute a graph

Page 16: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

16

From Tables to RDF …

1. Decompose tables into triples2. Multiple triples constitute a graph3. A graph can aggregate triples from other sources, as well

Page 17: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

17

From Tables to RDF …

Graphs can be represented in other ways, but RDF allows us to

1. Provide explicit semantics (RDF Schema, Ontology)

2. Check consistency and infer implicit information

3. Merge (not only syntactically, but semantically)

4. Query

5. Link (enrich with external data)

Page 18: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

18

From Tables to RDF …

Graphs can be represented in other ways, but RDF allows us to

1. Provide explicit semantics (RDF Schema, Ontology)

2. Check consistency and infer implicit information

3. Merge (not only syntactically, but semantically)

4. Query

5. Link (enrich with external data)

RDFS, OWL

Page 19: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

19

From Tables to RDF …

Graphs can be represented in other ways, but RDF allows us to

1. Provide explicit semantics (RDF Schema, Ontology)

2. Check consistency and infer implicit information

3. Merge (not only syntactically, but semantically)

4. Query

5. Link (enrich with external data) URIs & SPARQL

Page 20: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

20

Uniform Resource Identifiers (URIs)

● Agree on a common vocabulary and names for entities● URIs provide globally unique identifiers

“hasSegment”

vs.

<http://mlode.nlp2rdf.org/resource/phoible/hasSegment>

vs.

@prefix phoible: <http://mlode.nlp2rdf.org/resource/phoible/>

... phoible:hasSegment ...

string, not unambiguous

URIs

Page 21: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

21

Turtle

• Simple triple notation

Subject-URI Property-URI Object-URI .Subject-URI Property-URI “Literal value” .

e.g., phoible:khm phoible:hasSegment "u:".

Page 22: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

22

SPARQL

Merge data and query it using the W3C standard SPARQL (SPARQL Protocol and Query Language)

“the SQL of the Semantic Web”

SELECT DISTINCT ?languageWHERE {

?language phoible:hasSegment ?segment .?segment phoible:hasFeature phoible:delayed_release .

}

Page 23: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

23

From Tables to RDF to Linked Data

• use URIs as names for things (1)– links to external URIs (links) allow us to retrieve more information

from these sites

• if they can be resolved via HTTP (2)• and provide information via SPARQL/RDF* (3)• and they include links to other URIs (4)Þ then, this is Linked Data

@prefix phoible: <http://mlode.nlp2rdf.org/resource/phoible/>phoible:khm phoible:hasSegment "u:".phoible:khm owl:sameAs <http://lexvo.org/id/iso639-3/khm>.

Turtle notation

http://www.w3.org/DesignIssues/LinkedData.html

Page 24: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

24

From Tables to RDF to Linked Data

@prefix phoible: <http://mlode.nlp2rdf.org/resource/phoible/>phoible:khm phoible:hasSegment "u:".phoible:khm owl:sameAs <http://lexvo.org/id/iso639-3/khm>.

Turtle notation

Page 25: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

25

Linked Open Data: The 5 star plan

Page 26: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

26

Linked Open Data (LOD, Aug 2014)

Linguistic Linked Open Data

Page 27: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

Linguistic Linked Open Data (LLOD)

Linguistic MotivationsA brief History

Page 28: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

28

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

Page 29: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

29

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– terminology (GOLD, ISOcat, OLiA) (Farrar & Langendoen

2003, Ide & Wright 2004, Schmidt et al. 2006)

Page 30: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

30

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– integrating typological data bases (TDS)

(Saulwick et al. 2005, Dimitriades et al. 2010)

Page 31: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

31

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– modelling and querying multi-layer corpora

(Cassidy 2010, Chiarcos et al. 2008, Rehm et al. 2008)

Page 32: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

32

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– NLP pipelines

(Buyko et al. 2008, Ribieira et al. 2012, Hellmann et al. 2013)

Page 33: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

33

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– interfacing corpus and dictionary data

(Burchardt et al. 2008, Mazziotta et al. 2010)

Page 34: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

34

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution

• lexical resources long provided by the SW(Gangemi et al. 2003, Buitelaar et al. 2006)

Page 35: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

35

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

• But these activities were not coordinated– and in particular, RDF was used, but resources rarely

linked to other resources in the web of data

Page 36: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

36

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

• But these activities were not coordinatedÞ Interest in RDF, barely any links Þ need for establishing communication channels

Page 37: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

37

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

Community Building

• by the time, long been recognized as a problem

• But these activities were not coordinatedÞ Interest in RDF, barely any links Þ need for establishing communication channels

Page 38: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

38

OKFN Open Linguistics Working Group (OWLG)

• founded in Oct 2010 in Berlin, Germany– Working group of the Open Knowledge Foundation

• open network of individuals interested in– linguistic resources and/or – their publication under open licenses

• multi-disciplinary– NLP/CL, typology/language documentation, SW, …

• infrastructure – mailing list, web site/blog, wiki– http://linguistics.okfn.org

Page 39: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

39

OWLG activities(http://linguistics.okfn.org)

– promoting open linguistic resourcesÞraising awareness, collecting metadata

(datahub.io)– facilitating wide-range community activities

• workshops, mailing list, publications– Linked Data in Linguistics (LDL)– Multilingual Linked Open Data for Enterprises (MLODE)– Linked Data in Linguistic Typology (LDLT)

Page 40: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

40

OWLG activities(http://linguistics.okfn.org)

– promoting open linguistic resourcesÞraising awareness, collecting metadata

(datahub.io)– facilitating wide-range community activities

• workshops, mailing list, publications• facilitating exchange between and among more

specialized community groups– W3C OntoLex, BP-MLOD, LD4LT, ...– ACL SIGs (SIGLEX, SIGANN), ...– DGfS, MPI-EVA, ...

Page 41: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

41

LLOD cloud

• a collection of linguistic resources– published under open licenses– as linked data– decentralized developed and maintained– meta data at http://datahub.io

=> cloud diagram

– developed as a community effort in the context of the Open Linguistics Working Group of the Open Knowledge Foundation

Page 42: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

42

2011 a sketch on a table napkin

Mar 2012Chiarcos et al. (2012), LDL book

Sep 2012MLODE hackathon to produce first diagram from original (meta) data

2012-2014more data, more rigid quality constrantsemergence of related W3C Community Groups

Aug 2014top-level category in the LOD diagram

Workshop series

Linked Data in Linguistics(LDL, anually)

Multilingual Linked Open Data for Enterprises

(MLODE, bi-annually)

Linked Data in Linguistic Typology

(LDLT, 2013)

Building the LLOD Cloud

Page 43: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

43linguistic-lod.org

Page 44: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

44

Recent developments

• 9th International Conference on Language Resources and Evaluation (LREC-2014)– „the new hot topic in our community“

(Nicoletta Calzolari, Pres. ELRA)

• Selected LLOD events in the last 3 months– 4th Multilingual Semantic Web (Portoroz, Slovenia, June 2015)– 1st Summer Datathon on LLOD (Cercedilla, Spain, June 2015)– EUROLAN-2015 Summer School on “Linguistic Linked Open Data”

(Sibiu, Romania, July 2015)– LLOD-LSA Workshop at the Summer Institute of the Linguistic Society

of America (Chicago, IL, July 2015)– 4th Linked Data in Linguistics (Beijing, PRC, July 2015) – Ontology session at ESSLLI-2015 (Barcelona, Spain, Aug 2015)– 2nd Workshop on NLP&LOD (Hissar, Bulgaria, Sep 2015)

Page 45: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

Linked Open Data for Linguists

Possible applications

Page 46: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

46

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

• protocols and standards• links between data sets

Þ improved access to distributed resourcesÞ improved (re-)usability of language resourcesÞ improved visibility of language resources

Page 47: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

47

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability

• comparable formats and protocols to access dataÞ use the same query language for different data sets

Page 48: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

48

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability– Conceptual interoperability

• develop and (re-)use a shared vocabularies for equivalent concepts

Þ the same query on different data sets

Page 49: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

49

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability– Conceptual interoperability– Federation

• data published on the web– with a query interface (SPARQL end point)

Þ use a single query to query different datasets

Page 50: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

50

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability– Conceptual interoperability– Federation

• data published on the web– with a query interface (SPARQL end point)

Þ use a single query to query different datasets

SPARQL 1.1

Page 51: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

51

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability– Conceptual interoperability– Federation

• data published on the web– with a query interface (SPARQL end point)

Þ use a single query to query different datasets

Achievable with any graph-based data model

Page 52: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

52

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability– Conceptual interoperability– Federation

• data published on the web– with a query interface (SPARQL end point)

Þ use a single query to query different datasets

The “killer application”, e.g., for annotated corpora

Page 53: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

53

Conceptual Interoperability: Monolingual

• When language ressources for a low-resource language are developed, different people have different ideas, e.g., for English (by the mid-1990s)

Susanne PennThe AT DTFulton NP1s NNPCounty NNL1cb NNPGrand JJ NNPJury NN1c NNPsaid VVDv VBDFriday NPD1 NNP

Page 54: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

54

Conceptual Interoperability: Monolingual

Susanne PennThe AT DTFulton NP1s NNPCounty NNL1cb NNPGrand JJ NNPJury NN1c NNPsaid VVDv VBDFriday NPD1 NNP

395 tagsword classes

morphological featuressyntactic features

lexical classes

57 tagsword classes

number and degree

Page 55: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

55

Conceptual Interoperability: Monolingual

• Integrating both resources allows us to– apply more wide-scale statistical analyses– increase training data for supervised POS tagging– increase test data for unsupervised POS tagging

395 tagsword classes

morphological featuressyntactic features

lexical classes

57 tagsword classes

number and degree

Page 56: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

56

Conceptual Interoperability: Multilingual

• with interoperable POS tags used across different languages, …– we can apply the same unlexicalized NLP tools

(e.g., parsers, cf. McDonald et al. 2013)– we can perform comparative corpus studies– we simplify multilingual annotation projection

Page 57: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

57

Conceptual Interoperability

• Multiple terminology repositories exist– available over the web– RDF representation

• Are linked with each other– for language IDs: Glottolog & lexvo.org– for lexical senses: WordNets (ILI)– for grammatical categories & features: GOLD,

ISOcat, OLiA

Page 58: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

58

Linked Terminologies

English

EAGLES

MULTEXT/East

15 (mostly) Eastern European languages

MULTEXT/East

MULTEXT/East 11 European languages

STTS

TIGER GermanConnexor

TüBa-D/ZGerman

PennBrown

Susanne

etc.

OLiAReference

Model

GOLD

ISOcat(morpho-syntax)

OntoTag(morpho-syntax)

TDS ontology

Ontologies of Linguistic Annotation OLiA

External Reference Models(Terminology Repositories)

(resource-specific) Annotation Models

Language Ressources

DictionariesCorpora

NLP Tools

EAGLESEAGLES

Page 59: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

59

Conceptual Interoperability

PennThe DTFulton NNPCounty NNPGrand NNPJury NNPsaid VBDFriday NNP

Determiner PronounOrDeterminer

SusanneThe ATFulton NP1sCounty NNL1cbGrand JJJury NN1csaid VVDvFriday NPD1

ProperNoun Noun hasNumber.Singular

ProperNoun Noun hasNumber.Singular

ProperNoun Noun hasNumber.Singular

ProperNoun Noun hasNumber.Singular

ProperNoun Noun hasNumber.Singular

(MainVerb StrictAuxiliaryVerb) hasTense.Past [sic!]

DefiniteArticle ArticleDeterminer PronounOrDeterminer

Surname ProperNoun Noun hasNumber.Singular

TopographicalNoun ProperNoun Noun hasNumber.Singular

AdjectivehasDegree.Positive

CommonNoun Noun hasNumber.Singular

TemporalNoun ProperNoun Noun hasNumber.Singular

MainVerb hasTense.Past

atomic statements mostly identical, just a few more

from Susanne

Page 60: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

What’s in for (Deep) MT?

Entity Linking: A Special Track for Proper NamesRe-Using Lexical Resources

Addressing Lexical GapsBootstrapping Dictionaries

Improved Deep Analysis

Page 61: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

61

Deep MT (© Jan Hajic, yesterday)

Page 62: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

62

Entity Linking: A Special Track for Proper Names

Page 63: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

63

Translating Proper Names

• Normally not directly translated, but maintained– Differences in inflection– Different writing systems (Cyrillic vs. Latin vs.

Arabic, etc.)– SMT: make sure your Language Model doesn’t

override the Translation Model !

Page 64: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

64

Translating Proper Names

• Make sure your Language Model doesn’t override the Translation Model !My grandfather's grandfather came to Germany in 1905.– "Il nonno di mio nonno arrivò in Canada nel 1905."

(google translate, Jul 2010)– "Nonno di mio nonno è venuto in Germania nel

1905." (google translate, Sep 2012)

Page 65: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

65

Translating Proper Names

• Make sure your Language Model doesn’t override the Translation Model !"Recentemente", conferma Maria Serena Balestracci, "mi ha telefonato un signore da Bologna, che aveva sentito parlare del libro alla radio– "... a gentleman from London ..." (google

translate, Oct 2010)– "... a gentleman from Bologna ..." (google

translate, Sep 2012)

Page 66: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

66

Translating Proper Names

• Make sure your Language Model doesn’t override the Translation Model !

These errors stopped after Google bought Freebase

Page 67: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

67

Trivial Entity Linking

• Named Entity Recognition -> treat NEs in a special way

Page 68: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

68

Trivial Entity Linking

• Named Entity Recognition • Entity Linking -> Link with an ontology, which

may provide multilingual labels

Page 69: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

69

Trivial Entity Linking

• Named Entity Recognition • Entity Linking -> Link with an ontology, which

may provide multilingual labels

Page 70: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

70

Trivial Entity Linking

• Named Entity Recognition • Entity Linking -> Link with an ontology, which

may provide multilingual labels– E.g., Entity Linking via DBpedia Spotlight– Follow DBpedia-JRCNames linking– Use multilingual label from JRCNames instead of

translating yourself

Page 71: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

71

Problems with inflecting languages

• Just knowing about (one) possible label in another language doesn’t help much if you need to inflect a name – or, any other string label from a knowledge base

Page 72: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

72

Problems with inflecting languages

• Listing all forms helps with entity linking, but not with machine translation

Þ We need linguistic LOD (LLOD)Systematic inclusion of grammatical information=> LLOD vocabularies and conventions

Page 73: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

73

(Re-) Using lexical resources

Page 74: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

74

Lemon: Lexicon Model for Ontologies

• Developed by the W3C Ontology-Lexica Community Group (OntoLex)

Page 75: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

75

Lemon: Lexicon Model for Ontologies

• Developed by the W3C Ontology-Lexica Community Group (OntoLex)

• Provides a data model for adding linguistic information to ontologies

• Widely used within the LLOD cloud– Also by colleagues not participating in OntoLex

• E.g., PanLex (Long Now Foundation)

– “Abused” for any kind of lexical resource• Even beyond the original ontology lexicalization use

case

Page 76: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

lemon Core

<Nr>

Page 77: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

lemon Sample (Moran and Brümmer 2013)

<Nr>

Page 78: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

Open World Assumption

• Unless explicitly stated, information is per se incomplete– Additional information can be expressed– E.g., using linguistic categories and features from

terminology repositoriesÞ Grammatical information can be described in a

reusable wayRecommended vocabularies: lexinfo, OLiA, GOLD

<Nr>

Page 79: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

Vocabularies for Lexical-Conceptual Resources

• lemon provides data structures, but – for content and metadata, it relies on external

vocabularies• Interoperability depends on a bundle of vocabularies

– WordNet, DBpedia, any ontology (lexical senses)– lexvo (language identifiers)– glottolog (languoid identifiers from linguistic typology)– PHOIBLE (phoneme inventories and phonological structures)– lexinfo (grammatical features for lexical resources)– OLiA (annotations)– ISOcat (resource metadata)– GOLD (grammatical concepts) <Nr>

Page 80: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

Vocabularies for Lexical-Conceptual Resources

• lemon provides data structures, but – for content and metadata, it relies on external

vocabularies• Interoperability depends on a bundle of vocabularies

– WordNet, DBpedia, any ontology (lexical senses)– lexvo (language identifiers)– glottolog (languoid identifiers from linguistic typology)– PHOIBLE (phoneme inventories and phonological structures)– lexinfo (grammatical features for lexical resources)– OLiA (annotations)– ISOcat (resource metadata)– GOLD (grammatical concepts)

Providing (lexical) resources in

accordance with these vocabularies improves

their reusability<Nr>

Page 81: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

Vocabularies for Lexical-Conceptual Resources

• lemon provides data structures, but – for content and metadata, it relies on external

vocabularies• Interoperability depends on a bundle of vocabularies

– WordNet, DBpedia, any ontology (lexical senses)– lexvo (language identifiers)– glottolog (languoid identifiers from linguistic typology)– PHOIBLE (phoneme inventories and phonological structures)– lexinfo (grammatical features for lexical resources)– OLiA (annotations)– ISOcat (resource metadata)– GOLD (grammatical concepts)

Providing (lexical) resources in

accordance with these vocabularies improves

their reusability

This effect can be extended to

NLP tools

<Nr>

Page 82: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

82

Addressing lexical gaps

Page 83: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

Addressing lexical gaps

• Subsumption inference can partially compensate the lack of lexical resources/coverageÞ If no counterpart for the target language is found,

try hypernyms

<Nr>

Page 84: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

A gaggle of photographers …

• Idea: – translate something that doesn’t exist in the target

language• Assume you know that it refers to the English

WordNet term gaggle-n. What would it be in German?– http://wordnet-rdf.princeton.edu/ provides

multilingual labels

<Nr>

Page 85: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

Basque, Finnish, Japanese, no German-> check hyperym

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

<Nr>

Page 86: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

Port. branco, Span. banda, …no German -> check indirect hypernyms

<Nr>

Page 87: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

French groupe, Cat. grup, Gal. grupo, … still no German -> check an external resource, say lemonUby(which we know to contain some German)

<Nr>

Page 88: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")

Traverse different resources (in the same end point) according to their structure, until something is found …

<Nr>

Page 89: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")

Possible because the structure of these resources is lemon-conformant

<Nr>

Page 90: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")

For resources out of the current end point, another SERVICE can be addressed -> federation

<Nr>

Page 91: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")

Quite slow, though, but can be used to pre-compile word lists with generalisation

<Nr>

Page 92: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

92

Bootstrapping dictionaries

Page 93: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

Bootstrapping dictionaries

By transitivity• Goal: Translate from Czech to Farsi• No dictionary, but

– Czech-English, English-Farsi• Quite noisy, though, hence check multiple paths

– Czech-Russian, Russian-Farsi– …-> Limit to forms with high confidence (by the number of paths, alternatives, etc.)

• Slow, again, but can be used for precompiling<Nr>

Page 94: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

94

Improved Deep Analysis

Page 95: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

95

Improved Deep Analysis

• Re-using externally provided tools (and data)– Structural Interoperability

• The output of NLP tools can be represented in RDF– NLP Interchange Format (NIF, nlp2rdf.org)

» If only one layer of analysis is considered» For more complicated annotations and actual corpora,

additional means are necessary, cf. POWLA (purl.org/powla)

– Conceptual Interoperability• Represent and integrate the output of NLP tools with

reference to LLOD repositories, e.g., the Ontologies of Linguistic Annotation (OLiA)

Page 96: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

96

Improved Deep Analysis

• Re-using externally provided tools (and data)– Structural Interoperability

• The output of NLP tools can be represented in RDF– NLP Interchange Format (NIF, nlp2rdf.org)

» If only one layer of analysis is considered» For more complicated annotations and actual corpora,

additional means are necessary, cf. POWLA (purl.org/powla)

– Conceptual Interoperability• Represent and integrate the output of NLP tools with

reference to LLOD repositories, e.g., the Ontologies of Linguistic Annotation (OLiA)

Page 97: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

98

Comparing and combining heterogeneous linguistic analyses

diese nicht neue Erkenntnisthis not new insight`this well-known insight‘

* P. Tapanainen and T. Järvinen. 1997. A nonprojective dependency parser. In Proceedings of the 5th Conference on Applied Natural Language Processing, pages 64–71, Washington, DC, April 1997

** H. Schmid and F. Laws. 2008. Estimation of conditionalprobabilities with decision trees and an application to fine-grained pos tagging. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008) , Manchester, UK, August 2008.

Connexor*PRON Dem FEM SG NOM

RFTagger**PRO.Dem.Attr.-3.Acc.Sg.Fem

Page 98: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

99

Comparing and combining heterogeneous linguistic analyses

ConnexorPRON Dem FEM SG NOM

RFTaggerPRO.Dem.Attr.-3.Acc.Sg.Fem

rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Pronoun)rdf:type(olia:DemonstrativePronoun)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Nominative)

rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Determiner)rdf:type(olia:DemonstrativeDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Accusative)

OLiA Reference Model descriptions

Page 99: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

100

Comparing and combining heterogeneous linguistic analyses

rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Pronoun)rdf:type(olia:DemonstrativePronoun)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Nominative)

rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Determiner)rdf:type(olia:DemonstrativeDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Accusative)

OLiA Reference Model descriptions

confidence ranking(simple voting)

rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)

rdf:type(olia:Pronoun)rdf:type(olia:Determiner)

rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)

olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)

predicted by both tools

predicted by one tool

Page 100: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

101

Comparing and combining heterogeneous linguistic analyses

rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)

rdf:type(olia:Pronoun)rdf:type(olia:Determiner)

rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)

olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)

disambiguation: create the maximal consistent set S of descriptions

1. S is empty2. process descriptions with decreasing confidence

a) if the current description is consistent with all descriptions in S, then add it to S

b) if not, skip itc) iterate until all descriptions are processed

confidence ranking(simple voting)

predicted by both tools

predicted by one tool

Page 101: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

102

Comparing and combining heterogeneous linguistic analyses

rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)

rdf:type(olia:Pronoun)rdf:type(olia:Determiner)

rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)

olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)

disambiguation: create the maximal consistent set S of descriptions

1. S is empty2. process descriptions with decreasing confidence

a) if the current description is consistent with all descriptions in S, then add it to S

b) if not, skip itc) iterate until all descriptions are processed

identify incompatible annotations

check consistency conditionsin the ontology

Page 102: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

103

Comparing and combining heterogeneous linguistic analyses

olia:Determinerolia:Pronoun

olia:PronounOrDeterminer

olia_top:MorphosyntacticCategory

is-a

is-ais-a

olia:Demonstrative

Pronoun

olia:Demonstrative

Determiner

is-a is-a

rdf:type(olia:Pronoun)rdf:type(olia:Determiner)

rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)

siblings are inconsistent

cousins are inconsistent

rdf:type(olia:Determiner)rdf:type(olia:DemonstrativePronoun)

aunts/nieces, etc. are inconsistent

A is consistent with B iff A B or B A

Page 103: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

104

Comparing and combining heterogeneous linguistic analyses

rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)

rdf:type(olia:Pronoun)rdf:type(olia:Determiner)

rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)

olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)

disambiguation: create the maximal consistent set S of descriptions

consistency

1. S is empty2. process descriptions with decreasing confidence

a) if the current description is consistent with all descriptions in S, then add it to S

b) if not, skip itc) iterate until all descriptions are processed

Þ from every equally-ranked pair of inconsistent descriptions:

first come, first serve(simple voting with random tie resolution)

Page 104: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

105

Experiments

• we know that ensemble combination improves accuracy– if so, we should observe an increase of accuracy

for (at least some) combinations of tools– but accuracy may be the wrong criterion

• inaccurate if the target annotation is less rich than one of the source annotations

Þ measure recall, not accuracy

Page 105: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

106

Experiments (Chiarcos 2010)

• German newspaper corpora• 10,000 tokens from each of the following

newspaper corpora– NEGRA (Skut et al. 1998)– TIGER (Brants et al. 2002)– Potsdam Commentary Corpus (PCC, Stede 2004)

• TIGER/NEGRA-style target annotation

Page 106: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

107

RFTagger TreeTaggerStanfordTagger

StanfordParser

BerkeleyParser

MorphistoMorphology

Connexor

corpus file with reference annotation

in TIGER format

plain texttokenized

RFTaggeranotation

model

STTS annotation

model

STTSannotation

model

STTSannotation

model

STTSannotation

model

Morphistoannotation

model

Connexorannotation

model

set of OLiA reference model

descriptions

TIGERannotation

model

maximalconsistentdescription

comparisonwith reference

description

Page 107: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

108

ExperimentsMorphosyntax: Example

Diese nicht neue Erkenntnis

Þ PronounOrDeterminer& Determiner& DemonstrativeDeterminer

Page 108: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

109

ExperimentsMorphosyntax: Recall

* StanfordTagger was trained on NEGRA

Page 109: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

110

ExperimentsMorphosyntax: Recall

• continuous increase of (avg.) recall• combination of 5-6 tools outperforms best-

performing single tool– except StanfordTagger on NEGRA

• trained on NEGRA

Þ table for individual combinations

Page 110: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

111

ExperimentsMorphosyntax: Results

• best-performing combinations (NEGRA)1. Stanford Tagger (98.97% recall)2. -“- + Stanford Parser (98.71% recall)3. -“- + TreeTagger (99.00% recall)4. Stanford Tagger + Stanford Parser + Morphisto + RFTagger

(98.87% recall)5. Stanford Tagger + Stanford Parser + TreeTagger + RFTagger +

Connexor (98.29% recall)Þ marginal decrease of performance of best-performing

tools

Page 111: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

112

ExperimentsMorphosyntax: Results

• worst-performing combinations (NEGRA)(Berkeley Parser excluded)

1. Morphisto (70.06 % recall)2. -“- + Connexor (86.05 % recall)3. -“- + TreeTagger (91.90 % recall)4. -“- + RFTagger (94.29 % recall)5. -“- + StanfordTagger (96.10 % recall)

Þ rapid increase of performance for worst-performing tools

Page 112: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

113

Findings

• result is a consistent set of ontological descriptions– no loss of detail when trained/evaluated against

a target annotation• with different granularityÞ can be evaluated against corpora with different target

annotation

– natural handling of different granularities• hierarchical structures

– string-based representation can be generated

Page 113: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

114

More Experiments

Comparable results for German morphology (Chiarcos 2010, 3 tools)

Similar results for German dependency/edge labels(Chiarcos, unpublished, 5 tools, labels only)

Different use case, similar methodologyPareja et al. (2010), Spanish particle se

Page 114: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

115

Even more experiments

• Instead of combining existing tools, we can also train tools directly on ontological representations of annotations– Even if these originate from different annotations

• With ontology-based pruning, these yield ontologically consistent descriptions

– Chiarcos & Sukhareva (NLP&LOD2, next week)• Trained a neural network, encoded and decoded with

OLiA representations• Increased granularity (depth of analysis), stable accuracy

– Replicate for discourse parsing• Based on Chiarcos (2014)

Page 115: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

116

Summary

• LLOD – provides resources– facilitates interoperability

• data, tools, annotations, lexical resources

– facilitates access/integration of heterogeneous/distributed information

• is a community effort– depends on your input

Page 116: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

117

Want to stay/get involved ?

• Join our discussions / meetings– present your resources, interests, questions, etc.

• Open Linguistics Working Group– http://linguistics.okfn.org/

• mailing list, telcos, meetings and events

– also consider the relevant W3C CGs, e.g.,• OntoLex => lexical-conceptual resources• BP-MLOD => best practice guidelines• LD4LT => NLP applications

Page 117: Linguistic Linked Open Data: What’s in for (Deep) Machine Translation? Christian Chiarcos chiarcos@informatik.uni-frankfurt.de DeepMT, Sep 4 th, 2015,

118

Thanks a lot !

• Join our discussions / meetings– present your resources, interests, questions, etc.

• Open Linguistics Working Group– http://linguistics.okfn.org/

• mailing list, telcos, meetings and events

– also consider the relevant W3C CGs, e.g.,• OntoLex => lexical-conceptual resources• BP-MLOD => best practice guidelines• LD4LT => NLP applications