linguistic linked open data: what’s in for (deep) machine translation? christian chiarcos...
TRANSCRIPT
Linguistic Linked Open Data: What’s in for (Deep) Machine Translation?
Christian [email protected]
DeepMT, Sep 4th, 2015, Prague
Linked Open Data
Basic Concepts
3
Linked Open Data (LOD)
Plenty of Resources, linked with each other ;)Aug 2014
4
Linked Open Data (LOD)
• However, LOD pertains not so much to a resource (or a bundle of resources), but to a philosophy
• Best practices for publishing data on the web– Goals
• reusability• accessibility• transparent and explicit semantics
– esp. for links
Linked (Open) Data, informally
• use URIs as names for things (1)– links to external URIs (links) allow us to retrieve more
information from these sites• if they can be resolved via HTTP (2)• and provide information via SPARQL/RDF* (3)• and they include links to other URIs (4)Þ then, this is Linked Data
http://www.w3.org/DesignIssues/LinkedData.html<Nr>
6
Linked Open Data: The 5 star plan
7
From Tables …
PHOnetics Information Base and LExicon (PHOIBLE) Moran, S. 2012. Using Linked Data to Create a Typological Knowledge Base. In Chiarcos, C., Nordhoff, S., and Hellmann, S. (eds), Linked Data in Linguistics: Representing and Connecting Language Data and Language Metadata. Springer, Heidelberg.
8
From Tables to RDF …
Subject(primary key)
9
From Tables to RDF …
Subject
Property(„Relation“)
10
From Tables to RDF …
Subject
Property(„Relation“)
Object
11
From Tables to RDF …
1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object
Subject
Property(„Relation“)
Object
12
From Tables to RDF …
1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object
Subject
Property(„Relation“)
Object
tha u:glyph
13
From Tables to RDF …
1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object
Subject
Property(„Relation“)
Object
tha u:glyph
We chose “hasSegment” for the property corresponding to column “glyph”
14
From Tables to RDF …
1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object
Subject
Property(„Relation“)
Object
tha u:hasSegment
We chose “hasSegment” for the property corresponding to column “glyph”
15
From Tables to RDF …
1. Decompose tables into triples2. Multiple triples constitute a graph
16
From Tables to RDF …
1. Decompose tables into triples2. Multiple triples constitute a graph3. A graph can aggregate triples from other sources, as well
17
From Tables to RDF …
Graphs can be represented in other ways, but RDF allows us to
1. Provide explicit semantics (RDF Schema, Ontology)
2. Check consistency and infer implicit information
3. Merge (not only syntactically, but semantically)
4. Query
5. Link (enrich with external data)
18
From Tables to RDF …
Graphs can be represented in other ways, but RDF allows us to
1. Provide explicit semantics (RDF Schema, Ontology)
2. Check consistency and infer implicit information
3. Merge (not only syntactically, but semantically)
4. Query
5. Link (enrich with external data)
RDFS, OWL
19
From Tables to RDF …
Graphs can be represented in other ways, but RDF allows us to
1. Provide explicit semantics (RDF Schema, Ontology)
2. Check consistency and infer implicit information
3. Merge (not only syntactically, but semantically)
4. Query
5. Link (enrich with external data) URIs & SPARQL
20
Uniform Resource Identifiers (URIs)
● Agree on a common vocabulary and names for entities● URIs provide globally unique identifiers
“hasSegment”
vs.
<http://mlode.nlp2rdf.org/resource/phoible/hasSegment>
vs.
@prefix phoible: <http://mlode.nlp2rdf.org/resource/phoible/>
... phoible:hasSegment ...
string, not unambiguous
URIs
21
Turtle
• Simple triple notation
Subject-URI Property-URI Object-URI .Subject-URI Property-URI “Literal value” .
e.g., phoible:khm phoible:hasSegment "u:".
22
SPARQL
Merge data and query it using the W3C standard SPARQL (SPARQL Protocol and Query Language)
“the SQL of the Semantic Web”
SELECT DISTINCT ?languageWHERE {
?language phoible:hasSegment ?segment .?segment phoible:hasFeature phoible:delayed_release .
}
23
From Tables to RDF to Linked Data
• use URIs as names for things (1)– links to external URIs (links) allow us to retrieve more information
from these sites
• if they can be resolved via HTTP (2)• and provide information via SPARQL/RDF* (3)• and they include links to other URIs (4)Þ then, this is Linked Data
@prefix phoible: <http://mlode.nlp2rdf.org/resource/phoible/>phoible:khm phoible:hasSegment "u:".phoible:khm owl:sameAs <http://lexvo.org/id/iso639-3/khm>.
Turtle notation
http://www.w3.org/DesignIssues/LinkedData.html
24
From Tables to RDF to Linked Data
@prefix phoible: <http://mlode.nlp2rdf.org/resource/phoible/>phoible:khm phoible:hasSegment "u:".phoible:khm owl:sameAs <http://lexvo.org/id/iso639-3/khm>.
Turtle notation
25
Linked Open Data: The 5 star plan
26
Linked Open Data (LOD, Aug 2014)
Linguistic Linked Open Data
Linguistic Linked Open Data (LLOD)
Linguistic MotivationsA brief History
28
Language Resources, 2010 AD
• used in natural language processing, scientific research, language documentation, ...
• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections
Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud
29
Language Resources, 2010 AD
• used in natural language processing, scientific research, language documentation, ...
• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections
Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud
• by the time, long been recognized as a problem
hence, several proposals to address them
• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– terminology (GOLD, ISOcat, OLiA) (Farrar & Langendoen
2003, Ide & Wright 2004, Schmidt et al. 2006)
30
Language Resources, 2010 AD
• used in natural language processing, scientific research, language documentation, ...
• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections
Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud
• by the time, long been recognized as a problem
hence, several proposals to address them
• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– integrating typological data bases (TDS)
(Saulwick et al. 2005, Dimitriades et al. 2010)
31
Language Resources, 2010 AD
• used in natural language processing, scientific research, language documentation, ...
• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections
Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud
• by the time, long been recognized as a problem
hence, several proposals to address them
• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– modelling and querying multi-layer corpora
(Cassidy 2010, Chiarcos et al. 2008, Rehm et al. 2008)
32
Language Resources, 2010 AD
• used in natural language processing, scientific research, language documentation, ...
• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections
Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud
• by the time, long been recognized as a problem
hence, several proposals to address them
• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– NLP pipelines
(Buyko et al. 2008, Ribieira et al. 2012, Hellmann et al. 2013)
33
Language Resources, 2010 AD
• used in natural language processing, scientific research, language documentation, ...
• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections
Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud
• by the time, long been recognized as a problem
hence, several proposals to address them
• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– interfacing corpus and dictionary data
(Burchardt et al. 2008, Mazziotta et al. 2010)
34
Language Resources, 2010 AD
• used in natural language processing, scientific research, language documentation, ...
• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections
Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud
• by the time, long been recognized as a problem
hence, several proposals to address them
• Independently, different groups considered using RDF/OWL as a (local) solution
• lexical resources long provided by the SW(Gangemi et al. 2003, Buitelaar et al. 2006)
35
Language Resources, 2010 AD
• used in natural language processing, scientific research, language documentation, ...
• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections
Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud
• by the time, long been recognized as a problem
• But these activities were not coordinated– and in particular, RDF was used, but resources rarely
linked to other resources in the web of data
36
Language Resources, 2010 AD
• used in natural language processing, scientific research, language documentation, ...
• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections
Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud
• by the time, long been recognized as a problem
• But these activities were not coordinatedÞ Interest in RDF, barely any links Þ need for establishing communication channels
37
Language Resources, 2010 AD
• used in natural language processing, scientific research, language documentation, ...
• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections
Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud
Community Building
• by the time, long been recognized as a problem
• But these activities were not coordinatedÞ Interest in RDF, barely any links Þ need for establishing communication channels
38
OKFN Open Linguistics Working Group (OWLG)
• founded in Oct 2010 in Berlin, Germany– Working group of the Open Knowledge Foundation
• open network of individuals interested in– linguistic resources and/or – their publication under open licenses
• multi-disciplinary– NLP/CL, typology/language documentation, SW, …
• infrastructure – mailing list, web site/blog, wiki– http://linguistics.okfn.org
39
OWLG activities(http://linguistics.okfn.org)
– promoting open linguistic resourcesÞraising awareness, collecting metadata
(datahub.io)– facilitating wide-range community activities
• workshops, mailing list, publications– Linked Data in Linguistics (LDL)– Multilingual Linked Open Data for Enterprises (MLODE)– Linked Data in Linguistic Typology (LDLT)
40
OWLG activities(http://linguistics.okfn.org)
– promoting open linguistic resourcesÞraising awareness, collecting metadata
(datahub.io)– facilitating wide-range community activities
• workshops, mailing list, publications• facilitating exchange between and among more
specialized community groups– W3C OntoLex, BP-MLOD, LD4LT, ...– ACL SIGs (SIGLEX, SIGANN), ...– DGfS, MPI-EVA, ...
41
LLOD cloud
• a collection of linguistic resources– published under open licenses– as linked data– decentralized developed and maintained– meta data at http://datahub.io
=> cloud diagram
– developed as a community effort in the context of the Open Linguistics Working Group of the Open Knowledge Foundation
42
2011 a sketch on a table napkin
Mar 2012Chiarcos et al. (2012), LDL book
Sep 2012MLODE hackathon to produce first diagram from original (meta) data
2012-2014more data, more rigid quality constrantsemergence of related W3C Community Groups
Aug 2014top-level category in the LOD diagram
Workshop series
Linked Data in Linguistics(LDL, anually)
Multilingual Linked Open Data for Enterprises
(MLODE, bi-annually)
Linked Data in Linguistic Typology
(LDLT, 2013)
Building the LLOD Cloud
43linguistic-lod.org
44
Recent developments
• 9th International Conference on Language Resources and Evaluation (LREC-2014)– „the new hot topic in our community“
(Nicoletta Calzolari, Pres. ELRA)
• Selected LLOD events in the last 3 months– 4th Multilingual Semantic Web (Portoroz, Slovenia, June 2015)– 1st Summer Datathon on LLOD (Cercedilla, Spain, June 2015)– EUROLAN-2015 Summer School on “Linguistic Linked Open Data”
(Sibiu, Romania, July 2015)– LLOD-LSA Workshop at the Summer Institute of the Linguistic Society
of America (Chicago, IL, July 2015)– 4th Linked Data in Linguistics (Beijing, PRC, July 2015) – Ontology session at ESSLLI-2015 (Barcelona, Spain, Aug 2015)– 2nd Workshop on NLP&LOD (Hissar, Bulgaria, Sep 2015)
Linked Open Data for Linguists
Possible applications
46
Linked Open Data for Linguists
• Linked Data– rules of best practice for publishing data on the web
• protocols and standards• links between data sets
Þ improved access to distributed resourcesÞ improved (re-)usability of language resourcesÞ improved visibility of language resources
47
Linked Open Data for Linguists
• Linked Data– rules of best practice for publishing data on the web
Þ Information integration– Structural interoperability
• comparable formats and protocols to access dataÞ use the same query language for different data sets
48
Linked Open Data for Linguists
• Linked Data– rules of best practice for publishing data on the web
Þ Information integration– Structural interoperability– Conceptual interoperability
• develop and (re-)use a shared vocabularies for equivalent concepts
Þ the same query on different data sets
49
Linked Open Data for Linguists
• Linked Data– rules of best practice for publishing data on the web
Þ Information integration– Structural interoperability– Conceptual interoperability– Federation
• data published on the web– with a query interface (SPARQL end point)
Þ use a single query to query different datasets
50
Linked Open Data for Linguists
• Linked Data– rules of best practice for publishing data on the web
Þ Information integration– Structural interoperability– Conceptual interoperability– Federation
• data published on the web– with a query interface (SPARQL end point)
Þ use a single query to query different datasets
SPARQL 1.1
51
Linked Open Data for Linguists
• Linked Data– rules of best practice for publishing data on the web
Þ Information integration– Structural interoperability– Conceptual interoperability– Federation
• data published on the web– with a query interface (SPARQL end point)
Þ use a single query to query different datasets
Achievable with any graph-based data model
52
Linked Open Data for Linguists
• Linked Data– rules of best practice for publishing data on the web
Þ Information integration– Structural interoperability– Conceptual interoperability– Federation
• data published on the web– with a query interface (SPARQL end point)
Þ use a single query to query different datasets
The “killer application”, e.g., for annotated corpora
53
Conceptual Interoperability: Monolingual
• When language ressources for a low-resource language are developed, different people have different ideas, e.g., for English (by the mid-1990s)
Susanne PennThe AT DTFulton NP1s NNPCounty NNL1cb NNPGrand JJ NNPJury NN1c NNPsaid VVDv VBDFriday NPD1 NNP
54
Conceptual Interoperability: Monolingual
Susanne PennThe AT DTFulton NP1s NNPCounty NNL1cb NNPGrand JJ NNPJury NN1c NNPsaid VVDv VBDFriday NPD1 NNP
395 tagsword classes
morphological featuressyntactic features
lexical classes
57 tagsword classes
number and degree
55
Conceptual Interoperability: Monolingual
• Integrating both resources allows us to– apply more wide-scale statistical analyses– increase training data for supervised POS tagging– increase test data for unsupervised POS tagging
395 tagsword classes
morphological featuressyntactic features
lexical classes
57 tagsword classes
number and degree
56
Conceptual Interoperability: Multilingual
• with interoperable POS tags used across different languages, …– we can apply the same unlexicalized NLP tools
(e.g., parsers, cf. McDonald et al. 2013)– we can perform comparative corpus studies– we simplify multilingual annotation projection
57
Conceptual Interoperability
• Multiple terminology repositories exist– available over the web– RDF representation
• Are linked with each other– for language IDs: Glottolog & lexvo.org– for lexical senses: WordNets (ILI)– for grammatical categories & features: GOLD,
ISOcat, OLiA
58
Linked Terminologies
English
EAGLES
MULTEXT/East
15 (mostly) Eastern European languages
MULTEXT/East
MULTEXT/East 11 European languages
STTS
TIGER GermanConnexor
TüBa-D/ZGerman
PennBrown
Susanne
etc.
OLiAReference
Model
GOLD
ISOcat(morpho-syntax)
OntoTag(morpho-syntax)
TDS ontology
Ontologies of Linguistic Annotation OLiA
External Reference Models(Terminology Repositories)
(resource-specific) Annotation Models
Language Ressources
DictionariesCorpora
NLP Tools
EAGLESEAGLES
59
Conceptual Interoperability
PennThe DTFulton NNPCounty NNPGrand NNPJury NNPsaid VBDFriday NNP
Determiner PronounOrDeterminer
SusanneThe ATFulton NP1sCounty NNL1cbGrand JJJury NN1csaid VVDvFriday NPD1
ProperNoun Noun hasNumber.Singular
ProperNoun Noun hasNumber.Singular
ProperNoun Noun hasNumber.Singular
ProperNoun Noun hasNumber.Singular
ProperNoun Noun hasNumber.Singular
(MainVerb StrictAuxiliaryVerb) hasTense.Past [sic!]
DefiniteArticle ArticleDeterminer PronounOrDeterminer
Surname ProperNoun Noun hasNumber.Singular
TopographicalNoun ProperNoun Noun hasNumber.Singular
AdjectivehasDegree.Positive
CommonNoun Noun hasNumber.Singular
TemporalNoun ProperNoun Noun hasNumber.Singular
MainVerb hasTense.Past
atomic statements mostly identical, just a few more
from Susanne
What’s in for (Deep) MT?
Entity Linking: A Special Track for Proper NamesRe-Using Lexical Resources
Addressing Lexical GapsBootstrapping Dictionaries
Improved Deep Analysis
61
Deep MT (© Jan Hajic, yesterday)
62
Entity Linking: A Special Track for Proper Names
63
Translating Proper Names
• Normally not directly translated, but maintained– Differences in inflection– Different writing systems (Cyrillic vs. Latin vs.
Arabic, etc.)– SMT: make sure your Language Model doesn’t
override the Translation Model !
64
Translating Proper Names
• Make sure your Language Model doesn’t override the Translation Model !My grandfather's grandfather came to Germany in 1905.– "Il nonno di mio nonno arrivò in Canada nel 1905."
(google translate, Jul 2010)– "Nonno di mio nonno è venuto in Germania nel
1905." (google translate, Sep 2012)
65
Translating Proper Names
• Make sure your Language Model doesn’t override the Translation Model !"Recentemente", conferma Maria Serena Balestracci, "mi ha telefonato un signore da Bologna, che aveva sentito parlare del libro alla radio– "... a gentleman from London ..." (google
translate, Oct 2010)– "... a gentleman from Bologna ..." (google
translate, Sep 2012)
66
Translating Proper Names
• Make sure your Language Model doesn’t override the Translation Model !
These errors stopped after Google bought Freebase
67
Trivial Entity Linking
• Named Entity Recognition -> treat NEs in a special way
68
Trivial Entity Linking
• Named Entity Recognition • Entity Linking -> Link with an ontology, which
may provide multilingual labels
69
Trivial Entity Linking
• Named Entity Recognition • Entity Linking -> Link with an ontology, which
may provide multilingual labels
70
Trivial Entity Linking
• Named Entity Recognition • Entity Linking -> Link with an ontology, which
may provide multilingual labels– E.g., Entity Linking via DBpedia Spotlight– Follow DBpedia-JRCNames linking– Use multilingual label from JRCNames instead of
translating yourself
71
Problems with inflecting languages
• Just knowing about (one) possible label in another language doesn’t help much if you need to inflect a name – or, any other string label from a knowledge base
72
Problems with inflecting languages
• Listing all forms helps with entity linking, but not with machine translation
Þ We need linguistic LOD (LLOD)Systematic inclusion of grammatical information=> LLOD vocabularies and conventions
73
(Re-) Using lexical resources
74
Lemon: Lexicon Model for Ontologies
• Developed by the W3C Ontology-Lexica Community Group (OntoLex)
75
Lemon: Lexicon Model for Ontologies
• Developed by the W3C Ontology-Lexica Community Group (OntoLex)
• Provides a data model for adding linguistic information to ontologies
• Widely used within the LLOD cloud– Also by colleagues not participating in OntoLex
• E.g., PanLex (Long Now Foundation)
– “Abused” for any kind of lexical resource• Even beyond the original ontology lexicalization use
case
lemon Core
<Nr>
lemon Sample (Moran and Brümmer 2013)
<Nr>
Open World Assumption
• Unless explicitly stated, information is per se incomplete– Additional information can be expressed– E.g., using linguistic categories and features from
terminology repositoriesÞ Grammatical information can be described in a
reusable wayRecommended vocabularies: lexinfo, OLiA, GOLD
<Nr>
Vocabularies for Lexical-Conceptual Resources
• lemon provides data structures, but – for content and metadata, it relies on external
vocabularies• Interoperability depends on a bundle of vocabularies
– WordNet, DBpedia, any ontology (lexical senses)– lexvo (language identifiers)– glottolog (languoid identifiers from linguistic typology)– PHOIBLE (phoneme inventories and phonological structures)– lexinfo (grammatical features for lexical resources)– OLiA (annotations)– ISOcat (resource metadata)– GOLD (grammatical concepts) <Nr>
Vocabularies for Lexical-Conceptual Resources
• lemon provides data structures, but – for content and metadata, it relies on external
vocabularies• Interoperability depends on a bundle of vocabularies
– WordNet, DBpedia, any ontology (lexical senses)– lexvo (language identifiers)– glottolog (languoid identifiers from linguistic typology)– PHOIBLE (phoneme inventories and phonological structures)– lexinfo (grammatical features for lexical resources)– OLiA (annotations)– ISOcat (resource metadata)– GOLD (grammatical concepts)
Providing (lexical) resources in
accordance with these vocabularies improves
their reusability<Nr>
Vocabularies for Lexical-Conceptual Resources
• lemon provides data structures, but – for content and metadata, it relies on external
vocabularies• Interoperability depends on a bundle of vocabularies
– WordNet, DBpedia, any ontology (lexical senses)– lexvo (language identifiers)– glottolog (languoid identifiers from linguistic typology)– PHOIBLE (phoneme inventories and phonological structures)– lexinfo (grammatical features for lexical resources)– OLiA (annotations)– ISOcat (resource metadata)– GOLD (grammatical concepts)
Providing (lexical) resources in
accordance with these vocabularies improves
their reusability
This effect can be extended to
NLP tools
<Nr>
82
Addressing lexical gaps
Addressing lexical gaps
• Subsumption inference can partially compensate the lack of lexical resources/coverageÞ If no counterpart for the target language is found,
try hypernyms
<Nr>
A gaggle of photographers …
• Idea: – translate something that doesn’t exist in the target
language• Assume you know that it refers to the English
WordNet term gaggle-n. What would it be in German?– http://wordnet-rdf.princeton.edu/ provides
multilingual labels
<Nr>
A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n
Basque, Finnish, Japanese, no German-> check hyperym
wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
<Nr>
A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n
wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
Port. branco, Span. banda, …no German -> check indirect hypernyms
<Nr>
A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n
wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
French groupe, Cat. grup, Gal. grupo, … still no German -> check an external resource, say lemonUby(which we know to contain some German)
<Nr>
A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n
wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")
Traverse different resources (in the same end point) according to their structure, until something is found …
<Nr>
A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n
wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")
Possible because the structure of these resources is lemon-conformant
<Nr>
A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n
wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")
For resources out of the current end point, another SERVICE can be addressed -> federation
<Nr>
A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n
wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")
wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")
Quite slow, though, but can be used to pre-compile word lists with generalisation
<Nr>
92
Bootstrapping dictionaries
Bootstrapping dictionaries
By transitivity• Goal: Translate from Czech to Farsi• No dictionary, but
– Czech-English, English-Farsi• Quite noisy, though, hence check multiple paths
– Czech-Russian, Russian-Farsi– …-> Limit to forms with high confidence (by the number of paths, alternatives, etc.)
• Slow, again, but can be used for precompiling<Nr>
94
Improved Deep Analysis
95
Improved Deep Analysis
• Re-using externally provided tools (and data)– Structural Interoperability
• The output of NLP tools can be represented in RDF– NLP Interchange Format (NIF, nlp2rdf.org)
» If only one layer of analysis is considered» For more complicated annotations and actual corpora,
additional means are necessary, cf. POWLA (purl.org/powla)
– Conceptual Interoperability• Represent and integrate the output of NLP tools with
reference to LLOD repositories, e.g., the Ontologies of Linguistic Annotation (OLiA)
96
Improved Deep Analysis
• Re-using externally provided tools (and data)– Structural Interoperability
• The output of NLP tools can be represented in RDF– NLP Interchange Format (NIF, nlp2rdf.org)
» If only one layer of analysis is considered» For more complicated annotations and actual corpora,
additional means are necessary, cf. POWLA (purl.org/powla)
– Conceptual Interoperability• Represent and integrate the output of NLP tools with
reference to LLOD repositories, e.g., the Ontologies of Linguistic Annotation (OLiA)
98
Comparing and combining heterogeneous linguistic analyses
diese nicht neue Erkenntnisthis not new insight`this well-known insight‘
* P. Tapanainen and T. Järvinen. 1997. A nonprojective dependency parser. In Proceedings of the 5th Conference on Applied Natural Language Processing, pages 64–71, Washington, DC, April 1997
** H. Schmid and F. Laws. 2008. Estimation of conditionalprobabilities with decision trees and an application to fine-grained pos tagging. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008) , Manchester, UK, August 2008.
Connexor*PRON Dem FEM SG NOM
RFTagger**PRO.Dem.Attr.-3.Acc.Sg.Fem
99
Comparing and combining heterogeneous linguistic analyses
ConnexorPRON Dem FEM SG NOM
RFTaggerPRO.Dem.Attr.-3.Acc.Sg.Fem
rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Pronoun)rdf:type(olia:DemonstrativePronoun)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Nominative)
rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Determiner)rdf:type(olia:DemonstrativeDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Accusative)
OLiA Reference Model descriptions
100
Comparing and combining heterogeneous linguistic analyses
rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Pronoun)rdf:type(olia:DemonstrativePronoun)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Nominative)
rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Determiner)rdf:type(olia:DemonstrativeDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Accusative)
OLiA Reference Model descriptions
confidence ranking(simple voting)
rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)
rdf:type(olia:Pronoun)rdf:type(olia:Determiner)
rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)
olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)
predicted by both tools
predicted by one tool
101
Comparing and combining heterogeneous linguistic analyses
rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)
rdf:type(olia:Pronoun)rdf:type(olia:Determiner)
rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)
olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)
disambiguation: create the maximal consistent set S of descriptions
1. S is empty2. process descriptions with decreasing confidence
a) if the current description is consistent with all descriptions in S, then add it to S
b) if not, skip itc) iterate until all descriptions are processed
confidence ranking(simple voting)
predicted by both tools
predicted by one tool
102
Comparing and combining heterogeneous linguistic analyses
rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)
rdf:type(olia:Pronoun)rdf:type(olia:Determiner)
rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)
olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)
disambiguation: create the maximal consistent set S of descriptions
1. S is empty2. process descriptions with decreasing confidence
a) if the current description is consistent with all descriptions in S, then add it to S
b) if not, skip itc) iterate until all descriptions are processed
identify incompatible annotations
check consistency conditionsin the ontology
103
Comparing and combining heterogeneous linguistic analyses
olia:Determinerolia:Pronoun
olia:PronounOrDeterminer
olia_top:MorphosyntacticCategory
is-a
is-ais-a
olia:Demonstrative
Pronoun
olia:Demonstrative
Determiner
is-a is-a
rdf:type(olia:Pronoun)rdf:type(olia:Determiner)
rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)
siblings are inconsistent
cousins are inconsistent
rdf:type(olia:Determiner)rdf:type(olia:DemonstrativePronoun)
aunts/nieces, etc. are inconsistent
A is consistent with B iff A B or B A
104
Comparing and combining heterogeneous linguistic analyses
rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)
rdf:type(olia:Pronoun)rdf:type(olia:Determiner)
rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)
olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)
disambiguation: create the maximal consistent set S of descriptions
consistency
1. S is empty2. process descriptions with decreasing confidence
a) if the current description is consistent with all descriptions in S, then add it to S
b) if not, skip itc) iterate until all descriptions are processed
Þ from every equally-ranked pair of inconsistent descriptions:
first come, first serve(simple voting with random tie resolution)
105
Experiments
• we know that ensemble combination improves accuracy– if so, we should observe an increase of accuracy
for (at least some) combinations of tools– but accuracy may be the wrong criterion
• inaccurate if the target annotation is less rich than one of the source annotations
Þ measure recall, not accuracy
106
Experiments (Chiarcos 2010)
• German newspaper corpora• 10,000 tokens from each of the following
newspaper corpora– NEGRA (Skut et al. 1998)– TIGER (Brants et al. 2002)– Potsdam Commentary Corpus (PCC, Stede 2004)
• TIGER/NEGRA-style target annotation
107
RFTagger TreeTaggerStanfordTagger
StanfordParser
BerkeleyParser
MorphistoMorphology
Connexor
corpus file with reference annotation
in TIGER format
plain texttokenized
RFTaggeranotation
model
STTS annotation
model
STTSannotation
model
STTSannotation
model
STTSannotation
model
Morphistoannotation
model
Connexorannotation
model
set of OLiA reference model
descriptions
TIGERannotation
model
maximalconsistentdescription
comparisonwith reference
description
108
ExperimentsMorphosyntax: Example
Diese nicht neue Erkenntnis
Þ PronounOrDeterminer& Determiner& DemonstrativeDeterminer
109
ExperimentsMorphosyntax: Recall
* StanfordTagger was trained on NEGRA
110
ExperimentsMorphosyntax: Recall
• continuous increase of (avg.) recall• combination of 5-6 tools outperforms best-
performing single tool– except StanfordTagger on NEGRA
• trained on NEGRA
Þ table for individual combinations
111
ExperimentsMorphosyntax: Results
• best-performing combinations (NEGRA)1. Stanford Tagger (98.97% recall)2. -“- + Stanford Parser (98.71% recall)3. -“- + TreeTagger (99.00% recall)4. Stanford Tagger + Stanford Parser + Morphisto + RFTagger
(98.87% recall)5. Stanford Tagger + Stanford Parser + TreeTagger + RFTagger +
Connexor (98.29% recall)Þ marginal decrease of performance of best-performing
tools
112
ExperimentsMorphosyntax: Results
• worst-performing combinations (NEGRA)(Berkeley Parser excluded)
1. Morphisto (70.06 % recall)2. -“- + Connexor (86.05 % recall)3. -“- + TreeTagger (91.90 % recall)4. -“- + RFTagger (94.29 % recall)5. -“- + StanfordTagger (96.10 % recall)
Þ rapid increase of performance for worst-performing tools
113
Findings
• result is a consistent set of ontological descriptions– no loss of detail when trained/evaluated against
a target annotation• with different granularityÞ can be evaluated against corpora with different target
annotation
– natural handling of different granularities• hierarchical structures
– string-based representation can be generated
114
More Experiments
Comparable results for German morphology (Chiarcos 2010, 3 tools)
Similar results for German dependency/edge labels(Chiarcos, unpublished, 5 tools, labels only)
Different use case, similar methodologyPareja et al. (2010), Spanish particle se
115
Even more experiments
• Instead of combining existing tools, we can also train tools directly on ontological representations of annotations– Even if these originate from different annotations
• With ontology-based pruning, these yield ontologically consistent descriptions
– Chiarcos & Sukhareva (NLP&LOD2, next week)• Trained a neural network, encoded and decoded with
OLiA representations• Increased granularity (depth of analysis), stable accuracy
– Replicate for discourse parsing• Based on Chiarcos (2014)
116
Summary
• LLOD – provides resources– facilitates interoperability
• data, tools, annotations, lexical resources
– facilitates access/integration of heterogeneous/distributed information
• is a community effort– depends on your input
117
Want to stay/get involved ?
• Join our discussions / meetings– present your resources, interests, questions, etc.
• Open Linguistics Working Group– http://linguistics.okfn.org/
• mailing list, telcos, meetings and events
– also consider the relevant W3C CGs, e.g.,• OntoLex => lexical-conceptual resources• BP-MLOD => best practice guidelines• LD4LT => NLP applications
118
Thanks a lot !
• Join our discussions / meetings– present your resources, interests, questions, etc.
• Open Linguistics Working Group– http://linguistics.okfn.org/
• mailing list, telcos, meetings and events
– also consider the relevant W3C CGs, e.g.,• OntoLex => lexical-conceptual resources• BP-MLOD => best practice guidelines• LD4LT => NLP applications