from monolithic xml for print/web to lean xml for data: realising … · 2015. 12. 5. · i jenkins...
TRANSCRIPT
From monolithic XML for print/webto lean XML for data:
realising linked data for dictionaries
Matt Kohl & Sandro CirulliLanguage Technologists
Oxford University Press (OUP)
7 June 2014
IntroductionOxford University Press
I World-renowned dictionary publisher
I Licensing partner for lexical data
2/18
IntroductionShifts in Publishing
I New trends & demands
I Emerging technologies & markets
I Importance of well-structured, semantically-rich data
I Speed!
3/18
Data ModellingOur Current Dictionary Data Models
I Print-oriented: designed to capture dictionary layout
I Monolithic: one enormous document
I Permissive: continually loosened to accommodate new texts
Can’t give us the flexibility we need
4/18
Data ModellingRequirements
A new approach should:
I Represent language concepts, not layouts
I Enable data reusability for different products & services
I Allow only one, clear way to model any given lexical item
5/18
Data ModellingThe New Lexical Schema
6/18
Data ConversionMoving Data into the Lexical Schema
Conversion Framework Requirements
I Scalability: convert 40+ data-sets
I Standardization: harmonize variation inside the data-sets
I Modularity: enable customization, slotting in & out of QA,etc.
7/18
Data ConversionTools
I XProc
I XSpec
I Schematron & XML Schema
I Jenkins CI
I Agile methodology
8/18
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
Data ConversionBuild Workflow
Check code in SVN
Jenkins build
- Ant script
- XSpec unit tests
- XProc pipeline
Check code in SVN Jenkins build Linguistic QACheck code in SVN Jenkins build Linguistic QA
Update code Passes?No
Check code in SVN Jenkins build Linguistic QA
Update code
Archive artefacts Tag release in SVN
via Jenkins
Passes?
Yes
No
10/18
Data ConversionBuild Workflow
Check code in SVN
Jenkins build
- Ant script
- XSpec unit tests
- XProc pipeline
Check code in SVN Jenkins build Linguistic QA
Check code in SVN Jenkins build Linguistic QA
Update code Passes?No
Check code in SVN Jenkins build Linguistic QA
Update code
Archive artefacts Tag release in SVN
via Jenkins
Passes?
Yes
No
10/18
Data ConversionBuild Workflow
Check code in SVN
Jenkins build
- Ant script
- XSpec unit tests
- XProc pipeline
Check code in SVN Jenkins build Linguistic QACheck code in SVN Jenkins build Linguistic QA
Update code Passes?No
Check code in SVN Jenkins build Linguistic QA
Update code
Archive artefacts Tag release in SVN
via Jenkins
Passes?
Yes
No
10/18
Data ConversionBuild Workflow
Check code in SVN
Jenkins build
- Ant script
- XSpec unit tests
- XProc pipeline
Check code in SVN Jenkins build Linguistic QACheck code in SVN Jenkins build Linguistic QA
Update code Passes?No
Check code in SVN Jenkins build Linguistic QA
Update code
Archive artefacts Tag release in SVN
via Jenkins
Passes?
Yes
No
10/18
Results & DiscussionSource data
A sense of ’malauva’ from a monolingual Spanish dictionary
<ACEPCIO ACEP="2"><AREA-GEO>Esp</AREA-GEO><NIVELL>coloquial</NIVELL><SIGNIFICAT>Persona que tiene mal caracter o mala
intencion.</SIGNIFICAT><SINONIM>malaleche.</SINONIM>
</ACEPCIO>
11/18
12/18
Results & DiscussionOUP XML
Print-focused DTD
<se2 num="2"><lg><ge>Esp</ge><reg>coloquial</reg>
</lg><msDict type="core"><df>Persona que
tiene malcaracter o malaintencion.</df>
<syn>malaleche</syn></msDict>
</se2>
12/18
Results & DiscussionOUP XML
Print-focused DTD
<se2 num="2"><lg><ge>Esp</ge><reg>coloquial</reg>
</lg><msDict type="core"><df>Persona que
tiene malcaracter o malaintencion.</df>
<syn>malaleche</syn></msDict>
</se2>
New Lexical XSD
<sense register="informal"region="ES">
<definitions><definition><text>Persona que tiene
mal caracter o malaintencion</text>
</definition></definitions><synonyms><synonym>malaleche</
synonym></synonyms>
</sense>
12/18
Results & DiscussionOUP XML
Print-focused DTD
<se2 num="2">
<lg><ge>Esp</ge><reg>coloquial</
reg></lg><msDict type="core"><df>Persona que
tiene malcaracter o malaintencion.</df>
<syn>malaleche</syn></msDict>
</se2>
New Lexical XSD
<senseregister="informal"region="ES"><definitions><definition><text>Persona que tiene
mal caracter o malaintencion</text>
</definition></definitions><synonyms><synonym>malaleche</
synonym></synonyms>
</sense>
Next stepsScale It Up
I Consolidate data in an XML database
I Build an RDF layer on top of the XML database
I Leverage Semantic Web to enhance our data
13/18
Next StepsPrototype RDF/XML
<Sense rdf:about="sense:es_noun_malauva_se_2"><isDescribedByrdf:resource="
definition:es_noun_malauva_se_2_def_1"/>
<hasRegister rdf:resource="register:informal"/>
<hasRegion rdf:resource="region:ES"/><hasSynonym rdf:resource="lemma:a5e644"/>
</Sense><StandardDefinitionrdf:about="definition:es_noun_malauva_se_2_def_1"><rdfs:label xml:lang="es">Persona que tiene malcaracter o mala intencion</rdfs:label>
</StandardDefinition>
14/18
RDF Data extractionMusical terms in English & Spanish
choir:
chant:
air:
hook:
strain:
chorus:
chorus:
chorale:
ensemble:
song:
tune:
aria:
chorus:
coral
hook
choral
coro
conjunto
canción
melodía
aria
estribillo
tonocoro
aire
salmodia
15/18
Inference mechanism
word sense X word sense Y word sense Z
hasAntonym hasSynonym
hasAntonym
16/18
Summary
I Overall project requirementsI Moving from products to platforms and servicesI Supporting current business needs while innovatingI Adapting in nimble ways to fast changing market requirementsI Focusing on time and cost efficiency
I Data modelI Content drivenI Machine interpretableI ModularI Evolvable/adaptable
I Conversion processI Highly automatedI ModularI Scalable
17/18
Thank you for your attention!Any questions?
Matt Kohl: [email protected] Cirulli: [email protected]