![Page 1: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/1.jpg)
From monolithic XML for print/webto lean XML for data:
realising linked data for dictionaries
Matt Kohl & Sandro CirulliLanguage Technologists
Oxford University Press (OUP)
7 June 2014
![Page 2: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/2.jpg)
IntroductionOxford University Press
I World-renowned dictionary publisher
I Licensing partner for lexical data
2/18
![Page 3: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/3.jpg)
IntroductionShifts in Publishing
I New trends & demands
I Emerging technologies & markets
I Importance of well-structured, semantically-rich data
I Speed!
3/18
![Page 4: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/4.jpg)
Data ModellingOur Current Dictionary Data Models
I Print-oriented: designed to capture dictionary layout
I Monolithic: one enormous document
I Permissive: continually loosened to accommodate new texts
Can’t give us the flexibility we need
4/18
![Page 5: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/5.jpg)
Data ModellingRequirements
A new approach should:
I Represent language concepts, not layouts
I Enable data reusability for different products & services
I Allow only one, clear way to model any given lexical item
5/18
![Page 6: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/6.jpg)
Data ModellingThe New Lexical Schema
6/18
![Page 7: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/7.jpg)
Data ConversionMoving Data into the Lexical Schema
Conversion Framework Requirements
I Scalability: convert 40+ data-sets
I Standardization: harmonize variation inside the data-sets
I Modularity: enable customization, slotting in & out of QA,etc.
7/18
![Page 8: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/8.jpg)
Data ConversionTools
I XProc
I XSpec
I Schematron & XML Schema
I Jenkins CI
I Agile methodology
8/18
![Page 9: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/9.jpg)
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
![Page 10: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/10.jpg)
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
![Page 11: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/11.jpg)
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
![Page 12: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/12.jpg)
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
![Page 13: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/13.jpg)
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
![Page 14: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/14.jpg)
Data ConversionSimplified XProc pipeline
print-focused XML
+xml:lang = "es"
print-focused XML
+xml:lang = "es" XSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
enhanced XML
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
print-focused XML
+xml:lang = "es" XSL transformations
Schematron
validation
XML Schema
validation
Schematron
validation
enhanced XMLXSL transformations
Lexical Data
9/18
![Page 15: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/15.jpg)
Data ConversionBuild Workflow
Check code in SVN
Jenkins build
- Ant script
- XSpec unit tests
- XProc pipeline
Check code in SVN Jenkins build Linguistic QACheck code in SVN Jenkins build Linguistic QA
Update code Passes?No
Check code in SVN Jenkins build Linguistic QA
Update code
Archive artefacts Tag release in SVN
via Jenkins
Passes?
Yes
No
10/18
![Page 16: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/16.jpg)
Data ConversionBuild Workflow
Check code in SVN
Jenkins build
- Ant script
- XSpec unit tests
- XProc pipeline
Check code in SVN Jenkins build Linguistic QA
Check code in SVN Jenkins build Linguistic QA
Update code Passes?No
Check code in SVN Jenkins build Linguistic QA
Update code
Archive artefacts Tag release in SVN
via Jenkins
Passes?
Yes
No
10/18
![Page 17: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/17.jpg)
Data ConversionBuild Workflow
Check code in SVN
Jenkins build
- Ant script
- XSpec unit tests
- XProc pipeline
Check code in SVN Jenkins build Linguistic QACheck code in SVN Jenkins build Linguistic QA
Update code Passes?No
Check code in SVN Jenkins build Linguistic QA
Update code
Archive artefacts Tag release in SVN
via Jenkins
Passes?
Yes
No
10/18
![Page 18: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/18.jpg)
Data ConversionBuild Workflow
Check code in SVN
Jenkins build
- Ant script
- XSpec unit tests
- XProc pipeline
Check code in SVN Jenkins build Linguistic QACheck code in SVN Jenkins build Linguistic QA
Update code Passes?No
Check code in SVN Jenkins build Linguistic QA
Update code
Archive artefacts Tag release in SVN
via Jenkins
Passes?
Yes
No
10/18
![Page 19: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/19.jpg)
Results & DiscussionSource data
A sense of ’malauva’ from a monolingual Spanish dictionary
<ACEPCIO ACEP="2"><AREA-GEO>Esp</AREA-GEO><NIVELL>coloquial</NIVELL><SIGNIFICAT>Persona que tiene mal caracter o mala
intencion.</SIGNIFICAT><SINONIM>malaleche.</SINONIM>
</ACEPCIO>
11/18
![Page 20: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/20.jpg)
12/18
Results & DiscussionOUP XML
Print-focused DTD
<se2 num="2"><lg><ge>Esp</ge><reg>coloquial</reg>
</lg><msDict type="core"><df>Persona que
tiene malcaracter o malaintencion.</df>
<syn>malaleche</syn></msDict>
</se2>
![Page 21: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/21.jpg)
12/18
Results & DiscussionOUP XML
Print-focused DTD
<se2 num="2"><lg><ge>Esp</ge><reg>coloquial</reg>
</lg><msDict type="core"><df>Persona que
tiene malcaracter o malaintencion.</df>
<syn>malaleche</syn></msDict>
</se2>
New Lexical XSD
<sense register="informal"region="ES">
<definitions><definition><text>Persona que tiene
mal caracter o malaintencion</text>
</definition></definitions><synonyms><synonym>malaleche</
synonym></synonyms>
</sense>
![Page 22: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/22.jpg)
12/18
Results & DiscussionOUP XML
Print-focused DTD
<se2 num="2">
<lg><ge>Esp</ge><reg>coloquial</
reg></lg><msDict type="core"><df>Persona que
tiene malcaracter o malaintencion.</df>
<syn>malaleche</syn></msDict>
</se2>
New Lexical XSD
<senseregister="informal"region="ES"><definitions><definition><text>Persona que tiene
mal caracter o malaintencion</text>
</definition></definitions><synonyms><synonym>malaleche</
synonym></synonyms>
</sense>
![Page 23: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/23.jpg)
Next stepsScale It Up
I Consolidate data in an XML database
I Build an RDF layer on top of the XML database
I Leverage Semantic Web to enhance our data
13/18
![Page 24: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/24.jpg)
Next StepsPrototype RDF/XML
<Sense rdf:about="sense:es_noun_malauva_se_2"><isDescribedByrdf:resource="
definition:es_noun_malauva_se_2_def_1"/>
<hasRegister rdf:resource="register:informal"/>
<hasRegion rdf:resource="region:ES"/><hasSynonym rdf:resource="lemma:a5e644"/>
</Sense><StandardDefinitionrdf:about="definition:es_noun_malauva_se_2_def_1"><rdfs:label xml:lang="es">Persona que tiene malcaracter o mala intencion</rdfs:label>
</StandardDefinition>
14/18
![Page 25: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/25.jpg)
RDF Data extractionMusical terms in English & Spanish
choir:
chant:
air:
hook:
strain:
chorus:
chorus:
chorale:
ensemble:
song:
tune:
aria:
chorus:
coral
hook
choral
coro
conjunto
canción
melodía
aria
estribillo
tonocoro
aire
salmodia
15/18
![Page 26: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/26.jpg)
Inference mechanism
word sense X word sense Y word sense Z
hasAntonym hasSynonym
hasAntonym
16/18
![Page 27: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data](https://reader033.vdocuments.site/reader033/viewer/2022052001/601442a22efbae642678f93c/html5/thumbnails/27.jpg)
Summary
I Overall project requirementsI Moving from products to platforms and servicesI Supporting current business needs while innovatingI Adapting in nimble ways to fast changing market requirementsI Focusing on time and cost efficiency
I Data modelI Content drivenI Machine interpretableI ModularI Evolvable/adaptable
I Conversion processI Highly automatedI ModularI Scalable
17/18