from monolithic xml for print/web to lean xml for data: realising … · 2015. 12. 5. · i jenkins...

28
From monolithic XML for print/web to lean XML for data: realising linked data for dictionaries Matt Kohl & Sandro Cirulli Language Technologists Oxford University Press (OUP) 7 June 2014

Upload: others

Post on 27-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

From monolithic XML for print/webto lean XML for data:

realising linked data for dictionaries

Matt Kohl & Sandro CirulliLanguage Technologists

Oxford University Press (OUP)

7 June 2014

Page 2: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

IntroductionOxford University Press

I World-renowned dictionary publisher

I Licensing partner for lexical data

2/18

Page 3: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

IntroductionShifts in Publishing

I New trends & demands

I Emerging technologies & markets

I Importance of well-structured, semantically-rich data

I Speed!

3/18

Page 4: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ModellingOur Current Dictionary Data Models

I Print-oriented: designed to capture dictionary layout

I Monolithic: one enormous document

I Permissive: continually loosened to accommodate new texts

Can’t give us the flexibility we need

4/18

Page 5: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ModellingRequirements

A new approach should:

I Represent language concepts, not layouts

I Enable data reusability for different products & services

I Allow only one, clear way to model any given lexical item

5/18

Page 6: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ModellingThe New Lexical Schema

6/18

Page 7: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionMoving Data into the Lexical Schema

Conversion Framework Requirements

I Scalability: convert 40+ data-sets

I Standardization: harmonize variation inside the data-sets

I Modularity: enable customization, slotting in & out of QA,etc.

7/18

Page 8: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionTools

I XProc

I XSpec

I Schematron & XML Schema

I Jenkins CI

I Agile methodology

8/18

Page 9: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionSimplified XProc pipeline

print-focused XML

+xml:lang = "es"

print-focused XML

+xml:lang = "es" XSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

enhanced XML

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

Lexical Data

9/18

Page 10: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionSimplified XProc pipeline

print-focused XML

+xml:lang = "es"

print-focused XML

+xml:lang = "es" XSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

enhanced XML

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

Lexical Data

9/18

Page 11: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionSimplified XProc pipeline

print-focused XML

+xml:lang = "es"

print-focused XML

+xml:lang = "es" XSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

enhanced XML

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

Lexical Data

9/18

Page 12: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionSimplified XProc pipeline

print-focused XML

+xml:lang = "es"

print-focused XML

+xml:lang = "es" XSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

enhanced XML

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

Lexical Data

9/18

Page 13: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionSimplified XProc pipeline

print-focused XML

+xml:lang = "es"

print-focused XML

+xml:lang = "es" XSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

enhanced XML

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

Lexical Data

9/18

Page 14: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionSimplified XProc pipeline

print-focused XML

+xml:lang = "es"

print-focused XML

+xml:lang = "es" XSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

enhanced XML

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

print-focused XML

+xml:lang = "es" XSL transformations

Schematron

validation

XML Schema

validation

Schematron

validation

enhanced XMLXSL transformations

Lexical Data

9/18

Page 15: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionBuild Workflow

Check code in SVN

Jenkins build

- Ant script

- XSpec unit tests

- XProc pipeline

Check code in SVN Jenkins build Linguistic QACheck code in SVN Jenkins build Linguistic QA

Update code Passes?No

Check code in SVN Jenkins build Linguistic QA

Update code

Archive artefacts Tag release in SVN

via Jenkins

Passes?

Yes

No

10/18

Page 16: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionBuild Workflow

Check code in SVN

Jenkins build

- Ant script

- XSpec unit tests

- XProc pipeline

Check code in SVN Jenkins build Linguistic QA

Check code in SVN Jenkins build Linguistic QA

Update code Passes?No

Check code in SVN Jenkins build Linguistic QA

Update code

Archive artefacts Tag release in SVN

via Jenkins

Passes?

Yes

No

10/18

Page 17: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionBuild Workflow

Check code in SVN

Jenkins build

- Ant script

- XSpec unit tests

- XProc pipeline

Check code in SVN Jenkins build Linguistic QACheck code in SVN Jenkins build Linguistic QA

Update code Passes?No

Check code in SVN Jenkins build Linguistic QA

Update code

Archive artefacts Tag release in SVN

via Jenkins

Passes?

Yes

No

10/18

Page 18: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Data ConversionBuild Workflow

Check code in SVN

Jenkins build

- Ant script

- XSpec unit tests

- XProc pipeline

Check code in SVN Jenkins build Linguistic QACheck code in SVN Jenkins build Linguistic QA

Update code Passes?No

Check code in SVN Jenkins build Linguistic QA

Update code

Archive artefacts Tag release in SVN

via Jenkins

Passes?

Yes

No

10/18

Page 19: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Results & DiscussionSource data

A sense of ’malauva’ from a monolingual Spanish dictionary

<ACEPCIO ACEP="2"><AREA-GEO>Esp</AREA-GEO><NIVELL>coloquial</NIVELL><SIGNIFICAT>Persona que tiene mal caracter o mala

intencion.</SIGNIFICAT><SINONIM>malaleche.</SINONIM>

</ACEPCIO>

11/18

Page 20: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

12/18

Results & DiscussionOUP XML

Print-focused DTD

<se2 num="2"><lg><ge>Esp</ge><reg>coloquial</reg>

</lg><msDict type="core"><df>Persona que

tiene malcaracter o malaintencion.</df>

<syn>malaleche</syn></msDict>

</se2>

Page 21: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

12/18

Results & DiscussionOUP XML

Print-focused DTD

<se2 num="2"><lg><ge>Esp</ge><reg>coloquial</reg>

</lg><msDict type="core"><df>Persona que

tiene malcaracter o malaintencion.</df>

<syn>malaleche</syn></msDict>

</se2>

New Lexical XSD

<sense register="informal"region="ES">

<definitions><definition><text>Persona que tiene

mal caracter o malaintencion</text>

</definition></definitions><synonyms><synonym>malaleche</

synonym></synonyms>

</sense>

Page 22: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

12/18

Results & DiscussionOUP XML

Print-focused DTD

<se2 num="2">

<lg><ge>Esp</ge><reg>coloquial</

reg></lg><msDict type="core"><df>Persona que

tiene malcaracter o malaintencion.</df>

<syn>malaleche</syn></msDict>

</se2>

New Lexical XSD

<senseregister="informal"region="ES"><definitions><definition><text>Persona que tiene

mal caracter o malaintencion</text>

</definition></definitions><synonyms><synonym>malaleche</

synonym></synonyms>

</sense>

Page 23: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Next stepsScale It Up

I Consolidate data in an XML database

I Build an RDF layer on top of the XML database

I Leverage Semantic Web to enhance our data

13/18

Page 24: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Next StepsPrototype RDF/XML

<Sense rdf:about="sense:es_noun_malauva_se_2"><isDescribedByrdf:resource="

definition:es_noun_malauva_se_2_def_1"/>

<hasRegister rdf:resource="register:informal"/>

<hasRegion rdf:resource="region:ES"/><hasSynonym rdf:resource="lemma:a5e644"/>

</Sense><StandardDefinitionrdf:about="definition:es_noun_malauva_se_2_def_1"><rdfs:label xml:lang="es">Persona que tiene malcaracter o mala intencion</rdfs:label>

</StandardDefinition>

14/18

Page 25: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

RDF Data extractionMusical terms in English & Spanish

choir:

chant:

air:

hook:

strain:

chorus:

chorus:

chorale:

ensemble:

song:

tune:

aria:

chorus:

coral

hook

choral

coro

conjunto

canción

melodía

aria

estribillo

tonocoro

aire

salmodia

15/18

Page 26: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Inference mechanism

word sense X word sense Y word sense Z

hasAntonym hasSynonym

hasAntonym

16/18

Page 27: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Summary

I Overall project requirementsI Moving from products to platforms and servicesI Supporting current business needs while innovatingI Adapting in nimble ways to fast changing market requirementsI Focusing on time and cost efficiency

I Data modelI Content drivenI Machine interpretableI ModularI Evolvable/adaptable

I Conversion processI Highly automatedI ModularI Scalable

17/18

Page 28: From monolithic XML for print/web to lean XML for data: realising … · 2015. 12. 5. · I Jenkins CI I Agile methodology 8/18. Data Conversion Simpli ed XProc pipeline 9/18. Data

Thank you for your attention!Any questions?

Matt Kohl: [email protected] Cirulli: [email protected]