encoding information for interchange: st malo, 1998 1 encoding information for interchange an...

39
Encoding Information for Interchange: St Malo, 1998 1 Encoding Information for Interchange An introduction to the TEI Lou Burnard Humanities Computing Unit Oxford University

Upload: meryl-robbins

Post on 29-Dec-2015

236 views

Category:

Documents


1 download

TRANSCRIPT

Encoding Information for Interchange: St Malo, 19981

Encoding Information for Interchange

An introduction to the TEI

Lou BurnardHumanities Computing Unit

Oxford University

Encoding Information for Interchange: St Malo, 19982

The problem

• SGML/XML markup is powerful, flexible, and can be customised to meet most (all?) needs

• But to use it, you need a formal specification (aka document type definition orDTD)

• Where do you get one from?• How do you choose?

Encoding Information for Interchange: St Malo, 19983

Some answers

• Roll your own– from scratch– within an existing framework

• Take what’s on offer• Use the TEI architecture

The Text Encoding Initiative

Origins and GoalsModular Architecture

Customization

Encoding Information for Interchange: St Malo, 19985

Where did the TEI come from? • From the humanities research community

• librarians and cybernauts• linguists, historians, lexicographers...

• Sponsors• ACH Association for Computers and the Humanities• ACL Association for Computational Linguistics• ALLC Association for Literary and Linguistic

Computing• Funders

• U.S. National Endowment for the Humanities• Mellon Foundation• Commission of European Communities DG XIII• Social Science and Humanities Research Council of

Canada

Encoding Information for Interchange: St Malo, 19986

… and where is it going?

• Continued work in new application areas– manuscript description– physical description– non-SGML data– XML conformance

• Continued take-up• Need for new infrastructure• Corrected reprint of P3 due summer

1998

Encoding Information for Interchange: St Malo, 19987

a user-driven codification of existing best practice

a user-driven codification of existing best practice

Goals of the TEI

• better interchange and integration of data

• support for all texts, in all languages, from all periods

• guidance for the perplexed: what to encode

• assistance for the specialist: how to encode any information of interest

Encoding Information for Interchange: St Malo, 19988

... but no software... but no software

TEI Deliverables

• coherent set of recommendations for text encoding

• comprising several distinct SGML tagsets

• based on existing practice• documented in a reference manual• tutorials for general and

specialised audiences

Encoding Information for Interchange: St Malo, 19989

The TEI modus operandi...

• identify significant particularities independent of notation or realisation

• avoid controversy, over-delicacy, inadequacy

• seek generalizable solutions, acceptable to a consensus

Encoding Information for Interchange: St Malo, 199810

... and some consequences

• focus on content, not presentation• descriptive, not prescriptive• Occam's razor• modular, extensible dtd• highly general in application,

needs customization for particular areas

Encoding Information for Interchange: St Malo, 199811

Who uses TEI?

• see http://www-tei.uic/orgs/tei/app/• digital librarians and archivists

•LC, HTI, UVA, CETH, OTA...

• Language Engineering projects•EAGLES, BNC, MULTEX, Parole, Silfide

• academic researchers•Women Writers Project, Project Orlando,

Model Editions Partnership, Canterbury Tales Project, Bodleian Library, and many more...

Encoding Information for Interchange: St Malo, 199812

Designing your DTD

• How can a single mark-up scheme handle a large variety of requirements ?– all texts are alike– every text is different

• Learn from the database designers– one construct, many views– each view a selection from the whole

Encoding Information for Interchange: St Malo, 199813

or is there a better way?or is there a better way?

How many dtds might you need?

• one (the Corporate or WKWBFY approach)

• none (the Anarchic or NWEUMP approach)

• as many as it takes (the Mixed Economy or WNSA approach)

Encoding Information for Interchange: St Malo, 199814

a single main DTD with many faces (a British DTD)

a single main DTD with many faces (a British DTD)

The TEI solution: modularization

• a (very) large number of element and attribute definitions

• organised as tagsets (core, base, additional, or auxiliary)

• grouped into classes

Encoding Information for Interchange: St Malo, 199815

Combining Tag Sets

• And how does one combine tagsets? The how-many-dtds problem is back.– all tag sets, all the time (the table d'hôte

model)– a few pre-selected combinations (the

combination plate model)– in completely unconstrained abandon

(the smorgasbord model)– one from column A, two from column B

(the Chinese menu model)

Encoding Information for Interchange: St Malo, 199816

The Chicago Pizza Model

<!ENTITY % base “(deepDish|thinCrust|stuffed)” ><!ENTITY % topping “(pepperoni|mushrooms|sausage| pepper | anchovies | ...)” ><!ELEMENT pizza - -

(%base;, tomatoSauce & cheese, %(topping)*) >

<!ENTITY % base “(deepDish|thinCrust|stuffed)” ><!ENTITY % topping “(pepperoni|mushrooms|sausage| pepper | anchovies | ...)” ><!ELEMENT pizza - -

(%base;, tomatoSauce & cheese, %(topping)*) >

Encoding Information for Interchange: St Malo, 199817

<!DOCTYPE TEI.2 system 'tei2.dtd' [<!ENTITY % tei.prose 'INCLUDE' ><!ENTITY % tei.analysis 'INCLUDE' >]><tei.2>.....</tei.2>

<!DOCTYPE TEI.2 system 'tei2.dtd' [<!ENTITY % tei.prose 'INCLUDE' ><!ENTITY % tei.analysis 'INCLUDE' >]><tei.2>.....</tei.2>

To build a view of the TEI dtd, take...

• the core tagsets• the base of your choice• the toppings of your choice

Encoding Information for Interchange: St Malo, 199818

… trim to fit ...

• user extension files• rename elements• undefine elements to be redefined* or

removed

<!ENTITY % tei.extensions.ent SYSTEM ‘myMods.ent’ >

<!ENTITY % tei.extensions.ent SYSTEM ‘myMods.ent’ >

<!ENTITY % n.p ‘para’ ><!ENTITY % seg ‘IGNORE’>

<!ENTITY % n.p ‘para’ ><!ENTITY % seg ‘IGNORE’>

* see later

Encoding Information for Interchange: St Malo, 199819

… and cook thoroughly

• ‘compile’ the dtd to remove all parameterization

• easier to use for some software• better project management• see

http://firth.natcorp.ox.ac.uk/~tei/pizza.html

•don’t forget the documentation!

Encoding Information for Interchange: St Malo, 199820

TEI base tagsets

• one only must be selected• defines basic structural components• currently defined:

– prose, verse, drama– transcribed speech– dictionaries– terminological databases

• mixtures of bases require special treatment

Encoding Information for Interchange: St Malo, 199821

TEI additional tagsets

• sets of elements for specialised application areas

• can be mixed and matched ad lib• currently provided:

– linking and alignment; analysis; feature structures; certainty; physical transcription; textual criticism, names and dates; graphs and trees; figures and tables; language corpora....

Encoding Information for Interchange: St Malo, 199822

How does this work ?

• Main dtd consists of marked sections, each (potentially) containing one tagset

• By default, all tagsets are IGNOREd<![ %TEI.tagset [<!-- declarations for tagset here -->]]>

<![ %TEI.tagset [<!-- declarations for tagset here -->]]>

<!ENTITY % TEI.tagset “INCLUDE”><!ENTITY % TEI.tagset “INCLUDE”>

Encoding Information for Interchange: St Malo, 199823

How does this work? (contd)

• Tagsets contain element and attlist declarations, each also enclosed by a marked section

• By default all elements are INCLUDEd<![ %element [<!ELEMENT %n.element - - (#PCDATA)><!ATTLIST %n.element %a.global >]]>

<![ %element [<!ELEMENT %n.element - - (#PCDATA)><!ATTLIST %n.element %a.global >]]>

<!ENTITY % element “IGNORE”><!ENTITY % element “IGNORE”>

Encoding Information for Interchange: St Malo, 199824

How does this work? (contd)

• Element names (GIs) are always referred to indirectly, so that they may be renamed

<!ELEMENT %n.elem1 - (%n.elem2;+)><!ELEMENT %n.elem1 - (%n.elem2;+)>

<!ENTITY % n.elem1 “elem1”><!ENTITY % n.elem2 “foo”>

<!ENTITY % n.elem1 “elem1”><!ENTITY % n.elem2 “foo”>

Encoding Information for Interchange: St Malo, 199825

Element Classes

• Model classes– elements which share syntactic

properties (i.e. occur in same position)

• Attribute classes– elements which share attributes

• Class membership can be inherited• Another way of doing architectural

forms

Encoding Information for Interchange: St Malo, 199826

Some TEI model classes

• divn: structural elements like divisions<div>, <div1>, <div2>, <lg>, <lg1>...

• divtop: elements which can appear at the start of a divn element<head>, <epigraph>, <byLine>...

• chunk: paragraph-like elements<sp>, <p>, <lg>, <l>…

• phrase: elements which appear within chunks<hi>, <foreign>, <date>, <q> ...

Encoding Information for Interchange: St Malo, 199827

Some TEI semantic classes

• data: phrases likely to be normalised or processed non textually<date>, <time>, <name>...

• biblpart: specialised components of bibliographic descriptions<author>, <title>, <editor>...

• demographic: descriptive features of participants in a language interaction<birth>, <socEcstat>, <occupation>...

Encoding Information for Interchange: St Malo, 199828

Some TEI attribute classes

•global: attributes which are available to every elementn, lang, id, TEIform

•linking: attributes for elements which have linking semanticstargType, targOrder, evaluate

Encoding Information for Interchange: St Malo, 199829

The class system in action

• Simplifying documentation and understanding of the DTD

• Parameterizing content models– different for different bases

• Simplifies customization– class membership is unaffected– adding new elements to an existing

class

Encoding Information for Interchange: St Malo, 199830

Parameterized content models

• “Components”, for example:– a dictionary is composed of entries– a play is composed of speeches– a novel is composed of paragraphs

• in each case, the basic “text soup” (and the structural divisions) remain the same, but they are organized differently

Encoding Information for Interchange: St Malo, 199831

How does this work? (contd)

• the component class has different members in different bases

<![ %TEI.prose [<!ENTITY % m.component “p|list|note”>]]><![ %TEI.dictionaries [<!ENTITY % m.component “entry”>]]><!ENTITY %component.seq “(%m.component)+”><!ELEMENT div -- (head?, (%component.seq), div*) >

<![ %TEI.prose [<!ENTITY % m.component “p|list|note”>]]><![ %TEI.dictionaries [<!ENTITY % m.component “entry”>]]><!ENTITY %component.seq “(%m.component)+”><!ELEMENT div -- (head?, (%component.seq), div*) >

Encoding Information for Interchange: St Malo, 199832

Customization...

• Removing an element involves– undeclaring it– (NB: ISO 8879 permits references to

undefined elements -- though not all vendors know this)

• Adding a new element involves– determining its class– defining it– adding it to that class

Encoding Information for Interchange: St Malo, 199833

Customization (contd)

• Modification of an element implies removal followed by addition

• Class membership should be unaffected<!-- in TEI.extensions.ent --><!ENTITY % p “IGNORE”>

<!-- in TEI.extensions.ent --><!ENTITY % p “IGNORE”>

<!-- in TEI.extensions.dtd --><!ELEMENT %n.p - - (#PCDATA)>

<!-- in TEI.extensions.dtd --><!ELEMENT %n.p - - (#PCDATA)>

Encoding Information for Interchange: St Malo, 199834

<!ENTITY % x.class ““><!ENTITY % m.class “%x.class name1 | name2 | name3 ...” >

<!ENTITY % x.class ““><!ENTITY % m.class “%x.class name1 | name2 | name3 ...” >

<!ELEMENT % n.element - - (%m.class;+)><!ELEMENT % n.element - - (%m.class;+)>

How does this work? (contd)

• Each model class is defined as a parameter entity

• Reference to class members is always indirect

• Membership extensible (by a kludge)

Encoding Information for Interchange: St Malo, 199835

An example: the Lampeter corpus

• Requirements– light presentational tagging– structural markup for access– demographic information about text

production– small number of tags to ease data capture and

validation

• Implementation– tagsets: prose base, and tags from four

additional sets– some extensions, many exclusions

Encoding Information for Interchange: St Malo, 199836

The Lampeter corpus DTD subset

<!DOCTYPE TEICORPUS.2 SYSTEM "tei2.dtd" [<!ENTITY % TEI.prose "INCLUDE"><!ENTITY % TEI.corpus "INCLUDE"><!ENTITY % TEI.figures "INCLUDE"><!ENTITY % TEI.transcr "INCLUDE"><!ENTITY % TEI.extensions.ent SYSTEM "lampext.ent"><!ENTITY % TEI.extensions.dtd SYSTEM "lampext.dtd"><!-- more declarations here -->]>

<!DOCTYPE TEICORPUS.2 SYSTEM "tei2.dtd" [<!ENTITY % TEI.prose "INCLUDE"><!ENTITY % TEI.corpus "INCLUDE"><!ENTITY % TEI.figures "INCLUDE"><!ENTITY % TEI.transcr "INCLUDE"><!ENTITY % TEI.extensions.ent SYSTEM "lampext.ent"><!ENTITY % TEI.extensions.dtd SYSTEM "lampext.dtd"><!-- more declarations here -->]>

Encoding Information for Interchange: St Malo, 199837

The Lampeter corpus extensions.ent

<!ENTITY % analytic 'IGNORE' ><!ENTITY % biblStruct 'IGNORE' ><!-- hic desunt multa --><!ENTITY % supplied 'IGNORE' >

<!ENTITY % x.phrase "it|ro|sc|su|bo|go|"><!ENTITY % x.biblPart "printer|pubFormat|bookSeller|"><!ENTITY % x.demographic "socecstatusPat|biogNote|"><!ENTITY % x.globincl "gap|">

<!ENTITY % analytic 'IGNORE' ><!ENTITY % biblStruct 'IGNORE' ><!-- hic desunt multa --><!ENTITY % supplied 'IGNORE' >

<!ENTITY % x.phrase "it|ro|sc|su|bo|go|"><!ENTITY % x.biblPart "printer|pubFormat|bookSeller|"><!ENTITY % x.demographic "socecstatusPat|biogNote|"><!ENTITY % x.globincl "gap|">

Encoding Information for Interchange: St Malo, 199838

The Lampeter corpus extensions.dtd

<!ELEMENT (it|ro|sc|su|bo|go) - - (%phrase.seq)><!ELEMENT (persName|printer|pubFormat |bookSeller|biogNote|socecstatusPat) - - (%phrase.seq) >

<!ELEMENT (it|ro|sc|su|bo|go) - - (%phrase.seq)><!ELEMENT (persName|printer|pubFormat |bookSeller|biogNote|socecstatusPat) - - (%phrase.seq) >

NB: This is a provisional version only! (no attlists, no

documentation…)

Encoding Information for Interchange: St Malo, 199839

Summary

• Designing a successful DTD involves careful, conscious, controlled , theft

• Modularize the task• A class system helps identify

– what is true of all documents– what is true of some documents

• Modifiability can be compatible with standardization