1xml and linguistic annotation chris brew, ohio state university ( credit to marc moens, henry...

122
1 XML and Linguistic Annotation XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group, University of Edinburgh)

Upload: miles-alan-clarke

Post on 28-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

1XML and Linguistic Annotation

XML and Linguistic Annotation

Chris Brew, Ohio State University( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology

Group, University of Edinburgh)

http://www.ling.ohio-state.edu/~cbrewOhio State University

Copyright 2000 Chris Brew

Page 2: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 2Summer School, July 2000

XML topics

What is XML? HTML,XML and SGML Wider context of XML

Data Description DTDs, Schemas

Query Languages XML Query, XQL, Quilt, LORE, LT QUERY

Style Languages CSS, XSL

Page 3: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 3Summer School, July 2000

What is XML?

It is a markup language used for annotating text is concerned with logical structure

to identify sections, titles, section headers, chapters, paragraphs,…

is not concerned with appearance you say 'this is a subtitle'

not 'this is in bold, 14pt, centered' you say 'this is an example'

not 'this is in verbatim, indented by 5pts, ragged right’

Derived from SGML.

Page 4: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 4Summer School, July 2000

Why is XML a big deal?

It is a W3C standard It is vendor-independent, platform independent,

application independent,… unlike Word documents, RTF documents, PDF

documents, Postscript documents,…

It is human readable ditto (for most values of 'human')

The Web interchange format

Page 5: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 5Summer School, July 2000

Who is in charge of XML?

XML is a W3C Recommendation The W3C is The World Wide Web Consortium, a

voluntary association of companies and non-profit organizations. Membership costs serious money, confers voting rights. Complex procedures, with the Chairman (Tim Berners-Lee) holding all the high cards, but the big vendors (e.g. Microsoft, Adobe, Netscape) have a lot of power.

The recommendation was written by the W3C’s XML Working Group.

Page 6: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 6Summer School, July 2000

XML as a career move?

Most of the big computer and entertainment companies believe XML is the solution. Exactly what was the problem?

Presenting a parts database over the InternetRunning an on-line job market (flipdog.com)Usually not corpus creation.

Scholars win and loseSGML was a minority interest where we had

serious influence on what facilities were usedXML is mainstream. We’re the minority now.This year’s .coms are busily hiring people who

understand ontologies, NLP and web technology.

Page 7: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 7Summer School, July 2000

Does it live up to the hype?

Of course not, but… The basic idea is simple labeled brackets. Lisp showed the

power of this idea in knowledge representation. Knowledge representation is inherently hard. Lisp made it

easier to state the problem, but it wasn’t itself the solution. XML won’t solve your knowledge representation problems either, but it will let you state them.

Labeled brackets++ Labeled brackets – but designed for information exchange,

with sophisticated input (and political pressures) from many interest groups.

Page 8: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 8Summer School, July 2000

Does it live up to the hype?

Yes. XML and allied standards (XSLT, XML Query,) give us a framework for data interchange.

Weather Reports

XSL

Browser

Day Planner

Weather Model

XML XML

Transformation End UsersData

Page 9: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 9Summer School, July 2000

Transformation

End users will differ in which parts of the weather reports they need, so the middle stage is the crux. One XML format defines the available data Transformations map this format into what is needed by the

different applications, leaving out bits that they don’t need. One common transformation is to HTML, for browsers.

(easy) Another is to printed paper, for efficient random access.

(difficult, because our quality expectations are so high)

Page 10: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 10Summer School, July 2000

Representing knowledge in text

Unformatted text Formatted text Structured Markup

Page 11: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 11Summer School, July 2000

Unformatted text

United Kingdom GeographyLocation: Western Europe, bordering on the North Atlantic Ocean and the North Sea, between Ireland and FranceMap references: Europe, Standard Time Zones of the World Area:total area: 244,820 km2 land area: 241,590 km2 comparative area: slightly smaller than Oregonnote: includes Rockall and Shetland IslandsLand boundaries: total 360 km, Ireland 360 km Coastline: 12,429 km

Page 12: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 12Summer School, July 2000

Formatted text

United Kingdom Geography Location: Western Europe, bordering on the North Atlantic Ocean and the North Sea, between Ireland and France Map references: Europe, Standard Time Zones of the World Area: total area: 244,820 km2 land area: 241,590 km2 comparative area: slightly smaller than Oregon >> note: includes Rockall and Shetland Islands Land boundaries: total 360 km, Ireland 360 km Coastline: 12,429 km

Page 13: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 13Summer School, July 2000

XML marked up text

<chapter><title>United Kingdom</title><section><title>Geography</title><featlist> <feat name=Location>Western Europe, bordering on the NorthAtlantic Ocean and the North Sea, between Ireland and France <feat name='Map references'>Europe, Standard TimeZones of the World <feat name=Area><featlist> <feat name='total area'>244,820 km2</feat> <feat name='land area'>241,590 km2 </feat> <feat name='comparative area'>slightly smaller than Oregon <addendum>note: includes Rockall and Shetland Islands </feat></featlist></feat> <feat name='Land boundaries'>total 360 km, Ireland 360 km</feat></featlist></section>

Page 14: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 14Summer School, July 2000

The syntax...

But aren't all those angle brackets still terribly cumbersome and complicated? Yes. simpler relative only to SGML. But..

There are tools that allow you to add XML annotation without the need to know XML

There are tools that allow you to search XML annotation without the need to know XML

XML is no more complex than other annotation schemes

If you roll your own scheme, you’ll have to write (and maintain) the tools.

If you use XML, part or all of your tool set will be provided by mainstream computer industry.

Page 15: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 15Summer School, July 2000

RTF Format{\rtf1\ansi \pard\plain\s1\fs36\ppscheme-3\lang2057 {\f1\lang1033 Formatted text\par }\pard\plain\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\b\f1\fs32\ppscheme-6\lang1033 United Kingdom}{\f1\fs20\lang1033 }{\f1\fs16\lang1033 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\b\f1\fs24\lang1033 Geography}{\f1\fs12\lang1033 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Location: Western Europe, bordering on the North Atlantic Ocean \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 and the North Sea, between Ireland and France\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Map references: Europe, Standard Time Zones of the World \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Area: total area: 244,820 km2 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 land area: 241,590 km2 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 comparative area: slightly smaller than Oregon\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 >> note: includes}{\f1\fs20\lang1033 Rockall}{\f1\fs20\lang1033 and Shetland Islands\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Land boundaries: total 360 km, Ireland 360 km \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Coastline: 12,429 km\par}}

Page 16: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 16Summer School, July 2000

XHTML is a use of XML

HTML derived from SGML, but an application, not a subsetSGML/XML let you define new types of documentHTML only gives you a language to write

document instances Hard-wired to a particular tag set (often with proprietary

extensions -- e.g. frames) Hard-wired to particular typographic format, with limited

style-sheets XHTML is to XML as HTML is to SGML

Page 17: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 17Summer School, July 2000

SGML/XML for computational linguists

What is XML?

SGML Lite Simpler to write Simpler to parse

HTML Heavy New user-definable tags Not (just) about browsing Data interchange Heavily legislated syntax

Page 18: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 18Summer School, July 2000

What is XML?

XML is just labeled brackets. You get elements with a start tag, some content, and an end tag.

<memo><sender>Marc Moens</sender><recipient>Henry, David</recipient><status>confidential</status><subject>GGP Contract</subject><message>The GGP contract is ready for signature. Please sign the contractas well as the NDA.</message></memo>

Page 19: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 19Summer School, July 2000

XML is SGML made simple

SGML is labeled brackets too. You get elements with an optional start tag, some content,

<memo><sender>Marc Moens<recipient>Henry, David<status>confidential</status><subject>GGP Contract<message>The GGP contract is ready for signature. </memo>

Page 20: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 20Summer School, July 2000

XML Basics

Document Type Definition (DTD) Describes what can (and can’t) be in a particular type of

document E.g. a memo DTD might specify that every memo has:

sender (name),recipients (list of names),date (default: today),subject,message,status (confidential or unrestricted)

Document Instance: Identifies the document type and contains the marked-up text E.g. a memo document instance:

refers to the memo DTDcontains text marked up in conformance with that

DTD

Page 21: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 21Summer School, July 2000

XML and document structure

XML is used to make the structure of documents

• explicit• machine readable

Document content

SGML Tags

Marc Moens

This is the first paragraph. It has some text.

This is the second paragraph with some more text.

Page 22: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 22Summer School, July 2000

XML markup

<article status='draft'> <header> <title>XML tags </title> <author>Marc Moens </author> </header> <body> <para>This is the first paragraph. It has some text. </para> <para> This is the second paragraph with some moretext <emph>and</emph> an embedded element. </para></body></article>

Elements: start tags e.g. <author> content e.g. Marc Moens end tags e.g. </author>

Elements mark up text to indicatestructure and function of text (as opposed to appearance)

tag name = element typeElements can have attributes

Elements and attributes are defined in the Document Type Definition

Page 23: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 23Summer School, July 2000

XML markup: for structure and function

He shouted: 'Come here now, Mr Banks.'

<sentence>He <verb>shouted</verb>: <quote><verb mood=imperative>Come </verb> here <emphasis>now</emphasis>, <person><title>Mr</title> <name>Banks</name></person></quote></sentence>

Encodes structure informationto support renderingas well as data handling

Data handling e.g.• search for all quotes inside sentences but not in footnotes;• search for every mention of someone called Banks without finding the Banks of Scotland[Use an XML-aware query tool]

Rendering e.g.• emphasis should be bold underline;•quotes should be in italics[Use a stylesheet]

Page 24: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 24Summer School, July 2000

XML: Relevance for Linguists

Simplify and standardize appeal to context E.g. build tokenizer which specifically works for headlines of

newspaper articles:We need to be able to tell the tokenizer where the headline starts and ends

Annotate text with interesting linguistic information E.g. use XML tags to record the results of a tokenizer or part

of speech tagger. Or a human annotator

Allow sharing of results between research efforts without having to write a new parser every time you get new

material from somewhere

Page 25: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 25Summer School, July 2000

XML: Relevance for Linguists (example)

cat text | lttok -q '.*/P' -m W | ltpos -q '.*/W' -m C

Use the tokeniser lttok on all paragraphs <P> in the text and mark the resulting words as <W> entitiesThen run the part of speech tagger ltpos over the text and pos tag all the <W> entities, putting the result in attribute C

<W C=VBD>said</W><W C=DET>the</W><W C=NN>director</W><W C=IN>of</W> <W C=NNP>Russian</W><W C=NNP>Bear</W><W C=NNP>Ltd. </W><W C=ë.í>.</W>

Page 26: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 26Summer School, July 2000

Associated Standards

XSLT Transforming documents

XML Query Find bits of documents

XML Schema Use element syntax for DTDs

Namespaces Ensure that <art:draw><cube/><cube/></art:draw> and <soccer:draw><team name=“crew”/><team name=“burn”/></soccer:draw> both get processed correctly.

Page 27: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 27Summer School, July 2000

Infrastructure standards

Xpath Referring to parts of documents

XPointer pointing at documents and parts of documents

DOM Uniform programmer’s interface to document trees

(abstracts away from some details)

SAX Stream-based document interface (essential for big

documents)

Information Set

Page 28: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 28Summer School, July 2000

XML in detail

Well-formedness and validity DTDs XML tools XSLT XML Query

Page 29: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 29Summer School, July 2000

Well-formed and Valid documents

Well-formed XML Each start tag has an end tag XML content is rooted in single “document element” Valid encoding declaration

Valid Well-formed All elements mentioned in DTD All entities defined All parent-child relations as described in DTD All attributes used as described in DTD All element IDs unique

Page 30: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 30Summer School, July 2000

Why well-formedness?

a simpler standard for documents to meet Can be determined without reference to a DTD Simplifies the parser Retains “standalone” property of HTML, which was a big

win.

Non-validating XML systems can thus still be conformant, providing they check well-formedness

If you have a DTD (or a Schema) you can do more refined processing.

Page 31: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 31Summer School, July 2000

DTDs

Document Type Definitions: the grammar of a document family Elements Attributes & values Entities & parameter entities Comments

Page 32: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 32Summer School, July 2000

DTD: Elements

Elements are used to structure a document. Element types are declared in the DTD:

<!DOCTYPE article [ <!ELEMENT article (title, section+) > <!ELEMENT section (title, para+) > <!ELEMENT para (#PCDATA) > <!ELEMENT title (#PCDATA) > ]>

Page 33: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 33Summer School, July 2000

DTD: Attribute declarations

Attributes specify properties of elements. The attributes which may appear on elements of a given type are also declared in the DTD.

<!DOCTYPE article [<!ELEMENT article (title, section+) > <!ATTLIST article artno NUMBER #IMPLIED > <!ELEMENT section (title, para+) > <!ATTLIST section secid ID #REQUIRED > <!ELEMENT para (#PCDATA) >

<!ELEMENT title (#PCDATA) >]>

Page 34: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 34Summer School, July 2000

DTD: Entity declarations

Entities provide short names for commonly used strings, and are also declared in the DTD. <!DOCTYPE article [

<!ELEMENT article (title, section+) ><!ATTLIST article artno NUMBER #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED >

<!ENTITY ltg "Language Technology Group> ]>

Page 35: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 35Summer School, July 2000

DTD: IDs

IDs are rigid designators for particular elements in the document. They are declared using type ID<!DOCTYPE article [

<!ELEMENT article (title, section+) ><!ATTLIST article artno NUMBER #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED >

<!ENTITY ltg "Language Technology Group>]>

Potentially, IDs allow processors to provide fast random access to parts of documents.

Ids must be unique. Checking might be onerous

Page 36: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 36Summer School, July 2000

XML tools

XML Parser LT XML Toolkit XSLT - xt and Saxon

Page 37: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 37Summer School, July 2000

XML Parser

probably most important single bit of XML software uses DTD to check if document instance is valid

Page 38: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 38Summer School, July 2000

Example: >> cat memo.xml

<?xml version=“1.0” encoding=“ISO-8859-1”?><!DOCTYPE article [

<!ELEMENT article (para+)><!ELEMENT para (#PCDATA)><!ENTITY ltg "Language Technology Group">]>

<article><para>This is the text of a very short article,with very little internal structure.Here is a reference to the &ltg; entity.

</para></article>

Page 39: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 39Summer School, July 2000

Add correct output

Example: >> xmlnorm -V memo.xml

Entity reference has beenreplaced with entity textby parser

Page 40: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 40Summer School, July 2000

Exercise

Practice using xmlnorm to check your documents Add some new entities to the memo. Experience some of xmlnorm‘s error messages Begin to think about DTD design Practice using Web browsers to look at XML files Get a glimpse of what XSL is about

Page 41: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 41Summer School, July 2000

<!DOCTYPE article [<!-- Just a simple example DTD --><!ELEMENT article (title, section+) ><!ATTLIST article artno NUMBER #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED ><!ELEMENT para (#PCDATA) ><!ELEMENT title (#PCDATA) ><!ENTITY ltg 'Language Technology Group'>]>

DTD: Comments

Page 42: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 42Summer School, July 2000

Element type declaration details

<!ELEMENT chapter (title, section+) >

keyword

element typestart with a-zmay contain hyphen, number, stopsnot case sensitivecan be more than one

content modelAn unambiguous regularexpression

Page 43: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 43Summer School, July 2000

Element types: Content model

<!ELEMENT article (title, section+) >

+ at least one, possibly more? optional* zero or more

, all occur, in that order| exclusive or

<!ELEMENT header ( ( (title, subtitle?),(author, affil)+ ), (date | status)? ) >

XML eradicated SGML’s neat & all occur, any order

Page 44: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 44Summer School, July 2000

Element types: Content model options

<!ELEMENT graphic EMPTY > EMPTY

no content no end tag point semantics: attributes may specialise

(#PCDATA) text only

ANY no constraint: sub-elements and/or text

((#PCDATA|emph)*) 'mixed content'

Page 45: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 45Summer School, July 2000

Element grammar

Since content model is a regular expression, markup grammar is context free

Except for one thing ANY keyword

Note that any realistic application interprets the markup tree. The interpretation could be anything. All bets are off…

Page 46: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 46Summer School, July 2000

SGML/XML for computational linguists

nsgmls:exa2a.sgm:7:42:E: element "PI" undefinednsgmls:exa2a.sgm:8:24:E: general entity "T." not defined

and no default entity(ARTICLE(PARA-Here is some text with an inequality: a(PI-2and an abbreviation: AT)PI)PARA)ARTICLE

Example: >> nsgmls exa2a.sgm

<pi/ interpreted as start tag

&T. interpreted as entity reference, not defined so gone from output

No C to confirm validity.

Page 47: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 47Summer School, July 2000

Escaping special characters

There are several ways around the problem of introducing XML's meta-syntax characters into documents Use numeric character references

AT&#38;T Use CDATA marked sections

<![CDATA[<this> is data &not markup]]>

XML provides built-in definitions for amp, lt, gt, quot and apos

Page 48: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 48Summer School, July 2000

76SGML/XML for computational linguists

Example: >> nsgmls exa2b.sgm

(ARTICLE(PARA-Here is some text with an inequality: a<pi/2\nand an abbreviation: AT&T.)PARA)ARTICLEC

Page 49: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 49Summer School, July 2000

DTD: Comments

<!-- Comments added here -->

double hyphens act as comment

<!ELEMENT article (title, section+)>

Page 50: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 50Summer School, July 2000

DTD: Attributes

<!DOCTYPE article [<!ELEMENT article (title, section+) ><!ATTLIST article artno CDATA #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED ><!ELEMENT para (#PCDATA) ><!ELEMENT title (#PCDATA) >]>

Page 51: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 51Summer School, July 2000

DTD Attribute declarations: syntax

<!ATTLIST article artno CDATA #IMPLIED >

keyword

element type

attribute nameattribute type

default type#REQUIRED#IMPLIED (= optional)#FIXED

Page 52: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 52Summer School, July 2000

Attribute Value types (contd)

<!ATTLIST article artno CDATA #IMPLIED >

CDATA valid SGML charactersENTITY declared entity nameID unique nameIDREF reference to a unique name

Page 53: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 53Summer School, July 2000

Cross-references

<!DOCTYPE article [<!ELEMENT article (section+)><!ATTLIST section secid ID #IMPLIED><!ELEMENT section (#PCDATA | xref)+><!ELEMENT xref EMPTY><!ATTLIST xref xrefid IDREF #REQUIRED>]><article><section secid='s1'>Here is some text.</section><section>In section <xref xrefid='s1'> we showedyou how to create crossreferences.</section></article>

Page 54: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 54Summer School, July 2000

In a valid SGML/XML document IDs are unique IDREFs are discharged

Applications may interpret IDREF/ID connections

Links from elsewhere may target IDs cf. HTML 'name' attribute as the target for #....

IDs and IDREFs

Page 55: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 55Summer School, July 2000

Attribute value types: list

CDATAvalid SGML characters author='Robin Hood'

ENTITY/IESdeclared entity name(s) figs='pict2 pict7'

IDunique name id='foo37'

IDREF(S)reference(s) to an ID refid='foo2 foo37'

NMTOKEN(S)name(s) w/o i.c. restraint code='96-mm01 98-a'

NOTATIONdata content notation encoding='eps'

Page 56: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 56Summer School, July 2000

Enumerated attribute values

Attribute values can also be constrained to be one of a finite set of allowed values

<!ATTLIST section status (draft|alpha| beta|final) 'draft' >

<section status=alpha><section status=final><section><section status=gamma> Not valid

Page 57: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 57Summer School, July 2000

Elements vs Attributes

<!ELEMENT date (day, month, year)><!ELEMENT day (#PCDATA)>

Content is unconstrained Order will be enforced

vs

<!ELEMENT dateday EMPTY><!ATTLIST date day NUMBER #REQUIRED

month NUMBER #REQUIREDyear NUMBER #REQUIRED>

Content is constrained Order is unconstrained

Page 58: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 58Summer School, July 2000

DTD: Entities

<!DOCTYPE article [<!ELEMENT article - - (#PCDATA)><!ENTITY ltg 'Language Technology Group'>]><article>The &ltg; carries out application-oriented research inlanguage engineering. The &ltg; is based withinthe HCRC.</article>

Each occurrence of &ltg; in the text is replaced byLanguage Technology Groupduring parsing.

can be nested:

<!ENTITY hcl 'HCRC &ltg;'>

Page 59: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 59Summer School, July 2000

DTD: Parameter Entities

Like entities, except within the DTD

<!ENTITY % section '(title?, para+)'>

each time parser finds %section; inthe DTD, it will replace it with (title?, para+)

<!ENTITY % section (title?, para+)><!ELEMENT article - - (title, %section;+)><!ELEMENT subsect - - (%section;+)>

Page 60: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 60Summer School, July 2000

DTD

That’s almost all there is to it For more detail, see the XML standard

Which, as Michael Kay puts it, is like tax legislation

DTD syntax differs from element syntax Harder to learn/use XML Schema

Also, DTDs were designed to be used by document designers, not for distributed data interchange XML can use a DTD, but doesn’t assume one. Composite documents entail composite DTDs, but these

don’t exist. Namespace prefixes add extra complexity

Page 61: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 61Summer School, July 2000

XSL Transformations

Content from one document.

Style from another

Structure

Page 62: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 62Summer School, July 2000

barts_stylish_memo.xml

<?xml version="1.0"?>

<!ELEMENT article (title,(para|credit)+)> <!ELEMENT para (#PCDATA)> <!ENTITY ltg "Language Technology Group"> <!ENTITY author "Bart Simpson"> <!ENTITY techie "Lisa Simpson"> <!ENTITY parents "Marge and Homer"> <!ENTITY school "M&amp;M University">]>

This is the text of a very short article,with very little internal structure.Here is a reference to the &ltg; entity.Please may I stop now?</para>

</credit>

</credit>

</article>&parents; for unfailing support.

<credit>&techie; of &school; for slick XML authoring.

<credit><para><para> by &author;: &school;</para><title>Bart's Ph.D Thesis</title>

<article><!DOCTYPE article [<?xml-stylesheet type="text/xsl" href="memo.xsl"?>

Page 63: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 63Summer School, July 2000

memo.xsl

IE5 attempts to display the style in visual form, without any content.

Germ of a good idea here.

Page 64: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 64Summer School, July 2000

Source of memo.xsl

<?xml version="1.0" encoding="ISO-8859-1" ?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/TR/WD-xsl"><xsl:template match="/"><html> <head><title><xsl:value-of select="//title"/></title></head><body BGCOLOR='#FFFFCC'> <h1><xsl:value-of select="//title"/></h1><xsl:for-each select="//para"><p><xsl:value-of/></p></xsl:for-each><hr/><p><i> Thanks to: </i><br/><xsl:for-each select="//credit">&#160; <xsl:value-of/><br/></xsl:for-each><hr/></p></body></html></xsl:template></xsl:stylesheet>

Page 65: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 65Summer School, July 2000

Fill in the blanks

<?xml version="1.0" encoding="ISO-8859-1" ?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/TR/WD-xsl"><xsl:template match="/"><html> <head><title>•••</title></head><body BGCOLOR='#FFFFCC'> <h1>•••</h1><xsl:for-each select="//para"> <p>•••</p></xsl:for-each><hr/><p><i> Thanks to: </i><br/><xsl:for-each select="//credit"> &#160; ••• <br/></xsl:for-each> <hr/></p></body></html></xsl:template></xsl:stylesheet>

XSLT gives you tools for sending part of document to one place, part to another.

Simplest use is pure fill in the blanks. Anybody who uses HTML, PHP and so on will be comfortable with this use of XSLT

If necessary, it is a Turing-complete programming language. It gives you the rope if you need it.

Page 66: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 66Summer School, July 2000

Fill in the blanks

<?xml version="1.0" encoding="ISO-8859-1" ?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/TR/WD-xsl"><xsl:template match="/"><html> <head><title> <xsl:value-of select="//title"/> </title></head><body BGCOLOR='#FFFFCC'> <h1> <xsl:value-of select="//title"/> </h1><xsl:for-each select="//para"><p> <xsl:value-of/> </p></xsl:for-each><hr/><p><i> Thanks to: </i><br/><xsl:for-each select="//credit">&#160; <xsl:value-of/> <br/></xsl:for-each> <hr/></p></body></html></xsl:template></xsl:stylesheet>

Page 67: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 67Summer School, July 2000

XSLT standards

Microsoft’s implementation in IE5 is non-standard (they put it out well before the standard existed). They are moving to conformance.

James Clark’s xt and Michael Kay’s Saxon are much more complete and conformant

W3C eats its own lunch. The HTML versions of the XML standard are generated with XSL

In practice, current best options are Static data:Pre-generate HTML from XML at publication

time Dynamic data: Use Saxon or xt as Java Servlets

Page 68: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 68Summer School, July 2000

Generating HTML

HTML is generated by running Saxon on poem.xml and poem.xsl

saxon poem.xml poem.xsl > poem.html

Page 69: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 69Summer School, July 2000

Using IE5 to view poem.xml

<poem><author>Rupert Brooke</author><date>1912</date><title>Song</title><stanza><line>And suddenly the wind comes soft,</line><line>And Spring is here again;</line><line>And the hawthorn quickens with buds of green</line><line>And my heart with buds of pain.</line></stanza><stanza><line>My heart all Winter lay so numb,</line><line>The earth so dead and frore,</line><line>That I never thought the Spring would come again</line><line>Or my heart wake any more.</line></stanza><stanza><line>But Winter's broken and earth has woken,</line><line>And the small birds cry again;</line><line>And the hawthorn hedge puts forth its buds,</line><line>And my heart puts forth its pain.</line></stanza></poem>

Page 70: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 70Summer School, July 2000

poem.xsl<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:template match="poem"><html><head>

<title><xsl:value-of select="title"/></title></head><body>

<xsl:apply-templates select="title"/><xsl:apply-templates select="author"/><xsl:apply-templates select="stanza"/><xsl:apply-templates select="date"/>

</body></html></xsl:template>

<xsl:template match="title"><div align="center"><h1><xsl:value-of select="."/></h1></div></xsl:template>

<xsl:template match="author"><div align="center"><h2>By <xsl:value-of select="."/></h2></div></xsl:template>

<xsl:template match="stanza"><p><xsl:apply-templates select="line"/></p></xsl:template>

<xsl:template match="line"><xsl:if test="position() mod 2 = 0">&#160;&#160;</xsl:if><xsl:value-of select="."/><br/></xsl:template>

<xsl:template match="date"><p><i><xsl:value-of select="."/></i></p></xsl:template></xsl:stylesheet>

Namespace declaration is different (standard conforming) for Saxon.

+XSLT language is different.

+ Saxon and XT are really easy to install.

- IE5 has millions of current users

Page 71: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 71Summer School, July 2000

“Problems” with XML

Uses complex and weird terminology Yes. But so does the ANSI C standard. So do most fields…

Not convenient for specifying graphs (as opposed to trees) This is a point about graphs, not XML. Unification grammar

notations get unwieldy too.

Not as convenient as plain text True for some tasks, but the extra structure of XML lets do

things that you wouldn’t even try with plain text.

Page 72: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 72Summer School, July 2000

XML tools for Unix

Simple equivalents of UN*X tools are available (for free) to do simple SGML processing

We'll introduce them using examples, and give details at the end

Page 73: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 73Summer School, July 2000

sggrep

LT XML program for searching for structure and text in XML files sggrep -q query -s subquery -t regexp in.xml

Options -d DTD: Specify a DTD explicitly. File is an XML file -r : Attribute values in queries are regular expressions. -v : Invert sense of sub-query+regexp. Other options

Page 74: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 74Summer School, July 2000

||

LT XML query language

Two-dimensional regular expressions First dimension is over tree paths

Based on file path analogy:DIV/PARA/W matches Ws inside PARAs inside (toplevel) DIVs

Second dimension is regular expressions over text content of leaf nodes

Select Ss containing Ws whose text is it's or its-q S -s './W' -t "^(it's|its)$"

Full UTZOO (Henry Spencer) regular expression support

Influential, slightly dated now.

Page 75: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 75Summer School, July 2000

sggrep: examples of use

sggrep -q ".*/P/S" -s "./W[TAG=NN]"ï find all S elements occuring inside a P element at any depth

which immediately contain a W element with attribute TAG="NN".

sggrep -q ".*/P/S/W[TAG=NN]"ï find those W elements themselves

sggrep -q ".*/S/W[0]" -t "^[a-z]" ï find all sentence initial words starting with a lower case

letter.

Page 76: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 76Summer School, July 2000

sgmltrans

converts XML into different formats.sgmltrans -r rulefile file.nsg > file.txtï sample rule file:

.*/W matches W "" what to print at start tag "/$TAG\n" what to print at end tag: value of TAG

attribute .*/W/# matches text inside W " " --> "" text replacement: eliminate space if any .*/S matches S "" start tag: nothing "\n" end tag: make each S on separate line .* matches other markup

Page 77: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 77Summer School, July 2000

sgmltrans: example of use

The previous rule file would do this:

<?xml version='1.0'>

<TEST><P><S>

<W TAG='A'>The </W>

<W TAG='B'>cat </W>

<W>sat </W>

<C>.</C></S>

<S>

<W TAG='A'>on </W>

<W TAG='B'>the </W>

<W>mat </W>

<C>.</C>

</S></P></TEST>

The/A

cat/B

sat/

on/A

the/B

mat/

Page 78: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 78Summer School, July 2000

sgrpg: SGML report generator

Program for making more complex queries of normalised SGML and for transforming SGML. Provides nested subqueries and sequencing

Usage: sgrpg query sub-query regexp out-fmt oargs < file.nsg >

file.txt sgrpg -f pat-file < file.nsg > file.txt

This now looks like a design study for XSLT and XML Query.

Has one advantage, designed (from the outset) for big documents

Page 79: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 79Summer School, July 2000

The British National Corpus

2 gigabytes of contemporary English Marked up to word level with part of speech tags Extract data:

zcat medium.xml.gz | sggrep -q ".*/W[TYPE=NN1]" gives all singular nouns in a part of the corpus, e.g.

<W TYPE=NN1>part </W><W TYPE=NN1>meeting </W><W TYPE=NN1>while </W><W TYPE=NN1>funeral</W><W TYPE=NN1>loss</W><W TYPE=NN1>meeting</W><W TYPE=NN1>time </W>

Page 80: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 80Summer School, July 2000

The BNC: an example (2)

zcat medium.xml.gz | \sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" \-t "^[Rr]ight$"

gives sentences containing non-adjectival uses of the word 'right', e.g.

<S N=092> <W TYPE=ITJ>Yes </W> <W TYPE=DT0>that </W> <W TYPE=VBD>was</W> <C TYPE=PUN>, </C> <W TYPE=DT0>that </W> <W TYPE=VBD>was </W> <W TYPE=AV0>right</W> . . . </S>

Page 81: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 81Summer School, July 2000

The BNC: an example (3)

Format the output into a more readable form:

zcat medium.xml.gz | \sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" -t "^[Rr]ight$" |\sgmltrans -r test.rule

Yes/ITJ that/DT0 was/VBD , that/DT0 was/VBD right/AV0 erm/UNC there/EX0 was/VBD a/AT0 limit/NN1 to/PRP how/AVQ much/AV0 you/PNP could/VM0 spend/VVI aswell/AV0 was/VBD n't/XX0 there/EX0 ?

He/PNP goes/VVZ into/PRP a/AT0 restaurant/NN1 and/CJC he/PNP says/VVZ oh/ITJ the/AT0 waiter/NN1 erm/UNC let/VVB me/PNP see/VVI the/AT0 menu/NN1 and/CJC he/PNP looks/VVZ at/PRP the/AT0 menu/NN1 and/CJC said/VVD right/AV0 , he/PNP said/VVD .

Page 82: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 82Summer School, July 2000

An extended example: Noun Compounds

Noun compounds in British National Corpus What is a noun compound?

Too hard. Simple approximation? Sequence of tags matching NN. . .

BNC uses a version of the Brown tags, where NN0, NN1, . . . are all variants of Noun

A pipeline of SGML-aware tools will do the job sgrpg | sggrep [ | . . .]

Use sgrpg to wrap such tag sequences in <G> ... </G>. Use sggrep to filter the output. Use further tools to tabulate, format, etc.

Page 83: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 83Summer School, July 2000

An extended example: The pipe

Step by step through the pipe sgrpg -r -f np-pat.xml | ...

Group the sequences-r use regexp matching-f script file

... sggrep -d groups.xml -q '.*/G'extract the sequences-d DTD -q query (selects groups)

Result:<G><W TYPE='AJ0-NN1'>Local</W>

<W TYPE='NN0'>government</W><W TYPE='NN2'>districts</W></G>. . .

Page 84: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 84Summer School, July 2000

An extended example: filtering

Find all words with unresolved tags, e.g. AJ0-NN1 use regexp matching, which is unanchored by default ...| sggrep -r -q './W[TYPE="-"]' | ...

Find all words in second position ...| sggrep -q './W[1]' | ...

Find all words with unresolved tags in second position ...| sggrep -r -q './W[1 TYPE="-"]' | ...

Page 85: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 85Summer School, July 2000

An extended example: counting

Count all words in second position ...| sggrep -q './W[1]' | sgcount

Count all words with unresolved tags in second position ...| sggrep -r -q './W[1 TYPE="-"]' | sgcount

Results: all 2nd place W 23283 2nd place W with unresolved tag 5066

Page 86: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 86Summer School, July 2000

An extended example: long compounds

Long compounds including 'government' Use subquery to select <G>...</G>s with 'government': sggrep -q G -s './W' -t government Next step, discard short ones: sggrep -q G -s './W[2]' Then sgmltrans for neater format Results:

official/AJ0-NN1 government/NN0 report/NN1-VBLocal/AJ0-NN1 government/NN0 districts/NN2...

Page 87: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 87Summer School, July 2000

An extended example: left context

select for 'government' in 2nd place . . . | sggrep -q G -s './W[1]' -t government |

pull words from first place sggrep -q './W[0]' |

remove markup textonly |

use UN*X for the rest sort | uniq -c | sort -nr | head -4 6 French 5 German 4 interim 4 Chinese

Page 88: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 88Summer School, July 2000

British International Corpus?

We are more francophone than we think! Longest 'noun-phrase' in 10% of BNC is:

serai/NN1 mentionn&eacute;/NN1 dans/NN2 le/NN1 rapport/NN1-VB qui/NN1 te/NN1 sera/NN1 remis/NN1

No disgrace that the part-of-speech tagger gave up here. Tools can't be better than their input allows

Page 89: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 89Summer School, July 2000

XML Conclusions

XML is the wave of the future Both Microsoft and Netscape have endorsed it

Both Mozillla and IE5 have XML support built-in Very good free software is available Microsoft seem to be serious about standard compliance

The W3C have made it clear that all subsequent W3C standards for web distribution of information will be based on XML (c.f. SMIL, SVG and RDF)

Issues XSLT efficiency - space and time.

Page 90: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 90Summer School, July 2000

To read

Robin Cover’s SGML/XML Web Pagehttp://www.sil.org/sgml/sgml.html includes many pointers to SGML tutorials, overviews,

publications

The Whirlwind Guide to SGML & XML Tools and Vendorshttp://www.infotek.no/sgmltool/guide.htm

The XML FAQhttp://www.ucc.ie/xml/ An excellent introduction to XML with pointers to useful

resources for newcomers to the standard

Page 91: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 91Summer School, July 2000

SGML/XML for Linguistics

2.1 Programs for querying/modifying SGML an example what is needed available tools

2.2 SGML marked-up corpora some existing resources

2.3 Related developments SSTML SGML for X-waves

Page 92: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 92Summer School, July 2000

An example

You want to build a system that performs particular LE task You have a corpus of texts for

analysis (detecting textual regularities)system trainingsystem testing

Use XMLWhy?How?

Page 93: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 93Summer School, July 2000

Why use XML?

Use structure of text to fine-tune certain tools e.g. build tokeniser which specifically works for headlines of

newspaper articles

Annotate text with linguistic information e.g. use SGML tags to record the results of a tokeniser or

part of speech tagger, so that other tools can make use of this information

Ensure the others (and you two years from now :-) will have easy access to your results No special-purpose parser required Simple retrieval and tabulation with existing free tools DTD provides some self-documentation

Page 94: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 94Summer School, July 2000

What is needed to use XML?

XML is text Therefore:

you can use any UNIX text manipulation programe.g. grep, sed, awk, perl, etc

XML is annotated text Therefore:

Needed: versions of these tools that are XML-aware

Page 95: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 95Summer School, July 2000

What is needed to use XML?

SGML reflects the hierarchical structure of a text You want to be able to tell tools to operate on a particular

part of the SGML-annotated text, for example:all WORD entities with attribute POS set to JJ

(i.e. all adjectives)occurring within the first PARAGRAPH of the main

BODY of an ARTICLE; oroccurring within the HEADLINE of and ARTICLE

Needed: a query language over XML structures

Page 96: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 96Summer School, July 2000

What is needed to use XML?

XML-aware versions of text processing tools Query language

In fact sggrep is just a simple wrapper round our query language.

Our query language and interface is designed to work with big files, so it doesn’t read the whole document into memory unless absolutely necessary. Most competitors do this

Page 97: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 97Summer School, July 2000

XML tools: the LT XML library

sggrep is part of an SGML toolset, called LT XML Developed by the Language Technology Group

(Edinburgh) see: http://www.ltg.ed.ac.uk/software

XML Library with Command-line tools Application Programming Interface (API)

Available for WIN32, UN*X (and Mac) LT XML processes XML or nSGML

nSGML now looks like a design study for XML

Page 98: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 98Summer School, July 2000

LT XML: Command-line tools

sggrep - retrieving context sensitive data sgmltrans - transforming information sgrpg - more complex queries/reformatting textonly - strips out SGML markup sgcount - counts SGML tags knit - resolves XML-link links others

Page 99: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 99Summer School, July 2000

LT NSL: APIs

LT NSL Application Program Interfaces:procedure calls to help you write your own programs to process nSGML C language API Python language API

Page 100: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 100Summer School, July 2000

C API for specialised access

Write your own programs to read/write SGML/XML LT XML provides a rich API Both event and tree views of the document stream

The distribution includes two heavily commented example programs.

Page 101: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 101Summer School, July 2000

Python language API for LT XML

Experimental integration of the LT XML API into Python (free portable object-oriented scripting language)

Uses TK portable widget library for graphical UI Reflects document stream as Python objects

Page 102: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 102Summer School, July 2000

Specialised XML editors

Using the Python API we have written a number of specialised processors:

A WYSIWYG XML instance editor (XED) Several specialised annotation tools, E.g. PoS

correctors, span coders Limited set of operations Preserve validity Hide structure from the user

Page 103: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 103Summer School, July 2000

Dataflow in LT NSL programs

mknsg

unknit

nSGML NSL C(++) program

stream API

parser

nSGML NSL C(++) program

stream API

parser

DDB file

file1.sgm Ö file2.sgm ...

file1.sgm ...

Page 104: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 104Summer School, July 2000

The Edinburgh MapTask Corpus

Contents 128 task oriented spontaneous Scottish dialogues small corpus, but very dense and detailed SGML markup.

Availability: Transcripts and digitized speech on 8 CD-ROMS:

http://www.elsnet.org/resources.html or from the LDC

What is its markup like? (early) TEI-compliant Turns, pointers into the speech, identification of non-words. Word-level transcripts with timing markup available soon via

the Internet

Page 105: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 105Summer School, July 2000

HCRC Maptask: an example

mknsg q1ec1.turns.sgm | sggrep -q ".*/W[TAG=at]"

<W START=2.9644 DUR=0.0725 UTT=1 TAG=at>a</W>

<W START=17.1410 DUR=0.1779 UTT=3 TAG=at>an</W>

<W START=18.6693 DUR=0.0791 UTT=3 TAG=at>the</W>

Page 106: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 106Summer School, July 2000

Parsed HCRC Maptask : an example

mknsg q1ec1.g.syn.sgm | sggrep -q ".*/NP" | sgmltrans -r mt.rule

<NP>we </NP>

<NP>a caravan park </NP>

<NP>we </NP>

<NP>we </NP>

<NP><NP><NP>an old mill </NP></NP><PP>on <NP>the right hand side </NP></PP></NP>

<NP><NP>an old mill </NP><PP>on <NP>the right </NP></PP></NP>

<NP>you </NP>

...

Page 107: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 107Summer School, July 2000

The MLCC corpus

Contents Financial Newspaper texts: Dutch, English, French, German,

Italian, Spanish Parallel texts:

The Journal of the European Commission, Written Questions (1993).

Corpus of European Parliamentary debates (1993-1994). (languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish ).

Markup

Available from ELRA: http://www.inpg.fr/ELRA/catalog.html

Page 108: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 108Summer School, July 2000

The MLCC Corpus: an example

zcat exp.joc006.93.en.01.tei.gz |\

mknsg | \

sggrep -q ".*/DIV4[TYPE=Q]/HEAD"

<HEAD>Subject: The staffing in the Commission of the European Communities</HEAD>

<HEAD>Subject: Supplies of military equipment to Iraq</HEAD>

<HEAD>Subject: Commission plans to liberalize the postal sector and to abolish the State monopoly</HEAD>

<HEAD>Subject: New industries in Attika</HEAD>...

Page 109: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 109Summer School, July 2000

The same example for French

zcat exp.joc006.93.fr.01.tei.gz |\

mknsg | \

sggrep ".*/DIV4[TYPE=Q]/HEAD" ""

<HEAD>Objet: Organigramme de la Commission</HEAD>

<HEAD>Objet: Livraisons de matÈriel militaire ‡ l'Irak</HEAD>

<HEAD>Objet: Projets de la Commission visant ‡ libÈraliser et ‡ abolir le monopole d'…tat dans le secteur des postes</HEAD>

<HEAD>Objet: Nouvelles industries en Attique</HEAD>

Corresponds to the English data: Suitable input for multilingual alignment experiments.

Page 110: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 110Summer School, July 2000

The Text Encoding Initiative (TEI)

The TEI is a large and well documented DTD for textual markup. Use it if you can Now has an XML version

Large and comprehensive hardcopy documentation available http://www.uic.edu/orgs/tei/

DTDs available there as well

Page 111: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 111Summer School, July 2000

The Linguistic Data Consortium

LDC - based in Pennsylvania USA Distributes text corpora See: http://www.ldc.upenn.edu/

SGML Corpora include: The European Language Newspaper Text corpus

French (100 million words), German (90 million words) and Portuguese (15 million words). SGML.

TIPSTER Information Retrieval Text Research Collection3 gigabytes. SGML-like. Various English texts.

United Nations Parallel Text Corpus (English, French, Spanish)

Fully-compliant SGML, 2.5 gigabytes

Page 112: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 112Summer School, July 2000

Tutorials

XML: far too many to mention XSL:

XSL specification http://www.w3.org/Style/XSL

Robin Cover's guide http://www.oasis-open.org/cover/xsl.html

Page 113: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 113Summer School, July 2000

Resources

LT-XML http://www.ltg.ed.ac.uk/software/xml/index.html

Full-text search Witten, Moffat and Bell's Managing Gigabyteshttp://www.cs.mu.OZ.AU/mg/

Page 114: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 114Summer School, July 2000

Corpus Tools

Stuttgart Corpus Workbenchhttp://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench

Birmingham Qwick}http://www-clg.bham.ac.uk/QWICK/

The MATE Workbench http://www.cogsci.ed.ac.uk/~dmck/MateCode}.

NB. Prototype

Page 115: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 115Summer School, July 2000

Bibliography

McKelvie, Brew,Thompson: Using SGML as a Basis for Data-Intensive Natural Language Processing, Computers and the Humanities, 31(5): 367-388, 1997

Sinclair, Mason,Ball,Barnbrook Language Independent Statistical Software for Corpus Exploration, Computers and the Humanities, Vol 31(3): 229-255, 1998

References on McKelvie's MATE workbench pagehttp://www.cogsci.ed.ac.uk/~dmck/MateCode

Welty and Ide. Using the right tools: enhancing retrieval from marked-up documents. Computers and the Humanities. 33(10):59-84. 1999

Alignment graphs (and much else) Steven Bird's Linguistic Annotation Pagehttp://www.ldc.upenn.edu/annotation/.

Page 116: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 116Summer School, July 2000

Annotation topics

_ Item annotations Words, Parts-of-speech, lemmas

Simple annotations (one data stream) Boundaries,Spans,Partitions

Complex annotations (multiple data streams) Sequences,Graphs,Overlaps

Data models for annotation access Streams, Trees, Graphs, Databases

_ Human factors in annotation Writing instructions, Measuring and improving reliability

Page 117: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 117Summer School, July 2000

XML topics

Data formats HTML,XML and SGML

Data Description Formalisms DTDs, XML Schema

Style Languages XSLT

Query Languages Annotation Graphs, XML Query, XQL, Quilt, LORE

Page 118: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 118Summer School, July 2000

Exercises

On average, these exercises should take about one hour to complete. Try not to spend longer.

Create an XML document Create a very simple memo

Simple annotation Disambiguate parts-of-speech Compare results with those made by a partner.

Style Create an XML DTD and an XSL style sheet for displaying

POS-tagged text in a browser.

Page 119: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 119Summer School, July 2000

Exercises

More complex annotation syntactic annotation in Penn tree bank style. As before, compare results

Search Exercise XML search tools on the newly annotated texts

Page 120: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 120Summer School, July 2000

Projects

These are open-ended projects hard enough to merit write-up in a research paper. I’d willingly supervise these.

Design a DTD and an XSL stylesheet for tree bank style syntactic annotations. Implement a convenient interface allowing these annotations to be edited over the Web.

Investigate the corpus search tools provided at the LDC web-site. What do they do? Could they and should they use XML/XSL technology for the same purpose? (Easiest if your institution has an LDC membership).

Page 121: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 121Summer School, July 2000

Projects (contd)

Critical review of the Talkbank tools (www.talkbank.org)

Design an XML query language that works well with very big documents

What sort of annotation structure for dialog? (cf. MATE)

Design an optimizing compiler for XSLT (cf. Sun’s very recent XSL compiler)

Does XSLT support language modeling and statistical computation? (If you put XSLT and Splus into a closed box and shake vigorously, what emerges?)

Page 122: 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

XML and Linguistic Annotation 122Summer School, July 2000

In Summary

Phew! </xmlstuff>