1 generic xml programs chris brew, ohio state university cbrew

84
1 Generic XML Programs Chris Brew, Ohio State University http://www.ling.ohio-state.edu/ ~cbrew

Upload: jonah-richardson

Post on 29-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

1

Generic XML Programs

Chris Brew, Ohio State Universityhttp://www.ling.ohio-state.edu/~cbrew

Generic XML Programs 2Ohio Jan 2003

Questions

What is needed for linguistic annotation? Tools and techniques• Patterns of collaboration• Things to avoid doing

• (How) can XML help?• What is easy/hard to represent?• Which standards are useful right now?

• (How) can we write reliable XML software?• Patterns for typical annotation tasks• Simple ways of exploiting DTDs

Generic XML Programs 3Ohio Jan 2003

What is XML?

• It is a markup language used for annotating text• is concerned with logical structure

• to identify sections, titles, section headers, chapters, paragraphs,…

• is not concerned with appearance• you say 'this is a subtitle'

not 'this is in bold, 14pt, centered'• you say 'this is an example'

not 'this is in verbatim, indented by 5pts, ragged right’

• Derived from SGML• Designed (in some measure) by linguists and scholars.

Generic XML Programs 4Ohio Jan 2003

What is XML?

• It is means for describing and presenting data (usually on the web).

• Most of the big computer and entertainment companies believe XML is the solution.• But exactly what was the problem?

• Presenting a parts database over the Internet• Running an on-line job market• Usually not corpus creation • XML is mainstream. We’re the minority now.

Generic XML Programs 5Ohio Jan 2003

Does XML live up to the hype?

• Of course not, but…• The basic idea is simple labeled brackets. Lisp showed

the power of this idea in knowledge representation.• Knowledge representation is inherently hard. Lisp

made it easier to state the problem, but it wasn’t itself the solution. XML won’t solve your knowledge representation problems either, but it will let you state them and explain them to your friends.

• Labeled brackets++• Labeled brackets – but designed for information

exchange, with sophisticated input (and political pressures) from many interest groups.

Generic XML Programs 6Ohio Jan 2003

Does XML live up to the hype?

• Yes. XML and allied standards (XSLT, XML Query,) give us a framework for data interchange.

Weather Reports

XSL

Browser

Day Planner

Weather Model

XML XML

Transformation End UsersData

Generic XML Programs 7Ohio Jan 2003

Linguistic annotation

• The real task of linguistic annotation is to add information to (recordings of) naturally occurring communicative behaviour.

• Annotation is hard enough without having to worry about buggy tools. Many annotators are data-centric and intolerant of ‘interface junk’.

• Many users are working linguists who simply want to find examples and/or numbers to put into the next paper. Limited tolerance for complex and inaccessible data formats.

Generic XML Programs 8Ohio Jan 2003

Multi-level Annotation

• Corpora are often annotated at different levels of detail, by heterogeneous teams, over a period of decades.

• Our view/understanding of the data will change.• A useful strategy is a layered approach, defining

a core set of distinctions which is augmented by optional levels of detail

• EAGLES defines cross-language annotation standards this way.

• Optional levels may be for particular languages or particular applications

Generic XML Programs 9Ohio Jan 2003

Encoding for multiple levels

• XML has good capabilities.• Links between documents• Transformation tools• Display facilities • Well-defined properties

• We need agreed standards for• Levels of annotation

• Some automatic (POS tagging )• Some require humans (co-reference etc.)

• Checking procedures• Annotation Criteria

Generic XML Programs 10Ohio Jan 2003

Architectures for multiple levels

• Must support• Range of annotation types• Multiple versions, alternatives• Different human languages• Different media and modalities

• (video, audio, diagrams)• Complex links between documents, parts of documents,

external data sources.

• XML has some support for all of these, especially given the use of stand-off annotation.

Generic XML Programs 11Ohio Jan 2003

Data models

• TIPSTER model (annotated spans)• GATE (focus on re-use of processes rather than data)

• ATLAS Annotation graph formalism• Networks of nodes, many of which are ordered by the

time relation before/after. Nodes may overlap, not overlap, or the annotation may be indeterminate. Gives a lot of flexibility for doing e.g. phonology.

• Granularity is critical -- must be able to refer to smallest objects of interest.

• Verbose, but annotation graph queries can be optimized, and/or indexed.

Generic XML Programs 12Ohio Jan 2003

Tool architecture

• MULTEXT/LT XML Pioneered use of standoff annotation. Adapted Unix tool architecture for ?ML

• GATE implements the Tipster architecture. Focus on tool composition

• ATLAS similar, but based on annotation graphs

Generic XML Programs 13Ohio Jan 2003

Tool architecture: consensus

• High-level (computer) language independent API with a three layer architecture• Low-level physical details (database, text-files,

proprietary system)• Logical view of the data (XML has several good

candidates)• High-level, SQL-like query interface, designed to be

usable by programmers and non-programmers alike.

• Visual interface for annotation• (Optional) shortcuts for efficiency. Provide an

event-based view on hierarchical structure, until such time as query language is efficient

Generic XML Programs 14Ohio Jan 2003

Tool support

• Still need to embed tools into a system• May wish to compress our data files.

• XML compresses very well (Liefke and Suciu)• Searching compressed data is feasible

• May wish to index data files, as has been done with Stuttgart’s CQP

• Relies on data having fairly simple structure

Generic XML Programs 15Ohio Jan 2003

Non-traditional data

• Documents with diagrams, including engineering drawings.

• Books which overlay text and illustration.• Manuscripts where the physical details of

calligraphy matter.• Interlinked texts.• Phonetic databases, word-lists• Personal mailboxes and the like

No obvious time line in some of these. Challenges to indexing.

Generic XML Programs 16Ohio Jan 2003

Software infrastructure

• XML+XSLTOne way of providing views on corpora• Very good XML tools exist• But they aren’t specialised for language

• Annotation graphs• Advantage: customizable to our concerns• Disadvantage: someone has to do the customization.• Challenge: efficient, intuitive, expressive query

languages (Bird,Buneman and Tan)

Generic XML Programs 17Ohio Jan 2003

Transcriber

Generic XML Programs 18Ohio Jan 2003

Transcriber (Barras et al, ‘00)

• Designed for manual annotation of large speech files.

• Now has annotation graph engine at its heart.

• Uses XML to define communication between modules

• Motivation is graceful handling of multi-channel audio or video

• Make the end-product more customizable, by building in less of the data description, and putting more into data files

• Retain simple, crisp user interface

Generic XML Programs 19Ohio Jan 2003

Clan

Generic XML Programs 20Ohio Jan 2003

Talkbank project

• Very diverse user group (ethologists, child language, CA)

• Serious effort to standardize tools• Strongly committed to open source products• http://www.talkbank.org

Generic XML Programs 21Ohio Jan 2003

Semi-structured data

• The standard assumption in the database community is that when we have a body of data we know its structure.

• This is simply not true on the Web. We typically have data which have some structure but some irregularities.

Person: (name: “Chris Brew”)

Person:(name: (first: “Jamie”,last:”Brew”))

Person:(name:(first:”Matthew”, last:”Brew”,initial:”R”))

Generic XML Programs 22Ohio Jan 2003

XML is semi-structured data

• XML is allowed not to have detailed document type information (though it may have).

• Some XML applications need to be generic, in the sense that they are not limited to any particular DTD browsers, editors, tree diff…

• Others make assumptions about the class of documents they will process, but do not fully specify DTD

• Others are tied to many details of specific DTDs

Generic XML Programs 23Ohio Jan 2003

Types of annotation

• A taxonomy of different sorts of annotation which are needed for various forms of linguistic data.

Generic XML Programs 24Ohio Jan 2003

Item annotations

• Words, Parts-of-speech, lemmas

Each item receives one annotation on each of several, and is related to others primarily by contiguity.

Sample tool: Stuttgart CQP

Sample query: word = “right” & pos != “j.*”

Generic XML Programs 25Ohio Jan 2003

Simple annotations

• Boundaries,Spans,Partitions• Boundaries

• Correspond to EMPTY XML elements• Single click inserts boundary• Resulting span partitions the time line of the input

file.• Discontiguous spans.

• Click and drag selects a span of the document• Inserts start tag and end tag

• Attributes• Subsequent click on start tag (or empty tag) brings

up menu of attributes

Generic XML Programs 26Ohio Jan 2003

DTD relations

• Such annotation tools rely on relations between DTDs to give meaningful user actions

• Ensure syntactic consistency of input, allowing annotator to focus on meaning of annotation.

Generic XML Programs 27Ohio Jan 2003

Stylesheets mediate relations

• Editors and their operations relate to the data which they process.

• Visual presentations relate to the structure of the data which they process

• Frequency counted wordlists are related to the original form of the corpus.

• Many processes of analysis and summarisation are well expressed as relations between document types.

• Stylesheets, in some form, are a natural choice to mediate such relations, making them customizable.

Generic XML Programs 28Ohio Jan 2003

XSL Transformations

Content from one document.

Style from another

Structure

Generic XML Programs 29Ohio Jan 2003

barts_stylish_memo.xml

<?xml version="1.0"?>

<!ELEMENT article (title,(para|credit)+)> <!ELEMENT para (#PCDATA)> <!ENTITY ltg "Language Technology Group"> <!ENTITY author "Bart Simpson"> <!ENTITY techie "Lisa Simpson"> <!ENTITY parents "Marge and Homer"> <!ENTITY school "M&amp;M University">]>

This is the text of a very short article,with very little internal structure.Here is a reference to the &ltg; entity.Please may I stop now?</para>

</credit>

</credit>

</article>&parents; for unfailing support.

<credit>&techie; of &school; for slick XML authoring.

<credit><para><para> by &author;: &school;</para><title>Bart's Ph.D Thesis</title>

<article><!DOCTYPE article [<?xml-stylesheet type="text/xsl" href="memo.xsl"?>

Generic XML Programs 30Ohio Jan 2003

memo.xsl

IE5 attempts to display the style in visual form, without any content.

Not standard, but very reasonable.

Generic XML Programs 31Ohio Jan 2003

Source of memo.xsl

<?xml version="1.0" encoding="ISO-8859-1" ?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/TR/WD-xsl"><xsl:template match="/"><html> <head><title><xsl:value-of select="//title"/></title></head><body BGCOLOR='#FFFFCC'> <h1><xsl:value-of select="//title"/></h1><xsl:for-each select="//para"><p><xsl:value-of/></p></xsl:for-each><hr/><p><i> Thanks to: </i><br/><xsl:for-each select="//credit">&#160; <xsl:value-of/><br/></xsl:for-each><hr/></p></body></html></xsl:template></xsl:stylesheet>

Generic XML Programs 32Ohio Jan 2003

Fill in the blanks

<?xml version="1.0" encoding="ISO-8859-1" ?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/TR/WD-xsl"><xsl:template match="/"><html> <head><title>•••</title></head><body BGCOLOR='#FFFFCC'> <h1>•••</h1><xsl:for-each select="//para"> <p>•••</p></xsl:for-each><hr/><p><i> Thanks to: </i><br/><xsl:for-each select="//credit"> &#160; ••• <br/></xsl:for-each> <hr/></p></body></html></xsl:template></xsl:stylesheet>

XSLT gives you tools for sending part of document to one place, part to another.

Simplest use is pure fill in the blanks. Anybody who uses HTML, PHP and so on will be comfortable with this use of XSLT

If necessary, it is a Turing-complete programming language. It gives you the rope if you need it.

Generic XML Programs 33Ohio Jan 2003

XSLT standards

• Microsoft’s implementation in IE5 is now standard (used not to be, they put it out well before the standard existed).

• James Clark’s xt and Michael Kay’s Saxon are complete and highly conformant. xt is highly optimized, Saxon simpler and easier to work with

• W3C eats its own lunch. The HTML versions of the XML standard are generated with XSL

• In practice, current best options are• Static data:Pre-generate HTML from XML at publication

time• Dynamic data: Use Saxon or xt as Java Servlets

Generic XML Programs 34Ohio Jan 2003

Generating HTML

HTML is generated by running Saxon on poem.xml and poem.xsl

saxon poem.xml poem.xsl > poem.html

Generic XML Programs 35Ohio Jan 2003

Using IE5 to view poem.xml

<poem><author>Rupert Brooke</author><date>1912</date><title>Song</title><stanza><line>And suddenly the wind comes soft,</line><line>And Spring is here again;</line><line>And the hawthorn quickens with buds of green</line><line>And my heart with buds of pain.</line></stanza><stanza><line>My heart all Winter lay so numb,</line><line>The earth so dead and frore,</line><line>That I never thought the Spring would come again</line><line>Or my heart wake any more.</line></stanza><stanza><line>But Winter's broken and earth has woken,</line><line>And the small birds cry again;</line><line>And the hawthorn hedge puts forth its buds,</line><line>And my heart puts forth its pain.</line></stanza></poem>

Generic XML Programs 36Ohio Jan 2003

poem.xsl

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:template match="poem"><html><head>

<title><xsl:value-of select="title"/></title></head><body>

<xsl:apply-templates select="title"/><xsl:apply-templates select="author"/><xsl:apply-templates select="stanza"/><xsl:apply-templates select="date"/>

</body></html></xsl:template>

Generic XML Programs 37Ohio Jan 2003

poem.xsl

<xsl:template match="title">

<div align="center"><h1><xsl:value-of select="."/></h1></div>

</xsl:template><xsl:template match="author">

<div align="center"><h2>By <xsl:value-of select="."/></h2></div>

</xsl:template>

<xsl:template match="stanza">

<p><xsl:apply-templates select="line"/></p>

</xsl:template>

Generic XML Programs 38Ohio Jan 2003

poem.xsl

<xsl:template match="line">

<xsl:if test="position() mod 2 = 0">&#160;&#160;</xsl:if>

<xsl:value-of select="."/><br/>

</xsl:template>

<xsl:template match="date">

<p><i><xsl:value-of select="."/></i></p>

</xsl:template>

</xsl:stylesheet>

Computation model of XSL is structural recursion, allowing considerable flexibility in transforming documents. Implemented via queries.

Generic XML Programs 39Ohio Jan 2003

XML tools for Unix

• Simple equivalents of UN*X tools are available (for free) to do simple SGML processing

• We'll introduce them using examples, and give details at the end

Generic XML Programs 40Ohio Jan 2003

sggrep

• LT XML program for searching for structure and text in XML files• sggrep -q query -s subquery -t regexp in.xml

• Options• -d DTD: Specify a DTD explicitly. File is an XML file• -r : Attribute values in queries are regular expressions.• -v : Invert sense of sub-query+regexp.• Other options

Generic XML Programs 41Ohio Jan 2003

||

LT XML query language

• Two-dimensional regular expressions• First dimension is over tree paths

• Based on file path analogy:DIV/PARA/W matches Ws inside PARAs inside (toplevel) DIVs

• Second dimension is regular expressions over text content of leaf nodes

• Select Ss containing Ws whose text is it's or its-q S -s './W' -t "^(it's|its)$"

• Full UTZOO (Henry Spencer) regular expression support

Generic XML Programs 42Ohio Jan 2003

sggrep: examples of use

• sggrep -q ".*/P/S" -s "./W[TAG=NN]"• find all S elements occuring inside a P element at any

depth which immediately contain a W element with attribute TAG="NN".

• sggrep -q ".*/P/S/W[TAG=NN]"• find those W elements themselves

• sggrep -q ".*/S/W[0]" -t "^[a-z]" • find all sentence initial words starting with a lower

case letter.

Generic XML Programs 43Ohio Jan 2003

LT XML and annotation

• Based on the Unix pipeline idea• Reads from standard input (usually)• Writes to standard output (usually)• Each tool does a single, fairly simple task• Controlled by command line flags

• In standard Unix filters, unit of transaction is character or line

• In LT XML, unit of transaction is• Either: An XML Bit (start tag, end tag, content fragment)• Or: An XML Item (balanced start tag, end tag, contents)• Or: mixed bits and items

Generic XML Programs 44Ohio Jan 2003

xmlnorm

• xmlnorm is very simple to write using the LT XML API. Read all bits from file and print them back out in order.

• Many options for character set conversion and such.

// open input and output files// optionally set output encodingwhile ((bit := GetNextBit(in)) != NULL) {

PrintBit(bit,out);}

Generic XML Programs 45Ohio Jan 2003

xmlchange

– xmlchange is just like xmlnorm, but with an additional line which changes the content of some of the bits before printing

// open input and output files// optionally set output encodingwhile ((bit := GetNextBit(in)) != NULL) {

PerhapsChangeBit(&bit);PrintBit(bit,out);

}

Generic XML Programs 46Ohio Jan 2003

xmlitems

– xmlitems Read bits until an interesting start bit turns up, then read the whole Item into memory before printing it.

// open input and output files// optionally set output encodingwhile ((bit := GetNextBit(in)) != NULL) {

if(bit.type = NSL_start && interesting(bit)){item = ItemParse(bit,in);ProcessItem(&item);PrintItem(item,out);

}|}

Generic XML Programs 47Ohio Jan 2003

The Query Interface

• Query language packages up ItemParse mechanism returning only the items selected by the query

• Programs can receive the query as an argument or read it from some other file

// open input and output files// optionally set output encodingqu = ParseQuery(quString);while ((item := GetNextQueryItem(in,qu,out)) != NULL) {

PerhapsChangeItem(&item);PrintItem(item,out);

}

Generic XML Programs 48Ohio Jan 2003

Queries or Bits?

• If you use the query interface, well-formed output is nearly automatic, with bits you have to be sure to balance every start tag with an end tag.

• If the items are huge, reading them into memory may be unacceptable. In that case, use the ItemParse pattern above

Generic XML Programs 49Ohio Jan 2003

Other LT XML facilities

• C structures representing documents types, attribute information, element types

• Unique element names, so that (in C) you can say

if(element.type == target) {…}

rather thanif(strcmp(element.type,target) == 0) {…}

• Support for memory management.• Example programs• 100+ pages of detailed documentation

Generic XML Programs 50Ohio Jan 2003

Python interface

• Closely mirrors C interface, but safer and much easier to work with. Trades ultimate efficiency for convenience.

• Exposes DTD information, making possible applications like Thompson’s xed and xsv

• Meshes with well with Python’s new Unicode support and excellent GUI facilities.

Generic XML Programs 51Ohio Jan 2003

When not to use LT XML

• If input and output DTDs are very different, the LT XML filter model is unnatural. For that you want XSLT or similar.

• LT XML is stream oriented, so not especially natural for random access (if you need that, consider using a database).

• If someone else is providing your tools and they already do a good job.

Generic XML Programs 52Ohio Jan 2003

DTD relations

• Such annotation tools rely on relations between DTDs to give meaningful user actions

• Ensure syntactic consistency of input, allowing annotator to focus on meaning of annotation.

Generic XML Programs 53Ohio Jan 2003

What is XML Link

• Just as XML itself simplified SGML while extending HTML

• XML-link simplifies HyTime while extending HTML

• XML-link provides mechanisms forDescribing links with link elementsIdentifying links and link ends by type and roleLocating link ends with a powerful locator syntaxIncorporating link elements in-line or out-of-lineSpecifying default behaviours

Generic XML Programs 54Ohio Jan 2003

Simple XML-link example

• This a simple reconstruction of HTML's A element, specifying two-ended link in-line with one implicit and one explicit locator

<refr XML-LINK="SIMPLE" HREF="http://www.w3.org/">The W3C</refr>

• On the next slide is a richer example, specifying a two-ended link out-of-line with two explicit locators

Generic XML Programs 55Ohio Jan 2003

More complex link example

<connect XML-LINK='EXTENDED'> <dutch XML-LINK='LOCATOR'

HREF='http://www.klm.nl/About/Nederlands/default.htm'>

<english XML-LINK='LOCATOR' HREF='http://www.klm.nl/About/default.htm'>

This is a good example of hand-crafted home-page translation pairing.

</connect>

Generic XML Programs 56Ohio Jan 2003

Using XML/XSL

• Ian Hughson and Henry Thompson built a prototype MT system using XSLT as the transformation language.

• MUC systems often benefit from being able to use queries to plug together subtly different processing pipelines for different parts of the document.

• Annotation pipelines can still benefit from this kind of process description.

• Links can be richer than in Web browsers, allowing structures more complex than trees.

Generic XML Programs 57Ohio Jan 2003

Conclusions

• The LT XML strategy is to write tools which are parameterisable by user queries. These are similar in spirit to Unix sed and grep

• When transformations get complex, this gets unwieldy, but essentially the same idea is present in XSLT, which packages up collections of queries into programs.

• Long term, we might want something cleaner than XSLT.

Generic XML Programs 58Ohio Jan 2003

In Summary

• Phew! </xmlstuff>

Generic XML Programs 59Ohio Jan 2003

TreeStyle (Brew, 1999)

• Corpus Access - after Cutting et al.• Tree analysis - analysis policy• Node styling - style sheet• Visualization - non-portable

Generic function dramatises trade off between demands of data and capabilities of medium.

(DISPLAY-OBJECT <tree-node> <visual>)

Generic XML Programs 60Ohio Jan 2003

Corpus access

• Interface: (item <corpus> <tree-index>)• Just like Common Lisp elt. Also map-corpus and

similar.

• Implementation• In-memory-corpus simple adapter for Common Lisp

sequence types.• wsj-corpus simple disk-based indices to the files of the

Wall Street Journal part of Penn Treebank. The indexing is crucial to usability.

Generic XML Programs 61Ohio Jan 2003

Analysis

1 ("s" 2 ("np" 3 "John") 4 ("vp" 5 ("v"6 "detests") 7 ("np" 8 "anchovies")))

1 -> (:head (:headword "detests")) 2 -> (:complement (:headword "John")) 3 -> (:terminal) 4 -> (:head (:headword "detests")) 5 -> (:head (:headword "detests")) 6 -> (:terminal) 7 -> (:complement (:headword "anchovies")) 8 -> (:terminal)

TreeStyle transforms corpus trees into CLOS objects which make it easy to traverse the parent/child hierarchy. Also give access to node labels.

This makes it easy to state simple heuristics over local trees, allowing encoding of (e.g.) Collins’ heuristics for finding heads and complements in the Penn Treebank.

This is labelled data, the information is present in the treebank from the outset. We just make it vivid.

Information is added to an eq hash table indexed by tree nodes.

Generic XML Programs 62Ohio Jan 2003

Styling

1 ("s" 2 ("np" 3 "John") 4 ("vp" 5 ("v"6 "detests") 7 ("np" 8 "anchovies")))

1 -> (:red :head (:headword "detests")) 2 -> (:purple :complement (:headword "John")) 3 -> (:italic :grey :terminal) 4 -> (:red :head (:headword "detests")) 5 -> (:red :head (:headword "detests")) 6 -> (:italic :grey :terminal) 7 -> (:purple :complement (:headword

"anchovies")) 8 -> (:italic :grey :terminal)

Style policy is stated in the same framework as analysis policy.

One defines an analysis policy by specializing the generic function policy-for-node to a given corpus type.

One defines a style policy by specializing the generic function style-for-node to a given corpus type.

Analysis and style are non-destructive, optional and separate from visualization proper. The analysis policy and the style policy together play a role similar to that of the stylesheet in XSL or CSS.

Generic XML Programs 63Ohio Jan 2003

Case study

Apply Collins’ heuristics to the Penn Treebank.

Colour really helps bring out the information encoded by Penn’s annotators.

•Red heads

•Purple complements

•Blue modifiers

Co-ordination and gapping are the tricky problems, as always.

Generic XML Programs 64Ohio Jan 2003

Case study

Collins gives a generative statistical model for Penn trees. His parser uses the treebank as grammar, and performs very well. We’d like the same thing in a categorial framework. What is involved?

Thought: effectively ignore modifiers (in categorial terms, make them X/X or X\X). How will this pan out?

Generic XML Programs 65Ohio Jan 2003

Case study

Not categorial enough, needs to be binary branching.

Left-factor, as in Charniak. Goldwater and Johnson (leaves probabilities unchanged).

Is this reasonable? Not always. Don’t like “But funds” as a constituent. Uncategorial.

Want to try “head factoring”. Work in progress.

Generic XML Programs 66Ohio Jan 2003

The British National Corpus

• 2 gigabytes of contemporary English• Marked up to word level with part of speech tags• Extract data:

• zcat medium.xml.gz | sggrep -q ".*/W[TYPE=NN1]"• gives all singular nouns in a part of the corpus, e.g.

<W TYPE=NN1>part </W><W TYPE=NN1>meeting </W><W TYPE=NN1>while </W><W TYPE=NN1>funeral</W><W TYPE=NN1>loss</W><W TYPE=NN1>meeting</W><W TYPE=NN1>time </W>

Generic XML Programs 67Ohio Jan 2003

The BNC: an example (2)

zcat medium.xml.gz | \sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" \-t "^[Rr]ight$"

gives sentences containing non-adjectival uses of the word 'right', e.g.

<S N=092> <W TYPE=ITJ>Yes </W> <W TYPE=DT0>that </W> <W TYPE=VBD>was</W> <C TYPE=PUN>, </C> <W TYPE=DT0>that </W> <W TYPE=VBD>was </W> <W TYPE=AV0>right</W> . . . </S>

Generic XML Programs 68Ohio Jan 2003

The BNC: an example (3)

Format the output into a more readable form:

zcat medium.xml.gz | \sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" -t "^[Rr]ight$" |\sgmltrans -r test.ruleYes/ITJ that/DT0 was/VBD , that/DT0 was/VBD right/AV0 erm/UNC there/EX0 was/VBD a/AT0 limit/NN1 to/PRP how/AVQ much/AV0 you/PNP could/VM0 spend/VVI aswell/AV0 was/VBD n't/XX0 there/EX0 ?

He/PNP goes/VVZ into/PRP a/AT0 restaurant/NN1 and/CJC he/PNP says/VVZ oh/ITJ the/AT0 waiter/NN1 erm/UNC let/VVB me/PNP see/VVI the/AT0 menu/NN1 and/CJC he/PNP looks/VVZ at/PRP the/AT0 menu/NN1 and/CJC said/VVD right/AV0 , he/PNP said/VVD .

Generic XML Programs 69Ohio Jan 2003

An extended example: Noun Compounds

• Noun compounds in British National Corpus• What is a noun compound?

• Too hard.• Simple approximation? Sequence of tags matching NN.

. .• BNC uses a version of the Brown tags, where NN0,

NN1, . . . are all variants of Noun

• A pipeline of SGML-aware tools will do the job• sgrpg | sggrep [ | . . .]

• Use sgrpg to wrap such tag sequences in <G> ... </G>.• Use sggrep to filter the output.• Use further tools to tabulate, format, etc.

Generic XML Programs 70Ohio Jan 2003

An extended example: The pipe

• Step by step through the pipe• sgrpg -r -f np-pat.xml | ...

• Group the sequences• -r use regexp matching• -f script file

• ... sggrep -d groups.xml -q '.*/G'• extract the sequences• -d DTD • -q query (selects groups)

• Result:• <G><W TYPE='AJ0-NN1'>Local</W>

<W TYPE='NN0'>government</W><W TYPE='NN2'>districts</W></G>. . .

Generic XML Programs 71Ohio Jan 2003

An extended example: filtering

• Find all words with unresolved tags, e.g. AJ0-NN1• use regexp matching, which is unanchored by default• ...| sggrep -r -q './W[TYPE="-"]' | ...

• Find all words in second position• ...| sggrep -q './W[1]' | ...

• Find all words with unresolved tags in second position• ...| sggrep -r -q './W[1 TYPE="-"]' | ...

Generic XML Programs 72Ohio Jan 2003

An extended example: counting

• Count all words in second position• ...| sggrep -q './W[1]' | sgcount

• Count all words with unresolved tags in second position• ...| sggrep -r -q './W[1 TYPE="-"]' | sgcount

• Results:• all 2nd place W 23283• 2nd place W with unresolved tag 5066

Generic XML Programs 73Ohio Jan 2003

An extended example: long compounds

• Long compounds including 'government'• Use subquery to select <G>...</G>s with

'government':• sggrep -q G -s './W' -t government• Next step, discard short ones:• sggrep -q G -s './W[2]'• Then sgmltrans for neater format• Results:

• official/AJ0-NN1 government/NN0 report/NN1-VBLocal/AJ0-NN1 government/NN0 districts/NN2...

Generic XML Programs 74Ohio Jan 2003

An extended example: left context

• select for 'government' in 2nd place• . . . | sggrep -q G -s './W[1]' -t government |

• pull words from first place• sggrep -q './W[0]' |

• remove markup• textonly |

• use UN*X for the rest• sort | uniq -c | sort -nr | head -4• 6 French• 5 German• 4 interim• 4 Chinese

Generic XML Programs 75Ohio Jan 2003

British International Corpus?

• We are more francophone than we think!• Longest 'noun-phrase' in 10% of BNC is:

• serai/NN1 mentionn&eacute;/NN1 dans/NN2 le/NN1 rapport/NN1-VB qui/NN1 te/NN1 sera/NN1 remis/NN1

• No disgrace that the part-of-speech tagger gave up here.

• Tools can't be better than their input allows

Generic XML Programs 76Ohio Jan 2003

Human factors in annotation

• Writing instructions• Start by writing a DTD.• Define an explicit step-by-step process for making

decisions.• You’ll still need a body of case-law to ensure

reliability. So you’ll need to revise as you go.• If you have a group of five people, use a star model

in which each of four communicate with a central coordinator, revising the draft instructions as you go.

• Measuring reliability• Use standard measures of inter-rater reliability from

social psychology, including kappa statistic

Generic XML Programs 77Ohio Jan 2003

Can we predict queries?

• Corpora don’t change much.• Easy searches can be handled with flat indices into

tagged data.• Complicated searches are rare enough that it might be

OK to do them by linear search of the corpora

• Queries do change• Things like the historian’s application in Welty and Ide

warrant expressive search, but won’t cause a revolution.

• A single paper correlating gaze, gesture and audio might spawn many imitators. Expectations from corpus tools might shift radically.

Generic XML Programs 78Ohio Jan 2003

Tutorials

• XML: far too many to mention• The XML revolution: technologies for the future Web

• http://www.brics.dk/~amoeller/XML

• XSL: • XSL specification

• http://www.w3.org/Style/XSL• Robin Cover's guide

• http://www.oasis-open.org/cover/xsl.html

Generic XML Programs 79Ohio Jan 2003

Resources

• LT-XML • http://www.ltg.ed.ac.uk/software/xml/index.html

• Full-text search • Witten, Moffat and Bell's Managing Gigabytes• http://www.cs.mu.OZ.AU/mg/

Generic XML Programs 80Ohio Jan 2003

Corpus Tools

• Stuttgart Corpus Workbench• http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench

• Birmingham Qwick}http://www-clg.bham.ac.uk/QWICK/

The MATE Workbench

http://www.cogsci.ed.ac.uk/~dmck/MateCode}. • NB. Prototype

Generic XML Programs 81Ohio Jan 2003

The Linguistic Data Consortium

• LDC - based in Pennsylvania USA• Distributes text corpora• See: http://www.ldc.upenn.edu/

• SGML Corpora include:• The European Language Newspaper Text corpus

• French (100 million words), German (90 million words) and Portuguese (15 million words). SGML.

• TIPSTER Information Retrieval Text Research Collection• 3 gigabytes. SGML-like. Various English texts.

• United Nations Parallel Text Corpus (English, French, Spanish)

• Fully-compliant SGML, 2.5 gigabytes

Generic XML Programs 82Ohio Jan 2003

Bibliography

• McKelvie, Brew,Thompson: Using SGML as a Basis for Data-Intensive Natural Language Processing, Computers and the Humanities, 31(5): 367-388, 1997

• Sinclair, Mason,Ball,Barnbrook Language Independent Statistical Software for Corpus Exploration, Computers and the Humanities, Vol 31(3): 229-255, 1998

• References on McKelvie's MATE workbench pagehttp://www.cogsci.ed.ac.uk/~dmck/MateCode

• Welty and Ide. Using the right tools: enhancing retrieval from marked-up documents. Computers and the Humanities. 33(10):59-84. 1999

• Alignment graphs (and much else) Steven Bird's Linguistic Annotation Pagehttp://www.ldc.upenn.edu/annotation/.

Generic XML Programs 83Ohio Jan 2003

“Problems” with XML

• Uses complex and weird terminology • Yes. But so does the ANSI C standard. So do most fields…

• Not convenient for specifying graphs (as opposed to trees)• This is a point about graphs, not XML. Unification

grammar notations get unwieldy too.

• Not as convenient as plain text• True for some tasks, but the extra structure of XML

lets do things that you wouldn’t even try with plain text.

Generic XML Programs 84Ohio Jan 2003

XML Conclusions

• XML is the wave of the future• Both Microsoft and Netscape have endorsed it

• Both Mozillla and IE5 have XML support built-in• Very good free software is available• Microsoft seem to be serious about standard

compliance

• The W3C have made it clear that all subsequent W3C standards for web distribution of information will be based on XML (c.f. SMIL, SVG and RDF)

• Issues• XSLT efficiency - space and time.