Guy Aston, Ylva Berglund Prytz, & Lou Burnard,
http://www.natcorp.oucs.ox.ac.uk
Exploring BNC-XML with Xaira
What is the BNC?
a snapshot of British English, taken at the end of the 20th century
100 million words in approx 4000 different text samples, both spoken (10%) and written (90%)
synchronic (1990-4), sampled, general purpose corpus
available under licence; latest edition is BNC-XML (13 mar 2007)
Distinctive features of the BNC
non-opportunistic design standardized markup system
structural annotation word class annotation contextual information
general availability
...in these respects, the BNC remains distinctive, twenty years on!
What's new in BNC-XML? No systematic proofing, re-editing, or re-parsing... Same as BNC World:
texts (minus duplicates) POS tagging (but extended)
Additions simpler pos codes lemmata
Improvements Duplications, categorizations, segmentations... Coded descriptions
BNC-XML regroups texts using additional classification criteria
...sentences
Academic
Literary
Press
Nonfiction
Unpublished
Conversation
OtherSpolen
...words
<wtext type="NONAC"><div level="1" n="1" type="leaflet"> <head type="MAIN"><s n="1"><w c5="NN1" hw="factsheet" pos="SUBST">FACTSHEET</w> <w c5="DTQ" hw="what" pos="PRON">WHAT</w> <w c5="VBZ" hw="be" pos="VERB">IS</w> <w c5="NN1" hw="aids" pos="SUBST">AIDS</w><c c5="PUN">?</c> </s> </head><p><s n="2"><hi rend="bo"> <w c5="NN1" hw="aids" pos="SUBST">AIDS</w> <c c5="PUL">(</c><w c5="VVN-AJ0" hw="acquire" pos="VERB">Acquired</w> <w c5="AJ0" hw="immune" pos="ADJ">Immune</w> <w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w> <w c5="NN1" hw="syndrome" pos="SUBST">Syndrome</w><c c5="PUR">)</c></hi> <w c5="VBZ" hw="be" pos="VERB">is</w> <w c5="AT0" hw="a" pos="ART">a</w> <w c5="NN1" hw="condition" pos="SUBST">condition</w> <w c5="VVN" hw="cause" pos="VERB">caused</w> <w c5="PRP" hw="by" pos="PREP">by</w> <w c5="AT0" hw="a" pos="ART">a</w> <w c5="NN1" hw="virus" pos="SUBST">virus</w> <w c5="VVN" hw="call" pos="VERB">called</w> <w c5="NP0" hw="hiv" pos="SUBST">HIV</w> <c c5="PUL">(</c> <w c5="AJ0-NN1" hw="human" pos="ADJ">Human</w> <w c5="NN1" hw="immuno" pos="SUBST">Immuno</w> <w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w> <w c5="NN1" hw="virus" pos="SUBST">Virus</w><c c5="PUR">)</c><c c5="PUN">.</c> </s> … </p>… </div></wtext>
What is the markup for?
It makes it possible for you to distinguish aids=SUBST from aids=VERB distinguish occurrences in writing from ones in speech distinguish occurrences in headings from ones in
paragraphs identify contextual units like sentences and paragraphs
FACTSHEET WHAT IS AIDS?AIDS (Acquired Immune Deficiency Syndrome) is a condition caused by a virus called HIV (Human Immuno Deficiency Virus).
Has English moved on since the BNC?
types of text e-mail web pages / blogs SMS personal letters
topics globalization internet Elvis Word Perfect
how comparable is the Web?
Out of date?
The composition (and date) of any corpus affects inferences drawn from it
There aren't many alternatives Web-as-corpus: 85% of written texts aren't on the web -
and spoken texts? Results from monitor corpora non-replicable Copyright permissions unrepeatable
Quantitative and qualitative comparative evaluations of BNC coverage are needed but “it's surprising how much is there”
What can you do with it?
The BNC is a problematizing resource... complements (and corrects) intuition increases learner autonomy critiques the myth of the native speaker
... for teacher and learner alike XML makes it more accessible by non
specialist software (eg A0S in web browser)
You can use XAIRA to ...
find sample sentences cloze tests
check what the text book says grammar vs usage
(dis)confirm intuitions find sample specialist texts make serendipitous discoveries
Finding sample sentences
some phrases that take the gerund there's no point .... how / what about ...
generatable phrases [comparative] and [comparative]
sentence structures [s-initial interjection]
(Dis)confirming intuition
about choices have a problem + infinitive or gerund? do you make or take decisions?
about vocabulary which nouns collocate with hard?
about grammar I would be grateful if you [modal]?
Finding specialised texts
The BNC has an extraordinary range travel agent brochures, weather reports, formal
invitations, advertising, children's talk, academic discourse, doctor's consultations, marketing meetings, oral history, jokes and anecdotes, high literature, best-sellers, leaflets, personal diaries...
The problem is finding it use WLD principle
For learners...
The same as teachers Pointers to follow in the quest for idiomicity
collocations colligations semantic preferences semantic prosodies/pragmatic associations associations with particular genres/domains
Can learners use the BNC “autonomously”?
The ins and outs of autonomous use Learners may need warning to...
focus on patterns which recur, without necessarily trying to explain all the data
avoid overgeneralisation ... and encouragement to
be curious browse the context investigate exceptions
What are ins and outs?
(and are they the same as ups and downs)? 50 occurrences, sort left 2 colligation: (all) the ins and outs of semantic preference: know/learn/understand/keep
up with/get to grips with/get down to/forget; explain/teach/guide through/give/look at
semantic prosody: difficulty(?) analysis - mainly spoken conversation, but
numbers too small for reliable inference
Exploring idioms
make a point the point is point out
have a point high point point to
in point of fact starting point no point in
point of view at X point what‘s the point
to the point see/get/grasp the point
Example: idioms with point
Exploring features of speech
PS6NR >: [laugh] he's not a millionaire yet.PS6NM >: No so perhaps not, mm.Oh perhaps, perhaps he, perhaps he has the knowledge but has difficulty in er navigating his way to the betting shop to to do anything about it. PS6NR >: [laugh] PS6NM >: Anyway ermPS6NR >: Right I've ... results see this isPS6NM >: Mm.PS6NR >: this is really what I'm [ ... ] PS6NM >: Yeah. PS6NR >: comparison of subjects within groups and between groups I thought that's PS6NM >: Yeah, mm. PS6NR >: like a typical [ ... ]
Examples: spoken discourse markers and back channels
Exploring productivity of affixes
How many adjectives can you think of ending in -ish? babyish, bearish, .... wankish, whorish, yobbish
How many nouns starting with anti-? How about verbs?
Creative writing
Paul Auster: City of Glass
It was the wrong number that started it, the telephone ringing three times in the dead of the night, and the voice on the other end asking for someone he was not.
Examples: story beginnings
Ian McEwan: Saturday
Everyone agrees, airliners look different these days, predatory and doomed.
Where can I get one?
BNC XML: http://www.natcorp.ox.ac.uk now available on DVD standalone single user licence or institutional licence discounted price till end June
XAIRA Delivered free with the BNC (and also available free
from http://xaira.sf.net) Usable with any XML corpus Usable/ish on any platform