corpus mark-up uol summer institute in corpus linguistics matthew brook o’donnell
TRANSCRIPT
![Page 1: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/1.jpg)
Corpus Mark-upCorpus Mark-upCorpus Mark-upCorpus Mark-upUoL Summer Institute in UoL Summer Institute in
Corpus LinguisticsCorpus LinguisticsMatthew Brook O’DonnellMatthew Brook O’Donnell
![Page 2: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/2.jpg)
Aims• Introduce the concepts of corpus
mark-up and annotation• Consider why we would want to
add extra non-textual information to corpus texts
• Use a pos-tagger and tagged text
![Page 3: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/3.jpg)
What is Corpus Annotation?
• ‘the practice of adding interpretative linguistic information to a corpus’ (Leech 2005)
– interpretative– linguistic
– results in -> value-added corpus
![Page 4: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/4.jpg)
Terminology• Corpus Markup
– processing/formatting information– metadata/text classifications– structural representation
• Tagging– (usually) inline addition of category to word(s)
• Parsing– higher-level, multiword units (constituents)– chunking/shallow vs. full syntactical parsing– needn’t just be syntactical analysis
• XML – eXtensible Markup Language
![Page 5: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/5.jpg)
Why Annotate? 1. Manual examination of corpus2. Automatic analysis of corpus3. Reusability of annotations4. Multi-functionality
5. Objective record of analysis
6. Annotation process is corpus analysis
Leech 2005
O’Donnell 1999
McEnery 2003
![Page 6: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/6.jpg)
Types of Corpus Annotation
• Part-of-speech (POS)• Lemmatization• Syntactical (parsing)• Semantic (domain classifications)• Coreference (Discourse)• Pragmatic (Speech acts – dialogue)• Stylistic• Research specific (ad hoc)
![Page 7: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/7.jpg)
POS Tagging: Claws C5
Corpus_NN1 annotation_NN1 is_VBZthe_AT0 practice_NN1 of_PRFadding_VVG interpretative_AJ0linguistic_AJ0 information_NN1to_PRP a_AT0 corpus_NN1 ._.
NN1 singular noun AJ0 adjective (unmarked)
VBZ -s form of the verb "BE“ PRF the preposition OF
VVG -ing form of lexical verb AT0 article
![Page 8: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/8.jpg)
POS Tagging: Claws C7
Corpus_NN1 annotation_NN1 is_VBZthe_AT practice_NN1 of_IOadding_VVG interpretative_JJlinguistic_JJ information_NN1 to_II a_AT1 corpus_NN1 ._.
http://www.comp.lancs.ac.uk/ucrel/claws/trial.html
![Page 9: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/9.jpg)
POS Tagging: POSTagger
Corpus/NN annotation/NN is/VBZthe/DT practice/NN of/INadding/VBG interpretative/JJlinguistic/JJ information/NN to/TO a/DT corpus/NN ./.
![Page 10: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/10.jpg)
Parsing: Chunking[NP (NN Corpus) (NN annotation) ] (VBZ is) [NP (DT the) (NN practice) ] (IN of) (VBG adding) [NP (JJ interpretative) (JJ linguistic) (NN information) ] [PP (TO to) [NP (DT a) (NN corpus) ]
![Page 11: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/11.jpg)
Parsing(S
(NP Corpus annotation) (VP is (NP
(NP the practice) (PP of (S (VP adding (NP interpretative linguistic
information) (PP to (NP a corpus))
)))
))
.)
![Page 12: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/12.jpg)
Semantic Annotation• Each word given code from
thesaurus-style dictionary• Also called Word Sense Tagging• Examples
– UCREL Semantic Analysis System[
http://www.comp.lancs.ac.uk/ucrel/usas/]
– WordNet [http://wordnet.princeton.edu/]
![Page 13: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/13.jpg)
Semantic Annotation• The noun move has 5 senses (first 5 from tagged texts) • 1. (377) move -- (the act of deciding to do something; "he didn't make
a move to help"; "his first move was to hire a lawyer")
• 2. (70) move, relocation -- (the act of changing your residence or place of business; "they say that three moves equal one fire")
• 3. (57) motion, movement, move, motility -- (a change of position that does not entail a change of location; "the reflex motion of his eyebrows revealed his surprise"; "movement is a sign of life"; "an impatient move of his hand"; "gastrointestinal motility")
• 4. (30) motion, movement, move -- (the act of changing location from one place to another; "police controlled the motion of the crowd"; "the movement of people from the farms to the cities"; "his move put him directly in my path")
• 5. (5) move -- ((game) a player's turn to take some action permitted by the rules of the game)
![Page 14: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/14.jpg)
Semantic Annotation• The verb move has 16 senses (first 13 from tagged
texts) • 1. (130) travel, go, move, locomote -- (change location; move, travel,
or proceed; "How fast does your new car go?"; "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect"; "The soldiers moved towards the city in an attempt to take it before night fell")
• 2. (60) move, displace -- (cause to move, both in a concrete and in an abstract sense; "Move those boxes into the corner, please"; "I'm moving my money to another bank"; "The director moved more responsibilities onto his new assistant")
• 3. (52) move -- (move so as to change position, perform a nontranslational motion; "He moved his hand slightly to the right")
• 4. (20) move -- (change residence, affiliation, or place of employment; "We moved from Idaho to Nebraska"; "The basketball player moved from one team to another")
![Page 15: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/15.jpg)
Tools
• XML
• Annotation Editors– GATE
• WordSmith
![Page 16: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/16.jpg)
The ‘Great Annotation Debate’
• Leech et al. ‘annotation = value added’
• Sinclair ‘annotation = perilous activity’
• Scott ‘beware of the POS prison!’
![Page 17: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/17.jpg)
Sinclair on the perils of corpus annotation
• ‘The interspersing of tags in a language text is a perilous activity, because the text thereby loses integrity…’
‘Current Issues in Corpus Linguistics’ (Sinclair 2004: 191)
![Page 18: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/18.jpg)
Sinclair on the perils of corpus annotation
• ‘..one cosy consequence of using tagged text is that the description which produces the tags in the first place is not challenged – it is protected. The corpus data can only be observed through the tags; that is to say, anything the tags are not sensitive to will be missed’
‘Current Issues in Corpus Linguistics’ (Sinclair 2004: 191)
![Page 19: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/19.jpg)
Sinclair on the perils of corpus annotation
• ‘In corpus-driven linguistics you do not use pre-tagged text, but you process the raw text directly and then patterns of this uncontaminated text are able to be observed.’
‘Current Issues in Corpus Linguistics’ (Sinclair 2004: 191)
![Page 20: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/20.jpg)
• ‘…the categories used to annotate a corpus are typically determined before any corpus analysis is carried out, which in turn tends to limit, not the kind of question that can be asked, but the kind of question that usually is asked.’
(Hunston 2002: 93)
Hunston – annotation as ‘double-edged sword’
![Page 21: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/21.jpg)
Hunston – annotation as ‘double-edged sword’
• ‘Most of the work that is done using annotated corpora uses categories that have been developed in pre-corpus days, such as nominal clauses, anaphoric reference… Phenomena such as frames or semantic prosody… tend to have been identified from plain text corpora and word-based studies.’
(Hunston 2002: 93)
![Page 22: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/22.jpg)
Corpus-based approach
Annotate Corpus
• POS
• Parsing
• Semantic
• Reference
ANALYSIS
categorization
CORPUS METHODS
ANALYSIS
generalization
plain corpus
annotated corpus
DATA
RESULTS
![Page 23: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/23.jpg)
Corpus-driven approachCORPUS METHODS
ANALYSIS
generalization & categorization
DATA
RESULTS
plain corpus
![Page 24: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/24.jpg)
Problem for both CB & CD Approach
• Serial/Sequential process– CB analysis before (annotation) and
after processing– CD analysis only after processing (so
no need for annotation)
• Empirical process is cyclic– analysis feeds back into process and
around again… and again…
![Page 25: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/25.jpg)
So what if….
• Hunston - ‘Most of the work that is done using annotated corpora uses categories that have been developed in pre-corpus days….’
• we annotate categories that have come out of corpus analysis instead of/as well as traditional categories?
(Hunston 2002: 93)
![Page 26: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/26.jpg)
New uses for corpus annotation
• Cyclic investigation process1. KWIC/Frequency list/Collocates etc.2. Annotate results3. Goto 1
– How sould we annotate:– collocates– lexical items– semantic associations/prosodies– Local textual functions
![Page 27: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/27.jpg)
ReferencesLeech, G
2005 ‘Adding Linguistic Annotation’, in M. Wynne, Developing Linguistic Corpora: a Guide to Good Practice (Oxford: Oxbrow Books), pp. 17-29
[http://ahds.ac.uk/linguistic-corpora/]Hunston, S.
2002 Corpora in Applied Linguistics (Cambridge: Cambridge University Press)
McEnery, A2003 ‘Corpus Linguistics’, in R. Mitov (ed.), The Oxford Handbook of Computational Linguistics (Oxford: Oxford University Press), pp. 448-463
![Page 28: Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell](https://reader035.vdocuments.site/reader035/viewer/2022062511/551ab393550346e0158b6436/html5/thumbnails/28.jpg)
ReferencesO’Donnell, M.B.
‘The Use of Annotated Corpora for New Testament Discourse Analysis: A Survey of Current Practice and Future Prospects’, in S.E. Porter and J.T. Reed (eds.), Discourse Analysis and the New Testament: Results and Applications (Sheffield: Sheffield Academic Press, 1999), pp. 71-117.
Sinclair, J.2004 Trust the Text: Language, Corpus and Discourse (London: Routledge)