headtalk, handtalk and the corpus: towards a framework for multi-modal, multi-media corpus...

32
HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development Dawn Knight, 1 David Evans, 1 Ronald Carter 1 and Svenja Adolphs 1 Abstract In this paper, we address a number of key methodological challenges and concerns faced by linguists in the development of a new generation of corpora: the multi-modal, multi-media corpus – that which combines video, audio and textual records of naturally occurring discourse. We contextualise these issues according to a research project which is currently developing such a corpus: the ESRC-funded Understanding New Digital Records for e-Social Science (DReSS) project based at the University of Nottingham. 2 This paper primarily explores the questions of the functionality of the corpus, identifying the problems we faced in making multi-modal corpora ‘usable’ for further research. We focus on the need for new methods for categorising and marking up multiple streams of data, using, as examples, the coding of head nods and hand gestures. We also consider the challenges faced when integrating and representing the data in a functional corpus tool, to allow for further synthesis and analysis. Here, we also underline some of the ethical challenges faced in the development of this tool, exploring the issues faced both in the collection of data and in the future distribution of video corpora to the wider research community. 1. Introduction Language corpora provide the linguist with the apparatus to explore the ‘actual patterns of [language] use’ (Biber et al., 1998: 4), and help to generate empirically based, objective analyses that are ‘focused not simply on providing a formal description of language but on describing the use 1 School of English Studies, University Park, University of Nottingham, Nottingham, NG7 2RD, United Kingdom. Correspondence to: Dawn Knight, e-mail: [email protected] 2 For further information, results and publications related to the project, please refer to the main DReSS website, at: http://web.mac.com/andy.crabtree/NCeSS_Digital_Records_Node/Welcome.html DOI: 10.3366/E1749503209000203 Corpora Vol. 4 (1): 1–32

Upload: svenja

Post on 12-Dec-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus: towardsa framework for multi-modal, multi-mediacorpus development

Dawn Knight,1 David Evans,1

Ronald Carter1 and Svenja Adolphs1

Abstract

In this paper, we address a number of key methodological challenges andconcerns faced by linguists in the development of a new generation ofcorpora: the multi-modal, multi-media corpus – that which combines video,audio and textual records of naturally occurring discourse. We contextualisethese issues according to a research project which is currently developingsuch a corpus: the ESRC-funded Understanding New Digital Records fore-Social Science (DReSS) project based at the University of Nottingham.2

This paper primarily explores the questions of the functionality of thecorpus, identifying the problems we faced in making multi-modal corpora‘usable’ for further research. We focus on the need for new methods forcategorising and marking up multiple streams of data, using, as examples,the coding of head nods and hand gestures. We also consider the challengesfaced when integrating and representing the data in a functional corpus tool,to allow for further synthesis and analysis.

Here, we also underline some of the ethical challenges faced in thedevelopment of this tool, exploring the issues faced both in the collectionof data and in the future distribution of video corpora to the wider researchcommunity.

1. Introduction

Language corpora provide the linguist with the apparatus to explore the‘actual patterns of [language] use’ (Biber et al., 1998: 4), and help togenerate empirically based, objective analyses that are ‘focused not simplyon providing a formal description of language but on describing the use

1 School of English Studies, University Park, University of Nottingham, Nottingham, NG72RD, United Kingdom.

Correspondence to: Dawn Knight, e-mail: [email protected] For further information, results and publications related to the project, please refer to themain DReSS website, at:http://web.mac.com/andy.crabtree/NCeSS_Digital_Records_Node/Welcome.html

DOI: 10.3366/E1749503209000203Corpora Vol. 4 (1): 1–32

Page 2: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

2 D. Knight, D. Evans, R. Carter and S. Adolphs

of language as a communicative tool’ (Meyer, 2002: 5). In spite of thesestrengths, most current corpora have a fundamental limitation in that theyrepresent all features of communication through the same mode – that of atextual record. In the case of written texts this affects the communicativeuse of extra-textual features such as tables, graphs and images. In thecase of spoken corpora it means that significant visual signals transmittedbetween interlocutors during face-to-face interactions, which support andsupplement the verbal content of the communication, are lost. Even when thisinformation is available to transcribers ‘the reflexivity of gesture, movementand setting is difficult to express in a transcript’ (Saferstein, 2004: 213). Thusthere are few spoken corpora that provide the linguist with tools to explorethe non-verbal, gestural components of interaction in any great detail. Since‘we speak with our vocal organs, but we converse with our whole body’(Abercrombie, 1963: 55), it is appropriate to call for a new generation ofcorpora that will enable the linguist to gain a more comprehensive view ofthe characteristics of language ‘beyond the text’.

The Understanding Digital Records for e-Social Science (DReSS)project at the University of Nottingham is a three-year e-Social Scienceresearch initiative funded by the Economic and Social Research Council(ESRC). As part of this interdisciplinary project, researchers in appliedlinguistics are working in collaboration with computer scientists to developa multi-modal corpus of spoken interaction: the Nottingham Multi-ModalCorpus (NMMC). Alongside the development of the corpus itself, newinterfaces and tools for representing and interrogating the multi-modal corpusare being explored. The NMMC presents ‘data’ in three different modes: astextual, spoken (audio) and video records of naturally occurring interactions,aligning them within a functional, searchable corpus interface. Access tolexical, prosodic and gestural features of spoken discourse allows for theanalysis of their interaction in the creation of real, everyday communication.

This paper reports on some of the key methodological challenges andissues faced in the development of the NMMC. Drawing on examples of headnods and hand gestures, and on some of their communicative functions inselected conversations, we explore how the corpus data can be analysed bothquantitatively and qualitatively. We focus in particular on issues of recording,transcribing, coding and marking-up, alongside discussion of applicationsand the physical representation of the data (see also Knight et al., 2006;Knight, 2006; Lapadat and Lindsay, 1999; Psathas and Anderson, 1990; andBaldry and Thibault, 2001).

2. Gesture-in-talk

Discourse comprises not only the vocalised characteristics of talk andaccompanying meaning, but also ‘kinesic behaviour’ (the study of which isknown as kinemics – a term coined by Birdwhistell, 1952), more commonly

Page 3: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 3

known as non-verbal behaviour (NVB, see Richmond et al., 1991). Thisincludes, for example, certain hand and arm movements (Beattie andShovelton, 1999; Rimé and Schiaratura, 1991; and Thompson and Massaro,1986), gaze (Cerrato and Skhiri, 2003; and Griffin and Bock, 2000),body movement, head nods and facial expressions. Thus, as with someverbal aspects of talk, gestures communicate pragmatic and semanticinformation – information that is sometimes distinct from the verbal aspectsof talk (McNeill, 1979; see also McNeill et al., 1994: 225). Gestures addan extra dimension to discourse contexts, complementing what is said withconcurrent, co-expressive signs that can be semantically aligned to abstractor concrete objects and notions expressed in spoken language.

It is important to note that whilst the umbrella terms ‘kinemics’ and‘NVB’ refer to all forms of bodily movement, the specific focus of this studyis only on gestures which occur in communicative contexts, (that is, thosewhich exist in spoken discourse). Originally, such communicative gestureswere treated under the term ‘expressive movement’ (see Davis, 1979: 54)although these are currently more commonly regarded as being examples ofnon-verbal communication (NVC).

NVC comprises individual gestures or sequences of more discreteand structured gestural episodes which communicate messages betweenindividuals involved in a conversation. So, ‘if another person interprets thebehaviour as a message and attributes meaning to that message’ (Richmondet al., 1991: 7) the behaviour is best defined as NVC rather than NVB(also see McNeill, 1985: 254; Kendon, 1972, 1980, 1997; Schegloff, 1984;McNeill, 1985, 1992; McClave, 2000; and Nobe, 1996). Individual gesturesand sequences of NVC can adopt a range of functions in discourse andsubsequently ‘can do almost as many things’ as spoken language (Streeck,1993: 297; see also Chawla and Krauss, 1994: 580).

NVC allows speakers to express extra information about abstractconcepts, opinions and emotions that may not ‘readily be expressed inwords’ (Argyle, 1988: 141; see also Wilcox, 2004: 422). These can beused pragmatically to emphasise or repeat a point that is being made, orsemantically to complement it or to contradict it. Aside from semanticand pragmatic functions, Noller (1984: 7) suggests that specific forms ofgesticulation exist, primarily, to maintain the ‘relationship or common’ partof a message in NVC. Thus, NVC is a tool used by interlocutors to sustainrelationships and to manage and structure discourse.

It is important to note that, since gesticulation is less salient andquantifiable than spoken language (although, as with head nods, certainmotions are an exception to this rule), some forms of NVC ‘have invariantmeanings, others [only] have a probability of meaning something’ (Argyle,1988: 6). Consequently, the classification of gesticulation has traditionallyfocussed on defining the specific meaning function of a gesture, on ‘howgestures communicate’, in accordance with the enaction of kinesic sequences(Bavelas, 1994: 201).

Page 4: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

4 D. Knight, D. Evans, R. Carter and S. Adolphs

As Argyle notes, gestures are, of course, an example of ‘gradedsemiosis’, being both discrete and denotative as well as fuzzy and connotativein meaning. Whilst we acknowledge throughout that this is an importantaspect of theory – relevant to multi-modal corpus analysis and a particularchallenge to corpus linguistic methods, especially in relation to the coding ofdata – it is not a debate to which we contribute in this paper.

3. Contextualising multi-modal analysis

The most frequently explored gestures are the sequences of spontaneousiconic hand movements made by speakers in conversation (as featured instudies by Ekman and Friesen, 1969; Kendon, 1972, 1980, 1982; Argyle,1988; McNeill, 1985, 1992; Chawla and Krauss, 1994; and Beattie andShovelton, 1999). Such studies have been of particular concern to theresearch into the relationship between gaze and gesture, in particular, inclassroom interaction. They explore, for example, the impact of listener gazeon the frequency of generation, and the general interactive functions andnature of such movements (see Streeck, 1993, 1994; Griffin, 2004; Kita,2000, 2003; Olsher, 2004; and Goodwin, 2000).

This research is relevant to the sort of questions that can beinvestigated using the NMMC, although previous studies have usually dealtwith small samples of data, adopting a Conversation Analysis (CA, seeMarkee, 2000, for related literature) rather than a Corpus Linguistic (CL)based methodological approach to the investigation of the data. Therefore,the development of large-scale collections of similar academic datasets(beyond the NMMC, and including multi-party discussions in a classroomor lecture context) will extend the potential for research into behavioural,gestural and linguistic features of academic and learning discourse (althoughthe utility of the multi-modal corpus approaches outlined here are notrestricted to this discourse context alone).

Current multi-modal corpus studies tend to concentrate on thefollowing (taken from Gu, 2006: 132):

(1) Multimodal and multimedia studies of discourse(2) Speech engineering and corpus annotation

Examples of the first type includes work by, amongst others, Martinec (1998,2001) Scollon (1998), Kress (2001), Kress and van Leeuwen (1996), andGripsrud (2002). Examples of research working towards the second typeinclude, for example, Allwood et al. (2001), Hill (2000), Leech et al. (1995),Gibbon et al. (1997) and Taylor et al. (2000). Few studies concentrate onboth of these concerns, since different expertise is required for each of thesestrands: whilst the first is usually undertaken by those ‘in the social sciences’who ‘are interested in human beings’, the second is usually the concern of

Page 5: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 5

computational linguists and computer scientists who are concerned primarilywith ‘how to improve human-computer interaction’ (Gu, 2006: 132).

The concern of the NMMC project is to finely align the two concernsin order to explore how the second can help to inform the first, with thedevelopment of new methods for the construction, annotation and analysisof multi-modal corpus tools. This necessitates an interdisciplinary approach,as is provided in the DReSS project, in which a truly collaborative socialscience environment allows researchers from different fields to come togetherin order to generate a new system for language exploration. To approach thecompilation of this corpus systematically a multiple perspective methodologyhas been used, focussing on three individual phases:

• RECORD: data sources and collection methods.• CODE: detect, define, encode head nods.• (RE)PRESENT: (re)presentation of data within an interactive, user-

friendly interface.

4. Recording multi-modal data

During this phase, naturally occurring speech is typically captured using avideo/audio recorder in order to be transcribed for use in the corpus. As forthe general design of the corpus, it is important to note that it is ‘impossibleto present a set of invariant rules about data collection because choices haveto be made in light of the investigators’ goals’ (Cameron, 2001: 29; see alsoO’Connell and Kowal, 1999). The decisions made concerning the recordingphase are specific to the research aims, yet it is nevertheless important toensure, as Reppen and Simpson (2002: 93) highlight, that ‘the text collectionprocess for building a corpus. . . [is] principled’.

In this regard, although basic assumptions of quality control,representativeness and balance that exist for current corpora are relevantto the construction of multi-modal corpora, other conventions for recordingneed to be adapted and reassessed accordingly. As part of the DReSS projectthe following principal aims were met when recording data:

(1) To record multiple modes of communication in natural contexts;(2) To record both the individual sequences of body movements on

the part of two speakers in an interaction, but allow for the analysisof synchronised videos in order to allow the examination of co-ordinated movement (i.e., across both speakers); and,

(3) To obtain recordings that can be replayed and annotated by otherresearchers.

It is essential that multi-modal corpus data are both ‘suitable and rich enoughin the information required for in-depth linguistic enquiry, and of a high

Page 6: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

6 D. Knight, D. Evans, R. Carter and S. Adolphs

Figure 1: Recording set-up and equipment for multi-modal corpusdevelopment

enough quality’ (Knight and Adolphs, 2006) to be used and re-used in thecorpus database. Thus, when recording the data for this study, the participantswere seated face-to-face, with the cameras placed face-on and in view of eachparticipant, and with a high-quality microphone placed between them. Thisallowed for the recording of close-up images of each individual, which weresynchronised using Adobe Premiere, so that during analysis both participantscan be observed at the same time. In order to ensure the close-up quality ofthe recordings, the participants were encouraged to remain seated.

The discourse context in which speakers operated was dissertationsupervision sessions in a university English department. Although these dataare contextually homogenous, conditions inevitably vary slightly from oneset of speakers to another and no attempt was made to produce replicableexperimental conditions. This was done in an attempt to ensure that allthe data collected were as natural and authentic as possible. Multi-partyconversations were not recorded, and whilst monologic lecture data do alsoform a substantial part of the corpus, the emphasis in this paper is onconversational dyads.

The hardware used for recording includes stationary digital videocameras (DV, allowing the data to be digitised for subsequent mpegcompression), and digital microphones, arranged as shown in Figure 1 (seealso Knight and Adolphs, 2006):

It is difficult to promote real, naturalistic talk in research settings (theconcept of natural data is itself difficult to define) and, in this case especially,speakers can feel uneasy about being recorded due to the obtrusive natureof video cameras. (It is not ethical, of course, to ‘hide cameras’ without theknowledge of the participants.) Furthermore, as Thompson (2004) points out,a video recording ‘presents a view of the event that in most cases cannotcapture the views of the participants in the event themselves’: it providesa partial account of the conversation which should be considered whenanalyses are made. Despite the fact there are no researchers or bystandersphysically present throughout the recording of our data, the presence of thecamera alone may give rise to/cause the ‘observer’s paradox’ (Labov, 1972),potentially affecting the behaviour and speech of the participants.

Page 7: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 7

The fact that all data were collected from an academic setting meansthat it is important to question the extent to which specific behavioursobserved in the data are context- or genre dependent. Specific sequences oftalk or gesture may be finely attuned to the particular conversation, betweenthose particular participants, in that specific academic setting. Behavioursmay or may not be transferable beyond such a limited context, and inorder to support this notion, further data samples require analysis in thefuture. Compounding this problem is the fact that many paralinguistic andgesticulatory characteristics may be attributed to an individual’s behaviouralor personality traits, although it is difficult to specify which ones until a moreextensive cross section of data has been examined.

Despite these shortcomings, as Argyle highlights, this does not meanthat it is necessary to ‘abandon recordings’ in such a ‘laboratory’-typesetting. It is possible to obtain natural responses from participants, especiallyif the recordings take place in relaxed, familiar settings (Argyle, 1988: 11),such as those used here. The fact that each recording lasts forty-five to sixtyminutes also means speakers may become more at ease around the recordingequipment, with the hope of promoting talk that is as natural as possible. Soalthough, as with other corpora, the recording of natural language events isnecessarily selective and incomplete, the use of video data as part of spokencorpora is likely to make the record of individual interactions more complete.

Apart from the process of recording the actual interaction betweenthe two speakers, it is also important to collect and document furtherinformation about the event itself. As with mono-modal corpora, it isimportant to keep a detailed record of as many aspects of the discourse eventas possible. The importance of metadata when it comes to using and sharingcorpora means that consideration should be given to a system of storing andextending such information. Burnard (2004) uses the term ‘metadata’ as anumbrella term to include editorial, analytic, descriptive and administrativecategories. For the purposes of developing a metadata system for the NMMC,the descriptive and analytic categories were embedded mainly within theuser interface which has been developed as part of this project. Editorial andadministrative categories are included in header fields for individual files.These include information about the transcription system that has been used,the recording techniques and equipment, information about the speakers andthe context of the interaction.

5. A note on ethics

It is important to note that traditional approaches to corpus developmentemphasise the desirability of anonymity when developing records ofdiscourse situations. These are designed to conceal not only the identitiesof the particular speakers involved (unless explicit consent to the contraryhas been given), but also for third parties who may be mentioned in the

Page 8: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

8 D. Knight, D. Evans, R. Carter and S. Adolphs

talk without being present. To achieve this, names and other details whichmay make the identity of a particular participant or referent obvious aremodified or completely omitted (see Du Bois, 2004; and Du Bois et al.,1992, 1993). Whilst this is a fairly straightforward procedure for text-basedcorpora, multi-modal corpora pose a multitude of ethical challenges whichneed to be considered a high priority for future research in this area.

First, audio files exist as an audio fingerprint of a speaker, making iteasy to associate a real-life speaker with their recording. Of course, it wouldbe possible to protect the identity of speakers using actor’s voices. Such aprocedure would, however, forfeit the authenticity of the data, compromisingthe spontaneity and ‘naturalness’ of the talk. Regardless of how accomplishedthe actor is, it is unlikely that every acoustic or prosodic feature, for example,can be adequately replicated. A similar question is, therefore, logically posedwith the use of video data: do we need to think about anonymity in termsof protecting the actual visual image of the participants involved? It is nearimpossible, and technically impractical, to attempt to anonymise video datawhile still being able to extract the salient features that are the focus of theanalysis. For example, pixellating faces or using shadow representations ofheads and bodies can blur distinctions between gestures and language formsrendering the data unusable for certain lines of linguistic enquiry.

Perhaps the wider question to ask is whether we need to think ofanonymity in such a controlled way at all. If participants have providedwritten permission to be recorded, logically, they are providing the supportfor their voice or image to be used. So it may be unrealistic to make aneffort to conceal these features when looking to create a database of real-lifeinteraction. These are components of the participant’s identity, and with theiralteration or omission the data become far from real.

The matter of protecting the identity of third parties remains anethical quandary, along with the issue of re-using and sharing contextually-sensitive video data. The issues are especially acute when tools are shared orare developed to be web-enabled as part of a multi-modal corpus resource.These issues need to be addressed further in consultation with end users,informants, researchers and ethics advisors.

6. Transcribing data

‘Transcription is an integral process in the qualitative analysis of languagedata and is widely employed in basic and applied research across a numberof disciplines and in professional practice fields’ (Lapadat and Lindsay,1999: 64). Ochs (1979: 44) proposes that it is at the point of transcriptionthat language technically becomes data – when it becomes a document of awritten or graphic format that represents something else. Ochs (1979: 44)emphasises that transcription can best be depicted as a research method,‘a theoretical process reflecting theoretical goals and definitions’, which

Page 9: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 9

involves the researcher deciding what to record and to transcribe in theirdata according to the theoretical aims and objectives upheld by the study(a point also emphasised by Lapadat and Lindsay, 1999; Roberts, 2006:4; Cameron, 2001: 39; and Cook, 1995). To this extent, it is believedthat ‘no matter how fine grained, [a transcript] can never be complete’(Kendon, 1982: 478–9) because the choices made by the transcriber about theformat of the transcription, what linguistic information (such as phonetic orprosodic characteristics) is included and what conventions are used, make thetranscript to some extent partial. A transcript may be viewed as being ‘bothinterpretative and constructive’ (Lapadat and Lindsay, 1999: 77; see alsoO’Connell and Kowal, 1999: 104), providing a window into communicativeevents from the perspective imposed by the person(s) responsible for thetranscription.

To some degree, this problem of partiality is unavoidable since,as with current CL approaches, spoken discourse (especially when dealingwith large quantities and when using a CL approach) becomes ‘usable’ onlythrough transcription, since this allows for quick and easy searches (usingcorpus software) and subsequent analysis (see Knight, 2006; and Knight andAdolphs, 2008). This usability would be difficult to sustain if spoken corporawere presented as real time, raw audio files alone. Early efforts to overcomethe partiality of transcription have been made particularly in the area of CAmethodologies (see ten Have, 2007; and Markee, 2000).

As with all stages of development, there is no strictly ‘standard’approach that is used to transcribe talk (Cameron, 2001: 43), as ‘there islittle agreement among researchers about standardisation of [transcription]conventions’ (Lapadat and Lindsay, 1999: 65) in current corpus development.Consequently, a ‘need to converge towards a standard, or (to weaken theconcept) towards agreed guidelines is becoming a matter or urgency’ (Leechet al., 1995: 5). This is because consistency across corpora would allow thedata to be transferable – re-usable across each database, and so may savea large amount of time and resources (assuming that copyright and otherrestrictions were not in place). Since, at present ‘re-use is a rare phenomenon’in language research (Dybkjær and Ole Bernsen, 2004: 2), it is importantthat future corpus development seeks to provide a utility for re-usability; thiswill enhance the quantity and, perhaps, quality of corpus data available forlinguistic research. In light of this, the basic conventions used for transcribingthe multi-modal data used in this study (and the NMMC, see below formore information) have been taken from the CANCODE3 corpus in orderto maintain some consistency across the corpora available at the Universityof Nottingham.

3 CANCODE stands for Cambridge and Nottingham Corpus of Discourse in English. Thiscorpus has been built as part of a collaborative project between The University ofNottingham and Cambridge University Press with whom sole copyright resides. CANCODEis comprised of five-million words of (mainly casual) conversation recorded in differentcontexts across the British Isles.

Page 10: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

10 D. Knight, D. Evans, R. Carter and S. Adolphs

The CANCODE conventions are designed to present conversationaldata ‘in a way that is faithful to the spontaneity and informality of the talk,but is also easily accessible to readers not familiar with the conversationalliterature or phonological/prosodic features’ (a general requirement oftranscription, outlined by Eggins and Slade, 1997: 1–2), which is a criticalmethodological requirement for a corpus tool for general use. It is generallyaccepted that the inclusion of, for example, International Phonetic Alphabet(IPA, see Laver, 1994; and Canepari, 2005) features would make thetranscript difficult to read and would be too specific a focus, thus standing inopposition to the fundamental usability aims of the corpus. Since audio filesare to be integrated into the corpus, phonetic characteristics can be exploredin more detail with software add-ons that complement the corpus.

It is important to note that, over time, it would be beneficial toredevelop this basic approach in order to realise a more integrated systemfor transcription – one that exists between the different modes, incorporating‘criteria that show how different resources contextualise each other’ (Baldryand Thibault, 2001: 88). Adding to this, Baldry and Thibault (2001: 90)emphasise that it is only with this integrated system of both transcription andsubsequent analysis that ‘we are effecting a transition from MM transcriptionto MM corpus’.

There exists a wide range of computer software that allowsresearchers to view, play, transcribe and code-up video and audio data,starting to address some of the issues posed by the transcription phase. Theyallow for the accurate and consistent transcription across separate modesof data which can then be integrated as required. (Such tools include, forexample, Transtool, Tractor, TraSA and SyncTool; see Allwood et al., 2001,for more information.)

These tools allow researchers to transcribe data, to annotatetranscripts and/or devise coding schemes or to integrate visually the differentmodes of data for subsequent analysis. Other tools such as the MultiTooland Transana integrate these features. MultiTool, for example, allows theresearcher to ‘simultaneously display the video and relative orthographictranscription of dialogues so that the operator can easily observe whengestures are produced together with speech’ (Cerrato and Skhiri, 2003: 5).Transana has been used to transcribe the data used within the DReSS projectin order to allow the analyst to time-stamp discourse features, and, whenspecific words and phrases are explored, to enable the video and audiorecords of such incidents to be aligned. Further reasons for choosing thistool are explored in Brundell and Knight (2005).

That such tools are easy to obtain and use means they can beuniversally applied and integrated into the multi-modal corpus software thatwas developed specifically for this purpose. The Digital Replay System4

4 The Digital Replay System software is available to download for free, at:http://www.mrl.nott.ac.uk/research/projects/dress/software/DRS/Home.html. Furtherinformation on the system, including publications user guides and demonstrations can also befound on this website.

Page 11: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 11

(DRS, Greenhalgh et al., 2007) allows researchers to include their owndata and synchronise this with specific encoding or transcription codes asrequired. The software also allows researchers to edit transcripts and dataalready included within the corpora to make them more relevant to their ownrequirements.

7. Coding and mark-up

Coding can be described as a stage which involves ‘the assignment ofevents to stipulated symbolic categories’ and as an extension of thisprocess, mark-up constitutes the ‘specific technical meaning, involving theaddition of typographic or structural information to a document’ (Bird andLiberman, 2001: 2). The coding of corpora thus encapsulates the stage atwhich the qualitative data become quantitative as specific ‘items relevantto the variables being quantified’ are being marked up for future analyses(Scholfield, 1995: 46). Whilst a corpus that is the size of the NMMC, whichcurrently contains 250,000 words, is largely unsuitable for the purposes ofquantitative analysis – at least as far as lexical and grammatical categoriesare concerned, the repetitive nature of individual gestures, such as head nodsfor example, means that we have to reassess the value of smaller corpora formeaningful quantitative exploration.

Current corpora are manually- or automatically marked-up andtagged according to a large range of discourse features such as informationon speakers (demographic), context (extra-linguistic information), part-of-speech (POS – a form of grammatical tagging), prosodic (marking stressin spoken corpora), phonetic (marking speech sounds) features, or acombination of these (for more information see McEnery and Xiao, 2004and Leech, 2004). Annotated corpora are sometimes described as being‘tagged’, wherein every single token in the given corpus has been assignedgrammatical word-class labels. Not all corpora are tagged, coded or marked-up; it is possible to have both annotated and un-annotated corpora, but it iseasier to search annotated computerised corpora (Knight and Adolphs, 2006).

Early standards for the mark-up of corpora, known as the SGML(Standard Generalised Mark-up Language), which has been succeeded byXML (Extensible Markup Language), were developed in the 1980s when theelectronic-corpora ‘revolution’ was just beginning. SGML has traditionallybeen used for marking up line breaks and paragraph boundaries, and providesbasic structure standards for transcription and mark-up. It also defines thestandards for the typeface, page layout used and so on, in order to allowsuch coded corpora to be re-used and transferred across different corpora.SGML has been used as the foundation for current movements within thelinguistic discipline towards the development of a corpus encoding standard(CES) to cater for all corpora. CES aims to provide ‘encoding conventionsfor linguistic corpora designed to be optimally suited for use in language

Page 12: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

12 D. Knight, D. Evans, R. Carter and S. Adolphs

engineering and to serve as a widely accepted set of encoding standards forcorpus-based work’ (Ide, 1998: 1).

Despite these movements towards common conventions, there are nospecific required standards for corpus annotation, since not all corpora useCES. Consequently, the actual codification of a corpus (that is, ‘the actualsymbolic representations used’ (Leech, 2004)), and the level of detail used inannotating it, is dependent on the purpose and basic aims of the corpus (i.e.,they are ‘hypothesis-driven’, see Rayson, 2003: 1). To this extent, ‘there isno purely objective, mechanistic way of deciding what label or labels shouldbe applied to a given linguistic phenomenon’ (Leech, 1997: 2).

The advent of gestural data in MM corpora poses many problemsfor the coding and mark-up of corpora. This is because, at present, systemssuch as SGML, CES and even those conventions used in CANCODE, donot have provisions for marking-up discourse ‘beyond the text’ in any greatdetail. Codification needs to be readdressed in the construction of multi-modal corpora. In conjunction with this, Gu (2006: 134) suggests that theuse of a CL approach on multi-modal texts involves two basic tasks, namely‘segmenting the MM text into various units as appropriate’ and ‘annotatingthe units’. These processes are the crux of the coding and mark-up phases ofmulti-modal corpus development.

Coding schemes that are equipped for both gesture and speech tendonly to look at each mode in turn, as Baldry and Thibault (2006: 148)emphasise:

In spite of the important advances made in the past 30 or so yearsin the development of linguistic corpora and related techniques ofanalysis, a central and unexamined theoretical problem remains, namelythat the methods adapted for collecting and coding texts isolate thelinguistic semiotic from the other semiotic modalities with whichlanguage interacts. . . . [In] other words, linguistic corpora as so farconceived remain intra-semiotic in orientation. . . . [By] contrast MMcorpora are, by definition, inter-semiotic in their analytical proceduresand theoretical orientations.

It would be more useful, therefore, to create an alternative manual codingsystem that is transferable across the different data streams in the multi-modal corpus in order to allow us to connect directly the pragmatic andsemantic properties of the two modes and to enable cross referencingbetween the two. This would make it easier to search for patterns inconcordances of the data and to explore the interplay between language andgesture in the generation of meaning.

It is important to note that many schemes do exist to representthe basic semiotic relationship between verbalisations and gesture (earlycoding schemes of this nature are provided by Efron, 1941; and Ekmanand Friesen, 1968, 1969). These mark up the occasions where gestures

Page 13: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 13

co-occur (or not) with speech, and state whether the basic discoursal functionof the gestures and speech ‘overlap’, are ‘disjunct’, or if the concurrentverbalisation or gesture is more ‘specific’ than the other sign at a givenmoment (for more details see Evans et al., 2001: 316). These schemes maybe a useful starting point for labelling information in each mode, which canbe further supplemented to cater for the semantic properties of individualfeatures.

An extensive coding scheme that deals with defining a range ofgestures based on sequences of kinesic movements (that occur during speech)has been drawn up by Frey et al. (1983). Other more detailed kinesic codingschemes exist which attempt to define more explicitly the specific action,size, shape and relative position of movements throughout gesticulation (seeHoller and Beattie, 2002, 2003, 2004: 85; McNeill, 1992). However, theseschemes are limited in their utility for marking up the specific functionalityof such sequences of gesticulation, and in their explicit relationship to spokendiscourse.

Other coding schemes that are equipped for dealing with both gestureand speech (a variety of schemes are discussed at length by Church andGoldin-Meadow, 1986; and Bavelas, 1994) tend to be designed primarilyto model sign language and facial expressions specifically (as with FACS(Facial Action Coding Scheme), which can also be used for determiningmouth movements in speech, as is common in HCI studies, see Ekmanand Friesen, 1978). Examples of such coding schemes are the HamNoSys(Hamburg Notation System, see Prillwitz et al., 1989), the MPI (Max PlanckInstitute) GesturePhone (which transcribes signs as speech) as well the MPIMovement Phase Coding Scheme which is designed to code gestures andsigns which co-occur with talk (Kita et al., 1997). From these examples,the scheme that is closest to the requirements of MM corpora is the MPIMovement Phase Coding Scheme. This is described as ‘a syntagmatic rulesystem for movement phases that applies to both co-speech gestures andsigns’ (Knudsen et al., 2002). It is a scheme that was developed at the MPIin order to allow for the referencing of video files. Annotations made withthis scheme can be conducted using another MPI tool, MediaTagger, and areinput into the EUDICO software for further analysis and the representationof data.

It is important to note that current schemes which do classify theverbal and the visual only tend to deal with the typological features of MMtalk. Current schemes that do classify the verbal and the visual only tendto deal with the typological features of MM talk. An example of this isgiven by Cerrato (2004: 26; see also the ‘binary coding scheme for iconicgestures’ of Holler and Beattie, 2002) who marks up a range HH and HCIconversations according to, primarily, whether it is a word (marked as W),phrase (marked as P), sentence (marked as S) and gesture (marked as G).Cerrato (2004: 26) also starts to mark up processes of feedback within theconversation, distinguishing situations where feedback is ‘given’ (markedwith Giv) from those situations where feedback is elicited (marked with Eli).

Page 14: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

14 D. Knight, D. Evans, R. Carter and S. Adolphs

Although this scheme fails to incorporate a more detailed semantic mark-up of the verbal and non-verbal elements, such a coding scheme may be ofuse as the basis for constructing guidelines for the development of a morein-depth and appropriate scheme that looks at other linguistic elements ofconversation beyond feedback.

Despite this, it has been widely recognised that it is beneficial tocreate ‘International Standards for Language Engineering’ (known as theISLE project, see Dybkjær and Ole Bernsen, 2004: 1). These standardsare labelled as NIMMs in the ISLE project: Natural Interaction and MMAnnotation Schemes. ISLE is based on the notion that there is a need fora ‘coding scheme of a general purpose’ to be constructed to deal with the‘cross-level and cross modality coding’ of naturally occurring language data(Dybkjær and Ole Bernsen, 2004: 2–3; also see Wittenburg et al., 2000).Such set standards may, therefore, be of potential use in the development ofmulti-modal corpora.

Since gesticulations are so complex and variable in nature, it wouldappear difficult to create a comprehensive scheme for annotating everyfeature of gesture-in-use. Depending on the focus of the research, gesturesmay also be seen to have different semantic or discursive functions in thediscourse, thus making it difficult to mark-up each respective function.

8. Coding the NMMC

Despite practical constraints, we have attempted to encode the NMMCdata in a way that will ‘allow for the maximum usability and reusabilityof encoded texts’ (Ide, 1998: 1). In order to explore the ‘interaction oflanguage and gesture-in-use for the generation of meaning in discourse’,which has been the central aim for linguistic analyses using the NMMC, weinitially focussed on classifying movement features and linguistic featuresindependently. Preliminary linguistic analyses and classifications of eachstream of data were undertaken, and findings cross-compared in order todetermine patterns that may occur both within and across each data stream.This process is outlined below:

• Defining and classifying linguistic (and paralinguistic) behaviour.• Defining and classifying non-verbal behaviour.• Combining the (para)linguistic and non-verbal, and highlighting the

potential for exploring patterns and relationships between the two.

Head-nod behaviour was initially selected as the focus for these explorationsdue to the fact that, although they are variable in terms of length of nod, or thecomplexity or intensity of the nod movement (the difference in the peak andtrough), they exist as one of the most salient forms of gesticulation. The mostcomplex types of head nods are structurally similar to the least complex nods.

Page 15: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 15

This structural similarity means that head nods are ideal for use as a basis fordeveloping and training our approach to multi-modal CL analysis, since theyare more manageable than other forms of gesticulation.

Furthermore, certain forms of head nods have a direct relationshipwith the linguistic phenomenon of backchannelling behaviour. The questionswe are aiming to ask of head nod behaviour in our ongoing research are:

• To what extent do head nods relate to language forms used to signalactive listenership (i.e., backchannelling behaviour, see O’Keeffeand Adolphs, 2008)?

• How do head nods relate to functions of backchannels? Thisinvolves comparing the discourse functions of specific occurrencesof backchannel behaviour to determine whether patterns of formand function exist:

(a) Within each individual stream of data(b) Across the verbal and non-verbal data streams

The mark-up of verbal realisations of active listenership is relativelystraightforward. Once the conversation is transcribed we can identify thoseverbalisations and minimal responses which are likely candidates and studythem within the co-text and context of the discourse. Using a concordancer,specific forms of active listenership, or backchannels (Yngve, 1970), can beautomatically located with relative ease, although the process of codificationhas to be carried out manually. Once these forms have been extracted they canbe coded according to their discourse function (see O’Keeffe and Adolphs,2008, for possible coding schemes for backchannel items).

The next step in the process outlined above is the coding of non-verbal behaviour. Non-verbal, gestural signals are less discrete than verbalforms: they are also more difficult to detect and extract from data and todefine in terms of their functionality. Existing coding schemes for mark-up and labelling of gestures look to code either the kinesic properties ofa particular movement or the functional properties of an already identifiedmovement. In order to establish the different types of head nods, we need todevelop a coding scheme for different types of head nods that is relevant toour research question and appropriate to our data source. This requires themanual identification and isolation of all nods of the head by extracting allinstances where the basic up-and-down motion sequence of the head occurs(in any order) and affixing boundaries to the gesture in the relevant places.

As this procedure relies on the judgments of human coders, it is noterror-free. How these processes can be refined with the use of computervision technologies is currently being explored. Yet it is important tonote that communication itself is not always error-free. Intent (consciousor non-conscious) to communicate (by a speaker) with either a verbalmessage and/or its gestural counterpart, does not always equate to successfulcomprehension by the listener. There are also situations in discourse

Page 16: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

16 D. Knight, D. Evans, R. Carter and S. Adolphs

when intentional or unintentional ‘mismatches’ occur between the semanticinformation expressed through the use of gestures and concurrent spokenmessages; these have a direct affect on the message communicated (Goldin-Meadow, 1999: 426; also explored in McNeill et al., 1994: 231; and Churchand Goldin-Meadow, 1986).

Brown (1986) explores this idea further in the examination ofdyadic discourse situations where intentional mismatches of emotions andopinions are expressed using language and gesticulation. Brown (1986: 498)notes that in such instances of mismatching observers actually perceive thegesticulations used to ‘be a more reliable source [of semantic information]than the verbal content’, since it is deemed that gesticulations are lessconsciously controlled by the speaker, which makes them seem to be amore accurate expression of what is really meant by the speaker. Whilst itis difficult to create error-free prescriptions for manual coding, it is similarlydifficult to guarantee that accurate interpretations of behaviours are made ineveryday talk. This problem is compounded when exploring other elementsof paralanguage, because certain idiosyncratic behaviours can be interpretedin many different ways. Indeed, the ways in which these can be detected,encoded and observed are matters that should be posed in future studies usingmulti-modal data.

When developing a gesture coding scheme for head nods, a numberof key issues emerge that are largely related to the significance we assign todifferent properties of the gesture. For example, when viewing the data wedecided that the duration of a head nod was significant in terms of its functionin the ongoing discourse. Yet, when coding head nods, further decisions haveto be made as to whether a number of small head nods in short successionshould count as part of the same nod, or whether they should be coded asseveral small nods. These kinds of issues become crucial in the developmentof categories for possible coding schemes, and they highlight the need for anintegrated approach to the study of human interaction.

We would argue that it is both necessary and desirable for some ofthe decisions about the gestural coding scheme to be informed, at least inpart, by a consideration of the communicative event in its entirety, includingboth verbal and visual components. It is important to acknowledge that thisprocedure introduces an analytical bias into the process of developing acoding scheme which may be avoided if we base our codes solely on themechanical information of an automated tracker. Both approaches have tobe evaluated alongside each other in order to determine which one is morerelevant and useful for linguistic analysis.

Following a close examination of the data in order to develop agestural coding scheme for head nods, we decided on a broad initial set offive different types:

Type 1: Small nods with a short duration.Type 2: Small multiple nods with a longer duration than Type 1.Type 3: Intense nods with a short duration.

Page 17: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 17

Type 4: Intense and multiple nods with a longer duration than Type 3.Type 5: Multiple nods that are a combination of Types 1 and 2, with

a longer duration than Types 1 and 3.

The intensity of nods is defined in terms of the amplitude of the headmovement – the physical size of the movement in the head-up or head-down motion. Therefore, nods which appear to exact a more physicallyextreme motion are likely to be classified as Types 3 or 4 (or 5) whilst nodswith slight movement in the up or down direction are classified as Types 1 or2 (or 5).

It is important to note that more detailed information on the specificup-and-down movements of head nods (and other gestures) may be generatedusing, for example, eye and head tracking software. An example of suchis provided by Kapoor and Picard (2001) who present a system for theautomatic sensing, extraction and analysis of head nod patterns, based onvideos of human-to-computer interaction (HCI). Participants are presentedwith various stimuli on screen and are expected to respond to them as‘naturally’ as possible. (Similar systems monitor eye movements and basicfacial expressions: see Griffin and Bock, 2000; Kawato and Ohya, 2000; andTian et al., 2000). The positions of pupils in two consecutive frames of thevideo are compared and analysed statistically to classify head movements(see Avilés-Arriaga and Sucar, 2002: 244).

Cerrato and Skhiri (2003) present an adapted version of thissystem, using a video camera and infrared detectors in a one-to-one humanconversation context rather than in a HCI context. Here head movementsare detected and monitored using thirteen sticky markers, positioned on thebody of the participant who is speaking, which reflect infrared light from theeye trackers to detect movements in the eyes (Cerrato and Skhiri, 2003: 2,also see similar studies conducted by Hadar, 1997: 419; and Hällgren andLyberg, 1998). The shortcoming of this method is that the use of the bodymarkers is problematic since their presence on the body of the participant canbe regarded as invasive, and this may affect the performance of the participantin the conversation.

The next step is to combine the verbal and the visual coding schemesin our analysis of the different functions of verbal backchannels and headnods. A key question when applying both coding schemes is whether amulti-modal approach will allow us to gain a better understanding of therelationship between the different modes. Whether or not the verbal replacesor supports the gestural stream in terms of discourse function requires a closeexamination of a relatively large data set, as well as further iterations of thedevelopment of categories of the coding scheme. In a further step, it willthen be possible to develop an integrated coding scheme which includes bothverbal and gestural properties, so that the discourse function we may assignto a particular incident of active listenership will be a combination of the typeof head nod and the verbalisation (or lack of it) that co-occurs with the nod.

Page 18: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

18 D. Knight, D. Evans, R. Carter and S. Adolphs

This concern for mapping verbal discourse function codes onto non-verbalgesture-in-talk is a specific focus of the NMMC project for the future.

9. From HeadTalk to HandTalk

Having looked at head nods, the Digital Record project also consideredthe role of hand/arm gestures in face-to-face spoken communication.The problems associated with identifying and classifying head nods aremultiplied when considering hand gestures, even when gesture is onlyconsidered in the form of hand and arm movement. A head nod can besaid to exist due to movement along one axis. Hands and arms can movemuch more freely, both in conjunction with or independently of each other.Hands also perform several other practical functions during talk, includingscratching, adjusting clothing and hair, writing, reaching for and movingobjects, etc. Unlike head nods, there already exists a substantial body ofresearch into how hand gestures support and supplement spoken utterances.Much of this existing work (see McNeill, 1992) on gesture has attempted torepresent the full scope of this complexity by employing teams of transcriberswho manually code large amounts of video data and then cross-check oneanother’s work.

We focus initially on tackling features of iconic hand gesturessince such features are widely explored in linguistic, psychological andbehavioural research. In line with this, we are currently working towardsinvestigating the following:

• Are there specific gesture sequences that are associated withdiscourse markers that function to manage the talk (e.g., ‘right’,‘so’, ‘anyway’), as opposed to more interpersonal ones (e.g., ‘in asense’, ‘I guess’)?

• Are any relationships seen to exist between specific hand movementtypes (according to pragmatic categories) and various forms ofdiscourse markers that either manage the conversation, or functionin a more interpersonal way?

Researchers start out by identifying everything they consider to be a gesture(delivered by the speaker they are observing) and then adding these to theorthographic transcript, using square brackets around the words where agesture takes place and putting the words where the stroke of the gestureoccurs in bold type. The gesture is then located in space as illustrated here:

From this point the gestures can be coded according to classificationsystems for gesture type, form and meaning.

For this project it was decided to take a more ‘bottom-up’ approachand to work with a very simple set of gestures to which additional layers ofcomplexity could be added at a later stage. To simplify the types and forms

Page 19: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 19

Figure 2: Division of gesture space (regions and labels are all based onMcNeill, 1992: 378)

of the gestures for the study we decided to focus on the movement of armsonly and not to look at hand shape. We also wanted to adapt the existinggesture region model shown in Figure 2. The preliminary examinations of thedata from academic supervisions we have collected suggest that if we wereto simplify this model, then dividing the area in front of the speaker wherethe gestures are ‘played out’ along vertical axes, would generate interestingresults, as illustrated in Figure 3.

Figure 3 also illustrates the way in which key linguistic dataclusters are then subjected to mark-up by means of video tracking based oncomputer vision technologies and to further analysis which, manually andin semi-automated fashion, marries the verbal and the visual componentsof the video data. (Note that the blue circles indicate programmed trackingof movements.) The relevant tracking program, Cvision, is an interactiveprogram which allows users to apply the visual tracking algorithm developedwithin the HeadTalk project to selected targets in an input video clip.5

We chose the vertical axes because, during supervisions, the speakersspend a good deal of time comparing and contrasting different ideas. Theseideas often appear to exist in metaphorical compartments in front of the

5 For further explanation of this technology, see:http://www.ncess.ac.uk/research/sgp/headtalk/

Page 20: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

20 D. Knight, D. Evans, R. Carter and S. Adolphs

Figure 3: Computer image tracking applied to video

speaker. When these ideas are compared or contrasted the speaker willoften move one or both hands along a horizontal axis to support the verbalelement of the communication. Dividing the horizontal plane with verticallines allows us to track this movement and to link it to the talk. A four-pointcoding scheme was constructed, as follows:

(1) Left hand moves to the left.(2) Left hand moves to the right.(3) Right hand moves to the left.(4) Right hand moves to the right.

This initial coding scheme focusses purely on the kinesic properties of thearm movements. Additional or more complex schemes will allow us tomake further remarks about the way in which the gesture supports and/orsupplements the verbal part of the speaker’s message. To this extent, gestureshave initially been classified in a top-down fashion, with the analyst workingto define, first, the specific form of gesture, before proceeding to establishthe linguistic function (pragmatic category) of such. Using this four-pointcoding scheme the shape and direction of hand movements, and whether oneor both of the hands are moving at any one point, are determined. These

Page 21: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 21

movements are then classified using the typological descriptions of gesturesthat are available in gesture research.

The aim is to establish whether the movement features of the gesturebest attribute it to being Iconic, Metaphoric, Beat-like, Cohesive or Deicticin nature (see McNeill, 1985, 1992; similar paradigms are seen in McNeillet al., 1994: 224; Richmond et al., 1991: 57; and Kendon, 1994). In caseswhere gestures co-occur specifically with speaker verbalisations (rather thanwith recipient gesticulations), we are working with a separate classificationsystem in a bottom-up manner, exploring first the discursive function ofco-occurring text before looking in more detail at specific tokens andphrases (which are separately encoded; see Knight and Adolphs, 2008, fordetails). As a final measure, this information is combined in order to exploremore closely those specific words or phrases that are likely to co-occurwith specific gestures throughout the gesture phase (for complete gesturesequence, see Kendon, 1987: 77) and at the stroke (the most visible oremphatic part of the gesture, see McNeill, 1992: 12).

10. Applications and representation of data

The final concern of the corpus development is how the multiple streamsof coded data are physically represented in a re-usable operational interfaceformat. With a multi-media corpus tool it is difficult to exhibit all featuresof the talk simultaneously. If all characteristics of specific instances wherea token, phrase or coded gesture (in the video) occur in talk are displayed,the corpus software tool would have to involve multiple windows of dataincluding concordance viewers, text viewers, audio and video windows. Thismay make the corpus software impractical to use, since it is difficult for ahuman analyst to ‘read’ any large quantity of multiple tracks of such datasimultaneously (i.e., we are unable to watch closely more than one video atonce), as current corpora allow with text.

Following Carter and Adolphs (2005) and Adolphs and Carter(2007), we have explored two different potential representation formats. Thefirst one represents the different modes of data side by side within the sameinterface, as shown in Figure 4.

In this example, a basic layout of seven concordance lines is used.Although the audio and video data would be streamed next to the text, it willonly be active when the researcher selects the specific audio or video file.This method minimises the potential problems with computer capabilities,whilst maintaining the usability of an interface that can be used easily tocross-reference data across multiple concordances and the different modes.

The process of selecting a section of text or a search token, andretrieving the exact point in the data at which it occurs, is not, in itself,straightforward. The fact that gestures are not discrete units, in the way that

Page 22: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

22 D. Knight, D. Evans, R. Carter and S. Adolphs

Figure 4: Proposal for the ‘paradigmatic representation’ of data inmulti-modal corpora

Figure 5: Proposals for the ‘syntagmatic representation’ of data inmulti-modal corpora

words and utterances are, means that it is difficult to align the different modesexactly according to the time at which different actions or words occur.

Another way to present the data is to align entire transcripts withstreams of audio and visual data, as illustrated in Figure 5 to allow for amore syntagmatic representation of the data, (taken from Carter and Adolphs,2005).

The syntagmatic representation is similar to current corpora in thatit presents data in a ‘textured’ way, with windows of specific and relevantinformation being integrated and layered behind main frames that displaythe key search features, in a similar way to textual concordances. It is thissecond method of representation that has been adopted for the DRS software.DRS allows us to encode MM datasets and, by using a novel concordancingfacility, enables searches of data to be undertaken. The DRS concordancer isdepicted in Figure 6 (see French et al., 2006; this tool is free to download6).

6 At: http://www.mrl.nott.ac.uk/research/projects/dress/software/DRS/Home.html

Page 23: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 23

Figure 6: A screenshot of the concordance tool in use within the DRSsoftware interface

In Figure 6, the second concordance line of the search term ‘yeah’has been selected (shown on the right-hand side of the DRS interface) andthe corresponding video clip where this utterance is spoken is shown in thevideo clip (to the left-hand side of the interface) and can be played on theaudio track (shown at the bottom of the screen).

Using this syntagmatic concordance viewer, the analyst can search alarge database of multi-modal data utilising specific types, phrases, patternsof language or gesture codes as a ‘search term’. Once presented as aconcordance view the analyst may jump directly to the temporal locationof each occurrence within the associated video or audio clip, as well as to thefull transcript.

It is important to note that other software tools which support theannotation, visualisation and analysis of MM data (in a similar way toDRS) do already exist. As already seen, Transana supports the transcriptionof multi-modal data, whilst The Observer7 and I-Observe (Badre et al.,

7 See: http://www.noldus.com

Page 24: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

24 D. Knight, D. Evans, R. Carter and S. Adolphs

1995: 101–13) present interfaces for coding and time-stamping. Othertools include Mangold International’s INTERACT8 and the Diver Project,developed at Stanford University (Pea et al., 2004: 54–61), both of whichsupport the process of coding and annotation of videos respectively,and provide some visualisation tools to support analysis of the codeddata.

More comprehensive tools include ANVIL, developed in 2001 byMichael Kipp at the University of the Saarland, which was designed as avideo annotation tool for specifically the purpose of analysing multimodalcorpora (Kipp, 2001: 1367–70), and also ELAN, developed at the MPI(Brugman and Russel, 2004: 2065–8). Both of these tools provide aplatform for the annotation of video data, primarily in the field of linguisticresearch. They support the annotation of data using a series of differenttracks (known as ‘tiers’ in ELAN), to allow the analyst to make severalsimultaneous annotations, which can be applied to a single piece ofmedia. In light of previous discussions, the integration of such facilitiesproves to be one of the key concerns in multi-modal corpus developmentand use.

Unlike any of these others, DRS has the innovative concordancingtool integrated directly into the environment. This allows users to search notonly for text-based records of language-in-use, but also for the existence ofgesture (by searching specific gesture codes) within and across the individualrecords of academic supervisory sessions; and it provides an account ofconcurrent language at the point of the gesticulation. Thus we are presentedwith a better utility for the exploration of a range of lexical, prosodic andgestural features of conversation, and for investigations of how such featuresinteract in real, everyday speech. This novel concordancing applicationmakes the DRS more finely attuned with the needs of corpus linguists thanits contemporaries.

11. Summary

This paper has reported on some first steps in the processes of developinga multi-modal corpus and corpus analysis tool, focussing on a wide rangeof issues and methodological considerations that need to be addressed fromthe recording of data through to its representation and use as part of amulti-modal corpus. It also provides an exploration of the problems andshortcomings of the research processes, and of the techniques used at eachphase of development, as well as a commentary on how these findings willseek to provide a useful research rubric and design platform for multi-modalcorpus developments in the future.

8 See: http://www.mangold-international.com

Page 25: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 25

Acknowledgements

The research on which this article is based is funded by the UK Economic andSocial Research Council (ESRC), e-Social Science Research Node DReSS,9

and the ESRC e-Social Science small grants project HeadTalk (grant number:RES-149–25–1016).

References

Abercrombie, D. 1963. ‘Conversation and spoken prose’, English LanguageTeaching XVIII (1), pp. 10–16.

Adolphs, S. and R. Carter. 2007. ‘Beyond the word: new challenges inanalysing corpora of spoken English’, European Journal of EnglishStudies 11 (2), pp. 133–46.

Allwood, J., L. Grönqvist, E. Ahlsén and M. Gunnarssan. 2001. ‘Annotationsand tools for an activity based spoken language corpus’, Proceedings ofthe Second SIGdial Workshop of Discourse and Dialogue 16, pp. 1–10.Morristown, New Jersey: Association for Computational Linguistics.

Argyle, M. 1988. Bodily Communication. (Second edition.) London:Methuen.

Avilés-Arriaga, H.H. and L.E. Sucar. 2002. ‘Dynamic Bayesian networksfor visual recognition of dynamic gestures’, Journal of Intelligent andFuzzy Systems 12 (3–4), pp. 243–50.

Badre, A., M. Guzdial, S. Hudson and P. Santos. 1995. ‘A user interfaceevaluation using synchronized video, visualizations and event tracedata’, Software Quality Journal 4 (2), pp. 101–13.

Baldry, A. and P.J. Thibault. 2001. ‘Towards multimodal corpora’ inG. Aston and L. Burnard (eds) Corpora in the Description and Teachingof English: Papers from the Fifth ESSE Conference, pp. 87–102.Bologna: Cooperativa Libraria Universitaria Editrice Bologna.

Baldry, A. and P.J. Thibault. 2006. Multimodal Transcription and TextAnalysis: A Multimedia Toolkit and Coursebook. London: Equinox.

Bavelas, J.B. 1994. ‘Gestures as part of speech: methodologicalimplications’, Research on Language and Social Interaction 27 (3),pp. 201–21.

Beattie, G. and H. Shovelton. 1999. ‘Mapping the range of informationcontained in the iconic hand gestures that accompany speech’, Journalof Language and Social Psychology 18, pp. 438–63.

9 See: www.ncess.ac.uk/nodes/digitalrecord

Page 26: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

26 D. Knight, D. Evans, R. Carter and S. Adolphs

Biber, D., S. Conrad and R. Reppen. 1998. Corpus Linguistics: InvestigatingLanguage Structure and Use. Cambridge: Cambridge University Press.

Bird, S. and M. Liberman. 2001. ‘A formal framework for linguisticannotation’, Speech Communication 33 (1–2), pp 23–60.

Birdwhistell, R.L. 1952. Introduction to Kinesics: An Annotated Systemfor the Analysis of Body Motion and Gesture. Louisville, Kentucky:University of Louisville.

Brown, R. 1986. Social Psychology. (Second edition.) New York: Free Press.

Brugman, H. and A. Russel. 2004. ‘Annotating multi-media / multi-modalresources with ELAN’, LREC 2004, pp. 2065–8. Lisbon, Portugal.

Brundell, P. and D. Knight. 2005. ‘Current research and tools to support dataintensive analysis for digital records in e-social science’, UnpublishedNCeSS project report. University of Nottingham.

Burnard, L. 2004. ‘Developing linguistic corpora: metadata for corpus work’in M. Wynne (ed.) Developing Linguistic Corpora: A Guide to GoodPractice, pp. 30–46. Oxford: Oxbow Books.

Cameron, D. 2001. Working with Spoken Discourse. London: Sage.

Canepari, L. 2005. A Handbook of Phonetics. Munich: Lincom.

Carter, R. and S. Adolphs. 2005. ‘Developing multimodal linguisticcorpora’, Unpublished NCeSS Node Meeting Presentation. Universityof Nottingham.

Cerrato, L. 2004. ‘A coding scheme for the annotation of feedbackphenomenon in everyday speech’, LREC Workshop on Models ofHuman Behaviour for the Specification and Evaluation of MultimodalInput and Output Interfaces, pp. 25–8. Lisbon: Portugal.

Cerrato, L. and M. Skhiri. 2003. ‘Analysis and measurement of headmovements signalling feedback in face-to-face human dialogues’,Proceedings of AVSP 2003, pp. 251–6. St Jorioz: France.

Chawla, P. and R.M. Krauss. 1994. ‘Gesture and speech in spontaneous andrehearsed narratives’, Journal of Experimental Social Psychology 30,pp. 580–601.

Church, R.B. and S. Goldin-Meadow. 1986. ‘The mismatch between gestureand speech as an index of transitional knowledge’, Cognition 23 (1),pp. 43–71.

Cook, G. 1995. ‘Theoretical issues: transcribing the untranscribable’ inG. Leech, G. Myers and J. Thomas (eds) Spoken English onComputer: Transcription, Mark-up and Applications, pp. 35–53.London: Longman.

Davis, M. 1979. ‘The state of the art: past and present trends in bodymovement research’ in A. Wolfgang (ed.) Nonverbal Behaviour:Applications and Cultural Implications, pp. 51–66. London:Academic Press.

Page 27: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 27

Du Bois, J.W. 2004. Representing Discourse. MS, University of California:Santa Barbara.

Du Bois, J.W., S. Cumming, S. Schuetze-Coburn and D. Paolino. 1992.‘Discourse transcription’, Santa Barbara Papers in Linguistics 4.

Du Bois, J.W., S. Schuetze-Coburn, S. Cumming and D. Paolino. 1993.‘Outline of discourse transcription’ in J.A. Edwards and M.A. Lampert(eds) Talking Data: Transcription and Coding in Discourse Research,pp. 45–89. Hillsdale, New Jersey: Erlbaum.

Dybkjær, L. and N. Ole Bernsen. 2004. ‘Recommendations for naturalinteractivity and multimodal annotation schemes’, Proceedings of theLREC 2004 Workshop on Multimodal Corpora, pp. 5–8. Lisbon,Portugal.

Efron, D. 1972 [1941]. Gesture, Race and Culture. The Hague: Moutonand Co.

Eggins, S. and D. Slade. 1997. Analysing Casual Conversation. London:Cassell.

Ekman, P. and W.V. Friesen. 1968. ‘Nonverbal behavior in psychotherapyresearch’ in J. Shlien (ed.) Research in Psychotherapy, Volume III,pp. 179–216. Washington, DC: American Psychological Association.

Ekman, P. and W.V. Friesen. 1969. ‘The repertoire of non-verbal behavior:categories, origins, usage and coding’, Semiotica 1 (1), pp. 49–98.

Ekman, P. and W.V. Friesen. 1978. The Facial Action Coding System. SanFrancisco, California: Consulting Psychologists Press Inc.

Evans, J.L., M.W. Alibali and N.M. McNeill. 2001. ‘Divergence of verbalexpression and embodied knowledge: evidence from speech andgesture in children with specific language impairment’, Language andCognitive Processes 16 (2–3), pp. 309–31.

French, A., C. Greenhalgh, A. Crabtree, M. Wright, P. Brundell,A. Hampshire and T. Rodden. 2006. ‘Software replay tools fortime-based social science data’, Proceedings of the Second AnnualInternational e-Social Science Conference. Manchester, England:University of Manchester.

Frey, S., H.P. Hirsbrunner, A. Florin, W. Daw and R. Crawford. 1983.‘A unified approach to the investigation of nonverbal and verbalbehaviour in communication research’ in W. Doise and S. Moscovici(eds) Current Issues in European Social Psychology, pp. 143–99.Cambridge: Cambridge University Press.

Gibbon, D., I. Mertins and R.K. Moore (eds). 1997. Handbook of Standardsand Resources for Spoken Language Systems. Berlin: Mouton deGruyter.

Goldin-Meadow, S. 1999. ‘The role of gesture in communication andthinking’, Trends in Cognitive Sciences 3 (11): pp. 419–29.

Page 28: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

28 D. Knight, D. Evans, R. Carter and S. Adolphs

Goodwin, C. 2000. ‘Action and embodiment within situated humaninteraction’, Journal of Pragmatics 32 (10), pp. 1489–522.

Greenhalgh, C., A. French, P. Tennant, J. Humble and A. Crabtree. 2007.‘From ReplayTool to Digital Replay System’, Online Proceedings ofthe Third International Conference on e-Social Science, ESRC / NSF,7–9 October. Ann Arbor: Michigan.

Griffin, Z.M. 2004. ‘The eyes are right when the mouth is wrong’,Psychological Science 15 (12), pp. 814–21.

Griffin, Z.M. and K. Bock. 2000. ‘What the eyes say about speaking’,Psychological Science 11 (4), pp. 274–9.

Gripsrud, J. 2002. Understanding Media Culture. London: Arnold.

Gu, Y. 2006. ‘Multimodal text analysis: a corpus linguistic approach tosituated discourse’, Text and Talk 26 (2), pp. 127–67.

Hadar, U. 1997. ‘Interpreting at the surface’, Clinical Studies 3, pp. 83–104.

Hällgren, A. and B. Lyberg. 1998. ‘Visual speech synthesis withconcatenative speech’, Proceedings of International Conference onAuditory-Visual Speech Processing (AVSP’98), pp. 181–3. Terrigal:Australia.

Hill, D.R. 2000. ‘Give us the tools: a personal view of multimodalcomputer–human dialogue’ in M.M. Taylor, F. Ne’el and G.G.Bouwhuis (eds) The Structure of Multimodal Dialogue II, pp. 25–62.Amsterdam: John Benjamins.

Holler, J. and G.W. Beattie. 2002. ‘A micro-analytic investigation of howiconic gestures and speech represent core semantic features in talk’,Semiotica 142 (1–4), pp. 31–69.

Holler, J. and G.W. Beattie. 2003. ‘How iconic gestures and speech interactin the representation of meaning: are both aspects really integral to theprocess?’, Semiotica 146 (1–4), pp. 81–116.

Holler, J. and G.W. Beattie. 2004. ‘The interaction of iconic gesture andspeech’, Fifth International Gesture Workshop. Genova, Italy. SelectedRevised Papers. Heidelberg: Springer Verlag.

Ide, N. 1998. ‘Corpus encoding standard: SGML guidelines for encodinglinguistic corpora’, First International Language Resources andEvaluation (LREC) Conference. Granada, Spain.

Kapoor, A. and R.W. Picard. 2001. ‘A real-time head nod and shake detector’,ACM International Conference Proceedings Series, pp. 1–5.

Kawato, S. and J. Ohya. 2000. ‘Real-time detection of nodding and headshaking by directly detecting and tracking the “between eyes”’,Proceedings of IEEE International Conference on Automatic Face andGesture Recognition, pp. 40–5.

Page 29: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 29

Kendon, A. 1972. ‘Some relationships between body motion and speech’in A. Seigman and B. Pope (eds) Studies in Dyadic Communication,pp. 177–216. Elmsford, New York: Pergamon Press.

Kendon, A. 1980. ‘Gesticulation and speech: two aspects of the process ofutterance’ in M.R. Key (ed.) The Relationship of Verbal and NonverbalCommunication, pp. 207–27. The Hague; New York: Mouton.

Kendon, A. 1982. ‘The organisation of behaviour in face-to-face interaction:observations on the development of a methodology’ in K.R. Schererand P. Ekman (eds) Handbook of Methods in Nonverbal BehaviourResearch, pp. 440–505. Cambridge: Cambridge University Press.

Kendon, A. 1987. ‘On gesture: its complementary relationship with speech’in A.W. Siegman and S. Feldstein (eds) Nonverbal Behavior andCommunication, pp. 65–97. London: Lawrence Erlbaum Associates.

Kendon, A. 1994. ‘Do gestures communicate? A review’, Research onLanguage and Social Interaction 27 (3), pp. 175–200.

Kendon, A. 1997. ‘Gesture’, Annual Review of Anthropology 26, pp.109–28.

Kipp, M. 2001. ‘Anvil – a generic annotation tool for multimodal dialogue’,Proceedings of the Seventh European Conference on SpeechCommunication and Technology (Eurospeech), pp. 1367–70. Aalborg,Denmark.

Kita, S. 2000. ‘How representational gesture helps speaking’ in D. McNeil(ed.) Gesture and Language. Cambridge: Cambridge University Press.

Kita, S. (ed.). 2003. Pointing: Where Language, Culture and Cognition Meet.Mahwah, New Jersey: Lawrence Erlbaum.

Kita, S., I. van Gijn and H. van der Hulst. 1997. ‘Movement phase in signsand co-speech gestures and their transcriptions by human coders’,Gesture Workshop, pp. 23–35.

Knight, D. 2006. ‘Corpora: the next generation’, Part of the AHRCfunded online Introduction to Corpus Investigative Techniques,The University of Birmingham. Accessed 26 February 2006, at:http://www.humcorp.bham.ac.uk/

Knight, D. and S. Adolphs. 2008 ‘Multi-modal corpus pragmatics: thecase of active listenership’ in J. Romeo (ed.) Corpus and Pragmatics,pp. 175–90. Berlinand New York: Mouton de Gruyter.

Knight, D. and S. Adolphs. 2006. Text, Talk and Corpus Analysis. Universityof Nottingham Academic Module MA.

Knight, D., S. Bayoumi, S. Mills, A. Crabtree, S. Adolphs, T. Pridmore andR. Carter. 2006. ‘Beyond the text: construction and analysis of multi-modal linguistic corpora’, Proceedings of the Second InternationalConference on e-Social Science, 28–30 June. Manchester.

Page 30: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

30 D. Knight, D. Evans, R. Carter and S. Adolphs

Knudsen, M.W., J.C. Martin, L. Dybkjær, M.J.M.N. Ayuso, N.O. Bernsen,J. Carletta, S. Kita, U. Heid, J. Llisterri, C. Pelachaud, I. Poggi,N. Reithinger, G. van ElsWijk and P. Wittenburg. 2002. ‘Survey ofmultimodal annotation schemes and best practice’, ISLE DeliverableD9.1, 2002.

Kress, G.R. 2001. Multimodal Discourse: The Modes and Media ofContemporary Communication. London: Arnold.

Kress, G.R. and T. van Leeuwen. 1996. Reading Images. London: Routledge.

Labov, W. 1972. Sociolinguistic Patterns. Philadelphia: University ofPennsylvania Press.

Lapadat, J.C. and A.C. Lindsay. 1999. ‘Transcription in researchand practice: from standardisation of technique to interpretativepositioning’, Qualitative Inquiry 5 (1), pp. 64–86.

Laver, J. 1994. Principles of Phonetics. Cambridge: Cambridge UniversityPress.

Leech, G. 1997. ‘Introducing corpus annotation’ in R. Garside, G. Leechand T. McEnery (eds) Corpus Annotation: Linguistic Information fromComputer Text Corpora, pp. 1–18. London: Longman.

Leech, G. 2004. ‘Adding linguistic annotation’ in M. Wynne (ed.)Developing Linguistic Corpora: A Guide to Good Practice, pp.1–16. Oxford: Oxbow Books. Accessed 26 February 2006, at:http://ahds.ac.uk/linguistic-corpora/

Leech, G., G. Myers and J. Thomas (eds). 1995. Spoken Englishon Computer: Transcription, Mark-up and Application. London:Longman.

Markee, N. 2000. Conversation Analysis. Mahwah, New Jersey: LawrenceErlbaum Associates.

Martinec, R. 1998. ‘Cohesion in action’, Semiotica 120 (1/2), pp. 161–80.

Martinec, R. 2001. ‘Interpersonal resources in action’, Semiotica 135 (1/4),pp. 117–45.

McClave, E.Z. 2000. ‘Linguistic functions of head movements in the contextof speech’, Journal of Pragmatics 32 (7), pp. 855–78.

McEnery, A. and Z. Xiao. 2004. ‘The Lancaster Corpus of MandarinChinese: a corpus for monolingual and contrastive language study’in M. Lino, M. Xavier, F. Ferreire, R. Costa and R. Silva (eds)Proceedings of the Fourth International Conference on LanguageResources and Evaluation (LREC) 2004, pp. 1175–8. Lisbon, Portugal.

McNeill, D. 1979. The Conceptual Basis of Language. Hillsdale, New Jersey:Erlbaum.

McNeill, D. 1985. ‘So you think gestures are nonverbal?’, PsychologicalReview 92 (3), pp. 350–71.

Page 31: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

HeadTalk, HandTalk and the corpus 31

McNeill, D. 1992. Hand and Mind: What Gestures Reveal about Thought.Chicago: University of Chicago Press.

McNeill, D., J. Cassell and K.E. McCullough. 1994. ‘Communicative effectsof speech-mismatches gestures’, Research on Language and SocialInteraction 27 (3), pp. 223–37.

Meyer, C.F. 2002. English Corpus Linguistics: An Introduction. Cambridge:Cambridge University Press.

Nobe, S. 1996. ‘Cognitive rhythms gestures and acoustic aspects of speech’.PhD thesis, University of Chicago.

Noller, P. 1984. Nonverbal Communication and Marital Interaction. Oxford:Pergamon Press.

O’Connell, D.C. and S. Kowal. 1999. ‘Transcription and the issue ofstandardisation’, Journal of Psycholinguistic Research 28 (2), pp.103–20.

Ochs, E. 1979. ‘Transcription as theory’ in E. Ochs and B.B. Schieffelin (eds)Developmental Pragmatics. New York: Academic Press.

O’Keeffe, A. and S. Adolphs. 2008. ‘Response tokens in British and Irishdiscourse: corpus, context and variational pragmatics’ in A. Barronand K. Schneider (eds) Variational Pragmatics. Amsterdam: JohnBenjamins.

Olsher, D. 2004. ‘Talk and gesture: the embodied completion of sequentialactions in spoken interaction’ in R. Wagner and J. Gardner (eds)Second Language Conversations, pp. 221–45. London: Continuum.

Pea, R., M. Mills, J. Rosen, K. Dauber, W. Effelsberg and E. Hoffert.2004. ‘The diver project: interactive digital video repurposing’, IEEEMultiMedia 11 (1), pp. 54–61.

Prillwitz, S., R. Leven, H. Zienert, T. Hanke and J. Henning. 1989.HamNoSys. Version 2.0, Hamburg Notation System for SignLanguage – An Introductory Guide. Hamburg: Signum.

Psathas, G and T. Anderson. 1990. ‘The “practices” of transcription inconversation analysis’, Semiotica 781 (2), pp. 75–99.

Rayson, P. 2003. ‘Matrix: a statistical method and software tool forlinguistic analysis through corpus comparison’. PhD thesis, LancasterUniversity.

Reppen, R. and R. Simpson. 2002. ‘Corpus linguistics’ in N. Schmitt (ed.)An Introduction to Applied Linguistics, pp. 92–111. London: Arnold.

Richmond, V.P., J.C. McCroskey and S.K. Payne. 1991. NonverbalBehaviour in Interpersonal Relations. New Jersey: Prentice Hall.

Rimé, B. and L. Schiarartura. 1991. ‘Gesture and speech’ in R.S. Feldmanand D. Rimé (eds) Fundamentals of Nonverbal Behaviour, pp. 239–84.New York: Cambridge University Press.

Page 32: HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development

32 D. Knight, D. Evans, R. Carter and S. Adolphs

Roberts, C. 2006. ‘Part one: issues in transcribing spoken discourse’, fromthe academic course qualitative research methods and transcription,Kings College London. Available online at: http://www.kcl.ac.uk/content/1/c6/01/81/04/part1.pdf

Saferstein, B. 2004. ‘Digital technology and methodological adaptation: texton video as a resource for analytical reflexivity’, Journal of AppliedLinguistics 1 (2), pp. 197–223.

Schegloff, E.A. 1984. ‘On some gestures’ relation to talk’ in J.M. Atkinsonand E.J. Heritage (eds) Structures of Social Action: Studies inConversation Analysis, pp. 266–96. Cambridge: Cambridge UniversityPress.

Scholfield, P. 1995. Quantifying Language. Clevedon: Multilingual MattersLtd.

Scollon, R. 1998. Mediated Discourse as Social Interaction. London:Longman.

Streeck, J. 1993. ‘Gesture as communication 1: its coordination with gazeand speech’, Communication Monographs 60, pp. 275–99.

Streeck, J. 1994. ‘Gestures as communication 2: the audience as co-author’,Research on Language and Social Interaction 27 (3), pp. 239–67.

Taylor, M.M., F. Ne’eland and G.G. Bouwhuis (eds). 2000. The Structure ofMultimodal Dialogue II. Amsterdam: John Benjamins.

ten Have, P. 2007. Doing Conversational Analysis: A Practical Guide.(Second edition.) London: Sage.

Thompson, L.A. and D.W. Massaro. 1986. ‘Evaluation and integration ofspeech and pointing gestures during referential understanding’, Journalof Experimental Child Psychology 42 (1), pp. 144–68.

Thompson, P. 2004. ‘Spoken language corpora’ in M. Wynne (ed.)Developing Linguistic Corpora: A Guide to Good Practice, pp. 59–70.Oxford: Oxbow Books.

Tian, Y., T. Kanade and J.F. Cohn. 2000. ‘Dual-state parametric eye tracking’,Proceedings of IEEE International Conference on Automatic Face andGesture Recognition.

Wilcox, S. 2004. ‘Language from gesture’, Behavioral and Brain Sciences27 (4), pp. 524–5.

Wittenburg, P., D. Broeder and B. Sloman. 2000. ‘Meta-description forlanguage resources’, EAGLES/ISLE White Paper. Accessed 2 October2006, at: http://www.mpi.nl/world/ISLE/documents/papers/white_paper_11.pdf

Yngve, V. 1970. ‘On getting a word in edgewise’, Papers from the SixthRegional Meeting of the Chicago Linguistic Society, pp. 567–77.Chicago: Chicago Linguistic Society.