multi-modal corpus design, construction and usepszaxc/dress/baal07.pdf · capturing, transcribing...

Multi-modal corpus design,construction and use

David Evans, Dawn Knight, Ronald Carter and Svenja Adolphs

BAAL 20076-8th September 2007, The University of Edinburgh

Introducing the Digital Record Project:

• 3-year research initiative, funded by the Economic andSocial Research Council (ESRC)

• Part of an e-Social Science ‘Node’ based at TheUniversity of Nottingham

• Interdisciplinary project, involving staff from Psychology,Applied Linguistics and Computer Science

• Develop a multi-modal corpus of spoken interaction: theNottingham Multi-Modal Corpus (NMMC)

The Nottingham Multi-Modal Corpus (NMMC):

Corpus data:

• 250,000 words• 125,000 words of 1-party data

125,000 words of 2-party data• Data in three different modes: textual, audio and video

Corpus tool-bench:

• Develop a reusable corpus tool (with appropriatelinguistic software)

• Search lexical, prosodic and gestural features of spokendiscourse

Key Methodological Issues:

1) Data collection and collation:Capturing, transcribing and aligning, and addinggesture to transcription

2) Tracking, defining and coding gesture of interest:Using specifically developed software to track andautomatically encode gestures according to a pre-defined kinesic coding scheme

3) Representing the data in an easy-to-use interface forfurther analysis:

Constructing an intelligent corpus database andassociated software (including a text/ gesture

concordancer)

1a) Capturing data

Naturalistic data v. Usable video image

1b) Transcribing and aligning data

• All data is transcribed using CANCODE transcriptionconventions.

• Data is also time-stamped using Transana, linking thetextual and audio streams:

¤<139851><$1> But if it's if it's utterly irrelevant then you're alright.

¤<143459><$2> Right.¤<143793><$1> Do you see what I mean cos cos

you're not there's no interfering factor then.

¤<147602><$2> Yeah so s=¤<148138><$1> Erm so that sounds like it's okay.¤<150144><$2> Okay.

• Do you see what I mean cos cos you'renot there's [no interfering factor] then.

onset stroke retraction

1c) Adding gesture to transcription

2a) Defining gestures of interest for codification

STUFF

Figure 2: Division of the gesture space fortranscription purposes. (From McNeill, 1992: 378 )

2b) Coding gestures of interest

Figure 3: Computer image trackingapplied to video

We have developed a4-point coding schemefor hand movement:

1) Left hand moves to the left

2) Left hand moves to the right

3) Right hand moves to the left

4) Right hand moves to the right

2c) Turning raw data to corpus data

Figure 4: An excel output generated by the tracker

3a) Requirements for MM corpus representation

= 2nd Generation = 3rd Generation

Figure 5: Defining ‘3rd Generation’ corpus software

3b) Current shortcomings of corpus software

• Current tools tend to focus either on the management of dataor upon the processes of coding and annotating previouslycollected data (examples include Transana, Anvil, NITE XMLWorkbench, ELAN)

3b) Current shortcomings of corpus software

• There does not appear to be a tool available to supportthe integration of these individual processes, supportingthe research process from:

The ‘Record’Phase

OrganisingRecords

AnalysingData

Defining andCoding Data

3c) Introducing DRS: Basic user information

• The DRS (formerly ReplayTool) enables the replay andannotation of large quantities of time-based datasets.

• It allows for the simultaneous synchronized replay ofmultiple data sources including videos, system log files,spatial data.

• In addition to the actual replay and annotation of suchdata sets, the DRS will also enable the user to performtasks with their data files that aid the organisation oftheir data sets.

3d) DRS: A real-time demo

Demonstrating the basic corpus tool-benchinterface, for the representation and replay ofindividual sets of encoded data, and theconcordance tool that has been developed aspart of the tool to enable detailed linguisticenquiry:

http://www.mrl.nott.ac.uk/research/projects/dress/software/DRS/webstart/drs.jnlphttp://www.mrl.nott.ac.uk/research/projects/dress/software/DRS/replaytool.zip

3f) Outlining ethical issues and concerns

• Defining ‘consent’

• Anonymisation in textual, audio and video data:the limitations of pixellisation

• Re-use and distribution problems

Contacts:

David Evans: [email protected]

Dawn Knight: [email protected]

Ronald Carter: [email protected]

Svenja Adolphs: [email protected]

multi-modal corpus design, construction and usepszaxc/dress/baal07.pdf · capturing, transcribing...

Documents