sign language technology and the promise it holds for corpus

Post on 03-Feb-2022

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

IInstitute for nstitute for LLanguage and anguage and SSpeech peech PProcessing (ILSP) rocessing (ILSP) / / ATHENA Research and Innovation Centre

www.ilsp.grwww.ilsp.gr

Sign language technology and the promise it holds for corpus linguistics

Eleni Efthimioueleni_e@ilsp.gr

Assistive Technology GroupSign Language Technologies Team (SLT)Sign Language Technologies Team (SLT)

www.ilsp.gr/assistive.htmlwww.ilsp.gr/assistive.html

During the 1st SLCN Workshop we dealt with the question:

Is every set of language data a

corpus?

4th SLCN Workshop, 3-4 December 2010, Berlin

Language Corpus Definition

““A corpus is a collection of pieces of language A corpus is a collection of pieces of language that are selected and ordered according to that are selected and ordered according to explicit linguistic criteria in order to be used as explicit linguistic criteria in order to be used as a sample of the languagea sample of the language””

Source: Source: SinclairSinclair,, http://http://www.ilc.cnr.itwww.ilc.cnr.it/EAGLES)/EAGLES)

4th SLCN Workshop, 3-4 December 2010, Berlin

TThe definition of computer corpus in the samehe definition of computer corpus in the samedocument document proves proves crucial crucial for our discussionfor our discussion: :

““ A computer corpus is a corpus which is A computer corpus is a corpus which is encoded in a standardised and homogenous encoded in a standardised and homogenous way for openway for open--ended retrieval tasksended retrieval tasks…… ””..

4th SLCN Workshop, 3-4 December 2010, Berlin

Corpus classification by Atkins et al. (1991):““a a corpus is a corpus is a well defined subset well defined subset [of language] [of language] that is that is

designed following specific requirements to serve designed following specific requirements to serve specific purposesspecific purposes””, ,

most crucially the demand for most crucially the demand for

knowledge managementknowledge management either in the form of either in the form of information retrievalinformation retrieval or in the form of or in the form of automatic categorisation and text dispatchingautomatic categorisation and text dispatchingaccording to thematic category.according to thematic category.

4th SLCN Workshop, 3-4 December 2010, Berlin

Three types of corpora :

nn Text corpora Text corpora nn Speech corporaSpeech corporann Sign Language corporaSign Language corpora

4th SLCN Workshop, 3-4 December 2010, Berlin

Natural Language Corpora

nn Should Should contain all possible instances contain all possible instances (vocabulary and (vocabulary and grammar phenomena) of a language required for the grammar phenomena) of a language required for the fulfillment of the corpus design purposes.fulfillment of the corpus design purposes.

nn ParticularParticular issuesissues: : -- adequateadequate coveragecoverage, , -- adequateadequate data quantitiesdata quantities-- efficient data classificationefficient data classification-- iterativeiterative evaluationevaluation techniquestechniques

4th SLCN Workshop, 3-4 December 2010, Berlin

Why we need all this?

Because we develop corpora in order to exploit them in:

- theoretical linguistics research

- human language technologies (HLTs) and tools

4th SLCN Workshop, 3-4 December 2010, Berlin

Text corpora of oral languages

nn Entail many millions of wordsEntail many millions of wordsnn Are available in the webAre available in the webnn Have NLP tools especially developed to Have NLP tools especially developed to

make them exploitablemake them exploitablenn Provide input to various NLP Provide input to various NLP

applicationsapplications

4th SLCN Workshop, 3-4 December 2010, Berlin

Text corpora are:

nn TaggedTaggednn LemmatisedLemmatisednn IndexedIndexednn Assigned metadataAssigned metadata

(although metadata remain an open (although metadata remain an open issue: Metaissue: Meta--Net)Net)

nn SEARCHABLESEARCHABLE

4th SLCN Workshop, 3-4 December 2010, Berlin

4th SLCN Workshop, 3-4 December 2010, Berlin

4th SLCN Workshop, 3-4 December 2010, Berlin

4th SLCN Workshop, 3-4 December 2010, Berlin

search in Sign Language Corpora is restricted to metadata ONLY

nn StorageStorage Medium Medium nn ((DetailedDetailed)) GenreGenre of narrationof narrationnn ((DetailedDetailed)) Topic Topic nn Signer personal dataSigner personal datann DateDate of capturingof capturingnn Other corpus external infoOther corpus external info

4th SLCN Workshop, 3-4 December 2010, Berlin

=> Information is restricted to:

nn Elements peripheral to the linguistic Elements peripheral to the linguistic contentcontent

nn No cues about the use of the languageNo cues about the use of the languagenn No statistics possible as to frequency of No statistics possible as to frequency of

(co)occurrences in signing utterance(co)occurrences in signing utterancenn No concordances No concordances nn No No morphomorpho--phonological variationsphonological variations

4th SLCN Workshop, 3-4 December 2010, Berlin

4th SLCN Workshop, 3-4 December 2010, Berlin

Oral language corpora in NLP domain

nn Statistical (linguistic knowledge blind) Statistical (linguistic knowledge blind) processingprocessing

nn Tag labelingTag labelingnn Linguistic knowledge rich processingLinguistic knowledge rich processingnn Hybrid approachesHybrid approaches

What about SL corpora in NLP domain?

4th SLCN Workshop, 3-4 December 2010, Berlin

Sign language technologies open the way to true exploitation of SL corpora

nn Manual annotation supported by automatic Manual annotation supported by automatic annotation toolsannotation tools

nn Automatic segmentation of sign/phrase Automatic segmentation of sign/phrase boundariesboundaries

nn LabelingLabelingnn Tagging Tagging

based on Sign Recognition technologiesbased on Sign Recognition technologies

4th SLCN Workshop, 3-4 December 2010, Berlin

nn More close to natural Sign SynthesisMore close to natural Sign Synthesisnn More accurate formal representation of More accurate formal representation of SLsSLs

because of the increase of the volume of available because of the increase of the volume of available datadata

viavia

nn Search directly in the content of SL Search directly in the content of SL videovideo

nn Retrieval of linguistic informationRetrieval of linguistic information

4th SLCN Workshop, 3-4 December 2010, Berlin

Tag/label based chunking exploiting i.e.

nn Detention (D) detector Detention (D) detector nn Posture (P) detector Posture (P) detector nn Transition (T) trajectory shape detector Transition (T) trajectory shape detector nn Transition Transition balisticbalistic trajectory detector trajectory detector nn Limb orientation detector Limb orientation detector

nn Detection of elbow locationsDetection of elbow locations

4th SLCN Workshop, 3-4 December 2010, Berlin

nn Detection of hand locations Detection of hand locations nn Contact detector Contact detector nn Symmetry detectorSymmetry detectornn Head Pose orientation detector Head Pose orientation detector nn Eyebrow detector Eyebrow detector nn Shoulder detector Shoulder detector nn Segmentation Segmentation nn Signer recognition Signer recognition nn Movement features Movement features nn Face Tracker Face Tracker

nn Signing Space calibrationSigning Space calibration

4th SLCN Workshop, 3-4 December 2010, Berlin

Iterative algorithm for corpus creation valid also in the case of SL corpora

Initial set of sentences

Coverage calculation

Expanded Corpus

Coverage attained?

Target diphone coverage

Add new sentences

Complete Corpus

NO

YES

Manual Fine-Tuning

4th SLCN Workshop, 3-4 December 2010, Berlin

Thank you for your attention!

top related