sign language technology and the promise it holds for corpus

23
Institute for nstitute for Language and anguage and Speech peech P rocessing (ILSP) rocessing (ILSP) / / ATHENA Research and Innovation Centre www.ilsp.gr www.ilsp.gr Sign language technology and the promise it holds for corpus linguistics Eleni Efthimiou [email protected] Assistive Technology Group Sign Language Technologies Team (SLT) Sign Language Technologies Team (SLT) www.ilsp.gr/assistive.html www.ilsp.gr/assistive.html

Upload: others

Post on 03-Feb-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sign language technology and the promise it holds for corpus

IInstitute for nstitute for LLanguage and anguage and SSpeech peech PProcessing (ILSP) rocessing (ILSP) / / ATHENA Research and Innovation Centre

www.ilsp.grwww.ilsp.gr

Sign language technology and the promise it holds for corpus linguistics

Eleni [email protected]

Assistive Technology GroupSign Language Technologies Team (SLT)Sign Language Technologies Team (SLT)

www.ilsp.gr/assistive.htmlwww.ilsp.gr/assistive.html

Page 2: Sign language technology and the promise it holds for corpus

During the 1st SLCN Workshop we dealt with the question:

Is every set of language data a

corpus?

Page 3: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Language Corpus Definition

““A corpus is a collection of pieces of language A corpus is a collection of pieces of language that are selected and ordered according to that are selected and ordered according to explicit linguistic criteria in order to be used as explicit linguistic criteria in order to be used as a sample of the languagea sample of the language””

Source: Source: SinclairSinclair,, http://http://www.ilc.cnr.itwww.ilc.cnr.it/EAGLES)/EAGLES)

Page 4: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

TThe definition of computer corpus in the samehe definition of computer corpus in the samedocument document proves proves crucial crucial for our discussionfor our discussion: :

““ A computer corpus is a corpus which is A computer corpus is a corpus which is encoded in a standardised and homogenous encoded in a standardised and homogenous way for openway for open--ended retrieval tasksended retrieval tasks…… ””..

Page 5: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Corpus classification by Atkins et al. (1991):““a a corpus is a corpus is a well defined subset well defined subset [of language] [of language] that is that is

designed following specific requirements to serve designed following specific requirements to serve specific purposesspecific purposes””, ,

most crucially the demand for most crucially the demand for

knowledge managementknowledge management either in the form of either in the form of information retrievalinformation retrieval or in the form of or in the form of automatic categorisation and text dispatchingautomatic categorisation and text dispatchingaccording to thematic category.according to thematic category.

Page 6: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Three types of corpora :

nn Text corpora Text corpora nn Speech corporaSpeech corporann Sign Language corporaSign Language corpora

Page 7: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Natural Language Corpora

nn Should Should contain all possible instances contain all possible instances (vocabulary and (vocabulary and grammar phenomena) of a language required for the grammar phenomena) of a language required for the fulfillment of the corpus design purposes.fulfillment of the corpus design purposes.

nn ParticularParticular issuesissues: : -- adequateadequate coveragecoverage, , -- adequateadequate data quantitiesdata quantities-- efficient data classificationefficient data classification-- iterativeiterative evaluationevaluation techniquestechniques

Page 8: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Why we need all this?

Because we develop corpora in order to exploit them in:

- theoretical linguistics research

- human language technologies (HLTs) and tools

Page 9: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Text corpora of oral languages

nn Entail many millions of wordsEntail many millions of wordsnn Are available in the webAre available in the webnn Have NLP tools especially developed to Have NLP tools especially developed to

make them exploitablemake them exploitablenn Provide input to various NLP Provide input to various NLP

applicationsapplications

Page 10: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Text corpora are:

nn TaggedTaggednn LemmatisedLemmatisednn IndexedIndexednn Assigned metadataAssigned metadata

(although metadata remain an open (although metadata remain an open issue: Metaissue: Meta--Net)Net)

nn SEARCHABLESEARCHABLE

Page 11: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Page 12: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Page 13: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Page 14: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

search in Sign Language Corpora is restricted to metadata ONLY

nn StorageStorage Medium Medium nn ((DetailedDetailed)) GenreGenre of narrationof narrationnn ((DetailedDetailed)) Topic Topic nn Signer personal dataSigner personal datann DateDate of capturingof capturingnn Other corpus external infoOther corpus external info

Page 15: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

=> Information is restricted to:

nn Elements peripheral to the linguistic Elements peripheral to the linguistic contentcontent

nn No cues about the use of the languageNo cues about the use of the languagenn No statistics possible as to frequency of No statistics possible as to frequency of

(co)occurrences in signing utterance(co)occurrences in signing utterancenn No concordances No concordances nn No No morphomorpho--phonological variationsphonological variations

Page 16: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Page 17: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Oral language corpora in NLP domain

nn Statistical (linguistic knowledge blind) Statistical (linguistic knowledge blind) processingprocessing

nn Tag labelingTag labelingnn Linguistic knowledge rich processingLinguistic knowledge rich processingnn Hybrid approachesHybrid approaches

What about SL corpora in NLP domain?

Page 18: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Sign language technologies open the way to true exploitation of SL corpora

nn Manual annotation supported by automatic Manual annotation supported by automatic annotation toolsannotation tools

nn Automatic segmentation of sign/phrase Automatic segmentation of sign/phrase boundariesboundaries

nn LabelingLabelingnn Tagging Tagging

based on Sign Recognition technologiesbased on Sign Recognition technologies

Page 19: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

nn More close to natural Sign SynthesisMore close to natural Sign Synthesisnn More accurate formal representation of More accurate formal representation of SLsSLs

because of the increase of the volume of available because of the increase of the volume of available datadata

viavia

nn Search directly in the content of SL Search directly in the content of SL videovideo

nn Retrieval of linguistic informationRetrieval of linguistic information

Page 20: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Tag/label based chunking exploiting i.e.

nn Detention (D) detector Detention (D) detector nn Posture (P) detector Posture (P) detector nn Transition (T) trajectory shape detector Transition (T) trajectory shape detector nn Transition Transition balisticbalistic trajectory detector trajectory detector nn Limb orientation detector Limb orientation detector

nn Detection of elbow locationsDetection of elbow locations

Page 21: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

nn Detection of hand locations Detection of hand locations nn Contact detector Contact detector nn Symmetry detectorSymmetry detectornn Head Pose orientation detector Head Pose orientation detector nn Eyebrow detector Eyebrow detector nn Shoulder detector Shoulder detector nn Segmentation Segmentation nn Signer recognition Signer recognition nn Movement features Movement features nn Face Tracker Face Tracker

nn Signing Space calibrationSigning Space calibration

Page 22: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Iterative algorithm for corpus creation valid also in the case of SL corpora

Initial set of sentences

Coverage calculation

Expanded Corpus

Coverage attained?

Target diphone coverage

Add new sentences

Complete Corpus

NO

YES

Manual Fine-Tuning

Page 23: Sign language technology and the promise it holds for corpus

4th SLCN Workshop, 3-4 December 2010, Berlin

Thank you for your attention!