sign language technology and the promise it holds for corpus
Post on 03-Feb-2022
0 Views
Preview:
TRANSCRIPT
IInstitute for nstitute for LLanguage and anguage and SSpeech peech PProcessing (ILSP) rocessing (ILSP) / / ATHENA Research and Innovation Centre
www.ilsp.grwww.ilsp.gr
Sign language technology and the promise it holds for corpus linguistics
Eleni Efthimioueleni_e@ilsp.gr
Assistive Technology GroupSign Language Technologies Team (SLT)Sign Language Technologies Team (SLT)
www.ilsp.gr/assistive.htmlwww.ilsp.gr/assistive.html
During the 1st SLCN Workshop we dealt with the question:
Is every set of language data a
corpus?
4th SLCN Workshop, 3-4 December 2010, Berlin
Language Corpus Definition
““A corpus is a collection of pieces of language A corpus is a collection of pieces of language that are selected and ordered according to that are selected and ordered according to explicit linguistic criteria in order to be used as explicit linguistic criteria in order to be used as a sample of the languagea sample of the language””
Source: Source: SinclairSinclair,, http://http://www.ilc.cnr.itwww.ilc.cnr.it/EAGLES)/EAGLES)
4th SLCN Workshop, 3-4 December 2010, Berlin
TThe definition of computer corpus in the samehe definition of computer corpus in the samedocument document proves proves crucial crucial for our discussionfor our discussion: :
““ A computer corpus is a corpus which is A computer corpus is a corpus which is encoded in a standardised and homogenous encoded in a standardised and homogenous way for openway for open--ended retrieval tasksended retrieval tasks…… ””..
4th SLCN Workshop, 3-4 December 2010, Berlin
Corpus classification by Atkins et al. (1991):““a a corpus is a corpus is a well defined subset well defined subset [of language] [of language] that is that is
designed following specific requirements to serve designed following specific requirements to serve specific purposesspecific purposes””, ,
most crucially the demand for most crucially the demand for
knowledge managementknowledge management either in the form of either in the form of information retrievalinformation retrieval or in the form of or in the form of automatic categorisation and text dispatchingautomatic categorisation and text dispatchingaccording to thematic category.according to thematic category.
4th SLCN Workshop, 3-4 December 2010, Berlin
Three types of corpora :
nn Text corpora Text corpora nn Speech corporaSpeech corporann Sign Language corporaSign Language corpora
4th SLCN Workshop, 3-4 December 2010, Berlin
Natural Language Corpora
nn Should Should contain all possible instances contain all possible instances (vocabulary and (vocabulary and grammar phenomena) of a language required for the grammar phenomena) of a language required for the fulfillment of the corpus design purposes.fulfillment of the corpus design purposes.
nn ParticularParticular issuesissues: : -- adequateadequate coveragecoverage, , -- adequateadequate data quantitiesdata quantities-- efficient data classificationefficient data classification-- iterativeiterative evaluationevaluation techniquestechniques
4th SLCN Workshop, 3-4 December 2010, Berlin
Why we need all this?
Because we develop corpora in order to exploit them in:
- theoretical linguistics research
- human language technologies (HLTs) and tools
4th SLCN Workshop, 3-4 December 2010, Berlin
Text corpora of oral languages
nn Entail many millions of wordsEntail many millions of wordsnn Are available in the webAre available in the webnn Have NLP tools especially developed to Have NLP tools especially developed to
make them exploitablemake them exploitablenn Provide input to various NLP Provide input to various NLP
applicationsapplications
4th SLCN Workshop, 3-4 December 2010, Berlin
Text corpora are:
nn TaggedTaggednn LemmatisedLemmatisednn IndexedIndexednn Assigned metadataAssigned metadata
(although metadata remain an open (although metadata remain an open issue: Metaissue: Meta--Net)Net)
nn SEARCHABLESEARCHABLE
4th SLCN Workshop, 3-4 December 2010, Berlin
4th SLCN Workshop, 3-4 December 2010, Berlin
4th SLCN Workshop, 3-4 December 2010, Berlin
4th SLCN Workshop, 3-4 December 2010, Berlin
search in Sign Language Corpora is restricted to metadata ONLY
nn StorageStorage Medium Medium nn ((DetailedDetailed)) GenreGenre of narrationof narrationnn ((DetailedDetailed)) Topic Topic nn Signer personal dataSigner personal datann DateDate of capturingof capturingnn Other corpus external infoOther corpus external info
4th SLCN Workshop, 3-4 December 2010, Berlin
=> Information is restricted to:
nn Elements peripheral to the linguistic Elements peripheral to the linguistic contentcontent
nn No cues about the use of the languageNo cues about the use of the languagenn No statistics possible as to frequency of No statistics possible as to frequency of
(co)occurrences in signing utterance(co)occurrences in signing utterancenn No concordances No concordances nn No No morphomorpho--phonological variationsphonological variations
4th SLCN Workshop, 3-4 December 2010, Berlin
4th SLCN Workshop, 3-4 December 2010, Berlin
Oral language corpora in NLP domain
nn Statistical (linguistic knowledge blind) Statistical (linguistic knowledge blind) processingprocessing
nn Tag labelingTag labelingnn Linguistic knowledge rich processingLinguistic knowledge rich processingnn Hybrid approachesHybrid approaches
What about SL corpora in NLP domain?
4th SLCN Workshop, 3-4 December 2010, Berlin
Sign language technologies open the way to true exploitation of SL corpora
nn Manual annotation supported by automatic Manual annotation supported by automatic annotation toolsannotation tools
nn Automatic segmentation of sign/phrase Automatic segmentation of sign/phrase boundariesboundaries
nn LabelingLabelingnn Tagging Tagging
based on Sign Recognition technologiesbased on Sign Recognition technologies
4th SLCN Workshop, 3-4 December 2010, Berlin
nn More close to natural Sign SynthesisMore close to natural Sign Synthesisnn More accurate formal representation of More accurate formal representation of SLsSLs
because of the increase of the volume of available because of the increase of the volume of available datadata
viavia
nn Search directly in the content of SL Search directly in the content of SL videovideo
nn Retrieval of linguistic informationRetrieval of linguistic information
4th SLCN Workshop, 3-4 December 2010, Berlin
Tag/label based chunking exploiting i.e.
nn Detention (D) detector Detention (D) detector nn Posture (P) detector Posture (P) detector nn Transition (T) trajectory shape detector Transition (T) trajectory shape detector nn Transition Transition balisticbalistic trajectory detector trajectory detector nn Limb orientation detector Limb orientation detector
nn Detection of elbow locationsDetection of elbow locations
4th SLCN Workshop, 3-4 December 2010, Berlin
nn Detection of hand locations Detection of hand locations nn Contact detector Contact detector nn Symmetry detectorSymmetry detectornn Head Pose orientation detector Head Pose orientation detector nn Eyebrow detector Eyebrow detector nn Shoulder detector Shoulder detector nn Segmentation Segmentation nn Signer recognition Signer recognition nn Movement features Movement features nn Face Tracker Face Tracker
nn Signing Space calibrationSigning Space calibration
4th SLCN Workshop, 3-4 December 2010, Berlin
Iterative algorithm for corpus creation valid also in the case of SL corpora
Initial set of sentences
Coverage calculation
Expanded Corpus
Coverage attained?
Target diphone coverage
Add new sentences
Complete Corpus
NO
YES
Manual Fine-Tuning
4th SLCN Workshop, 3-4 December 2010, Berlin
Thank you for your attention!
top related