nlp for low-resource or endangered languages and cross
TRANSCRIPT
![Page 1: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/1.jpg)
NLP for Low-resource or Endangered Languages
and Cross-lingual transferAntonios Anastasopoulos
MTMA
May 31, 2019
![Page 2: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/2.jpg)
2
There are about ______ languages in the world 7000
![Page 3: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/3.jpg)
3
There are about ______ languages in the world
According to UNESCO, ___% of the world’s languages are endangered or vulnerable.
7000
43
![Page 4: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/4.jpg)
�4
[from The Endangered Languages Project]
: dormant : critically endangered : endangered : at risk
![Page 5: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/5.jpg)
�5
[from The Endangered Languages Project]
: dormant : critically endangered : endangered : at risk
![Page 6: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/6.jpg)
�6
[from The Endangered Languages Project]
: dormant : critically endangered : endangered : at risk
![Page 7: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/7.jpg)
�7
[from The Endangered Languages Project]
: dormant : critically endangered : endangered : at risk
![Page 8: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/8.jpg)
�8
[from The Endangered Languages Project]
: dormant : critically endangered : endangered : at risk
![Page 9: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/9.jpg)
�9
Language Documentation1. Collect (record) data
2. Transcribe and translate 3. Perform analysis
4. Elicit further paradigms 5. Prepare a grammar
![Page 10: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/10.jpg)
�10
Steven Bird
Making an Audio Rosetta Stone
![Page 11: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/11.jpg)
My work
11
Alignment (segmentation)
Transcription
Translation
Analysis
Develop methods that will automate and speed up thelanguage documentation process:
![Page 12: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/12.jpg)
Leveraging Translations for Speech Transcription in Low-Resource Settings
Antonios Anastasopoulos and David Chiang. Interspeech 2018
12
Speech Transcription using Translations
![Page 13: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/13.jpg)
13
x1 · · · xN
encoder
h1 · · ·hN
attention
c1 · · · cM
decoder
s1 · · · sM
softmax
P(y1 · · · yM)
multi-source models: encoder-decoder model
x1 · · · xN
encoder
h1 · · ·hN
attention
c1 · · · cM
decoder
s1 · · · sM
softmax
P(y1 · · · yM)
![Page 14: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/14.jpg)
14
Character Error Rate
0
20
40
60
80
Ainu (2k)
Mboshi (5k)
Spanish (17k)
44.6
68.2
74.9
52.0
29.8
40.7
speech translation ensemble multisource multisource+tied multisource+shared
multi-source models: results
![Page 15: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/15.jpg)
15
x11 · · · x1
N1
encoder
h11 · · ·h1
N1
attention
c11 · · · c1
M
decoder
s11 · · · s1
M
x21 · · · x2
N2
encoder
h21 · · ·h2
N2
attention
c21 · · · c2
M
decoder
s21 · · · s2
M
softmax
P(y1 · · · yM)
[Dutt et al. 2017]
multi-source models: coupled ensemble
![Page 16: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/16.jpg)
16
Character Error Rate
0
20
40
60
80
Ainu (2k)
Mboshi (5k)
Spanish (17k)
42.236.8
40.644.6
68.2
74.9
52.0
29.8
40.7
speech translation ensemble multisource multisource+tied multisource+shared
multi-source models: results
![Page 17: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/17.jpg)
17
x11 · · · x1
N1
encoder
h11 · · ·h1
N1
x21 · · · x2
N2
encoder
h21 · · ·h2
N2
attention attention
c1 · · · cM
decoder
s1 · · · sM
softmax
P(y1 · · · yM)
multi-source models: multisource
![Page 18: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/18.jpg)
18
Character Error Rate
0
20
40
60
80
Ainu (2k)
Mboshi (5k)
Spanish (17k)
41.637.5
46.042.2
36.840.6
44.6
68.2
74.9
52.0
29.8
40.7
speech translation ensemble multisource multisource+tied multisource+shared
multi-source models: results
![Page 19: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/19.jpg)
19
x11 · · · x1
N1
encoder
h11 · · ·h1
N1
x21 · · · x2
N2
encoder
h21 · · ·h2
N2
attention attention
c1 · · · cM
decoder
s1 · · · sM
softmax
P(y1 · · · yM)
multi-source models: attention parameter sharing
Standard:
Shared:
![Page 20: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/20.jpg)
20
Character Error Rate
0
20
40
60
80
Ainu (2k)
Mboshi (5k)
Spanish (17k)
38.7
28.6
40.6 41.637.5
46.042.2
36.840.6
44.6
68.2
74.9
52.0
29.8
40.7
speech translation ensemble multisource multisource+shared
multi-source models: results
![Page 21: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/21.jpg)
Tied Multitask Models for Speech Transcription and Translation
Antonios Anastasopoulos and David Chiang. NAACL 2018.
21
Speech Transcription and Translation
![Page 22: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/22.jpg)
22
x1 · · · xN
encoder
h1 · · ·hN
attention
c1 · · · cM
decoder
s1 · · · sM
softmax
P(y1 · · · yM)
x1 · · · xN
encoder
h1 · · ·hN
attention
c1 · · · cM
decoder
s1 · · · sM
softmax
P(y1 · · · yM)
multi-task models: pivot
![Page 23: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/23.jpg)
23
x1 · · · xN
encoder
h1 · · ·hN
attention
c1 · · · cM
decoder
s1 · · · sM
softmax
P(y1 · · · yM)
x1 · · · xN
encoder
h1 · · ·hN
attention
c1 · · · cM
decoder
s1 · · · sM
softmax
P(y1 · · · yM)
x1 · · · xN
encoder
h1 · · ·hN
attention
c1 · · · cM
decoder
s1 · · · sM
softmax
P(y1 · · · yM)
multi-task models: end-to-end
An Attentional Model for Speech Translation without Transcription Long Duong, Antonios Anastasopoulos,
Trevor Cohn, Steven Bird, and David Chiang. NAACL 2016.
![Page 24: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/24.jpg)
24
x1 · · · xN
encoder
h1 · · ·hN
attention attention
c11 · · · c1
M1
decoder
s11 · · · s1
M1
softmax
P(y11 · · · y1
M1 )
c21 · · · c2
M2
decoder
s21 · · · s2
M2
softmax
P(y21 · · · y2
M2 )
multi-task models: simple multitask
![Page 25: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/25.jpg)
25
Translation character BLEU
0
6.5
13
19.5
26
English (Ainu) French (Mboshi) English (Spanish)
26.0
21.018.3
21.620.8
12.0
single-task multitask cascade triangle triangle+trans
Transcription Character Error Rate
0
17.5
35
52.5
70
Ainu Mboshi Spanish
57.4
36.940.1
63.2
42.344.0
single-task multitask cascade triangle triangle+trans
multi-task models: results
![Page 26: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/26.jpg)
26
x1 · · · xN
encoder
h1 · · ·hN
attention
c11 · · · c1
M1
decoder
s11 · · · s1
M1
softmax
P(y11 · · · y1
M1 )
attentions
c21 · · · c2
M2
decoder
s21 · · · s2
M2
softmax
P(y21 · · · y2
M2 )
multi-task models: triangle
![Page 27: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/27.jpg)
27
Translation character BLEU
0
7.5
15
22.5
30
English (Ainu) French (Mboshi) English (Spanish)
28.8
22.019.8
26.0
21.018.3
21.620.8
12.0
single-task multitask triangle triangle+trans
Transcription Character Error Rate
0
17.5
35
52.5
70
Ainu Mboshi Spanish
58.4
31.938.9
57.4
36.940.1
63.2
42.344.0
single-task multitask triangle triangle+trans
multi-task models: results
![Page 28: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/28.jpg)
28
x1 · · · xN
encoder
h1 · · ·hN
attention
c11 · · · c1
M1
decoder
s11 · · · s1
M1
softmax
P(y11 · · · y1
M1 )
attentions
c21 · · · c2
M2
decoder
s21 · · · s2
M2
softmax
P(y21 · · · y2
M2 )
Rtrans = ��trans kA12A1 � A2k22.
If A attends over B…
and B attends over C…
this should be similar to A attending directly over C.
multi-task models: transitivity regularizer
![Page 29: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/29.jpg)
29
Translation character BLEU
0
7.5
15
22.5
30
English (Ainu) French (Mboshi) English (Spanish)
28.6
23.420.3
28.8
22.019.8
26.0
21.018.3
21.620.8
12.0
single-task multitask triangle triangle+trans
Transcription Character Error Rate
0
17.5
35
52.5
70
Ainu Mboshi Spanish
59.1
32.343.0
58.4
31.938.9
57.4
36.940.1
63.2
42.344.0
single-task multitask triangle triangle+trans
multi-task models: results
![Page 30: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/30.jpg)
30
We can improve translation and transcription accuracy by jointly performing the two tasks.
Translation can be further improved by using intermediate representations and transitivity.
![Page 31: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/31.jpg)
31
Other (relevant and ongoing) work
work with Graham Neubig
Build a tool for linguists that uses ML in its backend to aid annotation:
![Page 32: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/32.jpg)
Data Augmentation, Cross-Lingual Transfer,
and other nice things
![Page 33: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/33.jpg)
Using related languages for MT
"Generalized Data Augmentation for Low-Resource Translation"
Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, and Graham NeubigACL 2019
![Page 34: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/34.jpg)
Transfer for MTTypical scenario: continued training
figure from "Transfer Learning across Low-Resource, Related Languages for NMT". Nguyen and Chiang, 2018.
![Page 35: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/35.jpg)
Machine TranslationThe current best approach is a semi-supervised one:
• Back-translation of target-side monolingual data
What if we don’t have tons of monolingual data for a language?
Does the quality of the back-translated data matter?
![Page 36: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/36.jpg)
GeneralizedBack-Translation
For low-resource languages, there maybe exist a related high-resource one e.g.
1. Azerbaijani (Turkish)
2. Belarusian (Russian)
3. Galician (Portuguese)
4. Slovak (Czech)
We should use them!
![Page 37: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/37.jpg)
GeneralizedBack-Translation
![Page 38: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/38.jpg)
GeneralizedBack-Translation
Typical:
only use [1] for data augmentation
OR
add [b] to [c] and train.
But HRL to LRL might be easier!
![Page 39: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/39.jpg)
From HRL to LRLAssuming a parallel dataset is probably too much.
If the languages are related enough:
1. Get monolingual embeddings
2. Align the embedding space [Lample et al, 2018]
3. Learn a dictionary
4. Word substitution in HRL to create pseudo-LRL
![Page 40: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/40.jpg)
From ENG to LRL through HRL
ENG to LRL system would be bad (duh!)
ENG to HRL system would be better…
… and HRL to LRL might be easy-ish (cause related)
![Page 41: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/41.jpg)
Results
![Page 42: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/42.jpg)
Takeaways
Translating from HRL to LRL:
• it is better to use word substitution than simple NMT or standard UMT (cf lines 5,6 to 7,8,9)
Pivoting from ENG though HRL, improvements but not as much.
Best of both worlds works best (line 12)
• More ENG data, as good as possible LRL data
![Page 43: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/43.jpg)
Using Related Languages for
Morphological Inflection
![Page 44: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/44.jpg)
Inflection task and SIGMORPHON
SIGMORPHON challenge:100 language pairs (43 test languages)
![Page 45: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/45.jpg)
Previous Work
Concatenate tags and lemma, single encoder-decoder
• Issue: inherently different (order, function)
Half task is identifying stem/root and copy characters, so other works focus on copying
• explicit copy mechanism, or
• hard monotonic attention, or
• learn to output the string transduction steps
![Page 46: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/46.jpg)
Augmentation approach: hallucinating data
Most low-resource languages have just 50 or 100 examples.
You can hallucinate more data:
b a i l a r| | | / /b a i l a b a
replacing the red parts with random characters
![Page 47: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/47.jpg)
Results on transfer from single language
If languages are genetically distant, transfer does NOT help.
Same alphabet crucial: see Kurmanji-Sorani
![Page 48: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/48.jpg)
But we can do better if transferring from multiple (related) languages
e.g.
![Page 49: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/49.jpg)
Interpreting the model
![Page 50: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/50.jpg)
Takeaways1. Monolingual data hallucination can take you a long way…
2. … and it’s preferable to cross-lingual transfer from distant languages
3. If close enough languages, both data hallucination and cross-lingual transfer should help
4. The closer the languages, the larger the improvements
Main Issues:
• Data Hallucination is language-agnostic. A more informed sampling could probably do better
• Different alphabets really hurt performance (Dutch-Yiddish, Kurmanji-Sorani). Need to find either an a priori mapping between the two, or map the to a common space (IPA?)
![Page 51: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/51.jpg)
What language should you use for cross-lingual
transfer?"Choosing Transfer Languages for Cross-Lingual Learning"
Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou
Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell and Graham Neubig
ACL 2019
![Page 52: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/52.jpg)
Setting
Cross-lingual transfer on 4 tasks:
1. MT: 54x54 pairs (X-Eng TED)
2. POS-tagging: 60x26
3. Entity Linking: 53x9
4. DEP parsing: 30x30
![Page 53: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/53.jpg)
Learning to Rank
For each language pair, extract features:
1. dataset dependent:
• dataset size
• type-token ratio
• word/subword overlap
2. dataset-independent:
1. typological features (from URIEL)
[plug: check out the lang2vec python library, now with pre-computed distances!]
![Page 54: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/54.jpg)
Learning to RankFor each test language and a list of potential transfer languages (each pair represented by the features), train a model to rank the candidate languages
Model: tree-based LambdaRank (good in limited feature/data settings)
[ Available as a python package too: https://github.com/neulab/langrank ]
![Page 55: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/55.jpg)
Other Cool Things
![Page 56: NLP for Low-resource or Endangered Languages and Cross](https://reader034.vdocuments.site/reader034/viewer/2022042302/625ab90e69a95d11b5321704/html5/thumbnails/56.jpg)
Language Technology for Language Documentation and Revitalization
Hackathon-type workshop at CMU, Aug 12-16, 2019▪ Language community members▪ Documentary linguists▪ Computational linguists▪ Computer scientists and developers
Example projects:
▪ Building and using tools for rapid dictionary creation▪ Building and using tools for development of speech recognition systems▪ Building and using tools to analyze the syntax of language, and extract
example sentences for educational materials▪ Creating a plugin that incorporates language techology into language
documentation software such as ELAN/Praat