![Page 1: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/1.jpg)
1
Disentangling from Babylonian Confusion –
Unsupervised Language Identification
Chris Biemann, Sven TeresniakUniversity of Leipzig, Germany
Cicling-05, Mexico CityFebruary 18, 2005
![Page 2: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/2.jpg)
2
Outline1. Review: Supervized Language Identification
2. Co-occurrence graphs• Co-occurrences• Visualizing co-occurrences
3. Chinese Whispers Algorithm• Finding words of the same language
4. Sorting text by language• Evaluation and limitations
![Page 3: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/3.jpg)
3
Review: Supervized Language Identification
• needs training • Operates on letter n-grams or common words as features• Works almost error-free for texts from 500 letters on
Drawbacks:• Does not work for previously unknown languages• Danger of misclassifying instead of reporting „unknown“
Example: http://odur.let.rug.nl/~vannoord/TextCat/Demo • “xx xxx x xxx …” classified as Nepali• “öö ö öö ööö …” classified as Persian
![Page 4: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/4.jpg)
4
Co-occurrence Statistics• Co-occurrence: occurrence of two or more words within
a well-defined unit of information (sentence, nearest neighbors, window...)
• Significant Co-occurrences reflect relations between words.
• Significance Measure (log-likelihood):
• In the following, sentence-based co-occurrence statistics are used.
( , ) log log !with number of sentences,
.
sig A B x k x kn
abxn
![Page 5: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/5.jpg)
5
Co-occurrence Graphs • The entirety of all
significant co-occurrences is a co-occurrence graph G(V,E) withV: Vertices = WordsE: Edges (v1, v2, s) with v1, v2 words, s significance value.
• Co-occurrence graph is- weighted- undirected
• Small-world-property
![Page 6: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/6.jpg)
6
Chinese Whispers - Motivation• (small-world) graphs consist of regions with a high
clustering coefficient and hubs that connect those regions
• The nodes in cluster regions should be assigned the same label per region
• Every node gets a label and whispers it to its neighbouring nodes. A node changes to a label if most of its neighbours whisper this label – or it invents a new one
• Under assumption of semantic closeness when being strongly connected there should emerge motivated clusters
![Page 7: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/7.jpg)
7
Chinese Whispers AlgorithmAssign different labels to every node in the graph;
For iteration i from 1 to total_iterations {mutation_rate= 1/(i^2);For each word w in the graph {new_label of w = highest ranked label in neighbourhood of w;with probability mutation_rate: new_label of w = new class label;}
labels = new_labels;}
• graph clustering algorithm• linear time in the number of nodes • random mutation can be omitted but showed better results for
small graphs
![Page 8: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/8.jpg)
8
Assigning New Labels
• Node A changes label from L1 to L3: Sum(L3)=9; Sum(L4)=8; Sum(L2)=5
• Other strategies result in different kinds of partitioning- threshold for share- weighting by node degrees
AL1->L3
DL2
EL3
BL4
CL3
58
6 3
![Page 9: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/9.jpg)
9
Chinese Whispers on 7 Languages
![Page 10: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/10.jpg)
10
Chinese Whispers on 7 languages
![Page 11: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/11.jpg)
11
Assigning languages to sentences• Use word-based language identification tool• Largest clusters form word lists for different languages• A sentence is assigned a cluster label if
- it contains at least 2 words from the cluster and - not more words from another cluster
Questions for Evaluation:• up to what number of languages is that possible ?• How much can the corpus be biased ?
![Page 12: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/12.jpg)
12
Evaluation: Mix of 7 languages
• Languages used: Dutch, Estonian, English, French, German, Icelandic and Italian
• At least 100 sentences per language are necessary for consistent clusters
Precision, Recall and F-value for 7-lingual corpora
0,96
0,97
0,98
0,99
1
100 1000 10000 100000
# of sentences per language
P/R
/F
Precision Recall F-value
![Page 13: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/13.jpg)
13
Common mistakes• Unclassified:
- mostly enumerations of sport teams - very short sentences, e.g. headlines- legal act ciphers in estonian case, e.g. 10.12.96 jõust.01.01.97 - RT I 1996 , 89 , 1590
• Misclassified: mixed-language-sentences, likeFrench: Frönsku orðin "cinéma vérité" þýða "kvikmyndasannleikur“
English: Die Beatles mit "All you need is love".
![Page 14: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/14.jpg)
14
Evaluation: Bilingual biased
• Language pairs used: English-Estonian, French-Italian, Dutch-German
• 1st language varied between 100-10‘000 sentences, 2nd language 100‘000 sentences
• Factor up to 200 does not result in deterioration• Above factor 200, the 1st language cluster is not
distinguishable in size from 2nd-language ‚noise‘
English noise in a 100K sentence Estonian Corpus
0,995
0,9975
1
100 1000 10000
# English sentences
Prec
isio
n/R
ecal
l
P Estonian P English R English
French noise in a 100K sentence Italian Corpus
0,925
0,95
0,975
1
100 1000
# French sentences
P/R
P Italian
R Italian
P French
R French
P total
R total
![Page 15: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/15.jpg)
15
Conclusion• Unsupervized Language Identification is possible
• It fails to name the languages, but rather sorts them
• It works for previously undescribed languages, even for dialects
• Accurracy on sentences (here about 120 characters) is compareable to supervized approaches
• When classifying documents, there should be virtually no errors
• Time-linear graph-clustering algorithm
![Page 16: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/16.jpg)
16
Questions?
THANK YOU!
![Page 17: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/17.jpg)
17
Small Cooccurring Worlds Angenommene Struktur von Kookkurrenzgraphen: skalenfreie
Small Worlds• kurze Weglänge zwischen den Knoten• hoher Clustering Coeffizient• Power-Law-Verteilung von Knotengraden• Power-Law-Verteilung von Komponentengrößen
Knotengrad: Anzahl (ausgehender) KantenKomponente: Zusammenhängende Menge von Knoten
Power-Law-Verteilungen lassen sich einfach aufzeichnen.
![Page 18: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/18.jpg)
18
Strategien zur FarbübernahmeEin Knoten ändert seine Farbe auf eine neue Farbe aus der
Umgebung, wenn diese (1) mit stärkster Signifikanzsumme auftritt. (top)(2) mit stärkster Signifikanzsumme gewichtet nach
Knotengrad auftritt (a - linear, b - logarithmisch) (dist)(3) mit stärkster Signifikanzsumme auftritt und anteilig über
einer gewissen Schwelle liegt (vote <x>)
AL1
DL2
EL3
BL4
CL3
58
6 3
deg=1deg=2
deg=3deg=5
deg=4
Beispiel: Einfärben von A(1): Sum(L3)=9; Sum(L4)=8; Sum(L2)=5(2a): wSum(L2)=5; wSum(L4)=4; wSum(L3)=2.2(2b): wSum(L4)=7,28; wSum(L3)=5,51; wSum(L2)=3,46(3): nSum(L3)=0,409; nSum(L4)=0,363; nSum(L2)=0,227
![Page 19: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,](https://reader035.vdocuments.site/reader035/viewer/2022062502/570491c41a28ab14218da34c/html5/thumbnails/19.jpg)
19
7 Clusters – 7 languages68701:(3792): [...] a-t-elle, a-t-il, a-t-on, aanval, abandonné, abattu, abattus, aborder, abords, abouti, absolu, absolue, acceptent,
accepter, accepté, accessibles, accession, accord, accords, accordé, accordée, accusation, accusations, accuse, accusé, accusée, acheter, achevé, achevée, acte, actes, actifs, action, actionnaires, actions, actions-suicides, activement, activiste, activistes, activités, actuelle, actuellement, adeptes, adjoint, admettre, administratif, admis, [...]
80266:(3616) [...] a, abandoned, able, ablösen, aboard, abortion, abortions, about, above, abroad, absence, absolute, absolutely, accepted, accessible, accident, accidents, acclaim, according, accounting, accused, accusing, acid, acidic, acknowledged, acquire, acquistare, acre, acres, across, act, acting, active, activist, activists, acts, actually, added, addicts, adding, additional, address, administration, administrator, admitted, adopt, adopted, adoption, adults, advance,
68952:(3312) [...] abbandonato, abbastanza, abbia, abbiamo, abbiano, abile, abitante, abitanti, abitazioni, abruzzesi, accade, accaduto, accenno, accertare, accertato, accesso, accoglienza, accolto, accordi, accordo, accorta, accorti, accusa, acquisito, ad, addetti, addirittura, addosso, adesso, adottata, aereo, affari, affermato, affetto, affidare, affidato, affiliati, affiliato, affonda, affrontare, affronteranno, agenti, agenzie, agevolare, aggiunge, aggiungere, aggiunto, agli [...]
75760:(3249) [...] af, afar, afgreiðslutíma, afl, afla, aflaheimilda, aflaheimildum, aflann, aflaverðmætið, afli, aflinn, aflýst, afnema, afnotagjalda, afnotagjaldið, afnotagjöld, afnotagjöldin, afnotagjöldum, afrek, afráðið, afstýra, aftur, afurðaverðs, afurðum, aka, al-Qaeda, al-Zawahri, ala, aldar, aldrei, aldri, aldur, allan, allar, allir, allra, allri, alls, allt, alltaf, alltof, allur, almannafjölmiðla, almannamiðla, almenna, almennt, altari, alveg, andvirði, annan, annar, annarra [...]
81089:(2894) [...] an, aandacht, aandachtsgebied, aangehouden, aangekeken, aangenomen, aangepakt, aangesloten, aangevuld, aangewezen, aangezien, aanleiding, aanmerking, aanpak, aanslag, aansluiting, aantal, aantrekkelijk, aanvankelijk, aanvragen, aanwezige, aanwezigheid, aanwijsbaar, aanwijzingen, aanzien, aarde, aardige, abortuspil, abortuswetgeving, acceptabel, achtduizend, achter, achtergrond, achterhalen, achterover, actie, actieve, actuele [..]
68872:(2791) [...] ab, abend, aber, abermals, abgebaut, abgelaufenen, abgeschlossen, abgeschlossenen, abschneiden, abzuwarten, acht, achtzehn, achtziger, afghanischen, akzeptieren, allein, allem, allen, allenfalls, aller, allerdings, allgemein, allgemeinen, alt, alte, alten, alter, am, amerikanische, amerikanischen, amerikanischer, anderen, anderer, anerkennen, angedroht, angegeben, angehende, angekündigt, angenommen, angesichts, angestellt, angetastet [...]
72602:(2247) [...] aadress, aadressi, aadressil, aadressina, aasta, aastaaruande, aastabilansi, aastabilanss, aastaks, aastal, aastas, aastat, aastate, aeg, aegumistähtaeg, aga, agressiooni, ainuaktsionäri, ainult, ainuõigus, ajaks, ajal, ajast, ajutiselt, akt, aktid, aktide, aktsia, aktsiad, aktsiaid, aktsiakapital, aktsiakapitali, aktsiakapitalist, aktsiaraamatusse, aktsiaselts, aktsiaseltsi, aktsiaseltsiga, aktsiaseltsil, aktsiaseltsile, aktsiat, aktsiate, aktsiatega [...]
60154:(195) [...] afferma, créée, dimostrano, dom, domicilio, dovranno, escl, esclusi, escluso, feriale, festivi, festivo, fóru, gleðilegs, gratuita, gravidanza, incarico, intero, inv, io, jr, jäetud, jõust, liðna, lõige, lõiked, nere, næstir, pagamento, punktides, sab, saper, scenario, servono, sindacale, sjálfsm, socc, soccorso, spiegato, spilurum, techniques, ventanni, warnte, zuständige [...]
[...]84023:(3) Inter, Mailand, Ronaldo