1 disentangling from babylonian confusion – unsupervised language identification chris biemann,...

19
1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05, Mexico City February 18, 2005

Upload: erna-bachmeier

Post on 06-Apr-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

1

Disentangling from Babylonian Confusion –

Unsupervised Language Identification

Chris Biemann, Sven TeresniakUniversity of Leipzig, Germany

Cicling-05, Mexico CityFebruary 18, 2005

Page 2: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

2

Outline1. Review: Supervized Language Identification

2. Co-occurrence graphs• Co-occurrences• Visualizing co-occurrences

3. Chinese Whispers Algorithm• Finding words of the same language

4. Sorting text by language• Evaluation and limitations

Page 3: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

3

Review: Supervized Language Identification

• needs training • Operates on letter n-grams or common words as features• Works almost error-free for texts from 500 letters on

Drawbacks:• Does not work for previously unknown languages• Danger of misclassifying instead of reporting „unknown“

Example: http://odur.let.rug.nl/~vannoord/TextCat/Demo • “xx xxx x xxx …” classified as Nepali• “öö ö öö ööö …” classified as Persian

Page 4: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

4

Co-occurrence Statistics• Co-occurrence: occurrence of two or more words within

a well-defined unit of information (sentence, nearest neighbors, window...)

• Significant Co-occurrences reflect relations between words.

• Significance Measure (log-likelihood):

• In the following, sentence-based co-occurrence statistics are used.

( , ) log log !with number of sentences,

.

sig A B x k x kn

abxn

Page 5: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

5

Co-occurrence Graphs • The entirety of all

significant co-occurrences is a co-occurrence graph G(V,E) withV: Vertices = WordsE: Edges (v1, v2, s) with v1, v2 words, s significance value.

• Co-occurrence graph is- weighted- undirected

• Small-world-property

Page 6: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

6

Chinese Whispers - Motivation• (small-world) graphs consist of regions with a high

clustering coefficient and hubs that connect those regions

• The nodes in cluster regions should be assigned the same label per region

• Every node gets a label and whispers it to its neighbouring nodes. A node changes to a label if most of its neighbours whisper this label – or it invents a new one

• Under assumption of semantic closeness when being strongly connected there should emerge motivated clusters

Page 7: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

7

Chinese Whispers AlgorithmAssign different labels to every node in the graph;

For iteration i from 1 to total_iterations {mutation_rate= 1/(i^2);For each word w in the graph {new_label of w = highest ranked label in neighbourhood of w;with probability mutation_rate: new_label of w = new class label;}

labels = new_labels;}

• graph clustering algorithm• linear time in the number of nodes • random mutation can be omitted but showed better results for

small graphs

Page 8: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

8

Assigning New Labels

• Node A changes label from L1 to L3: Sum(L3)=9; Sum(L4)=8; Sum(L2)=5

• Other strategies result in different kinds of partitioning- threshold for share- weighting by node degrees

AL1->L3

DL2

EL3

BL4

CL3

58

6 3

Page 9: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

9

Chinese Whispers on 7 Languages

Page 10: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

10

Chinese Whispers on 7 languages

Page 11: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

11

Assigning languages to sentences• Use word-based language identification tool• Largest clusters form word lists for different languages• A sentence is assigned a cluster label if

- it contains at least 2 words from the cluster and - not more words from another cluster

Questions for Evaluation:• up to what number of languages is that possible ?• How much can the corpus be biased ?

Page 12: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

12

Evaluation: Mix of 7 languages

• Languages used: Dutch, Estonian, English, French, German, Icelandic and Italian

• At least 100 sentences per language are necessary for consistent clusters

Precision, Recall and F-value for 7-lingual corpora

0,96

0,97

0,98

0,99

1

100 1000 10000 100000

# of sentences per language

P/R

/F

Precision Recall F-value

Page 13: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

13

Common mistakes• Unclassified:

- mostly enumerations of sport teams - very short sentences, e.g. headlines- legal act ciphers in estonian case, e.g. 10.12.96 jõust.01.01.97 - RT I 1996 , 89 , 1590

• Misclassified: mixed-language-sentences, likeFrench: Frönsku orðin "cinéma vérité" þýða "kvikmyndasannleikur“

English: Die Beatles mit "All you need is love".

Page 14: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

14

Evaluation: Bilingual biased

• Language pairs used: English-Estonian, French-Italian, Dutch-German

• 1st language varied between 100-10‘000 sentences, 2nd language 100‘000 sentences

• Factor up to 200 does not result in deterioration• Above factor 200, the 1st language cluster is not

distinguishable in size from 2nd-language ‚noise‘

English noise in a 100K sentence Estonian Corpus

0,995

0,9975

1

100 1000 10000

# English sentences

Prec

isio

n/R

ecal

l

P Estonian P English R English

French noise in a 100K sentence Italian Corpus

0,925

0,95

0,975

1

100 1000

# French sentences

P/R

P Italian

R Italian

P French

R French

P total

R total

Page 15: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

15

Conclusion• Unsupervized Language Identification is possible

• It fails to name the languages, but rather sorts them

• It works for previously undescribed languages, even for dialects

• Accurracy on sentences (here about 120 characters) is compareable to supervized approaches

• When classifying documents, there should be virtually no errors

• Time-linear graph-clustering algorithm

Page 16: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

16

Questions?

THANK YOU!

Page 17: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

17

Small Cooccurring Worlds Angenommene Struktur von Kookkurrenzgraphen: skalenfreie

Small Worlds• kurze Weglänge zwischen den Knoten• hoher Clustering Coeffizient• Power-Law-Verteilung von Knotengraden• Power-Law-Verteilung von Komponentengrößen

Knotengrad: Anzahl (ausgehender) KantenKomponente: Zusammenhängende Menge von Knoten

Power-Law-Verteilungen lassen sich einfach aufzeichnen.

Page 18: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

18

Strategien zur FarbübernahmeEin Knoten ändert seine Farbe auf eine neue Farbe aus der

Umgebung, wenn diese (1) mit stärkster Signifikanzsumme auftritt. (top)(2) mit stärkster Signifikanzsumme gewichtet nach

Knotengrad auftritt (a - linear, b - logarithmisch) (dist)(3) mit stärkster Signifikanzsumme auftritt und anteilig über

einer gewissen Schwelle liegt (vote <x>)

AL1

DL2

EL3

BL4

CL3

58

6 3

deg=1deg=2

deg=3deg=5

deg=4

Beispiel: Einfärben von A(1): Sum(L3)=9; Sum(L4)=8; Sum(L2)=5(2a): wSum(L2)=5; wSum(L4)=4; wSum(L3)=2.2(2b): wSum(L4)=7,28; wSum(L3)=5,51; wSum(L2)=3,46(3): nSum(L3)=0,409; nSum(L4)=0,363; nSum(L2)=0,227

Page 19: 1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

19

7 Clusters – 7 languages68701:(3792): [...] a-t-elle, a-t-il, a-t-on, aanval, abandonné, abattu, abattus, aborder, abords, abouti, absolu, absolue, acceptent,

accepter, accepté, accessibles, accession, accord, accords, accordé, accordée, accusation, accusations, accuse, accusé, accusée, acheter, achevé, achevée, acte, actes, actifs, action, actionnaires, actions, actions-suicides, activement, activiste, activistes, activités, actuelle, actuellement, adeptes, adjoint, admettre, administratif, admis, [...]

80266:(3616) [...] a, abandoned, able, ablösen, aboard, abortion, abortions, about, above, abroad, absence, absolute, absolutely, accepted, accessible, accident, accidents, acclaim, according, accounting, accused, accusing, acid, acidic, acknowledged, acquire, acquistare, acre, acres, across, act, acting, active, activist, activists, acts, actually, added, addicts, adding, additional, address, administration, administrator, admitted, adopt, adopted, adoption, adults, advance,

68952:(3312) [...] abbandonato, abbastanza, abbia, abbiamo, abbiano, abile, abitante, abitanti, abitazioni, abruzzesi, accade, accaduto, accenno, accertare, accertato, accesso, accoglienza, accolto, accordi, accordo, accorta, accorti, accusa, acquisito, ad, addetti, addirittura, addosso, adesso, adottata, aereo, affari, affermato, affetto, affidare, affidato, affiliati, affiliato, affonda, affrontare, affronteranno, agenti, agenzie, agevolare, aggiunge, aggiungere, aggiunto, agli [...]

75760:(3249) [...] af, afar, afgreiðslutíma, afl, afla, aflaheimilda, aflaheimildum, aflann, aflaverðmætið, afli, aflinn, aflýst, afnema, afnotagjalda, afnotagjaldið, afnotagjöld, afnotagjöldin, afnotagjöldum, afrek, afráðið, afstýra, aftur, afurðaverðs, afurðum, aka, al-Qaeda, al-Zawahri, ala, aldar, aldrei, aldri, aldur, allan, allar, allir, allra, allri, alls, allt, alltaf, alltof, allur, almannafjölmiðla, almannamiðla, almenna, almennt, altari, alveg, andvirði, annan, annar, annarra [...]

81089:(2894) [...] an, aandacht, aandachtsgebied, aangehouden, aangekeken, aangenomen, aangepakt, aangesloten, aangevuld, aangewezen, aangezien, aanleiding, aanmerking, aanpak, aanslag, aansluiting, aantal, aantrekkelijk, aanvankelijk, aanvragen, aanwezige, aanwezigheid, aanwijsbaar, aanwijzingen, aanzien, aarde, aardige, abortuspil, abortuswetgeving, acceptabel, achtduizend, achter, achtergrond, achterhalen, achterover, actie, actieve, actuele [..]

68872:(2791) [...] ab, abend, aber, abermals, abgebaut, abgelaufenen, abgeschlossen, abgeschlossenen, abschneiden, abzuwarten, acht, achtzehn, achtziger, afghanischen, akzeptieren, allein, allem, allen, allenfalls, aller, allerdings, allgemein, allgemeinen, alt, alte, alten, alter, am, amerikanische, amerikanischen, amerikanischer, anderen, anderer, anerkennen, angedroht, angegeben, angehende, angekündigt, angenommen, angesichts, angestellt, angetastet [...]

72602:(2247) [...] aadress, aadressi, aadressil, aadressina, aasta, aastaaruande, aastabilansi, aastabilanss, aastaks, aastal, aastas, aastat, aastate, aeg, aegumistähtaeg, aga, agressiooni, ainuaktsionäri, ainult, ainuõigus, ajaks, ajal, ajast, ajutiselt, akt, aktid, aktide, aktsia, aktsiad, aktsiaid, aktsiakapital, aktsiakapitali, aktsiakapitalist, aktsiaraamatusse, aktsiaselts, aktsiaseltsi, aktsiaseltsiga, aktsiaseltsil, aktsiaseltsile, aktsiat, aktsiate, aktsiatega [...]

60154:(195) [...] afferma, créée, dimostrano, dom, domicilio, dovranno, escl, esclusi, escluso, feriale, festivi, festivo, fóru, gleðilegs, gratuita, gravidanza, incarico, intero, inv, io, jr, jäetud, jõust, liðna, lõige, lõiked, nere, næstir, pagamento, punktides, sab, saper, scenario, servono, sindacale, sjálfsm, socc, soccorso, spiegato, spilurum, techniques, ventanni, warnte, zuständige [...]

[...]84023:(3) Inter, Mailand, Ronaldo