dialect distances projecting dialect distances to geography...• fix noise parameter, e.g. a = 0.5...
TRANSCRIPT
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Projecting Dialect Distances to GeographyBootstrapping Clustering vs. Clustering with Noise
John Nerbonne1 Peter Kleiweg1 Franz Manni2
1Alfa-informaticaUniversity of Groningen
2Musée de l’HommeParis
Data Analysis, Machine Learning, and Applications:31st Meeting, Gesellschaft für Klassifikation,
Freiburg, 9 March 2007
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Background
Pronunciation Distance
• obtained by a modified edit distance• roughly equally to the minimal number of basic changes
needed to obtain one phonetic transcription fromanother
• results for Dutch, Norwegian, German, Bulgarian,Sardinian, Gabon Bantu, ...
• validated wrt dialect speakers’ judgements (r = 0.7)
k O r s tk œ s t @
1 1 1 3
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Introduction
Dialect/Language Groups/Families—Language varieties (sometimes) organized in groups.Even elsewhere (e.g., continua, islands), we still examinegroupings to compare new results to older scholarship,which assumes groups.
• organized hierarchically• Kaiserstuhl is South Badish, is Badish, is Alemannic, is
Southern German, is Western Germanic, ...
• therefore we apply HIERARCHICAL CLUSTERING (notk-means, etc.)
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Stability, borders, certainty
Problems• Stability
• slight input differences can change cluster resultsmassively
• Hard and soft borders• Certainty
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Two Bulgarian Datasets(r = 0.97)
Stability is a real problem!
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Bootstrapping Clustering
• assume n (linguistic) distance matrices, M1≤i≤n, e.g.one matrix/word
• choose clustering technique, e.g. WPGMA• repeat, e.g. 100 times
• select m ≤ n matrices, allowing replacement• option 1: use repeated selection as weight (Mucha &
Haimerl, GfKl 2006)• option 2: ignore repetition
• cluster sum of matrices obtaining dendrogram,recording groups
• “composite matrix” M ′ ⇐ mean cophenetic distances• collect groups that appear a majority of times into a
“composite dendrogram”• (new!) project dendrogram borders to map, reflecting
cophenetic distance in darkness
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Composite Dendrograms
BonnKöln 100
Iversheim56
AachenWinterspelt
55
Odenspiel
56
LohraWittelsberg 58
Allna100
HerbornseelbachOffdilln 100
99
DexbachNiederasphe 100
Rosenthal58
Frohnhausen
100
74
AltenbergSchraden 54
BockelwitzSchmannewitz 97
Linz60
GrünlichtenbergRoßwein 100
69
Lampertswalde
72
JonsdorfRammenau 88
Gersdorf72
65
AltlandsbergLippen 100
Groß Jamno100
Pretzsch
100
Neu Schadow
93
GerbstedtLandgrafroda 100
53
BorstendorfGornsdorf 100
Theuma96
Mockern
55
CursdorfOsterfeld
Wehrsdorf
56
BillingsbachZellingen 66
Altentrüdingen97
BempflingenIggingen 80
Schömberg100
BurgriedenOberhomberg
53
BruchHermeskeil 100
KruftSiebenbach 100
Mastershausen56
57
Hartenfels
56
BüdesheimEisenbach 73
Niedernhausen61
Vielbrunn
56
Lohrhaupten
83
EschelbronnPfaffenrot 83
Niederauerbach85
56
EnsheimMaxweiler
53
EbertshausenExdorf 100
TannWeyhers 100Helmers
100100
EichenhofenHermannsreuth 100
PeterskirchenSchachach 60
Gelting92
LangenbruckOberviehbach 59
PielenhofenTreffelstein 100
Ulbering
67
Hartenstein
60
KemmernOttowind 100
Schauenstein100
Weidenbach
71
Nürnberg
65
63
Oberau
62
Klafferstraß
70
Pöttmes
7875
MaibrunnRamsau 93
79
EinöllenUngstein 59
HorheimSeelbach 62
Endenburg−Lehnacker52
EngelsbachSchellroda 100
HönebachRinggau−Röhrda 84
Unterellen
63
Mörshausen
60
GroßwechsungenWieda 99
Groß Ballhausen86
100
Orferode
99
HöchstädtIgling 70
Wildpoldsried96
SchnepfenbachVolkershausen 71
ClausthalKleinbottwar
ObermaiselsteinOberwürzbach
83
AhrbergenWasbüttel 100Brelingen
76
AlberslohHaddorf 100
Lippramsdorf61
BrockhausenEngter 100
60
HohenkörbenWüllen 63
77
AltwarpBreddin
Klein Rossau60
GrünowVietmannsdorf 94
Falkenthal79
99
MirowSchönbeck 99
98
BenninWentorf 91
Groß MohrdorfWolgast 91
Hagen64
Kirch Kogel
69
GresenhorstHerzfeld 97
Jürgenshagen
68
Verchen
68
59
AstfeldFreden 74
GottsbürenOsterhagen 96
71
AtzendorfHundisburg 100
Götz94
JacobsdorfReetz 61
62
Ruhlsdorf
81
Benzingerode
100
JeverWangerooge 57
Barßel81
BremscheidHerdecke 60
HerrentrupReelkirchen 100
HesselteichValdorf 100
9256
DreekeHerßum 66
GroßenwieheSchwabstedt 100
Holmkjer100
Wasbek
65
HammahOiste 52
JesteburgKuhstedt 94
StöckenWarpe 100Adorf
BardenflethDiekhusen
EbstorfEversen
HohwachtHuddestorf
JeetzelOhrdorf
Osterbruch
88
LeuthWemb 100
83
100
Composite dendrograms shows groups which appear inmore than 50% of the repeated (bootstrapped) clusterings.
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Cophenetic distance
Mean cophenetic distances M ′ obtained from bootstrappingM1≤i≤n:
• Apply (classical) multi-dimensional scaling to M ′,obtaining 3-dimensional solutions
• remaining stress ≈ 10%• correlation with original M ′ very high, r = 0.9
• Interpret dimensions as red, green, blue intensities
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
AhrbergenWasbüttel 72
AlberslohHaddorf 98
GrünowVietmannsdorf 63
AltwarpBreddin
FalkenthalKlein Rossau
82
MirowSchönbeck 81
71
AtzendorfHundisburg 88
Götz52
BenzingerodeJacobsdorf
ReetzRuhlsdorf
92
BarßelJever
Wangerooge70
BenninWentorf 64
BrockhausenEngter 72
DiekhusenOsterbruch 54
DreekeHerßum 71
GresenhorstHerzfeld 69
Groß MohrdorfWolgast 66
GroßenwieheSchwabstedt 96
Holmkjer89
HerrentrupReelkirchen 100
HesselteichValdorf 88
52
HohenkörbenWüllen 59
JesteburgKuhstedt 77
StöckenWarpe 88Adorf
AstfeldBardenfleth
BrelingenBremscheid
EbstorfEversenFreden
GottsbürenHagen
HammahHerdeckeHohwacht
HuddestorfJeetzel
JürgenshagenKirch Kogel
LippramsdorfOhrdorf
OisteOsterhagen
VerchenWasbek
59
AllnaLohra
Wittelsberg99
HerbornseelbachOffdilln 86
59
BillingsbachZellingen 52
Altentrüdingen83
AltlandsbergLippen 78
Groß Jamno88
Pretzsch
87
Neu Schadow
68
BempflingenIggingen 56
Schömberg99
Burgrieden
64
BockelwitzSchmannewitz 71
GrünlichtenbergRoßwein 68
LampertswaldeLinz
51
BonnKöln 100
BorstendorfGornsdorf 88
Theuma68
BruchHermeskeil 98
BüdesheimEisenbach 59
CursdorfOsterfeld 54
DexbachNiederasphe 100Frohnhausen
54
Rosenthal
90
EbertshausenExdorf 100
TannWeyhers 100Helmers
9893
EichenhofenHermannsreuth 99
EngelsbachSchellroda 90
GroßwechsungenWieda 89
Groß Ballhausen54
HönebachRinggau−Röhrda 66
MörshausenUnterellen
79
Orferode
65
EschelbronnPfaffenrot 56
GeltingPeterskirchen
Schachach52
GerbstedtLandgrafroda 93
JonsdorfRammenau 90
Gersdorf76
HorheimSeelbach 59
Oberhomberg52
HöchstädtIgling
Wildpoldsried78
KemmernOttowind 100
Schauenstein89
KruftSiebenbach 93
PielenhofenTreffelstein 82
LangenbruckOberviehbach
Ulbering
81
LeuthWemb 92
MaibrunnRamsau 58
SchnepfenbachVolkershausen 68
AachenAltenbergClausthal
EinöllenEndenburg−Lehnacker
EnsheimHartenfels
HartensteinIversheim
KlafferstraßKleinbottwarLohrhaupten
MastershausenMaxweiler
MockernNiederauerbachNiedernhausen
NürnbergOberau
ObermaiselsteinOberwürzbach
OdenspielPöttmes
SchradenUngsteinVielbrunn
WehrsdorfWeidenbach
Winterspelt
100
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Composite Cluster Maps
• instability of clustering countermanded bybootstrapping (or by iteration with noise)
• darkness of borders reflects certainty• “hard” and “soft”borders
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Clustering with Noise
• assume a (linguistic) distance matrix, M• choose clustering technique, e.g. WPGMA• fix noise parameter, e.g. a = 0.5σ, where σ is standard
deviation among the varietal distances• repeat, e.g. 100 times
• add uniform noise r to M, choosing randomly from0 ≤ r ≤ a
• cluster, noting cophenetic distances obtained• collect mean cophenetic distances in “composite
matrix” M ′
• project mean cophenetic distances from M ′ to map• collect groups that appear a majority of times into a
“composite dendrogram”
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
BonnKöln 100
Iversheim56
AachenWinterspelt
55
Odenspiel
56
LohraWittelsberg 58
Allna100
HerbornseelbachOffdilln 100
99
DexbachNiederasphe 100
Rosenthal58
Frohnhausen
100
74
AltenbergSchraden 54
BockelwitzSchmannewitz 97
Linz60
GrünlichtenbergRoßwein 100
69
Lampertswalde
72
JonsdorfRammenau 88
Gersdorf72
65
AltlandsbergLippen 100
Groß Jamno100
Pretzsch
100
Neu Schadow
93
GerbstedtLandgrafroda 100
53
BorstendorfGornsdorf 100
Theuma96
Mockern
55
CursdorfOsterfeld
Wehrsdorf
56
BillingsbachZellingen 66
Altentrüdingen97
BempflingenIggingen 80
Schömberg100
BurgriedenOberhomberg
53
BruchHermeskeil 100
KruftSiebenbach 100
Mastershausen56
57
Hartenfels
56
BüdesheimEisenbach 73
Niedernhausen61
Vielbrunn
56
Lohrhaupten
83
EschelbronnPfaffenrot 83
Niederauerbach85
56
EnsheimMaxweiler
53
EbertshausenExdorf 100
TannWeyhers 100Helmers
100100
EichenhofenHermannsreuth 100
PeterskirchenSchachach 60
Gelting92
LangenbruckOberviehbach 59
PielenhofenTreffelstein 100
Ulbering
67
Hartenstein
60
KemmernOttowind 100
Schauenstein100
Weidenbach
71
Nürnberg
65
63
Oberau
62
Klafferstraß
70
Pöttmes
7875
MaibrunnRamsau 93
79
EinöllenUngstein 59
HorheimSeelbach 62
Endenburg−Lehnacker52
EngelsbachSchellroda 100
HönebachRinggau−Röhrda 84
Unterellen
63
Mörshausen
60
GroßwechsungenWieda 99
Groß Ballhausen86
100
Orferode
99
HöchstädtIgling 70
Wildpoldsried96
SchnepfenbachVolkershausen 71
ClausthalKleinbottwar
ObermaiselsteinOberwürzbach
83
AhrbergenWasbüttel 100Brelingen
76
AlberslohHaddorf 100
Lippramsdorf61
BrockhausenEngter 100
60
HohenkörbenWüllen 63
77
AltwarpBreddin
Klein Rossau60
GrünowVietmannsdorf 94
Falkenthal79
99
MirowSchönbeck 99
98
BenninWentorf 91
Groß MohrdorfWolgast 91
Hagen64
Kirch Kogel
69
GresenhorstHerzfeld 97
Jürgenshagen
68
Verchen
68
59
AstfeldFreden 74
GottsbürenOsterhagen 96
71
AtzendorfHundisburg 100
Götz94
JacobsdorfReetz 61
62
Ruhlsdorf
81
Benzingerode
100
JeverWangerooge 57
Barßel81
BremscheidHerdecke 60
HerrentrupReelkirchen 100
HesselteichValdorf 100
9256
DreekeHerßum 66
GroßenwieheSchwabstedt 100
Holmkjer100
Wasbek
65
HammahOiste 52
JesteburgKuhstedt 94
StöckenWarpe 100Adorf
BardenflethDiekhusen
EbstorfEversen
HohwachtHuddestorf
JeetzelOhrdorf
Osterbruch
88
LeuthWemb 100
83
100
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
BonnKöln 100
Iversheim56
AachenWinterspelt
55
Odenspiel
56
LohraWittelsberg 58
Allna100
HerbornseelbachOffdilln 100
99
DexbachNiederasphe 100
Rosenthal58
Frohnhausen
100
74
AltenbergSchraden 54
BockelwitzSchmannewitz 97
Linz60
GrünlichtenbergRoßwein 100
69
Lampertswalde
72
JonsdorfRammenau 88
Gersdorf72
65
AltlandsbergLippen 100
Groß Jamno100
Pretzsch
100
Neu Schadow
93
GerbstedtLandgrafroda 100
53
BorstendorfGornsdorf 100
Theuma96
Mockern
55
CursdorfOsterfeld
Wehrsdorf
56
BillingsbachZellingen 66
Altentrüdingen97
BempflingenIggingen 80
Schömberg100
BurgriedenOberhomberg
53
BruchHermeskeil 100
KruftSiebenbach 100
Mastershausen56
57
Hartenfels
56
BüdesheimEisenbach 73
Niedernhausen61
Vielbrunn
56
Lohrhaupten
83
EschelbronnPfaffenrot 83
Niederauerbach85
56
EnsheimMaxweiler
53
EbertshausenExdorf 100
TannWeyhers 100Helmers
100100
EichenhofenHermannsreuth 100
PeterskirchenSchachach 60
Gelting92
LangenbruckOberviehbach 59
PielenhofenTreffelstein 100
Ulbering
67
Hartenstein
60
KemmernOttowind 100
Schauenstein100
Weidenbach
71
Nürnberg
65
63
Oberau
62
Klafferstraß
70
Pöttmes
7875
MaibrunnRamsau 93
79
EinöllenUngstein 59
HorheimSeelbach 62
Endenburg−Lehnacker52
EngelsbachSchellroda 100
HönebachRinggau−Röhrda 84
Unterellen
63
Mörshausen
60
GroßwechsungenWieda 99
Groß Ballhausen86
100
Orferode
99
HöchstädtIgling 70
Wildpoldsried96
SchnepfenbachVolkershausen 71
ClausthalKleinbottwar
ObermaiselsteinOberwürzbach
83
AhrbergenWasbüttel 100Brelingen
76
AlberslohHaddorf 100
Lippramsdorf61
BrockhausenEngter 100
60
HohenkörbenWüllen 63
77
AltwarpBreddin
Klein Rossau60
GrünowVietmannsdorf 94
Falkenthal79
99
MirowSchönbeck 99
98
BenninWentorf 91
Groß MohrdorfWolgast 91
Hagen64
Kirch Kogel
69
GresenhorstHerzfeld 97
Jürgenshagen
68
Verchen
68
59
AstfeldFreden 74
GottsbürenOsterhagen 96
71
AtzendorfHundisburg 100
Götz94
JacobsdorfReetz 61
62
Ruhlsdorf
81
Benzingerode
100
JeverWangerooge 57
Barßel81
BremscheidHerdecke 60
HerrentrupReelkirchen 100
HesselteichValdorf 100
9256
DreekeHerßum 66
GroßenwieheSchwabstedt 100
Holmkjer100
Wasbek
65
HammahOiste 52
JesteburgKuhstedt 94
StöckenWarpe 100Adorf
BardenflethDiekhusen
EbstorfEversen
HohwachtHuddestorf
JeetzelOhrdorf
Osterbruch
88
LeuthWemb 100
83
100
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Correlationr = 0.997
clustering with noise bootstrap clustering
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Different Clustering Algorithms
Weighted Ave. (WPGMA) Unweighted Ave.(UPGMA)
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Clustering Technique
Weighted Ave. Unweighted Group Ave.
ProjectingDialect
Distances
Nerbonne,Kleiweg,Manni
Introduction
Motivation
Bootstrapping
Clusteringwith Noise
Comparison
Conclusion
Conclusions
• Stability obtained—both through bootstrapping andthrough iteration with random small amounts of noise
• Noise-adding procedure needs a noise parameter,bootstrapping number of submatrices to use.
• Noise-adding procedure applicable to single matrices,bootstrapping requires that many be present.
• Choise of clustering technique still importantBut see: http://www.let.rug.nl/~kleiweg/kaarten/MDS-clusters.html