dialect distances projecting dialect distances to geography...• fix noise parameter, e.g. a = 0.5...

19
Projecting Dialect Distances Nerbonne, Kleiweg, Manni Introduction Motivation Bootstrapping Clustering with Noise Comparison Conclusion Projecting Dialect Distances to Geography Bootstrapping Clustering vs. Clustering with Noise John Nerbonne 1 Peter Kleiweg 1 Franz Manni 2 1 Alfa-informatica University of Groningen 2 Musée de l’Homme Paris Data Analysis, Machine Learning, and Applications: 31st Meeting, Gesellschaft für Klassifikation, Freiburg, 9 March 2007

Upload: others

Post on 21-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Projecting Dialect Distances to GeographyBootstrapping Clustering vs. Clustering with Noise

John Nerbonne1 Peter Kleiweg1 Franz Manni2

1Alfa-informaticaUniversity of Groningen

2Musée de l’HommeParis

Data Analysis, Machine Learning, and Applications:31st Meeting, Gesellschaft für Klassifikation,

Freiburg, 9 March 2007

Page 2: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Background

Pronunciation Distance

• obtained by a modified edit distance• roughly equally to the minimal number of basic changes

needed to obtain one phonetic transcription fromanother

• results for Dutch, Norwegian, German, Bulgarian,Sardinian, Gabon Bantu, ...

• validated wrt dialect speakers’ judgements (r = 0.7)

k O r s tk œ s t @

1 1 1 3

Page 3: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Introduction

Dialect/Language Groups/Families—Language varieties (sometimes) organized in groups.Even elsewhere (e.g., continua, islands), we still examinegroupings to compare new results to older scholarship,which assumes groups.

• organized hierarchically• Kaiserstuhl is South Badish, is Badish, is Alemannic, is

Southern German, is Western Germanic, ...

• therefore we apply HIERARCHICAL CLUSTERING (notk-means, etc.)

Page 4: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Stability, borders, certainty

Problems• Stability

• slight input differences can change cluster resultsmassively

• Hard and soft borders• Certainty

Page 5: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Two Bulgarian Datasets(r = 0.97)

Stability is a real problem!

Page 6: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Bootstrapping Clustering

• assume n (linguistic) distance matrices, M1≤i≤n, e.g.one matrix/word

• choose clustering technique, e.g. WPGMA• repeat, e.g. 100 times

• select m ≤ n matrices, allowing replacement• option 1: use repeated selection as weight (Mucha &

Haimerl, GfKl 2006)• option 2: ignore repetition

• cluster sum of matrices obtaining dendrogram,recording groups

• “composite matrix” M ′ ⇐ mean cophenetic distances• collect groups that appear a majority of times into a

“composite dendrogram”• (new!) project dendrogram borders to map, reflecting

cophenetic distance in darkness

Page 7: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Composite Dendrograms

BonnKöln 100

Iversheim56

AachenWinterspelt

55

Odenspiel

56

LohraWittelsberg 58

Allna100

HerbornseelbachOffdilln 100

99

DexbachNiederasphe 100

Rosenthal58

Frohnhausen

100

74

AltenbergSchraden 54

BockelwitzSchmannewitz 97

Linz60

GrünlichtenbergRoßwein 100

69

Lampertswalde

72

JonsdorfRammenau 88

Gersdorf72

65

AltlandsbergLippen 100

Groß Jamno100

Pretzsch

100

Neu Schadow

93

GerbstedtLandgrafroda 100

53

BorstendorfGornsdorf 100

Theuma96

Mockern

55

CursdorfOsterfeld

Wehrsdorf

56

BillingsbachZellingen 66

Altentrüdingen97

BempflingenIggingen 80

Schömberg100

BurgriedenOberhomberg

53

BruchHermeskeil 100

KruftSiebenbach 100

Mastershausen56

57

Hartenfels

56

BüdesheimEisenbach 73

Niedernhausen61

Vielbrunn

56

Lohrhaupten

83

EschelbronnPfaffenrot 83

Niederauerbach85

56

EnsheimMaxweiler

53

EbertshausenExdorf 100

TannWeyhers 100Helmers

100100

EichenhofenHermannsreuth 100

PeterskirchenSchachach 60

Gelting92

LangenbruckOberviehbach 59

PielenhofenTreffelstein 100

Ulbering

67

Hartenstein

60

KemmernOttowind 100

Schauenstein100

Weidenbach

71

Nürnberg

65

63

Oberau

62

Klafferstraß

70

Pöttmes

7875

MaibrunnRamsau 93

79

EinöllenUngstein 59

HorheimSeelbach 62

Endenburg−Lehnacker52

EngelsbachSchellroda 100

HönebachRinggau−Röhrda 84

Unterellen

63

Mörshausen

60

GroßwechsungenWieda 99

Groß Ballhausen86

100

Orferode

99

HöchstädtIgling 70

Wildpoldsried96

SchnepfenbachVolkershausen 71

ClausthalKleinbottwar

ObermaiselsteinOberwürzbach

83

AhrbergenWasbüttel 100Brelingen

76

AlberslohHaddorf 100

Lippramsdorf61

BrockhausenEngter 100

60

HohenkörbenWüllen 63

77

AltwarpBreddin

Klein Rossau60

GrünowVietmannsdorf 94

Falkenthal79

99

MirowSchönbeck 99

98

BenninWentorf 91

Groß MohrdorfWolgast 91

Hagen64

Kirch Kogel

69

GresenhorstHerzfeld 97

Jürgenshagen

68

Verchen

68

59

AstfeldFreden 74

GottsbürenOsterhagen 96

71

AtzendorfHundisburg 100

Götz94

JacobsdorfReetz 61

62

Ruhlsdorf

81

Benzingerode

100

JeverWangerooge 57

Barßel81

BremscheidHerdecke 60

HerrentrupReelkirchen 100

HesselteichValdorf 100

9256

DreekeHerßum 66

GroßenwieheSchwabstedt 100

Holmkjer100

Wasbek

65

HammahOiste 52

JesteburgKuhstedt 94

StöckenWarpe 100Adorf

BardenflethDiekhusen

EbstorfEversen

HohwachtHuddestorf

JeetzelOhrdorf

Osterbruch

88

LeuthWemb 100

83

100

Composite dendrograms shows groups which appear inmore than 50% of the repeated (bootstrapped) clusterings.

Page 8: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Cophenetic distance

Mean cophenetic distances M ′ obtained from bootstrappingM1≤i≤n:

• Apply (classical) multi-dimensional scaling to M ′,obtaining 3-dimensional solutions

• remaining stress ≈ 10%• correlation with original M ′ very high, r = 0.9

• Interpret dimensions as red, green, blue intensities

Page 9: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

AhrbergenWasbüttel 72

AlberslohHaddorf 98

GrünowVietmannsdorf 63

AltwarpBreddin

FalkenthalKlein Rossau

82

MirowSchönbeck 81

71

AtzendorfHundisburg 88

Götz52

BenzingerodeJacobsdorf

ReetzRuhlsdorf

92

BarßelJever

Wangerooge70

BenninWentorf 64

BrockhausenEngter 72

DiekhusenOsterbruch 54

DreekeHerßum 71

GresenhorstHerzfeld 69

Groß MohrdorfWolgast 66

GroßenwieheSchwabstedt 96

Holmkjer89

HerrentrupReelkirchen 100

HesselteichValdorf 88

52

HohenkörbenWüllen 59

JesteburgKuhstedt 77

StöckenWarpe 88Adorf

AstfeldBardenfleth

BrelingenBremscheid

EbstorfEversenFreden

GottsbürenHagen

HammahHerdeckeHohwacht

HuddestorfJeetzel

JürgenshagenKirch Kogel

LippramsdorfOhrdorf

OisteOsterhagen

VerchenWasbek

59

AllnaLohra

Wittelsberg99

HerbornseelbachOffdilln 86

59

BillingsbachZellingen 52

Altentrüdingen83

AltlandsbergLippen 78

Groß Jamno88

Pretzsch

87

Neu Schadow

68

BempflingenIggingen 56

Schömberg99

Burgrieden

64

BockelwitzSchmannewitz 71

GrünlichtenbergRoßwein 68

LampertswaldeLinz

51

BonnKöln 100

BorstendorfGornsdorf 88

Theuma68

BruchHermeskeil 98

BüdesheimEisenbach 59

CursdorfOsterfeld 54

DexbachNiederasphe 100Frohnhausen

54

Rosenthal

90

EbertshausenExdorf 100

TannWeyhers 100Helmers

9893

EichenhofenHermannsreuth 99

EngelsbachSchellroda 90

GroßwechsungenWieda 89

Groß Ballhausen54

HönebachRinggau−Röhrda 66

MörshausenUnterellen

79

Orferode

65

EschelbronnPfaffenrot 56

GeltingPeterskirchen

Schachach52

GerbstedtLandgrafroda 93

JonsdorfRammenau 90

Gersdorf76

HorheimSeelbach 59

Oberhomberg52

HöchstädtIgling

Wildpoldsried78

KemmernOttowind 100

Schauenstein89

KruftSiebenbach 93

PielenhofenTreffelstein 82

LangenbruckOberviehbach

Ulbering

81

LeuthWemb 92

MaibrunnRamsau 58

SchnepfenbachVolkershausen 68

AachenAltenbergClausthal

EinöllenEndenburg−Lehnacker

EnsheimHartenfels

HartensteinIversheim

KlafferstraßKleinbottwarLohrhaupten

MastershausenMaxweiler

MockernNiederauerbachNiedernhausen

NürnbergOberau

ObermaiselsteinOberwürzbach

OdenspielPöttmes

SchradenUngsteinVielbrunn

WehrsdorfWeidenbach

Winterspelt

100

Page 10: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Page 11: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Composite Cluster Maps

• instability of clustering countermanded bybootstrapping (or by iteration with noise)

• darkness of borders reflects certainty• “hard” and “soft”borders

Page 12: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Clustering with Noise

• assume a (linguistic) distance matrix, M• choose clustering technique, e.g. WPGMA• fix noise parameter, e.g. a = 0.5σ, where σ is standard

deviation among the varietal distances• repeat, e.g. 100 times

• add uniform noise r to M, choosing randomly from0 ≤ r ≤ a

• cluster, noting cophenetic distances obtained• collect mean cophenetic distances in “composite

matrix” M ′

• project mean cophenetic distances from M ′ to map• collect groups that appear a majority of times into a

“composite dendrogram”

Page 13: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Page 14: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

BonnKöln 100

Iversheim56

AachenWinterspelt

55

Odenspiel

56

LohraWittelsberg 58

Allna100

HerbornseelbachOffdilln 100

99

DexbachNiederasphe 100

Rosenthal58

Frohnhausen

100

74

AltenbergSchraden 54

BockelwitzSchmannewitz 97

Linz60

GrünlichtenbergRoßwein 100

69

Lampertswalde

72

JonsdorfRammenau 88

Gersdorf72

65

AltlandsbergLippen 100

Groß Jamno100

Pretzsch

100

Neu Schadow

93

GerbstedtLandgrafroda 100

53

BorstendorfGornsdorf 100

Theuma96

Mockern

55

CursdorfOsterfeld

Wehrsdorf

56

BillingsbachZellingen 66

Altentrüdingen97

BempflingenIggingen 80

Schömberg100

BurgriedenOberhomberg

53

BruchHermeskeil 100

KruftSiebenbach 100

Mastershausen56

57

Hartenfels

56

BüdesheimEisenbach 73

Niedernhausen61

Vielbrunn

56

Lohrhaupten

83

EschelbronnPfaffenrot 83

Niederauerbach85

56

EnsheimMaxweiler

53

EbertshausenExdorf 100

TannWeyhers 100Helmers

100100

EichenhofenHermannsreuth 100

PeterskirchenSchachach 60

Gelting92

LangenbruckOberviehbach 59

PielenhofenTreffelstein 100

Ulbering

67

Hartenstein

60

KemmernOttowind 100

Schauenstein100

Weidenbach

71

Nürnberg

65

63

Oberau

62

Klafferstraß

70

Pöttmes

7875

MaibrunnRamsau 93

79

EinöllenUngstein 59

HorheimSeelbach 62

Endenburg−Lehnacker52

EngelsbachSchellroda 100

HönebachRinggau−Röhrda 84

Unterellen

63

Mörshausen

60

GroßwechsungenWieda 99

Groß Ballhausen86

100

Orferode

99

HöchstädtIgling 70

Wildpoldsried96

SchnepfenbachVolkershausen 71

ClausthalKleinbottwar

ObermaiselsteinOberwürzbach

83

AhrbergenWasbüttel 100Brelingen

76

AlberslohHaddorf 100

Lippramsdorf61

BrockhausenEngter 100

60

HohenkörbenWüllen 63

77

AltwarpBreddin

Klein Rossau60

GrünowVietmannsdorf 94

Falkenthal79

99

MirowSchönbeck 99

98

BenninWentorf 91

Groß MohrdorfWolgast 91

Hagen64

Kirch Kogel

69

GresenhorstHerzfeld 97

Jürgenshagen

68

Verchen

68

59

AstfeldFreden 74

GottsbürenOsterhagen 96

71

AtzendorfHundisburg 100

Götz94

JacobsdorfReetz 61

62

Ruhlsdorf

81

Benzingerode

100

JeverWangerooge 57

Barßel81

BremscheidHerdecke 60

HerrentrupReelkirchen 100

HesselteichValdorf 100

9256

DreekeHerßum 66

GroßenwieheSchwabstedt 100

Holmkjer100

Wasbek

65

HammahOiste 52

JesteburgKuhstedt 94

StöckenWarpe 100Adorf

BardenflethDiekhusen

EbstorfEversen

HohwachtHuddestorf

JeetzelOhrdorf

Osterbruch

88

LeuthWemb 100

83

100

Page 15: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

BonnKöln 100

Iversheim56

AachenWinterspelt

55

Odenspiel

56

LohraWittelsberg 58

Allna100

HerbornseelbachOffdilln 100

99

DexbachNiederasphe 100

Rosenthal58

Frohnhausen

100

74

AltenbergSchraden 54

BockelwitzSchmannewitz 97

Linz60

GrünlichtenbergRoßwein 100

69

Lampertswalde

72

JonsdorfRammenau 88

Gersdorf72

65

AltlandsbergLippen 100

Groß Jamno100

Pretzsch

100

Neu Schadow

93

GerbstedtLandgrafroda 100

53

BorstendorfGornsdorf 100

Theuma96

Mockern

55

CursdorfOsterfeld

Wehrsdorf

56

BillingsbachZellingen 66

Altentrüdingen97

BempflingenIggingen 80

Schömberg100

BurgriedenOberhomberg

53

BruchHermeskeil 100

KruftSiebenbach 100

Mastershausen56

57

Hartenfels

56

BüdesheimEisenbach 73

Niedernhausen61

Vielbrunn

56

Lohrhaupten

83

EschelbronnPfaffenrot 83

Niederauerbach85

56

EnsheimMaxweiler

53

EbertshausenExdorf 100

TannWeyhers 100Helmers

100100

EichenhofenHermannsreuth 100

PeterskirchenSchachach 60

Gelting92

LangenbruckOberviehbach 59

PielenhofenTreffelstein 100

Ulbering

67

Hartenstein

60

KemmernOttowind 100

Schauenstein100

Weidenbach

71

Nürnberg

65

63

Oberau

62

Klafferstraß

70

Pöttmes

7875

MaibrunnRamsau 93

79

EinöllenUngstein 59

HorheimSeelbach 62

Endenburg−Lehnacker52

EngelsbachSchellroda 100

HönebachRinggau−Röhrda 84

Unterellen

63

Mörshausen

60

GroßwechsungenWieda 99

Groß Ballhausen86

100

Orferode

99

HöchstädtIgling 70

Wildpoldsried96

SchnepfenbachVolkershausen 71

ClausthalKleinbottwar

ObermaiselsteinOberwürzbach

83

AhrbergenWasbüttel 100Brelingen

76

AlberslohHaddorf 100

Lippramsdorf61

BrockhausenEngter 100

60

HohenkörbenWüllen 63

77

AltwarpBreddin

Klein Rossau60

GrünowVietmannsdorf 94

Falkenthal79

99

MirowSchönbeck 99

98

BenninWentorf 91

Groß MohrdorfWolgast 91

Hagen64

Kirch Kogel

69

GresenhorstHerzfeld 97

Jürgenshagen

68

Verchen

68

59

AstfeldFreden 74

GottsbürenOsterhagen 96

71

AtzendorfHundisburg 100

Götz94

JacobsdorfReetz 61

62

Ruhlsdorf

81

Benzingerode

100

JeverWangerooge 57

Barßel81

BremscheidHerdecke 60

HerrentrupReelkirchen 100

HesselteichValdorf 100

9256

DreekeHerßum 66

GroßenwieheSchwabstedt 100

Holmkjer100

Wasbek

65

HammahOiste 52

JesteburgKuhstedt 94

StöckenWarpe 100Adorf

BardenflethDiekhusen

EbstorfEversen

HohwachtHuddestorf

JeetzelOhrdorf

Osterbruch

88

LeuthWemb 100

83

100

Page 16: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Correlationr = 0.997

clustering with noise bootstrap clustering

Page 17: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Different Clustering Algorithms

Weighted Ave. (WPGMA) Unweighted Ave.(UPGMA)

Page 18: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Clustering Technique

Weighted Ave. Unweighted Group Ave.

Page 19: Dialect Distances Projecting Dialect Distances to Geography...• fix noise parameter, e.g. a = 0.5 σ, where is standard deviation among the varietal distances • repeat, e.g. 100

ProjectingDialect

Distances

Nerbonne,Kleiweg,Manni

Introduction

Motivation

Bootstrapping

Clusteringwith Noise

Comparison

Conclusion

Conclusions

• Stability obtained—both through bootstrapping andthrough iteration with random small amounts of noise

• Noise-adding procedure needs a noise parameter,bootstrapping number of submatrices to use.

• Noise-adding procedure applicable to single matrices,bootstrapping requires that many be present.

• Choise of clustering technique still importantBut see: http://www.let.rug.nl/~kleiweg/kaarten/MDS-clusters.html