dialect o metrics

61
NLP Seminar, January 2010 Yonatan Belinkov

Upload: meiyanti-nurchaerani

Post on 18-Apr-2015

54 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Dialect o Metrics

NLP Seminar, January 2010

Yonatan Belinkov

Page 2: Dialect o Metrics

Outline� Definitions

� Urban dialectology

� Dialect geography

Dialectometry� Dialectometry

� Measuring the Diffusion of Linguistic Change, John Nerbonne.

Page 3: Dialect o Metrics

What is a dialect� “A dialect is a subdivision of a particular language.”

� E.g. the Parisian dialect of French, the Bavarian dialect of German, etc.

� “A language is a collection of mutually intelligible � “A language is a collection of mutually intelligible dialects.”

� But Scandinavian languages are mutually intelligible, while (some) German dialects are mutually unintelligible.

� Mutual intelligibility may not be bidirectional (Danes understand Norwegians better than the other way around).

Page 4: Dialect o Metrics

What is a dialect (cont.)� Thus, language is not a pure linguistic term.

� It is influenced by other factors: political, geographical, historical, sociological and cultural.

� “A language is a dialect with an army and navy” � “A language is a dialect with an army and navy” � Max Weinreich: י און ַארמיַאן מיטט דיַאלעקַא איזך שּפרַאַא

טֿפלָא .

Page 5: Dialect o Metrics

What is a dialect (cont.)� Difficulty in distinguishing between dialect and

language calls for more technical definitions.

� “A variety is any particular kind of language which is considered as a single entity.”considered as a single entity.”

� “A dialect is a variety of language which is grammatically and lexically different from a similar variety. “

Page 6: Dialect o Metrics

Urban Dialectology� Traditional dialectology concentrated on regional or

geographical dialects and dialect continua.

� However, other factors also play an important role in the way one speaks. the way one speaks.

� In the 1960’s scholars began describing linguistic varieties by other criteria: social status, education, ethnic/religious affiliation, age, gender, etc.

� Example: Communal dialects in Baghdad (Blanc 1964)

Page 7: Dialect o Metrics

Urban dialectology (cont.) � Socio-dialects are not discrete, they form a social

dialect continuum.

� Jamaican Creole: It’s my bookits mai bukits mai bukiz mai bukiz mi buka mi buk data fi mi buk dat

Page 8: Dialect o Metrics

Dialect geography� Geographically, dialects form continua.

� West Romance dialect continuum:� While standard varieties of French, Spanish, Catalan

and Portuguese are not mutually intelligible,and Portuguese are not mutually intelligible,

� The rural dialects form a continuum with neighboring speakers easily understanding each other.

� Arabic dialect continuum:� Arabic dialects share the same standard language.

� Neighboring speakers communicate easily,

� But remote dialects are mutually unintelligible.

Page 9: Dialect o Metrics

Dialect geography - history� First significant dialect survey: Wenker 1877-1887

� Sent a list of (short) sentences in standard German to schoolmasters.

� Example: Im Winter fliegen die trocknen Blätter durch � Example: Im Winter fliegen die trocknen Blätter durch die Luft herum.

� Received transcriptions of sentences into local dialects.

� 45,000 questionnaires from entire Germany.

� Published the first linguistic atlases (Sprachatlas)

Page 10: Dialect o Metrics

Dialect geography - history� Field work gradually replaced postal questionnaires.

� In 1896-1900, Edmond Edmont interviewed 700 informants around the French countryside.

� His data were incorporated in Gilliéron’s French survey � His data were incorporated in Gilliéron’s French survey which was published between 1902-1910.

� Subsequent atlases published: Italy and southern Switzerland (1931-1940), US and Canada (1939-1943, 1949, 1953, 1961, 1973-1976, 1981-1992, 1994), England (1962-1978), Ḥōrān (1940-1946), Egypt (1985), Syria (1997)…

Page 11: Dialect o Metrics

Dialect geography - methodology� Much of the methodology is shared with other

branches of linguistics:� Recording data (phonetics)

� Analyzing data (theoretical linguistics, sociolinguistics, � Analyzing data (theoretical linguistics, sociolinguistics, historical linguistics)

� Some methods are unique or especially important in dialect geography:� Devising questionnaires

� Building linguistic maps

� Selecting informants

Page 12: Dialect o Metrics

Questionnaires� Using questionnaires ensures comparability of data

gathered by different fieldworkers in varying conditions.

� Questions can be direct (“what do you call a cup”) or, � Questions can be direct (“what do you call a cup”) or, preferably, indirect (“what is this?” ).

� Questionnaires are organized according to semantic fields (weather, social activities, etc.) so that the informant will focus on the subject matter and not on the form of his answer.

� Since tape-recording became available, it is easier to engage in casual, non-formal conversation.

Page 13: Dialect o Metrics

Linguistic maps� Linguistic maps can be display maps, simply showing

the data on a map, or interpretive maps, showing distribution of predominant variants from region to region. region.

� Example: “What do you call that small, four-legged, long-tailed creature, blackish on top, it darts about in ponds?”

� Contrast the display map with the interpretive map (in the following slides).

Page 14: Dialect o Metrics

Newt

Page 15: Dialect o Metrics

Linguistic maps� Linguistic maps can be display maps , simply showing

the data on a map, or interpretive maps, showing distribution of predominant variants from region to region. region.

� Example: “What do you call that small, four-legged, long-tailed creature, blackish on top, it darts about in ponds?”

� Contrast the display map with the interpretive map (in the following slides).

Page 16: Dialect o Metrics

Newt Display Map

Page 17: Dialect o Metrics

Newt Interpretive Map

Page 18: Dialect o Metrics

Linguistic maps� Linguistic maps can be display maps , simply showing

the data on a map, or interpretive maps, showing distribution of predominant variants from region to region. region.

� Example: “What do you call that small, four-legged, long-tailed creature, blackish on top, it darts about in ponds?”

� Contrast the display map with the interpretive map (in the above slides).

Page 19: Dialect o Metrics

Informants� Historically, most surveys focused on nonmobile, old,

rural males.

� The motivation for this homogeneous background is that the informants’ speech should reflect the that the informants’ speech should reflect the authentic speech of the area in which the live.

� Fewer studies recorded more heterogeneous speakers (young, educated, female, etc.).

Page 20: Dialect o Metrics

Dialectometry� The variable as a structural unit.

� Dialects may differ quantitatively with regards to variables

� Ex.: simplification of final consonant clusters (pos’card� Ex.: simplification of final consonant clusters (pos’cardfor postcard, han’ful for handful).� Subject to linguistic constraints such as environment

(before consonant/vowel/pause).

� But also to non- or extra-linguistic factors such as style or class.

� Varying frequencies in different dialects.

Page 21: Dialect o Metrics

Dialectometry (cont.)� The term was coined by Séguy (1973).

� Séguy published the linguistic atlas of Gascony in 1950’s and 1960’s.

� First 5 volumes were within the framework of the � First 5 volumes were within the framework of the Gilliéron tradition.

� But Séguy looked for a more objective way to reveal the dialect regions of Gascony.

� He managed to do so in the 6th volume published in 1973.

Page 22: Dialect o Metrics

Dialectometry (cont.)� Basic idea: devise a dissimilarity measure based on the

survey data.

� Algorithm:� Compare responses from every pair of neighboring � Compare responses from every pair of neighboring

sites.

� Count number of items on which the neighbors disagreed.

� Calculate percentage of disagreement.

� This gives the linguistic distance between two dialects.

Page 23: Dialect o Metrics

Dialectometry (cont.)� Refinements:

� Calculate respective percentage agreement for different types of items (lexical, phonological, syntactic, etc.).

� Linguistic distance is the mean percentage of all types.� Linguistic distance is the mean percentage of all types.

� Map of linguistic distances in southwest Gascony.

� What can be inferred from the map?� Northwestern group with low linguistic distance (10-15 %).

� Site 693 is connected to similar neighbors on 3 sides (11-19%) and less-similar neighbors to the east (22-28%);

� Possible explanation: departmental boundary of Hautes-Pyrénées

Page 24: Dialect o Metrics

Southwest Gascony

Page 25: Dialect o Metrics

Dialectometry (cont.)� Refinements:

� Calculate respective percentage agreement for different types of items (lexical, phonological, syntactic, etc.).

� Linguistic distance is the mean percentage of all types.� Linguistic distance is the mean percentage of all types.

� Map of linguistic distances in southwest Gascony.

� What can be inferred from the map?� Northwestern group with low linguistic distance (10-15 %).

� Site 693 is connected to similar neighbors on 3 sides (11-19%) and less-similar neighbors to the east (22-28%);

� Possible explanation: departmental boundary of Hautes-Pyrénées

Page 26: Dialect o Metrics

Multidimensional scaling� Séguy’s maps retain geographic distance and represent

linguistic distance as a number.

� In multidimensional scaling (MDS) linguistic distance is displayed spatially.is displayed spatially.

� We place data in a dissimilarity matrix:� Rows are variables, columns are informants.

� Entries are binary.

� We need to assign a vector to each informant.

Page 27: Dialect o Metrics

MDS (cont.)� In Generalized MDS, given:

� k objects.

� A dissimilarity measure d.

� Natural number N.� Natural number N.

� Calculate dij, the distance between items i and j.

� Build a k*k matrix A where (A)ij = dij.

� Find k vectors x1…xk in RN s.t.� ||xi-xj||~ dij for all i,j.

� If N=2 or N=3 we can plot the vectors.

Page 28: Dialect o Metrics

MDS - example� Davis & McDavid (1950) described the transition zone

in Northwestern Ohio.� 5 towns: Perrysburg, Defiance, Ottawa, Van wert and

Upper Sandusky. View map.Upper Sandusky. View map.� 10 informants, 2 from each town.� 56 variables; most have variants from two adjacent

dialect regions, Northern and Midland, from which immigrants arrived at the area. View table.

� Davis & McDavid could not “give convincing reasons for the restriction of some items and the spreading of others”.

Page 29: Dialect o Metrics

Northwestern Ohio map

Page 30: Dialect o Metrics

MDS - example� Davis & McDavid (1950) described the transition zone

in Northwestern Ohio.� 5 towns: Perrysburg, Defiance, Ottawa, Van wert and

Upper Sandusky. View map.Upper Sandusky. View map.� 10 informants, 2 from each town.� 56 variables; most have variants from two adjacent

dialect regions, Northern and Midland, from which immigrants arrived at the area. View table.

� Davis & McDavid could not “give convincing reasons for the restriction of some items and the spreading of others”.

Page 31: Dialect o Metrics

Northwestern Ohio table

Page 32: Dialect o Metrics

MDS - example� Davis & McDavid (1950) described the transition zone

in Northwestern Ohio.� 5 towns: Perrysburg, Defiance, Ottawa, Van wert and

Upper Sandusky. View map.Upper Sandusky. View map.� 10 informants, 2 from each town.� 56 variables; most have variants from two adjacent

dialect regions, Northern and Midland, from which immigrants arrived at the area. View table.

� Davis & McDavid could not “give convincing reasons for the restriction of some items and the spreading of others”.

Page 33: Dialect o Metrics

MDS – example (cont.)� Two years later, Reed & Spicer (1952) did a statistical

analysis of covariance on the same data.

� They showed that the speech of informants who lived closer to each other resembled one another more than closer to each other resembled one another more than the speech of informants who liver afar from each other.

� Rees & Spicer were ahead of their time in the quantitative approach they took.

Page 34: Dialect o Metrics

MDS – example (cont.)� Chambers (in Chambers & Trudgill, 1998) used

correspondence analysis with the same data, and arrived at the following figure.

� Interpretation: � Interpretation: � 3 clusters in different quadrants: P1 and P2; V1, V2, US1,

US2 and O2; D1, D2 and O1.

� 1st cluster tend to choose Northern variants.

� 2nd cluster tend to choose Midland variants.

� 3rd cluster have a mixed pattern of choosing.

� These observations correlate with the geographic map.

Page 35: Dialect o Metrics

Northwestern Ohio MDS

Page 36: Dialect o Metrics

MDS – example (cont.)� Chambers (in Chambers & Trudgill, 1998) used

correspondence analysis with the same data, and arrived at the following figure.

� Interpretation: � Interpretation: � 3 clusters in different quadrants:

� P1 and P2; V1, V2, US1, US2 and O2; D1, D2 and O1.

� 1st cluster tend to choose Northern variants.

� 2nd cluster tend to choose Midland variants.

� 3rd cluster have a mixed pattern of choosing.

� These observations correlate with the geographic map.

Page 37: Dialect o Metrics

Goebl� After Séguy’s breakthrough in dialectometry, Goebl

(1982, 1984; taken from Nerbonne & Kretschmar 2003) extended and developed new methods for measuring dialect differences.

� Recall that Séguy’s measure counted differences in responses to questionnaires in pairs of sites.

� Goebl explored measures that gives more weight to less frequent words.

� He also studied the level of coherence between a certain site and other sites to discover whether it is an island or a transition area.

Page 38: Dialect o Metrics

More recent work� Kessler (1995) first used (weighted-)Levenshtein distance

as a linguistic distance. � Calculated Levenshtein distances between phonetic strings

of Irish Gaelic words.Used 12 phonetic features (nasality, rounding, length, etc.), � Used 12 phonetic features (nasality, rounding, length, etc.), with values between 0-1, to describe phones; distance between two phones is the average difference between feature values.

� Applied clustering techniques to the calculated distances. � Obtained dialect boundaries which correspond to

provincial boundaries.

Page 39: Dialect o Metrics

More recent work(cont.)� Heeringa & Nerbonne (2002) studied dialect areas and

dialect continua using Levenshtein distance.

� They calculated Levenshtein distances between all pairs of 27 Dutch dialects which lie on a straight line.pairs of 27 Dutch dialects which lie on a straight line.

� On the one hand, they used regression to account for linguistic distance by geographic distance, thus validating the continuum concept.

� On the other, they used clustering to detect dialect areas.

� Finally, MDS showed interrelations between dialects.

Page 40: Dialect o Metrics

More recent work (cont.)� A number of refinements and alternatives to Levenshtein

distance have been suggested (surveyed in Nerbonne &

Kretschmar 2003).

� Kondrak notes that prefixes and suffixes tend to get deleted � Kondrak notes that prefixes and suffixes tend to get deleted and explores local alignments (or distances) between strings).

� Heeringa & Gooskens attempt to measure pronunciation differences in acoustic recordings instead of phonetic transcriptions.

� Nerbonne & Kleiweg deal with related but non-identical question responses (clears up, clears, clearing up).

Page 41: Dialect o Metrics

Visualization� Already early dialectologists presented their findings

visually (in various linguistic maps).

� Computers enable us to visualize data in more vivid, telling ways.telling ways.

� Nerbonne (2005) present Dutch dialects, their distances and inner-groups, using several visualizations (pp. 18, 20-23, 25, 26).

Page 42: Dialect o Metrics

Measuring the Diffusion of

Linguistic Change

John Nerbonne, 2009John Nerbonne, 2009

Page 43: Dialect o Metrics

Models of diffusion� What is linguistic diffusion?

� Sociolinguistic vs. spatial diffusion

� The wave model: innovations spreading outwards in waves.waves.

� The skipping stone model: innovations leaping discontinuously between centers of influence.� Innovations spread locally in waves around each center.

� Centers of influence are usually larger cities or towns.

Page 44: Dialect o Metrics

The gravity model� Developed by Peter Trudgill (1974).

� Geographic distance and population size predict the chance of communication and thus the degree of diffusion. diffusion.

� As in physical gravity, the most influential site (=body) is the nearest largest (=most massive) one.

� Influence is inversely proportional to the square of the distance between sites, and proportional to the multiplication of the population sizes:

� Iij = s*PiPj/(dij)2

Page 45: Dialect o Metrics

Séguy’s Curve� Séguy measured lexical, or linguistic, distance and

compared it to geographic distance.

� He found that lexical distance is a sub-linear function of geographic distance (square root of logarithm).of geographic distance (square root of logarithm).

Page 46: Dialect o Metrics

Dialectometric view of gravity� Why use dialectometric methods in this case?

� Avoid arbitrary choice of which features to focus on.

� Quantify influence to arrive at a more general perspective.perspective.

� Several studies attempted to test the validity of the gravity model using dialectometry (references in Nerbonne 2009).

Page 47: Dialect o Metrics

Dialectometric view of gravity (cont.)� Nerbonne & Heeringa (2007) derived linguistic

(Levenshtein) distances from 52 towns in the Netherlands.� Interestingly, Levenshtein operation costs were derived from

comparing spectograms. � Since in the gravity model influence correlates inversely � Since in the gravity model influence correlates inversely

with geographic distance,� They stipulated that according to the gravity model,

linguistic distance should correlate with geographic distance directly.

� Indeed, they found direct correlation, but sub-linear and not quadratic (as predicted by the gravity model).

� They also found no effect of population size on linguistic distance.

Page 48: Dialect o Metrics

Dialectometric view of gravity (cont.)� Heeringa (2007) included more Dutch data.

� He too found sub-linear connection between linguistic distance and geographic distance.

� However, he also found that population size contributes � However, he also found that population size contributes to the linguistic distance, as in the gravity model.

� Alewijnse et al. (2007) found sub-linear, logarithmic correlation between linguistic and geographic distance in Bantu data collected in Gabon.

� Prokić (2007) and Nerbonne & Siedle (2005) arrived at similar results with Bulgarian and German data, respectively.

Page 49: Dialect o Metrics

Dialectometric view of gravity (cont.)

� The same correlation was found in other studies, in the US, Netherlands (again) and Norway.

� In the above studies geography accounted for 16-37% of the linguistic variation. of the linguistic variation.

� Note that in all of the above linguistic distance is narrowed down to phonetic distance.

� Spruit (2006) measured syntactic distance and found a linear correlation to geographic distance.

Page 50: Dialect o Metrics

Individual vs. Aggregate Differences

� Dialectometry measures the influence of geography on aggregate, cumulative variation.

� Sociolinguistics, on the other hand, focus on diffusion of single items (words, sounds).of single items (words, sounds).

� What is the relation between diffusion of individual items and aggregate diffusion?

� Simulating the diffusion of individual items could save the time and effort that would take a researcher to examine distributions of many individual items.

Page 51: Dialect o Metrics

Simulating Diffusion� Create several thousand sites.

� Sites are at different distances from a single reference site.

� Each site is represented by a 100-dimensional binary � Each site is represented by a 100-dimensional binary vector.

� Each dimension symbolizes a linguistic feature.

� Each dimension is a binary variable: “o” means that the site is the same as the reference site with respect to that feature; “1” means that it is different.

Page 52: Dialect o Metrics

Simulating Diffusion (cont.)� Simulation is comprised of two views: linear and

quadratic, corresponding to Séguy’s curve and the gravity model.

� In both cases, a random change is created n times in � In both cases, a random change is created n times in each site, depending on its distance from the reference site.

� In the linear view, n depends linearly on the distance.

� In the quadratic view, n depends on the square of the distance.

Page 53: Dialect o Metrics

Simulating Diffusion (cont.)� Creating the random change:

� Randomly select dimension i in the 100-dimensional vector.

� Create random number x between 0 and 1.� Create random number x between 0 and 1.

� If x > 0.5, set i=1; else, set i=0.

� Aggregate distance of a vector from the reference site is the sum of all its elements.

Page 54: Dialect o Metrics

Results� The following figure shows the results of two

simulations: one when chance of change depends linearly on distance and one when it depends quadratically on distance.quadratically on distance.

� In both cases a logarithmic regression line is showed; this is the typical sub-linear Séguy curve.

� The results imply that geography has a linear effect on the likelihood of the diffusion of an individual item.

Page 55: Dialect o Metrics

Results (cont.)

Page 56: Dialect o Metrics

Results (cont.)� However, applying local regression gives a similar

logarithmic curve in the linear case, but reveals a different curve in the quadratic case.

� This suggests that quadratic influence also contributes � This suggests that quadratic influence also contributes to aggregate diffusion, as predicted by the gravity model.

� The following figure shows the results after applying local regression.

Page 57: Dialect o Metrics

Results (cont.)

Page 58: Dialect o Metrics

Conclusions� Several points need to be investigated in further

simulations, e.g.:� Restriction of changes to binary choices.

� Limiting influence to only one center.� Limiting influence to only one center.

� Further studies are required to test diffusion with individual items.

� Still, it was shown that models of diffusion can be effectively tested quantitatively.

� There is a (sub)-linear correlation between linguistic distance and geographic distance.

Page 59: Dialect o Metrics

References� Chambers, J. & Trudgill, P. (1998). Dialectology.

Cambridge: Cambridge University Press, 2nd ed. � Goebl, H. (1982). Dialektometrie: Prinzipien und Methoden

des Einsatzes der Numerischen Taxonomie im Bereich der Dialektgeographie. Wien: Österreichischen Akademie der Dialektgeographie. Wien: Österreichischen Akademie der Wissenschaften.

� Goebl, H. (1984). Dialektometrische Studien: Anhand italoromanischer, rätoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF. 3 Vol. Tübingen: Max Niemeyer.

� Heeringa, W. & Nerbonne, J. (2002). Dialect Areas and Dialect Continua. In Language Variation and Change 13, 375-398.

Page 60: Dialect o Metrics

References (cont.)� Kessler, B. (1995). Computational dialectology in Irish

Gaelic. In Proceedings of the seventh conference of the European chapter of the Association for Computational Linguistics (pp. 60–66). San Francisco, CA: Morgan Kaufmann Publishers, 1995.Kaufmann Publishers, 1995.

� Nerbonne, J. (2005). Dialectology: Aggregate Dialectal Variation. Presentation in LSA linguistic Institute, Harvard and MIT. http://www.let.rug.nl/nerbonne/teach/dialectology/

� Nerbonne, J. & Kretzschmar, W. (2003). Introducing Computational Methods in Dialectometry. In Computational Methods in Dialectometry. Special issue of Computers and the Humanities, 37(3), 2003, 245-255.

Page 61: Dialect o Metrics

References (cont.)� Nerbonne, J. (2009). Measuring the Diffusion of

Linguistic Change. To appear in Philosophical Transactions of the Royal Society B: Biological Sciences, ca. 2010, special issue with selection of Sciences, ca. 2010, special issue with selection of papers from "Cultural and Linguistic Diversity", conference held at AHRC Centre for Evolution of Cultural Diversity, London, Dec. 9-13, 2008.