infÉrer l’histoire des populations humaines À partir … · et de l’application de méthodes...
Post on 18-Sep-2020
1 Views
Preview:
TRANSCRIPT
MUSEUM NATIONAL D’HISTOIRE NATURELLEEcole Doctorale Sciences de la Nature et de l’Homme – ED 227
Année 2017 N°attribué par la bibliothèque|_|_|_|_|_|_|_|_|_|_|_|_|
THESE
Pour obtenir le grade de
DOCTEUR DU MUSEUM NATIONAL D’HISTOIRE NATURELLE
Spécialité : GÉNÉTIQUE ET LINGUISTIQUE DES POPULATIONS HUMAINES
Présentée et soutenue publiquement par
Valentin ThouzeauLe 28 Novembre 2017
INFÉRER L’HISTOIRE DES POPULATIONS HUMAINESÀ PARTIR DES DIVERSITÉS GÉNÉTIQUES ET
LINGUISTIQUES
Sous la direction de : Monsieur Austerlitz Frédéric, Directeur de Recherche,et Monsieur Verdu Paul, Chargé de Recherche
JURY :
Mme Porcher, Emmanuelle Professeure, MNHN Présidente
M. Austerlitz, Frédéric Directeur de Recherche, CNRS Directeur de Thèse
M. Verdu, Paul Chargé de Recherche, CNRS Directeur de Thèse
M. Estoup, Arnaud Directeur de Recherche, INRA Rapporteur
Mme Kandler, Anne Senior Scientist, Max Planck Institute Rapportrice
Mme Barberousse, Anouk Professeure, Université Lille 1 Examinatrice
Mme Barkat-Defradas Mélissa Chargée de Recherche, CNRS Examinatrice
Résumé
Les inférences historiques sont des méthodes statistiques permettant de
reconstruire les événements passés à partir de données actuelles. Les inférences en
parallèle des histoires génétiques et linguistiques ont récemment profité d’avancées
méthodologiques permettant de mieux comprendre la co-évolution entre gènes et
langues. Néanmoins, les événements historiques complexes affectant la diversité
linguistique sont encore très peu étudiés. Cette thèse a été centrée sur l’articulation
entre inférences génétiques et linguistiques, à partir d’une pratique interdisciplinaire
et de l’application de méthodes statistiques de calcul Bayésien approché. Ce cadre a
permis de prendre en compte des événements complexes de migration, d’hybridation,
ou de changement de taille des populations, aussi bien pour l’histoire génétique que
pour l’histoire linguistique. Des données génétiques et linguistiques issues de
plusieurs populations d’Asie Centrale ont ainsi été analysées, montrant que l’histoire
des populations génétiques peut parfois différer de l’histoire des variétés linguistiques
parlées par ces populations. Un cadre permettant de prendre en compte la diversité
linguistique interindividuelle a ensuite été développé et appliqué à un ensemble de
locuteurs tadjiks d’Asie Centrale. Une interface entre génétique et linguistique centrée
sur les individus a ensuite été formalisée à partir d’un travail théorique, dont les
possibilités méthodologiques ont été confirmées a priori par le calcul Bayésien
approché, ouvrant un nouveau champ d’investigation dans l’étude de la co-évolution
entre génétique et linguistique. Enfin, un protocole d’échantillonnage linguistique
appliqué à un ensemble de locuteurs des Îles du Cap Vert a été construit afin de
permettre une intégration entre le travail théorique et le travail de terrain. L’ensemble
de ce travail formalise une linguistique des populations couplée à la génétique des
populations humaines, et fournit les outils méthodologiques permettant de
reconstruire l’histoire de la co-évolution entre génétique et linguistique.
Abstract
Historical inferences are statistical methods allowing to reconstruct past events
from current data. Inferences in parallel of genetic and linguistic histories have
recently benefited from methodological advances allowing to better understand the
coevolution between genes and languages. Nevertheless, the complex historical
events affecting linguistic diversity are still understudied. This thesis was centered on
the articulation between genetic and linguistic inferences, based on an
interdisciplinary practice and the application of statistical methods of approximate
Bayesian computation. This framework allowed taking into account complex events
of migration, admixture, or change in population size, both for genetic and linguistic
history. Genetic and linguistic data sampled from several populations in Central Asia
were analyzed, showing that the history of genetic populations can differ from the
history of the linguistic varieties spoken by these populations. A framework allowing
to take into account within-population linguistic diversity was then developed and
applied to a group of Tajik speakers from Central Asia. An interface between genetics
and linguistics centered on individuals was then formalized on the basis of a
theoretical work whose methodological possibilities were confirmed a priori by
approximate Bayesian computation, opening a new field of investigation in the study
of the coevolution between genetic and linguistic. Finally, a linguistic sampling
protocol applied to a group of speakers of the Cape Verde Islands was built in order to
allow an integration between theoretical work and fieldwork. This whole work
formalizes a population linguistics coupled with the human population genetics, and
provides the methodological tools to reconstruct the history of the genetic and
linguistic coevolution.
Remerciements
On pourrait croire qu’une thèse de doctorat est un travail plutôt solitaire. Dans
mon cas, ce n’est qu’avec l’aide de celles et ceux que j’ai eu la chance de côtoyer
durant ces trois années qu’il m’a été possible de mener à bien ce travail. Ces quelques
lignes ne sauraient suffire pour leur exprimer toute ma gratitude.
Je tiens en premier lieu à adresser mes profonds remerciements à mes directeurs
de thèse, Frédéric Austerlitz et Paul Verdu, pour leur disponibilité au quotidien, leur
confiance face à mes excentricités, et leur soutien en période de doutes. Merci à
Frédéric pour m’avoir transmis l’exigence théorique, et merci à Paul pour m’avoir
transmis la flamme du terrain !
Je souhaiterais remercier chaleureusement Anne Kandler, Anouk Barberousse,
Arnaud Estoup, Emmanuelle Porcher, et Mélissa Barkat-Defradas, pour avoir accepté
la tache d’évaluer un travail de thèse aux influences scientifiques multiples. Merci
également à Etienne Danchin, Michael Blum, et Sylvie Le Bomin, pour avoir accepté
de participer à mon comité de thèse et m’avoir accompagné dans mes premiers pas
hésitants.
Je remercie Philippe Endicott pour m’avoir donné l’occasion de collaborer avec
des scientifiques à l’autre bout du monde, ainsi que Rusell Gray et Quentin Atkinson
pour leur sympathique accueil à l’université d’Auckland. Je remercie Ethan Jewett,
Marlyse Baptista et Sergio da Costa pour leur aide et leur énergie sur le terrain au
cours de notre mission au Cap Vert, ainsi que l’ensemble des participants qui ont
accepté de se plier à nos curieuses demandes.
Merci à l’ensemble mes collègues du Musée de l’Homme, pour toute la
diversité qu’ils apportent, autant scientifique qu’extra-scientifique. Merci à Bérénice
Alard pour son énergie sans cesse renouvelée, Christophe Costes pour sa curiosité
débordante, Goki Ly pour les cours de yoyo, Nina Marchi pour nos très riches
controverses. Merci à Bruno Toupance, Céline Bon, Evelyne Heyer, Flora Jay, Laure
Ségurel, Marie-Françoise Rombi, Philippe Mennecier, Pierre Darlu, Priscille
Touraille, Raphaëlle Chaix, Romain Laurent, Samuel Pavard, pour leur aide précieuse
et la convivialité qu’ils apportent au quotidien. Merci à Marie-Claude Kergoat d’avoir
partagé mon avidité maladive de « connaissance vraie ». Merci à Franz Manni pour
ses précieux conseils scientifiques et vestimentaire. Merci à Antonin Affholder pour
son sérieux, sa curiosité, et son énergie, dans la réalisation de son stage que j’ai eu le
plaisir de co-encadrer. Il me faut remercier également les Préhistoriens, Archéologues,
Ethnographes, Anthropologues, Ethnomusicologues, Primatologues, Linguistes,
Juristes, et Ceux-Qui-N’Entrent-Pas-Dans-Les-Cases, ces chercheurs qui m’ont offert
à voir la richesse du monde de tant de manières différentes. Merci également à
Florence Loiseau et Taouès Lahrem, pour leur accompagnement administratif
bienveillant.
C’est avec émotion que je tiens à adresser des remerciements tout particuliers à
Frank Alvarez-Peyrere, pour ses enseignements en épistémologie de
l’interdisciplinarité, pour son écoute toujours très attentive, et pour m’avoir fait
apercevoir toute la richesse des dimensions cachées de l’être humain.
Je remercie bien chaleureusement Mathieu Tiret, un chercheur dont la vivacité
d’esprit n’aura de cesse de m’impressionner, un collègue toujours prêt à m’offrir une
aide inconditionnelle, un colocataire de longue date et à la tolérance sans limite, mais,
surtout, un ami irremplaçable.
Je remercie mes parents, ma famille, et mes amis de Savoie et d’ailleurs, dont le
soutien de chaque instant m’a donné toute la force de poursuivre mon travail. Merci
enfin à Manon Potin, sans laquelle aucune vérité ne vaudrait la peine d’être
recherchée.
Sommaire
Résumé......................................................................................................................2Abstract......................................................................................................................3Remerciements..........................................................................................................4Sommaire...................................................................................................................6Index des figures........................................................................................................9Index des tables........................................................................................................11
Avant-propos...................................................................................................13Communication, linguistique et épistémologie.......................................................14Discours, énoncé, sens.............................................................................................14Perspective analytique et perspective anthropologique...........................................15Conventions des langues disciplinaires...................................................................16Présupposés théoriques implicites...........................................................................19Pour une communication interdisciplinaire.............................................................21
Introduction....................................................................................................251. Construction et observation de l’objet.................................................................28
1.1. Observation de la diversité génétique..........................................................281.2. Observation de la diversité linguistique.......................................................301.3. Positionnement philologique.......................................................................32
2. Description des diversités et inférences historiques............................................342.1. Description de la diversité génétique...........................................................342.2. Description de la diversité linguistique........................................................352.3. Inférence de l’histoire à l’origine de la diversité génétique.........................352.4. Inférence de l’histoire à l’origine de la diversité linguistique.....................37
3. Études de la co-évolution génétique et linguistique............................................374. Vers un cadre d’analyse conjoint des diversités génétiques et linguistiques.......40
4.1. Inférence de l’histoire des populations génétiques et des variétés linguistiques........................................................................................................404.2. Exploration d’une « linguistique des populations ».....................................404.3. Construction d’une interface entre linguistique des populations et génétiquedes populations....................................................................................................414.4. Échantillonnage et analyse de données linguistique issues des Îles du Cap Vert......................................................................................................................41
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation............................................................43
Abstract....................................................................................................................431. Introduction.........................................................................................................442. Material................................................................................................................47
2.1. Genetic data.................................................................................................48
2.2. Linguistic data..............................................................................................483. Methods...............................................................................................................49
3.1. Genetic and Linguistic Dissimilarities among Populations.........................493.2. Approximate Bayesian Computation (ABC)...............................................50
4. Results.................................................................................................................534.1. Central Asian linguistic and genetic structures............................................534.2. Model selection and parameter estimations for the UZA population..........544.3. Model selection and parameters estimation for the TJY population............56
5. Discussion............................................................................................................575.1. Two different linguistic and genetic historical admixture for the Soj-Mahalla Uzbek-speakers.....................................................................................575.2. Stronger genetic than linguistic isolation in the Tadjikistan Yagnob speakers.............................................................................................................................585.3. Conclusions and Perspectives......................................................................59
Chapter II – Inferring linguistic transmission between generations at the scale of individuals..........................................................................................61
1. Introduction.........................................................................................................612. Models.................................................................................................................64
2.1. Production of utterances..............................................................................642.2. Four models of acquisition of a new language............................................642.3. Historical scenario.......................................................................................66
3. Materials..............................................................................................................684. Analyses...............................................................................................................68
4.1. Simulations..................................................................................................684.2. Summary statistics.......................................................................................694.3. Model selection............................................................................................694.4. Parameters estimation..................................................................................69
5. Results.................................................................................................................705.1. Model selection............................................................................................705.2. Parameter estimation....................................................................................70
6. Discussion............................................................................................................73
Chapter III – Building a formalised interface between population genetics and population linguistics..............................................................................77
Introduction.............................................................................................................77Part 1 – Formalising genetic and linguistic coevolution.........................................811. A formalisation of biological evolution...............................................................812. A formalisation of linguistic evolution................................................................843. Modalities of linguistic communications at the scale of the individuals.............884. Coupling the reproductive and the communication networks.............................91Part 2 – Inferring genetic and linguistic histories....................................................971. Modelling.............................................................................................................98
1.1. Sampling......................................................................................................981.2. Genetic model..............................................................................................981.3. Linguistic model..........................................................................................991.4. Parameters....................................................................................................99
1.5. Summary statistics.....................................................................................1001.6. Simulations and model selection...............................................................102
2. Results...............................................................................................................1032.1. Should the individuals be considered as copiers, probabilistic copiers, or Bayesian learners?............................................................................................1032.2. Are the mutation rates different between the linguistic classes?...............1052.3. Are the sizes of the genetic and the linguistic populations different?........1052.4. Do the sampled individuals belong to genetically and/or linguistically differentiated populations?................................................................................1072.5. What is the tree topology of three populations?........................................109
Discussion..............................................................................................................111
Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu.............................................................................................................115
Introduction............................................................................................................115Data sampling........................................................................................................117Descriptive analyses..............................................................................................120Discussion..............................................................................................................124
Conclusion.....................................................................................................127Bibliographie.................................................................................................133APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures..............................................................................151
1. Linguistic Model...........................................................................................1512. Prior distributions for the linguistic model parameters.................................1513. Genetic model...............................................................................................1524. Prior distributions for the genetic model parameters....................................1525. Summary Statistics........................................................................................1536. Scenarios selection using random forest (RF)..............................................1537. Parameters estimation using neural networks (NN).....................................1548. Cross-validation and posterior probabilities in the UZA case......................1559. Cross-validation and posterior probabilities in the TJY case........................156
Index des figures
Figure I.1 – Geographical distribution of the 21 populations and linguistic varieties under study...................................................................................................................48Figure I.2 – Five competing scenarios for the origin of the UZA population..............52Figure I.3 – Neighbour-joining trees based on (a) the linguistic distances matrix and (b) the pairwise FST matrix..........................................................................................54Figure I.4 – ABC Analyses for the UZA population....................................................55Figure II.1 – Four models of linguistic transmission between generations..................65Figure II.2 – Historical scenario...................................................................................67Figure II.3 – Geographical distribution of the 10 sampled units under study..............68Figure II.4 – Confusion matrices from the out-of-bag cross-validation analysis of the four models...................................................................................................................71Figure III.1 – Structure of the reproduction relationship in human species.................82Figure III.2 – Classic representation of the setting up of the reproduction network....83Figure III.3 – Alternative representation of the reproductive network.........................84Figure III.4 – Structure of the linguistic communication relationship in human.........86Figure III.5 – Representation of the linguistic communication network.....................87Figure III.6 – Representation of the setting up of the reproduction network as well as the linguistic communication network.........................................................................92Figure III.7 – Representation of the reproduction network as well as the linguistic communication network under Hypothesis 2...............................................................93Figure III.8 – Representation of the reproduction network as well as the linguistic communication network under Hypothesis 3...............................................................94Figure III.9 – Representation of the reproduction network as well as the linguistic communication network under Hypothesis 4...............................................................95Figure III.10 – Description of the three models of individual grammars...................104Figure III.11 – Description of the three scenarios of different size of the genetic and the linguistic population.............................................................................................106Figure III.12 – Description of the four scenarios of genetic and/or linguistic population differentiation...........................................................................................108Figure III.13 – Description of the three scenarios of historic topologies...................109Figure IV.1 – Geographical distribution of the 19 sampling localities under study in Cape Verde..................................................................................................................119Figure IV.2 – MCA of the 84 individuals sampled in Cap Verde...............................121Figure IV.3 – Representation of the weight of the 50 words in the MCA..................122Figure IV.4 – Neighbour-joining trees based on the linguistic distances matrix........123Figure S1 – Models of linguistic evolution................................................................158Figure S2 – Models of genetic evolution...................................................................159Figure S3 – Two competing scenarios of linguistic and genetic origin of the Yagnob speaking population....................................................................................................160Figure S4 – Pairwise FST matrix (a) and linguistic distances matrix (b)..................161
Figure S5 – Neighbour-joining trees based on the pairwise (δμ)2 matrix.................162Figure S6 – Principal Component Analysis of the Manhattan distances...................163Figure S7 – Principal Component Analysis of the pairwise FST distance matrix.....164Figure S8 – An example of computation of the linguistic summary statistics, over three linguistic varieties and six meanings.................................................................165Figure S9 – PCA performed in the UZA case over one observed summary statistics set....................................................................................................................................166Figure S10 – PCA performed in the TJY case over one observed summary statistics set................................................................................................................................167Figure S11 – Analysis of the TJY population history.................................................168Figure S12 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets tested for the linguistic origin of the UZA population (admixture model).. . .169Figure S13 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets tested for the genetic origin of the UZA population (admixture model)........170Figure S14 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets tested for the linguistic origin of the TJY population (isolation model)........171Figure S15 – Pooling of the parameters priors (in black) and posteriors (in blue) of thetriplets tested for the linguistic origin of the TJY population (non-isolation model).172Figure S16 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets tested for the genetic origin of the TJY population (isolation model)...........173Figure S17 – Density of the distribution of the posterior probabilities......................174
Index des tables
Table II.1 – Summary of the prior distributions of the parameters for the four models......................................................................................................................................67Table II.2 – Proportion of votes for the four models of linguistic evolution, and the posterior probability of the Social model.....................................................................70Table II.3 – Summary of the posterior distributions of the parameters, assuming a Sexual2 scenario...........................................................................................................72Table II.4 – Summary of the posterior distributions of the parameters, assuming a Social scenario..............................................................................................................72Table III.1 – Cross-validation results aiming at assessing a priori distinctions between three models of individual grammars.........................................................................104Table III.2 – Cross-validation results aiming at assessing a priori distinctions between two models of the mutation of the linguistic variants................................................105Table III.3 – Cross-validation results aiming at assessing a priori distinctions between three models described Figure III.11..........................................................................107Table III.4 – Cross-validation results aiming at assessing a priori distinctions between four scenarios described Figure III.12........................................................................108Table III.5 – Cross-validation results aiming at assessing a priori distinctions between three scenarios described Figure III.13......................................................................110Table III.6 – Cross-validation results aiming at assessing a priori distinctions between three scenarios described Figure III.13......................................................................110Table III.7 – Cross-validation results aiming at assessing a priori distinctions between three scenarios described figure III.13.......................................................................110Table IV.1 – List of meanings extracted from the Swadesh list.................................119Table S1 – Information table for the 21 studied Central Asian populations..............175Table S2 – Summary of the posterior distributions of the genetic parameters, assuminga scenario of an admixed origin of the UZA population (scenario E)........................176Table S3 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of an admixed origin of the UZA population (scenario E)........177Table S4 – Summary of the posterior distributions of the genetic parameters, assuminga scenario of isolation of the TJY population (scenario 2).........................................178Table S5 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of isolation of the TJY population (scenario 1).........................179Table S6 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of no-isolation of the TJY population (scenario 2)...................179Table S7 – Linguistic cross-validation of the UZA case............................................180Table S8 – Genetic cross-validation of the UZA case................................................180Table S9 – Linguistic cross-validation of the TJY case.............................................181Table S10 – Genetic cross-validation of the TJY case...............................................181Table S11 – Quantiles of the distributions of the posterior probabilities computed overthe triplets supporting the most probable scenario.....................................................182
Avant-propos
Ce travail de thèse s’intitule « Inférer l'histoire des population humaines à partir
des diversités génétiques et linguistiques ». Comment ce titre est-il compris par un
généticien ou par un linguiste ? Est-il possible de proposer un titre qui lève toute
ambiguïté de sens pour l’une des disciplines comme pour l’autre ? Concernant le
contenu de ce travail, doit-il répondre à l’ensemble des exigences des différentes
disciplines, alors que ces exigences semblent parfois très contradictoires ?
Ces quelques questions sont parmi celles qui ne manquent pas de s’imposer,
parfois brutalement, au cours d’un travail à la frontière de plusieurs disciplines. Un
choix s’impose alors : choisir de s’ancrer dans une discipline unique et collaborer
avec d’autres disciplines par échanges de services, ou alors préférer prendre en charge
ces problématiques complexes pour tenter de construire un cadre interdisciplinaire
commun. C’est la seconde option qui a été retenue pour ce travail. Cet avant-propos
présente les raisons qui m’ont poussé à ce choix, ainsi que la manière dont cela a été
mis en place au cours de la réalisation de ma thèse.
L’objectif est de rendre compte du raisonnement qui a été construit pour
considérer la diversité des concepts issus tant de la génétique que de la linguistique.
Le troisième champ disciplinaire mobilisé au cours de ce travail est l’épistémologie,
parfois appelé philosophie des sciences. Ce champ disciplinaire permet de
comprendre comment la diversité des raisonnements se construit. Il permet également,
je le souhaite, de mettre en œuvre la plus grande rigueur intellectuelle possible. J’ai
fait le choix d’exposer dès à présent les présupposés de l’ensemble de ce travail, ceci
afin que le lecteur puisse en avoir la meilleure compréhension possible. Un tel
dévoilement intellectuel amène à montrer ce qui sera probablement identifié comme
des limites au raisonnement. Cependant, un travail de clarification est-il autre chose
qu’une mise en lumière des limites d’un objet ?
Je tenterai dans ce texte de clarifier les problèmes de communications
scientifiques afin de prendre en charge les questions propres à l’interdisciplinarité.
J’espère ainsi poser les bases d’une réflexion qui me permettra d'articuler les concepts
13
Avant-propos
des différentes disciplines mobilisées et de pouvoir ainsi respecter les différences de
chaque discipline, et d’engager un dialogue efficace, respectueux et productif.
Communication, linguistique et épistémologie
L’une des premières difficultés qui émerge au cours d’un travail
interdisciplinaire concerne l’intercompréhension entre les disciplines (Holbrook,
2013). Ce qui fait sens pour un généticien n'en fait pas nécessairement pour un
linguiste, et inversement. La communication efficace entre les disciplines est toujours
un processus en construction et n’est jamais complètement acquise.
J’ai choisi de prendre en charge la question de la communication
interdisciplinaire à partir d’une variété d’outils proposés par différentes écoles de la
linguistique et de l'épistémologie. Ces disciplines sont particulièrement indiquées pour
apréhender l’articulation entre les questions de connaissance et de communication, au
cœur des problèmes d’interdisciplinarité. Ce sont des disciplines très exigeantes
lorsqu'elles sont mobilisées pour traiter les problèmes qui sont détaillés ici, car elles
prennent un caractère récursif : ce sont des sciences qui parlent de sciences. Leurs
discours s'appliquent donc notamment à elles-mêmes. Ce mouvement d'aller-retour
permet de construire une analyse du problème tout en prenant le recul nécessaire afin
d’évaluer les outils que nous mobilisons pour analyser le problème. Nous invitons le
lecteur à prendre conscience des niveaux qu’implique cette récursivité, en essayant de
comprendre la logique de ce texte avec les outils mis à disposition.
Discours, énoncé, sens
Les productions scientifiques construisent des discours (Maingueneau, 1979) à
travers des articles, des manuscrits de thèses, des séminaires, des conférences, des
ouvrages, des cours, des interventions auprès du grand public, des discussions
informelles entre collègues… Une première difficulté du travail interdisciplinaire
14
Avant-propos
consiste à essayer de comprendre la diversité des discours des différents champs
disciplinaires impliqués, et comment ces champs pourraient proposer un discours
commun.
Pour la linguistique moderne et la philosophie des sciences, l'élément minimal
porteur de sens au sein d’un discours est l’énoncé. La phrase suivante : « Des données
génétiques et linguistiques issues de plusieurs populations d’Asie Centrale ont ainsi
été analysées, montrant que l’histoire des populations génétiques peut parfois différer
de l’histoire des variétés linguistiques parlées par ces populations » issue du résumé
de ce manuscrit de thèse, est un exemple d’énoncé. Comment se construit le sens de
cet énoncé ? Deux approches, qui ne sont a priori pas incompatibles (Nguyên-Duy
and Luckerhoff, 2006), sont envisageables : l’approche analytique et l’approche
anthropologique, qui délimitent toute deux leur propre axe de réponse.
Perspective analytique et perspective
anthropologique
Une manière analytique de comprendre comment se construit le sens d’un
énoncé scientifique est de renvoyer aux conditions pour lesquelles cet énoncé est vrai
(Davidson, 1967). Cette manière de concevoir les énoncés permet de déterminer s’ils
sont dénués de sens, si deux énoncés ont un sens équivalent, si un énoncé est
nécessairement vrai, etc. Ce cadre théorique permet de clarifier les énoncés de
manière à ce que leur sens soit le plus univoque possible.
La volonté de comprendre le sens des énoncés sans ambiguïté est un projet
envisagé par la philosophie analytique, héritière de l’empirisme logique de la
première moitié du XXème siècle. Ce projet tente de clarifier le langage et en
particulier le langage scientifique. L’objectif est d’éviter les imprécisions et de mettre
en lumière le plus rigoureusement possible le sens des énoncés en s’aidant de la
logique formelle. Cette vision s'appuie sur un objectif de rigueur logique et sur une
confiance dans les capacités du langage. Elle a notamment été envisagée pour
15
Avant-propos
formaliser les communications interdisciplinaires via la traduction des énoncés d’une
langue disciplinaire à une autre (Davidson, 1973). Ces travaux ont ensuite servi de
base à une théorie de l’intégration des langues disciplinaires dans l’objectif de générer
une compréhension commune d’un même objet scientifique (Klein, 2013). Pourtant,
la pratique de l'interdisciplinarité et le constat des difficultés de communication qui
l’accompagnent, ont amené à envisager des voies alternatives à la seule intégration
des langues disciplinaires (Holbrook, 2013).
Comprendre comment se construit le sens d’un énoncé dans une perspective
anthropologique consiste à s’intéresser à l’analyse des présupposés implicites et du
contexte dans lequel cet énoncé est produit. Cette manière de comprendre le sens d'un
énoncé est reliée à chaque situation de communication et pour chaque locuteur
(Calame, 1986). Le sens n'est plus inscrit dans l'énoncé lui-même : il est construit par
le destinataire selon le contexte (Preyer and Peters, 2005). Ce cadre théorique a des
conséquences importantes sur la manière de concevoir la communication : un discours
ne se donne pas à comprendre directement, il doit toujours faire l’objet d’un travail
d’interprétation par un destinataire. Ainsi, une anthropologie des pratiques langagières
(Bornand and Leguy, 2013) s’intéresse particulièrement aux conditions qui
déterminent la manière dont les énoncés sont interprétés en situation.
Conventions des langues disciplinaires
Dans la suite de cet avant-propos, nous allons nous plonger plus en détail dans
les perspectives analytiques et anthropologiques, dans le but de déterminer les
conditions d’une communication interdisciplinaire efficace.
Pour le fondateur de la linguistique moderne, Ferdinand de Saussure, la
possibilité de communiquer est assurée par les conventions langagières que partagent
les différents locuteurs (Saussure, 1916). Autrement dit, deux locuteurs ne peuvent
communiquer efficacement que s’ils partagent tout un ensemble de conventions de
langage. Ces conventions permettent d’attacher un signe (le mot français
« population » par exemple) à un signifiant (l’idée de population, avec ce qu'elle peut
16
Avant-propos
impliquer en génétique des populations). Mais cette association n'est pas évidente a
priori, elle est, au contraire, arbitraire. Les locuteurs l’apprennent progressivement
lorsqu’ils apprennent une langue particulière. Le plus souvent, les conventions de
langage « vont-de-soi » de façon plus ou moins explicite et plus ou moins consciente
en situation de conversation.
Il peut exister plusieurs signifiants proches pour un même signe au sein d’une
langue, on parle alors de polysémie. Par exemple, le mot français « population » peut
renvoyer notamment à l’idée d’un groupe d’individus appartenant à une même espèce
et présents dans un même habitat, ou bien à l’idée d’un groupe d’individus d’une
même espèce se reproduisant préférentiellement entre eux (Debouzie, 1999). Cette
polysémie est une source majeure de problèmes d’intercompréhension entre les
locuteurs d’une même langue : on peut alors avoir l’illusion de parler de la même
chose, et se rendre finalement compte que nos conceptions respectives admettent un
certain nombre de différences. En ce qui concerne les sous-disciplines de la génétique,
le concept de « gène » pourtant central est le siège d’une forte polysémie (Gayon,
2004). On peut notamment admettre qu’un gène est une séquence codante, ou une
séquence codante associée à une séquence régulatrice, ou un élément transmis de
manière héréditaire et ayant une influence causale sur le phénotype, ou bien d’autres
choses encore. Cette forte polysémie semble vider le terme de son contenu
conceptuel, mais J. Gayon indique que :
Néanmoins, le terme de «gène» demeure. À cela, plusieurs explications sont possibles. Il y a
d’abord une raison pragmatique. Les savants, comme tous les hommes, ont besoin de mots pour
communiquer entre eux. Dans ce but, des termes approximatifs sont souvent plus utiles que des
termes définis avec une parfaite précision. Des termes trop précis limitent l’espace de
communication. Or, le terme de gène, avec son ambiguïté présente, joue à cet égard un rôle
important: il permet, avec un degré raisonnable d’approximation, à des scientifiques de
disciplines différentes (biochimistes, biologistes moléculaires, généticiens des populations,
spécialistes de génétique médicale, etc.) de se comprendre. Par ailleurs, les savants ont aussi
besoin de s’inscrire dans des traditions de pensée et de dialoguer avec leurs maîtres et
prédécesseurs. Depuis 1900, un certain nombre de concepts du gène se sont succédé, qui ne se
recouvrent qu’en partie. Leur contenu descriptif, c’est-à-dire les classes d’observables auxquels
ils renvoient, ne coïncident que partiellement. Il n’existe pas de dictionnaire permettant de
17
Avant-propos
traduire de manière générale les divers concepts du gène les uns dans les autres sans équivoque.
Aucun dictionnaire linguistique ne peut d’ailleurs jamais faire cela. Toutefois, au cas par cas, il
est possible de traduire les uns dans les autres des énoncés qui utilisent des concepts du gène
différents.
La polysémie peut également toucher des disciplines a priori très éloignées.
L'énoncé « il y a un phénomène de dérive » peut désigner quelque chose de différent
pour un généticien des populations et pour un linguiste. Pour un généticien, le
processus de dérive correspond plutôt à la modification des fréquences d'un ensemble
d'objets (des allèles par exemple) par l’effet du hasard. Pour un linguiste, le processus
de dérive correspond plutôt à la modification de plusieurs objets (des langues
différentes par exemple) dans une direction particulière commune, selon des
contraintes de structure. Bien que ces deux concepts de « dérive » partagent des
ressemblances, ils ne sont pourtant pas équivalents. Malgré cela, rien ne peut les
différencier a priori à l’exception du contexte dans lequel ils sont utilisés (Kasavin,
2009). Ce contexte, associé aux conventions de langage des locuteurs qui
rencontreraient ce terme, est ce qui leur permet de l’interpréter correctement. Un
généticien qui lirait cet énoncé dans une publication de linguistique par exemple, s'il
n'est pas au fait des conventions propres à cette discipline au sujet de ce terme,
pourrait interpréter l'énoncé dans le sens de ses présupposés et aboutir à un contresens
par rapport à ce qu’entend l'auteur de l'article. De même, un linguiste n'étant pas au
fait des conventions propres à la génétique et lisant cet énoncé dans une publication
de génétique, courra le risque d’un contresens.
Le partage d’un ensemble plus ou moins grand de conventions de langage par
un groupe d’individus permet de constituer une communauté de langage (Labov,
1972). Une communauté de langage adopte implicitement un ensemble de règles,
souvent inconscientes, qui se comportent dans la conversation comme les règles d'un
jeu (Wittgenstein, 1953). Les membres d’une communauté peuvent ainsi
communiquer entre eux sans avoir besoin de préciser perpétuellement le sens des
énoncés qu’ils produisent.
Les énoncés des langues disciplinaires ne sont donc pas livrés avec les règles
18
Avant-propos
pour les décoder, quand bien même les termes sembleraient tous rigoureusement
définis1. Les conventions doivent être apprises petit à petit par les locuteurs au sein
des communautés pour leur permettre de comprendre le sens des énoncés. Les
universités permettent aux étudiants d'apprendre les conventions de langage de leurs
langues disciplinaires. Les définitions, les exemples, les lectures d’articles, les
exercices, les projets, les discussions, les évaluations, sont autant de situations de
communications productrices d’énoncés. Cela permet aux locuteurs d'inférer petit à
petit les conventions de langage de la communauté. Un chercheur confirmé maîtrise
finement les conventions de langage de sa discipline lorsqu'il est capable de
communiquer efficacement avec ses pairs.
Il faut tout de suite ajouter que les conventions des communications
scientifiques ne se construisent pas indépendamment des autres conventions de
langage. Les discours scientifiques prennent au contraire racine dans le sens commun,
dans un contexte culturel et social plus large (Bonfils, 1990; Delamotte, 2004). On est
alors face à un mélange des genres : les langues disciplinaires se reposent en partie sur
le sens commun partagé plus largement par une culture, tout en suivant certaines
règles qui leur sont particulières.
Présupposés théoriques implicites
Les disciplines scientifiques ont comme particularité d’avoir à rendre compte
des objets du monde qu’elles s’emploient à étudier. Nous faisons ici une différence
entre les objets du monde, extérieurs aux scientifiques, et les objets des disciplines,
1. En effet, une solution pourrait être de définir l’ensemble des termes, afin de générer le sens desénoncés à partir des composants plus élémentaires que sont les mots. Néanmoins, les définitions sontdes périphrases possibles, appelant d'autres termes, eux-mêmes très souvent polysémiques. Ces termesfont ensuite eux-mêmes l'objet de définitions, qui renvoient encore à d’autres définitions, et ainsi desuite de manière circulaire (Amiel, 2010). Un dictionnaire se construit ainsi comme un ensemble demots renvoyant dans leurs définitions à d'autres mots, qui dans le réseau de définitions entrecroiséesfinissent nécessairement par boucler en un système clôt. Les définitions ont donc une utilité, mais celle-ci est relative car elle dépend des allants-de-soi langagiers implicites. Elles constituent donc un moyend’accès à la signification en fournissant un exemple d’énoncé choisit, qui tente de porter le sens le pluslarge possible.
19
Avant-propos
construits dans le but de rendre compte des objets du monde. L’activité de
construction de connaissances des disciplines scientifiques peut être vue comme une
tentative de décrire le monde le plus rigoureusement possible. Les multiples sources
d’incertitudes linguistiques évoquées plus haut pèsent donc lourdement sur
l’ensemble du champ scientifique, d’où l'importance pour les communautés
scientifiques de préciser au mieux les termes qu’elles utilisent et de clarifier au mieux
les énoncés produits. C'est en construisant des objets d’étude (les « gènes », les
« populations », les « mutations », les « pressions de sélection ») que les disciplines
tentent d’être au plus près des objets du monde. Bien que ces constructions de langage
soient parfois très floues, les scientifiques peuvent s’appuyer sur leurs présupposés et
le contexte d’utilisation de ces termes pour inférer le sens des énoncés dans lesquels
ils apparaissent.
Ainsi, le sens des énoncés scientifiques est largement dépendant des
connaissances et des croyances partagées par les locuteurs. Par exemple, l’utilisation
et la compréhension du mot « épigénétique » par un généticien dépend directement de
ses connaissances propres. Celles-ci sont faites d'hypothèses, d'approximations et de
théories qui se trouvent souvent être des non-dits. L’ensemble des connaissances et
des croyances partagées par une communauté de langage fondent ce qu’on peut
appeler un terrain d’entente linguistique (ou « common ground » pour Resnick et al.,
1991).
Lorsque les langues disciplinaires produisent des énoncés sur le monde, elles se
fondent sur des présupposés théoriques implicites qui correspondent à des visions du
monde différentes2. La linguistique historique computationnelle (Bowern and
Atkinson, 2012; Gray and Atkinson, 2002; Gray et al., 2009), en appliquant des
2. Le lecteur pourra rester perplexe face à une telle vision de la science. Tout se passe comme si lescommunautés disciplinaires construisaient des mondes imaginaires indépendants les uns des autres, etque les membres de ces communautés discutaient entre eux comme dans une fiction partagée encommun. Ce serait oublier que les objets du monde s'imposent à nous avec force : nous ne construisonspas les objets de nos disciplines indépendamment des objets du monde. Nos objets disciplinaires sontconstruits dans l'objectif de rendre effectivement compte d'une facette particulière des objets du monde,qui s’imposent à nous dans la pratique. Le généticien des populations dont l’objectif est d’inférerl’histoire construit les concepts de « locus » et de « gènes » afin de lui permettre de rendre compte enpratique de l'histoire et de la structure des populations. Le généticien moléculaire construit le conceptde « gène » afin de rendre compte en pratique des propriétés physico-chimiques de l'ADN et desmécanismes microscopiques qui l'affectent.
20
Avant-propos
méthodes phylogénétiques à des données linguistiques, déploie sans le dire tout un
champ d’hypothèses particulières concernant la nature des objets linguistiques. Les
éléments considérés (phonèmes, mots, syntaxe…) y sont vus comme des éléments
structurellement indépendants les uns des autres (à l’image des gènes), et pouvant
faire l'objet d'une comparaison directe avec d'autres langues. Ce postulat – parmi
d'autres – n'est pas problématique en lui-même, car il reflète la manière de construire
l’objet de la linguistique historique computationnelle. Il est fondateur pour cette
discipline, car c’est en partie ce qui lui permet de produire un discours pertinent sur
les objets du monde qu’il se donne à étudier, à savoir l’histoire des langues composées
par ces éléments indépendants.
Les langues disciplinaires se fondent ainsi sur un ensemble de présupposés
théoriques différents, abordant chacune leurs objets selon un regard particulier. Une
hypothèse de « mutation aléatoire » peut sembler étrange pour un linguiste
structuraliste, pour lequel les changements linguistiques ont lieu selon des règles bien
précises. La notion de « groupe ethnique » (Balazs, 1993; Huang et al., 2015) attachée
à des groupes d’individus en génétique des populations peut sembler étrange pour un
ethnologue, pour lequel une question centrale est de savoir comment les identités se
construisent dans un rapport à l’autre et non sur les propriétés des individus eux-
mêmes. Mais quelles que soient les disciplines, chacune construit ses concepts pour
rendre compte, avec pertinence, d’un objet du monde. Une grande partie de la
difficulté du projet interdisciplinaire se trouve dans les limites entre les objets que
chacune des disciplines construit, et leur rapport avec les limites des objets du monde.
Dans ce cadre, comment arriver à ouvrir un dialogue entre les disciplines ?
Comment, encore, arriver à faire communiquer les concepts malgré ces obstacles qui
sont souvent invisibles au premier abord ?
Pour une communication interdisciplinaire
Une première piste pour tenter de produire un travail interdisciplinaire
rigoureux découle de l’ensemble des éléments évoqués jusqu'ici au sujet des langues
21
Avant-propos
et des savoirs disciplinaires. Étant donné l’influence des différences de conventions
des langues disciplinaires sur la communication, leur mise en lumière est de première
nécessité. Le travail interdisciplinaire implique ainsi de clarifier au mieux les discours
et ce qu’ils construisent autant explicitement qu’implicitement, afin de permettre une
articulation des concepts la plus rigoureuse possible. Clarifier le discours permet de
spécifier sur quelle base théorique se déploie le savoir pour chaque discipline
impliquée. Omettre cette étape risque de mener à l’un des multiples pièges du
langage. Néanmoins, la mise en lumière des présupposés disciplinaires est un travail
périlleux, car l’explicitation des limites d’un champ scientifique peut se révéler
inconfortable pour les pratiquants de ce champ.
De plus, sachant que les discours disciplinaires prennent leur sens au travers des
conventions de langage variées aux présupposés théoriques parfois disjoints, chaque
champ disciplinaire impliqué doit être appris à nouveau, comme une seconde langue
maternelle par les pratiquants de l’interdisciplinarité (MacIntyre, 1988). Ce travail
d’apprentissage est un prérequis essentiel à la démarche interdisciplinaire. En effet, le
risque d’une maîtrise partielle d’une discipline est d'aboutir à une mutilation de ses
bases théoriques lors de la mise en regard des disciplines impliquées. Le risque est
grand de plaquer les présupposés d’une discipline sur une autre par méconnaissance
de leurs différences. C'est dans un souci de responsabilité vis-à-vis de ce qui constitue
les disciplines qu’un travail d’apprentissage de novo s'avère nécessaire.
Enfin, au contraire de la reconnaissance des différences qui fondent les langues
disciplinaires, il peut être tentant d’adhérer au projet de leur uniformisation, afin de
disposer d’un langage unique. Néanmoins, cette uniformisation implique une
normalisation des différents rapports possibles aux objets du monde étudiés par les
différentes disciplines scientifiques. Face à cet idéal d'uniformisation des savoirs, il
me semble que l’interdisciplinarité exige au contraire de prendre acte des différences
constitutives entre les disciplines, irréductibles entre elles. Le projet doit alors se
situer dans la prise en compte des différences constitutives entre chaque discipline,
dans un objectif d’unité (et non d’uniformité), comme indiqué par F. Alvarez-Péreyre
(2003) :
22
Avant-propos
C’est que dans le rapport entre le particulier et l’universel – tel que les sciences de l’homme
et de la société s’efforcent de le traiter – deux voies se présentent. L’une consiste à mettre les
objets en perspective. Cette dernière repose sur deux implicites. D’une part, la conviction que
ces objets sont aisément comparables. D’autre part, la conviction – l’illusion, contre toute
constatation quotidienne du contraire – que les discours scientifiques sont fortement apparentés,
non ambigus, non conflictuels.
L’autre voie qui se présente consiste à supposer qu’il n’y a pas de naturalité des objets et que
les discours scientifiques sont l’expression de constructions incessantes. Faire une place à
l’exigence interdisciplinaire c’est, aussi, tenter de travailler dans ce sens.
Dans cet objectif, il me semble que l’entreprise interdisciplinaire gagne à
prendre appui sur la diversité des travaux qui découlent des perspectives analytiques
et anthropologiques3. Nous avons vu que ces perspectives offrent des outils très
intéressants pour prendre en charge les difficultés induites par la diversité des
pratiques et des discours scientifiques. C’est ainsi que la reconnaissance du caractère
construit des sciences et de leurs discours, sous différentes dimensions, nous amène à
relativiser le projet d'un langage uniforme, au moins dans le cadre d’un travail
interdisciplinaire.
L'interdisciplinarité est donc un processus en mouvement, une pratique, et non
un état de fait. La suite de ce travail de thèse s’est conçue progressivement dans cette
optique, cherchant à clarifier au mieux les limites disciplinaires, à mettre en lumière
dans la mesure du possible les présupposés linguistiques et épistémologiques, et à
questionner les « allants-de-soi » des pratiques scientifiques. Ce processus demande
un travail parfois laborieux, car souvent inaudible par rapport aux attentes de chacune
des disciplines. A cette difficulté s’ajoute un rejet institutionnel d’une telle pratique de
l’interdisciplinarité, comme l’indiquent Bühlera et al. (2012) :
3. Nous avons mobilisé plusieurs disciplines au fil de cet avant-propos afin de construire un cadreinterdisciplinaire que nous espérons rigoureux. Or, les différences entre ces domaines disciplinairesnous ramènent aux questions d’interdisciplinarité détaillées précédemment : comment articuler aumieux ces méta-disciplines entre elles, chacune reposant sur des présupposés à première vueincompatibles avec les autres ? Un certain vertige peut venir sans une habitude de l'application desraisonnements sur eux-mêmes. Nous indiquerons que la nature récursive de notre travail, évoquée audébut de cet avant-propos, est pour nous la clef de prise en charge d'une possible analyse de l'analyse,ouvrant la voie d’une interdisciplinarité de l'interdisciplinarité. Celle-ci dépasse assurément le cadre dece manuscrit de thèse.
23
Avant-propos
Pour la plupart des institutions, l’interdisciplinarité n’est concevable que si elle ne remet pas
en cause les fondements des disciplines, mieux, si elle conforte les disciplines en place : « [la]
pratique de la transdisciplinarité exige, au contraire, le renforcement constant du « noyau dur »
[des différentes disciplines] » (CNRS, 2002, p. 13).
Éviter de mettre en question les allants-de-soi disciplinaires serait-il déjà un
allant-de-soi profondément ancré dans nos disciplines ?
24
Introduction
Les langues que nous parlons et les gènes que nous portons sont l’héritage d’une
histoire riche et complexe. Il n’existe pas deux personnes qui parlent exactement de la
même manière, comme il n’existe pas deux personnes aux génomes exactement
identiques. Cette large diversité génétique et linguistique est en partie le résultat de
l’histoire des populations humaines à travers le monde. Comment pouvons-nous
entrevoir cette histoire, en étudiant uniquement la diversité humaine présente
actuellement ?
Le raisonnement par inférence historique est un moyen d’étudier les événements
et les processus passés. C’est à travers l’observation et l’analyse de données
contemporaines que les inférences nous permettent de reconstruire l’histoire des
populations humaines. En effet, ce type de raisonnements qui se fonde sur l’induction
(Nagel, 1961; Sagaut, 2008) nécessite de proposer une série de scénarios ou de
modèles4 possibles, et de les évaluer à la lumière des données réelles. Le ou les
scénarios jugés les plus à même d’être à l’origine des données observées sont
considérés comme des représentants pertinents des événements du passé. On parle
alors de réfutation (ou « falsification ») des scénarios les moins pertinents. Ce type de
raisonnement nécessite une base théorique commune, ainsi que plusieurs
ramifications en concurrence, afin d’arbitrer entre elles et d’éliminer les moins
plausibles (Lakatos, 1976). Les scénarios qui ne sont pas réfutés sont conservés,
jusqu’à ce que d’autres concurrents entrent en lice.
L’observation de la diversité actuelle pour des caractères très variés a permis
d’inférer l’histoire des populations humaines au cours de l’histoire des sciences. Les
diversités linguistiques ont notamment été utilisées dès le milieu du XIXème siècle par
4. Le mot de « modèle » revient à de nombreuses reprises dans les discours scientifique, y comprisdans ce manuscrit de thèse. Nous invitons le lecteur à se méfier de ce mot, car il est très utilisé maisnéanmoins hautement polysémique. Pour un aperçu historique de différentes significations possibles dece mot, voir par exemple Suppes (1961). Nous favoriserons l’emploi du mot « modèle » pour désignerun ensemble d’hypothèses (comme pour un « modèle mutationnel » par exemple), et le mot« scénario » pour désigner un ensemble d’hypothèses spécifiquement historiques (comme pour un« scénario de croissance démographique » par exemple).
25
Introduction
Schleicher pour inférer l’hypothétique origine des peuples indo-européens
(Schleicher, 1853), ainsi que par Čelakovský dans l’étude des langues slaves
(Čelakovský, 1853). Les diversités morphologiques ont également été utilisées très tôt
par Haeckel, dans l’objectif de déterminer la place de l’humanité au sein de l’histoire
évolutive des espèces vivantes (Haeckel, 1874). Les diversités génétiques ont quant à
elles été mobilisées plus tardivement, à partir de la seconde moitié du XX ème siècle
(Cavalli-Sforza et al., 1964), et ont été de plus en plus utilisées suite au progrès
conjoints du séquençage génomique et du traitement informatique (Reich et al., 2009).
Les inférences historiques utilisant des données génétiques ont par exemple permis
d’avancer des arguments décisifs concernant la sortie d’Afrique d’Homo sapiens
(Cann, 2001). L’utilisation de diversités linguistiques, profitant des développements
des méthodes de phylogénies, a récemment permis de reconstruire l’histoire du
peuplement des îles austronésiennes et d’indiquer leur origine probablement
Taïwanaise (Gray et al., 2009). D’autres types de diversités culturelles ont été utilisées
plus récemment dans la réalisation d’inférences historiques, réinvestissant également
les méthodes développées en phylogénie. Les corpus de données de la culture
matérielle décrivant les caractéristiques des canoës de Polynésie ont ainsi permis
d’établir une origine Fidjienne probable des populations ayant colonisé ces îles
(Rogers et al., 2009). L’étude de la diversité des structures des contes a quant à elle
permis d’établir la parenté de deux grands groupes de variants du « Petit Chaperon
Rouge » à travers le monde (Tehrani, 2013). Plus récemment encore, l’étude des
diversités musicales a permis d’établir le caractère significativement vertical (d’une
génération à la suivante) des diffusions des traits musicaux au Gabon (Le Bomin et
al., 2016).
L’utilisation et le développement de méthodes mathématiques permettant de
réaliser des inférences ont largement profité de la croissance continue des capacités
informatiques (Beaumont and Rannala, 2004). En génétique des populations
notamment, les techniques de séquençage modernes permettent de générer de grands
volumes de données en mesure d’être traités par la puissance des ordinateurs actuels.
En effet, le génome humain comporte environ trois milliards de paires de bases, et le
séquençage de génomes complets est maintenant accessible via les techniques
26
Introduction
modernes de « next-generation sequencing ». Ces données peuvent servir
d’informations pertinentes à la reconstruction historique, mais leur prise en compte
nécessite de larges capacités de stockages et de calculs.
Les connaissances contemporaines agrégées concernant les populations
humaines suggèrent que leurs histoires sont très souvent composées d’événements
complexes, au-delà d’une simple filiation entre populations : augmentations ou
diminutions de tailles, goulots d’étranglements, événements de migrations ou de
mélanges… Le développement des méthodes statistiques d’inférences historiques,
profitant du développement des capacités informatiques évoquées plus tôt, permet la
prise en charge de scénarios de plus en plus complexes (Robinson et al., 2014). Des
méthodes basées sur un grand nombre de simulations informatiques ont été rendues
possibles, permettant de prendre en charge de grands jeux de données évalués pour
des scénarios arbitrairement complexes. Elles sont ainsi de plus en plus mobilisées
dans la réalisation d’inférence historiques en génétique des populations, par le biais
du calcul Bayésien approché (ou ABC pour Approximate Bayesian Computation,
Beaumont et al., 2002; Tavaré et al., 1997).
Néanmoins, comme nous l’avons souligné plus tôt, les méthodes d’inférences
historiques présupposent toujours une ensemble d’hypothèses théoriques préalables
quels que soient leurs raffinements informatiques et statistiques. Quelles sont ainsi les
bases théoriques des méthodes d’inférences en génétique et en linguistique ? Quels
sont les présupposés implicites qui accompagnent l’utilisation de ces méthodes ? Ces
présupposés sont-ils justifiés ?
Je détaillerai dans la suite de cette introduction la chaîne méthodologique qui
part de l’observation sur le terrain des données génétiques ou linguistiques, puis passe
par les descriptions graphiques et statistiques et va jusqu’à l’inférence historique
proprement dite. J’exposerai ensuite les études d’inférences historiques couplant les
diversités génétiques et linguistiques, avant de présenter mes propres travaux.
27
Introduction
1. Construction et observation de l’objet
1.1. Observation de la diversité génétique
Contrairement à ce que pourrait laisser penser le terme de « donnée », l’idée que
le réel serait « donné à voir » est un mythe tenace (Sellars, 1956), notamment en
génétique des populations. La construction d’une série d’observations se fait d’abord
par l’intermédiaire d’un échafaudage théorique préalable, que cet échafaudage soit
construit sur des présupposés scientifiques ou sur des présupposés du sens commun
(voir l’avant-propos). Les étapes qui mènent les généticiens des populations à inférer
l’histoire des populations humaines se basent sur un ensemble de connaissances
variées en biologie, accumulées au cours de l’histoire des sciences. Ces connaissances
concernent la complexité de la biologie des organismes et incluent les mécanismes de
mitose et de méiose, les mécanismes de fécondation cellulaire, ainsi que ceux de
réplication des molécules d’ADN (Chakravarti, 1999). C’est par la prise en compte de
cette vaste architecture théorique présupposée à son travail que le généticien des
populations peut proposer un discours où tout se tient de concert (Murphy and Medin,
1985).
C’est également la formalisation de la notion de population qui donne les
conditions de réalisation d’une campagne d’échantillonnage pertinente pour un travail
de recherche en génétique des populations. Les variations autours d’un même air de
famille (Wittgenstein, 1953) dépendent en partie des différences dans les objectifs que
se fixent a priori les sous-disciplines concernées. Par exemple, les écologues
définissent plutôt les populations en termes de cohésion démographique, étant donné
qu’ils s’intéressent plutôt aux interactions entre les individus. Les généticiens des
populations ont quant à eux plutôt tendance à définir les populations en terme de
cohésion reproductive, étant donné qu’ils s’intéressent plutôt à la transmission
génétique au fil des générations. Cette polysémie aboutit à un certain flou sémantique
quand aux présupposés des différents emplois de ce terme, ne permettant pas de
statuer définitivement sur une définition unanime de ce mot (Waples and Gaggiotti,
2006).
28
Introduction
De plus, chacune de ces disciplines déploie en interne toute une polysémie, le
concept de population étant souvent utilisé de manière informelle à travers une variété
de sens (Debouzie, 1999; Hartl and Clark, 2007) pouvant référer notamment en
génétique des populations soit à un groupe d’individus d’une même espèce cohabitant
dans un même habitat pour Lefevre et al., (2016) , soit à un groupe d’individus pour
lesquels les accouplements ont lieu au hasard, de manière « panmictique », pour
Jobling et al. (2003) , soit encore à un « pool » de gènes dont la composition est
susceptible d’évoluer pour Henry et Gouyon (1999) .
Ce flou se retrouve tout autant dans la notion de « gène », également centrale à
la génétique des populations (Gayon, 2004). Un « gène » peut faire référence soit à
une entité héréditaire contrôlant causalement la production d’un phénotype, soit à une
séquence d’ADN transcrite, soit à une séquence codante associée à une séquence
régulatrice, soit encore à toute portion variable du génome.
Quoiqu’il en soit, la génétique des populations se constitue autours de ces deux
notions clefs (« gène » et « population ») qui guident au quotidien la réflexion et la
pratique des généticiens des populations en fonction de leurs problématiques propres.
La construction d’une campagne d’échantillonnage est ainsi conditionnée par une
acceptation particulière de ces termes au sein de cette polysémie. Des études centrées
sur l’espèce humaine pourront mettre la notion de « groupes ethniques » au cœur de la
structure de leur échantillonnage (Balazs, 1993; Heyer et al., 2009; Huang et al.,
2015), là où des études centrées sur la santé pourront mettre la notion de « population
de malades » au cœur de leur structure d’échantillonnage (Khuri et al., 2007;
McDonald et al., 2003). Des branches voisines de la discipline comme la phylogénie
mettront au centre de leurs constructions théoriques plutôt d’autres notions, comme
celles de « relations généalogiques » (entre séquences, entre individus, entre
espèces…), les amenant ainsi à construire des campagnes d’échantillonnages
différentes (Wiley and Lieberman, 2011).
Après la campagne de récolte des échantillons, les savoirs et les techniques
accumulés par la biochimie et la biologie cellulaire permettent de réaliser des
protocoles d’extraction de l’ADN en laboratoire (Aljanabi and Martinez, 1997), son
amplification par PCR (Mullis and Faloona, 1987), puis son séquençage (Mardis,
29
Introduction
2008) afin d’obtenir un fichier informatique composé pour chaque ADN séquencé
d’une suite de lettres A,T,G,C. C’est la connaissance théorique et pratique de la chaîne
allant de l’échantillon récolté sur le terrain au fichier informatique de séquençage qui
permet au final de donner une signification aux jeux de données disponibles
informatiquement, et de proposer des hypothèses au sujet des populations
échantillonnées.
1.2. Observation de la diversité linguistique
Tout comme dans le cas de la génétique, le réel linguistique n’est pas donné à
voir. Les données linguistiques doivent être construites à l’aide d’un protocole
d’échantillonnage par l’interprétation d’un ensemble d’énoncés. La réalisation du
protocole d’échantillonnage par un linguiste dépend de ses hypothèses sur la nature du
langage. Un concept central à la linguistique est celui de « langue » (ou de « lect »
dans le cas d’une échelle plus fine), or, les objets linguistiques sont de natures très
différentes selon les travaux entrepris sur le terrain et selon les écoles disciplinaires
des linguistes recueillant les données.
La grande diversité sémantique attachée à cette notion rend difficile sa
clarification, et sa signification doit être évaluée au cas par cas (Pateman, 1983). Au
contraire de la génétique des populations, la linguistique se déploie dans une diversité
de paradigmes analytiques : philologique, structural, stratégique et neural (Alvarez-
Pereyre, 2014) , ces paradigmes représentant différentes manières de construire l’objet
linguistique.
Le paradigme philologique, historiquement le plus ancien, permet de rendre
compte du changement des éléments linguistiques au cours du temps et de leurs
variations géographiques. Les travaux de William Jones, avec sa comparaison entre le
sanskrit, le grec ancien et le latin (Jones, 1786), s’inscrivent déjà dans ce paradigme.
Le prisme diachronique (relatif au déroulement dans le temps) permet d’établir les
embranchements historiques, les phénomènes de diffusion, et les facteurs de
variations géographiques et temporels. Les langues sont considérées comme une
agrégation d’unités élémentaires, l’histoire de chacune de ces unités pouvant être
étudiée indépendamment. Ce paradigme profite aujourd’hui d’un renouveau apporté
30
Introduction
par les méthodes développées par la phylogénétique (Atkinson and Gray, 2005), les
deux champs étant largement inter-féconds et profitant d’une base théorique commune
dite « évolutionniste » et « diffusionniste ». L’analogie entre les unités linguistiques et
les unités génétiques est à la source de nombreux échanges méthodologiques et
conceptuels entre ces disciplines (Pagel, 2009).
Le paradigme structural s’ancre sur les travaux de Ferdinand de Saussure à
travers la transcription posthume de son cours de linguistique générale (Saussure,
1916), puis sur les travaux des linguistes russes Nicolaï Troubetzkoy et Roman
Jakobson. Les linguistes de ce paradigme s’attachent plutôt à comprendre la structure
des systèmes de signes, à un instant donné. Le prisme synchronique, au contraire du
prisme diachronique de la philologie, est un moyen d’établir les relations systémiques
au sein d’une langue particulière. Le paradigme structural permet ainsi de rendre
compte de l’articulation des éléments d’un système linguistique en fonction de ses
contraintes internes. Les langues sont ici considérées comme des systèmes, où
chacune des parties n’ont de sens que relativement aux autres. Le caractère
synchronique de l’école structuraliste n’exclut pas la prise en compte des
changements linguistiques au cours du temps, avec différentes approches selon les
auteurs (Verleyen, 2007).
Le paradigme stratégique est un retournement de la perspective structurale.
L’emphase est mise sur le rôle des paramètres externes (culturels, sociaux,
environnementaux…) sur un système linguistique (Bornand and Leguy, 2013). La
diversité et l’influence des paramètres extra-linguistiques est mise en lumière par les
outils de ce paradigme, rendant possible l’explicitation des dynamiques plus ou moins
conscientes qui se reflètent à travers le langage. Les langues ici sont considérées
comme des pratiques humaines inclues dans des systèmes culturels beaucoup plus
larges.
Le paradigme neural fait quant à lui l’hypothèse d’un système nerveux central
contraignant les fonctions comportementales, et en particulier celles liées au langage
(Chomsky, 2006). La prise en compte de la perspective cognitive permet de mettre en
lumière les liens entre les structures nerveuses, psychologiques et physiologiques, et
la perception, le traitement et la production de la parole. Les langues sont ainsi ici
31
Introduction
considérées comme la réalisation de contraintes cognitives préexistantes.
Chacun de ces quatre paradigmes capture une dimension particulière des
langues, en s’attachant plus particulièrement à un de leurs aspects, selon leurs
objectifs respectifs. Il est à signaler qu’une large polysémie autours du concept de
langue se déploie également au cœur de chacun des ces paradigmes.
1.3. Positionnement philologique
Mon travail de thèse se positionne dans la tradition philologique héritière des
travaux de Morris Swadesh, conceptuellement très proche de la phylogénétique. Dans
l’objectif de reconstruire l’histoire des langues, Swadesh développe la
glottochronologie (1952). Le principe est d’évaluer à quel point les mots d’une même
racine étymologique, c’est à dire provenant d’un même mot ancestral, se retrouvent
d’une langue à l’autre. Les mots ayant le même sens et la même racine étymologique
sont appelés des cognats. Des listes classiques de 100 ou 200 mots sont établies à
partir du vocabulaire commun à la plupart des langues du monde afin de pouvoir
comparer les langues entre elles. Les mots qui composent ces listes Swadesh sont
choisis de manière à être universels et résistants à l’emprunt. Le nombre de cognats
différents entre deux langues permet alors de calculer la date de divergence entre ces
langues (Lees, 1953), et de construire un arbre de l’histoire linguistique analogue aux
arbres phylogénétiques. On pourrait ainsi nommer de tels arbres des « arbres
glossogéniques » (Fitch, 2008).
Les outils informatiques récents permettent de réviser l’approche
glottochronologique classique et de tenter de se défaire des hypothèses les plus
critiquées (Atkinson et al., 2005; Campbell, 2006), permettant par exemple de
spécifier explicitement les processus de filiation entre les langues au-delà de supposer
une relation de parenté seulement d’après une corrélation numérique. L’idée est alors
de comparer des éléments linguistiques d’une langue à l’autre à la lumière d’un
modèle explicite de ramifications successives et de divergences par accumulation de
mutations. Les « langues » sont assimilées à des collections d’éléments présents ou
absents (Atkinson et al., 2005) : présence ou absence de certains sons, présence ou
absence de certains mots, de certaines formes syntaxiques… susceptibles de muter au
32
Introduction
cours du temps, à des taux variables.
La question de l’observation se rapporte ici à la question d’établir la présence ou
l’absence d’une liste d’objets élémentaires au sein de chaque langue, cette
formalisation guidant le protocole d’échantillonnage sur le terrain. Dans l’exemple de
l’utilisation d’une liste Swadesh, la liste de mots est établie pour chaque langue
étudiée, les méthodes de la linguistique comparative et de l’étymologie permettant
ensuite d’établir l’appartenance des mots à des groupes de cognats.
Plus généralement et au-delà de la reconstruction d’arbres glossogéniques, un
ensemble de recherches s’est emparé de la perspective philologique et des analogies
conceptuelles entre les évolutions génétique et linguistique. Le champ récent de la
« linguistique évolutive » (pour une revue détaillée, voir Croft, 2008) se déploie dans
une diversité d’approches dont les objets, les méthodes et les présupposés sont parfois
très variés. Une formalisation de la notion de « langue » proposée par Croft (1996) a
retenu plus particulièrement mon attention. La langue y est plutôt assimilée à un
ensemble d’énoncés produits par une communauté de locuteurs. Croft propose
d’utiliser le concept de « communauté de langage » analogue à celui de « population
génétique » (Croft, 2006) :
A speech community is a group of speakers who engage in intercourse, that is, talk to each
other, and more critically, are communicatively isolated from speakers in other speech
communities. […] The definition of a language spoken by a speech community is then more
“social” than “linguistic”.
La prise en compte d’une telle définition de la langue à des conséquences
importantes sur le protocole d’échantillonnage : la diversité linguistique interne à
chaque communauté de langage devient une donnée pertinente pour l’inférence
historique. Il s’agit alors de relever les énoncés produits par une série de locuteurs,
contrairement à la prise en compte d’une langue à un niveau plus général comme
précédemment pour les études glottochronologiques. Un protocole de récolte de
données pour une série d’individus échantillonnées au sein des communautés de
langage est nécessaire. Néanmoins, à ma connaissance, très peu de campagnes
33
Introduction
d’échantillonnage de ce type ont été réalisées jusqu’à présent, hormis celles de
Mennecier et al. (2016) et Verdu et al., (2017).
2. Description des diversités et inférences
historiques
2.1. Description de la diversité génétique
La description de la diversité en génétique des populations est une étape
permettant de transformer les informations complexes du fichier listant les séquences
génétiques résultant de l’étape de séquençage, en informations d’une dimension plus
réduites. Cela permet aux chercheurs de la discipline de disposer d’une représentation
synthétique du jeu de données et de disposer d’une première visualisation de sa
complexité. Cette étape peut prendre la forme d’un calcul d’indices statistiques à
l’aide de programmes informatiques (Excoffier and Lischer, 2010; Guillot et al.,
2005), de représentations graphiques issues par exemple d’analyses en composantes
principales, de représentations sous forme d’arbre utilisant par exemple l’algorithme
de Neighbour-Joining (Saitou and Nei, 1987), ou encore de graphiques de clustering
(Pritchard et al., 2000). La construction d’une description est ici aussi chargée de
théorie, car elle ne peut prendre de signification qu’en se référant aux présupposés
induits par les différentes méthodes descriptives appliquée aux données, ces données
étant elles-mêmes construites selon un cadre théorique préalable.
Ces descriptions sont parfois directement interprétées, afin de réaliser des
raisonnements d’inférences historiques que je qualifierais de « verbales ». Le risque
majeur de ces discours, construit à partir des seules représentations graphiques ou
statistiques, est de rendre implicites leurs éventuelles hypothèses ou allants-de-soi
sous-jacents. Il est alors souvent difficile de statuer sur leur validité, la
méconnaissance de ces présupposés pouvant aboutir à des contre-sens dans
34
Introduction
l’interprétation des résultats de description, comme cela a été souligné par plusieurs
auteurs (Falush et al., 2016; Novembre and Stephens, 2008). Reich et al. (2008)
montrent par exemple que l’observation d’un gradient génétique sur un graphique
d’ACP est couramment interprété comme le résultat d’un processus de migration,
mais qu’il pourrait tout aussi bien être analysé comme le résultat d’un isolement
graduel par la distance par exemple.
2.2. Description de la diversité linguistique
Les méthodes de description de la diversité linguistique en philologie sont
souvent analogues aux méthodes de descriptions de la diversité génétique. La prise en
compte de transferts horizontaux s’accompagne parfois de représentations graphiques
incluant des réticulations, par exemple via les graphiques en réseaux (Hamed, 2005;
List et al., 2014). Une interprétation directe de ces descriptions peut être proposée.
Néanmoins, le même problème que dans le cas de la génétique émerge quant au lien
entre ces représentations statistiques ou graphiques et les processus historiques à
l’origine des diversités observées. Dans le cas des inférences verbales de l’histoire
linguistique à partir des seules descriptions, le problème est encore plus aiguë. Quels
liens causaux peut-on établir entre les descriptions et les mécanismes à l’origine de la
production de la diversité linguistique ? Quels présupposés implicites, parfois issus
directement d’une vision « génétique » des objets linguistiques, guident les
interprétations ?
2.3. Inférence de l’histoire à l’origine de la diversité génétique
Les modèles mathématiques et informatiques formalisent une première couche
de connaissances liées aux particularités biologiques de la réplication et de la
transmission de l’ADN. La connaissance des processus de recombinaison des
autosomes, de mutation des paires de bases, de production des gamètes par
mitose/méiose et ceux de la fécondation renseignent l’image générale du processus de
construction de la diversité génétique. Le chromosome Y et le chromosome
mitochondrial ont une place particulière pour les inférences en génétique des
populations, étant donné que le premier se transmet uniquement de père en fils tandis
35
Introduction
que le second ne se transmet que par la mère, ceci permettant de reconstruire l’histoire
des lignées spécifiquement paternelles d’un côté et spécifiquement maternelles de
l’autre.
Après cette première couche de formalisation au niveau des individus, c’est la
structure des populations qui renseigne la seconde couche de formalisation. Wright
(1951) a proposé un ensemble d’hypothèses devenues aujourd’hui classiques dans les
travaux de modélisation. À moins d’une spécification explicite du contraire, les
individus sont supposés se reproduire uniquement avec les membres de leur
population en choisissant leurs partenaires reproductifs de manière aléatoire : c’est
l’hypothèse de panmixie. Les allèles sont supposés ne pas avoir d’influence sur les
traits d’histoire de vie des individus, ou avoir une influence neutre ; c’est l’hypothèse
de neutralité. L’histoire des séparations, des migrations, des processus de sélection ou
des processus de mélanges entre plusieurs populations peuvent ensuite constituer les
particularités des scénarios historiques qui seront évaluées les uns par rapport aux
autres.
L’évaluation de ces différents scénarios historiques peut s’effectuer à l’aide de
plusieurs types de méthodes inférentielles. Les méthodes du maximum de
vraisemblance explorent directement la probabilité d’observer ce qui est observé,
d’après l’ensemble des scénarios envisagés (Huelsenbeck and Crandall, 1997). En
particulier, les algorithmes de Monte Carlo par chaînes de Markov dans un cadre
bayésien offrent de nombreux avantages computationnels et sont des méthodes de
choix pour la discipline (Drummond and Rambaut, 2007). Néanmoins, ces méthodes
sont parfois inopérantes dans la prise en compte de jeux de données très volumineux
ou de scénarios complexes (Stephens and Donnelly, 2003). Ces cas peuvent rendre
impossible le calcul direct de la vraisemblance, ce calcul étant au cœur de ces
méthodes. Les méthodes de calcul bayésien approché (ABC, pour Approximate
Bayesian Computations) se basent quant à elle sur des simulations informatiques des
scénarios proposés. Les données simulées sont comparées à des données réelles, afin
d'évaluer la pertinence relative de chacun des scénarios sans se baser sur un calcul
direct de la vraisemblance (Tavaré et al., 1997). Les méthodes ABC sont parfois très
coûteuses en temps de simulations informatiques, mais elles permettent néanmoins de
36
Introduction
prendre en compte des jeux de données très volumineux, et d’évaluer des scénarios
comportant des événements complexes de migrations et de mélanges, difficilement
pris en compte par d’autres méthodes d’inférence (Csilléry et al., 2010; Robinson et
al., 2014).
2.4. Inférence de l’histoire à l’origine de la diversité linguistique
Les travaux de construction des arbres de langues (Atkinson et al., 2005)
utilisent principalement les méthodes inférentielles bayésiennes du maximum de
vraisemblance (Drummond and Rambaut, 2007). Les scénarios évalués les uns par
rapport aux autres sont des scénarios de ramifications successives à partir d’une
origine commune, analogues aux modèles de phylogénies. La possibilité
d’événements de mélanges entre les langues n’est pas prise en compte, et l’emprunt
est considéré comme un paramètre de nuisance (Greenhill et al., 2009). Ces méthodes
présupposent donc une transmission exclusivement verticale des langues.
Au contraire de ceux de génétique des populations qui dispose d’un ensemble
de résultats issus de la biologie, les travaux d’inférence historique en linguistique ne
disposent pas d’un champ décrivant les processus de changements linguistiques et
leurs mécanismes causaux. Une réelle lacune se trouve dans la chaîne qui va de
l’observation à l’inférence, sans qu’une synthèse ne permette de relier les processus
de changements historiques à l’échelle des langues aux mécanismes de transmission
linguistique à l’échelle des individus. Ainsi, alors que la génétique fournit la
description des mécanismes de transmission de l’ADN, il est difficile de statuer de
manière univoque sur les mécanismes d’acquisition et de transmission linguistique à
l’échelle des individus.
3. Études de la co-évolution génétique et
linguistique
Les reconstructions historiques associant la génétique et la linguistique peuvent
37
Introduction
remonter au moins aux travaux de Ranajit Chakraborty dans les années 70
(Chakraborty, 1976), étudiant la corrélation entre les distances génétiques et
linguistiques des Amerindiens des hauts plateaux andins. Luigi Luca Cavalli-Sforza et
ses collaborateurs étendent cette approche dans les années 80 à une échelle mondiale
(Cavalli-Sforza et al., 1988), les diversités alléliques leur permettant de construire des
arbres phylogénétiques et de les comparer aux taxonomies des langues proposées par
ailleurs (Ruhlen, 1991).
Les distances génétiques, linguistiques et géographiques sont encore utilisées
abondamment aujourd’hui pour décrire leurs éventuelles corrélations locales
(Balanovsky et al., 2011; Nettle and Harriss, 2003; Ramallo et al., 2013) ou globales
(Belle and Barbujani, 2007; Creanza et al., 2015). Ces travaux tendent à montrer une
très faible corrélation entre les distances génétiques et linguistiques une fois les
distances géographiques prises en compte. En revanche, une forte corrélation entre les
distances génétiques et géographiques et entre les distances linguistiques et
géographiques est très souvent mesurée. Cela semble donc indiquer que le lien entre
les diversités génétique et linguistique est uniquement le résultat d’une géographie
partagée (Ben Hamed and Darlu, 2007), au lieu d’un mécanisme de transmission
analogue qui aurait aboutit à une corrélation entre génétique et linguistique en plus
d’une corrélation avec la géographie.
En se basant malgré cela sur une hypothèse d’une co-évolution entre les gènes
et les langues, l’inférence des séparations entre les populations génétiques est parfois
utilisée pour évaluer les taxonomies des langues (Amorim et al., 2013). Les histoires
génétiques et linguistiques sont ici supposées identiques, ou au moins analogues,
l’une pouvant renseigner l’autre. Dans un autre contexte, la comparaison entre arbres
phylogénétiques reconstruits respectivement à partir des données génétiques et
linguistiques permet d’améliorer la solidité des inférences historiques lorsque les
arbres se superposent (Balanovsky et al., 2011; Duda and Jan Zrzavý, 2016). Cette
association permet également de suggérer des événements particuliers
(remplacements linguistiques par exemple) lorsque les arbres ne se superposent pas.
La reconstruction d’arbres phylogénétiques dans les travaux d’inférence
historique cités précédemment est directement la conséquence de leurs hypothèses de
38
Introduction
modélisation. L’histoire à l’origine de la production des diversités génétiques et
linguistiques est supposée être un processus de ramifications successives partant d’un
tronc commun correspondant à une forme ancestrale, et se divisant en branches, puis
en feuilles représentant les diversités actuellement observées.
Ce type de modèle est très discuté dans la littérature concernant l’évolution de la
culture en général (Geisler and List, 2013). Deux critiques principales sont formulées
à leur encontre (Cabrera, 2017). La première concerne l’absence de prise en compte
de phénomènes de transferts entre les branches des arbres dus à des emprunts entre les
langues. Les méthodes de phylogénies supposent en effet une transmission strictement
verticale des éléments linguistiques. La seconde critique concerne l’analogie
injustifiée entre l’événement de transmission génétique et l’événement de
transmission culturelle (Claidière and André, 2012). En effet, les éléments culturels
influencent largement leur mode de transmission par toute une série de biais. Par
exemple, une structure linguistique plus simple sera plus facilement comprise au
moment de son énonciation. Autre exemple, un mot plus expressif sera plus
facilement retenu et à son tour plus utilisé. Au contraire, l’ADN ne biaise pas
l’événement de transmission lui-même en dehors des rares distorteurs de ségrégation,
puisque son mode de transmission est toujours le même : la moitié des autosomes de
chaque parent en plus des marqueurs uni-parentaux. Kirby (2000) indique ainsi que :
[les langues] doivent être reconstruites à chaque génération via un apprentissage ou une
acquisition, au contraire des séquences ADN (celles-ci étant copiées puis physiquement
transmises).
Comment prendre en compte ces particularités lors de la proposition
d’inférences historiques conjointes entre génétique et linguistique ? Quels
présupposés peut-on adopter pour réaliser les inférences linguistiques les plus
robustes possible ?
39
Introduction
4. Vers un cadre d’analyse conjoint des
diversités génétiques et linguistiques
4.1. Inférence de l’histoire des populations génétiques et des
variétés linguistiques
Dans le premier chapitre de ce manuscrit de thèse, nous développons une
nouvelle méthode ABC permettant d’inférer en parallèle les histoires de populations
génétiques et de variétés linguistiques. Elle est appliquée à un ensemble de
populations d’Asie Centrale pour lesquelles des données génétiques et linguistiques
ont été collectées au laboratoire (21 populations d’Asie Centrale génotypées pour 26
marqueurs microsatellites autosomaux et typées linguistiquement pour une liste de
185 mots). Pour la première fois, l’utilisation de méthodes ABC permet de prendre en
compte explicitement la possibilité d’emprunts ou de mélanges linguistiques aussi
bien que la possibilité d’événements de migrations ou de mélange génétiques. Les
inférences sont réalisées indépendamment pour les deux types de données, afin de
mettre en regards les scénarios proposés sans présupposer d’une co-évolution stricte
entre les deux.
4.2. Exploration d’une « linguistique des populations »
Nous avons signalé précédemment qu’une campagne d’observation centrée
autours des langues n’était qu’un choix possible parmi d’autres. Une perspective
différente, plus proche de la génétique des populations, consiste à relever sur le terrain
les diversités linguistiques au sein des communautés de langage à l’échelle des
individus (Thomsen, 2006). Il s’agit ainsi, dans un second chapitre, de construire une
série de modèles d’évolution linguistique différents à l’échelle des individus.
L’objectif est d’évaluer la pertinence de plusieurs modes de transmission possibles,
tout en évaluant la faisabilité des inférences linguistiques intra-populationnelles pour
une première formalisation d’une « linguistique des populations » (Manni, 2017). Ce
cadre est appliqué à l’étude des modes de transmission linguistiques chez 10
40
Introduction
populations du Tadjikistan pour une liste de 185 mots.
4.3. Construction d’une interface entre linguistique des
populations et génétique des populations
L’impossibilité d’appliquer les modèles de phylogénie pour la reconstruction
des histoires linguistiques à l’échelle des individus nécessite de construire de
nouveaux modèles détaillant les mécanismes de transmission linguistique. Les
analogies avec les mécanismes de réplication génétique peuvent être considérées
comme des sources d’inspiration, mais non comme des justifications. Il est alors
nécessaire de s’appuyer sur les connaissances issues des différentes branches de la
linguistique, ce qui implique un profond travail interdisciplinaire. Nous nous
attachons ainsi dans un troisième chapitre à construire une articulation entre une
formalisation de la génétique des populations et une formalisation d’une possible
linguistique des populations. L’explicitation des hypothèses sous-jacentes et la prise
en compte de la chaîne causale allant de l’observation sur le terrain à l’inférence nous
amènent à proposer un retournement de la manière de construire l’objet linguistique,
en mettant l’énoncé au cœur de la modélisation.
4.4. Échantillonnage et analyse de données linguistique issues
des Îles du Cap Vert
Dans le dernier chapitre de cette thèse, nous détaillons le protocole de récoltes
de données linguistiques qui a été appliqué lors d’une mission dans les Îles du Cap
Vert avec un ensemble de locuteurs Cap-Verdiens. Ce protocole a été construit afin de
répondre aux besoins de la linguistique des population présenté dans le chapitre
précédent. En retour, il s’est avéré que l’application de ce protocole sur le terrain a
nourrit nos perspectives théoriques. Une analyse de données préliminaires nous
permet une première approche de la structuration de diversité linguistique de cette
région du monde.
41
Chapter I – Genetic and linguistic
histories in Central Asia inferred
using Approximate Bayesian
Computation
Valentin Thouzeau†,1, Philippe Mennecier†, Paul Verdu†,2, Frédéric Austerlitz†,2
† CNRS, MNHN, Université Paris Diderot, UMR 7206 Eco-Anthropologie et
Ethnobiologie, Paris 75016, France1 Corresponding author: valentin.thouzeau@mnhn.fr, +33 (0)1 44 05 73 44 2 These authors equally supervised this work
This article was published online August 23, 2017, with the following reference:
Thouzeau, V., Mennecier, P., Verdu, P., and Austerlitz, F. (2017). Genetic andlinguistic histories in Central Asia inferred using approximate Bayesian computations.Proc. R. Soc. B 284, 20170706.
Abstract
Linguistic and genetic data have been widely compared, but the histories
underlying these descriptions are rarely jointly inferred. We developed a unique
methodological framework for analysing jointly language diversity and genetic
polymorphism data, to infer the past history of separation, exchange and admixture
events among human populations. This method relies on Approximate Bayesian
Computations that allows to identify the most probable historical scenario underlying
each type of data, and to infer the parameters of these scenarios. For this purpose, we
43
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
developed a new computer program PopLingSim that simulates the evolution of
linguistic diversity, which we coupled with an existing coalescent-based genetic
simulation program, to simulate both linguistic and genetic data within a set of
populations. Applying this new program to a wide linguistic and genetic data set of
Central Asia, we found several differences between linguistic and genetic histories. In
particular, we showed how genetic and linguistic exchanges differed in the past in this
area: some cultural exchanges were maintained without genetic exchanges. The
methodological framework and the linguistic simulation tool here developed can be
successfully used in future work for disentangling complex linguistic and genetic
evolutions underlying human biological and cultural histories.
1. Introduction
Human demographic historiy encompasses complex events such as migrations,
population size changes, and admixture events (Cavalli-Sforza and Feldman, 2003;
Hellenthal et al., 2014; Mallick et al., 2016; Ramachandran et al., 2005). These
demographic events, which impact within- and among-population genetic diversities,
are coupled with gradual cultural changes or bursts of innovation, and borrowings
(Atkinson et al., 2008; Cavalli-Sforza and Feldman, 1981; Mesoudi et al., 2006).
Since Darwin (1871), numerous authors have investigated genetic and linguistic
evolutionary processes. They found parallelisms between linguistic and genetic trees
(Cavalli-Sforza et al., 1988); identified homologous linguistic traits with a possible
common origin similar to homologous genetic markers (Atkinson and Gray, 2005);
and proposed that genes and languages are both composed of discrete heritable
replicators which may evolve in parallel (Pagel, 2009). There are numerous studies
comparing linguistic and genetic diversities, which investigate also how they may
match at different geographical scales. For instances, a strong correlation was found
between genetic barriers and linguistic boundaries in Europe (Barbujani and Sokal,
1990); genes from North Island Melanesian populations appeared to diffuse more than
linguistic features (Hunley et al., 2008); African genetic diversitiy was more
44
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
structured geographically than linguistically (Scheinfeldt et al., 2010); strong
differences between genetic and phonemic drifts were found at the worldwide scale
(Creanza et al., 2015).
While complex mechanisms are often considered in demographic inferences
based on genetic data, the known complexity of mechanisms underlying linguistic
evolution has rarely been accounted for. Indeed, genes and languages differ by the
very nature of their transmission processes (Claidière and André, 2012). Genes are
transmitted vertically among individuals and gene flow only occurs through sexual
reproduction and/or individual migrations. On the other hand, languages can be
transmitted vertically, horizontally and obliquely (Tao Gong, 2010). This transmission
among generations may occur in parallel with genetic transmission, in particular
within families. However, linguistic exchanges or borrowings among populations may
also occur via cultural diffusion, without migration of individuals (Haspelmath and
Tadmor, 2009; Haugen, 1950). Conversely, gene flow can occur without language
borrowing when a migrating individual does not transmit his/her language to his/her
progeny. Therefore, cultural and demographic changes are not necessarily correlated
(Steele and Kandler, 2010), and genetic and linguistic data may reveal different
aspects of human history. For instance, all Central African Pygmy populations share a
common ancestral population long diverged from the ancestral non-Pygmy
neighbouring population (Verdu et al., 2009), but nevertheless now use the language
of their neighbours (Bahuchet, 2012).
Population genetics methods allow inferring complex demographic histories
from genetic polymorphism data, using elaborated statistical methods, such as Monte
Carlo Markov Chain or Approximate Bayesian Computation (ABC) (Beaumont et al.,
2002; Tavaré et al., 1997) based on the coalescent theory (Kingman, 1982). They have
allowed inferring parameters of human demographic history at worldwide or regional
scales (Alves et al., 2016; Haber et al., 2016; Moreno-Estrada et al., 2013). Several
models have been proposed for the transmission and evolution of specific linguistic
features such as words (Atkinson et al., 2005). Computational linguistic approaches
have recently been applied to lexical datasets, allowing the inference of recent human
language diffusions (Bouckaert et al., 2012; Gray et al., 2009). These approaches did
45
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
not encompass horizontal or oblique borrowing events, as these processes cannot be
easily accommodated within a phylogenetic framework. However, neglecting
borrowings is expected to significantly bias the estimation of parameters such as
divergence dates (Greenhill et al., 2009).
Likelihood-based approaches cannot handle large data sets under highly
complex models (Csilléry et al., 2010; Schiffels and Durbin, 2014; Weiss and von
Haeseler, 1998). However, complex models are essential to interrogate the
multifaceted demographic and cultural histories. ABC methods provide an ideal
framework to overcome these challenges (Beaumont et al., 2002; Tavaré et al., 1997),
since they rely on explicit simulations, which allow considering altogether
phenomena such as admixture, changes in effective population size, and borrowings
among numerous populations.
We developed here an ABC framework to study the links between genetic
transmission (vertical with or without gene flow) and linguistic transmission (vertical
and/or horizontal) under a large number of possible complex scenarios. This
framework aims at choosing among different historical scenarios and inferring the
best parameters for the chosen scenarios, using linguistic and genetic data sets. Since
ABC methods require extensive simulations, we developed a new efficient linguistic
simulation program to simulate linguistic trees with possible borrowing and admixture
events among linguistic varieties or populations, generating ultimately simulated
cognate lists in each population. Cognates are homologous words with the same
etymological origin and the same meaning. They are usually obtained by comparing
word lists among populations or linguistic varieties, such as the 207-words list
designed by Swadesh (Swadesh, 1952). They have been previously used as cultural
markers of evolution (Bouckaert et al., 2012; Gray et al., 2009). In parallel to this
novel “language” simulator, we used the program fastsimcoal 2.5.1 (Excoffier and
Foll, 2011) to simulate large genetic polymorphism data sets under complex
demographic histories.
We specifically applied this novel inference framework to Central Asia, which
represents an ideal setting for the investigation of gene-language coevolution. A
46
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
complex history of settlements, migration waves and admixture events, expansions,
and secondary contacts, have shaped the genetic and linguistic diversity of
populations in this area (Palstra et al., 2015). Central Asian populations belong to at
least two linguistically and genetically contrasted groups: Turkic speaking and Indo-
Iranian speaking populations (Martínez-Cruz et al., 2011). Since they often live
nearby, we expected gene flow and/or vocabulary exchanges between them (Martínez-
Cruz et al., 2011; Palstra et al., 2015).
We obtained linguistic and genetic data for 21 populations (11 Turkic speaking
and 10 Indo-Iranian speaking). We focused on two specific populations: the Uzbek
speaking population from the district of Soj-Mahalla in the city of Andizhan in the
Fergana valley (abbreviated UZA), and the Yagnob speaking population from the
Yagnob valley (abbreviated TJY). Linguistic replacement was hypothesized to explain
the previously observed mismatch between linguistic and genetic clustering of the
UZA population (Martínez-Cruz et al., 2011). Alternatively, the TJY population is
assumed to be linguistically and genetically isolated from the other Indo-Iranian
speakers of this region due to its geographical isolation in valleys difficult to reach
(Gunya et al., 2002). These populations represent therefore two separate relevant case
studies to apply our new framework and contrast genetic and linguistic histories. We
reconstructed these histories separately for each population, and compared the
obtained inferences a posteriori. We focused on the chronology of genetic and
linguistic splits, and on the respective levels of genetic and linguistic exchanges.
2. Material
We studied the genetic and linguistic diversities of 21 Central Asian localities,
sampled in three countries (Uzbekistan, Tajikistan and Kyrgyzstan): 11 Turkic
speaking populations and 10 Indo-Iranian speaking populations (Figure I.1). The
national ethics committees of each country of sampling and the French research
ministry approved the study. All sampled individuals provided appropriate informed
consent.
47
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
2.1. Genetic data
We used previously published genetic data from these 21 populations (Aimé et
al., 2014), for a total of 643 individuals (24 to 49 individuals per population, see Table
S1), genotyped for 26 autosomal microsatellite makers that showed no significant
deviation from Hardy-Weinberg equilibrium and extremely low pairwise linkage
disequilibrium (Martínez-Cruz et al., 2011). All sampled individuals included in our
study were no closer than second-degree cousins (Martínez-Cruz et al., 2011).
2.2. Linguistic data
We obtained linguistic data from the same 21 populations, using phonetic
transcriptions on a subset of the individuals also sampled for DNA. Between one and
seven individuals participated to the linguistic questionnaires per population,
amounting 74 individuals in total. For each individual, we recorded up to 185 words
corresponding to 185 meanings extracted from the classic extended Swadesh list
(Swadesh, 1952). For detailed linguistic data collection procedures, see (Mennecier et
al., 2016b).
We consider as “cognate” a group of words with the same etymological origin
and the same meaning, such words being more likely to be related by a common
ancestry (Bouckaert et al., 2012). For example, the words “un” in French and “uno” in
Spanish belong to the same cognate: they have the same meaning, the number one,
48
Figure I.1 – Geographical distribution of the 21 populations and linguistic varieties under study, with11 Turkic speaking populations (Yellow circles) and 10 Indo-Iranian speaking populations (Bluecircles). The red arrows indicate the Uzbek speaking population from Soj-Mahalla (UZA) and theYagnob speaking population (TJY).
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
and the same origin from the Latin “unus”. The words “papillon” in French and
“multa” in Spanish do not belong to the same cognate: they have the same meaning,
butterfly, but not the same etymological origin. The classification into cognates was
performed by Philippe Mennecier following previous work (Mennecier et al., 2016b).
Due to the low number of individuals sampled in each population, we did not
take into account the inter-individual cognates variability within population. Instead,
we considered, for each word, only the most frequent cognate for each population,
reducing our linguistic dataset to a single cognate list per population, namely a
“linguistic variety”.
3. Methods
3.1. Genetic and Linguistic Dissimilarities among Populations
Using the 26 microsatellites, we computed pairwise FST values (Weir and
Cockerham, 1984) among the 21 populations using the Geneland R package (Guillot
et al., 2005). We tested their significance using 1,000 permutations of individuals
between populations (Excoffier and Lischer, 2010), with a significance level
α = 2.3×10-4 after Bonferroni correction for multiple testing. We set non-significant
FST values to zero. For the linguistic data, we computed the pairwise Manhattan
distances (R script in Repository) among the 21 populations, assuming, for each
meaning separately, a distance equal to 0 for the same cognate and 1 for different
cognates. Then, we constructed two weighted consensus trees, from the genetic and
linguistic dissimilarity matrices respectively, using the neighbour-joining algorithm
BioNJ (Gascuel, 1997) implemented in the R package ape (Paradis et al., 2004),
performing 1,000 bootstraps of populations for each tree. We set negative branch
lengths to zero. We performed Mantel tests (Mantel, 1967) between genetic distances
and linguistic distances using the R package ade4 (Thioulouse et al., 1997), testing
their significance with 10,000 permutations of both genetic and linguistic distances.
This was done on all pairs of populations, then on all pairs of Turkic speaking
49
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
populations (with or without the UZA population), and on all pairs of Indo-Iranian
speaking populations (with or without the TJY population).
3.2. Approximate Bayesian Computation (ABC)
Using the genetic and linguistic data, we investigated the genetic and linguistic
histories of Central Asia using an ABC framework (Beaumont et al., 2002; Tavaré et
al., 1997). In short, we generated a large number of simulated data sets under several
competing scenarios, the parameters of each scenario being drawn randomly in a
priori distributions. We then computed summary statistics for each simulated dataset.
We evaluated the proximity between the observed and the simulated summary
statistics to select the most likely scenario. We then inferred the a posteriori
distribution of each parameter for the most likely scenario. Since we did not assume a
priori that the genetic and linguistic histories were linked, we performed the
simulations and the ABC procedures separately for each type of data. Genetic data
were simulated using FastSimCoal 2.5.1. For details about the genetic model, the
priors of the parameters, and the summary statistics, see supplementary information.
3.2.1. Linguistic model
We extended Gray and Atkinson (Gray and Atkinson, 2002) model, with
substantial modifications (Figure S1), assuming:
1) Cognate evolution was tree-like, with possibilities of borrowing or admixture
between branches;
2) Each cognate corresponded to exactly one word, to be consistent with the
format of our dataset;
3) There was an infinite number of possible cognate, and a cognate may appear
only once.
We developed the C++ program PopLingSim (script in Repository, see
Appendix) using the CodeBlocks software to simulate cognate variation data. Each
linguistic variety carries a set of cognates. At each linguistic generation time, each
cognate i of each variety may change for a new cognate (it adopts a completely new
50
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
identifier) with probability μL. The linguistic generation time is not necessarily on the
same absolute time-scale as the genetic generation time.
3.2.2. Scenarios of linguistic and genetic origins of the UZA population
Potentially numerous linguistic and genetic scenarios describe the origin of the
UZA population relatively to the other Central-Asian populations. We aimed at
evaluating (i) the linguistic and genetic origins of the studied populations and (ii) the
linguistic and genetic exchanges between the UZA population and the other Central
Asian populations. We chose to consider a set of scenarios, addressing these questions
specifically. We performed separately three-populations analyses for the genetic case
and three-linguistic varieties analyses for the linguistic case, in which we tested five
possible scenarios respectively (Figure I.2).
3.2.3. Scenarios of linguistic and genetic isolation of the Yagnob speaking
populations
In this case, we aimed (i) at evaluating whether the TJY population is
genetically and/or linguistically isolated, and (ii) at estimating the linguistic and
genetic exchanges between this population and the other Indo-Iranian speaking
populations. We chose to consider two scenarios either with or without genetic
migration or linguistic borrowing, respectively (Figure S3). Indeed, the TJY linguistic
variety is a subset of the Yagnob language, known to derive from the other Indo-
Iranian languages and to have recently started to resist linguistic changes (D’Errico
and Hombert, 2009).
3.2.4. Choice of scenarios and estimation of parameters
We defined a triplet of populations (resp. linguistic varieties) as a combination
of (i) the UZA or the TJY population (resp. variety), (ii) one of the nine Indo-Iranian
speaking populations (resp. varieties), excluding TJY, and (iii) one of the eight Turkic
speaking populations (resp. varieties), excluding UZA, UZB or UZT. This led to 72
scenarios, each considering a different population (resp. linguistic varieties) triplet.
51
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
52
Figure I.2 – Five competing scenarios for the origin of the UZA population, tested independently forlinguistic and genetic history. In scenarios A and B, the ancestral Indo-Iranian and Turkic speakingpopulations (resp. varieties) split at time t0. At time t1, the ancestral UZA population (resp. variety)diverged from the Turkic lineage. Subsequent migration (resp. borrowing) events between the Indo-Iranian speaking populations (resp. varieties) and the UZA population (resp. variety) occurred inscenario B. In scenarios C and D, the ancestral Indo-Iranian and Turkic speaking populations (resp.varieties) split at time t0. At time t1, the ancestral UZA population (resp. variety) diverged from theIndo-Iranian lineage. Subsequent migration (resp. borrowing) events between the Turkic speakingpopulations (resp. varieties) and the UZA population (resp. variety) occurred in scenario D. In scenarioE, the ancestral Indo-Iranian and Turkic speaking populations (resp. varieties) split at time t0. At time t1,the ancestral UZA population (resp. variety) resulted from an admixture event between these twolineages. Abbreviations: Tc = Turkic speaking population. UZA = UZA population from the Soj-Mahalla region in Uzbekistan. I-I = Indo-Iranian speaking population.
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
For each triplet, we conducted an ABC analysis to determine the best historical
scenario for genetic and linguistic cases respectively using Random-Forest algorithm
(RF), and then estimated the parameters of the selected scenarios with a Neural-
Network algorithm (NN) (see Appendix).
4. Results
4.1. Central Asian linguistic and genetic structures
As shown in previous studies (Martínez-Cruz et al., 2011), the Indo-Iranian
speaking populations had higher genetic differentiation levels than the Turkic
speaking populations. Indeed, 47 pairwise FST values out of 55 were significantly
different from zero for the ten Indo-Iranian speaking populations, while it was the
case for only 14 pairwise FST out of 45 for the nine Turkic speaking populations.
The neighbour-joining tree analyses based either on genetic or linguistic data
showed a structure with two groups (Figure I.3), corresponding to the two main
linguistic families. This result can also be visualized with the distance matrices
(Figure S3). The Turkic linguistic variety UZA was closer from the other Turkic
varieties than from the Indo-Iranian varieties, but the UZA populations was
genetically closer from the other Indo-Iranian speaking populations than from the
Turkic speaking populations. The TJY population was both linguistically and
genetically distant from the other Indo-Iranian speaking populations, and even more
distant from the Turkic speaking populations.
Consistently with these results showing a clear overlap between the linguistic
and genetic groups, the Mantel test between linguistic and genetic distances computed
on all pairs of populations was significant (p = 0.0001, with α = 0.01 for five tests
after Bonferroni correction for multiple testing). However, it was not significant
(p = 0.112) among Turkic populations only, unless we excluded the UZA population
(p = 0.0091). The Mantel test including only the Indo-Iranian speaking populations
was not significant, whether including the TJY population or not (respectively
53
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
p = 0.0821 and p = 0.7449).
4.2. Model selection and parameter estimations for the UZA
population
According to the RF algorithm, the most supported scenario was an admixed
origin of the UZA population, both for genetic and linguistic data, with 36 decisions
over the 72 tested population triplets (Figure I.4c) and 55 decisions (Figure I.4a)
respectively.
54
Figure I.3 – Neighbour-joining trees based on (a) the linguistic distances matrix and (b) the pairwiseFST matrix, with 11 Turkic speaking population (in Yellow/Light Grey) and 10 Indo-Iranian speakingpopulations (in Blue/Dark Grey). The values at each node correspond to the number of boot-strap treescontaining this node among 1000 permutations. The red arrows indicate the UZA and the TJYpopulations, under specific scrutiny using Approximate Bayesian Computation inferences.
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
The modal estimates of the admixture rates were different between the linguistic
and genetic data. We found a strong bias toward the Turkic linguistic varieties with
r̂ L = 0.093 (95% CI 0.02-0.18, Figure I.4b), and a more balanced genetic
admixture, with r̂G = 0.48 (95% CI 0.05-0.95, Figure I.4d). It was difficult to
compare the divergence times t0 (ancient) and t1 (recent) directly between linguistics
and genetics processes, because the linguistic generation time was not necessarily on
the same absolute time scale than the genetic generation time. Therefore, we
compared the ratios t1/t0 between the genetics and linguistics histories, finding them
differing by an order of magnitude with a linguistic t̂1/ t̂ 0 = 0.038 (95% CI 0.002-
0.08), and a genetic t̂1/ t̂ 0 = 0.30 (95% CI 0.04-0.95).
55
Figure I.4 – ABC Analyses for the UZA population. (a) Decisions over the Random Forest analysis of72 triplets for the selection of the linguistic scenarios. (b) Priors (dotted-line) and posteriors (solid line)of the parameters t1/t0 and rL estimated from the linguistic simulations of the scenario E (c) Decisionsover the Random Forest analysis of 72 triplets for the selection of the genetic scenarios. (d) Priors(dotted-line) and posteriors (solid line) of the parameters t1/t0 and rG estimated from the geneticsimulations of the scenario E.
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
Finally, we estimated an effective population size of 82,173 (95% CI 13,608-
98,179) for the UZA, and lower effective population sizes respectively for the Turkic
and the Indo-Iranian speaking populations [ N̂ 0 = 16,862 (95% CI 6,399-87,812)
and N̂2 = 28,382 (95% CI 8,124-95,255)]. The increased estimated effective
population size in the UZA was likely due to the admixture process itself, which
increased the genetic diversity in the admixed population compared to each source
(Long, 1991).
4.3. Model selection and parameters estimation for the TJY
population
Genetically, the RF algorithm unambiguously supported the scenario of a strict
isolation of the TJY population (72 decisions over 72). Conversely, the two scenarios
(split with or without subsequent migration) were almost equally supported
linguistically, with 37 (respectively 35) decisions over 72.
Since the two scenarios were equally supported for the linguistic case, we
performed the parameter estimations in both cases. The estimated split time ratios of
the TJY linguistic variety from the other Indo-Iranian linguistic varieties were similar
between the two scenarios: t̂1/ t̂ 0 = 0.12 (95% CI 0.02-0.30, Figure S7) for the
isolation scenario 1, and t̂1/ t̂ 0 = 0.15 (95% CI 0.002-0.97, Figure S7) for the non-
isolation scenario 2. Under this scenario, the estimated borrowing rate between the
TJY linguistic variety and the Indo-Iranian linguistic varieties was quite low: δ̂ L
= 0.004 (95% CI 0.0009-0.019). This meant that each cognate was borrowed with a
probability of 0.4% at each linguistic generation since the split t1, a low estimate since
the prior was drawn in U[0-0.1].
The estimated ratio of split times based on the genetic data, assuming an
isolation scenario, was much higher than for the linguistic data, with t̂1/ t̂0 = 0.77
(95% CI 0.11-0.97, Table S4) compared to t̂1/ t̂ 0 = 0.12 (95% CI 0.18-0.30, Table
S4). The estimated effective population size of the TJY population was 10,280 (95%
CI 2,534-75,516), lower than the estimated effective population size of 50,561 (95%
56
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
CI 12,909-97,032) for the other Indo-Iranian speaking populations.
5. Discussion
In this study, we built a new flexible simulator of cognate data under historical
models encompassing divergences and multiple borrowings and admixture events
between linguistic varieties. Using in parallel an existing population genetic data
simulation program, we developed an ABC framework allowing to infer the most
probable genetic and linguistic histories and estimated their underlying parameters,
using both types of data sampled in the same populations. We used this new
framework to reconstruct the evolutionary scenarios underlying linguistic and genetic
diversities of a range of populations from Central Asia.
5.1. Two different linguistic and genetic historical admixture for
the Soj-Mahalla Uzbek-speakers
We tested five possible genetic and linguistic scenarios to investigate the
relation between the UZA population and the other populations of the area, i.e. the
Indo-Iranian speaking populations and the other Turkic speaking populations. The
UZA population appeared to result from a similar general process of split and
admixture for both genetic and linguistic data. Nevertheless, these processes differed
in their chronology and in the intensity of the admixture process.
The ratio t̂1/ t̂ 0 was indeed an order of magnitude higher for the genetic
scenario than for the linguistic scenario. Assuming that the genetic and linguistic
admixture events happened synchronously, the ancestral linguistic divergence
happened long before the ancestral genetic divergence. Conversely, assuming that the
ancestral genetic and linguistic divergences happened synchronously, the genetic
admixture event was older than the linguistic admixture event.
According to historical records (Soucek, 2000), the first hypothesis seems more
plausible: the recent Turkic speaking population invasions probably led to a linguistic
shift (D’Errico and Hombert, 2009). This shift seems to result from an admixture
57
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
between the Indo-Iranian and Turkic vocabularies, strongly biased toward the former,
rather than a complete linguistic replacement as previously proposed (Martínez-Cruz
et al., 2011). Conversely, the estimated proportions of genes inherited from each
group appeared to be similar. Previous studies indicate also a low rate of genetic
replacement in Central Asia (Zerjal et al., 2003), in agreement with a cultural
diffusion through trading routes (e.g. the Silk Road) but without extensive genetic
exchanges (Palstra et al., 2015).
5.2. Stronger genetic than linguistic isolation in the Tadjikistan
Yagnob speakers
We found that the scenario of genetic divergence without subsequent gene flow
was more supported for the genetic data. Conversely for the linguistic data, we could
not assess whether the linguistic divergence was followed by vocabulary borrowings
or not, as both scenarios appeared as equally likely. If borrowings occurred, they
would have nevertheless been very limited, as shown by the low estimated borrowing
rate of 0.4%.
Interestingly, we estimated, as above, a t̂1/ t̂ 0 higher for the genetic scenario
than for the linguistic scenario. If the genetic and linguistic divergences between the
ancestral populations happened synchronously, then the linguistic divergence between
the ancestors of the TJY and the other Indo-Iranian speaking populations occurred
much more recently than the genetic divergence. Conversely, assuming the genetic
and linguistic divergences between the TJY ancestral population and the ancestors of
the other Indo-Iranian speaking populations happened synchronously, then the
divergence between the ancestral populations would be more ancient linguistically
than genetically.
Whichever scenario we considered, we showed limited linguistic exchanges and
no genetic exchanges, which indicated that cultural exchanges were maintained
without genetic exchanges, potentially through commercial relationships (Renfrew,
1987). Cultural norms may limit genetic exchanges between populations without
limiting cultural exchanges, as it is frequently the case in Central Asia (Heyer et al.,
58
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
2009). Indeed, ethnic constructs may produce endogamy rules, which limit the
probability of inter-marriages between groups. Economic relationships, geographical
proximity, and migration may favour cultural exchanges despite this genetic isolation.
5.3. Conclusions and Perspectives
In this study, we investigated the coevolution between genes and languages at a
regional scale. Genetic and linguistic diversities result, respectively, from the
demographic and cultural histories of the populations. Using separately one or the
other type of data may implicitly assume that demographic and cultural histories are
linked (Amorim et al., 2013; Gray et al., 2009). We showed that these histories can
differ substantially, as pointed out also by other authors (Creanza et al., 2015; Hunley
et al., 2012; Steele and Kandler, 2010). We did not assume a strict parallelism
between genetic and linguistic evolutions. On the contrary, our approach allowed us to
highlight discrepancies between genetic and linguistic inferences and to provide new
insights in the history of the studied populations.
As pointed out by Cavalli-Sforza et. al. (Cavalli-Sforza et al., 1992), the
parallelism between genetic and linguistic evolutions should be weaker at a local scale
than at a more global, worldwide, scale. This is likely due to an intrinsic difference
between genes and languages: the former can only be transmitted vertically while the
latter can be transmitted vertically, horizontally and obliquely (Tao Gong, 2010).
Nevertheless, strong links between genetic and linguistic histories may also be
observed at a local scale in some cases (Lansing et al., 2007), whereas strong
discrepancies may be observed at a larger scale (Creanza et al., 2015; Hunley et al.,
2012). Thus, congruence or not between linguistic and genetic evolutions should be
studied case by case, as the very histories of the populations under study may differ.
Several extensions will be possible for our model. We assumed here a neutral
linguistic evolution, where each word evolved independently with its own mutation
rate, and where no burst of innovation occurred. Relaxing these assumptions could
improve our knowledge of language evolution. It may also allow us to perform better
inferences of parameters such as borrowing rates or divergence times among linguistic
varieties. Moreover, we assumed a model of linguistic evolution with discrete
59
Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation
generations. The linguistic generation time is not easily defined; we showed that it is
not strictly equivalent to demographic generation times. Finally, a linguistic sampling
at the individual scale could make it possible to build and study a much wider range
of models of evolutions, and would also allow comparing genetic and linguistic data
at the individual level, which cannot be achieved when considering population
language varieties. This type of model should allow us to better understand linguistic
and genetic evolutions, and the potential links between them.
60
Chapter II – Inferring linguistic
transmission between generations at
the scale of individuals
Valentin Thouzeau†, Antonin Affholder†, Philippe Mennecier†, Paul Verdu†,1, Frédéric
Austerlitz†,1
† CNRS, MNHN, Université Paris Diderot, UMR 7206 Eco-Anthropologie et
Ethnobiologie, Paris 75016, France2 These authors equally supervised this work
This article is currently in preparation.
1. Introduction
Linguistic data have been extensively used recently in computational
frameworks to reconstruct some aspects of the history of human populations
(Atkinson, 2011; Bouckaert et al., 2012; Gray and Atkinson, 2002; Pagel et al., 2013).
These data consist mainly of a set of presence or absence of items in lists within a
given set of contemporaneous languages, as in databases like the World Atlas of
Language Structures WALS (Dryer and Haspelmath, 2013), or the Global Database of
Cultural, Linguistic and Environmental Diversity D-PLACE (Kirby et al., 2016).
Most computational studies aiming at reconstructing languages histories from current
linguistic data are usually languages at a macro-evolutionary scale. For instance, Gray
and Atkinson (2002) used a series of Swadesh list over 87 languages to investigate the
origin of the Indo-European linguistic family. Atkinson (2011) studied the number of
phonemes used in 504 languages worldwide to test the hypothesis of a serial founder
61
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
effect due to the Out-Of-Africa expansion. Reesink et al. (2009) used the linguistic
diversity of the ancient Sahul continent (present day Australia, New Guinea, and
surrounding islands) for 121 languages using diverse structural features.
These approaches relies implicitly on several assumptions. They require
primarily a clear division between several differentiated languages, as a set of discrete
units. Nevertheless, this notion of distinct languages is sometimes irrelevant at a local
scale, in particular in a context of dialectal continuum or linguistic contacts (Heeringa
and Nerbonne, 2001; Livingstone and Fyfe, 1999). These studies thus do not take into
account the within-population linguistic diversity, since traditional linguistics often
considers the languages as unique and coherent systems (Pateman, 1983). Indeed,
only a few number of linguists in the field record systematically the linguistic
diversity for a given set of community of language. Samplings campaigns are mainly
conducted at the language scale, hiding the intra-language diversity.
This implies the loss of a large amount of information, knowing that the
demographic phenomena at the population level – different population sizes,
bottleneck, expansion – are expected to play a major role in languages evolution
(Vogt, 2009). However, these phenomena are rarely taken into account by the models
reconstructing the history of languages. Including contemporaneous within-population
linguistic diversity in the reconstruction of the demographic history of human
populations at a local scale should thus open a whole new dimension into the field of
historical linguistic inferences.
Croft (1996) thus argued for a replacement of the ‘essentialist’ theory of
languages changes by a ‘population’ approach of the languages changes. He proposed
a review of the “evolutionary linguistic” field (Croft, 2008), detailing some work
developed in an evolutionary historical linguistics framework. Nevertheless, very few
studies deal with the contemporaneous within-population linguistic diversity in a
historical-reconstruction perspective. Some recent examples include the use of
surnames in Austria as linguistic contemporaneous information (I. Barrai, A.
Rodriguez-Larralde, E, 2000), the use of the family names in different contexts (Darlu
et al., 2012), or the use of proportion of African words in free speech in Cape Verdean
kriolu (Verdu et al. 2017).
62
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
Historical linguistic inferences implie knowledge about causal mechanisms
between the observed data and a possible set of historical scenarios which produced
these observed data. Nevertheless, there is no consensual theoretical framework able
to handle within-population linguistic diversity data in order to infer the underlying
historical scenarios and evolutionary mechanisms. It is indeed impossible to primly
assume a clear and delimited mechanism of linguistic evolution, and to then study the
range of historical scenarios that could have produced the observed linguistic data.
We propose, in this article, to evaluate a series of models of linguistic evolution
between generations at the individual scale. These models are based on the personal
linguistic knowledge of each individual, instead of a language external from the
individuals. We thus do not study the history of higher-order objects such as “the
languages”, but the history of the linguistic diversity carried by individuals within the
populations. We aim here at understanding how the linguistic items are transmitted
from generation to generation, as functions of several demographic parameters.
Approximate Bayesian Computation methods (ABC, Beaumont et al., 2002;
Tavaré et al., 1997) provide a particularly well-adapted framework to tackle the
problems presented here. It is a mean for inferring jointly the most likely historical
scenarios among a set of possible ones, along with the mechanisms of linguistic
transmission between generations. We used the recently developed Approximate
Bayesian Computation via Random Forest algorithm to choose among the possible
competing scenarios and estimate the parameters of the ”winning” models and
associated scenario (Breiman, 1999; Pudlo et al., 2016).
We implemented, in a computer program, the simulation of historical scenarios
under the models we proposed, and we evaluated the congruence of simulated data
with a real dataset from Central Asia. This dataset consists of 30 individuals sampled
for 185 words across 10 villages in Tajikistan. These villages are known to use the
same language, but with some variability among individuals (Mennecier et al.,
2016a). The analyses of these data provided a proof of the feasibility of the use of
contemporaneous within-population linguistic diversity to infer historical features of a
human population cultural evolution.
63
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
2. Models
2.1. Production of utterances
We considered a linguistic population as a group of individuals which may
potentially interact through communication. The mechanisms of linguistic
communications and linguistic transmissions may follow different modalities, which
correspond to different models of linguistic evolution. Nevertheless, we consider that
the unit of linguistic communication is the utterance, a production of linguistic items
associated with a meaning.
Each linguistic item is a possible version from a class. For example, the words
“Multa” and “Papillon” are two items of the class Butterfly, and one or another may
be used during an utterance to express the same meaning. In linguistics, a class of
items is called a paradigm.
Here, cognates are specific to a context and an individual. This is different from
cognates sampled at the language scale, for which individuals are considered as users
of the language instead of producers of the language.
During the field work, the protocol of linguistic recording is an act of
communication through utterance. Despite the unusual setting of the linguistic
questionnaire, the utterances produced by the individuals are considered like any other
act of utterance that the individuals may produce during their lifespan.
2.2. Four models of acquisition of a new language
We developed an individual-based forward-in-time simulation model, in which
we assumed that populations were composed of only two types of individuals:
“learners” and “teachers”. Moreover, we assumed that the rules of utterance
productions of a teacher depend only on the utterances he/she heard when he/she was
a learner. We assumed that each learner choose only one item from each class during
the learning phase. Two learners may choose the same linguistic item. After the whole
learning phase, each teacher is discarded and each learner becomes a teacher.
64
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
We tested here four models of linguistic acquisition during learning (Figure
II.1). Each model differed through the number of teachers implied during the
language acquisition, and the relative roles of these teachers.
In the first model, named the “Clonal” model, each learner select a teacher at
random and copies “clonally” every item that he/she produces. In the second model,
named the “Sexual” model, two teachers (a male and a female) are attributed at
random to each learner. He/she then copies directly the first half of the items produced
by the male, and the second half of the items produced by the female. Then, a
determined half of the items was always transmitted by males, and the determined
other half of the items was always transmitted by females. In the third model, named
65
Figure II.1 – Four models of linguistic transmission between generations. Each white circle representsan individual. The utterances that individuals produce depend only on the utterances that their teachersproduced at the previous generation, and on the mutations induced during the transmission.Transmission of linguistic items by teachers follow four possible modalities: (a) a “Clonal” model withonly one teacher per learner, (b) a “Sexual” model with two teachers associated with a distinct set ofvocabulary for each sex, (c) a “Sexual2” model with two teachers without a distinct set of vocabularyfor each sex, and (d) an “Social” model with the whole population as teacher for each learner.
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
the “Sexual2” model, each learner select two teachers (a male and a female) at
random. For each class, he/she copies at random the item from the male or the item
from the female. There is no item only transmitted by males or females, every item is
transmitted from one parent chosen at random. In the fourth model, named the
“Social” model, for each class each learner copies an item drawn at random from the
items produced by every teacher in the population.
For each model, the process of copy may produce some error and create a
completely new item. We call that type of error a “linguistic mutation”. The mean
mutation rate μL was drawn in a log-uniform distribution, between 10-6 and 10-1
mutations per lexical item per generation. For each item, its mutation rate was drawn
in a beta distribution with a mean μL and a shape β = 2, allowing us to simulate a set
of linguistic items with a different rate of change. We developed a new simulation
software PopLingSim 2 (PLS2) according to these models of linguistic evolution.
2.3. Historical scenario
We focused here on a single linguistic population, defined as a language
community, where the individuals have been sampled using a linguistic questionnaire.
Forward in time, this linguistic population evolved with a constant size N0 until
t = 5×N0, a time that, as we visually checked, was sufficient to reach an equilibrium
between the production of linguistic diversity through mutation, and the reduction of
this diversity through random sampling. This population then evolved with a new size
N1 during t0 generations. The linguistic items were then sampled at present day. This
historical scenario allows a range of histories, depending on the relative values of the
parameters N0 and N1 and on the value of t0. The population sizes N0 and N1 are drawn
in a uniform distribution, between 100 and 1000 individuals, this low upper bound
being set to limit the really high computational cost of these forward-in-time
simulation models. Time t0 was drawn in a uniform distribution, between 0 and 1000
generations. The median, the minimum, the maximum, and the quantile 5% of the
priors of the models are summarized Table II.1.
66
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
67
Figure II.2 – Historical scenario. It structure depending on the relative values of the parameters N0 andN1. If N0 = N1, we assumed a scenario of constant population size. If N0 < N1, we assume a scenario ofexpansion of the population. If N0 > N1, we assume a scenario of contraction of the population.
Median Min Max Quantile
2.5%
Quantile
97.5%
N0 550 100 1000 122 978
N1 550 100 1000 122 978
t0 500 0 1000 25 975
μL 3.165×10-4 10-6 10-1 1.35×10-6 7.73×10-2
N0×μL 0.150 10-4 100 5.25×10-4 44.5
N1×μL 0.150 10-4 100 5.25×10-4 44.5
t0×μL 0.116 0 100 2.80×10-4 42.0
Table II.1 – Summary of the prior distributions of the parameters for the four models.
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
3. Materials
We sampled 30 individuals from 10 villages in Tajikistan (Figure II.3). For each
individual, we recorded the words used for 185 meanings from an adapted Swadesh-
list. We considered as “cognate” a group of words with the same etymological origin
and the same meaning, such words being more likely to be related by a common
ancestry. The classification of lexical data gathered on the field into cognates was
performed by Philippe Mennecier following previous work (2016).
4. Analyses
4.1. Simulations
For each model, we performed 10 000 simulations using our newly-developed
software PopLingSim 2 (PLS2). We parallelized the computations using 250 cores of
the cluster station Genotoul, leading to approximately 90 000 CPU hours. Most this
computational time was spent during the phase of equilibrium between mutation and
drift of t0 = 5×N0 generations.
During the process of sampling linguistic items from our simulations, we drew a
68
Figure II.3 – Geographical distribution of the 10 sampled units under study.
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
number of missing values equal to the number of missing values of our real data set,
to avoid the bias induced by the missingness in the computation of the summary
statistics needed for ABC procedures.
4.2. Summary statistics
We constructed a new set of summary statistics, some of which were inspired
from classical population genetics statistics. After computing pi,j, the proportion of
individuals using the item i of the class j, we computed the linguistic diversity
Dj = 1 – Σi pi,j2, analogous to the gene diversity (Nei, 1987).
Then, we computed :
- The mean linguistic diversity, D;
- The range of the linguistic diversity, R(D) ;
- The variance of the linguistic diversity, V(D) ;
- The number of strictly different lists of items, S ;
- The mean number of items in each class, N ;
- The variance of the number of items in each class, V(N) ;
- The frequency spectrum of the number of items per class, F.
4.3. Model selection
Before the model selection, we performed a goodness-of-fit test to check if the
simulations were able to produce data close to the real data using the R package abc
(Csilléry et al., 2012). We performed model selection using the R package abcrf with
the RF algorithm and the function abcrf (Pudlo et al., 2016). We graphically checked
if a forest of 500 trees allowed a convergence of the error rate. We then performed a
cross-validation analysis using an out-of-bag approach implemented in the package
abcrf, evaluating if the algorithm was a priori able to distinguish between the four
models.
4.4. Parameters estimation
We used the RF algorithm with the function regAbcrf of the package abcrf to
estimate the expectation, the median, the variance and the quantiles 5% of the
69
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
parameters N1, N0, t0, μL and the composite-parameters N1×μL, N0×μL and t0×μL. Note
that the RF algorithm do not estimate the whole distribution of the parameters
directly, but estimate the quantiles of the distribution instead.
5. Results
5.1. Model selection
Using the goodness-of-fit test, we verified that there was no significant
difference between the real and simulated datasets (p-value = 0.55, with a number of
replications = 1000). We performed the RF analysis using 500 trees, and we verified
graphically that the error rate converged. The result of the RF analysis rejected the
Clonal, and the Sexual models, and preferred to select the Sexual2 and the Social
models (Table II.2), with a posterior probability of 0.499 for the Social model.
The cross-validation analysis (Figure II.4) indicated a good a priori
differentiation between the Clonal model, the Sexual model and the group ‘Sexual2
and Social’ models. Nevertheless, the Sexual2 and the Social models cannot be
distinguished a priori. It is then impossible at that stage to choose, based on our data,
between the ‘Sexual2’ and the ‘Social’ models, but we may be confident in the
falsification of the Clonal and the Sexual models.
5.2. Parameter estimation
For the two more likely models (Sexual2 and Social), we could not estimate
70
Clonal Sexual Sexual2 Social Post.Prob.
0.002 0.04 0.478 0.48 0.499
Table II.2 – Proportion of votes for the four models of linguistic evolution, and the posteriorprobability of the Social model.
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
separately the parameters N0, N1 and t0: the estimated quantiles of their posterior
distributions were similar to the quantiles of the priors considered (Tables II.3 and
II.4). Nevertheless, the estimated quantiles of the parameter μL and the composite
parameter N1×μL, N0×μL and t0×μL, are substantially narrower than the priors (Tables
II.3 and II.4). Using the estimated posteriors for the Sexual2 and the Social model, we
estimated that the linguistic mutation rate ranged between 1.9810-4 and 1.4410-3.
71
Figure II.4 – Confusion matrices from the out-of-bag cross-validation analysis of the four models,using 10000 pseudo-observed data.
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
72
Expectation Median Variance Quantile
2.5%
Quantile
97.5%
N0 526 499 43331 126 968
N1 645 714 65762 154 975
t0 479 466 87448 21 937
μL 4.66×10-4 3.23×10-4 1.13×10-7 2.18×10-4 1.44×10-3
N0×μL 0.243 0.193 0.039 0.057 0.87
N1×μL 0.255 0.244 4.10×10-3 0.15 0.467
t0×μL 0.239 0.177 0.064 8.092×10-3 1.152
Table II.3 – Summary of the posterior distributions of the parameters, assuming a Sexual2 scenario.
Expectation Median Variance Quantile
2.5%
Quantile
97.5%
N0 544 542 60108 153 986
N1 655 681 61907 148 966
t0 353 290 109196 9 954
μL 4.26×10-4 3.14×10-4 1.03×10-7 1.98×10-4 1.28×10-3
N0×μL 0.203 0.175 0.028 0.074 0.553
N1×μL 0.255 0.246 4.85×10-3 0.122 0.432
t0×μL 0.204 0.126 0.098 5.33×10-3 1.09
Table II.4 – Summary of the posterior distributions of the parameters, assuming a Social scenario.
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
6. Discussion
In this article, we built four models of intra-population linguistic evolution, at
the individual scale. We compared the simulated data with a real dataset of 30
individuals in Tajikistan carrying 185 cognates.
First, we showed that some of our models were able to produce simulated data
close to the contemporaneously observed data. It means that we were able to specify
linguistic reproduction mechanisms between generations, a set of transmission models
at an individual scale, that are consistent with the linguistic diversity of the sampled
populations.
We provided inferences of some features of the linguistic history, selecting the
most plausible mechanisms of linguistic transmission, and estimating the parameters
of the selected models. The low posterior probability of the Clonal and Sexual models
compared to the Sexual2 and the Social models indicates that the mechanisms of
linguistic acquisition follow probably more a process of linguistic recombination with
several teachers than a process of transmission without recombination. It would be of
great interest to distinguish between a transmission following a Sexual2 model (with
only two teachers), and a transmission following a Social model (with a whole
community as teacher).
The estimation we provided of the mean linguistic mutation rate of the lexical
items of the Swadesh list falls between 10-4 and 10-3 mutations per lexical item per
generation. Our micro-evolutionary context (i.e. at the scale of the individuals), may
be compared with a macro-evolutionary context (i.e. at the scale of a whole language
or a linguistic variety). The mutation rate of one item per generation and per
individuals estimated here, fall in the same range that the mutation rate of one item
per generation in macro-evolutionary studies (Pagel et al., 2007a). Considering that
the languages at the global scale emerge from the interactions of the individuals, our
result lead to hypothesise that the mutation rate estimated globally emerges from the
mutation rate at a local scale.
Contrary to most other studies using within-population linguistic diversity
(Baxter et al., 2009; Danescu-Niculescu-Mizil et al., 2013; Kandler et al., 2010), we
73
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
only used contemporaneous linguistic diversity. This method allows us performing
historical inferences only based on sampling campaigns conducted in existing
populations. The amount of information available is then only dependent on the
sampling effort, and not on the relatively limited historical records.
There are nevertheless some theoretical obstacles remaining. First, the models
of linguistic acquisition that we propose do not integrate the particular constraints of
communication processes, hypothesizing a neutral production of variants without any
constraints on linguistic communication. Some evolutionary linguists would argue for
an integration of the particularity of languages as communication systems, associated
with a strong set of constraints (Beckner et al., 2009). Indeed, individuals maximize
the probability of being understood, as well as minimize the cost of communication,
which probably mainly drives evolutionary processes (Tamariz and Kirby, 2015).
These constraints are particularly strong in the case of evolution of phonological,
morphological, or syntactical systems, and we may wonder if lexical variants are
subject to these constraints too. If so, theses particularities of linguistic systems may
be at odds with inferences based on a model of neutral evolution, and should thus be
taken into account for a more accurate model of linguistic evolution at the individual
scale for historical inferences purposes.
Moreover, we assumed that linguistic transmission occurs between generations,
occulting the resulting effects of iterated communication between individuals of the
same generation. We thus should consider in future investigations a set of alternative
models of languages evolution, where the acquisition of language results from a series
of interactions between individuals rather than from a unique transmission event.
Finally, note that the formalism of our models are close to the formalism of
population genetics. This should allow proposing joint inferences coupling genetic
and linguistic data for the same set of populations and individuals, but some
theoretical limits remain. We may wonder whether a speech community (a “linguistic
population”) is identical to a reproductive group (a “genetic population”). It is far
from obvious that human reproductive boundaries overlap language boundaries in
human groups. A joint model between genetics and linguistics should then request
clarifying and articulating rigorously the concepts of population genetics with the
74
Chapter II – Inferring linguistic transmission between generations at the scale of individuals
concepts of population linguistics to propose robust joint inferences.
75
Chapter III – Building a formalised
interface between population genetics
and population linguistics
Valentin Thouzeau†, Mathieu Tiret‡, Frédéric Austerlitz†,1, Paul Verdu†,1
† CNRS, MNHN, Université Paris Diderot, UMR 7206 Eco-Anthropologie et
Ethnobiologie, Paris 75016, France‡ INRA, GABI, UMR 1313 Population, statistique et génomique, Jouy en josas 78352,
France1 These authors equally supervised this work
This article is currently in preparation.
Introduction
Numerous research studies have investigated the conceptual analogies between
genetic and linguistic evolutions (Atkinson, 2013; Ben Hamed and Darlu, 2007;
Cavalli-Sforza, 1997; Fitch, 2008; Gray et al., 2007; Hunley, 2015; List et al., 2016).
In « The Descent of Man », Darwin proposed a parallel between the formation of
languages and the formation of species (Darwin, 1871):
The formation of different languages and of distinct species, and the proofs that both have been
developed through a gradual process, are curiously parallel....We find in distinct languages striking
homologies due to community of descent, and analogies due to a similar process of formation.
The pioneer study published by Cavalli-Sforza et al. (1988) allowed evaluating
the relevance of this quote, showing striking homologies between population trees and
77
Chapter III – Building a formalised interface between population genetics and population linguistics
language classifications, confirming for the first time the relevance of Darwin’s
intuition. Since then, several studies coupled genetic and linguistic analyses to
produce parallel inferences, for the same set of populations, aiming at understanding
the possible links between genetic and linguistic histories (see for instance
Balanovsky et al., 2011; Hunley et al., 2008; Thouzeau et al., 2017). Phylogenetic
methods have also recently been coupled with population genetics approaches to
propose a worldwide super-tree integrating genetic and linguistic informations (Duda
and Jan Zrzavý, 2016). These studies allowed comparing the history of genetic
populations and that of languages at a macro-evolutionary scale, contrasting the
patterns of linguistic and genetic differentiations.
In a population genetics approach, the notions of within- and among-
populations diversities are central (Hartl and Clark, 2007). Genetic diversity is seen as
resulting from historical processes emerging from the repeated interactions between
individuals through time. Therefore, historical inferences in population genetics
models and simulations are centred at the individual level (Hoban et al., 2012; Judson,
1994), implementing the rules concerning the interactions between agents to study the
global properties emerging from these repeated interactions. These simulations
themselves rely on known mechanisms of genetic transmission at the individual level.
Events like population split, migrations, admixture, expansions, or bottlenecks, are
well studied because they are explicitly specified at the individual level in the models.
Conversely, within-population linguistic diversity is rarely taken into account
when reconstructing linguistic histories (see chapter 2). Indeed, the current absence of
within-population inferences lies in part in the lack of clear consensus on the causal
mechanisms responsible for the construction of the linguistic diversity among
individuals. Without explicit mechanisms describing the way the linguistic items are
transmitted between individuals, it is impossible to infer the histories of the
populations of speakers from within-population linguistic diversity patterns using an
agent-based approach. Knowing these limitations, complex events like migrations,
admixture, or population-size changes, are rarely taken into account in linguistic
inferences.
78
Chapter III – Building a formalised interface between population genetics and population linguistics
Some authors, from the emerging field of “evolutionary linguistics” (Croft,
2008), proposed a first step to take into account within-population linguistic diversity
(Croft, 1996; Niyogi and Berwick, 1997; Tamariz and Kirby, 2016). Languages are
viewed as complex adaptive systems (Beckner et al., 2009), their properties emerging
at macro-evolutionary levels as a result of the iterated interactions between
individuals (Haspelmath, 1999; Kirby et al., 2008, 2014; Steels, 2011). These type of
studies rely thus mainly on agent-based models (Steels, 1997), making explicit the
underlying behaviour of the agents. For instance, Zuidema and de Boer (2009)
showed that combinatorial phonology (i.e. the fact that the sounds produced by
speakers are categorized in discrete units) may emerge from repeated interactions
between individuals. Kirby (2001) showed that a structured mapping between strings
and meanings (i.e. the fact that one word corresponds to one meaning) could also be
the result of iterated learning. Nevertheless, no attempt has been made, to our
knowledge, to use contemporaneous within-population linguistic data with agent-
based models for explicit linguistic historical inferences.
Following the perspective proposed in chapter 2, we aimed in this chapter at
formalising an interface between population genetics and population linguistics, to
build a general model describing individuals interacting genetically and linguistically
in an agent-based approach.
In the first part of this chapter, we constructed a theoretical framework
associating genetic and linguistic evolutions at the within-population level. We
present in section 1 a formalisation of biological evolution, allowing to delimit the
notions of reproduction relationship and genetic population. We develop in section 2
a “population linguistics” framework, delimiting the notions of linguistic
communication relationship, individual grammars and linguistic population. We
develop then in section 3 the diversity of mechanisms that may occur during a
linguistic communication relation. We then assemble all these notions in section 4,
describing our genetic and linguistic coevolution framework. We adopt a
formalization avoiding the risks of making non-explicit underlying assumptions that
may be at odds with known genetics or linguistics results.
79
Chapter III – Building a formalised interface between population genetics and population linguistics
In the second part of this chapter, we evaluate statistically the possibilities given
by the joint inferences following the framework built in the first part. We detail, in
section 1, our modelling and the method of Approximate Bayesian Computation
(ABC) that we used. We fully developped a new simulation software, Population
Linguistic and Genetic Simulator (PLGS), which simulates genetic and linguistic
coevolution for a given set of individuals in one or several populations. We perform
then, in section 2, a series of cross-validation analyses on simulated data, aiming at
testing the a priori possibilities of the framework presented in the first part.
80
Chapter III – Building a formalised interface between population genetics and population linguistics
Part 1 – Formalising genetic and
linguistic coevolution
1. A formalisation of biological evolution
Our first objective was to delimit a biological evolution framework able to link
genetic evolution and linguistic evolution based on what they have in common, the
individuals, which both carry genes and speak languages. The purpose of this section
is to delimit the notions of reproduction relations and genetic population. To do that,
we built on the theoretical foundations proposed by Barberousse and Samadi (2015,
abbreviated B&S in the following), where the individuals are central, and we then
formalized the notion of genetic population in theoretical terms.
B&S recently proposed a preliminary step for a formalisation of the biological
theory of evolution. While previous formalisations have been proposed (Gould, 2002;
Lewontin, 1970; Maynard Smith, 1987; Szathmáry and Maynard Smith, 1997), the
particularity of the proposition of B&S lies on the centrality of the organisms. B&S
take a neutralist perspective (Gould and Lewontin, 1979; Kimura, 1983), where
genetic drift and selection are seen as sampling processes, contingently dependent on
the ecological and historical contexts (biotic and abiotic). The domain of the theory is
the genealogical network, the set of all organisms that are linked to one another by
descent relationships. This network is characterized as follows by B&S:
Within a genealogical network, each organism is related to at least one other organism by a
reproduction relationship we call RG5 in the following.
Definition Let there be two organisms a and b, aRGb if a and b have common direct
offspring. This means that a or b, or both, have transmitted, within finite time, some material
5. To clarify our reasoning, we called the reproduction relation RG instead of R as called by B&S.
81
Chapter III – Building a formalised interface between population genetics and population linguistics
substrate to one or more other organisms. The material substrate may be modified; it provides
the offspring with the capacity to reproduce. This general definition of the reproduction
relationship allows us to formalise different reproduction modes that are common in earthly
organisms:
- {aRGb} ≠ Ø and ∀c {c/ cRGa or cRGb} = Ø represents strictly monogamic biparental
reproduction;
- {aRGa} ≠ Ø and ∀b {b/ bRGa} = Ø represents strictly clonal reproduction;
- {aRGb} ≠ Ø and {aRGc} ≠ Ø represents biparental, polygamic reproduction.
In order to represent other modalities, it is possible to generalise relation RG so that it can
take any (finite) number of organisms as relata6.
In this formalisation, each individual comes from the realization of a relation
RG, which is a transmission relationship of a material substrate. In our perspective, in
the case of the human species, the reproduction relationships imply a known genetic
transmission structure (see figure III.1): autosomal DNA is transmitted biparentally
with recombination, while mitochondrial DNA and Y chromosomes are transmitted
uniparentally, respectively through the female and the male line. Each genetic marker
may mutate through a given mutation rate per generation.
In particular for our population genetics perspective, the known structure of the
genetic transmission through the reproductive network allows expliciting the
6. The relata are the objects of a logic relation.
82
Figure III.1 – Structure of the reproduction relationship in human species. Circles representindividuals, arrows represent transmission relationships: half of the autosomes plus the chromosome Yfrom the father if the child is a male or the X chromosome if the child is a female, and the other half ofthe autosomes from the mother plus the mitochondrial DNA and a chromosome X.
Chapter III – Building a formalised interface between population genetics and population linguistics
historical contingent events, such as population splits, expansions, bottlenecks,
migration or admixture events, using mathematical or simulation frameworks (see
Figure III.2). Indeed, it is possible to propose a series of explanations and predictions
according to a series of historical scenarios. The predictions are transcribed through
agent-based mathematical formulas or simulation programs, and may ultimately be
confronted with real data to select the best scenarios. Historical events are considered
contingent with respect to the very structure of evolution through reproductive
relationships.
This formalisation proposes a series of advantages in our perspective. First, it
clarifies the underlying assumptions of evolutionary biology theory, making explicit
its notions of reproductive relationship, reproductive network, and historically
contingent events. Second, it explicitly links the different evolutionary scales
83
Figure III.2 – Classic representation of the setting up of the reproduction network. Each rowrepresents a generation. The vertical green lines represent the boundaries of the population.Demographical events like population size changes, migration, admixture, selections, or constraintsover population boundaries, are seen as historically contingent. On the contrary, the very mechanismsof reproduction relation RG are seen as necessary.
Chapter III – Building a formalised interface between population genetics and population linguistics
(molecular, individual, species): the molecular scale, where genes are transmitted
through the reproductive events; the individual scale, where reproductive events build
the network of inter-individual relationships; and the species scale, built by mapping
preferential units of reproductive relationships. Third, as said above, it places human
individuals and their reproductive interactions at the centre of the theory.
Based on this framework, we propose a formalisations of the notion of
“population”. We define a population as a set of individuals for which a relationship
RG is preferentially instantiated (see Figure III.3).
2. A formalisation of linguistic evolution
Our objective was to to perform robust historical inferences from linguistic data,
sampled at the individual scale, through agent-based modelling. The purpose of this
section is to delimit the notions of linguistic communication relations, individual
grammars and linguistic population. To do so, we built a theoretical framework able to
84
Figure III.3 – Alternative representation of the reproductive network. Only the individuals thatreproduced during a given time step are represented. Multiple green dotted lines represent multipleinstantiation of the reproductive relation RG. The green circle line represents the boundary of thegenetic population.
Chapter III – Building a formalised interface between population genetics and population linguistics
explain and predict the evolution of languages through time, considering only inter-
individuals interactions and contingent historical events. We aimed at integrating
classical results from laboratory experiments, computer simulations and linguistic
fieldwork concerning languages.
Analogies between biological and linguistic evolution cannot be used as
justifications of modelling hypotheses (Blevins, 2004; Claidière and André, 2012;
Testart, 2011). As pointed out by Smith (2014) concerning models of language
evolution:
Modellers therefore need to be flexible, yet careful to ensure that their design and
implementation decisions are plausible, justifiable and systematically explained.
First, we delimit the objects and provide the definition that we will use
hereafter. Linguistic evolution may refer to (Steels, 2004; Tamariz and Kirby, 2016):
(1) Origin, in human species, of biological capabilities to produce linguistic
communication (Diller and Cann, 2011) ;
(2) Emergence and modification of structural properties of language (Cangelosi
et al., 2006; Gong and Wang, 2005) ;
(3) Modification though time of linguistic variants used by individuals.
Our formalisation can be applied to cases (2) and (3), but we will mainly focus
on case (3) in part 2 of this chapter. Other studies focused on case (1), which goes
beyond the scope of this chapter.
Following Croft’s (1996) perspective on linguistic evolution, we considered that
utterances play a central role in linguistic evolution. For several evolutionary linguists
(Croft, 2013; Kirby et al., 2015), the event of linguistic communication is at the centre
of the linguistic change. This utterance-based perspective of linguistic evolution was
borrowed from the more general perspective of evolutionary theory from Hull (1988).
For Croft (1993), an utterance is:
a particular instance of actually-occurring language as it is pronounced, grammatically
structured, and semantically interpreted in its context.
85
Chapter III – Building a formalised interface between population genetics and population linguistics
Moreover, utterances are the very objects actually observed by linguists during
the sampling of within-population linguistic diversity. In this perspective, languages at
the macro-evolutionary level are only the result of the emergence of iterated
interactions between individuals through linguistic communication.
We defined the individual grammars of each individual as what she/he uses to
produce and comprehend utterances. This is the individual process underlying
utterances production. We then proposed a formulation of linguistic evolution
decoupling the structure of the network of communication on one hand, and the
structure of the individual grammars, on the other hand, which we will discuss in
section 3.
We considered the human communication network as the set of all speakers
linked to each other by at least one linguistic communication relationship. We
assumed that each speaker is related to at least one another speaker by a linguistic
communication relationship denoted RL. We did not focus primarily on the history of
the linguistic items used by individuals, but instead on the linguistic communication
network. We then defined the relationship RL as follow:
Let there be a speaker a and a listener b.
aRLb if a and b are engaged in a linguistic communication event where a
produces one or several utterances, and b comprehend these utterances (Figure III.4).
Note that the instantiation of the relation RL depends on spatial and temporal
constraints.
86
Figure III.4 – Structure of the linguistic communication relationship in human. Circles representindividuals, with the speaker on the left and the listener on the right. The full black arrow represents thefunction determining which utterance the speaker will produce using his/her individual grammar, andthe dotted arrow represents the updating function of the individual grammar of the listener.
Chapter III – Building a formalised interface between population genetics and population linguistics
In this formalisation, we adopted the “organism-centred” perspective following
the agent-based perspective of the genetic evolution previously proposed (section 1).
We assumed that language learning and individual grammars formations and
modifications occur through linguistic communication relationships.
Our formalisation of the linguistic evolution differs to some extent from the
framework of genetic evolution. First, a linguistic relationship does not generate a
new organism. The structure of the linguistic communications network is then
contingently constrained by the reproductive network: only existing individuals,
resulting from a reproductive event, may be expected to communicate. Second, the
relation RL is not symmetrical. A listener and a speaker engaged in a linguistic
communication relationship may not always switch their roles. Third, the relation RL
is expected to be massively more frequent than the relation RG.
As well as in our population genetic formalism, we defined a “linguistic
population” as a set of individuals for which the relationship RL is preferentially
instantiated (Figure III.5). Similarly to population genetics, the structure of the
linguistic population depends directly on historical contingencies: the size of the
population, its variation through expansions or bottlenecks, migration etc.
87
Figure III.5 – Representation of the linguistic communication network. Only the individuals whocommunicated and the communication which occurred during a given time step are represented.Multiple purple arrows represent multiple instantiation of the linguistic communication relation RL. Thepurple dashed circle represents the boundary of the linguistic population.
Chapter III – Building a formalised interface between population genetics and population linguistics
Here, the notion of linguistic population is close to the notion of speech
community (Labov, 1972), defined as a set of speakers involved in a series of
linguistic interactions and using a common set of linguistic conventions and norms.
3. Modalities of linguistic communications at
the scale of the individuals
Our formalism of linguistic network does not imply to specify the very
linguistic mechanisms of utterance production and their influence on the individual
grammar of the listener. The purpose of this section is to develop the range of the
linguistic mechanisms that may occur during a linguistic communication relation.
As detailed section 1, biological evolution relies on the well-known mechanisms
of genetic transmission (Figure III.1). Conversely, the linguistic mechanisms ruling
individual grammars are not well known in linguistics. This is partly because there is
no material substrate transmitted through linguistic communication relationships,
because languages are “products of the human mind” (Popper, 1979), accentuating the
difficulty.
Another aspect is that different linguistic levels (phonological, lexical,
syntactical…) may imply different rules of individual grammar. Particular cognitive
and structural constraints over each of these levels may shape utterance production
and their comprehension (Kirby et al., 2015; Nowak et al., 2002). Moreover, the
utterances are a production of the individual grammars, not the individual grammars
themselves. Therefore, listeners only access a part of linguistic information, through
linguistic communications. Some cognitive linguists (Culbertson, 2012) pointed out
that this lack of linguistic information available for the listeners, coupled with the
inferential cognitive bias of the listeners, should highly orientate the evolution of
linguistic structures. The structure of the cognitive bias influencing individual
grammars are of great interest and are widely debated in linguistics (Evans and
Levinson, 2009).
88
Chapter III – Building a formalised interface between population genetics and population linguistics
Laboratory experiments allow studying the functioning of the individual
grammars (Tamariz and Kirby, 2016). How languages are learned during lifespan?
Are there some universal cognitive biases concerning the production and the learning
of languages? What are the effects of such biases? What are the effects of the
functional constraints of the communication? Those experimental studies are
precious, because they delimit a set of plausible linguistic hypotheses concerning the
functioning of individual grammars. We should expect that the particularities of the
local mechanisms, after integration over a large set of individuals interacting during a
long period of time, should produce huge effects on linguistic evolution and on the
resulting linguistic diversities (Kirby et al., 2007; Smith et al., 2003). We thus argue
that individual grammars should be studied case by case, for each type of linguistic
item, and for each type of language.
We will now focus on the evolution of frequencies of a series of linguistic
variants among individuals. Variants are defined as a set of different realisations of a
particular linguistic class. Linguistic variants can be phonetic, phonological,
morphological, lexical, etymological, syntactical… They constitute a generic category
which refers to a series of linguistic items of the same meaning, thus belonging to one
linguistic class. For instance, two words differing from only one sound, but with the
same meaning, are two linguistic variants of the same phonemical class, they are two
phonetic variants. For the meaning left, the pronunciations /l ft/ common in Britainɛ
English and /lift/ common in New Zealand English constitute two such variants,
differing for the sound / / and /i/ ɛ (Watson et al., 2000). In another instance, several
words with the same meaning but with different etymological origins, are linguistic
variants of the same etymological class, i.e. lexical variants. For the meaning car, the
words “char”, “auto”, “machine”, and “voiture” are used in Canadian French and
constitute four such variants (Nadasdi et al., 2008). The Swadesh lists of cognates
(1952) are classical datasets of etymological variants used to infer linguistic histories.
A series of models have been proposed to describe acquisitions and changes of
linguistic variants in an utterances-based perspective (Baxter et al., 2009; Kirby et al.,
2014). Some authors (Baxter et al., 2006; Reali and Griffiths, 2010) argue that
89
Chapter III – Building a formalised interface between population genetics and population linguistics
listeners are Bayesian learners. The hypothesis of statistical acquisition and change of
the individual grammars seems indeed plausible from the perspective of language-
learning studies (Saffran, 2003) and theoretical results (Reali and Griffiths, 2010).
Knowing this previous framework of utterance-based learning, several
modalities may thus be additionally incorporated in our population linguistics
perspective. First, a kind of selection process may occur: are the different variants of a
linguistic structure equivalent? Or do the different variants have different weights in
individual grammars? Patterns of linguistic changes through time seem to indicate
that linguistic behaviours may be cognitively biased, for instance favouring linguistic
variant easier to remember or easier to pronounce (Blythe and Croft, 2012; Sturtevant,
1947). The social status of speakers constitute a second modality. Does this status
(sex, age, profession…) affect the individual grammars of listeners in the same way?
This modality is often described in terms of conformity bias or prestige bias, where,
for instance, the utterances of a socially prominent speaker affect the individual
grammars more efficiently than those of other, less prominent, speakers in the
population (Henrich, 2001). Third, what is the impact of the structure of the linguistic
communication network: how does this structure, its centrality, its connectivity,
modifie the linguistic change over time? Linguistic emergence of particular structures
may only depend on the shape of the linguistic communication network (Kauhanen,
2016).
In our perspective, all these different modalities may be formalised in terms of
structure of the linguistic communication network or of individual grammars. First,
the linguistic communication network describes the population size, the frequencies
of communication events, the structure of the social network, and all other historical
contingencies at the scale of the individuals. The modalities occurring at the scale of
the individuals are now commonly evaluated in population genetics (see for instance
Guillot et al., 2015; Palstra et al., 2015; Verdu et al., 2009), and could be evaluated as
well in population linguistics. Second, the embedded modalities of individual
grammars can be explicitly specified, implemented in simulation programs, and
should be formally tested case by case (Palminteri et al., 2017). In part 2 bellow, we
90
Chapter III – Building a formalised interface between population genetics and population linguistics
propose, for instance, three models of individual grammars describing how
individuals produce utterances and re-evaluate their individual grammar when
listening to other utterances (see part two, Figure III.9).
4. Coupling the reproductive and the
communication networks
We have introduced separate formalisation of biological and linguistic
evolutions. Each of these formalisations offers the possibility to clarify the
assumptions underlying their frameworks and to propose justified and disambiguated
predictions. This allows proposing a series of explicit models, a necessary step for any
inference-based approach. We propose now a coupled formalisation of the genetic-
linguistic coevolution.
As emphasised by B&S (2015),
The existence of an organism can be visualised as a trajectory in space and time. This allows
us to express an important constraint: only organisms whose trajectories intersect can be relata
of relation RG.
This property of the relation RG is also true about the relation RL: only
individuals whose trajectories intersect temporally and spatially can linguistically
communicate. This property may be seen through the double meaning of
“intercourse”, as pointed out by Croft (1996).
Both linguistic and reproductive networks share, therefore, common constraints
over the instantiation of their constitutive relationships, even if some constraints may
specifically affect either reproductive relationships or linguistic communication
relationships.
The notion of “population” may here be seen as the delimitation of these
constraints. The genetic population defines the sets of individuals for which the
91
Chapter III – Building a formalised interface between population genetics and population linguistics
relation RG is preferentially instantiated. The linguistic population is defined by the
sets of individuals for which the relation RL is preferentially instantiated. Under this
terminology, the questions of the coevolution between genes and languages may be
understood at the scale of individuals. We propose in the following several hypothesis
concerning this coevolution.
Hypopthesis 1: the genetic and linguistic populations strictly overlap: the spatio-
temporal constraints are the same for the two types of relations (Figure III.6). In this
case, languages and genealogies are expected to show a common phylogeny. It is then
legitimate to use both types of data to infer the unique history of the genetic-linguistic
population. This may be the case for highly isolated human groups, without migration
or cultural contact with any other human group. The common history of genes and
languages then results mainly from a shared set of geographical constraints.
Hypothesis 2: the genetic population is included in the linguistic population
(Figure III.7). In this case, the two types of relations (RG and RL) are different, the
linguistic and genetic populations do not overlap. In other words, that genetic
92
Figure III.6 – Representation of the setting up of the reproduction network as well as the linguisticcommunication network. Only the individuals who communicated or reproduced during a given timeperiod are represented. Multiple purple arrows represent multiple instantiation of the linguisticcommunication relation RL. Multiple green dotted lines represent multiple instantiation of the geneticreproduction relation RG. The dotted circle represents the boundary of the genetic-linguistic population.
Chapter III – Building a formalised interface between population genetics and population linguistics
reproduction events may be strictly more constrained than linguistic communication
events. This type of structure may emerge, intuitively, after a process of
standardisation of a unique national language taught at school and via public media,
leading for example to individuals sharing a common language but keeping
preferential inter-marriages between geographically close individuals.
This should also occur in a society forbidding inter-marriages between delimited
social groups, like casts or clans, leading to several differentiated genetic populations
within a unique linguistic population. In this case, the instantiation of the relation RL
is less constrained than the instantiation of the relation RG. This difference in the two
types of constraints should then imply different delimitations of the objects (the
populations) considered in the historical reconstructions.
Hypothesis 3: the linguistic population is included in the genetic population (see
Figure III.8). In other words, we may hypothesise that the constraints over linguistic
communications are strictly stronger than the constraints over the genetic
reproductions. This type of structure should be encountered for instance in cases of
93
Figure III.7 – Representation of the reproduction network as well as the linguistic communicationnetwork under Hypothesis 2. Only the individuals who communicated or reproduced during a giventime step are represented. The purple circle represents the boundary of the linguistic population. Thegreen circles represent the boundaries of the two genetic populations.
Chapter III – Building a formalised interface between population genetics and population linguistics
strong differentiation between linguistic groups with very little linguistic exchanges,
but with rules of inter-marriages between these groups. Figure III.8 shows a case of
instantiation of the relation RG less constrained than the instantiation of the relation
RL.
Hypothesis 4: it is also possible that the genetic and linguistic populations
partially overlap (see Figure III.9). This hypothesis is an association of the Hypothesis
2 and the Hypothesis 3, and should be observed in complex cases, where the two
networks are partially disjointed. Here, there is only two genetic and two linguistic
populations, but we may delimit three different units. In the case of a very reduced
overlap between the genetic and the linguistic populations, we may expect largely
diverging patterns of genetic and linguistic diversities. In this case, a unique notion of
“populations” without differentiating the genetic and the linguistic populations is
clearly misleading.
94
Figure III.8 – Representation of the reproduction network as well as the linguistic communicationnetwork under Hypothesis 3. Only the individuals who communicated or reproduced during a giventime step are represented. The green circle represents the boundary of the genetic population. Thepurple circles represent the boundaries of the two linguistic populations.
Chapter III – Building a formalised interface between population genetics and population linguistics
Another aspect of genetic and linguistic coevolution are the potential links
between the instantiations of the relations RL and RG, and the linguistic and genetic
characteristics of the individuals. The question is whether the genetic or linguistic
traits of the individuals can bias the realisation of their linguistic communication
events or reproductive events. For example, a language endogamy bias, where
individuals are more likely to mate when they are closer linguistically, implies that the
instantiation of the relation RG depends on the linguistic proximity between
individuals. This seems to be the case for some human groups, for instance in Central
Asian populations, where marriages rules are more constrained by a common
language than by a common geography (Heyer et al., 2009). In this case, we may
expect that the patterns of linguistic diversities affect genetic diversity, as genetic
differentiation among populations could result from previous linguistic differentiation
events. This mechanism of sympatric linguistic differentiation could then be
responsible for the differentiation of genetic populations without the need for spatial
isolation.
On the contrary, a preferential sociolinguistic association between individuals
based on their genetically determined phenotypes implies that the instantiation of the
relation RL depends on the genetic proximity between two individuals. It may be the
case in societies with strong emphasis on the notions of “races”, “ethnies” or
95
Figure III.9 – Representation of the reproduction network as well as the linguistic communicationnetwork under Hypothesis 4. Only the individuals who communicated or reproduced during a giventime step are represented. The green circles represent the boundary of the two genetic populations. Thepurple circles represent the boundaries of the two linguistic populations.
Chapter III – Building a formalised interface between population genetics and population linguistics
“origins”, based on a variety of criteria that, consciously or unconsciously, correlate
with genetic diversity patterns. This phenomenon could account for the reconstruction
of a set of distinct language communities related to genetic variation from a relatively
homogeneous linguistic substrate.
96
Chapter III – Building a formalised interface between population genetics and population linguistics
Part 2 – Inferring genetic and linguistic
histories
In part 1, we presented a framework to test a wide range of questions. We
propose, in this second part, a series of specific questions, each one associated with a
series of models or historical scenarios, to illustrate how our genetic-linguistic
coevolution framework may be used in practice to better reconstruct human biological
and cultural evolution from genetic and linguistic data.
Question 1: Should the individuals be considered as copiers, probabilistic
copiers, or Bayesian learners? This question aims at delimiting the mechanisms of the
individual grammars.
Question 2: Are the mutation rates different between the linguistic classes? This
question aims at describing if the variants of different linguistic classes mutate at the
same rate or at different rates.
Question 3: Are the sizes of the genetic and linguistic populations different?
This question aims at evaluating the relative size of the genetic and the linguistic
population of a given sample set.
Question 4: Do the sampled individuals belong to genetically and/or
linguistically differentiated populations? This question aims at proposing a test of
differentiation of the genetic and the linguistic populations.
Question 5: What is the tree topology for three populations? This question aims
at determining the history of splits which produced three genetic and linguistic
populations.
To address each question, we evaluated in each case how an ABC method based
on Random Forest is a priori able to select the right scenario. To do that, we built a
new computer software denoted Population Linguistic and genetic simulator (PLGS),
which simulates the models and scenarios framed in part 1. We detail first the
formalized assumptions of the models and the software, and then the analysis of the
five questions using ABC.
97
Chapter III – Building a formalised interface between population genetics and population linguistics
1. Modelling
1.1. Sampling
We considered a given set of sampled individuals, in one or more sampling
units. We define a sampling unit as a set of individuals assumed a priori to belong to
the same genetic/linguistic population. For each sampling unit, individuals were
assumed to be part of one genetic population, an entity for which the relations RG are
preferentially instantiated, and of one linguistic population, an entity for which the
relation RL are also preferentially instantiated. The genetic population and the
linguistic population may overlap or not.
We assumed that 30 individuals were sampled per genetic and linguistic
population respectively. For each individual, we considered that 25 microsatellites
were genotyped and 50 linguistic variants (see section 3 of the part 1) were obtained
through a linguistic questionnaire. Such numbers of genetic and linguistic markers are
low, aiming at testing our framework in non-favourable conditions to assess the
minimal statistical power available with our method. We consider here only linguistic
classes with a potentially infinite number of variants.
1.2. Genetic model
We assumed that each relation RG producing an individual birth in the
population was followed by the death of a random individual. This model is a Moran
process (1958), differing from Wright’s (1942) model by the fact that births and
deaths occur individually and at random. The Moran’s model allows overlapping
between generations, whereas the Wright’s model hypotheses assumes the death of all
the individuals of one generation after the birth of the new generation. Moran’s model
thus allows linguistic communication within and between several reproductive
generations of speakers.
We assumed a population of diploid individuals, with separate sexes (males and
females). The instantiation of a relation RG was only possible between a male and a
98
Chapter III – Building a formalised interface between population genetics and population linguistics
female, both drawn at random. The loci were assumed to be independent. In the
instantiation of a relation RG, for each locus, each parent transferred one of her/his
alleles at random to the child. Each allele might mutate at different probability per
reproductive event, following a strict stepwise mutation model.
1.3. Linguistic model
Following the Moran model, generations were no longer separated. Conversely,
they were widely overlapping allowing linguistic communication between individuals
from different generations.
For each instantiation of the relation RG, we assumed the instantiation of a
number αL/G of relations RL in the linguistic population. If αL/G > 1, the number of
linguistic communication events in the linguistic population was higher than the
number of reproductive events in the genetic population. If αL/G < 1, the number of
linguistic communication events in the linguistic population was lower than the
number of reproductive events in the genetic population. It is logically expected that
αL/G >> 1 in human populations.
The instantiation of the relation RL was possible between two individuals drawn
at random, whatever their sexes. We assumed that the mutations occurred during the
utterances, where each mutation lead that the listener received a completely new
variant instead of the speaker's variant, according to a given mutation rate for each
linguistic class.
1.4. Parameters
In the scenarios presented below, the parameters were drawn in the following
probability density distributions. Time t0 was drawn in a log-uniform distribution U
[0, 1000], as well as time t1, where appropriate, with the constraint t0 < t1. The sizes NG
and NL were drawn in a log-uniform distribution LogU [100, 1000]. The mean
mutation rates μG and μL were, separately, drawn in a log-uniform distribution LogU
[10-1, 10-6]. The mutation rate of each linguistic variant was drawn in a beta
distribution with mean μL and shape β = 2. The mutation rate of each genetic locus
99
Chapter III – Building a formalised interface between population genetics and population linguistics
was drawn in a beta distribution with mean μG and shape β = 2. The number αL/G of
relations RL in the linguistic population per relation RG was drawn in a log-uniform
distribution LogU [1, 100]. The upper boundary of the parameter αL/G was set
according to computation time limits. The probability hL for each listener of adopting
the variant of the speaker was drawn in a uniform distribution U [0.01, 1]. The lower
boundary of this parameter was set to avoid too excessive computation time, knowing
that a very low probability of adopting a new variant increases the time needed to
reach the equilibrium between mutation and drift.
1.5. Summary statistics
We computed summary statistics describing genetic diversity as well as
linguistic diversity. Hereafter, the terms “alleles” and “locus” used for genetics
summary statistics can be replaced, respectively, by the terms “linguistic variant” and
“linguistic class” for linguistic summary statistics.
1.5.1. Number of groups of individuals
We defined and computed M as the number of groups of individuals genetically
identical in the sample.
1.5.2. Number of monomorphic loci
We defined and computed S as the number of loci with only one allele in the
sample.
1.5.3. Number of different alleles
We defined k as the number of different alleles at a given locus. We then
computed in the linguistic and genetic populations respectively, the following
summary statistics:
- k, the mean number of different alleles across loci.
- V(k), the variance of the number of different alleles across loci.
- min(k), the minimum number of different alleles across loci.
100
Chapter III – Building a formalised interface between population genetics and population linguistics
- max(k), the maximum number of different alleles across loci.
- The range R(k), with R(k) = max(k) – min(k).
- med(k), the median of the number of different alleles across loci.
1.5.4. Gene diversity
We defined the gene diversity H as the probability for two randomly chosen
alleles for a given locus to be different (Nei, 1987). For a given locus, the estimated
gene diversity is:
H=n
n−1 ( 1−∑i=1
k
p i2)
Where n is the sample size, k is the number of different alleles, and pi is the
frequency of the ith allele in the sample.
We then computed for each population the following summary statistics:
- H, the mean of gene diversity across loci.
- V(H), the variance of the gene diversity across loci.
- min(H), the minimum of the gene diversity across loci.
- max(H), the maximum of the gene diversity across loci.
- The range R(H), with R(H) = max(H) – min(H).
- med(H), the median of the gene diversity across loci.
1.5.5. Pairwise distance between populations
We computed the pairwise dissimilarity GST per locus between each pair of
population as (Nei, 1973):
GST=1−HS
HT
Where HS=1−(∑i=1
k p1 i2
2+∑
i=1
k p2 i2
2 ) and HT=1−∑i=1
k
( p1 i
2×
p2 i
2 )Where p1i is the frequency of the allele i in the first population, and p2i is the
frequency of the allele i in the second population.
We then computed for each population the following summary statistics:
- GST, the mean of GST across loci.
101
Chapter III – Building a formalised interface between population genetics and population linguistics
- V(GST), the variance of the GST across loci.
- min(GST), the minimum of the GST across loci.
- max(GST), the maximum of the GST across loci.
- R(GST), with R(GST) = max(GST) – min(GST).
- med(GST), the median of the GST across loci.
1.6. Simulations and model selection
We developed a new C++ program, Population Linguistic and Genetic
Simulator (PLGS), which simulates jointly genetic and linguistic evolutions according
to the models and the scenarios detailed above, and computes the summary statistics
described above on each simulated data set. For each scenario, we performed 10000
simulations. For each simulation, we waited a time 5×NG, a time that, as we
empirically verified, was sufficient to reach genetic equilibrium between mutation and
drift. Moreover, we waited a time 10×NL/(hL×αL/G), a time that, as we empirically
verified, was sufficient to reach the linguistic equilibrium between mutation and drift.
We then analysed the simulations produced, using the R package abcrf (Pudlo et al.,
2016) which implements approximate Bayesian computation (Beaumont et al., 2002;
Tavaré et al., 1997) using Random Forest (RF) (Breiman, 1999). We performed a
cross-validation analysis to assess if the method was a priori able to select the
scenario which produced a pseudo-observed dataset drawn at random, using the out-
of-bag approach included in the function abcrf of the package abcrf, with 500 trees
per forest.
102
Chapter III – Building a formalised interface between population genetics and population linguistics
2. Results
2.1. Should the individuals be considered as copiers,
probabilistic copiers, or Bayesian learners?
In these models, we considered only one linguistic population. We aimed at
assessing how the individuals build an individual grammar considering a series of
linguistic variants for a series of linguistic classes (Figure III.10). In model a), we
considered that individuals were copiers. The speaker produced the linguistic variants
she/he knows, and the listener replaced every variant that she/he knows by the variant
of the utterance produced. In model b), we considered that individuals were
probabilistic copiers. The speaker produced the linguistic variants she/he knows, and
the listener had a probability hL to replace each variant that she/he knows by the
variant of the utterance produced. In model c), we considered that individuals were
Bayesian learners. For each linguistic class, the individual grammar consisted of a set
of two frequencies. For each class, the speaker produced a sequence of linguistic
variants according to the frequencies of her/his linguistic grammar, and the listener
updated the frequencies of her/his individual grammar as a linear combination of the
utterance and her/his individual grammar. See Baxter et al. (2006) for analytical
details about the model c).
Cross-validation results of model selection are summarized Table III.1. The
results suggest that it is a priori difficult to select unambiguously models a) or b),
with 44% and 51% of erroneously selecting other models, with respect to the
expectation of 66% at random. This difficulty could be explained by the fact that the
copying model is embedded into the probabilistic copying model, if hL = 1.
Conversely, the selection of model c) is a priori really powerful, with only ~1%
chances of selecting erroneously the two other models. In other words, we find that it
is difficult to assess if the individuals follow a copying model against a probabilistic
copying model. Conversely, to assess if the individuals follow a Bayesian learning
model again one of the two other models is really efficient.
103
104
Estimated
Scenario
a b c Error
True
Scenario
a 5604 3446 950 0.4396
b 3776 4859 1365 0.5141
c 68 65 9867 0.0133
Table III.1 – Cross-validation results aiming at assessing a priori distinctions between three modelsof individual grammars, using 10000 pseudo-observed data, with a) copying model, b) probabilisticcopying model, and c) Bayesian learning (see Figure III.9).
Figure III.10 – Description of the three models of individual grammars. a) Copying model, b) Probabilistic copying model, c) Bayesian learning model.
Chapter III – Building a formalised interface between population genetics and population linguistics
2.2. Are the mutation rates different between the linguistic
classes?
In these scenarios, we considered only one linguistic population and we
assumed that the individual grammars followed a probabilistic copying model (i.e. the
speaker produces the linguistic variants she/he knows, and the listener has a
probability hL to replace each variant that she/he knows by the variant of the utterance
produced). We aimed at assessing if the linguistic classes mutate with different
probabilities. In model a), we considered that the mutation rate of each linguistic class
were the same and equal to μL. In model b), we considered that the mutation raets of
the linguistic classes were drawn in a beta distribution mean μL and shape β = 2.
Cross-validation results of model selection are summarized Table III.2. The
results suggest that it is a priori possible to relatively clearly distinguish between
models a) and b), with 31% of erroneously selecting model b), and 19% of
erroneously selecting model a). In other words, we find that it is relatively easy to
assess if a set of linguistic classes mutate at the same rate or at different rate.
2.3. Are the sizes of the genetic and the linguistic populations
different?
In these scenarios, we considered only one population of each kind, and we
assumed that the individual grammars followed a probabilistic copying model. We
105
Estimated
Scenario
a b Error
True
Scenario
a 6871 3129 0.3129
b 1923 8077 0.1923
Table III.2 – Cross-validation results aiming at assessing a priori distinctions between two models ofthe mutation of the linguistic variants, using 10000 pseudo-observed data, with a) same mutation ratefor every cognate, and b) mutation rate drawn in a beta distribution.
Chapter III – Building a formalised interface between population genetics and population linguistics
aimed at assessing if a) the linguistic population was larger than the genetic
population (NG < NL), b) if the genetic population was larger than the linguistic
population (NG > NL), or c) if the sizes of the two populations matched (NG = NL ). We
considered one unit of sampling.
Cross-validation results of scenario selection are summarized Table III.3. The
results suggest that it is a priori impossible to distinguish between the three scenarios,
with an error between 61% and 71%. It means that assessing the relative size of the
genetic and the linguistic population is nearly impossible with our method. We may
hypothesize that this difficulty is caused by the fact that a lot of parameters (μG, μL, hL,
αL/G) modify the relation between the genetic and the linguistic summary statistics in
the same way that the parameters NG and NL, hiding a clear relation which could be
used by the Random Forest.
106
Figure III.11 – Description of the three scenarios of different size of the genetic and the linguisticpopulation. a) the linguistic population was larger than the genetic population, b) the genetic populationwas larger than the linguistic population, and c) the genetic and linguistic populations match.
Chapter III – Building a formalised interface between population genetics and population linguistics
2.4. Do the sampled individuals belong to genetically and/or
linguistically differentiated populations?
In these scenarios, we considered two sampling units, and we assumed that the
individual grammars followed a probabilistic copying model. We aimed at testing if
the individuals sampled in those units belong to two different populations, or to the
same population, both genetically and linguistically (Figure III.12). In scenario a), we
assumed that the two units corresponded to one linguistic population but two genetic
populations that diverged at time t0 in the past (see Figure III.8 for another
representation of the final state). In scenario b), we assumed that the two units
correspond to one genetic population but two linguistic populations that diverged at
time t0 in the past (see Figure III.9 for another representation of the final state). In the
scenario c), we assumed that the two units corresponded to only one genetic and one
linguistic populations (see Figure III.6 for another representation of the final state). In
the scenario d), we assumed that the two units corresponded to two genetic and two
linguistic populations that diverged at time t0 in the past.
Cross-validation results of scenario selection are summarized Table III.4. The
results suggest that it is a priori possible to clearly distinguish between models a), b),
c) and d), with an error between 9% and 15%. In other words, we find that our method
is a priori pretty efficient to assess if two sets of sampled individuals belong to one or
two genetic and/or linguistic populations.
107
Estimated
Scenario
a b c Error
True
Scenario
a 2885 3511 3604 0.7115
b 2860 3854 3286 0.6146
c 2810 3323 3867 0.6133
Table III.3 – Cross-validation results aiming at assessing a priori distinctions between three modelsdescribed Figure III.11, using 10000 pseudo-observed data.
Chapter III – Building a formalised interface between population genetics and population linguistics
108
Estimated
Scenario
a b c d Error
True
Scenario
a 8533 4 1344 119 0.1467
b 9 8899 223 869 0.1101
c 830 116 9047 7 0.0953
d 183 1237 41 8539 0.1461
Table III.4 – Cross-validation results aiming at assessing a priori distinctions between four scenariosdescribed Figure III.12, using 10000 pseudo-observed data.
Figure III.12 – Description of the four scenarios of genetic and/or linguistic population differentiation.a) The two units corresponded to one linguistic population but two genetic populations that diverged attime t0 in the past, b) The two units correspond to one genetic population but two linguistic populationsthat diverged at time t0 in the past, c) The two units corresponded to only one genetic and one linguisticpopulations, d) The two units corresponded to two genetic and two linguistic populations that divergedat time t0 in the past.
Chapter III – Building a formalised interface between population genetics and population linguistics
2.5. What is the tree topology of three populations?
In these scenarios (Figure III.13), we assumed that the individuals were sampled
from three genetic-linguistic populations, and we assumed that the individual
grammars followed a probabilistic copying model. We assumed three different
branching processes, corresponding to the three possible topologies. We considered
three sampling units, corresponding to three genetic-linguistic populations. In
scenario a), populations 0 and 1 have a more recent common ancestor than population
2. In scenario b), populations 1 and 2 have a more recent common ancestor than
population 0. In scenario c), populations 0 and 2 have a more recent common ancestor
than population 1. Considering only linguistic or, separately, only genetic pseudo-
observed data, the error rates ranged between 38% and 45% (Tables III.5 and III.6).
Considering jointly linguistic and genetic diversities reduced the error rates, to around
32% (Table III.7). This indicates that the joint inferences allowed a priori a slightly
more precise selection of the topology of three populations. Thus the selection of a
tree topology including three populations using only genetic diversity is as efficient as
using only linguistic diversity. Moreover, coupling the two types of diversities
increases, but only slightly, the efficiency of the selection.
109
Figure III.13 – Description of the three scenarios of historic topologies. a) Populations 0 and 1 have amore recent common ancestor than the population 2, b) Populations 1 and 2 have a more recentcommon ancestor than the population 0, c) Populations 0 and 2 have a more recent common ancestorthan the population 1.
Chapter III – Building a formalised interface between population genetics and population linguistics
110
Estimated
Scenario
a b c Error
True
Scenario
a 5489 2434 2077 0.4511
b 2443 5897 1660 0.4103
c 2457 1846 5697 0.4303
Table III.5 – Cross-validation results aiming at assessing a priori distinctions between three scenariosdescribed Figure III.13, using 10000 pseudo-observed data, using only linguistic data.1
Estimated
Scenario
a b c Error
True
Scenario
a 5641 2397 1962 0.4359
b 2065 6217 1718 0.3783
c 2110 2136 5754 0.4246
Table III.6 – Cross-validation results aiming at assessing a priori distinctions between three scenariosdescribed Figure III.13, using 10000 pseudo-observed data, using only genetic data.
Estimated
Scenario
a b c Error
True
Scenario
a 6733 1616 1651 0.3267
b 1947 6804 1249 0.3196
c 1967 1224 6809 0.3191
Table III.7 – Cross-validation results aiming at assessing a priori distinctions between three scenariosdescribed figure III.13, using 10000 pseudo-observed data, using genetic and linguistic data jointly.
Chapter III – Building a formalised interface between population genetics and population linguistics
Discussion
We proposed in this chapter to build an interface between population genetics
and population linguistics. We shaped a framework for studying genetic and linguistic
coevolution at the scale of the individuals, integrating a diversity of possible
individual grammars and population structures. This allowed us to perform a priori
inferences addressing several classical population genetics and linguistic questions.
We first needed to focus the theory of the biological evolution on the organisms,
to explicit the structure of the genealogical network, and delimit the notion of genetic
population with respect to the network structure. We defined only subsequently the
genetic transmission mechanisms, as the way the alleles are passed from parents to
offspring during the reproductive events. Integrating the contingent historical events
affecting the reproductive network and the genetic transmissions mechanisms allowed
us to retrieve a classical population genetics framework.
Focusing this formalisation on the individuals allowed us to propose a
formulation of linguistic evolution. We proposed to differentiate the building of the
communication network from the processes occurring at the individual scale during
each linguistic communication event. Nevertheless, linguistic communication
mechanisms and their links with individual grammars are less clear or consensual than
the genetic transmission mechanisms. We thus proposed to study case by case, for
each language and each linguistic item considered, the local constraints underlying the
rules of the linguistic communication events, before proposing historical inferences.
Coupling genetic and linguistic frameworks led to propose a diversity of cases
of coevolution, depending on the underlying structure of the genetic and linguistic
populations, and their history. We explicitly formulated a range of hypotheses
concerning genetic and linguistic coevolution at the scale of the individual, allowing
formal testing. We argue that classical hypotheses in the literature of coevolution
between genes and languages could be formalized and explicitly tested through our
framework.
We performed a priori a range of these tests, evaluating the theoretical
possibilities given by such framework. We used Approximate Bayesian Computation
111
Chapter III – Building a formalised interface between population genetics and population linguistics
with Random Forests (Pudlo et al., 2016), a flexible statistical framework, integrating
the whole complexity of our formulations of the models and the scenarios in
competition.
We showed that it is possible to differentiate between individual grammars
based on copying or Bayesian learning. Moreover, we showed that it is also possible
to evaluate if mutation rate is equal or different across for a set of linguistic classes.
These two results demonstrate that, using only linguistic data describing 30
individuals sampled contemporaneously, it is possible to evaluate the models
underlying linguistic communication events and individual grammars, a crucial step
to propose subsequent historical inferences. Future studies will focus on other types of
models concerning individual grammars could be tested as well, for instance to
evaluate the existence of cognitive bias, structural bias, or social bias, modifying the
mechanisms of linguistic evolution of a series of linguistic variants.
We showed that it is quite difficult to assess if the size of the linguistic
population and the genetic population associated to a given sample are different or
not. We may hypothesize that the summary statistics that we used do not reflect in any
way the relative size of the genetic and the linguistic populations. To overcome this
issue, future work will focus on developing a set of composite summary statistics,
computed using both types of data and reflecting their relative organisation, which
would allow to access more precisely to the links between genetic and linguistic
coevolution. On the contrary, we showed that we are a priori able, considering two
sets of individuals, to assess if they belong to one or two genetic and/or linguistic
populations. This could give us a lot of information about the relative structure of the
reproduction network and the linguistic communication network, and the underlying
rules which could explain these structures, as detailed in section 4 part 1. Finally, we
showed that using linguistic or genetic diversities is equivalent for the solving of a
tree topology for three populations. Moreover, we showed that coupling the two types
of diversities lead to a slight increase in precision.
All of these examples showed that in most of the cases, a genetic population
framework coupled with a linguistic population framework allowed to address a wide
range of questions about the genetic and linguistic coevolution of a given sample of
112
Chapter III – Building a formalised interface between population genetics and population linguistics
individuals.
Accounting for within-population diversity places the individuals at the centre
of the formal framework. Conversely, following recent conceptual and
methodological advances, phylogenetic studies have been adapted to linguistic data in
order to infer computationally the history of a set of languages (Atkinson and Gray,
2005; Bouckaert et al., 2012; Gray and Atkinson, 2002; Gray and Jordan, 2000; Gray
et al., 2009). In these phylogenetic studies, languages are considered at a macro-
evolutionary scale. Nevertheless, several authors pointed out the potential limitations
of the phylogenetic approach applied to cultural data (see for instance Moon, 1994;
Testart, 2011). It is often implicitly assumed that the linguistic trees are also
representative of the demographic history of the population of speakers. Nevertheless,
language histories can differ from genetic histories, as shown in several previous
works (Steele and Kandler, 2010; Thouzeau et al., 2017; Verdu et al., 2009; Ward et
al., 1993). Moreover, phylogenetic studies assume that language evolution is tree-like,
with a vertical transmission of languages, ignoring thus processes such as borrowing,
language shift, creolization, or linguistic admixture. These process seem nonetheless
extremely frequent in the history of languages (Steele and Kandler, 2010), some
authors argue that the assumption of complete isolation is rather the exception in
language evolution than the norm (Campbell, 2006). We argue that taking into
account within-population linguistic diversity through an agent-based approach is an
efficient way to overcome these limitations, taking into account all of these complex
events at the individual scale.
In our framework, the analogies between languages phylogenies and genetic
population histories are no longer a set of premises to study evolutionary history of
human populations, but a consequence emerging from a set of shared spatio-temporal
constraints affecting the genealogical and the communication networks. Applying this
framework to real datasets could thus help entangling several problems concerning
genes and languages coevolution in the future, allowing to infer a wide range of
events encountered by human populations throughout their history.
113
Chapter IV – Sampling and describing
linguistic data from Cape Verdean
Kriolu
Valentin Thouzeau1, Ethan M. Jewett2,3, Sergio S. da Costa1, Cesar A. Fortes-Lima1,
Noah A. Rosenberg2, Frédéric Austerlitz 1, Marlyse Baptista4 and Paul Verdu1
1 CNRS, MNHN, Université Paris Diderot, UMR 7206 Eco-Anthropologie et
Ethnobiologie, Paris 75016, France2 Department of Biology, Stanford University, Stanford, CA 94305, USA3 Department of Statistics and Department of Electrical Engineering and Computer
Science, University of California, Berkeley, CA 94720, USA4 Department of Linguistics and Department of Afroamerican and African Studies,
University of Michigan, Ann Arbor, MI 48109, USA
This chapter is a preliminary work.
Introduction
Cape Verde Islands, an archipelago of islands off the coast of Senegal and
Mauritania, have been the subject of a series of genetic and linguistic data sampling
missions involving French, American and Cape Verdean researchers since 2010. I had
the opportunity to participate in the mission from 1st July to 17th July 2016, and to
develop a sampling protocol of linguistic data. Moreover, knowing the
methodological developments proposed in chapters 2 and 3, this work was the
opportunity to go back and forth between fieldwork and theory, allowing one
experience to inform the other.
115
Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu
A first set of genetic and linguistic data sampled for 44 individuals was recently
statistically described by Verdu et al. (2017). This study was based on the joint
analysis of a series of genetic and linguistic markers. The genetic data were generated
at a genome-wide scale using the Illumina HumanOmni2.5-8 BeadChip genotyping
array. The linguistic data were generated by the recording and the transcription of
semi-spontaneous speeches of each DNA donor. The speeches were produced by the
speakers without interruption and without time limit, following the watching of a
speech-less movie of a little more than 5 minutes (The Pear Story, Chafe, 1980). The
words used by the speakers were then categorized by linguists according to their
African or non-African lexical or etymological basis.
The analysis performed in Verdu et al. (2017) has shown that the Cape Verdean
population from the main island of Santiago resulted from a process of admixture
between Iberian populations and Senegambian populations, thus echoing the known
peopling history of this archipelago by the Portuguese Crown and African slaves.
Moreover, speech patterns described by word frequencies tabulated for each
individuals, seemed to be significantly correlated with individuals’ birth-places as
well as parental and grand-parental birth places (correcting for shared parent-offspring
birth places). The authors proposed that this indicated that speech patterns transmitted
vertically from one generation to the next were not completely obliterated by speech
patterns acquired horizontally or obliquely by each individual throughout his/her
lifetime. Finally, results showed that genetic levels of African admixture were
positively correlated with the frequency of usage of African derived words or words
with a mixed African-European etymology. Altogether, these results seemed to
indicate that the processes underlying the observed genetic and linguistic admixture
patterns followed a parallel evolutionary trajectory, probably through co-transmission
processes.
The methods developed during this PhD. thesis are in line with this type of
work. In order to propose formal historical inference to investigate the historical
events encountered by Cape Verdean populations, an explicit model was required. The
developments proposed in chapter 3 focused on describing a theoretical framework to
infer genetic and linguistic histories using within-population diversities jointly. For
116
Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu
clarity, we chose to present this theoretical construction separately from the sampling
protocol conducted in Cape Verde, but it is nevertheless essential to note that they
result, in reality, from a co-construction process. Without an underlying theory, it was
difficult to construct a data set that can be analysed in a relevant way, and without
fieldwork, it was difficult to propose a theory that reflects the object of study in a
relevant way.
We describe in the first part of this chapter the data sampling strategy developed
in the field work, and we then propose a first descriptive analysis of this new dataset.
Data sampling
We presented in chapter 3 a formalization of the notion of linguistic population,
as a set of individuals communicating preferentially. A first consequence of this
formalization lies in the sampling strategy: how do we determine if two individuals
belong to the same linguistic population? A categorization work is needed here. Each
sampling unit must therefore be defined in such a way as to provide a possible set of
population structures, using ethnographic informations available: sampling location,
living place, birth place, etc. Moreover, knowing that our statistical framework uses
linguistic diversity as the prime source of information, a sufficient number of
individuals had to be sampled in order to perform statistical tests with sufficient
power. This was then the case in our mission, with 104 individuals sampled between
2016 (49 individuals, sampled by V.T., S.DC., E.J., P.V. and M.B.) and 2017 (55
individuals, sampled by C.F-L, S.DC, P.V. and M.B.), instead of considering only a
few individuals as a representative of the homogeneous linguistic variety of a given
location.
The formalization that we previously proposed took into account only linguistic
variants, a series of items of the same linguistic class. We might imagine a model of
individual grammar generating a whole discourse, in order to analyse the speeches
sampled throughout, but the linguistic constraints on the production of the discourses
are widely complex, and still poorly understood. We chose thus to sample lexical
117
Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu
variants from the Swadesh list during the fieldwork, in order to access a type of data
that could correspond to the linguistic evolution model developed previously, and
computationally tractable using our novel simulation software PLGS (chapter 3).
In order to sample the linguistic variants with each individual, we could not use
a vehicular language as it was the case for the sampling campaign in Central Asia,
where Russian was used to sample Central Asian linguistic varieties (Mennecier et al.,
2016). We chose to not to use English or Portuguese, knowing that Kriolu is a creole
language with a Portuguese substrate too close to these two languages. We then
decided to show to each speaker a series of 96 pictures depicting meanings from the
Swadesh. Objects and verbs were privileged in order to be able to represent
graphically the meanings. The set of graphical pictures seemed unambiguous for our
research team, but some of the pictures happened to be in fact very ambiguous for the
sampled individuals. For instance, the picture of a forest rarely triggered the word
from sampled individuals, likely due to the fact that forests are extremely rare in Cape
Verde. It led us to reduce the list to only 56 meanings rarely found ambiguous during
our experiments (Table IV.1). For each speaker, the pictures were presented
successively, asking them to pronounce twice the word associated to each picture, in
order to record properly each linguistic variant. Some speakers produced several
variants for one picture, indicating that at least some of them were conscious of the
linguistic diversity in Cape Verde. We chose to only use the first variant pronounced
by each speaker in our subsequent analysis. However, this choice gave us a clue about
the possibility to memorize several variants, as implemented in the Bayesian-learning
model studied in chapter 3.
From the fieldwork, it appears that speakers uttered linguistic variants
depending on the context of the interview. For instance, some of them spoke using a
lot of Portuguese linguistic variants during the interviews, but without using these
variants when they spoke more freely. Knowing that Portuguese is more associated
with academic or socio-economically favoured environments environment, it was
clear that the formal context of the interview oriented their choice of linguistic variant
at the moment of the utterance. The importance of the context lead us to take into
account the linguistic data as an utterance produce in a particular context, instead of
118
Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu
the realisation of a particular dialect quite independent from the individuals and
interviewers. This perspective nourished the theoretical construction of the chapter 3,
leading to the formalisation of an utterance-based model.
119
1. green 11. foot 21. hair 31. branch 41. smoke 51. to see
2. red 12. leg 22. tongue 32. root 42. lake 52. to drink
3. yellow 13. knee 23. dog 33. flower 43. sea 53. to eat
4. black 14. hand 24. tail 34. seed 44. mountain 54. to cut
5. white 15. neck 25. snake 35. fruit 45. rope 55. to bite
6. one 16. head 26. fish 36. cloud 46. stone 56. to sew
7. two 17. ear 27. feather 37. sun 47. to sing
8. three 18. eye 28. egg 38. moon 48. to swim
9. four 19. nose 29. tree 39. star 49. to sit
10. five 20. mouth 30. leaf 40. fire 50. to ear
Table IV.1 – List of meanings extracted from the Swadesh list.
Figure IV.1 – Geographical distribution of the 19 sampling localities under study in Cape Verde.
Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu
This protocol was performed with 104 individuals in total, during missions from
1st July 2016 to 17th July 2016 and from 20th May 2017 to 7th June 2017, in the
islands of Santiago, Brava, Fogo, São Vicente, and Santo Antão (Figure IV.1). Then,
each linguistic variant was transcribed, leading to a table describing the linguistic
variants used by each speaker for each lexical class.
Descriptive analyses
Several descriptive analyses of the dataset were performed, in order to
understand the structuring of the linguistic data. A Multiple Correspondence Analysis
(abbreviated MCA, see for instance Abdi and Valentin, 2007) was produced using the
function MCA of the R package FactoMineR (Lê et al., 2008), considering each
meaning as qualitative variables. This analysis allows to describe a multidimensional
dataset projected on only two dimensions. In addition, each meaning was analysed as
a function of its contribution to the linguistic diversity, in order to determine which
meaning is associated with the clustering of the MCA. Finally, Manhattan's linguistic
distances (see chapter 1) were calculated between each pair of individuals, and were
represented by a tree resulting from a Neighbour-Joining algorithm using the function
BioNJ (Gascuel, 1997) implemented in the R package ape (Paradis et al., 2004). 1000
bootstraps of the meanings was performed to determine the robustness of the
branches.
The MCA differentiated two axes (Figure IV.2). The first axis differentiated
individuals from the northern and southern islands. The second axis differentiated
individuals from the south-eastern and south-western islands. This suggested that the
lexical diversity of the Cape Verde islands is structured in at least three different
groups.
120
Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu
The MCA analysis makes it possible to determine which meanings among the
50 were responsible for the observed structure (Figure IV.3). The meanings close to
the origin (bottom left corner of Figure IV.3) were those for which diversity observed
across the islands did not contribute significantly to the three-group structure
observed in the MCA. The meanings close to 0 on the second axis, but high along the
first axis (bottom right corner of Figure IV.3), were those for which diversity reflects
the north / south linguistic structure. The meanings close to 0 on the second axis, but
high along the first axis (bottom right corner of Figure IV.3), were those for which
diversity reflect the south-east / south-west linguistic structure. Finally, the meanings
far along both the first and the second axis (top right corner of Figure IV.3) are those
which reflect the linguistic structure differentiating both the north / south islands and
the south-east / south-west islands. Note that it is almost exclusively the verbs that
structured both the axes.
121
Figure IV.2 – MCA of the 84 individuals sampled in Cap Verde, coloured according to their birthplace: red for Santiago, orange for Brava, purple for Fogo, blue for São Vicente, and green for SantoAntão. The first axis differentiate the northern islands and the southern islands. The second axisdifferentiate the south-eastern island ans the south-west island.
Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu
Neighbour-Joining (Figure IV.4) presented four subgroups, with a clear
separation between the north and south islands, as well as a clear separation between
the islands of the south-east and the south-west, as previously shown by the MCA.
Moreover, the Neighbour-Joining seemed to indicate a structuring within the south-
eastern island (Santiago), between two subgroups linked to two different birth places
locations within the island. The categorization of the individuals according to their
birth place reveal a clearer structure than the categorization of the individuals
according to their sampling location.
122
Figure IV.3 – Representation of the weight of the 50 words in the MCA. The first axis differentiate thewords with a low weight (close to the origin) and a high weight (far from the origin) in thedifferentiation between the north islands and the south islands. The second axis differentiate the wordswith a low weight (close to the origin) and a high weight (far from the origin) in the differentiationbetween the south-eastern island and the south-west island.
Figure IV.4 – Neighbour-joining trees based on the linguistic distances matrix with a) each individual number is coloured according to its sampling location, and b)each individual number is coloured according to birth places. The values at each edge corresponds to the number of bootstrap trees containing this edge at least 800times over 1000 permutations. The individual numbered 39, coloured in black, according to her/his birth place outside Cape Verde, on the Island of São Tomé andPríncipe.
Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu
Discussion
These preliminary analyses allowed us to describe the lexical linguistic diversity
sampled at the scale of the individuals in several locations in Cape Verde. The graphic
representations of the MCA and the Neighbour-Joining seem to indicate a really clear
structuring between different regions, even within the island of Santiago. We may
hypothesize that this result is the consequence of a structure between several linguistic
populations, in which the individuals preferentially communicate in a small
geographical area. Moreover, the structure was clearer when categorizing the
individuals according to their birth place than according to their sampling location.
This could indicate that the vocabulary used by the individuals is mainly determined
by the environment of their childhood, instead of their immediate environment. Verdu
et al. (2017) previously showed that the word frequencies sampled for each individual
were significantly correlated with their birth places, hypothesising that the vertical
linguistic transmission was not completely obliterated by horizontal or oblique
transmission.
Interestingly, the verbs are responsible for the main part of this linguistic
structure, compared to objects. We may hypothesize that the words describing actions
are more prone to mutate than the words describing objects. Regardless of this
hypothesis, this indicates that each type of linguistic class should be studied
separately in order to understand their respective dynamics in the populations
linguistic history (see chapter 3).
According to this preliminary study, we think that going back and forth between
fieldwork and theory is a particularly relevant when developing of new methods
associated with the collection of new types of data. As we show here, some
methodological choices made on the field directly resulted from theoretical
imperatives, such as using a series of linguistic variants as evolving linguistic data, or
sampling a large set of individuals in order to access to a linguistic diversity
statistically usable.
Genetic data (with more than 2.5 millions SNP’s) will be available in the fall of
2017, enabling to perform joint analysis of genetic and linguistic diversities for the
124
Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu
same set of individuals. It will be the opportunity to apply the framework presented in
chapter 3, aiming at inferring the genetic and linguistic coevolutionary history of the
Cape Verdean populations. It should allow us to assess whether the linguistic structure
described here reflects several differentiated linguistic populations. It should also
allow us to assess if this linguistic structure matches a genetic structure or not,
providing very interesting hints on the relative rules constraining reproductive and
communicative relationships in Cap Verde. Moreover, it should finally be possible to
infer the contingent historical events (split of populations, population size changes,
migration, admixture, etc…) which affected, maybe in different ways, the genetic
populations and the linguistic populations of this region.
125
Conclusion
L’application des méthodes d’inférences historiques à partir des données
génétiques et linguistiques a soulevé plusieurs questions durant cette thèse. Quels sont
les rapports théoriques et pratiques qui lient la collecte de données sur le terrain à
l’inférence historique ? Comment construire des modèles permettant de réaliser des
inférences qui ne soient pas seulement verbales ? Quelle articulation est-il possible
d’effectuer pour proposer des méthodes d’inférences qui associent les diversités
génétiques et linguistiques ?
Tandis que l’inférence historique en génétique des populations dispose d’un
paradigme consensuel, ce n’est pas encore le cas pour l’inférence historique en
linguistique. À plusieurs manières de construire l’objet de recherche sont associés
plusieurs modèles, directement reliés à différentes manières d’échantillonner des
données sur le terrain et à différentes manières de concevoir la diversité culturelle des
populations humaines. Dans l’articulation entre les inférences génétiques et
linguistiques, plusieurs approches possibles délimitent les objets scientifiques à
multiples facettes que sont les populations humaines, les gènes qu’elles portent, et les
langues qu’elles parlent. Au cours de la réalisation de cette thèse, plusieurs prismes
ont été empruntés pour tenter de rendre compte de la co-évolution génétique et
linguistique. Ce travail s’est alors ancré dans une pratique interdisciplinaire cherchant
à clarifier la communication entre les disciplines et l’articulation possible entre les
présupposés qui les constituent respectivement.
L’utlisation de méthodes statistiques en Approximate Bayesian Computation
repose sur des modèles entièrement explicites concernant les processus à l’œuvre.
Une formalisation poussée de la co-évolution génétique-linguistique a ainsi été rendue
nécessaire, exigeant elle-même de dépasser les allant-de-soi disciplinaires concernant
les évolutions génétiques et linguistiques et leurs articulations. L’ensemble de mon
travail peut donc se comprendre comme la résultante d’une double contrainte avec,
d’un côté, une méthode statistique d’ABC puissante mais exigeant une explicitation et
une formalisation des modèles sous-jacents utilisés, et de l’autre, un ensemble de
127
Conclusion
champs disciplinaires et de pratiques scientifiques disparates et informelles, voilant
les conditions nécessaires au travail de formalisation.
Le premier chapitre s’est attaché à proposer en parallèle une inférence de
l’histoire des populations génétiques et une inférence de l’histoire des variétés
linguistiques. Deux objets de natures très différentes (populations génétiques et
variétés linguistiques) ont été associés afin de dévoiler les relations complexes qui
peuvent exister entre l’histoire génétique des populations et l’histoire des langues
qu’elles parlent. Les données linguistiques et génétiques récoltées au préalable en
Asie Centrale ont été analysées dans le cadre des méthodes ABC, permettant de
réaliser en parallèle des inférences incluant des scénarios complexes de migration et
de mélange, et de traiter des jeux de données volumineux. La comparaison entre les
données issues des variétés linguistiques et les données issues des populations
génétiques a mit en lumière des histoires parfois dissociées, indiquant des échanges
linguistiques possibles sans échanges génétiques, et vice et versa. Cela a permis
d’affirmer que l’histoire des langues peut être différente de l’histoire des gènes,
surtout à une échelle géographique locale.
Un premier déplacement de la manière de prendre en compte l’objet de
l’inférence linguistique a été proposé au chapitre deux. Cette fois-ci, c’est la diversité
linguistique inter-individuelle qui a été placée au cœur de la méthode d’inférence, afin
d’évaluer la possibilité d’une linguistique des populations explicite sur un tout autre
plan qu’une linguistique historique des langues. Ce retournement théorique, du
général linguistique au particulier inter-individuel, a permis de construire un autre
cadre méthodologique pour l’inférence historique, centré sur une seule population, en
utilisant ici aussi les méthodes ABC. Il a ainsi été possible de sélectionner les modèles
de transmission les plus en accord avec les données réelles issues d’un ensemble de
locuteurs Tadjiks, et d’estimer certains paramètres des modèles sélectionnés.
Un second déplacement a été opéré au chapitre trois, dans la continuité du
précédent. S’attachant à rendre compte de la pluralité des approches en linguistique
évolutive et des fondements théoriques des différentes disciplines impliquées, une
formulation de l’évolution linguistique associée à une formalisation préalable de la
128
Conclusion
théorie de l’évolution biologique a permis de clarifier un cadre conceptuel associant la
génétique des populations et la linguistique des populations. L’association entre ces
deux domaines a ensuite permis d’expliciter plusieurs hypothèses verbales classique
dans ces champs de recherches, et de les intégrer à un logiciel simulant conjointement
l’évolution génétique et linguistique des populations humaines de manière
suffisamment optimale pour rendre les calculs informatiques réalisables dans un
temps raisonnable. L’évaluation a priori de ce cadre théorique et méthodologique à
partir des méthodes ABC a permis de démontrer son efficacité potentielle, ouvrant des
perspectives nouvelles pour l’étude de la co-évolution génétique et linguistique à
l’échelle des individus.
Enfin, le dernier chapitre a été l’occasion de détailler un travail de terrain
permettant de relever des données linguistique à l’échelle inter-individuelle pour des
locuteurs créoles des Îles du Cap Vert. L’analyse descriptive des données a permis de
mettre en lumière une structuration très claire entre plusieurs localités
d’échantillonnage, résultant d’une histoire linguistique qui reste à étudier.
Ces différents travaux s’inscrivent dans la tradition du parallèle originellement
proposé par Darwin entre évolution biologique et évolution linguistique. Il apparaît
maintenant clairement que la délimitation des objets d’étude est une étape cruciale
pour travailler cette analogie, mais que cette étape est loin d’être triviale. La
phylolinguistique, en considérant les langues comme des unités homogènes,
s’affranchit de la complexité qui peut émerger de l’interaction entre un ensemble
d’agents. A partir d’une série d’avancées théoriques, statistiques et informatiques, il
est maintenant possible d’étudier les langues comme des entités dont l’histoire émerge
de l’interaction répétée entre les locuteurs qui constituent les populations
linguistiques. Ce changement de perspective, analogue à celui passant des espèces
comme unités homogènes à des espèces comme un ensemble d’individus diversifiés,
ouvre des voies d’analyses similaires à celles ouvertes par la génétique des
populations par rapport à la phylogénie des espèces biologiques.
Plusieurs études futures s’ouvrent à l’issue de ce travail. Cette thèse offre tout
d’abord un cadre méthodologique permettant de réaliser des inférences historiques
129
Conclusion
explicites à partir de l’intégration des diversités génétiques et linguistiques des
populations humaines. La méthodologie proposée ici repose sur des simulations
explicites, elle est donc une voie d’accès vers des processus très complexes, et permet
notamment d’étudier les événements affectant les individus eux-même et se
répercutant par émergence sur les diversités observées à des échelles plus globales. Il
semble donc possible d’étudier l’histoire de la co-évolution génétique et linguistique
sous un nouveau jour, enrichi d’un ensemble de question qu’il est maintenant possible
de traiter directement. Quelles règles régissent les interactions linguistiques à l’échelle
des individus ? Quels événements historiques rencontrés par les populations humaines
ont structuré les diversités génétiques et linguistiques actuellement observables ?
Peut-on déterminer à quel point les contraintes spatio-temporelles régissant les
relations de reproductions sont confondues avec celles régissant les relations de
communication linguistique ? Ces questions, ainsi que de nombreuses autres liées à
l’étude de la co-évolution génétique et linguistique, restent à élucider.
Ces nouvelles perspectives nécessitent néanmoins d’articuler de nombreuses
contraintes entre elles. Un échantillonnage sur le terrain, en accord avec les
perspectives théoriques délimitées au préalable, le tout dans un cadre méthodologique
performant, sont les éléments indispensables à la bonne conduite d’un tel projet. A ces
trois niveaux en interrelations – terrain, théorie, méthode – s’ajoute la nécessité de
mobiliser plusieurs disciplines scientifiques. C’est alors que doit intervenir la pratique
interdisciplinaire, qui relève d’un effort essentiel pour rendre possible la
communication entre les différentes langues disciplinaires impliquées.
Mais une quatrième contrainte s’exprime à travers la possibilité même du travail
interdisciplinaire au sein d’une institution qui peut parfois, par certains aspects,
s’avérer réfractaire. En effet, malgré un discours institutionnel ayant tendance à
promouvoir l’interdisciplinarité, l’exigence et le travail qu’une telle pratique requiert
en amont sont difficiles à mettre en œuvre tout en prenant en compte l’exigence de
productivité qui pèse sur les jeunes chercheurs. De plus, l’organisation des institutions
de recherche séparées en disciplines bien différenciées, rend parfois très difficile
l’insertion d’un projet à la frontière entre plusieurs discipline. L’objectif d’associer
pleinement plusieurs disciplines, sans s’en tenir à un échange de service ou à un
130
Conclusion
emprunt ponctuel à des disciplines voisines, est amené à devoir se frayer un chemin
dans un milieu aux pratiques souvent incommensurables les unes aux autres. Cette
séparation entre disciplines, je l’ai évoqué lors de l’avant-propos de ce manuscrit, est
d’autant plus présente qu’elle est soutenue par des langues disciplinaires très
différenciées et aux présupposés parfois difficilement compatibles. Ainsi, à moins
d’une pratique interdisciplinaire qui préexisterait au sein des laboratoires de
recherches, elle doit être mise en œuvre individuellement à chaque fois, en faisant
appel à un outillage méthodologique solide. C’est ainsi que Bühlera et al. (2012)
indiquent que :
Si la pratique individuelle de l’interdisciplinarité soulève déjà en soi un ensemble de
questionnements, il nous a semblé que le jeune chercheur était confronté à des problématiques
particulières, directement liées à son statut. Par l’individualisation de l’interdisciplinarité, la
pratique passe d’un enjeu collectif à un enjeu épistémologique, conduisant à s’interroger sur le
fondement et la nature des sciences.
Il me semble que la prise en charge épistémologique des problématiques de
recherches interdisciplinaires a toute les raisons d’être encouragée à la hauteur des
perspectives scientifiques qu’elle ouvre. J’ai tenté de montrer au cours de cette thèse
qu’une très large fécondité est à même d’en résulter. Il me semble que
l’interdisciplinarité est le creuset d’un ensemble de mélange, infiniment riches entre
les disciplines scientifiques, ouvrant la voie vers autant de manières de porter sur la
complexité du monde de regards différents.
131
Bibliographie
Abdi, H., and Valentin, D. (2007). Multiple Correspondence Analysis.
Aimé, C., Verdu, P., Ségurel, L., Martinez-Cruz, B., Hegay, T., Heyer, E., andAusterlitz, F. (2014). Microsatellite data show recent demographic expansions insedentary but not in nomadic human populations in Africa and Eurasia. EuropeanJournal of Human Genetics.
Aljanabi, S.M., and Martinez, I. (1997). Universal and rapid salt-extraction of highquality genomic DNA for PCR-based techniques. Nucleic Acids Research 25, 4692–4693.
Alvarez-Péreyre, F. (2003). L’exigence interdisciplinaire: une pédagogie del’interdisciplinarité en linguistique, ethnologie et ethnomusicologie (Paris, France:Éditions de la Maison des sciences de l’homme).
Alvarez-Pereyre, F. (2014). Linguistique, anthropologie, ethnomusicologie : Regards croisés. as 38, 47–61.
Alves, I., Arenas, M., Currat, M., Sramkova Hanulova, A., Sousa, V.C., Ray, N., andExcoffier, L. (2016). Long-Distance Dispersal Shaped Patterns of Human GeneticDiversity in Eurasia. Molecular Biology and Evolution 33, 946–958.
Amiel, P. (2010). Ethnométhodologie appliquée Éléments de sociologiepraxéologique. (Paris: Les presses du Lema).
Amorim, C.E.G., Bisso-Machado, R., Ramallo, V., Bortolini, M.C., Bonatto, S.L.,Salzano, F.M., and Hünemeier, T. (2013). A Bayesian Approach to Genome/LinguisticRelationships in Native South Americans. PLoS ONE 8, e64099.
Atkinson, Q.D. (2011). Phonemic Diversity Supports a Serial Founder Effect Modelof Language Expansion from Africa. Science 332, 346–349.
Atkinson, Q.D. (2013). The descent of words. Proceedings of the National Academyof Sciences 110, 4159–4160.
Atkinson, Q., and Gray, R. (2005). Curious Parallels and Curious Connections—Phylogenetic Thinking in Biology and Historical Linguistics. Systematic Biology 54,513–526.
Atkinson, Q., Nicholls, G., Welch, D., and Gray, R. (2005). From words to dates:water into wine, mathemagic or phylogenetic inference? Transactions of thePhilological Society 103, 193–219.
133
Bibliographie
Atkinson, Q.D., Meade, A., Venditti, C., Greenhill, S.J., and Pagel, M. (2008).Languages Evolve in Punctuational Bursts. Science 319, 588–588.
Bahuchet, S. (2012). Changing language, remaining pygmy. Human Biology 84, 11–43.
Balanovsky, O., Dibirova, K., Dybo, A., Mudrak, O., Frolova, S., Pocheshkhova, E.,Haber, M., Platt, D., Schurr, T., Haak, W., et al. (2011). Parallel Evolution of Genesand Languages in the Caucasus Region. Molecular Biology and Evolution 28, 2905–2920.
Balazs, I. (1993). Population genetics of 14 ethnic groups using phenotypic data fromVNTR loci. EXS 67, 193–210.
Barberousse, A., and Samadi, S. (2015). Formalising Evolutionary Theory. InHandbook of Evolutionary Thinking in the Sciences, T. Heams, P. Huneman, G.Lecointre, and M. Silberstein, eds. (Dordrecht: Springer Netherlands), pp. 229–246.
Barbujani, G., and Sokal, R.R. (1990). Zones of sharp genetic change in Europe arealso linguistic boundaries. Proceedings of the National Academy of Sciences 87,1816–1819.
Baxter, G.J., Blythe, R.A., Croft, W., and McKane, A.J. (2006). Utterance selectionmodel of language change. Phys. Rev. E 73, 046118.
Baxter, G.J., Blythe, R.A., Croft, W., and McKane, A.J. (2009). Modeling languagechange: An evaluation of Trudgill’s theory of the emergence of New Zealand English.Language Variation and Change 21, 257.
Beaumont, M.A., and Rannala, B. (2004). The Bayesian revolution in genetics. NatRev Genet 5, 251–261.
Beaumont, M.A., Zhang, W., and Balding, D.J. (2002). Approximate Bayesiancomputation in population genetics. Genetics 162, 2025–2035.
Beckner, C., Blythe, R., Bybee, J., Christiansen, M.H., Croft, W., Ellis, N.C., Holland,J., Ke, J., Larsen-Freeman, D., and Schoenemann, T. (2009). Language is a complexadaptive system: Position paper. Language Learning 59, 1–26.
Belle, E.M.S., and Barbujani, G. (2007). Worldwide analysis of multiplemicrosatellites: Language diversity has a detectable influence on DNA diversity.American Journal of Physical Anthropology 133, 1137–1146.
Ben Hamed, M., and Darlu, P. (2007). Gènes et Langues : une longue histoire commune ? Bulletins et mémoires de la Société d’Anthropologie de Paris 243–264.
Blevins, J. (2004). Evolutionary Phonology: The Emergence of Sound Patterns(Cambridge University Press).
134
Bibliographie
Blum, M.G.B., and François, O. (2010). Non-linear regression models forApproximate Bayesian Computation. Statistics and Computing 20, 63–73.
Blythe, R.A., and Croft, W. (2012). S-curves and the mechanisms of propagation inlanguage change. Language 88, 269–304.
Bomin, S.L., Lecointre, G., and Heyer, E. (2016). The Evolution of Musical Diversity:The Key Role of Vertical Transmission. PLOS ONE 11, e0151570.
Bonfils, B. (1990). Connaissance scientifique et connaissance profane : de la générativité paradigmatique de l’opinion. Revue française de science politique 40,382–391.
Bornand, S., and Leguy, C. (2013). Anthropologie des pratiques langagières (ArmandColin).
Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S.J., Alekseyenko, A.V., Drummond,A.J., Gray, R.D., Suchard, M.A., and Atkinson, Q.D. (2012). Mapping the Origins andExpansion of the Indo-European Language Family. Science 337, 957–960.
Bowern, C., and Atkinson, Q. (2012). Computational phylogenetics and the internalstructure of Pama-Nyungan. Language 88, 817–845.
Breiman, L. (1999). Random forests. UC Berkeley TR567.
Bühlera, È.A., Cavaillé, F., and Gambino, M. (2012). Le jeune chercheur etl’interdisciplinarité en sciences sociales, Young researchers and interdisciplinarity insocial sciences. Reconsidering practices. Natures Sciences Sociétés 14, 392–398.
Cabrera, F. (2017). Cladistic Parsimony, Historical Linguistics and CulturalPhylogenetics. Mind Lang 32, 65–100.
Calame, C. (1986). Le récit en Grèce ancienne: énonciations et représentations depoètes (Méridiens/Klincksieck).
Campbell, L. (2006). Languages and Genes in Collaboration: some Practical Matters.(University of California, Santa Barbara), p.
Cangelosi, A., Smith, A.D.M., and Smith, K. (2006). The Evolution of Language:Proceedings of the 6th International Conference (EVOLANG6), Rome, Italy, 12-15April 2006 (World Scientific).
Cann, R.L. (2001). Genetic Clues to Dispersal in Human Populations: Retracing thePast from the Present. Science 291, 1742–1748.
Cavalli-Sforza, L.L. (1997). Genes, peoples, and languages. Proceedings of theNational Academy of Sciences 94, 7719–7724.
135
Bibliographie
Cavalli-Sforza, L.L., and Feldman, M.W. (1981). Cultural transmission and evolution:a quantitative approach. Monogr Popul Biol 16, 1–388.
Cavalli-Sforza, L.L., and Feldman, M.W. (2003). The application of molecular geneticapproaches to the study of human evolution. Nat. Genet. 33 Suppl, 266–275.
Cavalli-Sforza, L.L., Barrai, I., and Edwards, A.W.F. (1964). Analysis of HumanEvolution Under Random Genetic Drift. Cold Spring Harb Symp Quant Biol 29, 9–20.
Cavalli-Sforza, L.L., Piazza, A., Menozzi, P., and Mountain, J. (1988). Reconstructionof human evolution: bringing together genetic, archaeological, and linguistic data.Proceedings of the National Academy of Sciences 85, 6002–6006.
Cavalli-Sforza, L.L., Minch, E., and Mountain, J.L. (1992). Coevolution of genes andlanguages revisited. Proceedings of the National Academy of Sciences 89, 5620–5624.
Čelakovský, F.L. (1853). Čtení o srovnavací mluvnici slovanské na Universitěpražské (Rivnáč).
Chafe, W.L. (1980). The Pear Stories: Cognitive, Cultural, and Linguistic Aspects ofNarrative Production (Ablex).
Chakraborty, R. (1976). Cultural, language and geographical correlates of geneticvariability in Andean highland Indians. Nature 264, 350–352.
Chakravarti, A. (1999). Population genetics—making sense out of sequence. NatureGenetics 21, 56–60.
Chomsky, N. (2006). Language and Mind (Cambridge ; New York: Cambridge University Press).
Claidière, N., and André, J.-B. (2012). The Transmission of Genes and Culture: AQuestionable Analogy. Evolutionary Biology 39, 12–24.
Creanza, N., Ruhlen, M., Pemberton, T.J., Rosenberg, N.A., Feldman, M.W., andRamachandran, S. (2015). A comparison of worldwide phonemic and geneticvariation in human populations. Proceedings of the National Academy of Sciences112, 1265–1272.
Croft, W. (1996). Linguistic Selection: An Utterance-based Evolutionary Theory ofLanguage Change. Nordic Journal of Linguistics 19, 99.
Croft, W. (2006). The relevance of an evolutionary model to historical linguistics. InCompeting Models of Linguistic Change: Evolution and Beyond, (John BenjaminsPublishing), pp. 91–132.
136
Bibliographie
Croft, W. (2008). Evolutionary Linguistics. Annual Review of Anthropology 37, 219–234.
Croft, W. (2013). Evolution: Language use and the evolution of languages. In TheLanguage Phenomenon, (Springer), pp. 93–120.
Csilléry, K., Blum, M.G., Gaggiotti, O.E., and François, O. (2010). ApproximateBayesian computation (ABC) in practice. Trends in Ecology & Evolution 25, 410–418.
Csilléry, K., François, O., and Blum, M.G.B. (2012). abc: an R package forapproximate Bayesian computation (ABC): R package: abc. Methods in Ecology andEvolution 3, 475–479.
Culbertson, J. (2012). Typological Universals as Reflections of Biased Learning:Evidence from Artificial Language Learning: Typological Universals as Reflections ofBiased Learning. Language and Linguistics Compass 6, 310–329.
Danescu-Niculescu-Mizil, C., West, R., Jurafsky, D., Leskovec, J., and Potts, C.(2013). No country for old members: User lifecycle and linguistic change in onlinecommunities. In Proceedings of the 22nd International Conference on World WideWeb, (ACM), pp. 307–318.
Darlu, P., Bloothooft, G., Boattini, A., Brouwer, L., Brouwer, M., Brunet, G.,Chareille, P., Cheshire, J., Coates, R., Dräger, K., et al. (2012). The Family Name asSocio-Cultural Feature and Genetic Metaphor: From Concepts to Methods. HumanBiology 84, 169–214.
Darwin, C. (1871). The Descent of man (D. Appleton and Company).
Davidson, D. (1967). Truth and meaning. Synthese 17, 304–323.
Davidson, D. (1973). On the Very Idea of a Conceptual Scheme. Proceedings andAddresses of the American Philosophical Association 47, 5–20.
Debouzie, D. (1999). La notion de population en dynamique et génétique despopulations. Nature Sciences Sociétés 7, 19–26.
Delamotte, É. (2004). Communautés professionnelles, sens commun et doctrine.Études de communication. langages, information, médiations.
D’Errico, F., and Hombert, J.M. (2009). Becoming eloquent advances in theemergence of language, human cognition, and modern cultures (Amsterdam;Philadelphia, Pa.: John Benjamins Pub. Co.).
Diller, K.C., and Cann, R.L. (2011). Genetic influences on language evolution: anevaluation of the evidence.
137
Bibliographie
Drummond, A.J., and Rambaut, A. (2007). BEAST: Bayesian evolutionary analysisby sampling trees. BMC Evolutionary Biology 7, 214.
Dryer, M.S., and Haspelmath, M. (2013). The World Atlas of Language StructuresOnline (Leipzig: Max Planck Institute for Evolutionary Anthropology).
Duda, P., and Jan Zrzavý (2016). Human population history revealed by a supertreeapproach. Scientific Reports 6.
Estoup, A., Jarne, P., and Cornuet, J.-M. (2002). Homoplasy and mutation model atmicrosatellite loci and their consequences for population genetics analysis. MolecularEcology 11, 1591–1604.
Evans, N., and Levinson, S.C. (2009). The myth of language universals: Languagediversity and its importance for cognitive science. Behavioral and Brain Sciences 32,429.
Excoffier, L., and Foll, M. (2011). Fastsimcoal: a continuous-time coalescentsimulator of genomic diversity under arbitrarily complex evolutionary scenarios.Bioinformatics 27, 1332–1334.
Excoffier, L., and Lischer, H.E.L. (2010). Arlequin suite ver 3.5: a new series ofprograms to perform population genetics analyses under Linux and Windows.Molecular Ecology Resources 10, 564–567.
Excoffier, L., Dupanloup, I., Huerta-Sánchez, E., Sousa, V.C., and Foll, M. (2013).Robust Demographic Inference from Genomic and SNP Data. PLoS Genetics 9,e1003905.
Falush, D., van Dorp, L., and Lawson, D. (2016). A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots. BioRxiv 066431.
Fitch, W.T. (2008). Glossogeny and phylogeny: cultural evolution meets geneticevolution. Trends in Genetics 24, 373–374.
Garza, J.C., and Williamson, E.G. (2001). Detection of reduction in population sizeusing data from microsatellite loci. Molecular Ecology 10, 305–318.
Gascuel, O. (1997). BIONJ: an improved version of the NJ algorithm based on asimple model of sequence data. Molecular Biology and Evolution 14, 685–695.
Gayon, J. (2004). La génétique est-elle encore une discipline ? ms 20, 248–253.
Geisler, H., and List, J.-M. (2013). Do languages grow on trees? The tree metaphor inthe history of linguistics. Classification and Evolution in Biology, Linguistics and theHistory of Science. Concepts–methods–visualization. Stuttgart: Franz Steiner Verlag111–24.
138
Bibliographie
Goldstein, D.B., Ruiz Linares, A., Cavalli-Sforza, L.L., and Feldman, M.W. (1995).Genetic absolute dating based on microsatellites and the origin of modern humans.Proc. Natl. Acad. Sci. U.S.A. 92, 6723–6727.
Gong, T., and Wang, W.S. (2005). Computational modeling on language emergence: Acoevolution model of lexicon, syntax and social structure. Language and Linguistics6, 1.
Gould, S.J. (2002). The Structure of Evolutionary Theory (Harvard University Press).
Gould, S.J., and Lewontin, R.C. (1979). The spandrels of San Marco and thePanglossian paradigm: a critique of the adaptationist programme. Proceedings of theRoyal Society of London B: Biological Sciences 205, 581–598.
Gray, R.D., and Atkinson, Q.D. (2002). Language-tree divergence times support theAnatolian theory of Indo-European origin. Geophysical Research Letters 29.
Gray, R.D., and Jordan, F.M. (2000). Language trees support the express-trainsequence. Nature 405, 1052–1055.
Gray, R.D., Greenhill, S.J., and Ross, R.M. (2007). The pleasures and perils ofDarwinizing culture (with phylogenies). Biological Theory 2, 360–375.
Gray, R.D., Drummond, A.J., and Greenhill, S.J. (2009). Language phylogenies revealexpansion pulses and pauses in Pacific settlement. Science 323, 479–483.
Greenhill, S.J., Currie, T.E., and Gray, R.D. (2009). Does horizontal transmissioninvalidate cultural phylogenies? Proceedings of the Royal Society B: BiologicalSciences 276, 2299–2306.
Guillot, E.G., Hazelton, M.L., Karafet, T.M., Lansing, J.S., Sudoyo, H., and Cox, M.P.(2015). Relaxed Observance of Traditional Marriage Rules Allows SocialConnectivity without Loss of Genetic Diversity. Mol Biol Evol 32, 2254–2262.
Guillot, G., Mortier, F., and Estoup, A. (2005). Geneland: a computer package forlandscape genetics. Molecular Ecology Notes 5, 712–715.
Gunya, A., N. F. Glazovsky., Leadership for Environment and Development., andInstitut geografii (Rossijskaja akademija nauk) (2002). Yagnob valley: Nature, history,and chances of a mountain community development in Tadjikistan (Moscow: KMKScientific Press).
Haber, M., Mezzavilla, M., Xue, Y., Comas, D., Gasparini, P., Zalloua, P., and Tyler-Smith, C. (2016). Genetic evidence for an origin of the Armenians from Bronze Agemixing of multiple populations. European Journal of Human Genetics 24, 931–936.
Haeckel, E. (1874). The Evolution of Man.
139
Bibliographie
Hamed, M.B. (2005). Neighbour-nets portray the Chinese dialect continuum and thelinguistic legacy of China’s demic history. Proceedings of the Royal Society B:Biological Sciences 272, 1015–1022.
Hartl, D.L., and Clark, A.G. (2007). Principles of Population Genetics (SinauerAssociates, Incorporated).
Haspelmath, M. (1999). Optimality and diachronic adaptation. Zeitschrift FürSprachwissenschaft 18, 180–205.
Haspelmath, M., and Tadmor, U. (2009). Loanwords in the World’s Languages: AComparative Handbook (Walter de Gruyter).
Haugen, E. (1950). The Analysis of Linguistic Borrowing. Language 26, 210.
Heeringa, W., and Nerbonne, J. (2001). Dialect areas and dialect continua. LanguageVariation and Change 13, 375–400.
Hellenthal, G., Busby, G.B., Band, G., Wilson, J.F., Capelli, C., Falush, D., andMyers, S. (2014). A genetic atlas of human admixture history. Science 343, 747–751.
Henrich, J. (2001). Cultural transmission and the diffusion of innovations: Adoptiondynamics indicate that biased cultural transmission is the predominate force inbehavioral change. American Anthropologist 103, 992–1013.
Henry, J.-P., and Gouyon, P.-H. (1999). Précis de génétique des populations: cours,exercices et problèmes résolus (Dunod).
Heyer, E., Balaresque, P., Jobling, M.A., Quintana-Murci, L., Chaix, R., Segurel, L.,Aldashev, A., and Hegay, T. (2009). Genetic diversity and the emergence of ethnicgroups in Central Asia. BMC Genetics 10, 49.
Hoban, S., Bertorelle, G., and Gaggiotti, O.E. (2012). Computer simulations: tools forpopulation and evolutionary genetics. Nature Reviews Genetics 13, 110–122.
Holbrook, J.B. (2013). What is interdisciplinary communication? Reflections on thevery idea of disciplinary integration. Synthese 190, 1865–1879.
Huang, T., Shu, Y., and Cai, Y.-D. (2015). Genetic differences among ethnic groups.BMC Genomics 16, 1093.
Huelsenbeck, J.P., and Crandall, K.A. (1997). Phylogeny Estimation and HypothesisTesting Using Maximum Likelihood. Annual Review of Ecology and Systematics 28,437–466.
Hull, D.L. (1988). Science as a Process. An Evolutionary Account of the Social andConceptual Development of Science.
140
Bibliographie
Hunley, K. (2015). Reassessment of global gene–language coevolution. Proceedingsof the National Academy of Sciences 112, 1919–1920.
Hunley, K., Dunn, M., Lindström, E., Reesink, G., Terrill, A., Healy, M.E., Koki, G.,Friedlaender, F.R., and Friedlaender, J.S. (2008). Genetic and Linguistic Coevolutionin Northern Island Melanesia. PLoS Genetics 4, e1000239.
Hunley, K., Bowern, C., and Healy, M. (2012). Rejection of a serial founder effectsmodel of genetic and linguistic coevolution. Proceedings of the Royal Society B:Biological Sciences 279, 2281–2288.
I. Barrai, A. Rodriguez-Larralde, E (2000). Elements of the surname structure ofAustria. Annals of Human Biology 27, 607–622.
Jobling, M.A., Hurles, M., and Tyler-Smith, C. (2003). Human Evolutionary Genetics:Origins, Peoples and Disease (New York: Garland Science).
Jones, W. (1786). The Sanskrit Language.
Judson, O.P. (1994). The rise of the individual-based model in ecology. Trends inEcology & Evolution 9, 9–14.
Kandler, A., Unger, R., and Steele, J. (2010). Language shift, bilingualism and thefuture of Britain’s Celtic languages. Philosophical Transactions of the Royal SocietyB: Biological Sciences 365, 3855–3864.
Kasavin, I.T. (2009). L’idée d’interdisciplinarité dans l’épistémologie contemporaine.Diogène 38–57.
Kauhanen, H. (2016). Neutral change. Journal of Linguistics 1–32.
Khuri, S.F., Henderson, W.G., Daley, J., Jonasson, O., Jones, R.S., Campbell, D.A.,Fink, A.S., Mentzer, R.M., and Steeger, J.E. (2007). The Patient Safety in SurgeryStudy: Background, Study Design, and Patient Populations. Journal of the AmericanCollege of Surgeons 204, 1089–1102.
Kimura, M. (1983). The Neutral Theory of Molecular Evolution (CambridgeUniversity Press).
Kingman, J.F.C. (1982). The coalescent. Stochastic Processes and Their Applications13, 235–248.
Kirby, S. (2000). The role of I-language in diachronic adaptation. Sprachwissenschaft18, 2.
Kirby, S. (2001). Spontaneous evolution of linguistic structure-an iterated learningmodel of the emergence of regularity and irregularity. IEEE Transactions onEvolutionary Computation 5, 102–110.
141
Bibliographie
Kirby, K.R., Gray, R.D., Greenhill, S.J., Jordan, F.M., Gomes-Ng, S., Bibiko, H.-J.,Blasi, D.E., Botero, C.A., Bowern, C., Ember, C.R., et al. (2016). D-PLACE: AGlobal Database of Cultural, Linguistic and Environmental Diversity. PLOS ONE 11,e0158391.
Kirby, S., Dowman, M., and Griffiths, T.L. (2007). Innateness and culture in theevolution of language. PNAS 104, 5241–5245.
Kirby, S., Cornish, H., and Smith, K. (2008). Cumulative cultural evolution in thelaboratory: An experimental approach to the origins of structure in human language.Proceedings of the National Academy of Sciences 105, 10681–10686.
Kirby, S., Griffiths, T., and Smith, K. (2014). Iterated learning and the evolution oflanguage. Current Opinion in Neurobiology 28, 108–114.
Kirby, S., Tamariz, M., Cornish, H., and Smith, K. (2015). Compression andcommunication in the cultural evolution of linguistic structure. Cognition 141, 87–102.
Klein, J.T. (2013). Communication and collaboration in interdisciplinary research.Enhancing Communication & Collaboration in Crossdisciplinary Research, Edited byM. O’Rourke, S. Crowley, SD Eigenbrode, and JD Wulfhorst 11–30.
Labov, W. (1972). Sociolinguistic Patterns (University of Pennsylvania Press).
Lakatos, I. (1976). Falsification and the Methodology of Scientific ResearchProgrammes. In Can Theories Be Refuted?, S.G. Harding, ed. (Springer Netherlands),pp. 205–259.
Lansing, J.S., Cox, M.P., Downey, S.S., Gabler, B.M., Hallmark, B., Karafet, T.M.,Norquest, P., Schoenfelder, J.W., Sudoyo, H., Watkins, J.C., et al. (2007). Coevolutionof languages and genes on the island of Sumba, eastern Indonesia. Proceedings of theNational Academy of Sciences 104, 16022–16026.
Lê, S., Josse, J., Husson, F., and others (2008). FactoMineR: an R package formultivariate analysis. Journal of Statistical Software 25, 1–18.
Lees, R.B. (1953). The Basis of Glottochronology. Language 29, 113–127.
Lefevre, T., Raymond, M., and Thomas, F. (2016). Biologie évolutive (De BoeckSuperieur).
Lewontin, R.C. (1970). The units of selection. Annual Review of Ecology andSystematics 1, 1–18.
List, J.-M., Nelson-Sathi, S., Geisler, H., and Martin, W. (2014). Networks of lexicalborrowing and lateral gene transfer in language and genome evolution: Think again.BioEssays 36, 141–150.
142
Bibliographie
List, J.-M., Pathmanathan, J.S., Lopez, P., and Bapteste, E. (2016). Unity and disunityin evolutionary sciences: process-based analogies open common research avenues forbiology and linguistics. Biology Direct 11.
Livingstone, D., and Fyfe, C. (1999). Modelling the evolution of linguistic diversity.Advances in Artificial Life 704–708.
Long, J.C. (1991). The genetic structure of admixed populations. Genetics 127, 417–428.
MacIntyre, A. (1988). Whose justice? Which rationality? (Duckworth).
Maingueneau, D. (1979). “L’analyse du discours” - Persée. Repères pour larénovation de l’enseignement du français à l’école élémentaire 51, 3–27.
Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., Zhao, M.,Chennagiri, N., Nordenfelt, S., Tandon, A., et al. (2016). The Simons GenomeDiversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206.
Manni, F. (2017). Linguistic probes into human history. university of Groningen.
Mantel, N. (1967). The detection of disease clustering and a generalized regressionapproach. Cancer Research 27, 209–220.
Mardis, E.R. (2008). Next-Generation DNA Sequencing Methods. Annual Review ofGenomics and Human Genetics 9, 387–402.
Martínez-Cruz, B., Vitalis, R., Ségurel, L., Austerlitz, F., Georges, M., Théry, S.,Quintana-Murci, L., Hegay, T., Aldashev, A., Nasyrova, F., et al. (2011). In theheartland of Eurasia: the multilocus genetic landscape of Central Asian populations.European Journal of Human Genetics 19, 216–223.
Maynard Smith, J. (1987). How to model evolution. In The Latest on the Best, Essayson Evolution and Optimality, (Cambridge: MIT Press), pp. 119–131.
McDonald, S.P., Collins, J.F., and Johnson, D.W. (2003). Obesity Is Associated withWorse Peritoneal Dialysis Outcomes in the Australia and New Zealand PatientPopulations. JASN 14, 2894–2901.
Mennecier, P., Nerbonne, J., Heyer, E., and Manni, F. (2016a). A Central AsianLanguage Survey. Language Dynamics and Change 6, 57–98.
Mennecier, P., Nerbonne, J., Heyer, E., and Manni, F. (2016b). A Central Asianlanguage survey: Collecting data, measuring relatedness and detecting loans.
Mesoudi, A., Whiten, A., and Laland, K.N. (2006). Towards a unified science ofcultural evolution. Behavioral and Brain Sciences 29, 329–347.
143
Bibliographie
Moon, J.H. (1994). Putting Anthropology Back Togedier Again: The EthnogeneticCritique of Cladistic Theory 925. American Anthropologist 96, 925–948.
Moran, P.A.P. (1958). Random processes in genetics. Mathematical Proceedings ofthe Cambridge Philosophical Society 54, 60.
Moreno-Estrada, A., Gravel, S., Zakharia, F., McCauley, J.L., Byrnes, J.K., Gignoux,C.R., Ortiz-Tello, P.A., Martínez, R.J., Hedges, D.J., Morris, R.W., et al. (2013).Reconstructing the Population Genetic History of the Caribbean. PLoS Genetics 9,e1003925.
Mullis, K.B., and Faloona, F.A. (1987). Specific synthesis of DNA in vitro via apolymerase-catalyzed chain reaction. Meth. Enzymol. 155, 335–350.
Murphy, G.L., and Medin, D.L. (1985). The role of theories in conceptual coherence.Psychological Review 92, 289.
Nadasdi, T., Mougeon, R., and Rehner, K. (2008). Factors driving lexical variation inL2 French: A variationist study of automobile, auto, voiture, char and machine.Journal of French Language Studies 18, 365–381.
Nagel, E. (1961). The Structure of Science. American Journal of Physics 29, 716–716.
Nei, M. (1973). Analysis of gene diversity in subdivided populations. Proceedings ofthe National Academy of Sciences 70, 3321–3323.
Nei, M. (1977). F-statistics and analysis of gene diversity in subdivided populations.Annals of Human Genetics 41, 225–233.
Nei, M. (1987). Molecular Evolutionary Genetics (Columbia University Press).
Nettle, D., and Harriss, L. (2003). Genetic and linguistic affinities between humanpopulations in Eurasia and West Africa. Human Biology 331–344.
Nguyên-Duy, V., and Luckerhoff, J. (2006). Constructivisme/positivisme : où en sommes-nous avec cette opposition? (Université McGill (Montréal),), p.
Niyogi, P., and Berwick, R.C. (1997). Evolutionary Consequences of LanguageLearning. Linguistics and Philosophy 20, 697–719.
Novembre, J., and Stephens, M. (2008). Interpreting principal component analyses ofspatial population genetic variation. Nat Genet 40, 646–649.
Nowak, M.A., Komarova, N.L., and Niyogi, P. (2002). Computational andevolutionary aspects of language. Nature 417, 611.
Pagel, M. (2009). Human language as a culturally transmitted replicator. NatureReviews Genetics.
144
Bibliographie
Pagel, M., Atkinson, Q.D., and Meade, A. (2007a). Frequency of word-use predictsrates of lexical evolution throughout Indo-European history. Nature 449, 717–720.
Pagel, M., Atkinson, Q.D., and Meade, A. (2007b). Frequency of word-use predictsrates of lexical evolution throughout Indo-European history. Nature 449, 717–720.
Pagel, M., Atkinson, Q.D., S. Calude, A., and Meade, A. (2013). Ultraconservedwords point to deep language ancestry across Eurasia. Proceedings of the NationalAcademy of Sciences 110, 8471–8476.
Palminteri, S., Wyart, V., and Koechlin, E. (2017). The Importance of Falsification inComputational Cognitive Modeling. Trends in Cognitive Sciences 21, 425–433.
Palstra, F.P., Heyer, E., and Austerlitz, F. (2015). Statistical inference on genetic datareveals the complex demographic history of human populations in Central Asia.Molecular Biology and Evolution msv030.
Paradis, E., Claude, J., and Strimmer, K. (2004). APE: Analyses of Phylogenetics andEvolution in R language. Bioinformatics 20, 289–290.
Pateman, T. (1983). What is a language? Language & Communication 3, 101–127.
Popper, K. (1979). Three worlds (Ann Arbor,: University of Michigan.).
Preyer, G., and Peter, G. (2005). Contextualism in philosophy: knowledge, meaning,and truth (Oxford : New York: Clarendon Press ; Oxford University Press).
Pritchard, J.K., Stephens, M., and Donnelly, P. (2000). Inference of populationstructure using multilocus genotype data. Genetics 155, 945–959.
Pudlo, P., Marin, J.-M., Estoup, A., Cornuet, J.-M., Gautier, M., and Robert, C.P.(2016). Reliable ABC model choice via random forests. Bioinformatics 32, 859–866.
Ramachandran, S., Deshpande, O., Roseman, C.C., Rosenberg, N.A., Feldman, M.W.,and Cavalli-Sforza, L.L. (2005). Support from the relationship of genetic andgeographic distance in human populations for a serial founder effect originating inAfrica. Proceedings of the National Academy of Sciences of the United States ofAmerica 102, 15942–15947.
Ramallo, V., Bisso-Machado, R., Bravi, C., Coble, M.D., Salzano, F.M., Hünemeier,T., and Bortolini, M.C. (2013). Demographic expansions in South America:Enlightening a complex scenario with genetic and linguistic data. American Journal ofPhysical Anthropology 150, 453–463.
Reali, F., and Griffiths, T.L. (2010). Words as alleles: connecting language evolutionwith Bayesian learners to models of genetic drift. Proceedings of the Royal Society B:Biological Sciences 277, 429–436.
145
Bibliographie
Reesink, G., Singer, R., and Dunn, M. (2009). Explaining the Linguistic Diversity ofSahul Using Population Models. PLOS Biology 7, e1000241.
Reich, D., Price, A.L., and Patterson, N. (2008). Principal component analysis ofgenetic data. Nat Genet 40, 491–492.
Reich, D., Thangaraj, K., Patterson, N., Price, A.L., and Singh, L. (2009).Reconstructing Indian population history. Nature 461, 489–494.
Renfrew, C. (1987). Archaeology and language: the puzzle of Indo-European origins(J. Cape).
Resnick, L.B., Levine, J.M., and Teasley, S.D. (1991). Perspectives on SociallyShared Cognition (American Psychological Association).
Robinson, J.D., Bunnefeld, L., Hearn, J., Stone, G.N., and Hickerson, M.J. (2014).ABC inference of multi-population divergence with admixture from unphasedpopulation genomic data. Molecular Ecology 23, 4458–4471.
Rogers, D.S., Feldman, M.W., and Ehrlich, P.R. (2009). Inferring population historiesusing cultural data. Proceedings of the Royal Society B: Biological Sciences 276,3835–3843.
Ruhlen, M. (1991). A Guide to the World’s Languages: Classification (StanfordUniversity Press).
Saffran, J.R. (2003). Statistical language learning: Mechanisms and constraints.Current Directions in Psychological Science 12, 110–114.
Sagaut, P. (2008). Introduction à la pensée scientifique moderne.
Saitou, N., and Nei, M. (1987). The neighbor-joining method: a new method forreconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406–425.
Saussure, F. de (1916). Cours de linguistique générale (Payot).
Scheinfeldt, L.B., Soi, S., and Tishkoff, S.A. (2010). Working toward a synthesis ofarchaeological, linguistic, and genetic data for inferring African population history.Proceedings of the National Academy of Sciences 107, 8931–8938.
Schiffels, S., and Durbin, R. (2014). Inferring human population size and separationhistory from multiple genome sequences. Nature Genetics 46, 919–925.
Schleicher, A. (1853). Die ersten Spaltungen des indogermanischen Urvolkes [Thefirst splits of the Indo-European prehistoric people].
Sellars, W.S. (1956). Empiricism and the Philosophy of Mind. Minnesota Studies inthe Philosophy of Science 1, 253–329.
146
Bibliographie
Smith, A.D.M. (2014). Models of language evolution and change. WIREs Cogn Sci 5,281–293.
Smith, K., Kirby, S., and Brighton, H. (2003). Iterated learning: A framework for theemergence of language. Artificial Life 9, 371–386.
Soucek, S. (2000). A History of Inner Asia (Cambridge University Press).
Steele, J., and Kandler, A. (2010). Language trees ≠ gene trees. Theory in Biosciences129, 223–233.
Steels, L. (1997). The Synthetic Modeling of Language Origins. Evolution ofCommunication 1, 1–34.
Steels, L. (2004). Analogies between genome and language evolution. In ArtificialLife IX: Proceedings of the Ninth International Conference on the Simulation andSynthesis of Artificial Life, (MIT Press), p. 200.
Steels, L. (2011). Modeling the cultural evolution of language. Physics of LifeReviews 8, 339–356.
Stephens, M., and Donnelly, P. (2003). A comparison of bayesian methods forhaplotype reconstruction from population genotype data. The American Journal ofHuman Genetics 73, 1162–1169.
Sturtevant, E.H. (1947). An Introduction to Linguistic Science.
Suppes, P. (1961). A comparison of the meaning and uses of models in mathematicsand the empirical sciences. In The Concept and the Role of the Model in Mathematicsand Natural and Social Sciences, (Springer), pp. 163–177.
Swadesh, M. (1952). Lexico-Statistic Dating of Prehistoric Ethnic Contacts: WithSpecial Reference to North American Indians and Eskimos. Proceedings of theAmerican Philosophical Society 96, 452–463.
Szathmáry, E., and Maynard Smith, J. (1997). From replicators to reproducers: thefirst major transitions leading to life. J. Theor. Biol. 187, 555–571.
Tamariz, M., and Kirby, S. (2015). Culture: Copying, Compression, andConventionality. Cognitive Science 39, 171–183.
Tamariz, M., and Kirby, S. (2016). The cultural evolution of language. CurrentOpinion in Psychology 8, 37–43.
Tao Gong (2010). Exploring the Roles of Horizontal, Vertical, and ObliqueTransmissions in Language Evolution. Adaptive Behavior 18, 356–376.
Tavaré, S., Balding, D.J., Griffiths, R.C., and Donnelly, P. (1997). Inferring
147
Bibliographie
Coalescence Times from DNA Sequence Data. Genetics 145, 505–518.
Tehrani, J.J. (2013). The Phylogeny of Little Red Riding Hood. PLOS ONE 8,e78871.
Testart, A. (2011). Les modèles biologiques sont-ils utiles pour penser l’évolution dessociétés? Préhistoires Méditerranéennes.
Thioulouse, J., Chessel, D., Dole´dec, S., and Olivier, J.-M. (1997). ADE-4: amultivariate analysis and graphical display software. Statistics and Computing 7, 75–83.
Thomsen, O.N. (2006). Competing Models of Linguistic Change: Evolution andBeyond (John Benjamins Publishing).
Thouzeau, V., Mennecier, P., Verdu, P., and Austerlitz, F. (2017). Genetic andlinguistic histories in Central Asia inferred using approximate Bayesian computations.Proc. R. Soc. B 284, 20170706.
Verdu, P., Austerlitz, F., Estoup, A., Vitalis, R., Georges, M., Théry, S., Froment, A.,Le Bomin, S., Gessain, A., Hombert, J.-M., et al. (2009). Origins and GeneticDiversity of Pygmy Hunter-Gatherers from Western Central Africa. Current Biology19, 312–318.
Verdu, P., Jewett, E.M., Pemberton, T.J., Rosenberg, N.A., and Baptista, M. (2017).Parallel Trajectories of Genetic and Linguistic Admixture in a Genetically AdmixedCreole Population. Current Biology 27, 2529–2535.e3.
Verleyen, S. (2007). Le fonctionnalisme entre système linguistique et sujet parlant:Jakobson et Troubetzkoy face à Martinet. Cahiers Ferdinand de Saussure 163–188.
Vogt, P. (2009). Modeling interactions between language evolution and demography.Human Biology 81, 237–258.
Waples, R.S., and Gaggiotti, O. (2006). What is a population? An empirical evaluationof some genetic methods for identifying the number of gene pools and their degree ofconnectivity. Molecular Ecology 15, 1419–1439.
Ward, R.H., Redd, A., Valencia, D., Frazier, B., and Pääbo, S. (1993). Genetic andlinguistic differentiation in the Americas. Proceedings of the National Academy ofSciences 90, 10663–10667.
Watson, C.I., Maclagan, M., and Harrington, J. (2000). Acoustic evidence for vowelchange in New Zealand English. Language Variation and Change 12, 51–68.
Weber, J.L., and Wong, C. (1993). Mutation of human short tandem repeats. HumanMolecular Genetics 2, 1123–1128.
148
Bibliographie
Weir, B.S., and Cockerham, C.C. (1984). Estimating F-Statistics for the Analysis ofPopulation Structure. Evolution 38, 1358.
Weiss, G., and von Haeseler, A. (1998). Inference of population history using alikelihood approach. Genetics 149, 1539–1546.
Wiley, E.O., and Lieberman, B.S. (2011). Phylogenetics: Theory and Practice ofPhylogenetic Systematics (John Wiley & Sons).
Wittgenstein, L. (1953). Recherches philosophiques (Editions Gallimard).
Wright, S. (1942). Statistical genetics and evolution. Bull. Amer. Math. Soc. 48, 223–246.
Wright, S. (1951). The Genetical Structure of Populations. Annals of Eugenics 15,323–354.
Zerjal, T., Xue, Y., Bertorelle, G., Wells, R.S., Bao, W., Zhu, S., Qamar, R., Ayub, Q.,Mohyuddin, A., Fu, S., et al. (2003). The genetic legacy of the Mongols. TheAmerican Journal of Human Genetics 72, 717–721.
Zuidema, W., and Boer, B. de (2009). The evolution of combinatorial phonology.Journal of Phonetics 37, 125–144.
149
APPENDIX: Supplementary
informations on the Approximate
Bayesian Computation procedures
1. Linguistic Model
For models assuming a borrowing process between variety 1 and variety 2
(Figure S1a), each cognate was borrowed (i.e. it adopted the identifier of the other
variety) with probability δL. If we assumed an admixture event between varieties 0
and 2 (Figure S1b), a new variety 1 was created and each cognate of variety 1 was
drawn from variety 2 with probability rL and from variety 0 with probability 1 – rL.
The branches evolved independently. We assumed a constant cognate mutation rate µL
across branches and through time.
2. Prior distributions for the linguistic model parameters
We simulated datasets of 185 cognates each. For each simulation, their mean
mutation rate μL was drawn in U[0, 10-2]. This was consistent with previous
estimations (Pagel et al., 2007b), with a mean cognate mutation rate per generation
between 6.1x10-3 and 9.15x10-3 (respectively for a generation time of 20 and 30
years). The mutation rate μL,i of each cognate i was then drawn independently in a beta
distribution with mean μL and parameter β = 1, which we implemented in our
simulation software PopLingSim, using the ratio between two gamma distributions
drawn using the C++ library <random>. Indeed, if X ~ Gamma(α, θ) and Y ~
Gamma(β, θ) and X and Y are independent variables, X/(X + Y) ~ Beta(α, β). The
borrowing rate δL was drawn in U[0, 0.1], i.e. a maximum of 10% of cognates could
be borrowed at each linguistic generation, a value already representing a massive
amount of linguistic exchange in a single generation. The admixture rate rL was drawn
in U[0, 1]. The split times t0 and t1 were drawn in U[1, 1000], with the constraint t0 >
151
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
t1. The upper limit of the prior for these split times was thus roughly twice the
previous estimates in these language families (Pagel et al., 2013).
3. Genetic model
Genetic data were simulated using FastSimCoal 2.5.1 (Excoffier and Foll, 2011;
Excoffier et al., 2013). Microsatellite data were simulated assuming a generalized
stepwise mutation model with an infinite number of potential alleles (Estoup et al.,
2002) and a probability of a mutation of more than one step p = 0.22, a value
commonly considered as realistic in the literature (Estoup et al., 2002). We assumed a
pure split process, a split followed by migration, or a split followed by admixture
(Figure S2). The three populations Pop0, Pop1 and Pop2 had effective sizes N0, N1, N2
(Figure S2).
If we assumed a migration process between population 1 and population 2
(Figure S2a), each individual migrated with probability δG. Populations 0 and 1 split
at time t1 from ancestral population 3, of effective size N3. Populations 2 and 3 split at
time t0 from ancestral population 4, of effective size N4. For the model assuming an
admixture event between populations 0 and 2 (Figure S2b), a new population 1
appeared at time t1 made from individuals from population 0 with probability 1 – rG
and from population 2 with probability rG. In that case, populations 0 and 2 split from
the ancestral population 3 at time t0, with associated effective size N3.
4. Prior distributions for the genetic model parameters
The mean mutation rate μG of the microsatellite loci was drawn in U[10-4, 10-3]
(Weber and Wong, 1993). The mutation rates μG,i of each locus i were drawn
independently in a beta distribution with mean μG and parameter β=1. We assumed
that all markers were unlinked. Population effective sizes (N0, N1, N2, N3, N4) were
each drawn in U[100, 100000]. Migration rates δG were drawn in U[0, 0.1], and
admixture rates rG in U[0, 1]. The split times (t0, t1) were drawn in U[1, 1000] with the
constraint t0 > t1.
152
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
5. Summary Statistics
For the genetic data, we used the ArlSumStats software (Arlequin suite v.3.5.2.2,
Excoffier and Lischer, 2010) to compute a large set of standard diversity indices
available for microsatellite data. For each population, we computed the means and the
standard deviations across the 26 loci of the number of alleles K, the expected gene
diversity Ĥ (Nei, 1977), the difference between the maximum and the minimum
number of repeats R, and the G-W index (Garza and Williamson, 2001). We estimated
also all pairwise FST values between populations (Weir and Cockerham, 1984) and the
average pairwise squared distance between alleles δμ2 (Goldstein et al., 1995).
Altogether, this provided 42 population genetics statistics.
While these genetic summary statistics are classically used to describe genetic
diversity, no consensual statistics are available to describe linguistic diversity in a set
of cognates. We constructed six linguistic statistics to explore the cognate variability
in the whole data set and between all pairs of linguistic varieties (Figure S8, see code
in Repository): mean number of cognates per meaning C, variance of the number of
cognates per meaning V(C), range of the number of cognates per meaning R(C),
number of meanings with only one cognate throughout the linguistic varieties Cs,
number of meanings with one different cognate for each linguistic variety Cd, and the
number of pairwise differences between linguistic varieties Di-j.
Prior-checking was performed verifying that simulated data sets were close to
the real data set, using a PCA direct checking and a goodness-of-fit test (R package
abc, function gfit (Csilléry et al., 2012)).
6. Scenarios selection using random forest (RF)
For each triplet, we conducted an ABC analysis to determine the best historical
scenario for genetic and linguistic data respectively. We generated a reference table
with 10,000 simulated data for each linguistic and each genetic scenario, and for each
triplet separately. The low number of simulations is due to the use of a random forest
algorithm, which need much less simulations than other classical model-selection
methods in ABC (Pudlo et al., 2016). In the UZA case, we simulated thus 100,000
153
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
datasets per triplet for the five genetic and the five linguistic scenarios. In the TJY
case, we simulated 40,000 datasets per triplet for the two genetic and two linguistic
scenarios. Over the 72 triplets, we conducted, therefore, 7,200,000 simulations for the
UZA case and 2,880,000 simulations for the TJY case.
For each triplet separately, we selected the most likely scenarios using a random
forest (RF) decision algorithm implemented in the R package abcrf (Pudlo et al.,
2016) (script in Repository). This algorithm builds a forest of decision trees using the
link between indexes of scenarios and summary statistics. We graphically checked the
convergence of error-rates with 500 trees (Pudlo et al., 2016), in both case-studies
conducted here (results not shown). For each case (UZA or TJY), we produced 72
decisions for the genetic data and 72 decisions for the linguistic data corresponding to
the 72 triplets of populations analysed.
7. Parameters estimation using neural networks (NN)
After selecting the most likely scenario, we estimated the posterior distributions
of its constitutive parameters. We generated a reference table with 1,000,000
simulated data for the linguistic selected scenarios and the genetic selected scenarios.
The best genetic and linguistic scenarios for the UZA case produced 2,000,000
simulations. The best genetic and linguistic scenarios for the TJY case produced
3,000,000 simulations. Repeated over the 72 analyses, it yielded a total of
144,000,000 simulations for the UZA case and 216,000,000 simulations for the TJY
case.
For each triplet, we obtained an a posteriori distribution for each parameter
using the neuralnet option of the R package abc implementing a series of neural
networks (NN) with one hidden layer (Blum and François, 2010). As previously, we
performed cross-validations analysis using an out-of-bag approach over 100 pseudo-
observed datasets (Csilléry et al., 2012). We verified that the NN method estimates
better the parameters of the models than the linear, loclinear, and ridge methods.
We first accepted the 1% of the simulations closest to the real data set, based
on the euclidean distances between the observed summary statistics and the simulated
summary statistics. A logistic transformation was then used to constrain the intervals
154
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
of the estimated parameters, before using the NN itself. The sizes of the hidden layer
were four and seven neurons for the linguistic and genetic parameter estimations,
respectively. These sizes were set according to the assumed dimensionality of the
problem, which is necessarily lower than the number of parameters. No rule sets this
layer size (Csilléry et al., 2012), but a risk of over-fitting appears when it is large. To
avoid this issue, we empirically tested several possibilities for the number of neurons,
to obtain the best estimation while avoiding over-fitting. We finally pooled the
parameter distributions for each triplet supporting the most probable scenario (see
Figures S12-A.16 and Table S2-S6).
8. Cross-validation and posterior probabilities in the UZA case
8.1. Cross-validation
We produced simulations for the five genetic and five linguistic scenarios for
the UZA case. Our simulated data were congruent with the observed data, based on
the direct checking of the PCA performed on the simulated and the observed summary
statistics (Figure S9).
We then performed an a priori cross-validation for the scenario selection RF
procedure. We used the out-of-bag approach implemented in the function abcrf of the
R package abcrf to gauge to which extent the method is able to choose the correct
scenario (Tables S7-S8). In short, this method consists in testing each simulation,
considering them in turn as pseudo-observed data, with a separate RF analysis, using
all other simulated data to build the forest.
The linguistic RF performed a priori better than the genetic RF. This was not
surprising knowing the low genetic differentiation of the populations and the low
number of genetic markers. Moreover, when either scenario A, B, C or D was the true
scenario, RF performed better than when scenario E was the true scenario, in the
linguistic case as well as in the genetic case.
The pattern of scenario choices that we observed using real data across the 72
triplets for the UZA case indicated a higher support for the scenario E (Figure I.4).
155
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Thus, considering the challenges to a priori distinguish this scenario from the others
raised by the above cross-validation results, which show that it is rather unlikely to
choose scenario E when it is not the true scenario, this a priori reinforced our
confidence in eventually accepting the scenario E based on our real-data analysis,
linguistically and genetically.
8.2. Posterior probabilities
We then performed an a posteriori estimation of the probability of each scenario
to be selected, when the model underlying the real dataset is known. We estimated this
posterior probability with the post.prob value computed from the function
predict.abcrf of the R package abcrf, for each triplet supporting the most probable
scenario, and we report the parameter distributions of their posterior probabilities in
Table S11 and in Figure S17.
The mean posterior probability in the UZA case was 0.69 (SD = 0.12 over 55
triplets) in the linguistic case and for the triplets supporting the scenario E, and 0.51
(SD = 0.04 over 36 triplets) in the genetic case and for the triplets supporting the
scenario E. We can thus be confident in assuming that scenario E was better than the
four other scenarios concerning the UZA case and for our data set.
9. Cross-validation and posterior probabilities in the TJY case
9.1. Cross-validation
We produced the simulations for the two genetic and two linguistic scenarios for
the TJY case. Our simulated data were congruent with the observed data, based on the
direct checking of the PCA performed on the simulated and the observed summary
statistics (Figure S10).
We performed an a priori cross-validation procedure for the scenario selection
similar to the UZA case described above. In general, RF performed better in the TJY
case than in the UZA case (Table S9-S10). In particular, the linguistic RF for the TJY
case performed ideally.
156
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
9.2. Posterior probabilities
We then performed an a posteriori estimation of the probability of each of the
two scenarios to be selected, when the model underlying the real dataset was known
(Table S11 and Figure S17). The mean posterior probability in the linguistic case was
0.52 (SD = 0.05 over 37 triplets) for the triplets supporting scenario 1, and 0.52
(SD = 0.05 over 35 triplets) for the triplets supporting scenario 2. Theses values are
consistent with the impossibility to distinguish between a scenario with or without
borrowing using our data, which is quite consistent with the low, a posteriori,
estimated value for the borrowing rate δ̂L = 0.4%. Indeed, this result suggests that,
when considering a scenario including the possibility of borrowings, our data would
be best explained by a scenario with only a very reduced level of such borrowings,
hence explaining our difficulties to distinguish a priori between the two linguistic
scenarios. The mean posterior probability in the genetic case was 0.91 (SD = 0.04
over 72 triplets supporting this scenario) for the scenario 1. We were thus confident in
rejecting the scenario 2 for the genetic case, since there were 72 triplet over 72
indicating a support for the scenario 1 (Figure I.4).
Repository
The scripts that we developed are freely available at
https://github.com/ValentinThouzeau/PopLingSim
157
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S1 – Models of linguistic evolution. (a) The ancestral linguistic variety splits into two varieties.One of this variety splits again, with possible subsequent continuous borrowing. (b) The ancestralvariety splits into two varieties. A third variety is subsequently generated by an admixture eventbetween the two source varieties.
158
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S2 – Models of genetic evolution. (a) The ancestral population splits into two varieties. One ofthis population splits again, with possible subsequent migration. (b) The ancestral population splits intotwo varieties. A third population is subsequently generated by an admixture event between the twosource populations.
159
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S3 – Two competing scenarios of linguistic and genetic origin of the Yagnob speakingpopulation (TJY). These scenarios have been tested independently for linguistic history and genetichistory. Abbreviations: Tc = Turkic speaking population. TJY = TJY population from the Yagnob valleyin Tajikistan. I-I = Indo-Iranian speaking population.
160
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S4 – Pairwise FST matrix (a) and linguistic distances matrix (b). See the gray-scale to the rightof each plot (different scale for each plot).
161
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S5 – Neighbour-joining trees based on the pairwise (δμ)2 matrix, with 11 Turkic speakingpopulation (in Yellow/Light Grey) and 10 Indo-Iranian speaking populations (in Blue/Dark Grey). Thevalues at each node correspond to the number of bootstrap trees containing this node among 1000permutations. The red arrows indicate the UZA and the TJY populations, specifically investigated inthis paper using Approximate Bayesian Computations.
162
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S6 – Principal Component Analysis of the Manhattan distances computed using the 185cognates from the real linguistic dataset. The red arrows indicate the UZA and the TJY populations.
163
−0.4 −0.2 0.0 0.2 0.4
−0.4
−0.2
0.0
0.2
Axis 1
Axis
2
LKZ
KAZKRB
KRT
KRLKRA
OTU
KKKUZB
UZAUZT
TJA
TJU
TJK
TJNTJE
TJT
TJR
TDS
TDU
TJY
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S7 – Principal Component Analysis of the pairwise FST distance matrix computed using the 26microsatellites from the real genetic dataset. The red arrows indicate the UZA and the TJY populations.
164
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
C : Mean number of cognates per meaning:1+3+2+2+2+3
6=2.17
V(C) : Variance of the number of cognates per meaning:(2.17−1 )
2+(2.17−3 )
2+(2.17−2 )
2+(2.17−2 )
2+ (2.17−2 )
2+(2.17−3 )
2=2.83
R(C) : Range of the number of cognates per meaning:max (numberofcognates )−min (numberofcognates )=3−1=2
Cs : Number of meanings with only one cognate throughout the three linguistic varieties:
1+0+0+0+0+0=1
Cd : Number of meanings with one different cognate for each linguistic variety:0+1+0+0+0+1=2
D1-2 : Number of pairwise differences between linguistic varieties n°1 and n°2:0+1+0+1+1+1=4
D1-3 : Number of pairwise differences between linguistic varieties n°1 and n°3:0+1+1+1+1+1=5
D2-3 : Number of pairwise differences between linguistic varieties n°2 and n°3:0+1+1+0+0+1=3
Figure S8 – An example of computation of the linguistic summary statistics, over three linguisticvarieties and six meanings.
165
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S9 – PCA performed in the UZA case over one observed summary statistics set (yellow dot)and 5000 summary statistics sets of the associated simulated datasets. (a) Linguistic PCA over the 1 st
and the 2cd axis (b) Linguistic PCA over the 1st and the 3rd axis (c) Genetic PCA over the 1st and the 2cd
axis (d) Linguistic PCA over the 1st and the 3rd axis.
166
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S10 – PCA performed in the TJY case over one observed summary statistics set (yellow dot)and 5000 summary statistics sets of the associated simulated datasets. (a) Linguistic PCA over the 1 st
and the 2cd axis (b) Linguistic PCA over the 1st and the 3rd axis (c) Genetic PCA over the 1st and the 2cd
axis (d) Linguistic PCA over the 1st and the 3rd axis.
167
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S11 – Analysis of the TJY population history. (a) Decisions over the analysis of 72 triplets forthe selection of the linguistic scenarios. (b) Priors (dotted-line) and posteriors (solid line) of theparameters t1/t0 estimated from the linguistic simulations of the scenario 1 (c) Priors (dotted-line) andposteriors (solid line) of the parameters t1/t0 and δL estimated from the linguistic simulations of thescenario 2 (d) Decisions over the analysis of 72 triplets for the selection of the genetic scenarios. (e)Priors (dotted-line) and posteriors (solid line) of the parameters t1/t0 estimated from the geneticsimulations of the scenario 1
168
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S12 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets testedfor the linguistic origin of the UZA population (admixture model).
169
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S13 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets testedfor the genetic origin of the UZA population (admixture model).
170
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S14 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets testedfor the linguistic origin of the TJY population (isolation model).
171
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S15 – Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets testedfor the linguistic origin of the TJY population (non-isolation model).
172
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S16 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets testedfor the genetic origin of the TJY population (isolation model).
173
APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures
Figure S17 – Density of the distribution of the posterior probabilities for a) Scenario E in the linguisticUZA case, b) Scenario E in the genetic UZA case, c) Scenario 1 in the linguistic TJY case, d) Scenario1 in the genetic TJY case, e) Scenario 2 in the linguistic TJY case. Parameters of the distributions aredetailed Table S11.
174
Population Code Sample Country Linguistic familyKazaks (Gazli) LKZ 25 Uzbekistan Turkic
Kazaks (Raushan) KAZ 49 Uzbekistan TurkicKyrgyz (Ordaj) KRA 47 Kyrgyzstan Turkic
Kyrgyz (Akmuz) KRB 24 Kyrgyzstan TurkicKyrgyz (Kulanak) KRL 22 Kyrgyzstan TurkicKyrgyz (Barskoon) KRT 37 Uzbekistan Turkic
Karakalpaks (Halqabad) OTU 45 Uzbekistan TurkicKarakalpaks (Shege) KKK 45 Uzbekistan TurkicUzbeks (SojMahalla) UZA 25 Uzbekistan Turkic
Uzbeks (Hitoj) UZB 35 Uzbekistan TurkicUzbeks (Urtoqqishloq) UZT 25 Tajikistan Turkic
Tajiks (Shink) TDS 25 Uzbekistan Indo-IranianTajiks (Urmetan) TDU 25 Uzbekistan Indo-IranianTajiks (Agakalik) TJA 31 Uzbekistan Indo-IranianTajiks (Nimich) TJE 25 Tajikistan Indo-Iranian
Tajiks (Kaptarhona) TJK 26 Tajikistan Indo-IranianTajiks (Navdi) TJN 24 Tajikistan Indo-Iranian
Tajiks (Rishtan) TJR 29 Uzbekistan Indo-IranianTajiks (Nushnor) TJT 25 Tajikistan Indo-Iranian
Tajiks (Kamangaron) TJU 29 Tajikistan Indo-IranianYagnobs (Dushanbe) TJY 25 Tajikistan Indo-Iranian
Table S1 – Information table for the 21 studied Central Asian populations.
175
Mean Mode Median Quantile 2.5% Quantile 97.5%
N0 30979 16862 23978 6399 87812
N1 61531 82173 63848 13608 98179
N2 42764 28382 38443 8124 95255
N4 49291 37749 46460 12803 94394
μG 2.38×10-4 1.58×10-4 1.97×10-4 9.71 × 10-5 6.14×10-4
t0 623 699 634 190 981
t1 256 133 214 20 710
rG 0.499 0.479 0.498 0.05 0.958
4×N0×μG 25.4 14.2 18.8 5.23 84.6
4×N1×μG 60.8 38.3 49 9.06 184
4×N2×μG 36 20.9 28.5 7.26 113
4×N4×μG 37.5 34.5 36.3 19.3 62.7
t0×μG 0.136 0.102 0.118 0.037 0.34
t1×μG 0.0566 0.0259 0.0421 0.00439 0.193
t1/t0 0.439 0.279 0.411 0.0374 0.948
t1×r G 121 39.5 84.7 4.79 450
Table S2 – Summary of the posterior distributions of the genetic parameters, assuming a scenario of an admixed origin of the UZA population (scenarioE).
176
Mean Mode Median Quantile 2.5% Quantile 97.5%
μL 0.00501 0.00388 0.00466 0.00211 0.00946
t0 655 807 667 257 980
t1 22.7 22.5 22.2 2.31 48.7
rL 0.093 0.090 0.090 0.019 0.184
t0×μL 2.98 2.49 2.78 1.40 5.65
t1×μL 0.101 0.115 0.107 0.00946 0.180
t1/t0 0.0366 0.0379 0.0364 0.00289 0.0783
t1 ×rL 2.08 0.886 1.78 0.0616 5.98
Table S3 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of an admixed origin of the UZA population(scenario E).
177
Mean Mode Median Quantile 2.5% Quantile 97.5%
N0 31782 20108 26302 7585 86447
N1 20393 10279 14958 2533 75516
N2 55987 50561 55543 12909 97031
N3 47733 30765 45011 6468 96594
N4 52161 41207 50090 14248 96019
μG 2.33×10-4 1.60×10-4 1.99×10-4 9.61×10-5 5.65×10-4
t0 743 891 778 323 987
t1 440 389 427 71.2 856
4×N0×μG 25.2 16.5 20.6 6.99 70.4
4×N1×μG 16 8.70 11.8 2.63 55.7
4×N2×μG 51.7 29.4 42.4 8.64 150
4×N3×μG 44.2 20.5 32.1 4.66 156
4×N4×μG 40.8 37.6 39.5 20.2 68.3
t0×μG 0.168 0.125 0.148 0.054 0.397
t1×μG 0.097 0.063 0.081 0.014 0.276
t1/t0 0.601 0.772 0.625 0.112 0.971
Table S4 – Summary of the posterior distributions of the genetic parameters, assuming a scenario of isolation of the TJY population (scenario 2).
178
Mean Mode Median Quantile 2.5% Quantile 97.5%
μL 0.0058 0.0039 0.0056 0.0019 0.0107
t0 627 860 668 85.7 990
t1 80.6 51.8 68.8 21 211
t0×μL 3.78 2.41 2.95 1.65 9.92
t1×μL 0.402 0.364 0.390 0.272 0.586
t1/t0 0.129 0.123 0.123 0.0178 0.300
Table S5 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of isolation of the TJY population (scenario 1).
Mean Mode Median Quantile 2.5% Quantile 97.5%
μL 0.00730 0.00838 0.00758 0.00299 0.0107
t0 759 909 806 319 992
t1 319 130 263 7 873
δL 0.00695 0.00465 0.00574 0.000963 0.0197
t0×μL 5.55 3.82 5.70 1.95 9.72
t1×μL 2.54 0.939 1.96 0.102 7.67
t1/t0 0.408 0.149 0.355 0.00261 0.97
t1×δL 3.02 0.731 1.84 0.00958 13.9
Table S6 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of no-isolation of the TJY population (scenario 2).
179
Estimated Scenario
A B C D E ErrorT
rue
Sce
nar
ioA 6735 [795] 14 [2] 657 [77] 1148 [134] 1447 [171] 0.33
B 41 [5] 8500 [1003] 782 [92] 259 [30] 418 [49] 0.15
C 644 [76] 1161 [138] 6771 [797] 13 [2] 1411 [166] 0.32
D 739 [89] 272 [33] 35 [4] 8527 [1002] 427 [50] 0.15
E 1939 [229] 403 [48] 1912 [225] 406 [47] 5340 [630] 0.47
Table S7 – Linguistic cross-validation of the UZA case. Mean of the confusion matrices over the 72 triplets using 10000 pseudo-observed data pertriplet, from the out-of-bag analysis. Standard deviation of the confusion matrices is indicated between square brackets.
Estimated Scenario
A B C D E Error
Tru
e S
cen
ario
A 6341[749] 336 [39] 1137 [134] 264 [30] 1923 [227] 0.37
B 267 [30] 6037 [717] 834 [94] 2471 [294] 391 [43] 0.40
C 1195 [138] 1591 [186] 4492 [535] 899 [102] 1823 [218] 0.55
D 271 [31] 4071 [478] 524 [61] 4807 [569] 327 [39] 0.52
E 2989 [345] 986 [113] 2238 [268] 606 [65] 3181 [387] 0.68
Table S8 – Genetic cross-validation of the UZA case. Mean of the confusion matrices over the 72 triplets using 10000 pseudo-observed data per triplet,from the out-of-bag analysis. Standard deviation of the confusion matrices is indicated between square brackets.
180
Estimated Scenario
1 2 Error
Tru
e S 1 8770 [1034] 1230 [144] 0.13
2 994 [118] 9006 [1061] 0.10
Table S9 – Linguistic cross-validation of the TJY case. Mean of the confusion matrices over the 72 triplets using 10000 pseudo-observed data pertriplet, from the out-of-bag analysis. Standard deviation of the confusion matrices is indicated between square brackets.
Estimated Scenario
1 2 ErrorT
rue
S 1 7313 [868] 2687 [311] 0.27
2 1271 [144] 8729 [1035] 0.13
Table S10 – Genetic cross-validation of the TJY case. Mean of the confusion matrices over the 72 triplets using 10000 pseudo-observed data per triplet,from the out-of-bag analysis. Standard deviation of the confusion matrices is indicated between square brackets.
181
Min. 1st Qu. Median Mean 3rd Qu. Max.
UZA Linguistic, Model E 0.3716 0.6696 0.7267 0.6949 0.7771 0.8344
UZA Genetic, Model E 0.4099 0.4916 0.5084 0.5096 0.5340 0.5688
TJY Linguistic, Model 1 0.4037 0.4941 0.5148 0.5188 0.5524 0.6273
TJY Linguistic, Model 2 0.3980 0.4827 0.5146 0.5167 0.5467 0.6168
TJY Genetic, Model 1 0.7729 0.8906 0.9094 0.9059 0.9255 0.9617
Table S11 – Quantiles of the distributions of the posterior probabilities computed over the triplets supporting the most probable scenario. Thesedistributions are shown on Figure S17.
182
top related