infÉrer l’histoire des populations humaines À partir … · et de l’application de méthodes...

MUSEUM NATIONAL D’HISTOIRE NATURELLEEcole Doctorale Sciences de la Nature et de l’Homme – ED 227

Année 2017 N°attribué par la bibliothèque|_|_|_|_|_|_|_|_|_|_|_|_|

Pour obtenir le grade de

DOCTEUR DU MUSEUM NATIONAL D’HISTOIRE NATURELLE

Spécialité : GÉNÉTIQUE ET LINGUISTIQUE DES POPULATIONS HUMAINES

Présentée et soutenue publiquement par

Valentin ThouzeauLe 28 Novembre 2017

INFÉRER L’HISTOIRE DES POPULATIONS HUMAINESÀ PARTIR DES DIVERSITÉS GÉNÉTIQUES ET

LINGUISTIQUES

Sous la direction de : Monsieur Austerlitz Frédéric, Directeur de Recherche,et Monsieur Verdu Paul, Chargé de Recherche

JURY :

Mme Porcher, Emmanuelle Professeure, MNHN Présidente

M. Austerlitz, Frédéric Directeur de Recherche, CNRS Directeur de Thèse

M. Verdu, Paul Chargé de Recherche, CNRS Directeur de Thèse

M. Estoup, Arnaud Directeur de Recherche, INRA Rapporteur

Mme Kandler, Anne Senior Scientist, Max Planck Institute Rapportrice

Mme Barberousse, Anouk Professeure, Université Lille 1 Examinatrice

Mme Barkat-Defradas Mélissa Chargée de Recherche, CNRS Examinatrice

Résumé

Les inférences historiques sont des méthodes statistiques permettant de

reconstruire les événements passés à partir de données actuelles. Les inférences en

parallèle des histoires génétiques et linguistiques ont récemment profité d’avancées

méthodologiques permettant de mieux comprendre la co-évolution entre gènes et

langues. Néanmoins, les événements historiques complexes affectant la diversité

linguistique sont encore très peu étudiés. Cette thèse a été centrée sur l’articulation

entre inférences génétiques et linguistiques, à partir d’une pratique interdisciplinaire

et de l’application de méthodes statistiques de calcul Bayésien approché. Ce cadre a

permis de prendre en compte des événements complexes de migration, d’hybridation,

ou de changement de taille des populations, aussi bien pour l’histoire génétique que

pour l’histoire linguistique. Des données génétiques et linguistiques issues de

plusieurs populations d’Asie Centrale ont ainsi été analysées, montrant que l’histoire

des populations génétiques peut parfois différer de l’histoire des variétés linguistiques

parlées par ces populations. Un cadre permettant de prendre en compte la diversité

linguistique interindividuelle a ensuite été développé et appliqué à un ensemble de

locuteurs tadjiks d’Asie Centrale. Une interface entre génétique et linguistique centrée

sur les individus a ensuite été formalisée à partir d’un travail théorique, dont les

possibilités méthodologiques ont été confirmées a priori par le calcul Bayésien

approché, ouvrant un nouveau champ d’investigation dans l’étude de la co-évolution

entre génétique et linguistique. Enfin, un protocole d’échantillonnage linguistique

appliqué à un ensemble de locuteurs des Îles du Cap Vert a été construit afin de

permettre une intégration entre le travail théorique et le travail de terrain. L’ensemble

de ce travail formalise une linguistique des populations couplée à la génétique des

populations humaines, et fournit les outils méthodologiques permettant de

reconstruire l’histoire de la co-évolution entre génétique et linguistique.

Abstract

Historical inferences are statistical methods allowing to reconstruct past events

from current data. Inferences in parallel of genetic and linguistic histories have

recently benefited from methodological advances allowing to better understand the

coevolution between genes and languages. Nevertheless, the complex historical

events affecting linguistic diversity are still understudied. This thesis was centered on

the articulation between genetic and linguistic inferences, based on an

interdisciplinary practice and the application of statistical methods of approximate

Bayesian computation. This framework allowed taking into account complex events

of migration, admixture, or change in population size, both for genetic and linguistic

history. Genetic and linguistic data sampled from several populations in Central Asia

were analyzed, showing that the history of genetic populations can differ from the

history of the linguistic varieties spoken by these populations. A framework allowing

to take into account within-population linguistic diversity was then developed and

applied to a group of Tajik speakers from Central Asia. An interface between genetics

and linguistics centered on individuals was then formalized on the basis of a

theoretical work whose methodological possibilities were confirmed a priori by

approximate Bayesian computation, opening a new field of investigation in the study

of the coevolution between genetic and linguistic. Finally, a linguistic sampling

protocol applied to a group of speakers of the Cape Verde Islands was built in order to

allow an integration between theoretical work and fieldwork. This whole work

formalizes a population linguistics coupled with the human population genetics, and

provides the methodological tools to reconstruct the history of the genetic and

linguistic coevolution.

Remerciements

On pourrait croire qu’une thèse de doctorat est un travail plutôt solitaire. Dans

mon cas, ce n’est qu’avec l’aide de celles et ceux que j’ai eu la chance de côtoyer

durant ces trois années qu’il m’a été possible de mener à bien ce travail. Ces quelques

lignes ne sauraient suffire pour leur exprimer toute ma gratitude.

Je tiens en premier lieu à adresser mes profonds remerciements à mes directeurs

de thèse, Frédéric Austerlitz et Paul Verdu, pour leur disponibilité au quotidien, leur

confiance face à mes excentricités, et leur soutien en période de doutes. Merci à

Frédéric pour m’avoir transmis l’exigence théorique, et merci à Paul pour m’avoir

transmis la flamme du terrain !

Je souhaiterais remercier chaleureusement Anne Kandler, Anouk Barberousse,

Arnaud Estoup, Emmanuelle Porcher, et Mélissa Barkat-Defradas, pour avoir accepté

la tache d’évaluer un travail de thèse aux influences scientifiques multiples. Merci

également à Etienne Danchin, Michael Blum, et Sylvie Le Bomin, pour avoir accepté

de participer à mon comité de thèse et m’avoir accompagné dans mes premiers pas

hésitants.

Je remercie Philippe Endicott pour m’avoir donné l’occasion de collaborer avec

des scientifiques à l’autre bout du monde, ainsi que Rusell Gray et Quentin Atkinson

pour leur sympathique accueil à l’université d’Auckland. Je remercie Ethan Jewett,

Marlyse Baptista et Sergio da Costa pour leur aide et leur énergie sur le terrain au

cours de notre mission au Cap Vert, ainsi que l’ensemble des participants qui ont

accepté de se plier à nos curieuses demandes.

Merci à l’ensemble mes collègues du Musée de l’Homme, pour toute la

diversité qu’ils apportent, autant scientifique qu’extra-scientifique. Merci à Bérénice

Alard pour son énergie sans cesse renouvelée, Christophe Costes pour sa curiosité

débordante, Goki Ly pour les cours de yoyo, Nina Marchi pour nos très riches

controverses. Merci à Bruno Toupance, Céline Bon, Evelyne Heyer, Flora Jay, Laure

Ségurel, Marie-Françoise Rombi, Philippe Mennecier, Pierre Darlu, Priscille

Touraille, Raphaëlle Chaix, Romain Laurent, Samuel Pavard, pour leur aide précieuse

et la convivialité qu’ils apportent au quotidien. Merci à Marie-Claude Kergoat d’avoir

partagé mon avidité maladive de « connaissance vraie ». Merci à Franz Manni pour

ses précieux conseils scientifiques et vestimentaire. Merci à Antonin Affholder pour

son sérieux, sa curiosité, et son énergie, dans la réalisation de son stage que j’ai eu le

plaisir de co-encadrer. Il me faut remercier également les Préhistoriens, Archéologues,

Ethnographes, Anthropologues, Ethnomusicologues, Primatologues, Linguistes,

Juristes, et Ceux-Qui-N’Entrent-Pas-Dans-Les-Cases, ces chercheurs qui m’ont offert

à voir la richesse du monde de tant de manières différentes. Merci également à

Florence Loiseau et Taouès Lahrem, pour leur accompagnement administratif

bienveillant.

C’est avec émotion que je tiens à adresser des remerciements tout particuliers à

Frank Alvarez-Peyrere, pour ses enseignements en épistémologie de

l’interdisciplinarité, pour son écoute toujours très attentive, et pour m’avoir fait

apercevoir toute la richesse des dimensions cachées de l’être humain.

Je remercie bien chaleureusement Mathieu Tiret, un chercheur dont la vivacité

d’esprit n’aura de cesse de m’impressionner, un collègue toujours prêt à m’offrir une

aide inconditionnelle, un colocataire de longue date et à la tolérance sans limite, mais,

surtout, un ami irremplaçable.

Je remercie mes parents, ma famille, et mes amis de Savoie et d’ailleurs, dont le

soutien de chaque instant m’a donné toute la force de poursuivre mon travail. Merci

enfin à Manon Potin, sans laquelle aucune vérité ne vaudrait la peine d’être

recherchée.

Sommaire

Résumé......................................................................................................................2Abstract......................................................................................................................3Remerciements..........................................................................................................4Sommaire...................................................................................................................6Index des figures........................................................................................................9Index des tables........................................................................................................11

Avant-propos...................................................................................................13Communication, linguistique et épistémologie.......................................................14Discours, énoncé, sens.............................................................................................14Perspective analytique et perspective anthropologique...........................................15Conventions des langues disciplinaires...................................................................16Présupposés théoriques implicites...........................................................................19Pour une communication interdisciplinaire.............................................................21

Introduction....................................................................................................251. Construction et observation de l’objet.................................................................28

1.1. Observation de la diversité génétique..........................................................281.2. Observation de la diversité linguistique.......................................................301.3. Positionnement philologique.......................................................................32

2. Description des diversités et inférences historiques............................................342.1. Description de la diversité génétique...........................................................342.2. Description de la diversité linguistique........................................................352.3. Inférence de l’histoire à l’origine de la diversité génétique.........................352.4. Inférence de l’histoire à l’origine de la diversité linguistique.....................37

3. Études de la co-évolution génétique et linguistique............................................374. Vers un cadre d’analyse conjoint des diversités génétiques et linguistiques.......40

4.1. Inférence de l’histoire des populations génétiques et des variétés linguistiques........................................................................................................404.2. Exploration d’une « linguistique des populations ».....................................404.3. Construction d’une interface entre linguistique des populations et génétiquedes populations....................................................................................................414.4. Échantillonnage et analyse de données linguistique issues des Îles du Cap Vert......................................................................................................................41

Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation............................................................43

Abstract....................................................................................................................431. Introduction.........................................................................................................442. Material................................................................................................................47

2.1. Genetic data.................................................................................................48

2.2. Linguistic data..............................................................................................483. Methods...............................................................................................................49

3.1. Genetic and Linguistic Dissimilarities among Populations.........................493.2. Approximate Bayesian Computation (ABC)...............................................50

4. Results.................................................................................................................534.1. Central Asian linguistic and genetic structures............................................534.2. Model selection and parameter estimations for the UZA population..........544.3. Model selection and parameters estimation for the TJY population............56

5. Discussion............................................................................................................575.1. Two different linguistic and genetic historical admixture for the Soj-Mahalla Uzbek-speakers.....................................................................................575.2. Stronger genetic than linguistic isolation in the Tadjikistan Yagnob speakers.............................................................................................................................585.3. Conclusions and Perspectives......................................................................59

Chapter II – Inferring linguistic transmission between generations at the scale of individuals..........................................................................................61

1. Introduction.........................................................................................................612. Models.................................................................................................................64

2.1. Production of utterances..............................................................................642.2. Four models of acquisition of a new language............................................642.3. Historical scenario.......................................................................................66

3. Materials..............................................................................................................684. Analyses...............................................................................................................68

4.1. Simulations..................................................................................................684.2. Summary statistics.......................................................................................694.3. Model selection............................................................................................694.4. Parameters estimation..................................................................................69

5. Results.................................................................................................................705.1. Model selection............................................................................................705.2. Parameter estimation....................................................................................70

6. Discussion............................................................................................................73

Chapter III – Building a formalised interface between population genetics and population linguistics..............................................................................77

Introduction.............................................................................................................77Part 1 – Formalising genetic and linguistic coevolution.........................................811. A formalisation of biological evolution...............................................................812. A formalisation of linguistic evolution................................................................843. Modalities of linguistic communications at the scale of the individuals.............884. Coupling the reproductive and the communication networks.............................91Part 2 – Inferring genetic and linguistic histories....................................................971. Modelling.............................................................................................................98

1.1. Sampling......................................................................................................981.2. Genetic model..............................................................................................981.3. Linguistic model..........................................................................................991.4. Parameters....................................................................................................99

1.5. Summary statistics.....................................................................................1001.6. Simulations and model selection...............................................................102

2. Results...............................................................................................................1032.1. Should the individuals be considered as copiers, probabilistic copiers, or Bayesian learners?............................................................................................1032.2. Are the mutation rates different between the linguistic classes?...............1052.3. Are the sizes of the genetic and the linguistic populations different?........1052.4. Do the sampled individuals belong to genetically and/or linguistically differentiated populations?................................................................................1072.5. What is the tree topology of three populations?........................................109

Discussion..............................................................................................................111

Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu.............................................................................................................115

Introduction............................................................................................................115Data sampling........................................................................................................117Descriptive analyses..............................................................................................120Discussion..............................................................................................................124

Conclusion.....................................................................................................127Bibliographie.................................................................................................133APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures..............................................................................151

1. Linguistic Model...........................................................................................1512. Prior distributions for the linguistic model parameters.................................1513. Genetic model...............................................................................................1524. Prior distributions for the genetic model parameters....................................1525. Summary Statistics........................................................................................1536. Scenarios selection using random forest (RF)..............................................1537. Parameters estimation using neural networks (NN).....................................1548. Cross-validation and posterior probabilities in the UZA case......................1559. Cross-validation and posterior probabilities in the TJY case........................156

Index des figures

Figure I.1 – Geographical distribution of the 21 populations and linguistic varieties under study...................................................................................................................48Figure I.2 – Five competing scenarios for the origin of the UZA population..............52Figure I.3 – Neighbour-joining trees based on (a) the linguistic distances matrix and (b) the pairwise FST matrix..........................................................................................54Figure I.4 – ABC Analyses for the UZA population....................................................55Figure II.1 – Four models of linguistic transmission between generations..................65Figure II.2 – Historical scenario...................................................................................67Figure II.3 – Geographical distribution of the 10 sampled units under study..............68Figure II.4 – Confusion matrices from the out-of-bag cross-validation analysis of the four models...................................................................................................................71Figure III.1 – Structure of the reproduction relationship in human species.................82Figure III.2 – Classic representation of the setting up of the reproduction network....83Figure III.3 – Alternative representation of the reproductive network.........................84Figure III.4 – Structure of the linguistic communication relationship in human.........86Figure III.5 – Representation of the linguistic communication network.....................87Figure III.6 – Representation of the setting up of the reproduction network as well as the linguistic communication network.........................................................................92Figure III.7 – Representation of the reproduction network as well as the linguistic communication network under Hypothesis 2...............................................................93Figure III.8 – Representation of the reproduction network as well as the linguistic communication network under Hypothesis 3...............................................................94Figure III.9 – Representation of the reproduction network as well as the linguistic communication network under Hypothesis 4...............................................................95Figure III.10 – Description of the three models of individual grammars...................104Figure III.11 – Description of the three scenarios of different size of the genetic and the linguistic population.............................................................................................106Figure III.12 – Description of the four scenarios of genetic and/or linguistic population differentiation...........................................................................................108Figure III.13 – Description of the three scenarios of historic topologies...................109Figure IV.1 – Geographical distribution of the 19 sampling localities under study in Cape Verde..................................................................................................................119Figure IV.2 – MCA of the 84 individuals sampled in Cap Verde...............................121Figure IV.3 – Representation of the weight of the 50 words in the MCA..................122Figure IV.4 – Neighbour-joining trees based on the linguistic distances matrix........123Figure S1 – Models of linguistic evolution................................................................158Figure S2 – Models of genetic evolution...................................................................159Figure S3 – Two competing scenarios of linguistic and genetic origin of the Yagnob speaking population....................................................................................................160Figure S4 – Pairwise FST matrix (a) and linguistic distances matrix (b)..................161

Figure S5 – Neighbour-joining trees based on the pairwise (δμ)2 matrix.................162Figure S6 – Principal Component Analysis of the Manhattan distances...................163Figure S7 – Principal Component Analysis of the pairwise FST distance matrix.....164Figure S8 – An example of computation of the linguistic summary statistics, over three linguistic varieties and six meanings.................................................................165Figure S9 – PCA performed in the UZA case over one observed summary statistics set....................................................................................................................................166Figure S10 – PCA performed in the TJY case over one observed summary statistics set................................................................................................................................167Figure S11 – Analysis of the TJY population history.................................................168Figure S12 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets tested for the linguistic origin of the UZA population (admixture model).. . .169Figure S13 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets tested for the genetic origin of the UZA population (admixture model)........170Figure S14 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets tested for the linguistic origin of the TJY population (isolation model)........171Figure S15 – Pooling of the parameters priors (in black) and posteriors (in blue) of thetriplets tested for the linguistic origin of the TJY population (non-isolation model).172Figure S16 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets tested for the genetic origin of the TJY population (isolation model)...........173Figure S17 – Density of the distribution of the posterior probabilities......................174

Index des tables

Table II.1 – Summary of the prior distributions of the parameters for the four models......................................................................................................................................67Table II.2 – Proportion of votes for the four models of linguistic evolution, and the posterior probability of the Social model.....................................................................70Table II.3 – Summary of the posterior distributions of the parameters, assuming a Sexual2 scenario...........................................................................................................72Table II.4 – Summary of the posterior distributions of the parameters, assuming a Social scenario..............................................................................................................72Table III.1 – Cross-validation results aiming at assessing a priori distinctions between three models of individual grammars.........................................................................104Table III.2 – Cross-validation results aiming at assessing a priori distinctions between two models of the mutation of the linguistic variants................................................105Table III.3 – Cross-validation results aiming at assessing a priori distinctions between three models described Figure III.11..........................................................................107Table III.4 – Cross-validation results aiming at assessing a priori distinctions between four scenarios described Figure III.12........................................................................108Table III.5 – Cross-validation results aiming at assessing a priori distinctions between three scenarios described Figure III.13......................................................................110Table III.6 – Cross-validation results aiming at assessing a priori distinctions between three scenarios described Figure III.13......................................................................110Table III.7 – Cross-validation results aiming at assessing a priori distinctions between three scenarios described figure III.13.......................................................................110Table IV.1 – List of meanings extracted from the Swadesh list.................................119Table S1 – Information table for the 21 studied Central Asian populations..............175Table S2 – Summary of the posterior distributions of the genetic parameters, assuminga scenario of an admixed origin of the UZA population (scenario E)........................176Table S3 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of an admixed origin of the UZA population (scenario E)........177Table S4 – Summary of the posterior distributions of the genetic parameters, assuminga scenario of isolation of the TJY population (scenario 2).........................................178Table S5 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of isolation of the TJY population (scenario 1).........................179Table S6 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of no-isolation of the TJY population (scenario 2)...................179Table S7 – Linguistic cross-validation of the UZA case............................................180Table S8 – Genetic cross-validation of the UZA case................................................180Table S9 – Linguistic cross-validation of the TJY case.............................................181Table S10 – Genetic cross-validation of the TJY case...............................................181Table S11 – Quantiles of the distributions of the posterior probabilities computed overthe triplets supporting the most probable scenario.....................................................182

Avant-propos

Ce travail de thèse s’intitule « Inférer l'histoire des population humaines à partir

des diversités génétiques et linguistiques ». Comment ce titre est-il compris par un

généticien ou par un linguiste ? Est-il possible de proposer un titre qui lève toute

ambiguïté de sens pour l’une des disciplines comme pour l’autre ? Concernant le

contenu de ce travail, doit-il répondre à l’ensemble des exigences des différentes

disciplines, alors que ces exigences semblent parfois très contradictoires ?

Ces quelques questions sont parmi celles qui ne manquent pas de s’imposer,

parfois brutalement, au cours d’un travail à la frontière de plusieurs disciplines. Un

choix s’impose alors : choisir de s’ancrer dans une discipline unique et collaborer

avec d’autres disciplines par échanges de services, ou alors préférer prendre en charge

ces problématiques complexes pour tenter de construire un cadre interdisciplinaire

commun. C’est la seconde option qui a été retenue pour ce travail. Cet avant-propos

présente les raisons qui m’ont poussé à ce choix, ainsi que la manière dont cela a été

mis en place au cours de la réalisation de ma thèse.

L’objectif est de rendre compte du raisonnement qui a été construit pour

considérer la diversité des concepts issus tant de la génétique que de la linguistique.

Le troisième champ disciplinaire mobilisé au cours de ce travail est l’épistémologie,

parfois appelé philosophie des sciences. Ce champ disciplinaire permet de

comprendre comment la diversité des raisonnements se construit. Il permet également,

je le souhaite, de mettre en œuvre la plus grande rigueur intellectuelle possible. J’ai

fait le choix d’exposer dès à présent les présupposés de l’ensemble de ce travail, ceci

afin que le lecteur puisse en avoir la meilleure compréhension possible. Un tel

dévoilement intellectuel amène à montrer ce qui sera probablement identifié comme

des limites au raisonnement. Cependant, un travail de clarification est-il autre chose

qu’une mise en lumière des limites d’un objet ?

Je tenterai dans ce texte de clarifier les problèmes de communications

scientifiques afin de prendre en charge les questions propres à l’interdisciplinarité.

J’espère ainsi poser les bases d’une réflexion qui me permettra d'articuler les concepts

Avant-propos

des différentes disciplines mobilisées et de pouvoir ainsi respecter les différences de

chaque discipline, et d’engager un dialogue efficace, respectueux et productif.

Communication, linguistique et épistémologie

L’une des premières difficultés qui émerge au cours d’un travail

interdisciplinaire concerne l’intercompréhension entre les disciplines (Holbrook,

2013). Ce qui fait sens pour un généticien n'en fait pas nécessairement pour un

linguiste, et inversement. La communication efficace entre les disciplines est toujours

un processus en construction et n’est jamais complètement acquise.

J’ai choisi de prendre en charge la question de la communication

interdisciplinaire à partir d’une variété d’outils proposés par différentes écoles de la

linguistique et de l'épistémologie. Ces disciplines sont particulièrement indiquées pour

apréhender l’articulation entre les questions de connaissance et de communication, au

cœur des problèmes d’interdisciplinarité. Ce sont des disciplines très exigeantes

lorsqu'elles sont mobilisées pour traiter les problèmes qui sont détaillés ici, car elles

prennent un caractère récursif : ce sont des sciences qui parlent de sciences. Leurs

discours s'appliquent donc notamment à elles-mêmes. Ce mouvement d'aller-retour

permet de construire une analyse du problème tout en prenant le recul nécessaire afin

d’évaluer les outils que nous mobilisons pour analyser le problème. Nous invitons le

lecteur à prendre conscience des niveaux qu’implique cette récursivité, en essayant de

comprendre la logique de ce texte avec les outils mis à disposition.

Discours, énoncé, sens

Les productions scientifiques construisent des discours (Maingueneau, 1979) à

travers des articles, des manuscrits de thèses, des séminaires, des conférences, des

ouvrages, des cours, des interventions auprès du grand public, des discussions

informelles entre collègues… Une première difficulté du travail interdisciplinaire

Avant-propos

consiste à essayer de comprendre la diversité des discours des différents champs

disciplinaires impliqués, et comment ces champs pourraient proposer un discours

commun.

Pour la linguistique moderne et la philosophie des sciences, l'élément minimal

porteur de sens au sein d’un discours est l’énoncé. La phrase suivante : « Des données

génétiques et linguistiques issues de plusieurs populations d’Asie Centrale ont ainsi

été analysées, montrant que l’histoire des populations génétiques peut parfois différer

de l’histoire des variétés linguistiques parlées par ces populations » issue du résumé

de ce manuscrit de thèse, est un exemple d’énoncé. Comment se construit le sens de

cet énoncé ? Deux approches, qui ne sont a priori pas incompatibles (Nguyên-Duy

and Luckerhoff, 2006), sont envisageables : l’approche analytique et l’approche

anthropologique, qui délimitent toute deux leur propre axe de réponse.

Perspective analytique et perspective

anthropologique

Une manière analytique de comprendre comment se construit le sens d’un

énoncé scientifique est de renvoyer aux conditions pour lesquelles cet énoncé est vrai

(Davidson, 1967). Cette manière de concevoir les énoncés permet de déterminer s’ils

sont dénués de sens, si deux énoncés ont un sens équivalent, si un énoncé est

nécessairement vrai, etc. Ce cadre théorique permet de clarifier les énoncés de

manière à ce que leur sens soit le plus univoque possible.

La volonté de comprendre le sens des énoncés sans ambiguïté est un projet

envisagé par la philosophie analytique, héritière de l’empirisme logique de la

première moitié du XXème siècle. Ce projet tente de clarifier le langage et en

particulier le langage scientifique. L’objectif est d’éviter les imprécisions et de mettre

en lumière le plus rigoureusement possible le sens des énoncés en s’aidant de la

logique formelle. Cette vision s'appuie sur un objectif de rigueur logique et sur une

confiance dans les capacités du langage. Elle a notamment été envisagée pour

Avant-propos

formaliser les communications interdisciplinaires via la traduction des énoncés d’une

langue disciplinaire à une autre (Davidson, 1973). Ces travaux ont ensuite servi de

base à une théorie de l’intégration des langues disciplinaires dans l’objectif de générer

une compréhension commune d’un même objet scientifique (Klein, 2013). Pourtant,

la pratique de l'interdisciplinarité et le constat des difficultés de communication qui

l’accompagnent, ont amené à envisager des voies alternatives à la seule intégration

des langues disciplinaires (Holbrook, 2013).

Comprendre comment se construit le sens d’un énoncé dans une perspective

anthropologique consiste à s’intéresser à l’analyse des présupposés implicites et du

contexte dans lequel cet énoncé est produit. Cette manière de comprendre le sens d'un

énoncé est reliée à chaque situation de communication et pour chaque locuteur

(Calame, 1986). Le sens n'est plus inscrit dans l'énoncé lui-même : il est construit par

le destinataire selon le contexte (Preyer and Peters, 2005). Ce cadre théorique a des

conséquences importantes sur la manière de concevoir la communication : un discours

ne se donne pas à comprendre directement, il doit toujours faire l’objet d’un travail

d’interprétation par un destinataire. Ainsi, une anthropologie des pratiques langagières

(Bornand and Leguy, 2013) s’intéresse particulièrement aux conditions qui

déterminent la manière dont les énoncés sont interprétés en situation.

Conventions des langues disciplinaires

Dans la suite de cet avant-propos, nous allons nous plonger plus en détail dans

les perspectives analytiques et anthropologiques, dans le but de déterminer les

conditions d’une communication interdisciplinaire efficace.

Pour le fondateur de la linguistique moderne, Ferdinand de Saussure, la

possibilité de communiquer est assurée par les conventions langagières que partagent

les différents locuteurs (Saussure, 1916). Autrement dit, deux locuteurs ne peuvent

communiquer efficacement que s’ils partagent tout un ensemble de conventions de

langage. Ces conventions permettent d’attacher un signe (le mot français

« population » par exemple) à un signifiant (l’idée de population, avec ce qu'elle peut

Avant-propos

impliquer en génétique des populations). Mais cette association n'est pas évidente a

priori, elle est, au contraire, arbitraire. Les locuteurs l’apprennent progressivement

lorsqu’ils apprennent une langue particulière. Le plus souvent, les conventions de

langage « vont-de-soi » de façon plus ou moins explicite et plus ou moins consciente

en situation de conversation.

Il peut exister plusieurs signifiants proches pour un même signe au sein d’une

langue, on parle alors de polysémie. Par exemple, le mot français « population » peut

renvoyer notamment à l’idée d’un groupe d’individus appartenant à une même espèce

et présents dans un même habitat, ou bien à l’idée d’un groupe d’individus d’une

même espèce se reproduisant préférentiellement entre eux (Debouzie, 1999). Cette

polysémie est une source majeure de problèmes d’intercompréhension entre les

locuteurs d’une même langue : on peut alors avoir l’illusion de parler de la même

chose, et se rendre finalement compte que nos conceptions respectives admettent un

certain nombre de différences. En ce qui concerne les sous-disciplines de la génétique,

le concept de « gène » pourtant central est le siège d’une forte polysémie (Gayon,

2004). On peut notamment admettre qu’un gène est une séquence codante, ou une

séquence codante associée à une séquence régulatrice, ou un élément transmis de

manière héréditaire et ayant une influence causale sur le phénotype, ou bien d’autres

choses encore. Cette forte polysémie semble vider le terme de son contenu

conceptuel, mais J. Gayon indique que :

Néanmoins, le terme de «gène» demeure. À cela, plusieurs explications sont possibles. Il y a

d’abord une raison pragmatique. Les savants, comme tous les hommes, ont besoin de mots pour

communiquer entre eux. Dans ce but, des termes approximatifs sont souvent plus utiles que des

termes définis avec une parfaite précision. Des termes trop précis limitent l’espace de

communication. Or, le terme de gène, avec son ambiguïté présente, joue à cet égard un rôle

important: il permet, avec un degré raisonnable d’approximation, à des scientifiques de

disciplines différentes (biochimistes, biologistes moléculaires, généticiens des populations,

spécialistes de génétique médicale, etc.) de se comprendre. Par ailleurs, les savants ont aussi

besoin de s’inscrire dans des traditions de pensée et de dialoguer avec leurs maîtres et

prédécesseurs. Depuis 1900, un certain nombre de concepts du gène se sont succédé, qui ne se

recouvrent qu’en partie. Leur contenu descriptif, c’est-à-dire les classes d’observables auxquels

ils renvoient, ne coïncident que partiellement. Il n’existe pas de dictionnaire permettant de

Avant-propos

traduire de manière générale les divers concepts du gène les uns dans les autres sans équivoque.

Aucun dictionnaire linguistique ne peut d’ailleurs jamais faire cela. Toutefois, au cas par cas, il

est possible de traduire les uns dans les autres des énoncés qui utilisent des concepts du gène

différents.

La polysémie peut également toucher des disciplines a priori très éloignées.

L'énoncé « il y a un phénomène de dérive » peut désigner quelque chose de différent

pour un généticien des populations et pour un linguiste. Pour un généticien, le

processus de dérive correspond plutôt à la modification des fréquences d'un ensemble

d'objets (des allèles par exemple) par l’effet du hasard. Pour un linguiste, le processus

de dérive correspond plutôt à la modification de plusieurs objets (des langues

différentes par exemple) dans une direction particulière commune, selon des

contraintes de structure. Bien que ces deux concepts de « dérive » partagent des

ressemblances, ils ne sont pourtant pas équivalents. Malgré cela, rien ne peut les

différencier a priori à l’exception du contexte dans lequel ils sont utilisés (Kasavin,

2009). Ce contexte, associé aux conventions de langage des locuteurs qui

rencontreraient ce terme, est ce qui leur permet de l’interpréter correctement. Un

généticien qui lirait cet énoncé dans une publication de linguistique par exemple, s'il

n'est pas au fait des conventions propres à cette discipline au sujet de ce terme,

pourrait interpréter l'énoncé dans le sens de ses présupposés et aboutir à un contresens

par rapport à ce qu’entend l'auteur de l'article. De même, un linguiste n'étant pas au

fait des conventions propres à la génétique et lisant cet énoncé dans une publication

de génétique, courra le risque d’un contresens.

Le partage d’un ensemble plus ou moins grand de conventions de langage par

un groupe d’individus permet de constituer une communauté de langage (Labov,

1972). Une communauté de langage adopte implicitement un ensemble de règles,

souvent inconscientes, qui se comportent dans la conversation comme les règles d'un

jeu (Wittgenstein, 1953). Les membres d’une communauté peuvent ainsi

communiquer entre eux sans avoir besoin de préciser perpétuellement le sens des

énoncés qu’ils produisent.

Les énoncés des langues disciplinaires ne sont donc pas livrés avec les règles

Avant-propos

pour les décoder, quand bien même les termes sembleraient tous rigoureusement

définis1. Les conventions doivent être apprises petit à petit par les locuteurs au sein

des communautés pour leur permettre de comprendre le sens des énoncés. Les

universités permettent aux étudiants d'apprendre les conventions de langage de leurs

langues disciplinaires. Les définitions, les exemples, les lectures d’articles, les

exercices, les projets, les discussions, les évaluations, sont autant de situations de

communications productrices d’énoncés. Cela permet aux locuteurs d'inférer petit à

petit les conventions de langage de la communauté. Un chercheur confirmé maîtrise

finement les conventions de langage de sa discipline lorsqu'il est capable de

communiquer efficacement avec ses pairs.

Il faut tout de suite ajouter que les conventions des communications

scientifiques ne se construisent pas indépendamment des autres conventions de

langage. Les discours scientifiques prennent au contraire racine dans le sens commun,

dans un contexte culturel et social plus large (Bonfils, 1990; Delamotte, 2004). On est

alors face à un mélange des genres : les langues disciplinaires se reposent en partie sur

le sens commun partagé plus largement par une culture, tout en suivant certaines

règles qui leur sont particulières.

Présupposés théoriques implicites

Les disciplines scientifiques ont comme particularité d’avoir à rendre compte

des objets du monde qu’elles s’emploient à étudier. Nous faisons ici une différence

entre les objets du monde, extérieurs aux scientifiques, et les objets des disciplines,

1. En effet, une solution pourrait être de définir l’ensemble des termes, afin de générer le sens desénoncés à partir des composants plus élémentaires que sont les mots. Néanmoins, les définitions sontdes périphrases possibles, appelant d'autres termes, eux-mêmes très souvent polysémiques. Ces termesfont ensuite eux-mêmes l'objet de définitions, qui renvoient encore à d’autres définitions, et ainsi desuite de manière circulaire (Amiel, 2010). Un dictionnaire se construit ainsi comme un ensemble demots renvoyant dans leurs définitions à d'autres mots, qui dans le réseau de définitions entrecroiséesfinissent nécessairement par boucler en un système clôt. Les définitions ont donc une utilité, mais celle-ci est relative car elle dépend des allants-de-soi langagiers implicites. Elles constituent donc un moyend’accès à la signification en fournissant un exemple d’énoncé choisit, qui tente de porter le sens le pluslarge possible.

Avant-propos

construits dans le but de rendre compte des objets du monde. L’activité de

construction de connaissances des disciplines scientifiques peut être vue comme une

tentative de décrire le monde le plus rigoureusement possible. Les multiples sources

d’incertitudes linguistiques évoquées plus haut pèsent donc lourdement sur

l’ensemble du champ scientifique, d’où l'importance pour les communautés

scientifiques de préciser au mieux les termes qu’elles utilisent et de clarifier au mieux

les énoncés produits. C'est en construisant des objets d’étude (les « gènes », les

« populations », les « mutations », les « pressions de sélection ») que les disciplines

tentent d’être au plus près des objets du monde. Bien que ces constructions de langage

soient parfois très floues, les scientifiques peuvent s’appuyer sur leurs présupposés et

le contexte d’utilisation de ces termes pour inférer le sens des énoncés dans lesquels

ils apparaissent.

Ainsi, le sens des énoncés scientifiques est largement dépendant des

connaissances et des croyances partagées par les locuteurs. Par exemple, l’utilisation

et la compréhension du mot « épigénétique » par un généticien dépend directement de

ses connaissances propres. Celles-ci sont faites d'hypothèses, d'approximations et de

théories qui se trouvent souvent être des non-dits. L’ensemble des connaissances et

des croyances partagées par une communauté de langage fondent ce qu’on peut

appeler un terrain d’entente linguistique (ou « common ground » pour Resnick et al.,

1991).

Lorsque les langues disciplinaires produisent des énoncés sur le monde, elles se

fondent sur des présupposés théoriques implicites qui correspondent à des visions du

monde différentes2. La linguistique historique computationnelle (Bowern and

Atkinson, 2012; Gray and Atkinson, 2002; Gray et al., 2009), en appliquant des

2. Le lecteur pourra rester perplexe face à une telle vision de la science. Tout se passe comme si lescommunautés disciplinaires construisaient des mondes imaginaires indépendants les uns des autres, etque les membres de ces communautés discutaient entre eux comme dans une fiction partagée encommun. Ce serait oublier que les objets du monde s'imposent à nous avec force : nous ne construisonspas les objets de nos disciplines indépendamment des objets du monde. Nos objets disciplinaires sontconstruits dans l'objectif de rendre effectivement compte d'une facette particulière des objets du monde,qui s’imposent à nous dans la pratique. Le généticien des populations dont l’objectif est d’inférerl’histoire construit les concepts de « locus » et de « gènes » afin de lui permettre de rendre compte enpratique de l'histoire et de la structure des populations. Le généticien moléculaire construit le conceptde « gène » afin de rendre compte en pratique des propriétés physico-chimiques de l'ADN et desmécanismes microscopiques qui l'affectent.

Avant-propos

méthodes phylogénétiques à des données linguistiques, déploie sans le dire tout un

champ d’hypothèses particulières concernant la nature des objets linguistiques. Les

éléments considérés (phonèmes, mots, syntaxe…) y sont vus comme des éléments

structurellement indépendants les uns des autres (à l’image des gènes), et pouvant

faire l'objet d'une comparaison directe avec d'autres langues. Ce postulat – parmi

d'autres – n'est pas problématique en lui-même, car il reflète la manière de construire

l’objet de la linguistique historique computationnelle. Il est fondateur pour cette

discipline, car c’est en partie ce qui lui permet de produire un discours pertinent sur

les objets du monde qu’il se donne à étudier, à savoir l’histoire des langues composées

par ces éléments indépendants.

Les langues disciplinaires se fondent ainsi sur un ensemble de présupposés

théoriques différents, abordant chacune leurs objets selon un regard particulier. Une

hypothèse de « mutation aléatoire » peut sembler étrange pour un linguiste

structuraliste, pour lequel les changements linguistiques ont lieu selon des règles bien

précises. La notion de « groupe ethnique » (Balazs, 1993; Huang et al., 2015) attachée

à des groupes d’individus en génétique des populations peut sembler étrange pour un

ethnologue, pour lequel une question centrale est de savoir comment les identités se

construisent dans un rapport à l’autre et non sur les propriétés des individus eux-

mêmes. Mais quelles que soient les disciplines, chacune construit ses concepts pour

rendre compte, avec pertinence, d’un objet du monde. Une grande partie de la

difficulté du projet interdisciplinaire se trouve dans les limites entre les objets que

chacune des disciplines construit, et leur rapport avec les limites des objets du monde.

Dans ce cadre, comment arriver à ouvrir un dialogue entre les disciplines ?

Comment, encore, arriver à faire communiquer les concepts malgré ces obstacles qui

sont souvent invisibles au premier abord ?

Pour une communication interdisciplinaire

Une première piste pour tenter de produire un travail interdisciplinaire

rigoureux découle de l’ensemble des éléments évoqués jusqu'ici au sujet des langues

Avant-propos

et des savoirs disciplinaires. Étant donné l’influence des différences de conventions

des langues disciplinaires sur la communication, leur mise en lumière est de première

nécessité. Le travail interdisciplinaire implique ainsi de clarifier au mieux les discours

et ce qu’ils construisent autant explicitement qu’implicitement, afin de permettre une

articulation des concepts la plus rigoureuse possible. Clarifier le discours permet de

spécifier sur quelle base théorique se déploie le savoir pour chaque discipline

impliquée. Omettre cette étape risque de mener à l’un des multiples pièges du

langage. Néanmoins, la mise en lumière des présupposés disciplinaires est un travail

périlleux, car l’explicitation des limites d’un champ scientifique peut se révéler

inconfortable pour les pratiquants de ce champ.

De plus, sachant que les discours disciplinaires prennent leur sens au travers des

conventions de langage variées aux présupposés théoriques parfois disjoints, chaque

champ disciplinaire impliqué doit être appris à nouveau, comme une seconde langue

maternelle par les pratiquants de l’interdisciplinarité (MacIntyre, 1988). Ce travail

d’apprentissage est un prérequis essentiel à la démarche interdisciplinaire. En effet, le

risque d’une maîtrise partielle d’une discipline est d'aboutir à une mutilation de ses

bases théoriques lors de la mise en regard des disciplines impliquées. Le risque est

grand de plaquer les présupposés d’une discipline sur une autre par méconnaissance

de leurs différences. C'est dans un souci de responsabilité vis-à-vis de ce qui constitue

les disciplines qu’un travail d’apprentissage de novo s'avère nécessaire.

Enfin, au contraire de la reconnaissance des différences qui fondent les langues

disciplinaires, il peut être tentant d’adhérer au projet de leur uniformisation, afin de

disposer d’un langage unique. Néanmoins, cette uniformisation implique une

normalisation des différents rapports possibles aux objets du monde étudiés par les

différentes disciplines scientifiques. Face à cet idéal d'uniformisation des savoirs, il

me semble que l’interdisciplinarité exige au contraire de prendre acte des différences

constitutives entre les disciplines, irréductibles entre elles. Le projet doit alors se

situer dans la prise en compte des différences constitutives entre chaque discipline,

dans un objectif d’unité (et non d’uniformité), comme indiqué par F. Alvarez-Péreyre

(2003) :

Avant-propos

C’est que dans le rapport entre le particulier et l’universel – tel que les sciences de l’homme

et de la société s’efforcent de le traiter – deux voies se présentent. L’une consiste à mettre les

objets en perspective. Cette dernière repose sur deux implicites. D’une part, la conviction que

ces objets sont aisément comparables. D’autre part, la conviction – l’illusion, contre toute

constatation quotidienne du contraire – que les discours scientifiques sont fortement apparentés,

non ambigus, non conflictuels.

L’autre voie qui se présente consiste à supposer qu’il n’y a pas de naturalité des objets et que

les discours scientifiques sont l’expression de constructions incessantes. Faire une place à

l’exigence interdisciplinaire c’est, aussi, tenter de travailler dans ce sens.

Dans cet objectif, il me semble que l’entreprise interdisciplinaire gagne à

prendre appui sur la diversité des travaux qui découlent des perspectives analytiques

et anthropologiques3. Nous avons vu que ces perspectives offrent des outils très

intéressants pour prendre en charge les difficultés induites par la diversité des

pratiques et des discours scientifiques. C’est ainsi que la reconnaissance du caractère

construit des sciences et de leurs discours, sous différentes dimensions, nous amène à

relativiser le projet d'un langage uniforme, au moins dans le cadre d’un travail

interdisciplinaire.

L'interdisciplinarité est donc un processus en mouvement, une pratique, et non

un état de fait. La suite de ce travail de thèse s’est conçue progressivement dans cette

optique, cherchant à clarifier au mieux les limites disciplinaires, à mettre en lumière

dans la mesure du possible les présupposés linguistiques et épistémologiques, et à

questionner les « allants-de-soi » des pratiques scientifiques. Ce processus demande

un travail parfois laborieux, car souvent inaudible par rapport aux attentes de chacune

des disciplines. A cette difficulté s’ajoute un rejet institutionnel d’une telle pratique de

l’interdisciplinarité, comme l’indiquent Bühlera et al. (2012) :

3. Nous avons mobilisé plusieurs disciplines au fil de cet avant-propos afin de construire un cadreinterdisciplinaire que nous espérons rigoureux. Or, les différences entre ces domaines disciplinairesnous ramènent aux questions d’interdisciplinarité détaillées précédemment : comment articuler aumieux ces méta-disciplines entre elles, chacune reposant sur des présupposés à première vueincompatibles avec les autres ? Un certain vertige peut venir sans une habitude de l'application desraisonnements sur eux-mêmes. Nous indiquerons que la nature récursive de notre travail, évoquée audébut de cet avant-propos, est pour nous la clef de prise en charge d'une possible analyse de l'analyse,ouvrant la voie d’une interdisciplinarité de l'interdisciplinarité. Celle-ci dépasse assurément le cadre dece manuscrit de thèse.

Avant-propos

Pour la plupart des institutions, l’interdisciplinarité n’est concevable que si elle ne remet pas

en cause les fondements des disciplines, mieux, si elle conforte les disciplines en place : « [la]

pratique de la transdisciplinarité exige, au contraire, le renforcement constant du « noyau dur »

[des différentes disciplines] » (CNRS, 2002, p. 13).

Éviter de mettre en question les allants-de-soi disciplinaires serait-il déjà un

allant-de-soi profondément ancré dans nos disciplines ?

Introduction

Les langues que nous parlons et les gènes que nous portons sont l’héritage d’une

histoire riche et complexe. Il n’existe pas deux personnes qui parlent exactement de la

même manière, comme il n’existe pas deux personnes aux génomes exactement

identiques. Cette large diversité génétique et linguistique est en partie le résultat de

l’histoire des populations humaines à travers le monde. Comment pouvons-nous

entrevoir cette histoire, en étudiant uniquement la diversité humaine présente

actuellement ?

Le raisonnement par inférence historique est un moyen d’étudier les événements

et les processus passés. C’est à travers l’observation et l’analyse de données

contemporaines que les inférences nous permettent de reconstruire l’histoire des

populations humaines. En effet, ce type de raisonnements qui se fonde sur l’induction

(Nagel, 1961; Sagaut, 2008) nécessite de proposer une série de scénarios ou de

modèles4 possibles, et de les évaluer à la lumière des données réelles. Le ou les

scénarios jugés les plus à même d’être à l’origine des données observées sont

considérés comme des représentants pertinents des événements du passé. On parle

alors de réfutation (ou « falsification ») des scénarios les moins pertinents. Ce type de

raisonnement nécessite une base théorique commune, ainsi que plusieurs

ramifications en concurrence, afin d’arbitrer entre elles et d’éliminer les moins

plausibles (Lakatos, 1976). Les scénarios qui ne sont pas réfutés sont conservés,

jusqu’à ce que d’autres concurrents entrent en lice.

L’observation de la diversité actuelle pour des caractères très variés a permis

d’inférer l’histoire des populations humaines au cours de l’histoire des sciences. Les

diversités linguistiques ont notamment été utilisées dès le milieu du XIXème siècle par

4. Le mot de « modèle » revient à de nombreuses reprises dans les discours scientifique, y comprisdans ce manuscrit de thèse. Nous invitons le lecteur à se méfier de ce mot, car il est très utilisé maisnéanmoins hautement polysémique. Pour un aperçu historique de différentes significations possibles dece mot, voir par exemple Suppes (1961). Nous favoriserons l’emploi du mot « modèle » pour désignerun ensemble d’hypothèses (comme pour un « modèle mutationnel » par exemple), et le mot« scénario » pour désigner un ensemble d’hypothèses spécifiquement historiques (comme pour un« scénario de croissance démographique » par exemple).

Introduction

Schleicher pour inférer l’hypothétique origine des peuples indo-européens

(Schleicher, 1853), ainsi que par Čelakovský dans l’étude des langues slaves

(Čelakovský, 1853). Les diversités morphologiques ont également été utilisées très tôt

par Haeckel, dans l’objectif de déterminer la place de l’humanité au sein de l’histoire

évolutive des espèces vivantes (Haeckel, 1874). Les diversités génétiques ont quant à

elles été mobilisées plus tardivement, à partir de la seconde moitié du XX ème siècle

(Cavalli-Sforza et al., 1964), et ont été de plus en plus utilisées suite au progrès

conjoints du séquençage génomique et du traitement informatique (Reich et al., 2009).

Les inférences historiques utilisant des données génétiques ont par exemple permis

d’avancer des arguments décisifs concernant la sortie d’Afrique d’Homo sapiens

(Cann, 2001). L’utilisation de diversités linguistiques, profitant des développements

des méthodes de phylogénies, a récemment permis de reconstruire l’histoire du

peuplement des îles austronésiennes et d’indiquer leur origine probablement

Taïwanaise (Gray et al., 2009). D’autres types de diversités culturelles ont été utilisées

plus récemment dans la réalisation d’inférences historiques, réinvestissant également

les méthodes développées en phylogénie. Les corpus de données de la culture

matérielle décrivant les caractéristiques des canoës de Polynésie ont ainsi permis

d’établir une origine Fidjienne probable des populations ayant colonisé ces îles

(Rogers et al., 2009). L’étude de la diversité des structures des contes a quant à elle

permis d’établir la parenté de deux grands groupes de variants du « Petit Chaperon

Rouge » à travers le monde (Tehrani, 2013). Plus récemment encore, l’étude des

diversités musicales a permis d’établir le caractère significativement vertical (d’une

génération à la suivante) des diffusions des traits musicaux au Gabon (Le Bomin et

al., 2016).

L’utilisation et le développement de méthodes mathématiques permettant de

réaliser des inférences ont largement profité de la croissance continue des capacités

informatiques (Beaumont and Rannala, 2004). En génétique des populations

notamment, les techniques de séquençage modernes permettent de générer de grands

volumes de données en mesure d’être traités par la puissance des ordinateurs actuels.

En effet, le génome humain comporte environ trois milliards de paires de bases, et le

séquençage de génomes complets est maintenant accessible via les techniques

Introduction

modernes de « next-generation sequencing ». Ces données peuvent servir

d’informations pertinentes à la reconstruction historique, mais leur prise en compte

nécessite de larges capacités de stockages et de calculs.

Les connaissances contemporaines agrégées concernant les populations

humaines suggèrent que leurs histoires sont très souvent composées d’événements

complexes, au-delà d’une simple filiation entre populations : augmentations ou

diminutions de tailles, goulots d’étranglements, événements de migrations ou de

mélanges… Le développement des méthodes statistiques d’inférences historiques,

profitant du développement des capacités informatiques évoquées plus tôt, permet la

prise en charge de scénarios de plus en plus complexes (Robinson et al., 2014). Des

méthodes basées sur un grand nombre de simulations informatiques ont été rendues

possibles, permettant de prendre en charge de grands jeux de données évalués pour

des scénarios arbitrairement complexes. Elles sont ainsi de plus en plus mobilisées

dans la réalisation d’inférence historiques en génétique des populations, par le biais

du calcul Bayésien approché (ou ABC pour Approximate Bayesian Computation,

Beaumont et al., 2002; Tavaré et al., 1997).

Néanmoins, comme nous l’avons souligné plus tôt, les méthodes d’inférences

historiques présupposent toujours une ensemble d’hypothèses théoriques préalables

quels que soient leurs raffinements informatiques et statistiques. Quelles sont ainsi les

bases théoriques des méthodes d’inférences en génétique et en linguistique ? Quels

sont les présupposés implicites qui accompagnent l’utilisation de ces méthodes ? Ces

présupposés sont-ils justifiés ?

Je détaillerai dans la suite de cette introduction la chaîne méthodologique qui

part de l’observation sur le terrain des données génétiques ou linguistiques, puis passe

par les descriptions graphiques et statistiques et va jusqu’à l’inférence historique

proprement dite. J’exposerai ensuite les études d’inférences historiques couplant les

diversités génétiques et linguistiques, avant de présenter mes propres travaux.

Introduction

1. Construction et observation de l’objet

1.1. Observation de la diversité génétique

Contrairement à ce que pourrait laisser penser le terme de « donnée », l’idée que

le réel serait « donné à voir » est un mythe tenace (Sellars, 1956), notamment en

génétique des populations. La construction d’une série d’observations se fait d’abord

par l’intermédiaire d’un échafaudage théorique préalable, que cet échafaudage soit

construit sur des présupposés scientifiques ou sur des présupposés du sens commun

(voir l’avant-propos). Les étapes qui mènent les généticiens des populations à inférer

l’histoire des populations humaines se basent sur un ensemble de connaissances

variées en biologie, accumulées au cours de l’histoire des sciences. Ces connaissances

concernent la complexité de la biologie des organismes et incluent les mécanismes de

mitose et de méiose, les mécanismes de fécondation cellulaire, ainsi que ceux de

réplication des molécules d’ADN (Chakravarti, 1999). C’est par la prise en compte de

cette vaste architecture théorique présupposée à son travail que le généticien des

populations peut proposer un discours où tout se tient de concert (Murphy and Medin,

1985).

C’est également la formalisation de la notion de population qui donne les

conditions de réalisation d’une campagne d’échantillonnage pertinente pour un travail

de recherche en génétique des populations. Les variations autours d’un même air de

famille (Wittgenstein, 1953) dépendent en partie des différences dans les objectifs que

se fixent a priori les sous-disciplines concernées. Par exemple, les écologues

définissent plutôt les populations en termes de cohésion démographique, étant donné

qu’ils s’intéressent plutôt aux interactions entre les individus. Les généticiens des

populations ont quant à eux plutôt tendance à définir les populations en terme de

cohésion reproductive, étant donné qu’ils s’intéressent plutôt à la transmission

génétique au fil des générations. Cette polysémie aboutit à un certain flou sémantique

quand aux présupposés des différents emplois de ce terme, ne permettant pas de

statuer définitivement sur une définition unanime de ce mot (Waples and Gaggiotti,

2006).

Introduction

De plus, chacune de ces disciplines déploie en interne toute une polysémie, le

concept de population étant souvent utilisé de manière informelle à travers une variété

de sens (Debouzie, 1999; Hartl and Clark, 2007) pouvant référer notamment en

génétique des populations soit à un groupe d’individus d’une même espèce cohabitant

dans un même habitat pour Lefevre et al., (2016) , soit à un groupe d’individus pour

lesquels les accouplements ont lieu au hasard, de manière « panmictique », pour

Jobling et al. (2003) , soit encore à un « pool » de gènes dont la composition est

susceptible d’évoluer pour Henry et Gouyon (1999) .

Ce flou se retrouve tout autant dans la notion de « gène », également centrale à

la génétique des populations (Gayon, 2004). Un « gène » peut faire référence soit à

une entité héréditaire contrôlant causalement la production d’un phénotype, soit à une

séquence d’ADN transcrite, soit à une séquence codante associée à une séquence

régulatrice, soit encore à toute portion variable du génome.

Quoiqu’il en soit, la génétique des populations se constitue autours de ces deux

notions clefs (« gène » et « population ») qui guident au quotidien la réflexion et la

pratique des généticiens des populations en fonction de leurs problématiques propres.

La construction d’une campagne d’échantillonnage est ainsi conditionnée par une

acceptation particulière de ces termes au sein de cette polysémie. Des études centrées

sur l’espèce humaine pourront mettre la notion de « groupes ethniques » au cœur de la

structure de leur échantillonnage (Balazs, 1993; Heyer et al., 2009; Huang et al.,

2015), là où des études centrées sur la santé pourront mettre la notion de « population

de malades » au cœur de leur structure d’échantillonnage (Khuri et al., 2007;

McDonald et al., 2003). Des branches voisines de la discipline comme la phylogénie

mettront au centre de leurs constructions théoriques plutôt d’autres notions, comme

celles de « relations généalogiques » (entre séquences, entre individus, entre

espèces…), les amenant ainsi à construire des campagnes d’échantillonnages

différentes (Wiley and Lieberman, 2011).

Après la campagne de récolte des échantillons, les savoirs et les techniques

accumulés par la biochimie et la biologie cellulaire permettent de réaliser des

protocoles d’extraction de l’ADN en laboratoire (Aljanabi and Martinez, 1997), son

amplification par PCR (Mullis and Faloona, 1987), puis son séquençage (Mardis,

Introduction

2008) afin d’obtenir un fichier informatique composé pour chaque ADN séquencé

d’une suite de lettres A,T,G,C. C’est la connaissance théorique et pratique de la chaîne

allant de l’échantillon récolté sur le terrain au fichier informatique de séquençage qui

permet au final de donner une signification aux jeux de données disponibles

informatiquement, et de proposer des hypothèses au sujet des populations

échantillonnées.

1.2. Observation de la diversité linguistique

Tout comme dans le cas de la génétique, le réel linguistique n’est pas donné à

voir. Les données linguistiques doivent être construites à l’aide d’un protocole

d’échantillonnage par l’interprétation d’un ensemble d’énoncés. La réalisation du

protocole d’échantillonnage par un linguiste dépend de ses hypothèses sur la nature du

langage. Un concept central à la linguistique est celui de « langue » (ou de « lect »

dans le cas d’une échelle plus fine), or, les objets linguistiques sont de natures très

différentes selon les travaux entrepris sur le terrain et selon les écoles disciplinaires

des linguistes recueillant les données.

La grande diversité sémantique attachée à cette notion rend difficile sa

clarification, et sa signification doit être évaluée au cas par cas (Pateman, 1983). Au

contraire de la génétique des populations, la linguistique se déploie dans une diversité

de paradigmes analytiques : philologique, structural, stratégique et neural (Alvarez-

Pereyre, 2014) , ces paradigmes représentant différentes manières de construire l’objet

linguistique.

Le paradigme philologique, historiquement le plus ancien, permet de rendre

compte du changement des éléments linguistiques au cours du temps et de leurs

variations géographiques. Les travaux de William Jones, avec sa comparaison entre le

sanskrit, le grec ancien et le latin (Jones, 1786), s’inscrivent déjà dans ce paradigme.

Le prisme diachronique (relatif au déroulement dans le temps) permet d’établir les

embranchements historiques, les phénomènes de diffusion, et les facteurs de

variations géographiques et temporels. Les langues sont considérées comme une

agrégation d’unités élémentaires, l’histoire de chacune de ces unités pouvant être

étudiée indépendamment. Ce paradigme profite aujourd’hui d’un renouveau apporté

Introduction

par les méthodes développées par la phylogénétique (Atkinson and Gray, 2005), les

deux champs étant largement inter-féconds et profitant d’une base théorique commune

dite « évolutionniste » et « diffusionniste ». L’analogie entre les unités linguistiques et

les unités génétiques est à la source de nombreux échanges méthodologiques et

conceptuels entre ces disciplines (Pagel, 2009).

Le paradigme structural s’ancre sur les travaux de Ferdinand de Saussure à

travers la transcription posthume de son cours de linguistique générale (Saussure,

1916), puis sur les travaux des linguistes russes Nicolaï Troubetzkoy et Roman

Jakobson. Les linguistes de ce paradigme s’attachent plutôt à comprendre la structure

des systèmes de signes, à un instant donné. Le prisme synchronique, au contraire du

prisme diachronique de la philologie, est un moyen d’établir les relations systémiques

au sein d’une langue particulière. Le paradigme structural permet ainsi de rendre

compte de l’articulation des éléments d’un système linguistique en fonction de ses

contraintes internes. Les langues sont ici considérées comme des systèmes, où

chacune des parties n’ont de sens que relativement aux autres. Le caractère

synchronique de l’école structuraliste n’exclut pas la prise en compte des

changements linguistiques au cours du temps, avec différentes approches selon les

auteurs (Verleyen, 2007).

Le paradigme stratégique est un retournement de la perspective structurale.

L’emphase est mise sur le rôle des paramètres externes (culturels, sociaux,

environnementaux…) sur un système linguistique (Bornand and Leguy, 2013). La

diversité et l’influence des paramètres extra-linguistiques est mise en lumière par les

outils de ce paradigme, rendant possible l’explicitation des dynamiques plus ou moins

conscientes qui se reflètent à travers le langage. Les langues ici sont considérées

comme des pratiques humaines inclues dans des systèmes culturels beaucoup plus

larges.

Le paradigme neural fait quant à lui l’hypothèse d’un système nerveux central

contraignant les fonctions comportementales, et en particulier celles liées au langage

(Chomsky, 2006). La prise en compte de la perspective cognitive permet de mettre en

lumière les liens entre les structures nerveuses, psychologiques et physiologiques, et

la perception, le traitement et la production de la parole. Les langues sont ainsi ici

Introduction

considérées comme la réalisation de contraintes cognitives préexistantes.

Chacun de ces quatre paradigmes capture une dimension particulière des

langues, en s’attachant plus particulièrement à un de leurs aspects, selon leurs

objectifs respectifs. Il est à signaler qu’une large polysémie autours du concept de

langue se déploie également au cœur de chacun des ces paradigmes.

1.3. Positionnement philologique

Mon travail de thèse se positionne dans la tradition philologique héritière des

travaux de Morris Swadesh, conceptuellement très proche de la phylogénétique. Dans

l’objectif de reconstruire l’histoire des langues, Swadesh développe la

glottochronologie (1952). Le principe est d’évaluer à quel point les mots d’une même

racine étymologique, c’est à dire provenant d’un même mot ancestral, se retrouvent

d’une langue à l’autre. Les mots ayant le même sens et la même racine étymologique

sont appelés des cognats. Des listes classiques de 100 ou 200 mots sont établies à

partir du vocabulaire commun à la plupart des langues du monde afin de pouvoir

comparer les langues entre elles. Les mots qui composent ces listes Swadesh sont

choisis de manière à être universels et résistants à l’emprunt. Le nombre de cognats

différents entre deux langues permet alors de calculer la date de divergence entre ces

langues (Lees, 1953), et de construire un arbre de l’histoire linguistique analogue aux

arbres phylogénétiques. On pourrait ainsi nommer de tels arbres des « arbres

glossogéniques » (Fitch, 2008).

Les outils informatiques récents permettent de réviser l’approche

glottochronologique classique et de tenter de se défaire des hypothèses les plus

critiquées (Atkinson et al., 2005; Campbell, 2006), permettant par exemple de

spécifier explicitement les processus de filiation entre les langues au-delà de supposer

une relation de parenté seulement d’après une corrélation numérique. L’idée est alors

de comparer des éléments linguistiques d’une langue à l’autre à la lumière d’un

modèle explicite de ramifications successives et de divergences par accumulation de

mutations. Les « langues » sont assimilées à des collections d’éléments présents ou

absents (Atkinson et al., 2005) : présence ou absence de certains sons, présence ou

absence de certains mots, de certaines formes syntaxiques… susceptibles de muter au

Introduction

cours du temps, à des taux variables.

La question de l’observation se rapporte ici à la question d’établir la présence ou

l’absence d’une liste d’objets élémentaires au sein de chaque langue, cette

formalisation guidant le protocole d’échantillonnage sur le terrain. Dans l’exemple de

l’utilisation d’une liste Swadesh, la liste de mots est établie pour chaque langue

étudiée, les méthodes de la linguistique comparative et de l’étymologie permettant

ensuite d’établir l’appartenance des mots à des groupes de cognats.

Plus généralement et au-delà de la reconstruction d’arbres glossogéniques, un

ensemble de recherches s’est emparé de la perspective philologique et des analogies

conceptuelles entre les évolutions génétique et linguistique. Le champ récent de la

« linguistique évolutive » (pour une revue détaillée, voir Croft, 2008) se déploie dans

une diversité d’approches dont les objets, les méthodes et les présupposés sont parfois

très variés. Une formalisation de la notion de « langue » proposée par Croft (1996) a

retenu plus particulièrement mon attention. La langue y est plutôt assimilée à un

ensemble d’énoncés produits par une communauté de locuteurs. Croft propose

d’utiliser le concept de « communauté de langage » analogue à celui de « population

génétique » (Croft, 2006) :

A speech community is a group of speakers who engage in intercourse, that is, talk to each

other, and more critically, are communicatively isolated from speakers in other speech

communities. […] The definition of a language spoken by a speech community is then more

“social” than “linguistic”.

La prise en compte d’une telle définition de la langue à des conséquences

importantes sur le protocole d’échantillonnage : la diversité linguistique interne à

chaque communauté de langage devient une donnée pertinente pour l’inférence

historique. Il s’agit alors de relever les énoncés produits par une série de locuteurs,

contrairement à la prise en compte d’une langue à un niveau plus général comme

précédemment pour les études glottochronologiques. Un protocole de récolte de

données pour une série d’individus échantillonnées au sein des communautés de

langage est nécessaire. Néanmoins, à ma connaissance, très peu de campagnes

Introduction

d’échantillonnage de ce type ont été réalisées jusqu’à présent, hormis celles de

Mennecier et al. (2016) et Verdu et al., (2017).

2. Description des diversités et inférences

historiques

2.1. Description de la diversité génétique

La description de la diversité en génétique des populations est une étape

permettant de transformer les informations complexes du fichier listant les séquences

génétiques résultant de l’étape de séquençage, en informations d’une dimension plus

réduites. Cela permet aux chercheurs de la discipline de disposer d’une représentation

synthétique du jeu de données et de disposer d’une première visualisation de sa

complexité. Cette étape peut prendre la forme d’un calcul d’indices statistiques à

l’aide de programmes informatiques (Excoffier and Lischer, 2010; Guillot et al.,

2005), de représentations graphiques issues par exemple d’analyses en composantes

principales, de représentations sous forme d’arbre utilisant par exemple l’algorithme

de Neighbour-Joining (Saitou and Nei, 1987), ou encore de graphiques de clustering

(Pritchard et al., 2000). La construction d’une description est ici aussi chargée de

théorie, car elle ne peut prendre de signification qu’en se référant aux présupposés

induits par les différentes méthodes descriptives appliquée aux données, ces données

étant elles-mêmes construites selon un cadre théorique préalable.

Ces descriptions sont parfois directement interprétées, afin de réaliser des

raisonnements d’inférences historiques que je qualifierais de « verbales ». Le risque

majeur de ces discours, construit à partir des seules représentations graphiques ou

statistiques, est de rendre implicites leurs éventuelles hypothèses ou allants-de-soi

sous-jacents. Il est alors souvent difficile de statuer sur leur validité, la

méconnaissance de ces présupposés pouvant aboutir à des contre-sens dans

Introduction

l’interprétation des résultats de description, comme cela a été souligné par plusieurs

auteurs (Falush et al., 2016; Novembre and Stephens, 2008). Reich et al. (2008)

montrent par exemple que l’observation d’un gradient génétique sur un graphique

d’ACP est couramment interprété comme le résultat d’un processus de migration,

mais qu’il pourrait tout aussi bien être analysé comme le résultat d’un isolement

graduel par la distance par exemple.

2.2. Description de la diversité linguistique

Les méthodes de description de la diversité linguistique en philologie sont

souvent analogues aux méthodes de descriptions de la diversité génétique. La prise en

compte de transferts horizontaux s’accompagne parfois de représentations graphiques

incluant des réticulations, par exemple via les graphiques en réseaux (Hamed, 2005;

List et al., 2014). Une interprétation directe de ces descriptions peut être proposée.

Néanmoins, le même problème que dans le cas de la génétique émerge quant au lien

entre ces représentations statistiques ou graphiques et les processus historiques à

l’origine des diversités observées. Dans le cas des inférences verbales de l’histoire

linguistique à partir des seules descriptions, le problème est encore plus aiguë. Quels

liens causaux peut-on établir entre les descriptions et les mécanismes à l’origine de la

production de la diversité linguistique ? Quels présupposés implicites, parfois issus

directement d’une vision « génétique » des objets linguistiques, guident les

interprétations ?

2.3. Inférence de l’histoire à l’origine de la diversité génétique

Les modèles mathématiques et informatiques formalisent une première couche

de connaissances liées aux particularités biologiques de la réplication et de la

transmission de l’ADN. La connaissance des processus de recombinaison des

autosomes, de mutation des paires de bases, de production des gamètes par

mitose/méiose et ceux de la fécondation renseignent l’image générale du processus de

construction de la diversité génétique. Le chromosome Y et le chromosome

mitochondrial ont une place particulière pour les inférences en génétique des

populations, étant donné que le premier se transmet uniquement de père en fils tandis

Introduction

que le second ne se transmet que par la mère, ceci permettant de reconstruire l’histoire

des lignées spécifiquement paternelles d’un côté et spécifiquement maternelles de

l’autre.

Après cette première couche de formalisation au niveau des individus, c’est la

structure des populations qui renseigne la seconde couche de formalisation. Wright

(1951) a proposé un ensemble d’hypothèses devenues aujourd’hui classiques dans les

travaux de modélisation. À moins d’une spécification explicite du contraire, les

individus sont supposés se reproduire uniquement avec les membres de leur

population en choisissant leurs partenaires reproductifs de manière aléatoire : c’est

l’hypothèse de panmixie. Les allèles sont supposés ne pas avoir d’influence sur les

traits d’histoire de vie des individus, ou avoir une influence neutre ; c’est l’hypothèse

de neutralité. L’histoire des séparations, des migrations, des processus de sélection ou

des processus de mélanges entre plusieurs populations peuvent ensuite constituer les

particularités des scénarios historiques qui seront évaluées les uns par rapport aux

autres.

L’évaluation de ces différents scénarios historiques peut s’effectuer à l’aide de

plusieurs types de méthodes inférentielles. Les méthodes du maximum de

vraisemblance explorent directement la probabilité d’observer ce qui est observé,

d’après l’ensemble des scénarios envisagés (Huelsenbeck and Crandall, 1997). En

particulier, les algorithmes de Monte Carlo par chaînes de Markov dans un cadre

bayésien offrent de nombreux avantages computationnels et sont des méthodes de

choix pour la discipline (Drummond and Rambaut, 2007). Néanmoins, ces méthodes

sont parfois inopérantes dans la prise en compte de jeux de données très volumineux

ou de scénarios complexes (Stephens and Donnelly, 2003). Ces cas peuvent rendre

impossible le calcul direct de la vraisemblance, ce calcul étant au cœur de ces

méthodes. Les méthodes de calcul bayésien approché (ABC, pour Approximate

Bayesian Computations) se basent quant à elle sur des simulations informatiques des

scénarios proposés. Les données simulées sont comparées à des données réelles, afin

d'évaluer la pertinence relative de chacun des scénarios sans se baser sur un calcul

direct de la vraisemblance (Tavaré et al., 1997). Les méthodes ABC sont parfois très

coûteuses en temps de simulations informatiques, mais elles permettent néanmoins de

Introduction

prendre en compte des jeux de données très volumineux, et d’évaluer des scénarios

comportant des événements complexes de migrations et de mélanges, difficilement

pris en compte par d’autres méthodes d’inférence (Csilléry et al., 2010; Robinson et

al., 2014).

2.4. Inférence de l’histoire à l’origine de la diversité linguistique

Les travaux de construction des arbres de langues (Atkinson et al., 2005)

utilisent principalement les méthodes inférentielles bayésiennes du maximum de

vraisemblance (Drummond and Rambaut, 2007). Les scénarios évalués les uns par

rapport aux autres sont des scénarios de ramifications successives à partir d’une

origine commune, analogues aux modèles de phylogénies. La possibilité

d’événements de mélanges entre les langues n’est pas prise en compte, et l’emprunt

est considéré comme un paramètre de nuisance (Greenhill et al., 2009). Ces méthodes

présupposent donc une transmission exclusivement verticale des langues.

Au contraire de ceux de génétique des populations qui dispose d’un ensemble

de résultats issus de la biologie, les travaux d’inférence historique en linguistique ne

disposent pas d’un champ décrivant les processus de changements linguistiques et

leurs mécanismes causaux. Une réelle lacune se trouve dans la chaîne qui va de

l’observation à l’inférence, sans qu’une synthèse ne permette de relier les processus

de changements historiques à l’échelle des langues aux mécanismes de transmission

linguistique à l’échelle des individus. Ainsi, alors que la génétique fournit la

description des mécanismes de transmission de l’ADN, il est difficile de statuer de

manière univoque sur les mécanismes d’acquisition et de transmission linguistique à

l’échelle des individus.

3. Études de la co-évolution génétique et

linguistique

Les reconstructions historiques associant la génétique et la linguistique peuvent

Introduction

remonter au moins aux travaux de Ranajit Chakraborty dans les années 70

(Chakraborty, 1976), étudiant la corrélation entre les distances génétiques et

linguistiques des Amerindiens des hauts plateaux andins. Luigi Luca Cavalli-Sforza et

ses collaborateurs étendent cette approche dans les années 80 à une échelle mondiale

(Cavalli-Sforza et al., 1988), les diversités alléliques leur permettant de construire des

arbres phylogénétiques et de les comparer aux taxonomies des langues proposées par

ailleurs (Ruhlen, 1991).

Les distances génétiques, linguistiques et géographiques sont encore utilisées

abondamment aujourd’hui pour décrire leurs éventuelles corrélations locales

(Balanovsky et al., 2011; Nettle and Harriss, 2003; Ramallo et al., 2013) ou globales

(Belle and Barbujani, 2007; Creanza et al., 2015). Ces travaux tendent à montrer une

très faible corrélation entre les distances génétiques et linguistiques une fois les

distances géographiques prises en compte. En revanche, une forte corrélation entre les

distances génétiques et géographiques et entre les distances linguistiques et

géographiques est très souvent mesurée. Cela semble donc indiquer que le lien entre

les diversités génétique et linguistique est uniquement le résultat d’une géographie

partagée (Ben Hamed and Darlu, 2007), au lieu d’un mécanisme de transmission

analogue qui aurait aboutit à une corrélation entre génétique et linguistique en plus

d’une corrélation avec la géographie.

En se basant malgré cela sur une hypothèse d’une co-évolution entre les gènes

et les langues, l’inférence des séparations entre les populations génétiques est parfois

utilisée pour évaluer les taxonomies des langues (Amorim et al., 2013). Les histoires

génétiques et linguistiques sont ici supposées identiques, ou au moins analogues,

l’une pouvant renseigner l’autre. Dans un autre contexte, la comparaison entre arbres

phylogénétiques reconstruits respectivement à partir des données génétiques et

linguistiques permet d’améliorer la solidité des inférences historiques lorsque les

arbres se superposent (Balanovsky et al., 2011; Duda and Jan Zrzavý, 2016). Cette

association permet également de suggérer des événements particuliers

(remplacements linguistiques par exemple) lorsque les arbres ne se superposent pas.

La reconstruction d’arbres phylogénétiques dans les travaux d’inférence

historique cités précédemment est directement la conséquence de leurs hypothèses de

Introduction

modélisation. L’histoire à l’origine de la production des diversités génétiques et

linguistiques est supposée être un processus de ramifications successives partant d’un

tronc commun correspondant à une forme ancestrale, et se divisant en branches, puis

en feuilles représentant les diversités actuellement observées.

Ce type de modèle est très discuté dans la littérature concernant l’évolution de la

culture en général (Geisler and List, 2013). Deux critiques principales sont formulées

à leur encontre (Cabrera, 2017). La première concerne l’absence de prise en compte

de phénomènes de transferts entre les branches des arbres dus à des emprunts entre les

langues. Les méthodes de phylogénies supposent en effet une transmission strictement

verticale des éléments linguistiques. La seconde critique concerne l’analogie

injustifiée entre l’événement de transmission génétique et l’événement de

transmission culturelle (Claidière and André, 2012). En effet, les éléments culturels

influencent largement leur mode de transmission par toute une série de biais. Par

exemple, une structure linguistique plus simple sera plus facilement comprise au

moment de son énonciation. Autre exemple, un mot plus expressif sera plus

facilement retenu et à son tour plus utilisé. Au contraire, l’ADN ne biaise pas

l’événement de transmission lui-même en dehors des rares distorteurs de ségrégation,

puisque son mode de transmission est toujours le même : la moitié des autosomes de

chaque parent en plus des marqueurs uni-parentaux. Kirby (2000) indique ainsi que :

[les langues] doivent être reconstruites à chaque génération via un apprentissage ou une

acquisition, au contraire des séquences ADN (celles-ci étant copiées puis physiquement

transmises).

Comment prendre en compte ces particularités lors de la proposition

d’inférences historiques conjointes entre génétique et linguistique ? Quels

présupposés peut-on adopter pour réaliser les inférences linguistiques les plus

robustes possible ?

Introduction

4. Vers un cadre d’analyse conjoint des

diversités génétiques et linguistiques

4.1. Inférence de l’histoire des populations génétiques et des

variétés linguistiques

Dans le premier chapitre de ce manuscrit de thèse, nous développons une

nouvelle méthode ABC permettant d’inférer en parallèle les histoires de populations

génétiques et de variétés linguistiques. Elle est appliquée à un ensemble de

populations d’Asie Centrale pour lesquelles des données génétiques et linguistiques

ont été collectées au laboratoire (21 populations d’Asie Centrale génotypées pour 26

marqueurs microsatellites autosomaux et typées linguistiquement pour une liste de

185 mots). Pour la première fois, l’utilisation de méthodes ABC permet de prendre en

compte explicitement la possibilité d’emprunts ou de mélanges linguistiques aussi

bien que la possibilité d’événements de migrations ou de mélange génétiques. Les

inférences sont réalisées indépendamment pour les deux types de données, afin de

mettre en regards les scénarios proposés sans présupposer d’une co-évolution stricte

entre les deux.

4.2. Exploration d’une « linguistique des populations »

Nous avons signalé précédemment qu’une campagne d’observation centrée

autours des langues n’était qu’un choix possible parmi d’autres. Une perspective

différente, plus proche de la génétique des populations, consiste à relever sur le terrain

les diversités linguistiques au sein des communautés de langage à l’échelle des

individus (Thomsen, 2006). Il s’agit ainsi, dans un second chapitre, de construire une

série de modèles d’évolution linguistique différents à l’échelle des individus.

L’objectif est d’évaluer la pertinence de plusieurs modes de transmission possibles,

tout en évaluant la faisabilité des inférences linguistiques intra-populationnelles pour

une première formalisation d’une « linguistique des populations » (Manni, 2017). Ce

cadre est appliqué à l’étude des modes de transmission linguistiques chez 10

Introduction

populations du Tadjikistan pour une liste de 185 mots.

4.3. Construction d’une interface entre linguistique des

populations et génétique des populations

L’impossibilité d’appliquer les modèles de phylogénie pour la reconstruction

des histoires linguistiques à l’échelle des individus nécessite de construire de

nouveaux modèles détaillant les mécanismes de transmission linguistique. Les

analogies avec les mécanismes de réplication génétique peuvent être considérées

comme des sources d’inspiration, mais non comme des justifications. Il est alors

nécessaire de s’appuyer sur les connaissances issues des différentes branches de la

linguistique, ce qui implique un profond travail interdisciplinaire. Nous nous

attachons ainsi dans un troisième chapitre à construire une articulation entre une

formalisation de la génétique des populations et une formalisation d’une possible

linguistique des populations. L’explicitation des hypothèses sous-jacentes et la prise

en compte de la chaîne causale allant de l’observation sur le terrain à l’inférence nous

amènent à proposer un retournement de la manière de construire l’objet linguistique,

en mettant l’énoncé au cœur de la modélisation.

4.4. Échantillonnage et analyse de données linguistique issues

des Îles du Cap Vert

Dans le dernier chapitre de cette thèse, nous détaillons le protocole de récoltes

de données linguistiques qui a été appliqué lors d’une mission dans les Îles du Cap

Vert avec un ensemble de locuteurs Cap-Verdiens. Ce protocole a été construit afin de

répondre aux besoins de la linguistique des population présenté dans le chapitre

précédent. En retour, il s’est avéré que l’application de ce protocole sur le terrain a

nourrit nos perspectives théoriques. Une analyse de données préliminaires nous

permet une première approche de la structuration de diversité linguistique de cette

région du monde.

Chapter I – Genetic and linguistic

histories in Central Asia inferred

using Approximate Bayesian

Computation

Valentin Thouzeau†,1, Philippe Mennecier†, Paul Verdu†,2, Frédéric Austerlitz†,2

† CNRS, MNHN, Université Paris Diderot, UMR 7206 Eco-Anthropologie et

Ethnobiologie, Paris 75016, France1 Corresponding author: valentin.thouzeau@mnhn.fr, +33 (0)1 44 05 73 44 2 These authors equally supervised this work

This article was published online August 23, 2017, with the following reference:

Thouzeau, V., Mennecier, P., Verdu, P., and Austerlitz, F. (2017). Genetic andlinguistic histories in Central Asia inferred using approximate Bayesian computations.Proc. R. Soc. B 284, 20170706.

Abstract

Linguistic and genetic data have been widely compared, but the histories

underlying these descriptions are rarely jointly inferred. We developed a unique

methodological framework for analysing jointly language diversity and genetic

polymorphism data, to infer the past history of separation, exchange and admixture

events among human populations. This method relies on Approximate Bayesian

Computations that allows to identify the most probable historical scenario underlying

each type of data, and to infer the parameters of these scenarios. For this purpose, we

Chapter I – Genetic and linguistic histories in Central Asia inferred using Approximate Bayesian Computation

developed a new computer program PopLingSim that simulates the evolution of

linguistic diversity, which we coupled with an existing coalescent-based genetic

simulation program, to simulate both linguistic and genetic data within a set of

populations. Applying this new program to a wide linguistic and genetic data set of

Central Asia, we found several differences between linguistic and genetic histories. In

particular, we showed how genetic and linguistic exchanges differed in the past in this

area: some cultural exchanges were maintained without genetic exchanges. The

methodological framework and the linguistic simulation tool here developed can be

successfully used in future work for disentangling complex linguistic and genetic

evolutions underlying human biological and cultural histories.

1. Introduction

Human demographic historiy encompasses complex events such as migrations,

population size changes, and admixture events (Cavalli-Sforza and Feldman, 2003;

Hellenthal et al., 2014; Mallick et al., 2016; Ramachandran et al., 2005). These

demographic events, which impact within- and among-population genetic diversities,

are coupled with gradual cultural changes or bursts of innovation, and borrowings

(Atkinson et al., 2008; Cavalli-Sforza and Feldman, 1981; Mesoudi et al., 2006).

Since Darwin (1871), numerous authors have investigated genetic and linguistic

evolutionary processes. They found parallelisms between linguistic and genetic trees

(Cavalli-Sforza et al., 1988); identified homologous linguistic traits with a possible

common origin similar to homologous genetic markers (Atkinson and Gray, 2005);

and proposed that genes and languages are both composed of discrete heritable

replicators which may evolve in parallel (Pagel, 2009). There are numerous studies

comparing linguistic and genetic diversities, which investigate also how they may

match at different geographical scales. For instances, a strong correlation was found

between genetic barriers and linguistic boundaries in Europe (Barbujani and Sokal,

1990); genes from North Island Melanesian populations appeared to diffuse more than

linguistic features (Hunley et al., 2008); African genetic diversitiy was more

structured geographically than linguistically (Scheinfeldt et al., 2010); strong

differences between genetic and phonemic drifts were found at the worldwide scale

(Creanza et al., 2015).

While complex mechanisms are often considered in demographic inferences

based on genetic data, the known complexity of mechanisms underlying linguistic

evolution has rarely been accounted for. Indeed, genes and languages differ by the

very nature of their transmission processes (Claidière and André, 2012). Genes are

transmitted vertically among individuals and gene flow only occurs through sexual

reproduction and/or individual migrations. On the other hand, languages can be

transmitted vertically, horizontally and obliquely (Tao Gong, 2010). This transmission

among generations may occur in parallel with genetic transmission, in particular

within families. However, linguistic exchanges or borrowings among populations may

also occur via cultural diffusion, without migration of individuals (Haspelmath and

Tadmor, 2009; Haugen, 1950). Conversely, gene flow can occur without language

borrowing when a migrating individual does not transmit his/her language to his/her

progeny. Therefore, cultural and demographic changes are not necessarily correlated

(Steele and Kandler, 2010), and genetic and linguistic data may reveal different

aspects of human history. For instance, all Central African Pygmy populations share a

common ancestral population long diverged from the ancestral non-Pygmy

neighbouring population (Verdu et al., 2009), but nevertheless now use the language

of their neighbours (Bahuchet, 2012).

Population genetics methods allow inferring complex demographic histories

from genetic polymorphism data, using elaborated statistical methods, such as Monte

Carlo Markov Chain or Approximate Bayesian Computation (ABC) (Beaumont et al.,

2002; Tavaré et al., 1997) based on the coalescent theory (Kingman, 1982). They have

allowed inferring parameters of human demographic history at worldwide or regional

scales (Alves et al., 2016; Haber et al., 2016; Moreno-Estrada et al., 2013). Several

models have been proposed for the transmission and evolution of specific linguistic

features such as words (Atkinson et al., 2005). Computational linguistic approaches

have recently been applied to lexical datasets, allowing the inference of recent human

language diffusions (Bouckaert et al., 2012; Gray et al., 2009). These approaches did

not encompass horizontal or oblique borrowing events, as these processes cannot be

easily accommodated within a phylogenetic framework. However, neglecting

borrowings is expected to significantly bias the estimation of parameters such as

divergence dates (Greenhill et al., 2009).

Likelihood-based approaches cannot handle large data sets under highly

complex models (Csilléry et al., 2010; Schiffels and Durbin, 2014; Weiss and von

Haeseler, 1998). However, complex models are essential to interrogate the

multifaceted demographic and cultural histories. ABC methods provide an ideal

framework to overcome these challenges (Beaumont et al., 2002; Tavaré et al., 1997),

since they rely on explicit simulations, which allow considering altogether

phenomena such as admixture, changes in effective population size, and borrowings

among numerous populations.

We developed here an ABC framework to study the links between genetic

transmission (vertical with or without gene flow) and linguistic transmission (vertical

and/or horizontal) under a large number of possible complex scenarios. This

framework aims at choosing among different historical scenarios and inferring the

best parameters for the chosen scenarios, using linguistic and genetic data sets. Since

ABC methods require extensive simulations, we developed a new efficient linguistic

simulation program to simulate linguistic trees with possible borrowing and admixture

events among linguistic varieties or populations, generating ultimately simulated

cognate lists in each population. Cognates are homologous words with the same

etymological origin and the same meaning. They are usually obtained by comparing

word lists among populations or linguistic varieties, such as the 207-words list

designed by Swadesh (Swadesh, 1952). They have been previously used as cultural

markers of evolution (Bouckaert et al., 2012; Gray et al., 2009). In parallel to this

novel “language” simulator, we used the program fastsimcoal 2.5.1 (Excoffier and

Foll, 2011) to simulate large genetic polymorphism data sets under complex

demographic histories.

We specifically applied this novel inference framework to Central Asia, which

represents an ideal setting for the investigation of gene-language coevolution. A

complex history of settlements, migration waves and admixture events, expansions,

and secondary contacts, have shaped the genetic and linguistic diversity of

populations in this area (Palstra et al., 2015). Central Asian populations belong to at

least two linguistically and genetically contrasted groups: Turkic speaking and Indo-

Iranian speaking populations (Martínez-Cruz et al., 2011). Since they often live

nearby, we expected gene flow and/or vocabulary exchanges between them (Martínez-

Cruz et al., 2011; Palstra et al., 2015).

We obtained linguistic and genetic data for 21 populations (11 Turkic speaking

and 10 Indo-Iranian speaking). We focused on two specific populations: the Uzbek

speaking population from the district of Soj-Mahalla in the city of Andizhan in the

Fergana valley (abbreviated UZA), and the Yagnob speaking population from the

Yagnob valley (abbreviated TJY). Linguistic replacement was hypothesized to explain

the previously observed mismatch between linguistic and genetic clustering of the

UZA population (Martínez-Cruz et al., 2011). Alternatively, the TJY population is

assumed to be linguistically and genetically isolated from the other Indo-Iranian

speakers of this region due to its geographical isolation in valleys difficult to reach

(Gunya et al., 2002). These populations represent therefore two separate relevant case

studies to apply our new framework and contrast genetic and linguistic histories. We

reconstructed these histories separately for each population, and compared the

obtained inferences a posteriori. We focused on the chronology of genetic and

linguistic splits, and on the respective levels of genetic and linguistic exchanges.

2. Material

We studied the genetic and linguistic diversities of 21 Central Asian localities,

sampled in three countries (Uzbekistan, Tajikistan and Kyrgyzstan): 11 Turkic

speaking populations and 10 Indo-Iranian speaking populations (Figure I.1). The

national ethics committees of each country of sampling and the French research

ministry approved the study. All sampled individuals provided appropriate informed

consent.

2.1. Genetic data

We used previously published genetic data from these 21 populations (Aimé et

al., 2014), for a total of 643 individuals (24 to 49 individuals per population, see Table

S1), genotyped for 26 autosomal microsatellite makers that showed no significant

deviation from Hardy-Weinberg equilibrium and extremely low pairwise linkage

disequilibrium (Martínez-Cruz et al., 2011). All sampled individuals included in our

study were no closer than second-degree cousins (Martínez-Cruz et al., 2011).

2.2. Linguistic data

We obtained linguistic data from the same 21 populations, using phonetic

transcriptions on a subset of the individuals also sampled for DNA. Between one and

seven individuals participated to the linguistic questionnaires per population,

amounting 74 individuals in total. For each individual, we recorded up to 185 words

corresponding to 185 meanings extracted from the classic extended Swadesh list

(Swadesh, 1952). For detailed linguistic data collection procedures, see (Mennecier et

al., 2016b).

We consider as “cognate” a group of words with the same etymological origin

and the same meaning, such words being more likely to be related by a common

ancestry (Bouckaert et al., 2012). For example, the words “un” in French and “uno” in

Spanish belong to the same cognate: they have the same meaning, the number one,

Figure I.1 – Geographical distribution of the 21 populations and linguistic varieties under study, with11 Turkic speaking populations (Yellow circles) and 10 Indo-Iranian speaking populations (Bluecircles). The red arrows indicate the Uzbek speaking population from Soj-Mahalla (UZA) and theYagnob speaking population (TJY).

and the same origin from the Latin “unus”. The words “papillon” in French and

“multa” in Spanish do not belong to the same cognate: they have the same meaning,

butterfly, but not the same etymological origin. The classification into cognates was

performed by Philippe Mennecier following previous work (Mennecier et al., 2016b).

Due to the low number of individuals sampled in each population, we did not

take into account the inter-individual cognates variability within population. Instead,

we considered, for each word, only the most frequent cognate for each population,

reducing our linguistic dataset to a single cognate list per population, namely a

“linguistic variety”.

3. Methods

3.1. Genetic and Linguistic Dissimilarities among Populations

Using the 26 microsatellites, we computed pairwise FST values (Weir and

Cockerham, 1984) among the 21 populations using the Geneland R package (Guillot

et al., 2005). We tested their significance using 1,000 permutations of individuals

between populations (Excoffier and Lischer, 2010), with a significance level

α = 2.3×10-4 after Bonferroni correction for multiple testing. We set non-significant

FST values to zero. For the linguistic data, we computed the pairwise Manhattan

distances (R script in Repository) among the 21 populations, assuming, for each

meaning separately, a distance equal to 0 for the same cognate and 1 for different

cognates. Then, we constructed two weighted consensus trees, from the genetic and

linguistic dissimilarity matrices respectively, using the neighbour-joining algorithm

BioNJ (Gascuel, 1997) implemented in the R package ape (Paradis et al., 2004),

performing 1,000 bootstraps of populations for each tree. We set negative branch

lengths to zero. We performed Mantel tests (Mantel, 1967) between genetic distances

and linguistic distances using the R package ade4 (Thioulouse et al., 1997), testing

their significance with 10,000 permutations of both genetic and linguistic distances.

This was done on all pairs of populations, then on all pairs of Turkic speaking

populations (with or without the UZA population), and on all pairs of Indo-Iranian

speaking populations (with or without the TJY population).

3.2. Approximate Bayesian Computation (ABC)

Using the genetic and linguistic data, we investigated the genetic and linguistic

histories of Central Asia using an ABC framework (Beaumont et al., 2002; Tavaré et

al., 1997). In short, we generated a large number of simulated data sets under several

competing scenarios, the parameters of each scenario being drawn randomly in a

priori distributions. We then computed summary statistics for each simulated dataset.

We evaluated the proximity between the observed and the simulated summary

statistics to select the most likely scenario. We then inferred the a posteriori

distribution of each parameter for the most likely scenario. Since we did not assume a

priori that the genetic and linguistic histories were linked, we performed the

simulations and the ABC procedures separately for each type of data. Genetic data

were simulated using FastSimCoal 2.5.1. For details about the genetic model, the

priors of the parameters, and the summary statistics, see supplementary information.

3.2.1. Linguistic model

We extended Gray and Atkinson (Gray and Atkinson, 2002) model, with

substantial modifications (Figure S1), assuming:

1) Cognate evolution was tree-like, with possibilities of borrowing or admixture

between branches;

2) Each cognate corresponded to exactly one word, to be consistent with the

format of our dataset;

3) There was an infinite number of possible cognate, and a cognate may appear

only once.

We developed the C++ program PopLingSim (script in Repository, see

Appendix) using the CodeBlocks software to simulate cognate variation data. Each

linguistic variety carries a set of cognates. At each linguistic generation time, each

cognate i of each variety may change for a new cognate (it adopts a completely new

identifier) with probability μL. The linguistic generation time is not necessarily on the

same absolute time-scale as the genetic generation time.

3.2.2. Scenarios of linguistic and genetic origins of the UZA population

Potentially numerous linguistic and genetic scenarios describe the origin of the

UZA population relatively to the other Central-Asian populations. We aimed at

evaluating (i) the linguistic and genetic origins of the studied populations and (ii) the

linguistic and genetic exchanges between the UZA population and the other Central

Asian populations. We chose to consider a set of scenarios, addressing these questions

specifically. We performed separately three-populations analyses for the genetic case

and three-linguistic varieties analyses for the linguistic case, in which we tested five

possible scenarios respectively (Figure I.2).

3.2.3. Scenarios of linguistic and genetic isolation of the Yagnob speaking

populations

In this case, we aimed (i) at evaluating whether the TJY population is

genetically and/or linguistically isolated, and (ii) at estimating the linguistic and

genetic exchanges between this population and the other Indo-Iranian speaking

populations. We chose to consider two scenarios either with or without genetic

migration or linguistic borrowing, respectively (Figure S3). Indeed, the TJY linguistic

variety is a subset of the Yagnob language, known to derive from the other Indo-

Iranian languages and to have recently started to resist linguistic changes (D’Errico

and Hombert, 2009).

3.2.4. Choice of scenarios and estimation of parameters

We defined a triplet of populations (resp. linguistic varieties) as a combination

of (i) the UZA or the TJY population (resp. variety), (ii) one of the nine Indo-Iranian

speaking populations (resp. varieties), excluding TJY, and (iii) one of the eight Turkic

speaking populations (resp. varieties), excluding UZA, UZB or UZT. This led to 72

scenarios, each considering a different population (resp. linguistic varieties) triplet.

Figure I.2 – Five competing scenarios for the origin of the UZA population, tested independently forlinguistic and genetic history. In scenarios A and B, the ancestral Indo-Iranian and Turkic speakingpopulations (resp. varieties) split at time t0. At time t1, the ancestral UZA population (resp. variety)diverged from the Turkic lineage. Subsequent migration (resp. borrowing) events between the Indo-Iranian speaking populations (resp. varieties) and the UZA population (resp. variety) occurred inscenario B. In scenarios C and D, the ancestral Indo-Iranian and Turkic speaking populations (resp.varieties) split at time t0. At time t1, the ancestral UZA population (resp. variety) diverged from theIndo-Iranian lineage. Subsequent migration (resp. borrowing) events between the Turkic speakingpopulations (resp. varieties) and the UZA population (resp. variety) occurred in scenario D. In scenarioE, the ancestral Indo-Iranian and Turkic speaking populations (resp. varieties) split at time t0. At time t1,the ancestral UZA population (resp. variety) resulted from an admixture event between these twolineages. Abbreviations: Tc = Turkic speaking population. UZA = UZA population from the Soj-Mahalla region in Uzbekistan. I-I = Indo-Iranian speaking population.

For each triplet, we conducted an ABC analysis to determine the best historical

scenario for genetic and linguistic cases respectively using Random-Forest algorithm

(RF), and then estimated the parameters of the selected scenarios with a Neural-

Network algorithm (NN) (see Appendix).

4. Results

4.1. Central Asian linguistic and genetic structures

As shown in previous studies (Martínez-Cruz et al., 2011), the Indo-Iranian

speaking populations had higher genetic differentiation levels than the Turkic

speaking populations. Indeed, 47 pairwise FST values out of 55 were significantly

different from zero for the ten Indo-Iranian speaking populations, while it was the

case for only 14 pairwise FST out of 45 for the nine Turkic speaking populations.

The neighbour-joining tree analyses based either on genetic or linguistic data

showed a structure with two groups (Figure I.3), corresponding to the two main

linguistic families. This result can also be visualized with the distance matrices

(Figure S3). The Turkic linguistic variety UZA was closer from the other Turkic

varieties than from the Indo-Iranian varieties, but the UZA populations was

genetically closer from the other Indo-Iranian speaking populations than from the

Turkic speaking populations. The TJY population was both linguistically and

genetically distant from the other Indo-Iranian speaking populations, and even more

distant from the Turkic speaking populations.

Consistently with these results showing a clear overlap between the linguistic

and genetic groups, the Mantel test between linguistic and genetic distances computed

on all pairs of populations was significant (p = 0.0001, with α = 0.01 for five tests

after Bonferroni correction for multiple testing). However, it was not significant

(p = 0.112) among Turkic populations only, unless we excluded the UZA population

(p = 0.0091). The Mantel test including only the Indo-Iranian speaking populations

was not significant, whether including the TJY population or not (respectively

p = 0.0821 and p = 0.7449).

4.2. Model selection and parameter estimations for the UZA

population

According to the RF algorithm, the most supported scenario was an admixed

origin of the UZA population, both for genetic and linguistic data, with 36 decisions

over the 72 tested population triplets (Figure I.4c) and 55 decisions (Figure I.4a)

respectively.

Figure I.3 – Neighbour-joining trees based on (a) the linguistic distances matrix and (b) the pairwiseFST matrix, with 11 Turkic speaking population (in Yellow/Light Grey) and 10 Indo-Iranian speakingpopulations (in Blue/Dark Grey). The values at each node correspond to the number of boot-strap treescontaining this node among 1000 permutations. The red arrows indicate the UZA and the TJYpopulations, under specific scrutiny using Approximate Bayesian Computation inferences.

The modal estimates of the admixture rates were different between the linguistic

and genetic data. We found a strong bias toward the Turkic linguistic varieties with

r̂ L = 0.093 (95% CI 0.02-0.18, Figure I.4b), and a more balanced genetic

admixture, with r̂G = 0.48 (95% CI 0.05-0.95, Figure I.4d). It was difficult to

compare the divergence times t0 (ancient) and t1 (recent) directly between linguistics

and genetics processes, because the linguistic generation time was not necessarily on

the same absolute time scale than the genetic generation time. Therefore, we

compared the ratios t1/t0 between the genetics and linguistics histories, finding them

differing by an order of magnitude with a linguistic t̂1/ t̂ 0 = 0.038 (95% CI 0.002-

0.08), and a genetic t̂1/ t̂ 0 = 0.30 (95% CI 0.04-0.95).

Figure I.4 – ABC Analyses for the UZA population. (a) Decisions over the Random Forest analysis of72 triplets for the selection of the linguistic scenarios. (b) Priors (dotted-line) and posteriors (solid line)of the parameters t1/t0 and rL estimated from the linguistic simulations of the scenario E (c) Decisionsover the Random Forest analysis of 72 triplets for the selection of the genetic scenarios. (d) Priors(dotted-line) and posteriors (solid line) of the parameters t1/t0 and rG estimated from the geneticsimulations of the scenario E.

Finally, we estimated an effective population size of 82,173 (95% CI 13,608-

98,179) for the UZA, and lower effective population sizes respectively for the Turkic

and the Indo-Iranian speaking populations [ N̂ 0 = 16,862 (95% CI 6,399-87,812)

and N̂2 = 28,382 (95% CI 8,124-95,255)]. The increased estimated effective

population size in the UZA was likely due to the admixture process itself, which

increased the genetic diversity in the admixed population compared to each source

(Long, 1991).

4.3. Model selection and parameters estimation for the TJY

population

Genetically, the RF algorithm unambiguously supported the scenario of a strict

isolation of the TJY population (72 decisions over 72). Conversely, the two scenarios

(split with or without subsequent migration) were almost equally supported

linguistically, with 37 (respectively 35) decisions over 72.

Since the two scenarios were equally supported for the linguistic case, we

performed the parameter estimations in both cases. The estimated split time ratios of

the TJY linguistic variety from the other Indo-Iranian linguistic varieties were similar

between the two scenarios: t̂1/ t̂ 0 = 0.12 (95% CI 0.02-0.30, Figure S7) for the

isolation scenario 1, and t̂1/ t̂ 0 = 0.15 (95% CI 0.002-0.97, Figure S7) for the non-

isolation scenario 2. Under this scenario, the estimated borrowing rate between the

TJY linguistic variety and the Indo-Iranian linguistic varieties was quite low: δ̂ L

= 0.004 (95% CI 0.0009-0.019). This meant that each cognate was borrowed with a

probability of 0.4% at each linguistic generation since the split t1, a low estimate since

the prior was drawn in U[0-0.1].

The estimated ratio of split times based on the genetic data, assuming an

isolation scenario, was much higher than for the linguistic data, with t̂1/ t̂0 = 0.77

(95% CI 0.11-0.97, Table S4) compared to t̂1/ t̂ 0 = 0.12 (95% CI 0.18-0.30, Table

S4). The estimated effective population size of the TJY population was 10,280 (95%

CI 2,534-75,516), lower than the estimated effective population size of 50,561 (95%

CI 12,909-97,032) for the other Indo-Iranian speaking populations.

5. Discussion

In this study, we built a new flexible simulator of cognate data under historical

models encompassing divergences and multiple borrowings and admixture events

between linguistic varieties. Using in parallel an existing population genetic data

simulation program, we developed an ABC framework allowing to infer the most

probable genetic and linguistic histories and estimated their underlying parameters,

using both types of data sampled in the same populations. We used this new

framework to reconstruct the evolutionary scenarios underlying linguistic and genetic

diversities of a range of populations from Central Asia.

5.1. Two different linguistic and genetic historical admixture for

the Soj-Mahalla Uzbek-speakers

We tested five possible genetic and linguistic scenarios to investigate the

relation between the UZA population and the other populations of the area, i.e. the

Indo-Iranian speaking populations and the other Turkic speaking populations. The

UZA population appeared to result from a similar general process of split and

admixture for both genetic and linguistic data. Nevertheless, these processes differed

in their chronology and in the intensity of the admixture process.

The ratio t̂1/ t̂ 0 was indeed an order of magnitude higher for the genetic

scenario than for the linguistic scenario. Assuming that the genetic and linguistic

admixture events happened synchronously, the ancestral linguistic divergence

happened long before the ancestral genetic divergence. Conversely, assuming that the

ancestral genetic and linguistic divergences happened synchronously, the genetic

admixture event was older than the linguistic admixture event.

According to historical records (Soucek, 2000), the first hypothesis seems more

plausible: the recent Turkic speaking population invasions probably led to a linguistic

shift (D’Errico and Hombert, 2009). This shift seems to result from an admixture

between the Indo-Iranian and Turkic vocabularies, strongly biased toward the former,

rather than a complete linguistic replacement as previously proposed (Martínez-Cruz

et al., 2011). Conversely, the estimated proportions of genes inherited from each

group appeared to be similar. Previous studies indicate also a low rate of genetic

replacement in Central Asia (Zerjal et al., 2003), in agreement with a cultural

diffusion through trading routes (e.g. the Silk Road) but without extensive genetic

exchanges (Palstra et al., 2015).

5.2. Stronger genetic than linguistic isolation in the Tadjikistan

Yagnob speakers

We found that the scenario of genetic divergence without subsequent gene flow

was more supported for the genetic data. Conversely for the linguistic data, we could

not assess whether the linguistic divergence was followed by vocabulary borrowings

or not, as both scenarios appeared as equally likely. If borrowings occurred, they

would have nevertheless been very limited, as shown by the low estimated borrowing

rate of 0.4%.

Interestingly, we estimated, as above, a t̂1/ t̂ 0 higher for the genetic scenario

than for the linguistic scenario. If the genetic and linguistic divergences between the

ancestral populations happened synchronously, then the linguistic divergence between

the ancestors of the TJY and the other Indo-Iranian speaking populations occurred

much more recently than the genetic divergence. Conversely, assuming the genetic

and linguistic divergences between the TJY ancestral population and the ancestors of

the other Indo-Iranian speaking populations happened synchronously, then the

divergence between the ancestral populations would be more ancient linguistically

than genetically.

Whichever scenario we considered, we showed limited linguistic exchanges and

no genetic exchanges, which indicated that cultural exchanges were maintained

without genetic exchanges, potentially through commercial relationships (Renfrew,

1987). Cultural norms may limit genetic exchanges between populations without

limiting cultural exchanges, as it is frequently the case in Central Asia (Heyer et al.,

2009). Indeed, ethnic constructs may produce endogamy rules, which limit the

probability of inter-marriages between groups. Economic relationships, geographical

proximity, and migration may favour cultural exchanges despite this genetic isolation.

5.3. Conclusions and Perspectives

In this study, we investigated the coevolution between genes and languages at a

regional scale. Genetic and linguistic diversities result, respectively, from the

demographic and cultural histories of the populations. Using separately one or the

other type of data may implicitly assume that demographic and cultural histories are

linked (Amorim et al., 2013; Gray et al., 2009). We showed that these histories can

differ substantially, as pointed out also by other authors (Creanza et al., 2015; Hunley

et al., 2012; Steele and Kandler, 2010). We did not assume a strict parallelism

between genetic and linguistic evolutions. On the contrary, our approach allowed us to

highlight discrepancies between genetic and linguistic inferences and to provide new

insights in the history of the studied populations.

As pointed out by Cavalli-Sforza et. al. (Cavalli-Sforza et al., 1992), the

parallelism between genetic and linguistic evolutions should be weaker at a local scale

than at a more global, worldwide, scale. This is likely due to an intrinsic difference

between genes and languages: the former can only be transmitted vertically while the

latter can be transmitted vertically, horizontally and obliquely (Tao Gong, 2010).

Nevertheless, strong links between genetic and linguistic histories may also be

observed at a local scale in some cases (Lansing et al., 2007), whereas strong

discrepancies may be observed at a larger scale (Creanza et al., 2015; Hunley et al.,

2012). Thus, congruence or not between linguistic and genetic evolutions should be

studied case by case, as the very histories of the populations under study may differ.

Several extensions will be possible for our model. We assumed here a neutral

linguistic evolution, where each word evolved independently with its own mutation

rate, and where no burst of innovation occurred. Relaxing these assumptions could

improve our knowledge of language evolution. It may also allow us to perform better

inferences of parameters such as borrowing rates or divergence times among linguistic

varieties. Moreover, we assumed a model of linguistic evolution with discrete

generations. The linguistic generation time is not easily defined; we showed that it is

not strictly equivalent to demographic generation times. Finally, a linguistic sampling

at the individual scale could make it possible to build and study a much wider range

of models of evolutions, and would also allow comparing genetic and linguistic data

at the individual level, which cannot be achieved when considering population

language varieties. This type of model should allow us to better understand linguistic

and genetic evolutions, and the potential links between them.

Chapter II – Inferring linguistic

transmission between generations at

the scale of individuals

Valentin Thouzeau†, Antonin Affholder†, Philippe Mennecier†, Paul Verdu†,1, Frédéric

Austerlitz†,1

Ethnobiologie, Paris 75016, France2 These authors equally supervised this work

This article is currently in preparation.

1. Introduction

Linguistic data have been extensively used recently in computational

frameworks to reconstruct some aspects of the history of human populations

(Atkinson, 2011; Bouckaert et al., 2012; Gray and Atkinson, 2002; Pagel et al., 2013).

These data consist mainly of a set of presence or absence of items in lists within a

given set of contemporaneous languages, as in databases like the World Atlas of

Language Structures WALS (Dryer and Haspelmath, 2013), or the Global Database of

Cultural, Linguistic and Environmental Diversity D-PLACE (Kirby et al., 2016).

Most computational studies aiming at reconstructing languages histories from current

linguistic data are usually languages at a macro-evolutionary scale. For instance, Gray

and Atkinson (2002) used a series of Swadesh list over 87 languages to investigate the

origin of the Indo-European linguistic family. Atkinson (2011) studied the number of

phonemes used in 504 languages worldwide to test the hypothesis of a serial founder

Chapter II – Inferring linguistic transmission between generations at the scale of individuals

effect due to the Out-Of-Africa expansion. Reesink et al. (2009) used the linguistic

diversity of the ancient Sahul continent (present day Australia, New Guinea, and

surrounding islands) for 121 languages using diverse structural features.

These approaches relies implicitly on several assumptions. They require

primarily a clear division between several differentiated languages, as a set of discrete

units. Nevertheless, this notion of distinct languages is sometimes irrelevant at a local

scale, in particular in a context of dialectal continuum or linguistic contacts (Heeringa

and Nerbonne, 2001; Livingstone and Fyfe, 1999). These studies thus do not take into

account the within-population linguistic diversity, since traditional linguistics often

considers the languages as unique and coherent systems (Pateman, 1983). Indeed,

only a few number of linguists in the field record systematically the linguistic

diversity for a given set of community of language. Samplings campaigns are mainly

conducted at the language scale, hiding the intra-language diversity.

This implies the loss of a large amount of information, knowing that the

demographic phenomena at the population level – different population sizes,

bottleneck, expansion – are expected to play a major role in languages evolution

(Vogt, 2009). However, these phenomena are rarely taken into account by the models

reconstructing the history of languages. Including contemporaneous within-population

linguistic diversity in the reconstruction of the demographic history of human

populations at a local scale should thus open a whole new dimension into the field of

historical linguistic inferences.

Croft (1996) thus argued for a replacement of the ‘essentialist’ theory of

languages changes by a ‘population’ approach of the languages changes. He proposed

a review of the “evolutionary linguistic” field (Croft, 2008), detailing some work

developed in an evolutionary historical linguistics framework. Nevertheless, very few

studies deal with the contemporaneous within-population linguistic diversity in a

historical-reconstruction perspective. Some recent examples include the use of

surnames in Austria as linguistic contemporaneous information (I. Barrai, A.

Rodriguez-Larralde, E, 2000), the use of the family names in different contexts (Darlu

et al., 2012), or the use of proportion of African words in free speech in Cape Verdean

kriolu (Verdu et al. 2017).

Historical linguistic inferences implie knowledge about causal mechanisms

between the observed data and a possible set of historical scenarios which produced

these observed data. Nevertheless, there is no consensual theoretical framework able

to handle within-population linguistic diversity data in order to infer the underlying

historical scenarios and evolutionary mechanisms. It is indeed impossible to primly

assume a clear and delimited mechanism of linguistic evolution, and to then study the

range of historical scenarios that could have produced the observed linguistic data.

We propose, in this article, to evaluate a series of models of linguistic evolution

between generations at the individual scale. These models are based on the personal

linguistic knowledge of each individual, instead of a language external from the

individuals. We thus do not study the history of higher-order objects such as “the

languages”, but the history of the linguistic diversity carried by individuals within the

populations. We aim here at understanding how the linguistic items are transmitted

from generation to generation, as functions of several demographic parameters.

Approximate Bayesian Computation methods (ABC, Beaumont et al., 2002;

Tavaré et al., 1997) provide a particularly well-adapted framework to tackle the

problems presented here. It is a mean for inferring jointly the most likely historical

scenarios among a set of possible ones, along with the mechanisms of linguistic

transmission between generations. We used the recently developed Approximate

Bayesian Computation via Random Forest algorithm to choose among the possible

competing scenarios and estimate the parameters of the ”winning” models and

associated scenario (Breiman, 1999; Pudlo et al., 2016).

We implemented, in a computer program, the simulation of historical scenarios

under the models we proposed, and we evaluated the congruence of simulated data

with a real dataset from Central Asia. This dataset consists of 30 individuals sampled

for 185 words across 10 villages in Tajikistan. These villages are known to use the

same language, but with some variability among individuals (Mennecier et al.,

2016a). The analyses of these data provided a proof of the feasibility of the use of

contemporaneous within-population linguistic diversity to infer historical features of a

human population cultural evolution.

2. Models

2.1. Production of utterances

We considered a linguistic population as a group of individuals which may

potentially interact through communication. The mechanisms of linguistic

communications and linguistic transmissions may follow different modalities, which

correspond to different models of linguistic evolution. Nevertheless, we consider that

the unit of linguistic communication is the utterance, a production of linguistic items

associated with a meaning.

Each linguistic item is a possible version from a class. For example, the words

“Multa” and “Papillon” are two items of the class Butterfly, and one or another may

be used during an utterance to express the same meaning. In linguistics, a class of

items is called a paradigm.

Here, cognates are specific to a context and an individual. This is different from

cognates sampled at the language scale, for which individuals are considered as users

of the language instead of producers of the language.

During the field work, the protocol of linguistic recording is an act of

communication through utterance. Despite the unusual setting of the linguistic

questionnaire, the utterances produced by the individuals are considered like any other

act of utterance that the individuals may produce during their lifespan.

2.2. Four models of acquisition of a new language

We developed an individual-based forward-in-time simulation model, in which

we assumed that populations were composed of only two types of individuals:

“learners” and “teachers”. Moreover, we assumed that the rules of utterance

productions of a teacher depend only on the utterances he/she heard when he/she was

a learner. We assumed that each learner choose only one item from each class during

the learning phase. Two learners may choose the same linguistic item. After the whole

learning phase, each teacher is discarded and each learner becomes a teacher.

We tested here four models of linguistic acquisition during learning (Figure

II.1). Each model differed through the number of teachers implied during the

language acquisition, and the relative roles of these teachers.

In the first model, named the “Clonal” model, each learner select a teacher at

random and copies “clonally” every item that he/she produces. In the second model,

named the “Sexual” model, two teachers (a male and a female) are attributed at

random to each learner. He/she then copies directly the first half of the items produced

by the male, and the second half of the items produced by the female. Then, a

determined half of the items was always transmitted by males, and the determined

other half of the items was always transmitted by females. In the third model, named

Figure II.1 – Four models of linguistic transmission between generations. Each white circle representsan individual. The utterances that individuals produce depend only on the utterances that their teachersproduced at the previous generation, and on the mutations induced during the transmission.Transmission of linguistic items by teachers follow four possible modalities: (a) a “Clonal” model withonly one teacher per learner, (b) a “Sexual” model with two teachers associated with a distinct set ofvocabulary for each sex, (c) a “Sexual2” model with two teachers without a distinct set of vocabularyfor each sex, and (d) an “Social” model with the whole population as teacher for each learner.

the “Sexual2” model, each learner select two teachers (a male and a female) at

random. For each class, he/she copies at random the item from the male or the item

from the female. There is no item only transmitted by males or females, every item is

transmitted from one parent chosen at random. In the fourth model, named the

“Social” model, for each class each learner copies an item drawn at random from the

items produced by every teacher in the population.

For each model, the process of copy may produce some error and create a

completely new item. We call that type of error a “linguistic mutation”. The mean

mutation rate μL was drawn in a log-uniform distribution, between 10-6 and 10-1

mutations per lexical item per generation. For each item, its mutation rate was drawn

in a beta distribution with a mean μL and a shape β = 2, allowing us to simulate a set

of linguistic items with a different rate of change. We developed a new simulation

software PopLingSim 2 (PLS2) according to these models of linguistic evolution.

2.3. Historical scenario

We focused here on a single linguistic population, defined as a language

community, where the individuals have been sampled using a linguistic questionnaire.

Forward in time, this linguistic population evolved with a constant size N0 until

t = 5×N0, a time that, as we visually checked, was sufficient to reach an equilibrium

between the production of linguistic diversity through mutation, and the reduction of

this diversity through random sampling. This population then evolved with a new size

N1 during t0 generations. The linguistic items were then sampled at present day. This

historical scenario allows a range of histories, depending on the relative values of the

parameters N0 and N1 and on the value of t0. The population sizes N0 and N1 are drawn

in a uniform distribution, between 100 and 1000 individuals, this low upper bound

being set to limit the really high computational cost of these forward-in-time

simulation models. Time t0 was drawn in a uniform distribution, between 0 and 1000

generations. The median, the minimum, the maximum, and the quantile 5% of the

priors of the models are summarized Table II.1.

Figure II.2 – Historical scenario. It structure depending on the relative values of the parameters N0 andN1. If N0 = N1, we assumed a scenario of constant population size. If N0 < N1, we assume a scenario ofexpansion of the population. If N0 > N1, we assume a scenario of contraction of the population.

Median Min Max Quantile

Quantile

N0 550 100 1000 122 978

N1 550 100 1000 122 978

t0 500 0 1000 25 975

μL 3.165×10-4 10-6 10-1 1.35×10-6 7.73×10-2

N0×μL 0.150 10-4 100 5.25×10-4 44.5

N1×μL 0.150 10-4 100 5.25×10-4 44.5

t0×μL 0.116 0 100 2.80×10-4 42.0

Table II.1 – Summary of the prior distributions of the parameters for the four models.

3. Materials

We sampled 30 individuals from 10 villages in Tajikistan (Figure II.3). For each

individual, we recorded the words used for 185 meanings from an adapted Swadesh-

list. We considered as “cognate” a group of words with the same etymological origin

and the same meaning, such words being more likely to be related by a common

ancestry. The classification of lexical data gathered on the field into cognates was

performed by Philippe Mennecier following previous work (2016).

4. Analyses

4.1. Simulations

For each model, we performed 10 000 simulations using our newly-developed

software PopLingSim 2 (PLS2). We parallelized the computations using 250 cores of

the cluster station Genotoul, leading to approximately 90 000 CPU hours. Most this

computational time was spent during the phase of equilibrium between mutation and

drift of t0 = 5×N0 generations.

During the process of sampling linguistic items from our simulations, we drew a

Figure II.3 – Geographical distribution of the 10 sampled units under study.

number of missing values equal to the number of missing values of our real data set,

to avoid the bias induced by the missingness in the computation of the summary

statistics needed for ABC procedures.

4.2. Summary statistics

We constructed a new set of summary statistics, some of which were inspired

from classical population genetics statistics. After computing pi,j, the proportion of

individuals using the item i of the class j, we computed the linguistic diversity

Dj = 1 – Σi pi,j2, analogous to the gene diversity (Nei, 1987).

Then, we computed :

- The mean linguistic diversity, D;

- The range of the linguistic diversity, R(D) ;

- The variance of the linguistic diversity, V(D) ;

- The number of strictly different lists of items, S ;

- The mean number of items in each class, N ;

- The variance of the number of items in each class, V(N) ;

- The frequency spectrum of the number of items per class, F.

4.3. Model selection

Before the model selection, we performed a goodness-of-fit test to check if the

simulations were able to produce data close to the real data using the R package abc

(Csilléry et al., 2012). We performed model selection using the R package abcrf with

the RF algorithm and the function abcrf (Pudlo et al., 2016). We graphically checked

if a forest of 500 trees allowed a convergence of the error rate. We then performed a

cross-validation analysis using an out-of-bag approach implemented in the package

abcrf, evaluating if the algorithm was a priori able to distinguish between the four

models.

4.4. Parameters estimation

We used the RF algorithm with the function regAbcrf of the package abcrf to

estimate the expectation, the median, the variance and the quantiles 5% of the

parameters N1, N0, t0, μL and the composite-parameters N1×μL, N0×μL and t0×μL. Note

that the RF algorithm do not estimate the whole distribution of the parameters

directly, but estimate the quantiles of the distribution instead.

5. Results

5.1. Model selection

Using the goodness-of-fit test, we verified that there was no significant

difference between the real and simulated datasets (p-value = 0.55, with a number of

replications = 1000). We performed the RF analysis using 500 trees, and we verified

graphically that the error rate converged. The result of the RF analysis rejected the

Clonal, and the Sexual models, and preferred to select the Sexual2 and the Social

models (Table II.2), with a posterior probability of 0.499 for the Social model.

The cross-validation analysis (Figure II.4) indicated a good a priori

differentiation between the Clonal model, the Sexual model and the group ‘Sexual2

and Social’ models. Nevertheless, the Sexual2 and the Social models cannot be

distinguished a priori. It is then impossible at that stage to choose, based on our data,

between the ‘Sexual2’ and the ‘Social’ models, but we may be confident in the

falsification of the Clonal and the Sexual models.

5.2. Parameter estimation

For the two more likely models (Sexual2 and Social), we could not estimate

Clonal Sexual Sexual2 Social Post.Prob.

0.002 0.04 0.478 0.48 0.499

Table II.2 – Proportion of votes for the four models of linguistic evolution, and the posteriorprobability of the Social model.

separately the parameters N0, N1 and t0: the estimated quantiles of their posterior

distributions were similar to the quantiles of the priors considered (Tables II.3 and

II.4). Nevertheless, the estimated quantiles of the parameter μL and the composite

parameter N1×μL, N0×μL and t0×μL, are substantially narrower than the priors (Tables

II.3 and II.4). Using the estimated posteriors for the Sexual2 and the Social model, we

estimated that the linguistic mutation rate ranged between 1.9810-4 and 1.4410-3.

Figure II.4 – Confusion matrices from the out-of-bag cross-validation analysis of the four models,using 10000 pseudo-observed data.

Expectation Median Variance Quantile

Quantile

N0 526 499 43331 126 968

N1 645 714 65762 154 975

t0 479 466 87448 21 937

μL 4.66×10-4 3.23×10-4 1.13×10-7 2.18×10-4 1.44×10-3

N0×μL 0.243 0.193 0.039 0.057 0.87

N1×μL 0.255 0.244 4.10×10-3 0.15 0.467

t0×μL 0.239 0.177 0.064 8.092×10-3 1.152

Table II.3 – Summary of the posterior distributions of the parameters, assuming a Sexual2 scenario.

Expectation Median Variance Quantile

Quantile

N0 544 542 60108 153 986

N1 655 681 61907 148 966

t0 353 290 109196 9 954

μL 4.26×10-4 3.14×10-4 1.03×10-7 1.98×10-4 1.28×10-3

N0×μL 0.203 0.175 0.028 0.074 0.553

N1×μL 0.255 0.246 4.85×10-3 0.122 0.432

t0×μL 0.204 0.126 0.098 5.33×10-3 1.09

Table II.4 – Summary of the posterior distributions of the parameters, assuming a Social scenario.

6. Discussion

In this article, we built four models of intra-population linguistic evolution, at

the individual scale. We compared the simulated data with a real dataset of 30

individuals in Tajikistan carrying 185 cognates.

First, we showed that some of our models were able to produce simulated data

close to the contemporaneously observed data. It means that we were able to specify

linguistic reproduction mechanisms between generations, a set of transmission models

at an individual scale, that are consistent with the linguistic diversity of the sampled

populations.

We provided inferences of some features of the linguistic history, selecting the

most plausible mechanisms of linguistic transmission, and estimating the parameters

of the selected models. The low posterior probability of the Clonal and Sexual models

compared to the Sexual2 and the Social models indicates that the mechanisms of

linguistic acquisition follow probably more a process of linguistic recombination with

several teachers than a process of transmission without recombination. It would be of

great interest to distinguish between a transmission following a Sexual2 model (with

only two teachers), and a transmission following a Social model (with a whole

community as teacher).

The estimation we provided of the mean linguistic mutation rate of the lexical

items of the Swadesh list falls between 10-4 and 10-3 mutations per lexical item per

generation. Our micro-evolutionary context (i.e. at the scale of the individuals), may

be compared with a macro-evolutionary context (i.e. at the scale of a whole language

or a linguistic variety). The mutation rate of one item per generation and per

individuals estimated here, fall in the same range that the mutation rate of one item

per generation in macro-evolutionary studies (Pagel et al., 2007a). Considering that

the languages at the global scale emerge from the interactions of the individuals, our

result lead to hypothesise that the mutation rate estimated globally emerges from the

mutation rate at a local scale.

Contrary to most other studies using within-population linguistic diversity

(Baxter et al., 2009; Danescu-Niculescu-Mizil et al., 2013; Kandler et al., 2010), we

only used contemporaneous linguistic diversity. This method allows us performing

historical inferences only based on sampling campaigns conducted in existing

populations. The amount of information available is then only dependent on the

sampling effort, and not on the relatively limited historical records.

There are nevertheless some theoretical obstacles remaining. First, the models

of linguistic acquisition that we propose do not integrate the particular constraints of

communication processes, hypothesizing a neutral production of variants without any

constraints on linguistic communication. Some evolutionary linguists would argue for

an integration of the particularity of languages as communication systems, associated

with a strong set of constraints (Beckner et al., 2009). Indeed, individuals maximize

the probability of being understood, as well as minimize the cost of communication,

which probably mainly drives evolutionary processes (Tamariz and Kirby, 2015).

These constraints are particularly strong in the case of evolution of phonological,

morphological, or syntactical systems, and we may wonder if lexical variants are

subject to these constraints too. If so, theses particularities of linguistic systems may

be at odds with inferences based on a model of neutral evolution, and should thus be

taken into account for a more accurate model of linguistic evolution at the individual

scale for historical inferences purposes.

Moreover, we assumed that linguistic transmission occurs between generations,

occulting the resulting effects of iterated communication between individuals of the

same generation. We thus should consider in future investigations a set of alternative

models of languages evolution, where the acquisition of language results from a series

of interactions between individuals rather than from a unique transmission event.

Finally, note that the formalism of our models are close to the formalism of

population genetics. This should allow proposing joint inferences coupling genetic

and linguistic data for the same set of populations and individuals, but some

theoretical limits remain. We may wonder whether a speech community (a “linguistic

population”) is identical to a reproductive group (a “genetic population”). It is far

from obvious that human reproductive boundaries overlap language boundaries in

human groups. A joint model between genetics and linguistics should then request

clarifying and articulating rigorously the concepts of population genetics with the

concepts of population linguistics to propose robust joint inferences.

Chapter III – Building a formalised

interface between population genetics

and population linguistics

Valentin Thouzeau†, Mathieu Tiret‡, Frédéric Austerlitz†,1, Paul Verdu†,1

Ethnobiologie, Paris 75016, France‡ INRA, GABI, UMR 1313 Population, statistique et génomique, Jouy en josas 78352,

France1 These authors equally supervised this work

This article is currently in preparation.

Introduction

Numerous research studies have investigated the conceptual analogies between

genetic and linguistic evolutions (Atkinson, 2013; Ben Hamed and Darlu, 2007;

Cavalli-Sforza, 1997; Fitch, 2008; Gray et al., 2007; Hunley, 2015; List et al., 2016).

In « The Descent of Man », Darwin proposed a parallel between the formation of

languages and the formation of species (Darwin, 1871):

The formation of different languages and of distinct species, and the proofs that both have been

developed through a gradual process, are curiously parallel....We find in distinct languages striking

homologies due to community of descent, and analogies due to a similar process of formation.

The pioneer study published by Cavalli-Sforza et al. (1988) allowed evaluating

the relevance of this quote, showing striking homologies between population trees and

Chapter III – Building a formalised interface between population genetics and population linguistics

language classifications, confirming for the first time the relevance of Darwin’s

intuition. Since then, several studies coupled genetic and linguistic analyses to

produce parallel inferences, for the same set of populations, aiming at understanding

the possible links between genetic and linguistic histories (see for instance

Balanovsky et al., 2011; Hunley et al., 2008; Thouzeau et al., 2017). Phylogenetic

methods have also recently been coupled with population genetics approaches to

propose a worldwide super-tree integrating genetic and linguistic informations (Duda

and Jan Zrzavý, 2016). These studies allowed comparing the history of genetic

populations and that of languages at a macro-evolutionary scale, contrasting the

patterns of linguistic and genetic differentiations.

In a population genetics approach, the notions of within- and among-

populations diversities are central (Hartl and Clark, 2007). Genetic diversity is seen as

resulting from historical processes emerging from the repeated interactions between

individuals through time. Therefore, historical inferences in population genetics

models and simulations are centred at the individual level (Hoban et al., 2012; Judson,

1994), implementing the rules concerning the interactions between agents to study the

global properties emerging from these repeated interactions. These simulations

themselves rely on known mechanisms of genetic transmission at the individual level.

Events like population split, migrations, admixture, expansions, or bottlenecks, are

well studied because they are explicitly specified at the individual level in the models.

Conversely, within-population linguistic diversity is rarely taken into account

when reconstructing linguistic histories (see chapter 2). Indeed, the current absence of

within-population inferences lies in part in the lack of clear consensus on the causal

mechanisms responsible for the construction of the linguistic diversity among

individuals. Without explicit mechanisms describing the way the linguistic items are

transmitted between individuals, it is impossible to infer the histories of the

populations of speakers from within-population linguistic diversity patterns using an

agent-based approach. Knowing these limitations, complex events like migrations,

admixture, or population-size changes, are rarely taken into account in linguistic

inferences.

Some authors, from the emerging field of “evolutionary linguistics” (Croft,

2008), proposed a first step to take into account within-population linguistic diversity

(Croft, 1996; Niyogi and Berwick, 1997; Tamariz and Kirby, 2016). Languages are

viewed as complex adaptive systems (Beckner et al., 2009), their properties emerging

at macro-evolutionary levels as a result of the iterated interactions between

individuals (Haspelmath, 1999; Kirby et al., 2008, 2014; Steels, 2011). These type of

studies rely thus mainly on agent-based models (Steels, 1997), making explicit the

underlying behaviour of the agents. For instance, Zuidema and de Boer (2009)

showed that combinatorial phonology (i.e. the fact that the sounds produced by

speakers are categorized in discrete units) may emerge from repeated interactions

between individuals. Kirby (2001) showed that a structured mapping between strings

and meanings (i.e. the fact that one word corresponds to one meaning) could also be

the result of iterated learning. Nevertheless, no attempt has been made, to our

knowledge, to use contemporaneous within-population linguistic data with agent-

based models for explicit linguistic historical inferences.

Following the perspective proposed in chapter 2, we aimed in this chapter at

formalising an interface between population genetics and population linguistics, to

build a general model describing individuals interacting genetically and linguistically

in an agent-based approach.

In the first part of this chapter, we constructed a theoretical framework

associating genetic and linguistic evolutions at the within-population level. We

present in section 1 a formalisation of biological evolution, allowing to delimit the

notions of reproduction relationship and genetic population. We develop in section 2

a “population linguistics” framework, delimiting the notions of linguistic

communication relationship, individual grammars and linguistic population. We

develop then in section 3 the diversity of mechanisms that may occur during a

linguistic communication relation. We then assemble all these notions in section 4,

describing our genetic and linguistic coevolution framework. We adopt a

formalization avoiding the risks of making non-explicit underlying assumptions that

may be at odds with known genetics or linguistics results.

In the second part of this chapter, we evaluate statistically the possibilities given

by the joint inferences following the framework built in the first part. We detail, in

section 1, our modelling and the method of Approximate Bayesian Computation

(ABC) that we used. We fully developped a new simulation software, Population

Linguistic and Genetic Simulator (PLGS), which simulates genetic and linguistic

coevolution for a given set of individuals in one or several populations. We perform

then, in section 2, a series of cross-validation analyses on simulated data, aiming at

testing the a priori possibilities of the framework presented in the first part.

Part 1 – Formalising genetic and

linguistic coevolution

1. A formalisation of biological evolution

Our first objective was to delimit a biological evolution framework able to link

genetic evolution and linguistic evolution based on what they have in common, the

individuals, which both carry genes and speak languages. The purpose of this section

is to delimit the notions of reproduction relations and genetic population. To do that,

we built on the theoretical foundations proposed by Barberousse and Samadi (2015,

abbreviated B&S in the following), where the individuals are central, and we then

formalized the notion of genetic population in theoretical terms.

B&S recently proposed a preliminary step for a formalisation of the biological

theory of evolution. While previous formalisations have been proposed (Gould, 2002;

Lewontin, 1970; Maynard Smith, 1987; Szathmáry and Maynard Smith, 1997), the

particularity of the proposition of B&S lies on the centrality of the organisms. B&S

take a neutralist perspective (Gould and Lewontin, 1979; Kimura, 1983), where

genetic drift and selection are seen as sampling processes, contingently dependent on

the ecological and historical contexts (biotic and abiotic). The domain of the theory is

the genealogical network, the set of all organisms that are linked to one another by

descent relationships. This network is characterized as follows by B&S:

Within a genealogical network, each organism is related to at least one other organism by a

reproduction relationship we call RG5 in the following.

Definition Let there be two organisms a and b, aRGb if a and b have common direct

offspring. This means that a or b, or both, have transmitted, within finite time, some material

5. To clarify our reasoning, we called the reproduction relation RG instead of R as called by B&S.

substrate to one or more other organisms. The material substrate may be modified; it provides

the offspring with the capacity to reproduce. This general definition of the reproduction

relationship allows us to formalise different reproduction modes that are common in earthly

organisms:

- {aRGb} ≠ Ø and ∀c {c/ cRGa or cRGb} = Ø represents strictly monogamic biparental

reproduction;

- {aRGa} ≠ Ø and ∀b {b/ bRGa} = Ø represents strictly clonal reproduction;

- {aRGb} ≠ Ø and {aRGc} ≠ Ø represents biparental, polygamic reproduction.

In order to represent other modalities, it is possible to generalise relation RG so that it can

take any (finite) number of organisms as relata6.

In this formalisation, each individual comes from the realization of a relation

RG, which is a transmission relationship of a material substrate. In our perspective, in

the case of the human species, the reproduction relationships imply a known genetic

transmission structure (see figure III.1): autosomal DNA is transmitted biparentally

with recombination, while mitochondrial DNA and Y chromosomes are transmitted

uniparentally, respectively through the female and the male line. Each genetic marker

may mutate through a given mutation rate per generation.

In particular for our population genetics perspective, the known structure of the

genetic transmission through the reproductive network allows expliciting the

6. The relata are the objects of a logic relation.

Figure III.1 – Structure of the reproduction relationship in human species. Circles representindividuals, arrows represent transmission relationships: half of the autosomes plus the chromosome Yfrom the father if the child is a male or the X chromosome if the child is a female, and the other half ofthe autosomes from the mother plus the mitochondrial DNA and a chromosome X.

historical contingent events, such as population splits, expansions, bottlenecks,

migration or admixture events, using mathematical or simulation frameworks (see

Figure III.2). Indeed, it is possible to propose a series of explanations and predictions

according to a series of historical scenarios. The predictions are transcribed through

agent-based mathematical formulas or simulation programs, and may ultimately be

confronted with real data to select the best scenarios. Historical events are considered

contingent with respect to the very structure of evolution through reproductive

relationships.

This formalisation proposes a series of advantages in our perspective. First, it

clarifies the underlying assumptions of evolutionary biology theory, making explicit

its notions of reproductive relationship, reproductive network, and historically

contingent events. Second, it explicitly links the different evolutionary scales

Figure III.2 – Classic representation of the setting up of the reproduction network. Each rowrepresents a generation. The vertical green lines represent the boundaries of the population.Demographical events like population size changes, migration, admixture, selections, or constraintsover population boundaries, are seen as historically contingent. On the contrary, the very mechanismsof reproduction relation RG are seen as necessary.

(molecular, individual, species): the molecular scale, where genes are transmitted

through the reproductive events; the individual scale, where reproductive events build

the network of inter-individual relationships; and the species scale, built by mapping

preferential units of reproductive relationships. Third, as said above, it places human

individuals and their reproductive interactions at the centre of the theory.

Based on this framework, we propose a formalisations of the notion of

“population”. We define a population as a set of individuals for which a relationship

RG is preferentially instantiated (see Figure III.3).

2. A formalisation of linguistic evolution

Our objective was to to perform robust historical inferences from linguistic data,

sampled at the individual scale, through agent-based modelling. The purpose of this

section is to delimit the notions of linguistic communication relations, individual

grammars and linguistic population. To do so, we built a theoretical framework able to

Figure III.3 – Alternative representation of the reproductive network. Only the individuals thatreproduced during a given time step are represented. Multiple green dotted lines represent multipleinstantiation of the reproductive relation RG. The green circle line represents the boundary of thegenetic population.

explain and predict the evolution of languages through time, considering only inter-

individuals interactions and contingent historical events. We aimed at integrating

classical results from laboratory experiments, computer simulations and linguistic

fieldwork concerning languages.

Analogies between biological and linguistic evolution cannot be used as

justifications of modelling hypotheses (Blevins, 2004; Claidière and André, 2012;

Testart, 2011). As pointed out by Smith (2014) concerning models of language

evolution:

Modellers therefore need to be flexible, yet careful to ensure that their design and

implementation decisions are plausible, justifiable and systematically explained.

First, we delimit the objects and provide the definition that we will use

hereafter. Linguistic evolution may refer to (Steels, 2004; Tamariz and Kirby, 2016):

(1) Origin, in human species, of biological capabilities to produce linguistic

communication (Diller and Cann, 2011) ;

(2) Emergence and modification of structural properties of language (Cangelosi

et al., 2006; Gong and Wang, 2005) ;

(3) Modification though time of linguistic variants used by individuals.

Our formalisation can be applied to cases (2) and (3), but we will mainly focus

on case (3) in part 2 of this chapter. Other studies focused on case (1), which goes

beyond the scope of this chapter.

Following Croft’s (1996) perspective on linguistic evolution, we considered that

utterances play a central role in linguistic evolution. For several evolutionary linguists

(Croft, 2013; Kirby et al., 2015), the event of linguistic communication is at the centre

of the linguistic change. This utterance-based perspective of linguistic evolution was

borrowed from the more general perspective of evolutionary theory from Hull (1988).

For Croft (1993), an utterance is:

a particular instance of actually-occurring language as it is pronounced, grammatically

structured, and semantically interpreted in its context.

Moreover, utterances are the very objects actually observed by linguists during

the sampling of within-population linguistic diversity. In this perspective, languages at

the macro-evolutionary level are only the result of the emergence of iterated

interactions between individuals through linguistic communication.

We defined the individual grammars of each individual as what she/he uses to

produce and comprehend utterances. This is the individual process underlying

utterances production. We then proposed a formulation of linguistic evolution

decoupling the structure of the network of communication on one hand, and the

structure of the individual grammars, on the other hand, which we will discuss in

section 3.

We considered the human communication network as the set of all speakers

linked to each other by at least one linguistic communication relationship. We

assumed that each speaker is related to at least one another speaker by a linguistic

communication relationship denoted RL. We did not focus primarily on the history of

the linguistic items used by individuals, but instead on the linguistic communication

network. We then defined the relationship RL as follow:

Let there be a speaker a and a listener b.

aRLb if a and b are engaged in a linguistic communication event where a

produces one or several utterances, and b comprehend these utterances (Figure III.4).

Note that the instantiation of the relation RL depends on spatial and temporal

constraints.

Figure III.4 – Structure of the linguistic communication relationship in human. Circles representindividuals, with the speaker on the left and the listener on the right. The full black arrow represents thefunction determining which utterance the speaker will produce using his/her individual grammar, andthe dotted arrow represents the updating function of the individual grammar of the listener.

In this formalisation, we adopted the “organism-centred” perspective following

the agent-based perspective of the genetic evolution previously proposed (section 1).

We assumed that language learning and individual grammars formations and

modifications occur through linguistic communication relationships.

Our formalisation of the linguistic evolution differs to some extent from the

framework of genetic evolution. First, a linguistic relationship does not generate a

new organism. The structure of the linguistic communications network is then

contingently constrained by the reproductive network: only existing individuals,

resulting from a reproductive event, may be expected to communicate. Second, the

relation RL is not symmetrical. A listener and a speaker engaged in a linguistic

communication relationship may not always switch their roles. Third, the relation RL

is expected to be massively more frequent than the relation RG.

As well as in our population genetic formalism, we defined a “linguistic

population” as a set of individuals for which the relationship RL is preferentially

instantiated (Figure III.5). Similarly to population genetics, the structure of the

linguistic population depends directly on historical contingencies: the size of the

population, its variation through expansions or bottlenecks, migration etc.

Figure III.5 – Representation of the linguistic communication network. Only the individuals whocommunicated and the communication which occurred during a given time step are represented.Multiple purple arrows represent multiple instantiation of the linguistic communication relation RL. Thepurple dashed circle represents the boundary of the linguistic population.

Here, the notion of linguistic population is close to the notion of speech

community (Labov, 1972), defined as a set of speakers involved in a series of

linguistic interactions and using a common set of linguistic conventions and norms.

3. Modalities of linguistic communications at

the scale of the individuals

Our formalism of linguistic network does not imply to specify the very

linguistic mechanisms of utterance production and their influence on the individual

grammar of the listener. The purpose of this section is to develop the range of the

linguistic mechanisms that may occur during a linguistic communication relation.

As detailed section 1, biological evolution relies on the well-known mechanisms

of genetic transmission (Figure III.1). Conversely, the linguistic mechanisms ruling

individual grammars are not well known in linguistics. This is partly because there is

no material substrate transmitted through linguistic communication relationships,

because languages are “products of the human mind” (Popper, 1979), accentuating the

difficulty.

Another aspect is that different linguistic levels (phonological, lexical,

syntactical…) may imply different rules of individual grammar. Particular cognitive

and structural constraints over each of these levels may shape utterance production

and their comprehension (Kirby et al., 2015; Nowak et al., 2002). Moreover, the

utterances are a production of the individual grammars, not the individual grammars

themselves. Therefore, listeners only access a part of linguistic information, through

linguistic communications. Some cognitive linguists (Culbertson, 2012) pointed out

that this lack of linguistic information available for the listeners, coupled with the

inferential cognitive bias of the listeners, should highly orientate the evolution of

linguistic structures. The structure of the cognitive bias influencing individual

grammars are of great interest and are widely debated in linguistics (Evans and

Levinson, 2009).

Laboratory experiments allow studying the functioning of the individual

grammars (Tamariz and Kirby, 2016). How languages are learned during lifespan?

Are there some universal cognitive biases concerning the production and the learning

of languages? What are the effects of such biases? What are the effects of the

functional constraints of the communication? Those experimental studies are

precious, because they delimit a set of plausible linguistic hypotheses concerning the

functioning of individual grammars. We should expect that the particularities of the

local mechanisms, after integration over a large set of individuals interacting during a

long period of time, should produce huge effects on linguistic evolution and on the

resulting linguistic diversities (Kirby et al., 2007; Smith et al., 2003). We thus argue

that individual grammars should be studied case by case, for each type of linguistic

item, and for each type of language.

We will now focus on the evolution of frequencies of a series of linguistic

variants among individuals. Variants are defined as a set of different realisations of a

particular linguistic class. Linguistic variants can be phonetic, phonological,

morphological, lexical, etymological, syntactical… They constitute a generic category

which refers to a series of linguistic items of the same meaning, thus belonging to one

linguistic class. For instance, two words differing from only one sound, but with the

same meaning, are two linguistic variants of the same phonemical class, they are two

phonetic variants. For the meaning left, the pronunciations /l ft/ common in Britainɛ

English and /lift/ common in New Zealand English constitute two such variants,

differing for the sound / / and /i/ ɛ (Watson et al., 2000). In another instance, several

words with the same meaning but with different etymological origins, are linguistic

variants of the same etymological class, i.e. lexical variants. For the meaning car, the

words “char”, “auto”, “machine”, and “voiture” are used in Canadian French and

constitute four such variants (Nadasdi et al., 2008). The Swadesh lists of cognates

(1952) are classical datasets of etymological variants used to infer linguistic histories.

A series of models have been proposed to describe acquisitions and changes of

linguistic variants in an utterances-based perspective (Baxter et al., 2009; Kirby et al.,

2014). Some authors (Baxter et al., 2006; Reali and Griffiths, 2010) argue that

listeners are Bayesian learners. The hypothesis of statistical acquisition and change of

the individual grammars seems indeed plausible from the perspective of language-

learning studies (Saffran, 2003) and theoretical results (Reali and Griffiths, 2010).

Knowing this previous framework of utterance-based learning, several

modalities may thus be additionally incorporated in our population linguistics

perspective. First, a kind of selection process may occur: are the different variants of a

linguistic structure equivalent? Or do the different variants have different weights in

individual grammars? Patterns of linguistic changes through time seem to indicate

that linguistic behaviours may be cognitively biased, for instance favouring linguistic

variant easier to remember or easier to pronounce (Blythe and Croft, 2012; Sturtevant,

1947). The social status of speakers constitute a second modality. Does this status

(sex, age, profession…) affect the individual grammars of listeners in the same way?

This modality is often described in terms of conformity bias or prestige bias, where,

for instance, the utterances of a socially prominent speaker affect the individual

grammars more efficiently than those of other, less prominent, speakers in the

population (Henrich, 2001). Third, what is the impact of the structure of the linguistic

communication network: how does this structure, its centrality, its connectivity,

modifie the linguistic change over time? Linguistic emergence of particular structures

may only depend on the shape of the linguistic communication network (Kauhanen,

2016).

In our perspective, all these different modalities may be formalised in terms of

structure of the linguistic communication network or of individual grammars. First,

the linguistic communication network describes the population size, the frequencies

of communication events, the structure of the social network, and all other historical

contingencies at the scale of the individuals. The modalities occurring at the scale of

the individuals are now commonly evaluated in population genetics (see for instance

Guillot et al., 2015; Palstra et al., 2015; Verdu et al., 2009), and could be evaluated as

well in population linguistics. Second, the embedded modalities of individual

grammars can be explicitly specified, implemented in simulation programs, and

should be formally tested case by case (Palminteri et al., 2017). In part 2 bellow, we

propose, for instance, three models of individual grammars describing how

individuals produce utterances and re-evaluate their individual grammar when

listening to other utterances (see part two, Figure III.9).

4. Coupling the reproductive and the

communication networks

We have introduced separate formalisation of biological and linguistic

evolutions. Each of these formalisations offers the possibility to clarify the

assumptions underlying their frameworks and to propose justified and disambiguated

predictions. This allows proposing a series of explicit models, a necessary step for any

inference-based approach. We propose now a coupled formalisation of the genetic-

linguistic coevolution.

As emphasised by B&S (2015),

The existence of an organism can be visualised as a trajectory in space and time. This allows

us to express an important constraint: only organisms whose trajectories intersect can be relata

of relation RG.

This property of the relation RG is also true about the relation RL: only

individuals whose trajectories intersect temporally and spatially can linguistically

communicate. This property may be seen through the double meaning of

“intercourse”, as pointed out by Croft (1996).

Both linguistic and reproductive networks share, therefore, common constraints

over the instantiation of their constitutive relationships, even if some constraints may

specifically affect either reproductive relationships or linguistic communication

relationships.

The notion of “population” may here be seen as the delimitation of these

constraints. The genetic population defines the sets of individuals for which the

relation RG is preferentially instantiated. The linguistic population is defined by the

sets of individuals for which the relation RL is preferentially instantiated. Under this

terminology, the questions of the coevolution between genes and languages may be

understood at the scale of individuals. We propose in the following several hypothesis

concerning this coevolution.

Hypopthesis 1: the genetic and linguistic populations strictly overlap: the spatio-

temporal constraints are the same for the two types of relations (Figure III.6). In this

case, languages and genealogies are expected to show a common phylogeny. It is then

legitimate to use both types of data to infer the unique history of the genetic-linguistic

population. This may be the case for highly isolated human groups, without migration

or cultural contact with any other human group. The common history of genes and

languages then results mainly from a shared set of geographical constraints.

Hypothesis 2: the genetic population is included in the linguistic population

(Figure III.7). In this case, the two types of relations (RG and RL) are different, the

linguistic and genetic populations do not overlap. In other words, that genetic

Figure III.6 – Representation of the setting up of the reproduction network as well as the linguisticcommunication network. Only the individuals who communicated or reproduced during a given timeperiod are represented. Multiple purple arrows represent multiple instantiation of the linguisticcommunication relation RL. Multiple green dotted lines represent multiple instantiation of the geneticreproduction relation RG. The dotted circle represents the boundary of the genetic-linguistic population.

reproduction events may be strictly more constrained than linguistic communication

events. This type of structure may emerge, intuitively, after a process of

standardisation of a unique national language taught at school and via public media,

leading for example to individuals sharing a common language but keeping

preferential inter-marriages between geographically close individuals.

This should also occur in a society forbidding inter-marriages between delimited

social groups, like casts or clans, leading to several differentiated genetic populations

within a unique linguistic population. In this case, the instantiation of the relation RL

is less constrained than the instantiation of the relation RG. This difference in the two

types of constraints should then imply different delimitations of the objects (the

populations) considered in the historical reconstructions.

Hypothesis 3: the linguistic population is included in the genetic population (see

Figure III.8). In other words, we may hypothesise that the constraints over linguistic

communications are strictly stronger than the constraints over the genetic

reproductions. This type of structure should be encountered for instance in cases of

Figure III.7 – Representation of the reproduction network as well as the linguistic communicationnetwork under Hypothesis 2. Only the individuals who communicated or reproduced during a giventime step are represented. The purple circle represents the boundary of the linguistic population. Thegreen circles represent the boundaries of the two genetic populations.

strong differentiation between linguistic groups with very little linguistic exchanges,

but with rules of inter-marriages between these groups. Figure III.8 shows a case of

instantiation of the relation RG less constrained than the instantiation of the relation

Hypothesis 4: it is also possible that the genetic and linguistic populations

partially overlap (see Figure III.9). This hypothesis is an association of the Hypothesis

2 and the Hypothesis 3, and should be observed in complex cases, where the two

networks are partially disjointed. Here, there is only two genetic and two linguistic

populations, but we may delimit three different units. In the case of a very reduced

overlap between the genetic and the linguistic populations, we may expect largely

diverging patterns of genetic and linguistic diversities. In this case, a unique notion of

“populations” without differentiating the genetic and the linguistic populations is

clearly misleading.

Figure III.8 – Representation of the reproduction network as well as the linguistic communicationnetwork under Hypothesis 3. Only the individuals who communicated or reproduced during a giventime step are represented. The green circle represents the boundary of the genetic population. Thepurple circles represent the boundaries of the two linguistic populations.

Another aspect of genetic and linguistic coevolution are the potential links

between the instantiations of the relations RL and RG, and the linguistic and genetic

characteristics of the individuals. The question is whether the genetic or linguistic

traits of the individuals can bias the realisation of their linguistic communication

events or reproductive events. For example, a language endogamy bias, where

individuals are more likely to mate when they are closer linguistically, implies that the

instantiation of the relation RG depends on the linguistic proximity between

individuals. This seems to be the case for some human groups, for instance in Central

Asian populations, where marriages rules are more constrained by a common

language than by a common geography (Heyer et al., 2009). In this case, we may

expect that the patterns of linguistic diversities affect genetic diversity, as genetic

differentiation among populations could result from previous linguistic differentiation

events. This mechanism of sympatric linguistic differentiation could then be

responsible for the differentiation of genetic populations without the need for spatial

isolation.

On the contrary, a preferential sociolinguistic association between individuals

based on their genetically determined phenotypes implies that the instantiation of the

relation RL depends on the genetic proximity between two individuals. It may be the

case in societies with strong emphasis on the notions of “races”, “ethnies” or

Figure III.9 – Representation of the reproduction network as well as the linguistic communicationnetwork under Hypothesis 4. Only the individuals who communicated or reproduced during a giventime step are represented. The green circles represent the boundary of the two genetic populations. Thepurple circles represent the boundaries of the two linguistic populations.

“origins”, based on a variety of criteria that, consciously or unconsciously, correlate

with genetic diversity patterns. This phenomenon could account for the reconstruction

of a set of distinct language communities related to genetic variation from a relatively

homogeneous linguistic substrate.

Part 2 – Inferring genetic and linguistic

histories

In part 1, we presented a framework to test a wide range of questions. We

propose, in this second part, a series of specific questions, each one associated with a

series of models or historical scenarios, to illustrate how our genetic-linguistic

coevolution framework may be used in practice to better reconstruct human biological

and cultural evolution from genetic and linguistic data.

Question 1: Should the individuals be considered as copiers, probabilistic

copiers, or Bayesian learners? This question aims at delimiting the mechanisms of the

individual grammars.

Question 2: Are the mutation rates different between the linguistic classes? This

question aims at describing if the variants of different linguistic classes mutate at the

same rate or at different rates.

Question 3: Are the sizes of the genetic and linguistic populations different?

This question aims at evaluating the relative size of the genetic and the linguistic

population of a given sample set.

Question 4: Do the sampled individuals belong to genetically and/or

linguistically differentiated populations? This question aims at proposing a test of

differentiation of the genetic and the linguistic populations.

Question 5: What is the tree topology for three populations? This question aims

at determining the history of splits which produced three genetic and linguistic

populations.

To address each question, we evaluated in each case how an ABC method based

on Random Forest is a priori able to select the right scenario. To do that, we built a

new computer software denoted Population Linguistic and genetic simulator (PLGS),

which simulates the models and scenarios framed in part 1. We detail first the

formalized assumptions of the models and the software, and then the analysis of the

five questions using ABC.

1. Modelling

1.1. Sampling

We considered a given set of sampled individuals, in one or more sampling

units. We define a sampling unit as a set of individuals assumed a priori to belong to

the same genetic/linguistic population. For each sampling unit, individuals were

assumed to be part of one genetic population, an entity for which the relations RG are

preferentially instantiated, and of one linguistic population, an entity for which the

relation RL are also preferentially instantiated. The genetic population and the

linguistic population may overlap or not.

We assumed that 30 individuals were sampled per genetic and linguistic

population respectively. For each individual, we considered that 25 microsatellites

were genotyped and 50 linguistic variants (see section 3 of the part 1) were obtained

through a linguistic questionnaire. Such numbers of genetic and linguistic markers are

low, aiming at testing our framework in non-favourable conditions to assess the

minimal statistical power available with our method. We consider here only linguistic

classes with a potentially infinite number of variants.

1.2. Genetic model

We assumed that each relation RG producing an individual birth in the

population was followed by the death of a random individual. This model is a Moran

process (1958), differing from Wright’s (1942) model by the fact that births and

deaths occur individually and at random. The Moran’s model allows overlapping

between generations, whereas the Wright’s model hypotheses assumes the death of all

the individuals of one generation after the birth of the new generation. Moran’s model

thus allows linguistic communication within and between several reproductive

generations of speakers.

We assumed a population of diploid individuals, with separate sexes (males and

females). The instantiation of a relation RG was only possible between a male and a

female, both drawn at random. The loci were assumed to be independent. In the

instantiation of a relation RG, for each locus, each parent transferred one of her/his

alleles at random to the child. Each allele might mutate at different probability per

reproductive event, following a strict stepwise mutation model.

1.3. Linguistic model

Following the Moran model, generations were no longer separated. Conversely,

they were widely overlapping allowing linguistic communication between individuals

from different generations.

For each instantiation of the relation RG, we assumed the instantiation of a

number αL/G of relations RL in the linguistic population. If αL/G > 1, the number of

linguistic communication events in the linguistic population was higher than the

number of reproductive events in the genetic population. If αL/G < 1, the number of

linguistic communication events in the linguistic population was lower than the

number of reproductive events in the genetic population. It is logically expected that

αL/G >> 1 in human populations.

The instantiation of the relation RL was possible between two individuals drawn

at random, whatever their sexes. We assumed that the mutations occurred during the

utterances, where each mutation lead that the listener received a completely new

variant instead of the speaker's variant, according to a given mutation rate for each

linguistic class.

1.4. Parameters

In the scenarios presented below, the parameters were drawn in the following

probability density distributions. Time t0 was drawn in a log-uniform distribution U

[0, 1000], as well as time t1, where appropriate, with the constraint t0 < t1. The sizes NG

and NL were drawn in a log-uniform distribution LogU [100, 1000]. The mean

mutation rates μG and μL were, separately, drawn in a log-uniform distribution LogU

[10-1, 10-6]. The mutation rate of each linguistic variant was drawn in a beta

distribution with mean μL and shape β = 2. The mutation rate of each genetic locus

was drawn in a beta distribution with mean μG and shape β = 2. The number αL/G of

relations RL in the linguistic population per relation RG was drawn in a log-uniform

distribution LogU [1, 100]. The upper boundary of the parameter αL/G was set

according to computation time limits. The probability hL for each listener of adopting

the variant of the speaker was drawn in a uniform distribution U [0.01, 1]. The lower

boundary of this parameter was set to avoid too excessive computation time, knowing

that a very low probability of adopting a new variant increases the time needed to

reach the equilibrium between mutation and drift.

1.5. Summary statistics

We computed summary statistics describing genetic diversity as well as

linguistic diversity. Hereafter, the terms “alleles” and “locus” used for genetics

summary statistics can be replaced, respectively, by the terms “linguistic variant” and

“linguistic class” for linguistic summary statistics.

1.5.1. Number of groups of individuals

We defined and computed M as the number of groups of individuals genetically

identical in the sample.

1.5.2. Number of monomorphic loci

We defined and computed S as the number of loci with only one allele in the

sample.

1.5.3. Number of different alleles

We defined k as the number of different alleles at a given locus. We then

computed in the linguistic and genetic populations respectively, the following

summary statistics:

- k, the mean number of different alleles across loci.

- V(k), the variance of the number of different alleles across loci.

- min(k), the minimum number of different alleles across loci.

- max(k), the maximum number of different alleles across loci.

- The range R(k), with R(k) = max(k) – min(k).

- med(k), the median of the number of different alleles across loci.

1.5.4. Gene diversity

We defined the gene diversity H as the probability for two randomly chosen

alleles for a given locus to be different (Nei, 1987). For a given locus, the estimated

gene diversity is:

n−1 ( 1−∑i=1

Where n is the sample size, k is the number of different alleles, and pi is the

frequency of the ith allele in the sample.

We then computed for each population the following summary statistics:

- H, the mean of gene diversity across loci.

- V(H), the variance of the gene diversity across loci.

- min(H), the minimum of the gene diversity across loci.

- max(H), the maximum of the gene diversity across loci.

- The range R(H), with R(H) = max(H) – min(H).

- med(H), the median of the gene diversity across loci.

1.5.5. Pairwise distance between populations

We computed the pairwise dissimilarity GST per locus between each pair of

population as (Nei, 1973):

GST=1−HS

Where HS=1−(∑i=1

k p1 i2

k p2 i2

2 ) and HT=1−∑i=1

( p1 i

2 )Where p1i is the frequency of the allele i in the first population, and p2i is the

frequency of the allele i in the second population.

We then computed for each population the following summary statistics:

- GST, the mean of GST across loci.

- V(GST), the variance of the GST across loci.

- min(GST), the minimum of the GST across loci.

- max(GST), the maximum of the GST across loci.

- R(GST), with R(GST) = max(GST) – min(GST).

- med(GST), the median of the GST across loci.

1.6. Simulations and model selection

We developed a new C++ program, Population Linguistic and Genetic

Simulator (PLGS), which simulates jointly genetic and linguistic evolutions according

to the models and the scenarios detailed above, and computes the summary statistics

described above on each simulated data set. For each scenario, we performed 10000

simulations. For each simulation, we waited a time 5×NG, a time that, as we

empirically verified, was sufficient to reach genetic equilibrium between mutation and

drift. Moreover, we waited a time 10×NL/(hL×αL/G), a time that, as we empirically

verified, was sufficient to reach the linguistic equilibrium between mutation and drift.

We then analysed the simulations produced, using the R package abcrf (Pudlo et al.,

2016) which implements approximate Bayesian computation (Beaumont et al., 2002;

Tavaré et al., 1997) using Random Forest (RF) (Breiman, 1999). We performed a

cross-validation analysis to assess if the method was a priori able to select the

scenario which produced a pseudo-observed dataset drawn at random, using the out-

of-bag approach included in the function abcrf of the package abcrf, with 500 trees

per forest.

2. Results

2.1. Should the individuals be considered as copiers,

probabilistic copiers, or Bayesian learners?

In these models, we considered only one linguistic population. We aimed at

assessing how the individuals build an individual grammar considering a series of

linguistic variants for a series of linguistic classes (Figure III.10). In model a), we

considered that individuals were copiers. The speaker produced the linguistic variants

she/he knows, and the listener replaced every variant that she/he knows by the variant

of the utterance produced. In model b), we considered that individuals were

probabilistic copiers. The speaker produced the linguistic variants she/he knows, and

the listener had a probability hL to replace each variant that she/he knows by the

variant of the utterance produced. In model c), we considered that individuals were

Bayesian learners. For each linguistic class, the individual grammar consisted of a set

of two frequencies. For each class, the speaker produced a sequence of linguistic

variants according to the frequencies of her/his linguistic grammar, and the listener

updated the frequencies of her/his individual grammar as a linear combination of the

utterance and her/his individual grammar. See Baxter et al. (2006) for analytical

details about the model c).

Cross-validation results of model selection are summarized Table III.1. The

results suggest that it is a priori difficult to select unambiguously models a) or b),

with 44% and 51% of erroneously selecting other models, with respect to the

expectation of 66% at random. This difficulty could be explained by the fact that the

copying model is embedded into the probabilistic copying model, if hL = 1.

Conversely, the selection of model c) is a priori really powerful, with only ~1%

chances of selecting erroneously the two other models. In other words, we find that it

is difficult to assess if the individuals follow a copying model against a probabilistic

copying model. Conversely, to assess if the individuals follow a Bayesian learning

model again one of the two other models is really efficient.

Estimated

Scenario

a b c Error

Scenario

a 5604 3446 950 0.4396

b 3776 4859 1365 0.5141

c 68 65 9867 0.0133

Table III.1 – Cross-validation results aiming at assessing a priori distinctions between three modelsof individual grammars, using 10000 pseudo-observed data, with a) copying model, b) probabilisticcopying model, and c) Bayesian learning (see Figure III.9).

Figure III.10 – Description of the three models of individual grammars. a) Copying model, b) Probabilistic copying model, c) Bayesian learning model.

2.2. Are the mutation rates different between the linguistic

classes?

In these scenarios, we considered only one linguistic population and we

assumed that the individual grammars followed a probabilistic copying model (i.e. the

speaker produces the linguistic variants she/he knows, and the listener has a

probability hL to replace each variant that she/he knows by the variant of the utterance

produced). We aimed at assessing if the linguistic classes mutate with different

probabilities. In model a), we considered that the mutation rate of each linguistic class

were the same and equal to μL. In model b), we considered that the mutation raets of

the linguistic classes were drawn in a beta distribution mean μL and shape β = 2.

Cross-validation results of model selection are summarized Table III.2. The

results suggest that it is a priori possible to relatively clearly distinguish between

models a) and b), with 31% of erroneously selecting model b), and 19% of

erroneously selecting model a). In other words, we find that it is relatively easy to

assess if a set of linguistic classes mutate at the same rate or at different rate.

2.3. Are the sizes of the genetic and the linguistic populations

different?

In these scenarios, we considered only one population of each kind, and we

assumed that the individual grammars followed a probabilistic copying model. We

Estimated

Scenario

a b Error

Scenario

a 6871 3129 0.3129

b 1923 8077 0.1923

Table III.2 – Cross-validation results aiming at assessing a priori distinctions between two models ofthe mutation of the linguistic variants, using 10000 pseudo-observed data, with a) same mutation ratefor every cognate, and b) mutation rate drawn in a beta distribution.

aimed at assessing if a) the linguistic population was larger than the genetic

population (NG < NL), b) if the genetic population was larger than the linguistic

population (NG > NL), or c) if the sizes of the two populations matched (NG = NL ). We

considered one unit of sampling.

Cross-validation results of scenario selection are summarized Table III.3. The

results suggest that it is a priori impossible to distinguish between the three scenarios,

with an error between 61% and 71%. It means that assessing the relative size of the

genetic and the linguistic population is nearly impossible with our method. We may

hypothesize that this difficulty is caused by the fact that a lot of parameters (μG, μL, hL,

αL/G) modify the relation between the genetic and the linguistic summary statistics in

the same way that the parameters NG and NL, hiding a clear relation which could be

used by the Random Forest.

Figure III.11 – Description of the three scenarios of different size of the genetic and the linguisticpopulation. a) the linguistic population was larger than the genetic population, b) the genetic populationwas larger than the linguistic population, and c) the genetic and linguistic populations match.

2.4. Do the sampled individuals belong to genetically and/or

linguistically differentiated populations?

In these scenarios, we considered two sampling units, and we assumed that the

individual grammars followed a probabilistic copying model. We aimed at testing if

the individuals sampled in those units belong to two different populations, or to the

same population, both genetically and linguistically (Figure III.12). In scenario a), we

assumed that the two units corresponded to one linguistic population but two genetic

populations that diverged at time t0 in the past (see Figure III.8 for another

representation of the final state). In scenario b), we assumed that the two units

correspond to one genetic population but two linguistic populations that diverged at

time t0 in the past (see Figure III.9 for another representation of the final state). In the

scenario c), we assumed that the two units corresponded to only one genetic and one

linguistic populations (see Figure III.6 for another representation of the final state). In

the scenario d), we assumed that the two units corresponded to two genetic and two

linguistic populations that diverged at time t0 in the past.

Cross-validation results of scenario selection are summarized Table III.4. The

results suggest that it is a priori possible to clearly distinguish between models a), b),

c) and d), with an error between 9% and 15%. In other words, we find that our method

is a priori pretty efficient to assess if two sets of sampled individuals belong to one or

two genetic and/or linguistic populations.

Estimated

Scenario

a b c Error

Scenario

a 2885 3511 3604 0.7115

b 2860 3854 3286 0.6146

c 2810 3323 3867 0.6133

Table III.3 – Cross-validation results aiming at assessing a priori distinctions between three modelsdescribed Figure III.11, using 10000 pseudo-observed data.

Estimated

Scenario

a b c d Error

Scenario

a 8533 4 1344 119 0.1467

b 9 8899 223 869 0.1101

c 830 116 9047 7 0.0953

d 183 1237 41 8539 0.1461

Table III.4 – Cross-validation results aiming at assessing a priori distinctions between four scenariosdescribed Figure III.12, using 10000 pseudo-observed data.

Figure III.12 – Description of the four scenarios of genetic and/or linguistic population differentiation.a) The two units corresponded to one linguistic population but two genetic populations that diverged attime t0 in the past, b) The two units correspond to one genetic population but two linguistic populationsthat diverged at time t0 in the past, c) The two units corresponded to only one genetic and one linguisticpopulations, d) The two units corresponded to two genetic and two linguistic populations that divergedat time t0 in the past.

2.5. What is the tree topology of three populations?

In these scenarios (Figure III.13), we assumed that the individuals were sampled

from three genetic-linguistic populations, and we assumed that the individual

grammars followed a probabilistic copying model. We assumed three different

branching processes, corresponding to the three possible topologies. We considered

three sampling units, corresponding to three genetic-linguistic populations. In

scenario a), populations 0 and 1 have a more recent common ancestor than population

2. In scenario b), populations 1 and 2 have a more recent common ancestor than

population 0. In scenario c), populations 0 and 2 have a more recent common ancestor

than population 1. Considering only linguistic or, separately, only genetic pseudo-

observed data, the error rates ranged between 38% and 45% (Tables III.5 and III.6).

Considering jointly linguistic and genetic diversities reduced the error rates, to around

32% (Table III.7). This indicates that the joint inferences allowed a priori a slightly

more precise selection of the topology of three populations. Thus the selection of a

tree topology including three populations using only genetic diversity is as efficient as

using only linguistic diversity. Moreover, coupling the two types of diversities

increases, but only slightly, the efficiency of the selection.

Figure III.13 – Description of the three scenarios of historic topologies. a) Populations 0 and 1 have amore recent common ancestor than the population 2, b) Populations 1 and 2 have a more recentcommon ancestor than the population 0, c) Populations 0 and 2 have a more recent common ancestorthan the population 1.

Estimated

Scenario

a b c Error

Scenario

a 5489 2434 2077 0.4511

b 2443 5897 1660 0.4103

c 2457 1846 5697 0.4303

Table III.5 – Cross-validation results aiming at assessing a priori distinctions between three scenariosdescribed Figure III.13, using 10000 pseudo-observed data, using only linguistic data.1

Estimated

Scenario

a b c Error

Scenario

a 5641 2397 1962 0.4359

b 2065 6217 1718 0.3783

c 2110 2136 5754 0.4246

Table III.6 – Cross-validation results aiming at assessing a priori distinctions between three scenariosdescribed Figure III.13, using 10000 pseudo-observed data, using only genetic data.

Estimated

Scenario

a b c Error

Scenario

a 6733 1616 1651 0.3267

b 1947 6804 1249 0.3196

c 1967 1224 6809 0.3191

Table III.7 – Cross-validation results aiming at assessing a priori distinctions between three scenariosdescribed figure III.13, using 10000 pseudo-observed data, using genetic and linguistic data jointly.

Discussion

We proposed in this chapter to build an interface between population genetics

and population linguistics. We shaped a framework for studying genetic and linguistic

coevolution at the scale of the individuals, integrating a diversity of possible

individual grammars and population structures. This allowed us to perform a priori

inferences addressing several classical population genetics and linguistic questions.

We first needed to focus the theory of the biological evolution on the organisms,

to explicit the structure of the genealogical network, and delimit the notion of genetic

population with respect to the network structure. We defined only subsequently the

genetic transmission mechanisms, as the way the alleles are passed from parents to

offspring during the reproductive events. Integrating the contingent historical events

affecting the reproductive network and the genetic transmissions mechanisms allowed

us to retrieve a classical population genetics framework.

Focusing this formalisation on the individuals allowed us to propose a

formulation of linguistic evolution. We proposed to differentiate the building of the

communication network from the processes occurring at the individual scale during

each linguistic communication event. Nevertheless, linguistic communication

mechanisms and their links with individual grammars are less clear or consensual than

the genetic transmission mechanisms. We thus proposed to study case by case, for

each language and each linguistic item considered, the local constraints underlying the

rules of the linguistic communication events, before proposing historical inferences.

Coupling genetic and linguistic frameworks led to propose a diversity of cases

of coevolution, depending on the underlying structure of the genetic and linguistic

populations, and their history. We explicitly formulated a range of hypotheses

concerning genetic and linguistic coevolution at the scale of the individual, allowing

formal testing. We argue that classical hypotheses in the literature of coevolution

between genes and languages could be formalized and explicitly tested through our

framework.

We performed a priori a range of these tests, evaluating the theoretical

possibilities given by such framework. We used Approximate Bayesian Computation

with Random Forests (Pudlo et al., 2016), a flexible statistical framework, integrating

the whole complexity of our formulations of the models and the scenarios in

competition.

We showed that it is possible to differentiate between individual grammars

based on copying or Bayesian learning. Moreover, we showed that it is also possible

to evaluate if mutation rate is equal or different across for a set of linguistic classes.

These two results demonstrate that, using only linguistic data describing 30

individuals sampled contemporaneously, it is possible to evaluate the models

underlying linguistic communication events and individual grammars, a crucial step

to propose subsequent historical inferences. Future studies will focus on other types of

models concerning individual grammars could be tested as well, for instance to

evaluate the existence of cognitive bias, structural bias, or social bias, modifying the

mechanisms of linguistic evolution of a series of linguistic variants.

We showed that it is quite difficult to assess if the size of the linguistic

population and the genetic population associated to a given sample are different or

not. We may hypothesize that the summary statistics that we used do not reflect in any

way the relative size of the genetic and the linguistic populations. To overcome this

issue, future work will focus on developing a set of composite summary statistics,

computed using both types of data and reflecting their relative organisation, which

would allow to access more precisely to the links between genetic and linguistic

coevolution. On the contrary, we showed that we are a priori able, considering two

sets of individuals, to assess if they belong to one or two genetic and/or linguistic

populations. This could give us a lot of information about the relative structure of the

reproduction network and the linguistic communication network, and the underlying

rules which could explain these structures, as detailed in section 4 part 1. Finally, we

showed that using linguistic or genetic diversities is equivalent for the solving of a

tree topology for three populations. Moreover, we showed that coupling the two types

of diversities lead to a slight increase in precision.

All of these examples showed that in most of the cases, a genetic population

framework coupled with a linguistic population framework allowed to address a wide

range of questions about the genetic and linguistic coevolution of a given sample of

individuals.

Accounting for within-population diversity places the individuals at the centre

of the formal framework. Conversely, following recent conceptual and

methodological advances, phylogenetic studies have been adapted to linguistic data in

order to infer computationally the history of a set of languages (Atkinson and Gray,

2005; Bouckaert et al., 2012; Gray and Atkinson, 2002; Gray and Jordan, 2000; Gray

et al., 2009). In these phylogenetic studies, languages are considered at a macro-

evolutionary scale. Nevertheless, several authors pointed out the potential limitations

of the phylogenetic approach applied to cultural data (see for instance Moon, 1994;

Testart, 2011). It is often implicitly assumed that the linguistic trees are also

representative of the demographic history of the population of speakers. Nevertheless,

language histories can differ from genetic histories, as shown in several previous

works (Steele and Kandler, 2010; Thouzeau et al., 2017; Verdu et al., 2009; Ward et

al., 1993). Moreover, phylogenetic studies assume that language evolution is tree-like,

with a vertical transmission of languages, ignoring thus processes such as borrowing,

language shift, creolization, or linguistic admixture. These process seem nonetheless

extremely frequent in the history of languages (Steele and Kandler, 2010), some

authors argue that the assumption of complete isolation is rather the exception in

language evolution than the norm (Campbell, 2006). We argue that taking into

account within-population linguistic diversity through an agent-based approach is an

efficient way to overcome these limitations, taking into account all of these complex

events at the individual scale.

In our framework, the analogies between languages phylogenies and genetic

population histories are no longer a set of premises to study evolutionary history of

human populations, but a consequence emerging from a set of shared spatio-temporal

constraints affecting the genealogical and the communication networks. Applying this

framework to real datasets could thus help entangling several problems concerning

genes and languages coevolution in the future, allowing to infer a wide range of

events encountered by human populations throughout their history.

Chapter IV – Sampling and describing

linguistic data from Cape Verdean

Kriolu

Valentin Thouzeau1, Ethan M. Jewett2,3, Sergio S. da Costa1, Cesar A. Fortes-Lima1,

Noah A. Rosenberg2, Frédéric Austerlitz 1, Marlyse Baptista4 and Paul Verdu1

1 CNRS, MNHN, Université Paris Diderot, UMR 7206 Eco-Anthropologie et

Ethnobiologie, Paris 75016, France2 Department of Biology, Stanford University, Stanford, CA 94305, USA3 Department of Statistics and Department of Electrical Engineering and Computer

Science, University of California, Berkeley, CA 94720, USA4 Department of Linguistics and Department of Afroamerican and African Studies,

University of Michigan, Ann Arbor, MI 48109, USA

This chapter is a preliminary work.

Introduction

Cape Verde Islands, an archipelago of islands off the coast of Senegal and

Mauritania, have been the subject of a series of genetic and linguistic data sampling

missions involving French, American and Cape Verdean researchers since 2010. I had

the opportunity to participate in the mission from 1st July to 17th July 2016, and to

develop a sampling protocol of linguistic data. Moreover, knowing the

methodological developments proposed in chapters 2 and 3, this work was the

opportunity to go back and forth between fieldwork and theory, allowing one

experience to inform the other.

Chapter IV – Sampling and describing linguistic data from Cape Verdean Kriolu

A first set of genetic and linguistic data sampled for 44 individuals was recently

statistically described by Verdu et al. (2017). This study was based on the joint

analysis of a series of genetic and linguistic markers. The genetic data were generated

at a genome-wide scale using the Illumina HumanOmni2.5-8 BeadChip genotyping

array. The linguistic data were generated by the recording and the transcription of

semi-spontaneous speeches of each DNA donor. The speeches were produced by the

speakers without interruption and without time limit, following the watching of a

speech-less movie of a little more than 5 minutes (The Pear Story, Chafe, 1980). The

words used by the speakers were then categorized by linguists according to their

African or non-African lexical or etymological basis.

The analysis performed in Verdu et al. (2017) has shown that the Cape Verdean

population from the main island of Santiago resulted from a process of admixture

between Iberian populations and Senegambian populations, thus echoing the known

peopling history of this archipelago by the Portuguese Crown and African slaves.

Moreover, speech patterns described by word frequencies tabulated for each

individuals, seemed to be significantly correlated with individuals’ birth-places as

well as parental and grand-parental birth places (correcting for shared parent-offspring

birth places). The authors proposed that this indicated that speech patterns transmitted

vertically from one generation to the next were not completely obliterated by speech

patterns acquired horizontally or obliquely by each individual throughout his/her

lifetime. Finally, results showed that genetic levels of African admixture were

positively correlated with the frequency of usage of African derived words or words

with a mixed African-European etymology. Altogether, these results seemed to

indicate that the processes underlying the observed genetic and linguistic admixture

patterns followed a parallel evolutionary trajectory, probably through co-transmission

processes.

The methods developed during this PhD. thesis are in line with this type of

work. In order to propose formal historical inference to investigate the historical

events encountered by Cape Verdean populations, an explicit model was required. The

developments proposed in chapter 3 focused on describing a theoretical framework to

infer genetic and linguistic histories using within-population diversities jointly. For

clarity, we chose to present this theoretical construction separately from the sampling

protocol conducted in Cape Verde, but it is nevertheless essential to note that they

result, in reality, from a co-construction process. Without an underlying theory, it was

difficult to construct a data set that can be analysed in a relevant way, and without

fieldwork, it was difficult to propose a theory that reflects the object of study in a

relevant way.

We describe in the first part of this chapter the data sampling strategy developed

in the field work, and we then propose a first descriptive analysis of this new dataset.

Data sampling

We presented in chapter 3 a formalization of the notion of linguistic population,

as a set of individuals communicating preferentially. A first consequence of this

formalization lies in the sampling strategy: how do we determine if two individuals

belong to the same linguistic population? A categorization work is needed here. Each

sampling unit must therefore be defined in such a way as to provide a possible set of

population structures, using ethnographic informations available: sampling location,

living place, birth place, etc. Moreover, knowing that our statistical framework uses

linguistic diversity as the prime source of information, a sufficient number of

individuals had to be sampled in order to perform statistical tests with sufficient

power. This was then the case in our mission, with 104 individuals sampled between

2016 (49 individuals, sampled by V.T., S.DC., E.J., P.V. and M.B.) and 2017 (55

individuals, sampled by C.F-L, S.DC, P.V. and M.B.), instead of considering only a

few individuals as a representative of the homogeneous linguistic variety of a given

location.

The formalization that we previously proposed took into account only linguistic

variants, a series of items of the same linguistic class. We might imagine a model of

individual grammar generating a whole discourse, in order to analyse the speeches

sampled throughout, but the linguistic constraints on the production of the discourses

are widely complex, and still poorly understood. We chose thus to sample lexical

variants from the Swadesh list during the fieldwork, in order to access a type of data

that could correspond to the linguistic evolution model developed previously, and

computationally tractable using our novel simulation software PLGS (chapter 3).

In order to sample the linguistic variants with each individual, we could not use

a vehicular language as it was the case for the sampling campaign in Central Asia,

where Russian was used to sample Central Asian linguistic varieties (Mennecier et al.,

2016). We chose to not to use English or Portuguese, knowing that Kriolu is a creole

language with a Portuguese substrate too close to these two languages. We then

decided to show to each speaker a series of 96 pictures depicting meanings from the

Swadesh. Objects and verbs were privileged in order to be able to represent

graphically the meanings. The set of graphical pictures seemed unambiguous for our

research team, but some of the pictures happened to be in fact very ambiguous for the

sampled individuals. For instance, the picture of a forest rarely triggered the word

from sampled individuals, likely due to the fact that forests are extremely rare in Cape

Verde. It led us to reduce the list to only 56 meanings rarely found ambiguous during

our experiments (Table IV.1). For each speaker, the pictures were presented

successively, asking them to pronounce twice the word associated to each picture, in

order to record properly each linguistic variant. Some speakers produced several

variants for one picture, indicating that at least some of them were conscious of the

linguistic diversity in Cape Verde. We chose to only use the first variant pronounced

by each speaker in our subsequent analysis. However, this choice gave us a clue about

the possibility to memorize several variants, as implemented in the Bayesian-learning

model studied in chapter 3.

From the fieldwork, it appears that speakers uttered linguistic variants

depending on the context of the interview. For instance, some of them spoke using a

lot of Portuguese linguistic variants during the interviews, but without using these

variants when they spoke more freely. Knowing that Portuguese is more associated

with academic or socio-economically favoured environments environment, it was

clear that the formal context of the interview oriented their choice of linguistic variant

at the moment of the utterance. The importance of the context lead us to take into

account the linguistic data as an utterance produce in a particular context, instead of

the realisation of a particular dialect quite independent from the individuals and

interviewers. This perspective nourished the theoretical construction of the chapter 3,

leading to the formalisation of an utterance-based model.

1. green 11. foot 21. hair 31. branch 41. smoke 51. to see

2. red 12. leg 22. tongue 32. root 42. lake 52. to drink

3. yellow 13. knee 23. dog 33. flower 43. sea 53. to eat

4. black 14. hand 24. tail 34. seed 44. mountain 54. to cut

5. white 15. neck 25. snake 35. fruit 45. rope 55. to bite

6. one 16. head 26. fish 36. cloud 46. stone 56. to sew

7. two 17. ear 27. feather 37. sun 47. to sing

8. three 18. eye 28. egg 38. moon 48. to swim

9. four 19. nose 29. tree 39. star 49. to sit

10. five 20. mouth 30. leaf 40. fire 50. to ear

Table IV.1 – List of meanings extracted from the Swadesh list.

Figure IV.1 – Geographical distribution of the 19 sampling localities under study in Cape Verde.

This protocol was performed with 104 individuals in total, during missions from

1st July 2016 to 17th July 2016 and from 20th May 2017 to 7th June 2017, in the

islands of Santiago, Brava, Fogo, São Vicente, and Santo Antão (Figure IV.1). Then,

each linguistic variant was transcribed, leading to a table describing the linguistic

variants used by each speaker for each lexical class.

Descriptive analyses

Several descriptive analyses of the dataset were performed, in order to

understand the structuring of the linguistic data. A Multiple Correspondence Analysis

(abbreviated MCA, see for instance Abdi and Valentin, 2007) was produced using the

function MCA of the R package FactoMineR (Lê et al., 2008), considering each

meaning as qualitative variables. This analysis allows to describe a multidimensional

dataset projected on only two dimensions. In addition, each meaning was analysed as

a function of its contribution to the linguistic diversity, in order to determine which

meaning is associated with the clustering of the MCA. Finally, Manhattan's linguistic

distances (see chapter 1) were calculated between each pair of individuals, and were

represented by a tree resulting from a Neighbour-Joining algorithm using the function

BioNJ (Gascuel, 1997) implemented in the R package ape (Paradis et al., 2004). 1000

bootstraps of the meanings was performed to determine the robustness of the

branches.

The MCA differentiated two axes (Figure IV.2). The first axis differentiated

individuals from the northern and southern islands. The second axis differentiated

individuals from the south-eastern and south-western islands. This suggested that the

lexical diversity of the Cape Verde islands is structured in at least three different

groups.

The MCA analysis makes it possible to determine which meanings among the

50 were responsible for the observed structure (Figure IV.3). The meanings close to

the origin (bottom left corner of Figure IV.3) were those for which diversity observed

across the islands did not contribute significantly to the three-group structure

observed in the MCA. The meanings close to 0 on the second axis, but high along the

first axis (bottom right corner of Figure IV.3), were those for which diversity reflects

the north / south linguistic structure. The meanings close to 0 on the second axis, but

high along the first axis (bottom right corner of Figure IV.3), were those for which

diversity reflect the south-east / south-west linguistic structure. Finally, the meanings

far along both the first and the second axis (top right corner of Figure IV.3) are those

which reflect the linguistic structure differentiating both the north / south islands and

the south-east / south-west islands. Note that it is almost exclusively the verbs that

structured both the axes.

Figure IV.2 – MCA of the 84 individuals sampled in Cap Verde, coloured according to their birthplace: red for Santiago, orange for Brava, purple for Fogo, blue for São Vicente, and green for SantoAntão. The first axis differentiate the northern islands and the southern islands. The second axisdifferentiate the south-eastern island ans the south-west island.

Neighbour-Joining (Figure IV.4) presented four subgroups, with a clear

separation between the north and south islands, as well as a clear separation between

the islands of the south-east and the south-west, as previously shown by the MCA.

Moreover, the Neighbour-Joining seemed to indicate a structuring within the south-

eastern island (Santiago), between two subgroups linked to two different birth places

locations within the island. The categorization of the individuals according to their

birth place reveal a clearer structure than the categorization of the individuals

according to their sampling location.

Figure IV.3 – Representation of the weight of the 50 words in the MCA. The first axis differentiate thewords with a low weight (close to the origin) and a high weight (far from the origin) in thedifferentiation between the north islands and the south islands. The second axis differentiate the wordswith a low weight (close to the origin) and a high weight (far from the origin) in the differentiationbetween the south-eastern island and the south-west island.

Figure IV.4 – Neighbour-joining trees based on the linguistic distances matrix with a) each individual number is coloured according to its sampling location, and b)each individual number is coloured according to birth places. The values at each edge corresponds to the number of bootstrap trees containing this edge at least 800times over 1000 permutations. The individual numbered 39, coloured in black, according to her/his birth place outside Cape Verde, on the Island of São Tomé andPríncipe.

Discussion

These preliminary analyses allowed us to describe the lexical linguistic diversity

sampled at the scale of the individuals in several locations in Cape Verde. The graphic

representations of the MCA and the Neighbour-Joining seem to indicate a really clear

structuring between different regions, even within the island of Santiago. We may

hypothesize that this result is the consequence of a structure between several linguistic

populations, in which the individuals preferentially communicate in a small

geographical area. Moreover, the structure was clearer when categorizing the

individuals according to their birth place than according to their sampling location.

This could indicate that the vocabulary used by the individuals is mainly determined

by the environment of their childhood, instead of their immediate environment. Verdu

et al. (2017) previously showed that the word frequencies sampled for each individual

were significantly correlated with their birth places, hypothesising that the vertical

linguistic transmission was not completely obliterated by horizontal or oblique

transmission.

Interestingly, the verbs are responsible for the main part of this linguistic

structure, compared to objects. We may hypothesize that the words describing actions

are more prone to mutate than the words describing objects. Regardless of this

hypothesis, this indicates that each type of linguistic class should be studied

separately in order to understand their respective dynamics in the populations

linguistic history (see chapter 3).

According to this preliminary study, we think that going back and forth between

fieldwork and theory is a particularly relevant when developing of new methods

associated with the collection of new types of data. As we show here, some

methodological choices made on the field directly resulted from theoretical

imperatives, such as using a series of linguistic variants as evolving linguistic data, or

sampling a large set of individuals in order to access to a linguistic diversity

statistically usable.

Genetic data (with more than 2.5 millions SNP’s) will be available in the fall of

2017, enabling to perform joint analysis of genetic and linguistic diversities for the

same set of individuals. It will be the opportunity to apply the framework presented in

chapter 3, aiming at inferring the genetic and linguistic coevolutionary history of the

Cape Verdean populations. It should allow us to assess whether the linguistic structure

described here reflects several differentiated linguistic populations. It should also

allow us to assess if this linguistic structure matches a genetic structure or not,

providing very interesting hints on the relative rules constraining reproductive and

communicative relationships in Cap Verde. Moreover, it should finally be possible to

infer the contingent historical events (split of populations, population size changes,

migration, admixture, etc…) which affected, maybe in different ways, the genetic

populations and the linguistic populations of this region.

Conclusion

L’application des méthodes d’inférences historiques à partir des données

génétiques et linguistiques a soulevé plusieurs questions durant cette thèse. Quels sont

les rapports théoriques et pratiques qui lient la collecte de données sur le terrain à

l’inférence historique ? Comment construire des modèles permettant de réaliser des

inférences qui ne soient pas seulement verbales ? Quelle articulation est-il possible

d’effectuer pour proposer des méthodes d’inférences qui associent les diversités

génétiques et linguistiques ?

Tandis que l’inférence historique en génétique des populations dispose d’un

paradigme consensuel, ce n’est pas encore le cas pour l’inférence historique en

linguistique. À plusieurs manières de construire l’objet de recherche sont associés

plusieurs modèles, directement reliés à différentes manières d’échantillonner des

données sur le terrain et à différentes manières de concevoir la diversité culturelle des

populations humaines. Dans l’articulation entre les inférences génétiques et

linguistiques, plusieurs approches possibles délimitent les objets scientifiques à

multiples facettes que sont les populations humaines, les gènes qu’elles portent, et les

langues qu’elles parlent. Au cours de la réalisation de cette thèse, plusieurs prismes

ont été empruntés pour tenter de rendre compte de la co-évolution génétique et

linguistique. Ce travail s’est alors ancré dans une pratique interdisciplinaire cherchant

à clarifier la communication entre les disciplines et l’articulation possible entre les

présupposés qui les constituent respectivement.

L’utlisation de méthodes statistiques en Approximate Bayesian Computation

repose sur des modèles entièrement explicites concernant les processus à l’œuvre.

Une formalisation poussée de la co-évolution génétique-linguistique a ainsi été rendue

nécessaire, exigeant elle-même de dépasser les allant-de-soi disciplinaires concernant

les évolutions génétiques et linguistiques et leurs articulations. L’ensemble de mon

travail peut donc se comprendre comme la résultante d’une double contrainte avec,

d’un côté, une méthode statistique d’ABC puissante mais exigeant une explicitation et

une formalisation des modèles sous-jacents utilisés, et de l’autre, un ensemble de

Conclusion

champs disciplinaires et de pratiques scientifiques disparates et informelles, voilant

les conditions nécessaires au travail de formalisation.

Le premier chapitre s’est attaché à proposer en parallèle une inférence de

l’histoire des populations génétiques et une inférence de l’histoire des variétés

linguistiques. Deux objets de natures très différentes (populations génétiques et

variétés linguistiques) ont été associés afin de dévoiler les relations complexes qui

peuvent exister entre l’histoire génétique des populations et l’histoire des langues

qu’elles parlent. Les données linguistiques et génétiques récoltées au préalable en

Asie Centrale ont été analysées dans le cadre des méthodes ABC, permettant de

réaliser en parallèle des inférences incluant des scénarios complexes de migration et

de mélange, et de traiter des jeux de données volumineux. La comparaison entre les

données issues des variétés linguistiques et les données issues des populations

génétiques a mit en lumière des histoires parfois dissociées, indiquant des échanges

linguistiques possibles sans échanges génétiques, et vice et versa. Cela a permis

d’affirmer que l’histoire des langues peut être différente de l’histoire des gènes,

surtout à une échelle géographique locale.

Un premier déplacement de la manière de prendre en compte l’objet de

l’inférence linguistique a été proposé au chapitre deux. Cette fois-ci, c’est la diversité

linguistique inter-individuelle qui a été placée au cœur de la méthode d’inférence, afin

d’évaluer la possibilité d’une linguistique des populations explicite sur un tout autre

plan qu’une linguistique historique des langues. Ce retournement théorique, du

général linguistique au particulier inter-individuel, a permis de construire un autre

cadre méthodologique pour l’inférence historique, centré sur une seule population, en

utilisant ici aussi les méthodes ABC. Il a ainsi été possible de sélectionner les modèles

de transmission les plus en accord avec les données réelles issues d’un ensemble de

locuteurs Tadjiks, et d’estimer certains paramètres des modèles sélectionnés.

Un second déplacement a été opéré au chapitre trois, dans la continuité du

précédent. S’attachant à rendre compte de la pluralité des approches en linguistique

évolutive et des fondements théoriques des différentes disciplines impliquées, une

formulation de l’évolution linguistique associée à une formalisation préalable de la

Conclusion

théorie de l’évolution biologique a permis de clarifier un cadre conceptuel associant la

génétique des populations et la linguistique des populations. L’association entre ces

deux domaines a ensuite permis d’expliciter plusieurs hypothèses verbales classique

dans ces champs de recherches, et de les intégrer à un logiciel simulant conjointement

l’évolution génétique et linguistique des populations humaines de manière

suffisamment optimale pour rendre les calculs informatiques réalisables dans un

temps raisonnable. L’évaluation a priori de ce cadre théorique et méthodologique à

partir des méthodes ABC a permis de démontrer son efficacité potentielle, ouvrant des

perspectives nouvelles pour l’étude de la co-évolution génétique et linguistique à

l’échelle des individus.

Enfin, le dernier chapitre a été l’occasion de détailler un travail de terrain

permettant de relever des données linguistique à l’échelle inter-individuelle pour des

locuteurs créoles des Îles du Cap Vert. L’analyse descriptive des données a permis de

mettre en lumière une structuration très claire entre plusieurs localités

d’échantillonnage, résultant d’une histoire linguistique qui reste à étudier.

Ces différents travaux s’inscrivent dans la tradition du parallèle originellement

proposé par Darwin entre évolution biologique et évolution linguistique. Il apparaît

maintenant clairement que la délimitation des objets d’étude est une étape cruciale

pour travailler cette analogie, mais que cette étape est loin d’être triviale. La

phylolinguistique, en considérant les langues comme des unités homogènes,

s’affranchit de la complexité qui peut émerger de l’interaction entre un ensemble

d’agents. A partir d’une série d’avancées théoriques, statistiques et informatiques, il

est maintenant possible d’étudier les langues comme des entités dont l’histoire émerge

de l’interaction répétée entre les locuteurs qui constituent les populations

linguistiques. Ce changement de perspective, analogue à celui passant des espèces

comme unités homogènes à des espèces comme un ensemble d’individus diversifiés,

ouvre des voies d’analyses similaires à celles ouvertes par la génétique des

populations par rapport à la phylogénie des espèces biologiques.

Plusieurs études futures s’ouvrent à l’issue de ce travail. Cette thèse offre tout

d’abord un cadre méthodologique permettant de réaliser des inférences historiques

Conclusion

explicites à partir de l’intégration des diversités génétiques et linguistiques des

populations humaines. La méthodologie proposée ici repose sur des simulations

explicites, elle est donc une voie d’accès vers des processus très complexes, et permet

notamment d’étudier les événements affectant les individus eux-même et se

répercutant par émergence sur les diversités observées à des échelles plus globales. Il

semble donc possible d’étudier l’histoire de la co-évolution génétique et linguistique

sous un nouveau jour, enrichi d’un ensemble de question qu’il est maintenant possible

de traiter directement. Quelles règles régissent les interactions linguistiques à l’échelle

des individus ? Quels événements historiques rencontrés par les populations humaines

ont structuré les diversités génétiques et linguistiques actuellement observables ?

Peut-on déterminer à quel point les contraintes spatio-temporelles régissant les

relations de reproductions sont confondues avec celles régissant les relations de

communication linguistique ? Ces questions, ainsi que de nombreuses autres liées à

l’étude de la co-évolution génétique et linguistique, restent à élucider.

Ces nouvelles perspectives nécessitent néanmoins d’articuler de nombreuses

contraintes entre elles. Un échantillonnage sur le terrain, en accord avec les

perspectives théoriques délimitées au préalable, le tout dans un cadre méthodologique

performant, sont les éléments indispensables à la bonne conduite d’un tel projet. A ces

trois niveaux en interrelations – terrain, théorie, méthode – s’ajoute la nécessité de

mobiliser plusieurs disciplines scientifiques. C’est alors que doit intervenir la pratique

interdisciplinaire, qui relève d’un effort essentiel pour rendre possible la

communication entre les différentes langues disciplinaires impliquées.

Mais une quatrième contrainte s’exprime à travers la possibilité même du travail

interdisciplinaire au sein d’une institution qui peut parfois, par certains aspects,

s’avérer réfractaire. En effet, malgré un discours institutionnel ayant tendance à

promouvoir l’interdisciplinarité, l’exigence et le travail qu’une telle pratique requiert

en amont sont difficiles à mettre en œuvre tout en prenant en compte l’exigence de

productivité qui pèse sur les jeunes chercheurs. De plus, l’organisation des institutions

de recherche séparées en disciplines bien différenciées, rend parfois très difficile

l’insertion d’un projet à la frontière entre plusieurs discipline. L’objectif d’associer

pleinement plusieurs disciplines, sans s’en tenir à un échange de service ou à un

Conclusion

emprunt ponctuel à des disciplines voisines, est amené à devoir se frayer un chemin

dans un milieu aux pratiques souvent incommensurables les unes aux autres. Cette

séparation entre disciplines, je l’ai évoqué lors de l’avant-propos de ce manuscrit, est

d’autant plus présente qu’elle est soutenue par des langues disciplinaires très

différenciées et aux présupposés parfois difficilement compatibles. Ainsi, à moins

d’une pratique interdisciplinaire qui préexisterait au sein des laboratoires de

recherches, elle doit être mise en œuvre individuellement à chaque fois, en faisant

appel à un outillage méthodologique solide. C’est ainsi que Bühlera et al. (2012)

indiquent que :

Si la pratique individuelle de l’interdisciplinarité soulève déjà en soi un ensemble de

questionnements, il nous a semblé que le jeune chercheur était confronté à des problématiques

particulières, directement liées à son statut. Par l’individualisation de l’interdisciplinarité, la

pratique passe d’un enjeu collectif à un enjeu épistémologique, conduisant à s’interroger sur le

fondement et la nature des sciences.

Il me semble que la prise en charge épistémologique des problématiques de

recherches interdisciplinaires a toute les raisons d’être encouragée à la hauteur des

perspectives scientifiques qu’elle ouvre. J’ai tenté de montrer au cours de cette thèse

qu’une très large fécondité est à même d’en résulter. Il me semble que

l’interdisciplinarité est le creuset d’un ensemble de mélange, infiniment riches entre

les disciplines scientifiques, ouvrant la voie vers autant de manières de porter sur la

complexité du monde de regards différents.

Bibliographie

Abdi, H., and Valentin, D. (2007). Multiple Correspondence Analysis.

Aimé, C., Verdu, P., Ségurel, L., Martinez-Cruz, B., Hegay, T., Heyer, E., andAusterlitz, F. (2014). Microsatellite data show recent demographic expansions insedentary but not in nomadic human populations in Africa and Eurasia. EuropeanJournal of Human Genetics.

Aljanabi, S.M., and Martinez, I. (1997). Universal and rapid salt-extraction of highquality genomic DNA for PCR-based techniques. Nucleic Acids Research 25, 4692–4693.

Alvarez-Péreyre, F. (2003). L’exigence interdisciplinaire: une pédagogie del’interdisciplinarité en linguistique, ethnologie et ethnomusicologie (Paris, France:Éditions de la Maison des sciences de l’homme).

Alvarez-Pereyre, F. (2014). Linguistique, anthropologie, ethnomusicologie : Regards croisés. as 38, 47–61.

Alves, I., Arenas, M., Currat, M., Sramkova Hanulova, A., Sousa, V.C., Ray, N., andExcoffier, L. (2016). Long-Distance Dispersal Shaped Patterns of Human GeneticDiversity in Eurasia. Molecular Biology and Evolution 33, 946–958.

Amiel, P. (2010). Ethnométhodologie appliquée Éléments de sociologiepraxéologique. (Paris: Les presses du Lema).

Amorim, C.E.G., Bisso-Machado, R., Ramallo, V., Bortolini, M.C., Bonatto, S.L.,Salzano, F.M., and Hünemeier, T. (2013). A Bayesian Approach to Genome/LinguisticRelationships in Native South Americans. PLoS ONE 8, e64099.

Atkinson, Q.D. (2011). Phonemic Diversity Supports a Serial Founder Effect Modelof Language Expansion from Africa. Science 332, 346–349.

Atkinson, Q.D. (2013). The descent of words. Proceedings of the National Academyof Sciences 110, 4159–4160.

Atkinson, Q., and Gray, R. (2005). Curious Parallels and Curious Connections—Phylogenetic Thinking in Biology and Historical Linguistics. Systematic Biology 54,513–526.

Atkinson, Q., Nicholls, G., Welch, D., and Gray, R. (2005). From words to dates:water into wine, mathemagic or phylogenetic inference? Transactions of thePhilological Society 103, 193–219.

Bibliographie

Atkinson, Q.D., Meade, A., Venditti, C., Greenhill, S.J., and Pagel, M. (2008).Languages Evolve in Punctuational Bursts. Science 319, 588–588.

Bahuchet, S. (2012). Changing language, remaining pygmy. Human Biology 84, 11–43.

Balanovsky, O., Dibirova, K., Dybo, A., Mudrak, O., Frolova, S., Pocheshkhova, E.,Haber, M., Platt, D., Schurr, T., Haak, W., et al. (2011). Parallel Evolution of Genesand Languages in the Caucasus Region. Molecular Biology and Evolution 28, 2905–2920.

Balazs, I. (1993). Population genetics of 14 ethnic groups using phenotypic data fromVNTR loci. EXS 67, 193–210.

Barberousse, A., and Samadi, S. (2015). Formalising Evolutionary Theory. InHandbook of Evolutionary Thinking in the Sciences, T. Heams, P. Huneman, G.Lecointre, and M. Silberstein, eds. (Dordrecht: Springer Netherlands), pp. 229–246.

Barbujani, G., and Sokal, R.R. (1990). Zones of sharp genetic change in Europe arealso linguistic boundaries. Proceedings of the National Academy of Sciences 87,1816–1819.

Baxter, G.J., Blythe, R.A., Croft, W., and McKane, A.J. (2006). Utterance selectionmodel of language change. Phys. Rev. E 73, 046118.

Baxter, G.J., Blythe, R.A., Croft, W., and McKane, A.J. (2009). Modeling languagechange: An evaluation of Trudgill’s theory of the emergence of New Zealand English.Language Variation and Change 21, 257.

Beaumont, M.A., and Rannala, B. (2004). The Bayesian revolution in genetics. NatRev Genet 5, 251–261.

Beaumont, M.A., Zhang, W., and Balding, D.J. (2002). Approximate Bayesiancomputation in population genetics. Genetics 162, 2025–2035.

Beckner, C., Blythe, R., Bybee, J., Christiansen, M.H., Croft, W., Ellis, N.C., Holland,J., Ke, J., Larsen-Freeman, D., and Schoenemann, T. (2009). Language is a complexadaptive system: Position paper. Language Learning 59, 1–26.

Belle, E.M.S., and Barbujani, G. (2007). Worldwide analysis of multiplemicrosatellites: Language diversity has a detectable influence on DNA diversity.American Journal of Physical Anthropology 133, 1137–1146.

Ben Hamed, M., and Darlu, P. (2007). Gènes et Langues : une longue histoire commune ? Bulletins et mémoires de la Société d’Anthropologie de Paris 243–264. 

Blevins, J. (2004). Evolutionary Phonology: The Emergence of Sound Patterns(Cambridge University Press).

Bibliographie

Blum, M.G.B., and François, O. (2010). Non-linear regression models forApproximate Bayesian Computation. Statistics and Computing 20, 63–73.

Blythe, R.A., and Croft, W. (2012). S-curves and the mechanisms of propagation inlanguage change. Language 88, 269–304.

Bomin, S.L., Lecointre, G., and Heyer, E. (2016). The Evolution of Musical Diversity:The Key Role of Vertical Transmission. PLOS ONE 11, e0151570.

Bonfils, B. (1990). Connaissance scientifique et connaissance profane : de la générativité paradigmatique de l’opinion. Revue française de science politique 40,382–391.

Bornand, S., and Leguy, C. (2013). Anthropologie des pratiques langagières (ArmandColin).

Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S.J., Alekseyenko, A.V., Drummond,A.J., Gray, R.D., Suchard, M.A., and Atkinson, Q.D. (2012). Mapping the Origins andExpansion of the Indo-European Language Family. Science 337, 957–960.

Bowern, C., and Atkinson, Q. (2012). Computational phylogenetics and the internalstructure of Pama-Nyungan. Language 88, 817–845.

Breiman, L. (1999). Random forests. UC Berkeley TR567.

Bühlera, È.A., Cavaillé, F., and Gambino, M. (2012). Le jeune chercheur etl’interdisciplinarité en sciences sociales, Young researchers and interdisciplinarity insocial sciences. Reconsidering practices. Natures Sciences Sociétés 14, 392–398.

Cabrera, F. (2017). Cladistic Parsimony, Historical Linguistics and CulturalPhylogenetics. Mind Lang 32, 65–100.

Calame, C. (1986). Le récit en Grèce ancienne: énonciations et représentations depoètes (Méridiens/Klincksieck).

Campbell, L. (2006). Languages and Genes in Collaboration: some Practical Matters.(University of California, Santa Barbara), p.

Cangelosi, A., Smith, A.D.M., and Smith, K. (2006). The Evolution of Language:Proceedings of the 6th International Conference (EVOLANG6), Rome, Italy, 12-15April 2006 (World Scientific).

Cann, R.L. (2001). Genetic Clues to Dispersal in Human Populations: Retracing thePast from the Present. Science 291, 1742–1748.

Cavalli-Sforza, L.L. (1997). Genes, peoples, and languages. Proceedings of theNational Academy of Sciences 94, 7719–7724.

Bibliographie

Cavalli-Sforza, L.L., and Feldman, M.W. (1981). Cultural transmission and evolution:a quantitative approach. Monogr Popul Biol 16, 1–388.

Cavalli-Sforza, L.L., and Feldman, M.W. (2003). The application of molecular geneticapproaches to the study of human evolution. Nat. Genet. 33 Suppl, 266–275.

Cavalli-Sforza, L.L., Barrai, I., and Edwards, A.W.F. (1964). Analysis of HumanEvolution Under Random Genetic Drift. Cold Spring Harb Symp Quant Biol 29, 9–20.

Cavalli-Sforza, L.L., Piazza, A., Menozzi, P., and Mountain, J. (1988). Reconstructionof human evolution: bringing together genetic, archaeological, and linguistic data.Proceedings of the National Academy of Sciences 85, 6002–6006.

Cavalli-Sforza, L.L., Minch, E., and Mountain, J.L. (1992). Coevolution of genes andlanguages revisited. Proceedings of the National Academy of Sciences 89, 5620–5624.

Čelakovský, F.L. (1853). Čtení o srovnavací mluvnici slovanské na Universitěpražské (Rivnáč).

Chafe, W.L. (1980). The Pear Stories: Cognitive, Cultural, and Linguistic Aspects ofNarrative Production (Ablex).

Chakraborty, R. (1976). Cultural, language and geographical correlates of geneticvariability in Andean highland Indians. Nature 264, 350–352.

Chakravarti, A. (1999). Population genetics—making sense out of sequence. NatureGenetics 21, 56–60.

Chomsky, N. (2006). Language and Mind (Cambridge ; New York: Cambridge University Press).

Claidière, N., and André, J.-B. (2012). The Transmission of Genes and Culture: AQuestionable Analogy. Evolutionary Biology 39, 12–24.

Creanza, N., Ruhlen, M., Pemberton, T.J., Rosenberg, N.A., Feldman, M.W., andRamachandran, S. (2015). A comparison of worldwide phonemic and geneticvariation in human populations. Proceedings of the National Academy of Sciences112, 1265–1272.

Croft, W. (1996). Linguistic Selection: An Utterance-based Evolutionary Theory ofLanguage Change. Nordic Journal of Linguistics 19, 99.

Croft, W. (2006). The relevance of an evolutionary model to historical linguistics. InCompeting Models of Linguistic Change: Evolution and Beyond, (John BenjaminsPublishing), pp. 91–132.

Bibliographie

Croft, W. (2008). Evolutionary Linguistics. Annual Review of Anthropology 37, 219–234.

Croft, W. (2013). Evolution: Language use and the evolution of languages. In TheLanguage Phenomenon, (Springer), pp. 93–120.

Csilléry, K., Blum, M.G., Gaggiotti, O.E., and François, O. (2010). ApproximateBayesian computation (ABC) in practice. Trends in Ecology & Evolution 25, 410–418.

Csilléry, K., François, O., and Blum, M.G.B. (2012). abc: an R package forapproximate Bayesian computation (ABC): R package: abc. Methods in Ecology andEvolution 3, 475–479.

Culbertson, J. (2012). Typological Universals as Reflections of Biased Learning:Evidence from Artificial Language Learning: Typological Universals as Reflections ofBiased Learning. Language and Linguistics Compass 6, 310–329.

Danescu-Niculescu-Mizil, C., West, R., Jurafsky, D., Leskovec, J., and Potts, C.(2013). No country for old members: User lifecycle and linguistic change in onlinecommunities. In Proceedings of the 22nd International Conference on World WideWeb, (ACM), pp. 307–318.

Darlu, P., Bloothooft, G., Boattini, A., Brouwer, L., Brouwer, M., Brunet, G.,Chareille, P., Cheshire, J., Coates, R., Dräger, K., et al. (2012). The Family Name asSocio-Cultural Feature and Genetic Metaphor: From Concepts to Methods. HumanBiology 84, 169–214.

Darwin, C. (1871). The Descent of man (D. Appleton and Company).

Davidson, D. (1967). Truth and meaning. Synthese 17, 304–323.

Davidson, D. (1973). On the Very Idea of a Conceptual Scheme. Proceedings andAddresses of the American Philosophical Association 47, 5–20.

Debouzie, D. (1999). La notion de population en dynamique et génétique despopulations. Nature Sciences Sociétés 7, 19–26.

Delamotte, É. (2004). Communautés professionnelles, sens commun et doctrine.Études de communication. langages, information, médiations.

D’Errico, F., and Hombert, J.M. (2009). Becoming eloquent advances in theemergence of language, human cognition, and modern cultures (Amsterdam;Philadelphia, Pa.: John Benjamins Pub. Co.).

Diller, K.C., and Cann, R.L. (2011). Genetic influences on language evolution: anevaluation of the evidence.

Bibliographie

Drummond, A.J., and Rambaut, A. (2007). BEAST: Bayesian evolutionary analysisby sampling trees. BMC Evolutionary Biology 7, 214.

Dryer, M.S., and Haspelmath, M. (2013). The World Atlas of Language StructuresOnline (Leipzig: Max Planck Institute for Evolutionary Anthropology).

Duda, P., and Jan Zrzavý (2016). Human population history revealed by a supertreeapproach. Scientific Reports 6.

Estoup, A., Jarne, P., and Cornuet, J.-M. (2002). Homoplasy and mutation model atmicrosatellite loci and their consequences for population genetics analysis. MolecularEcology 11, 1591–1604.

Evans, N., and Levinson, S.C. (2009). The myth of language universals: Languagediversity and its importance for cognitive science. Behavioral and Brain Sciences 32,429.

Excoffier, L., and Foll, M. (2011). Fastsimcoal: a continuous-time coalescentsimulator of genomic diversity under arbitrarily complex evolutionary scenarios.Bioinformatics 27, 1332–1334.

Excoffier, L., and Lischer, H.E.L. (2010). Arlequin suite ver 3.5: a new series ofprograms to perform population genetics analyses under Linux and Windows.Molecular Ecology Resources 10, 564–567.

Excoffier, L., Dupanloup, I., Huerta-SÃ¡nchez, E., Sousa, V.C., and Foll, M. (2013).Robust Demographic Inference from Genomic and SNP Data. PLoS Genetics 9,e1003905.

Falush, D., van Dorp, L., and Lawson, D. (2016). A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots. BioRxiv 066431.

Fitch, W.T. (2008). Glossogeny and phylogeny: cultural evolution meets geneticevolution. Trends in Genetics 24, 373–374.

Garza, J.C., and Williamson, E.G. (2001). Detection of reduction in population sizeusing data from microsatellite loci. Molecular Ecology 10, 305–318.

Gascuel, O. (1997). BIONJ: an improved version of the NJ algorithm based on asimple model of sequence data. Molecular Biology and Evolution 14, 685–695.

Gayon, J. (2004). La génétique est-elle encore une discipline ? ms   20, 248–253.

Geisler, H., and List, J.-M. (2013). Do languages grow on trees? The tree metaphor inthe history of linguistics. Classification and Evolution in Biology, Linguistics and theHistory of Science. Concepts–methods–visualization. Stuttgart: Franz Steiner Verlag111–24.

Bibliographie

Goldstein, D.B., Ruiz Linares, A., Cavalli-Sforza, L.L., and Feldman, M.W. (1995).Genetic absolute dating based on microsatellites and the origin of modern humans.Proc. Natl. Acad. Sci. U.S.A. 92, 6723–6727.

Gong, T., and Wang, W.S. (2005). Computational modeling on language emergence: Acoevolution model of lexicon, syntax and social structure. Language and Linguistics6, 1.

Gould, S.J. (2002). The Structure of Evolutionary Theory (Harvard University Press).

Gould, S.J., and Lewontin, R.C. (1979). The spandrels of San Marco and thePanglossian paradigm: a critique of the adaptationist programme. Proceedings of theRoyal Society of London B: Biological Sciences 205, 581–598.

Gray, R.D., and Atkinson, Q.D. (2002). Language-tree divergence times support theAnatolian theory of Indo-European origin. Geophysical Research Letters 29.

Gray, R.D., and Jordan, F.M. (2000). Language trees support the express-trainsequence. Nature 405, 1052–1055.

Gray, R.D., Greenhill, S.J., and Ross, R.M. (2007). The pleasures and perils ofDarwinizing culture (with phylogenies). Biological Theory 2, 360–375.

Gray, R.D., Drummond, A.J., and Greenhill, S.J. (2009). Language phylogenies revealexpansion pulses and pauses in Pacific settlement. Science 323, 479–483.

Greenhill, S.J., Currie, T.E., and Gray, R.D. (2009). Does horizontal transmissioninvalidate cultural phylogenies? Proceedings of the Royal Society B: BiologicalSciences 276, 2299–2306.

Guillot, E.G., Hazelton, M.L., Karafet, T.M., Lansing, J.S., Sudoyo, H., and Cox, M.P.(2015). Relaxed Observance of Traditional Marriage Rules Allows SocialConnectivity without Loss of Genetic Diversity. Mol Biol Evol 32, 2254–2262.

Guillot, G., Mortier, F., and Estoup, A. (2005). Geneland: a computer package forlandscape genetics. Molecular Ecology Notes 5, 712–715.

Gunya, A., N. F. Glazovsky., Leadership for Environment and Development., andInstitut geografii (Rossijskaja akademija nauk) (2002). Yagnob valley: Nature, history,and chances of a mountain community development in Tadjikistan (Moscow: KMKScientific Press).

Haber, M., Mezzavilla, M., Xue, Y., Comas, D., Gasparini, P., Zalloua, P., and Tyler-Smith, C. (2016). Genetic evidence for an origin of the Armenians from Bronze Agemixing of multiple populations. European Journal of Human Genetics 24, 931–936.

Haeckel, E. (1874). The Evolution of Man.

Bibliographie

Hamed, M.B. (2005). Neighbour-nets portray the Chinese dialect continuum and thelinguistic legacy of China’s demic history. Proceedings of the Royal Society B:Biological Sciences 272, 1015–1022.

Hartl, D.L., and Clark, A.G. (2007). Principles of Population Genetics (SinauerAssociates, Incorporated).

Haspelmath, M. (1999). Optimality and diachronic adaptation. Zeitschrift FürSprachwissenschaft 18, 180–205.

Haspelmath, M., and Tadmor, U. (2009). Loanwords in the World’s Languages: AComparative Handbook (Walter de Gruyter).

Haugen, E. (1950). The Analysis of Linguistic Borrowing. Language 26, 210.

Heeringa, W., and Nerbonne, J. (2001). Dialect areas and dialect continua. LanguageVariation and Change 13, 375–400.

Hellenthal, G., Busby, G.B., Band, G., Wilson, J.F., Capelli, C., Falush, D., andMyers, S. (2014). A genetic atlas of human admixture history. Science 343, 747–751.

Henrich, J. (2001). Cultural transmission and the diffusion of innovations: Adoptiondynamics indicate that biased cultural transmission is the predominate force inbehavioral change. American Anthropologist 103, 992–1013.

Henry, J.-P., and Gouyon, P.-H. (1999). Précis de génétique des populations: cours,exercices et problèmes résolus (Dunod).

Heyer, E., Balaresque, P., Jobling, M.A., Quintana-Murci, L., Chaix, R., Segurel, L.,Aldashev, A., and Hegay, T. (2009). Genetic diversity and the emergence of ethnicgroups in Central Asia. BMC Genetics 10, 49.

Hoban, S., Bertorelle, G., and Gaggiotti, O.E. (2012). Computer simulations: tools forpopulation and evolutionary genetics. Nature Reviews Genetics 13, 110–122.

Holbrook, J.B. (2013). What is interdisciplinary communication? Reflections on thevery idea of disciplinary integration. Synthese 190, 1865–1879.

Huang, T., Shu, Y., and Cai, Y.-D. (2015). Genetic differences among ethnic groups.BMC Genomics 16, 1093.

Huelsenbeck, J.P., and Crandall, K.A. (1997). Phylogeny Estimation and HypothesisTesting Using Maximum Likelihood. Annual Review of Ecology and Systematics 28,437–466.

Hull, D.L. (1988). Science as a Process. An Evolutionary Account of the Social andConceptual Development of Science.

Bibliographie

Hunley, K. (2015). Reassessment of global gene–language coevolution. Proceedingsof the National Academy of Sciences 112, 1919–1920.

Hunley, K., Dunn, M., Lindström, E., Reesink, G., Terrill, A., Healy, M.E., Koki, G.,Friedlaender, F.R., and Friedlaender, J.S. (2008). Genetic and Linguistic Coevolutionin Northern Island Melanesia. PLoS Genetics 4, e1000239.

Hunley, K., Bowern, C., and Healy, M. (2012). Rejection of a serial founder effectsmodel of genetic and linguistic coevolution. Proceedings of the Royal Society B:Biological Sciences 279, 2281–2288.

I. Barrai, A. Rodriguez-Larralde, E (2000). Elements of the surname structure ofAustria. Annals of Human Biology 27, 607–622.

Jobling, M.A., Hurles, M., and Tyler-Smith, C. (2003). Human Evolutionary Genetics:Origins, Peoples and Disease (New York: Garland Science).

Jones, W. (1786). The Sanskrit Language.

Judson, O.P. (1994). The rise of the individual-based model in ecology. Trends inEcology & Evolution 9, 9–14.

Kandler, A., Unger, R., and Steele, J. (2010). Language shift, bilingualism and thefuture of Britain’s Celtic languages. Philosophical Transactions of the Royal SocietyB: Biological Sciences 365, 3855–3864.

Kasavin, I.T. (2009). L’idée d’interdisciplinarité dans l’épistémologie contemporaine.Diogène 38–57.

Kauhanen, H. (2016). Neutral change. Journal of Linguistics 1–32.

Khuri, S.F., Henderson, W.G., Daley, J., Jonasson, O., Jones, R.S., Campbell, D.A.,Fink, A.S., Mentzer, R.M., and Steeger, J.E. (2007). The Patient Safety in SurgeryStudy: Background, Study Design, and Patient Populations. Journal of the AmericanCollege of Surgeons 204, 1089–1102.

Kimura, M. (1983). The Neutral Theory of Molecular Evolution (CambridgeUniversity Press).

Kingman, J.F.C. (1982). The coalescent. Stochastic Processes and Their Applications13, 235–248.

Kirby, S. (2000). The role of I-language in diachronic adaptation. Sprachwissenschaft18, 2.

Kirby, S. (2001). Spontaneous evolution of linguistic structure-an iterated learningmodel of the emergence of regularity and irregularity. IEEE Transactions onEvolutionary Computation 5, 102–110.

Bibliographie

Kirby, K.R., Gray, R.D., Greenhill, S.J., Jordan, F.M., Gomes-Ng, S., Bibiko, H.-J.,Blasi, D.E., Botero, C.A., Bowern, C., Ember, C.R., et al. (2016). D-PLACE: AGlobal Database of Cultural, Linguistic and Environmental Diversity. PLOS ONE 11,e0158391.

Kirby, S., Dowman, M., and Griffiths, T.L. (2007). Innateness and culture in theevolution of language. PNAS 104, 5241–5245.

Kirby, S., Cornish, H., and Smith, K. (2008). Cumulative cultural evolution in thelaboratory: An experimental approach to the origins of structure in human language.Proceedings of the National Academy of Sciences 105, 10681–10686.

Kirby, S., Griffiths, T., and Smith, K. (2014). Iterated learning and the evolution oflanguage. Current Opinion in Neurobiology 28, 108–114.

Kirby, S., Tamariz, M., Cornish, H., and Smith, K. (2015). Compression andcommunication in the cultural evolution of linguistic structure. Cognition 141, 87–102.

Klein, J.T. (2013). Communication and collaboration in interdisciplinary research.Enhancing Communication & Collaboration in Crossdisciplinary Research, Edited byM. O’Rourke, S. Crowley, SD Eigenbrode, and JD Wulfhorst 11–30.

Labov, W. (1972). Sociolinguistic Patterns (University of Pennsylvania Press).

Lakatos, I. (1976). Falsification and the Methodology of Scientific ResearchProgrammes. In Can Theories Be Refuted?, S.G. Harding, ed. (Springer Netherlands),pp. 205–259.

Lansing, J.S., Cox, M.P., Downey, S.S., Gabler, B.M., Hallmark, B., Karafet, T.M.,Norquest, P., Schoenfelder, J.W., Sudoyo, H., Watkins, J.C., et al. (2007). Coevolutionof languages and genes on the island of Sumba, eastern Indonesia. Proceedings of theNational Academy of Sciences 104, 16022–16026.

Lê, S., Josse, J., Husson, F., and others (2008). FactoMineR: an R package formultivariate analysis. Journal of Statistical Software 25, 1–18.

Lees, R.B. (1953). The Basis of Glottochronology. Language 29, 113–127.

Lefevre, T., Raymond, M., and Thomas, F. (2016). Biologie évolutive (De BoeckSuperieur).

Lewontin, R.C. (1970). The units of selection. Annual Review of Ecology andSystematics 1, 1–18.

List, J.-M., Nelson-Sathi, S., Geisler, H., and Martin, W. (2014). Networks of lexicalborrowing and lateral gene transfer in language and genome evolution: Think again.BioEssays 36, 141–150.

Bibliographie

List, J.-M., Pathmanathan, J.S., Lopez, P., and Bapteste, E. (2016). Unity and disunityin evolutionary sciences: process-based analogies open common research avenues forbiology and linguistics. Biology Direct 11.

Livingstone, D., and Fyfe, C. (1999). Modelling the evolution of linguistic diversity.Advances in Artificial Life 704–708.

Long, J.C. (1991). The genetic structure of admixed populations. Genetics 127, 417–428.

MacIntyre, A. (1988). Whose justice? Which rationality? (Duckworth).

Maingueneau, D. (1979). “L’analyse du discours” - Persée. Repères pour larénovation de l’enseignement du français à l’école élémentaire 51, 3–27.

Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., Zhao, M.,Chennagiri, N., Nordenfelt, S., Tandon, A., et al. (2016). The Simons GenomeDiversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206.

Manni, F. (2017). Linguistic probes into human history. university of Groningen.

Mantel, N. (1967). The detection of disease clustering and a generalized regressionapproach. Cancer Research 27, 209–220.

Mardis, E.R. (2008). Next-Generation DNA Sequencing Methods. Annual Review ofGenomics and Human Genetics 9, 387–402.

Martínez-Cruz, B., Vitalis, R., Ségurel, L., Austerlitz, F., Georges, M., Théry, S.,Quintana-Murci, L., Hegay, T., Aldashev, A., Nasyrova, F., et al. (2011). In theheartland of Eurasia: the multilocus genetic landscape of Central Asian populations.European Journal of Human Genetics 19, 216–223.

Maynard Smith, J. (1987). How to model evolution. In The Latest on the Best, Essayson Evolution and Optimality, (Cambridge: MIT Press), pp. 119–131.

McDonald, S.P., Collins, J.F., and Johnson, D.W. (2003). Obesity Is Associated withWorse Peritoneal Dialysis Outcomes in the Australia and New Zealand PatientPopulations. JASN 14, 2894–2901.

Mennecier, P., Nerbonne, J., Heyer, E., and Manni, F. (2016a). A Central AsianLanguage Survey. Language Dynamics and Change 6, 57–98.

Mennecier, P., Nerbonne, J., Heyer, E., and Manni, F. (2016b). A Central Asianlanguage survey: Collecting data, measuring relatedness and detecting loans.

Mesoudi, A., Whiten, A., and Laland, K.N. (2006). Towards a unified science ofcultural evolution. Behavioral and Brain Sciences 29, 329–347.

Bibliographie

Moon, J.H. (1994). Putting Anthropology Back Togedier Again: The EthnogeneticCritique of Cladistic Theory 925. American Anthropologist 96, 925–948.

Moran, P.A.P. (1958). Random processes in genetics. Mathematical Proceedings ofthe Cambridge Philosophical Society 54, 60.

Moreno-Estrada, A., Gravel, S., Zakharia, F., McCauley, J.L., Byrnes, J.K., Gignoux,C.R., Ortiz-Tello, P.A., Martínez, R.J., Hedges, D.J., Morris, R.W., et al. (2013).Reconstructing the Population Genetic History of the Caribbean. PLoS Genetics 9,e1003925.

Mullis, K.B., and Faloona, F.A. (1987). Specific synthesis of DNA in vitro via apolymerase-catalyzed chain reaction. Meth. Enzymol. 155, 335–350.

Murphy, G.L., and Medin, D.L. (1985). The role of theories in conceptual coherence.Psychological Review 92, 289.

Nadasdi, T., Mougeon, R., and Rehner, K. (2008). Factors driving lexical variation inL2 French: A variationist study of automobile, auto, voiture, char and machine.Journal of French Language Studies 18, 365–381.

Nagel, E. (1961). The Structure of Science. American Journal of Physics 29, 716–716.

Nei, M. (1973). Analysis of gene diversity in subdivided populations. Proceedings ofthe National Academy of Sciences 70, 3321–3323.

Nei, M. (1977). F-statistics and analysis of gene diversity in subdivided populations.Annals of Human Genetics 41, 225–233.

Nei, M. (1987). Molecular Evolutionary Genetics (Columbia University Press).

Nettle, D., and Harriss, L. (2003). Genetic and linguistic affinities between humanpopulations in Eurasia and West Africa. Human Biology 331–344.

Nguyên-Duy, V., and Luckerhoff, J. (2006). Constructivisme/positivisme : où en sommes-nous avec cette opposition? (Université McGill (Montréal),), p.

Niyogi, P., and Berwick, R.C. (1997). Evolutionary Consequences of LanguageLearning. Linguistics and Philosophy 20, 697–719.

Novembre, J., and Stephens, M. (2008). Interpreting principal component analyses ofspatial population genetic variation. Nat Genet 40, 646–649.

Nowak, M.A., Komarova, N.L., and Niyogi, P. (2002). Computational andevolutionary aspects of language. Nature 417, 611.

Pagel, M. (2009). Human language as a culturally transmitted replicator. NatureReviews Genetics.

Bibliographie

Pagel, M., Atkinson, Q.D., and Meade, A. (2007a). Frequency of word-use predictsrates of lexical evolution throughout Indo-European history. Nature 449, 717–720.

Pagel, M., Atkinson, Q.D., and Meade, A. (2007b). Frequency of word-use predictsrates of lexical evolution throughout Indo-European history. Nature 449, 717–720.

Pagel, M., Atkinson, Q.D., S. Calude, A., and Meade, A. (2013). Ultraconservedwords point to deep language ancestry across Eurasia. Proceedings of the NationalAcademy of Sciences 110, 8471–8476.

Palminteri, S., Wyart, V., and Koechlin, E. (2017). The Importance of Falsification inComputational Cognitive Modeling. Trends in Cognitive Sciences 21, 425–433.

Palstra, F.P., Heyer, E., and Austerlitz, F. (2015). Statistical inference on genetic datareveals the complex demographic history of human populations in Central Asia.Molecular Biology and Evolution msv030.

Paradis, E., Claude, J., and Strimmer, K. (2004). APE: Analyses of Phylogenetics andEvolution in R language. Bioinformatics 20, 289–290.

Pateman, T. (1983). What is a language? Language & Communication 3, 101–127.

Popper, K. (1979). Three worlds (Ann Arbor,: University of Michigan.).

Preyer, G., and Peter, G. (2005). Contextualism in philosophy: knowledge, meaning,and truth (Oxford : New York: Clarendon Press ; Oxford University Press).   

Pritchard, J.K., Stephens, M., and Donnelly, P. (2000). Inference of populationstructure using multilocus genotype data. Genetics 155, 945–959.

Pudlo, P., Marin, J.-M., Estoup, A., Cornuet, J.-M., Gautier, M., and Robert, C.P.(2016). Reliable ABC model choice via random forests. Bioinformatics 32, 859–866.

Ramachandran, S., Deshpande, O., Roseman, C.C., Rosenberg, N.A., Feldman, M.W.,and Cavalli-Sforza, L.L. (2005). Support from the relationship of genetic andgeographic distance in human populations for a serial founder effect originating inAfrica. Proceedings of the National Academy of Sciences of the United States ofAmerica 102, 15942–15947.

Ramallo, V., Bisso-Machado, R., Bravi, C., Coble, M.D., Salzano, F.M., Hünemeier,T., and Bortolini, M.C. (2013). Demographic expansions in South America:Enlightening a complex scenario with genetic and linguistic data. American Journal ofPhysical Anthropology 150, 453–463.

Reali, F., and Griffiths, T.L. (2010). Words as alleles: connecting language evolutionwith Bayesian learners to models of genetic drift. Proceedings of the Royal Society B:Biological Sciences 277, 429–436.

Bibliographie

Reesink, G., Singer, R., and Dunn, M. (2009). Explaining the Linguistic Diversity ofSahul Using Population Models. PLOS Biology 7, e1000241.

Reich, D., Price, A.L., and Patterson, N. (2008). Principal component analysis ofgenetic data. Nat Genet 40, 491–492.

Reich, D., Thangaraj, K., Patterson, N., Price, A.L., and Singh, L. (2009).Reconstructing Indian population history. Nature 461, 489–494.

Renfrew, C. (1987). Archaeology and language: the puzzle of Indo-European origins(J. Cape).

Resnick, L.B., Levine, J.M., and Teasley, S.D. (1991). Perspectives on SociallyShared Cognition (American Psychological Association).

Robinson, J.D., Bunnefeld, L., Hearn, J., Stone, G.N., and Hickerson, M.J. (2014).ABC inference of multi-population divergence with admixture from unphasedpopulation genomic data. Molecular Ecology 23, 4458–4471.

Rogers, D.S., Feldman, M.W., and Ehrlich, P.R. (2009). Inferring population historiesusing cultural data. Proceedings of the Royal Society B: Biological Sciences 276,3835–3843.

Ruhlen, M. (1991). A Guide to the World’s Languages: Classification (StanfordUniversity Press).

Saffran, J.R. (2003). Statistical language learning: Mechanisms and constraints.Current Directions in Psychological Science 12, 110–114.

Sagaut, P. (2008). Introduction à la pensée scientifique moderne.

Saitou, N., and Nei, M. (1987). The neighbor-joining method: a new method forreconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406–425.

Saussure, F. de (1916). Cours de linguistique générale (Payot).

Scheinfeldt, L.B., Soi, S., and Tishkoff, S.A. (2010). Working toward a synthesis ofarchaeological, linguistic, and genetic data for inferring African population history.Proceedings of the National Academy of Sciences 107, 8931–8938.

Schiffels, S., and Durbin, R. (2014). Inferring human population size and separationhistory from multiple genome sequences. Nature Genetics 46, 919–925.

Schleicher, A. (1853). Die ersten Spaltungen des indogermanischen Urvolkes [Thefirst splits of the Indo-European prehistoric people].

Sellars, W.S. (1956). Empiricism and the Philosophy of Mind. Minnesota Studies inthe Philosophy of Science 1, 253–329.

Bibliographie

Smith, A.D.M. (2014). Models of language evolution and change. WIREs Cogn Sci 5,281–293.

Smith, K., Kirby, S., and Brighton, H. (2003). Iterated learning: A framework for theemergence of language. Artificial Life 9, 371–386.

Soucek, S. (2000). A History of Inner Asia (Cambridge University Press).

Steele, J., and Kandler, A. (2010). Language trees ≠ gene trees. Theory in Biosciences129, 223–233.

Steels, L. (1997). The Synthetic Modeling of Language Origins. Evolution ofCommunication 1, 1–34.

Steels, L. (2004). Analogies between genome and language evolution. In ArtificialLife IX: Proceedings of the Ninth International Conference on the Simulation andSynthesis of Artificial Life, (MIT Press), p. 200.

Steels, L. (2011). Modeling the cultural evolution of language. Physics of LifeReviews 8, 339–356.

Stephens, M., and Donnelly, P. (2003). A comparison of bayesian methods forhaplotype reconstruction from population genotype data. The American Journal ofHuman Genetics 73, 1162–1169.

Sturtevant, E.H. (1947). An Introduction to Linguistic Science.

Suppes, P. (1961). A comparison of the meaning and uses of models in mathematicsand the empirical sciences. In The Concept and the Role of the Model in Mathematicsand Natural and Social Sciences, (Springer), pp. 163–177.

Swadesh, M. (1952). Lexico-Statistic Dating of Prehistoric Ethnic Contacts: WithSpecial Reference to North American Indians and Eskimos. Proceedings of theAmerican Philosophical Society 96, 452–463.

Szathmáry, E., and Maynard Smith, J. (1997). From replicators to reproducers: thefirst major transitions leading to life. J. Theor. Biol. 187, 555–571.

Tamariz, M., and Kirby, S. (2015). Culture: Copying, Compression, andConventionality. Cognitive Science 39, 171–183.

Tamariz, M., and Kirby, S. (2016). The cultural evolution of language. CurrentOpinion in Psychology 8, 37–43.

Tao Gong (2010). Exploring the Roles of Horizontal, Vertical, and ObliqueTransmissions in Language Evolution. Adaptive Behavior 18, 356–376.

Tavaré, S., Balding, D.J., Griffiths, R.C., and Donnelly, P. (1997). Inferring

Bibliographie

Coalescence Times from DNA Sequence Data. Genetics 145, 505–518.

Tehrani, J.J. (2013). The Phylogeny of Little Red Riding Hood. PLOS ONE 8,e78871.

Testart, A. (2011). Les modèles biologiques sont-ils utiles pour penser l’évolution dessociétés? Préhistoires Méditerranéennes.

Thioulouse, J., Chessel, D., Dole´dec, S., and Olivier, J.-M. (1997). ADE-4: amultivariate analysis and graphical display software. Statistics and Computing 7, 75–83.

Thomsen, O.N. (2006). Competing Models of Linguistic Change: Evolution andBeyond (John Benjamins Publishing).

Thouzeau, V., Mennecier, P., Verdu, P., and Austerlitz, F. (2017). Genetic andlinguistic histories in Central Asia inferred using approximate Bayesian computations.Proc. R. Soc. B 284, 20170706.

Verdu, P., Austerlitz, F., Estoup, A., Vitalis, R., Georges, M., Théry, S., Froment, A.,Le Bomin, S., Gessain, A., Hombert, J.-M., et al. (2009). Origins and GeneticDiversity of Pygmy Hunter-Gatherers from Western Central Africa. Current Biology19, 312–318.

Verdu, P., Jewett, E.M., Pemberton, T.J., Rosenberg, N.A., and Baptista, M. (2017).Parallel Trajectories of Genetic and Linguistic Admixture in a Genetically AdmixedCreole Population. Current Biology 27, 2529–2535.e3.

Verleyen, S. (2007). Le fonctionnalisme entre système linguistique et sujet parlant:Jakobson et Troubetzkoy face à Martinet. Cahiers Ferdinand de Saussure 163–188.

Vogt, P. (2009). Modeling interactions between language evolution and demography.Human Biology 81, 237–258.

Waples, R.S., and Gaggiotti, O. (2006). What is a population? An empirical evaluationof some genetic methods for identifying the number of gene pools and their degree ofconnectivity. Molecular Ecology 15, 1419–1439.

Ward, R.H., Redd, A., Valencia, D., Frazier, B., and Pääbo, S. (1993). Genetic andlinguistic differentiation in the Americas. Proceedings of the National Academy ofSciences 90, 10663–10667.

Watson, C.I., Maclagan, M., and Harrington, J. (2000). Acoustic evidence for vowelchange in New Zealand English. Language Variation and Change 12, 51–68.

Weber, J.L., and Wong, C. (1993). Mutation of human short tandem repeats. HumanMolecular Genetics 2, 1123–1128.

Bibliographie

Weir, B.S., and Cockerham, C.C. (1984). Estimating F-Statistics for the Analysis ofPopulation Structure. Evolution 38, 1358.

Weiss, G., and von Haeseler, A. (1998). Inference of population history using alikelihood approach. Genetics 149, 1539–1546.

Wiley, E.O., and Lieberman, B.S. (2011). Phylogenetics: Theory and Practice ofPhylogenetic Systematics (John Wiley & Sons).

Wittgenstein, L. (1953). Recherches philosophiques (Editions Gallimard).

Wright, S. (1942). Statistical genetics and evolution. Bull. Amer. Math. Soc. 48, 223–246.

Wright, S. (1951). The Genetical Structure of Populations. Annals of Eugenics 15,323–354.

Zerjal, T., Xue, Y., Bertorelle, G., Wells, R.S., Bao, W., Zhu, S., Qamar, R., Ayub, Q.,Mohyuddin, A., Fu, S., et al. (2003). The genetic legacy of the Mongols. TheAmerican Journal of Human Genetics 72, 717–721.

Zuidema, W., and Boer, B. de (2009). The evolution of combinatorial phonology.Journal of Phonetics 37, 125–144.

APPENDIX: Supplementary

informations on the Approximate

Bayesian Computation procedures

1. Linguistic Model

For models assuming a borrowing process between variety 1 and variety 2

(Figure S1a), each cognate was borrowed (i.e. it adopted the identifier of the other

variety) with probability δL. If we assumed an admixture event between varieties 0

and 2 (Figure S1b), a new variety 1 was created and each cognate of variety 1 was

drawn from variety 2 with probability rL and from variety 0 with probability 1 – rL.

The branches evolved independently. We assumed a constant cognate mutation rate µL

across branches and through time.

2. Prior distributions for the linguistic model parameters

We simulated datasets of 185 cognates each. For each simulation, their mean

mutation rate μL was drawn in U[0, 10-2]. This was consistent with previous

estimations (Pagel et al., 2007b), with a mean cognate mutation rate per generation

between 6.1x10-3 and 9.15x10-3 (respectively for a generation time of 20 and 30

years). The mutation rate μL,i of each cognate i was then drawn independently in a beta

distribution with mean μL and parameter β = 1, which we implemented in our

simulation software PopLingSim, using the ratio between two gamma distributions

drawn using the C++ library <random>. Indeed, if X ~ Gamma(α, θ) and Y ~

Gamma(β, θ) and X and Y are independent variables, X/(X + Y) ~ Beta(α, β). The

borrowing rate δL was drawn in U[0, 0.1], i.e. a maximum of 10% of cognates could

be borrowed at each linguistic generation, a value already representing a massive

amount of linguistic exchange in a single generation. The admixture rate rL was drawn

in U[0, 1]. The split times t0 and t1 were drawn in U[1, 1000], with the constraint t0 >

APPENDIX: Supplementary informations on the Approximate Bayesian Computation procedures

t1. The upper limit of the prior for these split times was thus roughly twice the

previous estimates in these language families (Pagel et al., 2013).

3. Genetic model

Genetic data were simulated using FastSimCoal 2.5.1 (Excoffier and Foll, 2011;

Excoffier et al., 2013). Microsatellite data were simulated assuming a generalized

stepwise mutation model with an infinite number of potential alleles (Estoup et al.,

2002) and a probability of a mutation of more than one step p = 0.22, a value

commonly considered as realistic in the literature (Estoup et al., 2002). We assumed a

pure split process, a split followed by migration, or a split followed by admixture

(Figure S2). The three populations Pop0, Pop1 and Pop2 had effective sizes N0, N1, N2

(Figure S2).

If we assumed a migration process between population 1 and population 2

(Figure S2a), each individual migrated with probability δG. Populations 0 and 1 split

at time t1 from ancestral population 3, of effective size N3. Populations 2 and 3 split at

time t0 from ancestral population 4, of effective size N4. For the model assuming an

admixture event between populations 0 and 2 (Figure S2b), a new population 1

appeared at time t1 made from individuals from population 0 with probability 1 – rG

and from population 2 with probability rG. In that case, populations 0 and 2 split from

the ancestral population 3 at time t0, with associated effective size N3.

4. Prior distributions for the genetic model parameters

The mean mutation rate μG of the microsatellite loci was drawn in U[10-4, 10-3]

(Weber and Wong, 1993). The mutation rates μG,i of each locus i were drawn

independently in a beta distribution with mean μG and parameter β=1. We assumed

that all markers were unlinked. Population effective sizes (N0, N1, N2, N3, N4) were

each drawn in U[100, 100000]. Migration rates δG were drawn in U[0, 0.1], and

admixture rates rG in U[0, 1]. The split times (t0, t1) were drawn in U[1, 1000] with the

constraint t0 > t1.

5. Summary Statistics

For the genetic data, we used the ArlSumStats software (Arlequin suite v.3.5.2.2,

Excoffier and Lischer, 2010) to compute a large set of standard diversity indices

available for microsatellite data. For each population, we computed the means and the

standard deviations across the 26 loci of the number of alleles K, the expected gene

diversity Ĥ (Nei, 1977), the difference between the maximum and the minimum

number of repeats R, and the G-W index (Garza and Williamson, 2001). We estimated

also all pairwise FST values between populations (Weir and Cockerham, 1984) and the

average pairwise squared distance between alleles δμ2 (Goldstein et al., 1995).

Altogether, this provided 42 population genetics statistics.

While these genetic summary statistics are classically used to describe genetic

diversity, no consensual statistics are available to describe linguistic diversity in a set

of cognates. We constructed six linguistic statistics to explore the cognate variability

in the whole data set and between all pairs of linguistic varieties (Figure S8, see code

in Repository): mean number of cognates per meaning C, variance of the number of

cognates per meaning V(C), range of the number of cognates per meaning R(C),

number of meanings with only one cognate throughout the linguistic varieties Cs,

number of meanings with one different cognate for each linguistic variety Cd, and the

number of pairwise differences between linguistic varieties Di-j.

Prior-checking was performed verifying that simulated data sets were close to

the real data set, using a PCA direct checking and a goodness-of-fit test (R package

abc, function gfit (Csilléry et al., 2012)).

6. Scenarios selection using random forest (RF)

For each triplet, we conducted an ABC analysis to determine the best historical

scenario for genetic and linguistic data respectively. We generated a reference table

with 10,000 simulated data for each linguistic and each genetic scenario, and for each

triplet separately. The low number of simulations is due to the use of a random forest

algorithm, which need much less simulations than other classical model-selection

methods in ABC (Pudlo et al., 2016). In the UZA case, we simulated thus 100,000

datasets per triplet for the five genetic and the five linguistic scenarios. In the TJY

case, we simulated 40,000 datasets per triplet for the two genetic and two linguistic

scenarios. Over the 72 triplets, we conducted, therefore, 7,200,000 simulations for the

UZA case and 2,880,000 simulations for the TJY case.

For each triplet separately, we selected the most likely scenarios using a random

forest (RF) decision algorithm implemented in the R package abcrf (Pudlo et al.,

2016) (script in Repository). This algorithm builds a forest of decision trees using the

link between indexes of scenarios and summary statistics. We graphically checked the

convergence of error-rates with 500 trees (Pudlo et al., 2016), in both case-studies

conducted here (results not shown). For each case (UZA or TJY), we produced 72

decisions for the genetic data and 72 decisions for the linguistic data corresponding to

the 72 triplets of populations analysed.

7. Parameters estimation using neural networks (NN)

After selecting the most likely scenario, we estimated the posterior distributions

of its constitutive parameters. We generated a reference table with 1,000,000

simulated data for the linguistic selected scenarios and the genetic selected scenarios.

The best genetic and linguistic scenarios for the UZA case produced 2,000,000

simulations. The best genetic and linguistic scenarios for the TJY case produced

3,000,000 simulations. Repeated over the 72 analyses, it yielded a total of

144,000,000 simulations for the UZA case and 216,000,000 simulations for the TJY

For each triplet, we obtained an a posteriori distribution for each parameter

using the neuralnet option of the R package abc implementing a series of neural

networks (NN) with one hidden layer (Blum and François, 2010). As previously, we

performed cross-validations analysis using an out-of-bag approach over 100 pseudo-

observed datasets (Csilléry et al., 2012). We verified that the NN method estimates

better the parameters of the models than the linear, loclinear, and ridge methods.

We first accepted the 1% of the simulations closest to the real data set, based

on the euclidean distances between the observed summary statistics and the simulated

summary statistics. A logistic transformation was then used to constrain the intervals

of the estimated parameters, before using the NN itself. The sizes of the hidden layer

were four and seven neurons for the linguistic and genetic parameter estimations,

respectively. These sizes were set according to the assumed dimensionality of the

problem, which is necessarily lower than the number of parameters. No rule sets this

layer size (Csilléry et al., 2012), but a risk of over-fitting appears when it is large. To

avoid this issue, we empirically tested several possibilities for the number of neurons,

to obtain the best estimation while avoiding over-fitting. We finally pooled the

parameter distributions for each triplet supporting the most probable scenario (see

Figures S12-A.16 and Table S2-S6).

8. Cross-validation and posterior probabilities in the UZA case

8.1. Cross-validation

We produced simulations for the five genetic and five linguistic scenarios for

the UZA case. Our simulated data were congruent with the observed data, based on

the direct checking of the PCA performed on the simulated and the observed summary

statistics (Figure S9).

We then performed an a priori cross-validation for the scenario selection RF

procedure. We used the out-of-bag approach implemented in the function abcrf of the

R package abcrf to gauge to which extent the method is able to choose the correct

scenario (Tables S7-S8). In short, this method consists in testing each simulation,

considering them in turn as pseudo-observed data, with a separate RF analysis, using

all other simulated data to build the forest.

The linguistic RF performed a priori better than the genetic RF. This was not

surprising knowing the low genetic differentiation of the populations and the low

number of genetic markers. Moreover, when either scenario A, B, C or D was the true

scenario, RF performed better than when scenario E was the true scenario, in the

linguistic case as well as in the genetic case.

The pattern of scenario choices that we observed using real data across the 72

triplets for the UZA case indicated a higher support for the scenario E (Figure I.4).

Thus, considering the challenges to a priori distinguish this scenario from the others

raised by the above cross-validation results, which show that it is rather unlikely to

choose scenario E when it is not the true scenario, this a priori reinforced our

confidence in eventually accepting the scenario E based on our real-data analysis,

linguistically and genetically.

8.2. Posterior probabilities

We then performed an a posteriori estimation of the probability of each scenario

to be selected, when the model underlying the real dataset is known. We estimated this

posterior probability with the post.prob value computed from the function

predict.abcrf of the R package abcrf, for each triplet supporting the most probable

scenario, and we report the parameter distributions of their posterior probabilities in

Table S11 and in Figure S17.

The mean posterior probability in the UZA case was 0.69 (SD = 0.12 over 55

triplets) in the linguistic case and for the triplets supporting the scenario E, and 0.51

(SD = 0.04 over 36 triplets) in the genetic case and for the triplets supporting the

scenario E. We can thus be confident in assuming that scenario E was better than the

four other scenarios concerning the UZA case and for our data set.

9. Cross-validation and posterior probabilities in the TJY case

9.1. Cross-validation

We produced the simulations for the two genetic and two linguistic scenarios for

the TJY case. Our simulated data were congruent with the observed data, based on the

direct checking of the PCA performed on the simulated and the observed summary

statistics (Figure S10).

We performed an a priori cross-validation procedure for the scenario selection

similar to the UZA case described above. In general, RF performed better in the TJY

case than in the UZA case (Table S9-S10). In particular, the linguistic RF for the TJY

case performed ideally.

9.2. Posterior probabilities

We then performed an a posteriori estimation of the probability of each of the

two scenarios to be selected, when the model underlying the real dataset was known

(Table S11 and Figure S17). The mean posterior probability in the linguistic case was

0.52 (SD = 0.05 over 37 triplets) for the triplets supporting scenario 1, and 0.52

(SD = 0.05 over 35 triplets) for the triplets supporting scenario 2. Theses values are

consistent with the impossibility to distinguish between a scenario with or without

borrowing using our data, which is quite consistent with the low, a posteriori,

estimated value for the borrowing rate δ̂L = 0.4%. Indeed, this result suggests that,

when considering a scenario including the possibility of borrowings, our data would

be best explained by a scenario with only a very reduced level of such borrowings,

hence explaining our difficulties to distinguish a priori between the two linguistic

scenarios. The mean posterior probability in the genetic case was 0.91 (SD = 0.04

over 72 triplets supporting this scenario) for the scenario 1. We were thus confident in

rejecting the scenario 2 for the genetic case, since there were 72 triplet over 72

indicating a support for the scenario 1 (Figure I.4).

Repository

The scripts that we developed are freely available at

https://github.com/ValentinThouzeau/PopLingSim

Figure S1 – Models of linguistic evolution. (a) The ancestral linguistic variety splits into two varieties.One of this variety splits again, with possible subsequent continuous borrowing. (b) The ancestralvariety splits into two varieties. A third variety is subsequently generated by an admixture eventbetween the two source varieties.

Figure S2 – Models of genetic evolution. (a) The ancestral population splits into two varieties. One ofthis population splits again, with possible subsequent migration. (b) The ancestral population splits intotwo varieties. A third population is subsequently generated by an admixture event between the twosource populations.

Figure S3 – Two competing scenarios of linguistic and genetic origin of the Yagnob speakingpopulation (TJY). These scenarios have been tested independently for linguistic history and genetichistory. Abbreviations: Tc = Turkic speaking population. TJY = TJY population from the Yagnob valleyin Tajikistan. I-I = Indo-Iranian speaking population.

Figure S4 – Pairwise FST matrix (a) and linguistic distances matrix (b). See the gray-scale to the rightof each plot (different scale for each plot).

Figure S5 – Neighbour-joining trees based on the pairwise (δμ)2 matrix, with 11 Turkic speakingpopulation (in Yellow/Light Grey) and 10 Indo-Iranian speaking populations (in Blue/Dark Grey). Thevalues at each node correspond to the number of bootstrap trees containing this node among 1000permutations. The red arrows indicate the UZA and the TJY populations, specifically investigated inthis paper using Approximate Bayesian Computations.

Figure S6 – Principal Component Analysis of the Manhattan distances computed using the 185cognates from the real linguistic dataset. The red arrows indicate the UZA and the TJY populations.

−0.4 −0.2 0.0 0.2 0.4

−0.4

−0.2

Axis 1

KAZKRB

KRLKRA

KKKUZB

UZAUZT

TJNTJE

Figure S7 – Principal Component Analysis of the pairwise FST distance matrix computed using the 26microsatellites from the real genetic dataset. The red arrows indicate the UZA and the TJY populations.

C : Mean number of cognates per meaning:1+3+2+2+2+3

6=2.17

V(C) : Variance of the number of cognates per meaning:(2.17−1 )

2+(2.17−3 )

2+(2.17−2 )

2+ (2.17−2 )

2+(2.17−3 )

2=2.83

R(C) : Range of the number of cognates per meaning:max (numberofcognates )−min (numberofcognates )=3−1=2

Cs : Number of meanings with only one cognate throughout the three linguistic varieties:

1+0+0+0+0+0=1

Cd : Number of meanings with one different cognate for each linguistic variety:0+1+0+0+0+1=2

D1-2 : Number of pairwise differences between linguistic varieties n°1 and n°2:0+1+0+1+1+1=4

Figure S8 – An example of computation of the linguistic summary statistics, over three linguisticvarieties and six meanings.

Figure S9 – PCA performed in the UZA case over one observed summary statistics set (yellow dot)and 5000 summary statistics sets of the associated simulated datasets. (a) Linguistic PCA over the 1 st

and the 2cd axis (b) Linguistic PCA over the 1st and the 3rd axis (c) Genetic PCA over the 1st and the 2cd

axis (d) Linguistic PCA over the 1st and the 3rd axis.

Figure S10 – PCA performed in the TJY case over one observed summary statistics set (yellow dot)and 5000 summary statistics sets of the associated simulated datasets. (a) Linguistic PCA over the 1 st

and the 2cd axis (b) Linguistic PCA over the 1st and the 3rd axis (c) Genetic PCA over the 1st and the 2cd

axis (d) Linguistic PCA over the 1st and the 3rd axis.

Figure S11 – Analysis of the TJY population history. (a) Decisions over the analysis of 72 triplets forthe selection of the linguistic scenarios. (b) Priors (dotted-line) and posteriors (solid line) of theparameters t1/t0 estimated from the linguistic simulations of the scenario 1 (c) Priors (dotted-line) andposteriors (solid line) of the parameters t1/t0 and δL estimated from the linguistic simulations of thescenario 2 (d) Decisions over the analysis of 72 triplets for the selection of the genetic scenarios. (e)Priors (dotted-line) and posteriors (solid line) of the parameters t1/t0 estimated from the geneticsimulations of the scenario 1

Figure S12 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets testedfor the linguistic origin of the UZA population (admixture model).

Figure S13 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets testedfor the genetic origin of the UZA population (admixture model).

Figure S14 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets testedfor the linguistic origin of the TJY population (isolation model).

Figure S15 – Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets testedfor the linguistic origin of the TJY population (non-isolation model).

Figure S16 - Pooling of the parameters priors (in black) and posteriors (in blue) of the triplets testedfor the genetic origin of the TJY population (isolation model).

Figure S17 – Density of the distribution of the posterior probabilities for a) Scenario E in the linguisticUZA case, b) Scenario E in the genetic UZA case, c) Scenario 1 in the linguistic TJY case, d) Scenario1 in the genetic TJY case, e) Scenario 2 in the linguistic TJY case. Parameters of the distributions aredetailed Table S11.

Population Code Sample Country Linguistic familyKazaks (Gazli) LKZ 25 Uzbekistan Turkic

Kazaks (Raushan) KAZ 49 Uzbekistan TurkicKyrgyz (Ordaj) KRA 47 Kyrgyzstan Turkic

Kyrgyz (Akmuz) KRB 24 Kyrgyzstan TurkicKyrgyz (Kulanak) KRL 22 Kyrgyzstan TurkicKyrgyz (Barskoon) KRT 37 Uzbekistan Turkic

Karakalpaks (Halqabad) OTU 45 Uzbekistan TurkicKarakalpaks (Shege) KKK 45 Uzbekistan TurkicUzbeks (SojMahalla) UZA 25 Uzbekistan Turkic

Uzbeks (Hitoj) UZB 35 Uzbekistan TurkicUzbeks (Urtoqqishloq) UZT 25 Tajikistan Turkic

Tajiks (Shink) TDS 25 Uzbekistan Indo-IranianTajiks (Urmetan) TDU 25 Uzbekistan Indo-IranianTajiks (Agakalik) TJA 31 Uzbekistan Indo-IranianTajiks (Nimich) TJE 25 Tajikistan Indo-Iranian

Tajiks (Kaptarhona) TJK 26 Tajikistan Indo-IranianTajiks (Navdi) TJN 24 Tajikistan Indo-Iranian

Tajiks (Rishtan) TJR 29 Uzbekistan Indo-IranianTajiks (Nushnor) TJT 25 Tajikistan Indo-Iranian

Tajiks (Kamangaron) TJU 29 Tajikistan Indo-IranianYagnobs (Dushanbe) TJY 25 Tajikistan Indo-Iranian

Table S1 – Information table for the 21 studied Central Asian populations.

Mean Mode Median Quantile 2.5% Quantile 97.5%

N0 30979 16862 23978 6399 87812

N1 61531 82173 63848 13608 98179

N2 42764 28382 38443 8124 95255

N4 49291 37749 46460 12803 94394

μG 2.38×10-4 1.58×10-4 1.97×10-4 9.71 × 10-5 6.14×10-4

t0 623 699 634 190 981

t1 256 133 214 20 710

rG 0.499 0.479 0.498 0.05 0.958

4×N0×μG 25.4 14.2 18.8 5.23 84.6

4×N1×μG 60.8 38.3 49 9.06 184

4×N2×μG 36 20.9 28.5 7.26 113

4×N4×μG 37.5 34.5 36.3 19.3 62.7

t0×μG 0.136 0.102 0.118 0.037 0.34

t1×μG 0.0566 0.0259 0.0421 0.00439 0.193

t1/t0 0.439 0.279 0.411 0.0374 0.948

t1×r G 121 39.5 84.7 4.79 450

Table S2 – Summary of the posterior distributions of the genetic parameters, assuming a scenario of an admixed origin of the UZA population (scenarioE).

μL 0.00501 0.00388 0.00466 0.00211 0.00946

t0 655 807 667 257 980

t1 22.7 22.5 22.2 2.31 48.7

rL 0.093 0.090 0.090 0.019 0.184

t0×μL 2.98 2.49 2.78 1.40 5.65

t1×μL 0.101 0.115 0.107 0.00946 0.180

t1/t0 0.0366 0.0379 0.0364 0.00289 0.0783

t1 ×rL 2.08 0.886 1.78 0.0616 5.98

Table S3 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of an admixed origin of the UZA population(scenario E).

N0 31782 20108 26302 7585 86447

N1 20393 10279 14958 2533 75516

N2 55987 50561 55543 12909 97031

N3 47733 30765 45011 6468 96594

N4 52161 41207 50090 14248 96019

μG 2.33×10-4 1.60×10-4 1.99×10-4 9.61×10-5 5.65×10-4

t0 743 891 778 323 987

t1 440 389 427 71.2 856

4×N0×μG 25.2 16.5 20.6 6.99 70.4

4×N1×μG 16 8.70 11.8 2.63 55.7

4×N2×μG 51.7 29.4 42.4 8.64 150

4×N3×μG 44.2 20.5 32.1 4.66 156

4×N4×μG 40.8 37.6 39.5 20.2 68.3

t0×μG 0.168 0.125 0.148 0.054 0.397

t1×μG 0.097 0.063 0.081 0.014 0.276

t1/t0 0.601 0.772 0.625 0.112 0.971

Table S4 – Summary of the posterior distributions of the genetic parameters, assuming a scenario of isolation of the TJY population (scenario 2).

μL 0.0058 0.0039 0.0056 0.0019 0.0107

t0 627 860 668 85.7 990

t1 80.6 51.8 68.8 21 211

t0×μL 3.78 2.41 2.95 1.65 9.92

t1×μL 0.402 0.364 0.390 0.272 0.586

t1/t0 0.129 0.123 0.123 0.0178 0.300

Table S5 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of isolation of the TJY population (scenario 1).

μL 0.00730 0.00838 0.00758 0.00299 0.0107

t0 759 909 806 319 992

t1 319 130 263 7 873

δL 0.00695 0.00465 0.00574 0.000963 0.0197

t0×μL 5.55 3.82 5.70 1.95 9.72

t1×μL 2.54 0.939 1.96 0.102 7.67

t1/t0 0.408 0.149 0.355 0.00261 0.97

t1×δL 3.02 0.731 1.84 0.00958 13.9

Table S6 – Summary of the posterior distributions of the linguistic parameters, assuming a scenario of no-isolation of the TJY population (scenario 2).

Estimated Scenario

A B C D E ErrorT

ioA 6735 [795] 14 [2] 657 [77] 1148 [134] 1447 [171] 0.33

B 41 [5] 8500 [1003] 782 [92] 259 [30] 418 [49] 0.15

C 644 [76] 1161 [138] 6771 [797] 13 [2] 1411 [166] 0.32

D 739 [89] 272 [33] 35 [4] 8527 [1002] 427 [50] 0.15

E 1939 [229] 403 [48] 1912 [225] 406 [47] 5340 [630] 0.47

Table S7 – Linguistic cross-validation of the UZA case. Mean of the confusion matrices over the 72 triplets using 10000 pseudo-observed data pertriplet, from the out-of-bag analysis. Standard deviation of the confusion matrices is indicated between square brackets.

Estimated Scenario

A B C D E Error

A 6341[749] 336 [39] 1137 [134] 264 [30] 1923 [227] 0.37

B 267 [30] 6037 [717] 834 [94] 2471 [294] 391 [43] 0.40

C 1195 [138] 1591 [186] 4492 [535] 899 [102] 1823 [218] 0.55

D 271 [31] 4071 [478] 524 [61] 4807 [569] 327 [39] 0.52

E 2989 [345] 986 [113] 2238 [268] 606 [65] 3181 [387] 0.68

Table S8 – Genetic cross-validation of the UZA case. Mean of the confusion matrices over the 72 triplets using 10000 pseudo-observed data per triplet,from the out-of-bag analysis. Standard deviation of the confusion matrices is indicated between square brackets.

Estimated Scenario

1 2 Error

e S 1 8770 [1034] 1230 [144] 0.13

2 994 [118] 9006 [1061] 0.10

Table S9 – Linguistic cross-validation of the TJY case. Mean of the confusion matrices over the 72 triplets using 10000 pseudo-observed data pertriplet, from the out-of-bag analysis. Standard deviation of the confusion matrices is indicated between square brackets.

Estimated Scenario

1 2 ErrorT

S 1 7313 [868] 2687 [311] 0.27

2 1271 [144] 8729 [1035] 0.13

Table S10 – Genetic cross-validation of the TJY case. Mean of the confusion matrices over the 72 triplets using 10000 pseudo-observed data per triplet,from the out-of-bag analysis. Standard deviation of the confusion matrices is indicated between square brackets.

Min. 1st Qu. Median Mean 3rd Qu. Max.

UZA Linguistic, Model E 0.3716 0.6696 0.7267 0.6949 0.7771 0.8344

UZA Genetic, Model E 0.4099 0.4916 0.5084 0.5096 0.5340 0.5688

TJY Linguistic, Model 1 0.4037 0.4941 0.5148 0.5188 0.5524 0.6273

TJY Linguistic, Model 2 0.3980 0.4827 0.5146 0.5167 0.5467 0.6168

TJY Genetic, Model 1 0.7729 0.8906 0.9094 0.9059 0.9255 0.9617

Table S11 – Quantiles of the distributions of the posterior probabilities computed over the triplets supporting the most probable scenario. Thesedistributions are shown on Figure S17.

infÉrer l’histoire des populations humaines À partir … · et de l’application de méthodes...

Documents

tarification d’un portefeuille de seniors retraités en...

bayesian optimal design for changepoint problems · 0-1...

un regard bayésien sur les modèles dynamiques de la

tâches complexes - ac-grenoble.fr · cm1 - 2 savoir...

comprÉhension fine infÉrences - ac-nancy … ·...

intuitions métier règles formelles · exploitables, dans...

zorba production et l’ird présentent · fasciné par...

elections communales 2018 programme montigny-le … · dès...

calcul bayésien: approximations, mcmc - crest.fr · le...

réseaux bayésien en python avec pyagrum

or03: un nouvel estimateur bayÉsien de taux de …

annales de l institut...

estimation de paramètres avec le calcul bayésien … ·...

calcul approché d'une intégrale - univers ti-nspire · en...

univers vivant lévolution : la reproduction. objectifs...

un point de vue bayésien pour des algorithmes de bandits...

cours hydraulique générale mepa - unistra · dispositifs...

gavrilo princip, cet illustre inconnu de sarajevorapidement...

le corps des condamnés - act.hypotheses.org · beaucoup de...

plan de lintervention – présentation – la...