phoneme similarity aline evaluation algorithms for ...gjaeger/lehre/ws1213/biohistlingpublic/... ·...

48
Armin Buch Introduction Alignment Phoneme similarity ALINE Evaluation Identification of cognates COGIT Evaluation Identification of sound correspondences CORDI Evaluation Outlook Algorithms for Language Reconstruction Kondrak’s 2002 thesis Armin W. Buch February 8, 2013

Upload: hathuan

Post on 07-Sep-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Algorithms for Language ReconstructionKondrak’s 2002 thesis

Armin W. Buch

February 8, 2013

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

How to reconstruct a proto-language?

I Identification of cognatesI Alignment of cognatesI Discovery of sound correspondencesI Reconstruction of proto-formsI Kondrak contributes unsupervised algorithms for the first

three tasks

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Alignment

I Alignment is usually calculated with a dynamicprogramming algorithm (Wagner-Fischer)

I It needs a distance metric

I Kondrak adapts extensions to the algorithm to phoneticdata

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Similarity vs. distance

I To a large extent, similarity measures and distance metricscan be exchanged

I The metric properties do not always make sense forphoneme distance (we will see examples)

I Linguistic intuitions are sometimes easier to express assimilarities

I The alignment algorithm is easily adapted to similaritiesI Assign similarity scores instead of costsI Choose the maximum, not the minimum

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Local alignment

I Let the usual alignment be called globalI Local alignment strips off prefixes and suffixesI by having no indel costs at the beginning and at the end of

wordsI instead, it maximizes the similarity of similar substrings

(possibly the root)

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Half-local alignment

I Words tend to change a lot at their right edge, while theleft edge is quite stable

I Half-local alignment aligns globally on the left, andlocally on the right

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Gap penalties

I Gaps can be longer than just one segmentI e.g. by loss of an entire syllableI In order to weigh this less than a series of deletions, gap

costs can be calculated with a linear functionI initial gap cost + segment cost * number of deleted

segments

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Compression and expansion

I Many-to-one and one-to-many relations can be modeledas substitution plus deletion/insertion

I but this is not linguistically adequateI and its cost/similarity would be judged differentlyI As an example, consider En. ‘fact’ vs. Sp. ‘hecho’

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Transposition

I In phonology, transposition is rareI Span. cocodriloI The most common instance is metathesis of adjacent

segmentsI Metathesis is highly irregularI For practical purposes, it will be ignored here

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Phoneme similarity

I The easiest measure of phoneme distance is identity

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Covington’s measureI Covington (1996) defines a phonetic distance measureI gap penalty equals 10 base costs + 40 per segment

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Phoneme similarity 2

I Covington’s measure has a low resolution

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Covington’s measure 2

I it is not a metricI zero property violated with a:iI Preference for matching identical C over matching id. V

cannot be expressed in a metricI triangle inequality violated with a:i:yI cf. labio-velars (double marked, close to both); also cf.

j/dS

I “just a stand-in for a more sophisticated, perhapsfeature-based, system”

I Kondrak reports a good correlation between thesetrial-and-error costs and feature based Hamming distance,when the latter is an average over all sounds in thecategory

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Phoneme similarity 3

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Problems with binary features

I Binary features are interpreted within a languageI they do not always reflect confusability / possible

historical change:I /j/ → /dS/ is likely, but the two are very dissimilar

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Multi-valued features

I e.g. with values within [0,1]I possibly also weighted features (place > manner of

articulation)I efforts at the time (Nerbonne & Heringa 1997) found

worse alignments with better weightingsI still, beneficial weightings might be derived automaticallyI possibly today with more hand-annotated cognate data

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Phoneme similarity 4

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Kondrak’s ALINE algorithm

I similarities, not distancesI best alignments within a threshold εI local alignments; this replaces gap functionsI indels, substitution, expansion, compressionI transpositions are rare and too irregularI multivalued features

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Annotations to ALINE

I diff(p,q,f) returns the difference between p and q forfeature f

I Vowel features: syllabic, nasal, retroflex, high, back,round, long

I Consonant features: syllabic, manner, voice, nasal,retroflex, lateral, aspirated, place & double (= secondaryplace)

I Double leads to violation of triangle inequality, becausethe closest is taken

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Evaluation

I 82 words (from Covington 1996), manually coded forcognacy

I Spanish–French, English–German, English–Latin,Fox–Menomini, and some solitary examples

I This was the best data availableI And still it may contain errors, and it has too many too

easy pairsI Furthermore, it’s used for development and for evaluationI ALINE outperforms Covington’s method, but still has

errors

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Results

I ALINE achieves 95% accuracy compared to Kondrak’smanual alignments

I it outperforms earlier approachesI ‘tooth’ cannot be correctly aligned without referring to

regular sound changes

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Not everything is a cognate

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Cognate: a working definition

I For the present purposes, everything with similar meaningand form is a cognate

I Useful for unsupervised methods, including Greenberg’smass lexical comparison

I Better, and still to be established: automatically findingsound correspondences, and defining cognates accordingly

I Best data available on a large scale: transcribed word listswith glosses

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Example word list 1

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Example word list 2

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Kondrak’s program COGIT

I An algorithm to identify cognatesI It needs to evaluate phonetic similarity (via ALINE) and

semantic similarityI Phonetic similarity is normalized by dividing by the

self-similarity of the more self-similar word1

I Semantic similarity via WordNetI Identity of glosses is in general not enough

1As I understand it

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Problems in establishing semantic similarity

I Spelling errors / variantsI InflectionI Modifiers: determiners, adjectives, compounds,

complements, adjunctsI synonymy (‘tomb’, ‘grave’)I Semantic changes (‘fowl’, ‘turkey’; ‘broth’, ‘grease’)

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Addressing these problems

I Spelling correction (even if manually)I Removal of stop words (‘a kind of’, . . . )I Extraction of keywords (syntactic heads heuristically

found after POS-tagging)I LemmatizationI Employing WordNet

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

WordNet relations

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Semantic shift

I generalization & specialization (‘deer’, ‘Tier’)I melioration (Ancient Greek ‘guna’ “woman”, ‘queen’)I pejoration (‘Frau’; ‘Weib’)I metaphor (‘star’)I metonymy (attribute for whole): ‘crown’I synechdoche (pars pro toto)

⇒ some of them happen along WordNet’s semantic relations

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Weighing semantic similarity

I WordNet paths longer than 1 are considered useless

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Example calculation

I COGIT’s similarity score is a weighted sum of thephonetic and semantic similarity

I The weight is empirically set to 80% phonology, 20%semantics

I if it exceeds a threshold, record the pair as a cognatecandidate

I Example: Cree wahkwa ‘a lump of roe’, Ojibwa wakk‘fish eggs’

I remove determinerI identify keywords (lump, roe; fish, eggs)I lemmatize (egg)I hypernymy (roe IS-A egg) beats meronymy (roe PART-OF

fish): 0.25I phonetic score 0.4167I overall score 0.3834

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Evaluation

I evaluated on a set of dictionaries of North Americanlanguages, with its own inconsistencies

I weighting experimentally set to 80–20, so semantics isn’ta strong indicator

I no threshold set: it is a trade-off between recall andprecision

I precision levels reported as an average over 0%, 10%,. . . 100% recall thresholds

I better than older methods

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

The role of semantics

I gloss identity holds for 62.7% of all cognates (no specialmethod needed at all)

I keyword identity holds for 12%I others insignificantI 19.3% are not connected via their glosses at all (by this

method)I No word sense disambiguation in the process → false

positives via WordNetI imperfect keyword extractionI missing entries in WordNet

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Identity vs. correspondence

I English ‘have’ is not cognate with Latin ‘habere’, but with‘capire’

I by regular sound changes (Grimm’s Law, . . . )I Is automatic identification of correspondences possible?I Is it possible on data un-annotated for actual cognacy?I That is, are correspondences stable enough to be visible

under noise?

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Phoneme vs. word alignment

I Segment alignment is well-known from syntaxI Kondrak relies on Melamed’s (2000) algorithmI first, initialize correspondence likelihoods using

co-occurrence counts (G2 statistics, which I will not try toexplain here)

I greedily link words 1-to-1, highest scores firstI re-estimate likelihoods and repeat (serves to prune

accidental or indirect co-occurrences)I extended for contiguous sequences being treated as one

segment (many-to-one, one-to-many, many-to-many)

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Kondrak’s CORDI algorithm

I no crossing links expected, so the greedy aligner isreplaced with a variant of the standard aligner

I half-local (don’t consider word endings)I threshold on links: Don’t match everything even if you

couldI negative weight on indelsI positive weight on each link

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Evaluation 1I 112 English-Latin cognate pairsI Now, tooth:dent can be aligned correctlyI y:w is claimed to result from the diphtong [ay]

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

NoiseI pure cognate data is hard to getI 200 words (English/Latin), out of which only 29% are

cognatesI highly robust

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Armin Buch

Introduction

AlignmentPhoneme similarity

ALINE

Evaluation

Identification ofcognatesCOGIT

Evaluation

Identification ofsoundcorrespondencesCORDI

Evaluation

Outlook

Outlook

I A phoneme-by-phoneme correspondence likelihood tablederived from actual (cognate) data wasn’t available at thetime

I Automatic reconstruction of proto-forms is still a hot topic