embracing language diversity

36
Embracing Language Diversity

Upload: others

Post on 23-Nov-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Embracing Language Diversity

More than 4,000 live languages

Most are resource-poor

Key Questions

2

Can we improve monolingual performance byexploiting multilingual connections?

Multilingual Learning

Linguistic Motivation:

Languages related structurally and genetically

But differ systematically in patterns of expression and ambiguity

Goal:

• Induce individual language structures

• Induce cross-lingual connections

• Learn from differences in lexical ambiguity

fish/poissons [N] vs. fish/pêcher [V]

• Learn from differences in structural ambiguity (1) determiner “les” signals noun

Motivation for Multilingual Learning

• Learn from differences in lexical ambiguity

fish/poissons [N] vs. fish/pêcher [V]

• Learn from differences in structural ambiguity (1) determiner “les” signals noun

Motivation for Multilingual Learning

Multilingual Learningfor POS Tagging

Input:Untagged bilingual parallel corpus

Goal:Induce a POS tagger for each language(test on monolingual data)

6

Two Monolingual HMMs

Align Words

Merge Nodes of Aligned Words

Performance

Bilingual Tagging Performance: Serbian

65

67

69

71

73

75

77

79

81

Mono HU RO SL CS BG ET EN

Part-of-Speech Tagging Accuracy

The More The Merrier!

Beyond Multilingual Tagging

Morphology ParsingPOS Tagging

NN AC CC DTNN

Proposed Research

• Learn from non-parallel corpus

Benefit from the world’s wealth of language resources

• Move towards language-neutral semantic representation

num singular

person 1st

animacy yes

he הוא وہ

num singular

transitive yes

time present

smells מריח سونگھتا ہے

num plural

animacy no

flowers پھول פרחים

Constrain unsupervised grammar induction using language-independent syntactic rules

Using Linguistic Universals for Structure Analysis

Root Auxiliary Noun Adjective

Root Verb Noun Article

Verb Noun Noun Noun

Verb Pronoun Noun Numeral

Verb Adverb Preposition Noun

Verb Verb Adjective Adverb

Auxiliary Verb

(Naseem et al., EMNLP 2010)

Using Universal for Structure Analysis

20

30

40

50

60

70

80

English Danish Slovene Spanish Swedish Portuguese

No rules Universal Rules

Model Posterior

Adding the Universal Rules

Parses of data◊ Kids eat apples. ◊ Kids eat apples.

Parses of data

Post

erio

r p

rob

abili

ty

…. ….. ……

18

Model Posterior

Count(edges ∈ rules) ... 1 … … … 3 … …

╳ 0.005 ╳ 0.01

Adding the Universal Rules

Posterior Probability

Parses of data◊ Kids eat apples. ◊ Kids eat apples.

Parses of data

Post

erio

r p

rob

abili

ty

…. ….. ……

19

0.005

0.01

Model Posterior

Count(edges ∈ rules) ... 1 … … … 3 … …

╳ 0.005 ╳ 0.01

= (… + 0.005 + … + … + 0.03 + …) = 2.79 E[edges ∈ rules]

Adding the Universal Rules

Posterior Probability

Parses of data

20

◊ Kids eat apples. ◊ Kids eat apples.

Parses of data

Post

erio

r p

rob

abili

ty

…. ….. ……

0.005

0.01

Model Posterior

Count(edges ∈ rules) ... 1 … … … 3 … …

╳ 0.005 ╳ 0.01

= (… + 0.005 + … + … + 0.03 + …) = 2.79 E[edges ∈ rules]

≥ 0.8 ╳ total edges

Adding the Universal Rules

Posterior Probability

Parses of data

21

◊ Kids eat apples. ◊ Kids eat apples.

Parses of data

Post

erio

r p

rob

abili

ty

…. ….. ……

Pre-specified threshold

0.005

0.01

The Gap Remains

68.8

71.9

91.5

60

65

70

75

80

85

90

95

Unsupervised Headden III et al.

(2009)

Universal rules Naseem et al.

(2010)

Supervised McDonald et al.

(2006)

Leverage Language Diversity in Language Analysis

• Typological Analysis: compare languages based on structural patterns (aka typological parameters)‏

• Parameters encode dimensions of language variance

Subject Verb Object Positioning

Number of Genders

Definite Article

23

English Russian Hebrew

Exponence of Selected Inflectional Formatives

No case Case + number No case

Definite Articles Definite word distinct from demonstrative

No definite or indefinite article

Definite affix

Systems of Gender Assignment

SemanticSemantic and formal

Semantic and formal

Order of Adjective and Noun

Adjective-Noun Adjective-Noun Noun-Adjective

Hand and Arm Different Identical Different

The World Atlas of Language Structures Online2,650 Languages, 142 Features

24

0

0.1

0.2

0.3

0.4

English

P(.|Verb)

0

0.1

0.2

0.3

0.4

Portuguese

P(.|Verb)

From Typological Tables to Rule Distributions

0

0.1

0.2

0.3

0.4

0.5

English

P(.|Noun)

0

0.1

0.2

0.3

0.4

0.5

Portuguese

P(.|Noun)

From Typological Tables to Rule Distributions

Low Density Language

Unsupervised

Resource Rich Language

Supervised

Model for Low Density Language

Typology Reference

p(. | NP)p(. | NP)

KL divergence between p(. | NP)

and p(. | NP)

Proposed Approach: Bilingual Scenario

Arabic

Low Density Language

Unsupervised

)NP|(p

English

Chinese

Typology Reference

Proposed Approach: Multilingual Scenario

Model for Low Density Language

Semantic Analysis for Low-densityLanguages

Goal: Construct language-neutral abstract representation

He smells flowers

pos verb

num singular

transitive yes

time present

smells (x1,x2)

pos verb

num singular

transitive no

time present

smells (x1)

pos noun

num plural

count yes

smells

Semantic Ambiguity

He smells flowers

pos verb

num singular

transitive yes

time present

smells (x1,x2)

pos verb

num singular

transitive no

time present

smells (x1)

pos noun

num plural

count yes

smells

smells/سونگھتا ہے flowers/پھول he/وہ

سونگھتا ہے بدبو آتی ہےبدبوئیں

ריחסמ מריח תרחו

פרחים /flowersהוא /he מריח /smells

Construct a Language Neutral Semantic Representation

• Align trees of multi-parallel corpus

He smells flowers הוא‏מריח‏‏פרחים وہ پھول سونگھتا ہے

Construct a Language Neutral Semantic Representation

• Align trees of multi-parallel corpus

• Extract minimal set of frequently occurring fragments

Model with Dirichlet processes (adaptor grammar induction)

He smells flowers הוא‏מריח‏‏פרחים وہ پھول سونگھتا ہے

num singular

person 1st

animacy yes

he הוא وہ

num singular

transitive yes

time present

smells מריח سونگھتا ہے

num plural

animacy no

flowers پھول פרחים

Construct a Language Neutral Semantic Representation

• Align trees of multi-parallel corpus

• Extract minimal set of frequently occurring fragments

• Learn to semantic parsing in a monolingual setting

num singular

person 1st

animacy yes

he הוא وہ

num singular

transitive yes

time present

smells מריח سونگھتا ہے

num plural

animacy no

flowers پھول פרחים

Construct a Language Neutral Semantic Representation

• Align trees of multi-parallel corpus

• Extract minimal set of frequently occurring fragments

• Learn to semantic parsing in a monolingual setting

• Project representation into low density language via bilingual corpus

num singular

person 1st

animacy yes

he הוא وہ

num singular

transitive yes

time present

smells מריח سونگھتا ہے

num plural

animacy no

flowers پھول פרחים

Benefits of Multilingual Semantic Representation

• Developing tools with scarce target language annotations

– Reduces need in training data due to abstraction over alternative surface realizations

• Developing tools with no target language annotations

– Supports cross-lingual transfer due to language-neutral features derived from the representation