empirical approaches to multilingual lexical acquisitiontbaldwin/lexacq/lecture01.pdf · empirical...

46
Empirical Approaches to Multilingual Lexical Acquisition Lecturer: Timothy Baldwin

Upload: others

Post on 11-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches toMultilingual Lexical Acquisition

Lecturer: Timothy Baldwin

Page 2: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Lecture 1

1

Page 3: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Course Overview

2

Page 4: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Empirical Approaches to MultilingualLexical Acquisition

• Lecturer: Timothy Baldwin ([email protected])

• Website: http://www.coli.uni-saarland.de/~tbaldwin/lexacq/

3

Page 5: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

(Approximate) Schedule

Day Time Content

Wed 16:00-16:55 Introduction to multilingual lexical acquisition

17:00-17:55 Introduction to machine learning

18:00-18:55 Data discovery: language identification

Thu 17:00-17:55 Unsupervised approaches to lexical acquisition:

word segmentation and MWE extraction

18:00-18:55 Monolingual countability learning

Fri 16:00-16:55 Crosslingual countability learning

17:00-17:55 Learning Verb Syntax

18:00-18:55 General-purpose lexical acquisition

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 4

Page 6: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Prerequisites

• Linguistic skills:

? Basic notions of word classes, phrase structure, constituency

? Basic understanding of ontological semantics, esp. in the context

of WordNet

• Mathematical skills:

? familiarity with formal mathematical notation

? basic familiarity with probability/information theory

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 5

Page 7: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Introduction to MultilingualLexical Acquisition

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 6

Page 8: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Basic Terminology

• Corpora

• Tokens and types

• Ambiguity and disambiguation

• Words and multiword expressions (MWEs)

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 7

Page 9: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Corpora

• A corpus (plural corpora) is a body of written or spoken language,

generally either from a homogeneous source or balanced across

multiple sources in an attempt to be representative of a given

language type

• Examples: British National Corpus (BNC), Penn Treebank (Brown,

WSJ, Switchboard), Tiger Corpus, EUROPARL, ...

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 8

Page 10: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Types and Tokens

• The number of types in a corpus is the number of unique word

forms, and the number of tokens is the total word count

Pease porridge hotPease porridge coldPease porridge in the potNine days old

• Types: 10 (Pease, porridge, hot, cold, ...)

• Tokens: 14 (Pease, porridge, hot, Pease, ...)

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 9

Page 11: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Ambiguity and Disambiguation

• Ambiguity: observation that a given word occurs in multiple

configurations

• Disambiguation: determination of which of a fixed set of classes

a given word conforms to

The gang held up the bankThe boat pulled up at the bankWe stopped by the bank

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 10

Page 12: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Words and Multiword Expressions

• (Escapist definition) A word is what we would expect to occur

as an atomic, independent entry in a dictionary (e.g. reconsider ,

shogakko “primary school”)

• (Narrow definition) A multiword expression (MWE) is made

up of multiple words and is lexically, syntactically, semantically,

pragmatically and/or statistically idiosyncratic (e.g. look up, phonebook , off screen)

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 11

Page 13: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

INTRODUCTION

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 12

Page 14: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

What is Lexical Acquisition?

• Lexical acquisition is the process of (semi-)automatically learning

lexical properties (usually defined by a given language resource)

i.e., we are in the business of “filling in the gaps” in a language

resource

• Why bother?

? language is productive

? language is dynamic

? language is domain-dependent

? with ≈7,000 living languages in the world, there aren’t enough

computational linguists to go around!

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 13

Page 15: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

OK, so what are Language Resources(LRs)?

• From the ELRA website, a language resource is:

... a set of speech or language data and descriptions in

machine readable form, used e.g. for building, improving or

evaluating natural language and speech algorithms or systems,

or, as [a] core resource for the software localisation and

language services industries, for language studies, electronic

publishing, international transactions, subject-area specialists

and end users.

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 14

Page 16: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Deep Language Resources (DLRs)

• Definition: language resources which encode precise symbolic

linguistic knowledge based on a well-defined linguistic theory

• Examples:

? lexical semantic resources (e.g. WordNet)

? syntactic resources (e.g. COMLEX, Penn Treebank, CCGBank)

? lexico-semantic resources (e.g. PropBank, FrameNet)

? precision grammars (e.g. ERG, PARC grammars)

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 15

Page 17: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Keystone (Deep) Language Resource (#1)

• WordNet: applications in word sense disambiguation, information

retrieval, PP attachment, document summarisation, information

extraction ...

{savings bank, coin bank, money box, bank}∈ {container}∈ {instrumentality, instrumentation}∈ {artifact, artefact}∈ {whole,unit}

...

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 16

Page 18: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Keystone (Deep) Language Resource (#2)

• COMLEX: applications in parsing, information extraction, word

sense disambiguation, computational lexical semantics, ...

(ADJECTIVE :ORTH "ablative" :FEATURES ((ATTRIBUTIVE)))(NOUN :ORTH "ablative" :FEATURES ((COUNTABLE)))(NOUN :ORTH "ablaut" :PLURAL *NONE*)(ADJECTIVE :ORTH "ablaze" :FEATURES ((AINRN)

(PREDICATIVE)))(ADVERB :ORTH "ablaze" :MODIF ((PRED-ADV)

(POST-NOUN)(CLAUSAL-ADV :VERB-OBJ T

:FINAL T)))

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 17

Page 19: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Keystone (Deep) Language Resource (#3)

• English Resource Grammar: applications in parsing, language

understanding, ontology extraction, grammar checking, ...

ability_n2 := n_vp_c_le &[ STEM < "ability" >,SYNSEM [ LKEYS.KEYREL.PRED "_ability_n_rel" ] ].

able_a1 := aj_-_i_le &[ STEM < "able" >,SYNSEM [ LKEYS.KEYREL.PRED "_able_a_rel" ] ].

able_a2 := aj_vp_i-seq_le &[ STEM < "able" >,SYNSEM [ LKEYS.KEYREL.PRED "_able_a_rel" ] ].

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 18

Page 20: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Birds of a Feather ...

• Larger “family” of HPSGs being developed as part of DELPH-IN

German, Norwegian, Korean, Modern Greek, Spanish, Swedish,

Catalan, Chinese, ...

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 19

Page 21: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

The Rose

• DLRs are:

? attempts to capture (part of) a language in its full complexity

? glass-box tools for testing the generality of theories, hypotheses,

analyses etc. (mono- and cross-lingually)

? valuable in applications requiring a fine-grained level of

representation (deep linguistic processing)

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 20

Page 22: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

The Thorns

• But DLRs are also:

? expensive to build

? restricted/skewed in domain (lack of portability)

? limited in their system coverage

∗ constructions without analyses, unannotated lexical relations,

etc.

? limited in their lexical coverage

∗ lexemes with partial coverage (rare word usages)

∗ lexemes with no coverage (rare words, MWEs, etc)

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 20

Page 23: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

DLR Development Process

• DLR development made up of two tasks:

1. system design = development of description language/core

infrastructure (e.g. lexical hierarchy/ontology)

2. data classification = population of ontology/lexical types with

data items

vs.information,

data

beer,paper

train,cable

ability,right

absence,wife

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 21

Page 24: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

The Name of the Game

• Deep lexical acquisition (DLA) = automatic methods for

performing data classification

vs.information,

data

beer,paper

train,cable

ability,right

absence,wife

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 22

Page 25: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

But First a Clarification ...

• DLR development and statistical methods are not at opposite

extremes of the NLP continuum:

? historically, the existence of DLRs has been a driver of statistical

NLP (POS tagging, treebank parsing, SRL, WSD, ...)

? equally, DLR development is drawing on statistical methods for

(semi-)automation more and more

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 23

Page 26: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

DLA ca. 2008

• Largely English-centric

• Presupposition of:

? a large-scale corpus

? reasonable amounts of annotated corpus data

? preprocessing (POS tagger, parser, ...)

? expert linguistic knowledge (e.g. template set)...

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 24

Page 27: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

CLASSIFICATION OF DLAMETHODS

(Baldwin 2007)

http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 25

Page 28: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Basic Approaches to DLA

• General-purpose vs. targeted

? Is the method applicable to a range of tasks or specialised to a

particular lexical property?

• In vitro vs. in vivo

? Is the DLA method embedded within the target DLR, or based

on secondary DLRs?

• Token- vs. type-based

Baldwin (2007) 26

Page 29: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Applicability

• What is the relative “portability” of a given method?

• General-purpose DLA:

? applicable to any DLR

? generally employ a combination of type- and token-level features

OR take the form of resource alignment

• Targeted DLA:

? specialised methodology is to (automatically) learn a particular

linguistic property

Baldwin (2007) 27

Page 30: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Reliance on Secondary DLRs

• In vitro: analyse lexemes in a context independent of the target

DLR, via a secondary DLR/preprocessor

? often the only option in the absence of training data

• In vivo: leverage the target DLR directly in learning new lexical

items

Baldwin (2007) 28

Page 31: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Data Point Granularity

• What is the instance granularity of a given method?

• Token-level DLA: identify token-level instances of a given lexical

property

• Type-level DLA: extract type-level instances of a given lexical

property

• N.B. DLRs can similarly be token- or type-based (e.g. treebank vs.

wordnet), but granularity of DLA doesn’t necessarily correspond to

the granularity of the DLR

Baldwin (2007) 29

Page 32: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

DLA SPECIMEN #1:General-purpose DLA

Baldwin (2007) 30

Page 33: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

General-purpose DLA (Baldwin 2005)

• General-purpose, in vitro, type-level DLA

• Use supervised classifiers to learn deep linguistic properties of novel

words

? feature vectors from a given secondary LR

? class labels from seed data in the ERG lexicon

? evaluate by 10-fold stratified cross-validation

• Learn a binary classifier for each lexical type (110 binary classifiers

for each LR type, with default backoff)

Related work: Joanis and Stevenson (2003), Pantel and Pennacchiotti (2006), Snow et al. (2006) 31

Page 34: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Secondary Language Resources

• Use a range of LRs of varying availability:

Secondary LR type Preprocessor(s)

Word list∗∗∗ —

Morphological lexicon∗ —

Raw text corpus∗∗∗ POS tagger∗∗

Chunk parser∗

Dependency parser∗

WordNet-style ontology∗ —

Predicted availability: ∗ = low; ∗∗ = medium; ∗∗∗ = high.

32

Page 35: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

DLA SPECIMEN #2:Countability Learning

33

Page 36: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

English Countability Learning (Baldwin andBond 2003)

• General-purpose/targeted, in vitro, type-level DLA

• Classify English nouns according to powerset of 4 countability

classes:

? countable: one book, two books? uncountable: *one equipment, much equipment? plural only: *one clothes, clothes horse? bipartite: *one scissors, scissor kick, pair of scissors

Related Work: Nagata et al. (2006), Briscoe and Carroll (1997), Korhonen (2002) 34

Page 37: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Method 1: Lexico-syntactic Patterns

• Intuition: the countability properties of a noun type are reflected

in its corpus token occurrences:

Acyclovir given intravenously, ...... is also probably responsible for a coagulopathy ...

• Identify token occurrences of lexico-syntactic patterns associated

with each countability class

• For given noun, combine token-level counts for each pattern into

combined feature vector [targeted]

35

Page 38: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Method 2: Semantic Similarity

• Intuition: countability is to some degree deterministic given the

semantics of a word:

dog, pooch, canine, mongrel, ...gold, silver, copper, bronze, ...

BUT suitcases vs. luggage, leaves vs. foliage, etc.

• Take an existing ontology and determine the default countability

for each synset (semantic class) [general purpose]

36

Page 39: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

DLA SPECIMEN #3:Supertagging

37

Page 40: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Supertagging (Blunsom and Baldwin 2006)

• General-purpose, in vivo, token/type-based DLA

• Supertagging = POS tagging with a very fine-grained tagset (e.g.

full set of lexical types for precision grammar)

• Keep the feature set as general as possible to ensure compatibility

with any structured learning task

• ML backbone: pseudo-likelihood CRF

Related Work: Bangalore and Joshi (1999), Clark and Curran (2004), Zhang and Kordoni (2005) 38

Page 41: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Features

• Supertagging based on a combination of word context and

(generic) lexical features

• Lexical features based on n-gram prefixes & suffixes, and basic

character sets in the given language

? English = 5 character sets (upper case, lower case, numbers,

punctuation and hyphens)

? Japanese = 6 character sets (Roman letters, hiragana, katakana,

kanji, (Arabic) numerals and punctuation)

39

Page 42: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

SUMMARY

40

Page 43: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Summary

• Raft of different methods for tackling same basic problem, based

on the availability of different resources

• Most tasks/methods fit into our classification of:

? general-purpose vs. targeted

? In vitro vs. in vivo? Token- vs. type-based

41

Page 44: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Big Questions We’ll be Looking at

• What are the basic empirical methods used in DLA?

• What do we do if we don’t have a language resource corpus handy?

• What is the relative performance of different approaches/representations

in DLA?

• How can we leverage one language in analysing a second?

• What gains do we get from specialist linguistic knowledge?

(template development, feature engineering, etc.)

42

Page 45: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

ReferencesBaldwin, Timothy. 2005. Bootstrapping deep lexical resources: Resources for courses. In Proc.

of the ACL-SIGLEX 2005 Workshop on Deep Lexical Acquisition, 67–76, Ann Arbor, USA.

——. 2007. Scalable deep linguistic processing: Mind the lexical gap. In Proc. of the 21st Pacific

Asia Conference on Language, Information and Computation (PACLIC 21), 3–12, Seoul,

Korea.

——, and Francis Bond. 2003. Learning the countability of English nouns from corpus data. In

Proc. of the 41st Annual Meeting of the ACL, 463–70, Sapporo, Japan.

Bangalore, Srinivas, and Aravind K. Joshi. 1999. Supertagging: An approach to almost

parsing. Computational Linguistics 25.237–65.

Blunsom, Phil, and Timothy Baldwin. 2006. Multilingual deep lexical acquisition for HPSGs

via supertagging. In Proc. of the 2006 Conference on Empirical Methods in Natural

Language Processing (EMNLP 2006), 164–71, Sydney, Australia.

Briscoe, Ted, and John Carroll. 1997. Automatic extraction of subcategorization from

corpora. In Proc. of the 5th Conference on Applied Natural Language Processing (ANLP),

356–63, Washington DC, USA.

43

Page 46: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword

Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)

Clark, Stephen, and James R. Curran. 2004. The importance of supertagging for wide-

coverage CCG parsing. In Proc. of the 20th International Conference on Computational

Linguistics (COLING 2004), 282–8, Geneva, Switzerland.

Joanis, Eric, and Suzanne Stevenson. 2003. A general feature space for automatic verb

classification. In Proc. of the 10th Conference of the EACL (EACL 2003), 163–70, Budapest,

Hungary.

Korhonen, Anna, 2002. Subcategorization Acquisition. University of Cambridge dissertation.

Nagata, Ryo, Atsuo Kawai, Koichiro Morihiro, and Naoki Isu. 2006. Reinforcing

English countability prediction with one countability per discourse property. In Proc. of

COLING/ACL 2006 , 595–602, Sydney, Australia.

Pantel, Patrick, and Marco Pennacchiotti. 2006. Espresso: Leveraging generic patterns

for automatically harvesting semantic relations. In Proc. of COLING/ACL 2006 , 113–20,

Sydney, Australia.

Snow, Rion, Daniel Jurafsky, and Andrew Y. Ng. 2006. Semantic taxonomy induction

from heterogenous evidence. In Proc. of COLING/ACL 2006 , 801–8, Sydney, Australia.

Zhang, Yi, and Valia Kordoni. 2005. A statistical approach towards unknown word type

prediction for deep grammars. In Proc. of the Australasian Language Technology Workshop

2005 , 24–31, Sydney, Australia.

44