corpus-based analysis of near- synonymous constructions: larger, faster and more objective...
TRANSCRIPT
Corpus-based analysis of near-synonymous constructions:
Larger, faster and more objectiveNatalia Levshina
Kris HeylenDirk Geeraerts
University of LeuvenRU Quantitative Lexicology and Variational
Linguistics
Outline
• quantitative approaches to constructional semantics: the problem of semantic classes
• distributional semantic models as a method of semantic classification
• experiments with nominal and verbal classes in Dutch doen and laten CCx
• discussion and future research
Leiden 2011/03/25
Quantitative Constructional Semantics
Leiden 2011/03/25
semasiological perspective onomasiological perspective
collexeme-based feature-based collexeme-
based feature-based
collostructional analysis
(Stefanowitsch & Gries 2003)
cluster analysis (Gries 2006)
distinctive collexeme analysis
(Gries & Stefanowitsch
2004)
- logistic regression(Speelman & Geeraerts 2009)- MCA (Glynn 2007)- behavioural profiles (Divjak & Gries 2006)
constructional spaces (Levshina et al. In press)
Quantitative Constructional Semantics
Leiden 2011/03/25
semasiological perspective onomasiological perspective
collexeme-based feature-based collexeme-
based feature-based
collostructional analysis
(Stefanowitsch & Gries 2003)
cluster analysis (Gries 2006)
distinctive collexeme analysis
(Gries & Stefanowitsch
2004)
- logistic regression(Speelman & Geeraerts 2009)- MCA (Glynn 2007)- behavioural profiles (Divjak & Gries 2006)
constructional spaces (Levshina et al. In press)
All approaches require semantic classes
Semantic Classes: State of the Art
• as a rule, intuitive and subjective• ‘standard’ classifications (e.g. Levin 1993,
Garretson 2004):- not many- for English- incomplete- not tested empirically
• Gries and Stefanowitsch 2010: corpus-driven verb classes, but- limited contextual features (18 prepositions)- subjective evaluation
Leiden 2011/03/25
Semantic Classes: Desiderata
Leiden 2011/03/25
• data-driven, (potentially) entire vocabulary• objective validation• semantic relationships are multidimensional
different criteria of similarity• varying schematicity of semantic relationships
different levels of granularity
Instead of working with one a priori classification, let’s compare different ones and see which works
the best
Outline
• quantitative approaches to constructional semantics: the problem of semantic classes
• our proposal: distributional semantic models as a method of semantic classification
• experiments with nominal and verbal classes in Dutch doen and laten CCx
• discussion and future research
Leiden 2011/03/25
Our proposal
• a bottom-up quantitative approach based on distributional Semantic Vector Space models
• task-specific:- adjustable criteria of similarity- adjustable granularity
• validation in a real data set for near-synonymous doen and laten CCx (onomasiological perspective)
Leiden 2011/03/25
Semantic Vector Space Models
Standard technique in Computational Linguistics:• corpus based, bottom-up clustering of semantically
related words into semantic classesBased on the Distributional Hypothesis (Firth):• You shall know a word by the company it keeps• Words appearing in similar contexts tend to have
similar meaningsMethod• each word is assigned a vector stating the word's
co-occurrence frequencies with a range of possible contexts
• words with similar context vectors have similar meanings
Leiden 2011/03/25
Semantic Vector Space Models
Leiden 2011/03/25
gun
psychopat
knife
cruelly
lovingly
mother
lovers
toilet
...
kiss 2 2 0 1 89 56 98 3
hug 3 1 2 5 77 49 88 0
kill 10 59 67 69 0 8 12 1
murder 97 65 58 81 1 9 9 2
....
• co-occurrence frequency of target words (rows) with context words (columns)
• High dimensional matrix (only small subset shown)
Semantic Vector Space Models
Leiden 2011/03/25
gun
psychopat
knife
cruelly
lovingly
mother
lovers
toilet
...
kiss 2 2 0 1 89 56 98 3
hug 3 1 2 5 77 49 88 0
kill 10 59 67 69 0 8 12 1
murder 97 65 58 81 1 9 9 2
....
• general overview of collactional behaviour and distributional properties of a language's vocabulary
• Behavioural profiles (Divjak) but many more features
Semantic Vector Space Models
• weighting of frequencies (pointwise mutual information)
• projection of vectors into a semantic "word space"• measure proximity of vectors in space (cosine)• cluster words based on vector proximity
Leiden 2011/03/25
kill
mur
der
hug
kiss
cru
elly
lovingly
Semantic Vector Space Models
SVS come in many different flavours• Technical parameters (frequency weighting scheme,
similarity measure, dimensionality reduction technique, clustering technique,...)
• Number of clusters granularity of semantic distinctions– dependent on specific application (cf. infra)
• Definition of 'context' type of semantics captured– Document-based context features: topical relations– Window-based, bag-of-words context features: loose
associations– Dependency-based context features: tight relations
(synonymy)– Subcategorisation Frame features (verbs only): Levin-like
classes
Leiden 2011/03/25
Bag of words model
Leiden 2011/03/25
gun
psychopat
knife
cruelly
lovingly
mother
lovers
toilet
...
kiss 2 2 0 1 89 56 98 3
hug 3 1 2 5 77 49 88 0
kill 10 59 67 69 0 8 12 1
murder 97 65 58 81 1 9 9 2
....
• context feature = co-occurring word in window left and right of target word
• looser, associative semantic relations: e.g. doctor-hospital
Bag of words model
Leiden 2011/03/25
gun
psychopat
knife
cruelly
lovingly
mother
lovers
toilet
...
kiss 2 2 0 1 89 56 98 3
hug 3 1 2 5 77 49 88 0
kill 10 59 67 69 0 8 12 1
murder 97 65 58 81 1 9 9 2
....
• context feature = co-occurring word in window left and right of target word
• looser, associative semantic relations: e.g. doctor-hospital
...The psychopat killed his victims with a blunt knife. ...
Dependency based model
Leiden 2011/03/25
+PP with gun
+SU psychopat
+OBJ psychopat
+PP with knife
+ ADV cruelly
+ADV lovingly
+ SU mother
+SU Lovers
+PP on toilet
...
kiss 2 2 0 0 1 89 56 98 3
hug 3 1 0 2 5 77 49 88 0
kill 10 59 2 67 69 0 8 12 1
murder 97 65 1 58 81 1 9 9 2
....• context feature = word in specific syntactic dependency relation with target word
• tight semantic relations: hospital - clinic
Dependency based model
Leiden 2011/03/25
+PP with gun
+SU psychopat
+OBJ psychopat
+PP with knife
+ ADV cruelly
+ADV lovingly
+ SU mother
+SU Lovers
+PP on toilet
...
kiss 2 2 0 0 1 89 56 98 3
hug 3 1 0 2 5 77 49 88 0
kill 10 59 2 67 69 0 8 12 1
murder 97 65 1 58 81 1 9 9 2
....• context feature = word in specific syntactic dependency relation with target word
• tight semantic relations: hospital - clinic
...The psychopat killed his victims with a blunt knife. ...
SU OBJ PP
Subcat frame model
Leiden 2011/03/25
SU SU / OBJ
SU / OBJ / ADV
SU / PP
SU / OBJ / PP
SU / OBJ / IOBJ
...
kiss 2 2 0 0 1 89
hug 3 1 0 2 5 77
kill 10 59 2 67 69 0
murder 97 65 1 58 81 1
....• context feature = subcategorization frame co-
occurring with target verb (only used for verbs!)• Levin-like verb classes: e.g. lie, stand, sit, lean
Subcat frame model
Leiden 2011/03/25
SU SU / OBJ
SU / OBJ / ADV
SU / PP
SU / OBJ / PP
SU / OBJ / IOBJ
...
kiss 2 2 0 0 1 89
hug 3 1 0 2 5 77
kill 10 59 2 67 69 0
murder 97 65 1 58 81 1
....• context feature = subcategorization frame co-
occurring with target verb (only used for verbs!)• Levin-like verb classes: e.g. lie, stand, sit, lean
...The psychopat killed his victims with a blunt knife. ...
SU / OBJ / PP
Semantic Vector Space Models
3 models form a continuum lexical to syntactic purely lexical distributional information
lexical and syntactic (dependency) informtion
purely syntactic (subcat.) distributional propertiesmore intermediate forms, depending on• number of dependency relations (e.g. arguments
only)• inclusion of some "lexical" info in subcat frames
(e.g. prepositions or semantic noun classes)Leiden 2011/03/25
Semantic Vector Space Models
Bag of words
dependen-cy based
semantically / lexically
enriched SFM
Subcat Frame Models
# D
EP.
REL /
Win
dow
.
BOW4
BOW15
DEP3
DEP7
DEP13
SFr5
SFr9
SFr23
SFp9
SFp23
SFc5
SFc9
SFc23
SFs5
SFs9
SFs23
LEXICAL SYNTACTIC
• 16 different models for verbs• 2 different models for nouns
BOW5
DEP8
Outline
• quantitative approaches to constructional semantics: the problem of semantic classes
• distributional semantic models as a method of semantic classification
• experiments with nominal and verbal classes in Dutch doen and laten CCx
• discussion and future research
Leiden 2011/03/25
Our Experiments
• Causer, Causee and Effected Predicate slots of Dutch CCx with doen and laten
De politie deed de auto stoppen.
liet
(slot names: Kemmer & Verhagen 1994)
The police
CAUSER
mademade/let
CAUSATIVEAUXILIARY
the car
CAUSEE
stop
EFFECTEDPREDICATE
Leiden 2011/03/25
Classes of Cr and Ce
• data: Twente News Corpus• 2 models:
- purely lexical distributional information (bag of words)- lexical and syntactic dependency information (e.g. noun N as a subject of a verb V)
• granularity: from 2 to 100 classes emerging from hierarchical cluster analysis
Leiden 2011/03/25
Classes of Effected Predicates
• 16 models, a continuum from purely lexical information to purely syntactic information (subcat frames)
• different granularity (number of clusters in hierarchical cluster analysis): from 5 to 100
Leiden 2011/03/25
Evaluation
• test set: 6800 obs. with causative doen and laten from Dutch newspaper corpora
• objects: all explicit non-pronominal fillers of Causer, Causee and Effected Predicate slots
• criterion: prediction of doen or laten in the observations
• method: logistic regression model, several indicators R2, C, Somer’s Dxy, Gamma, Tau, AIC
Leiden 2011/03/25
Prediction by Causer Classes
Leiden 2011/03/25
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Causer's classes, Nagelkerke R^2
n of clusters
R^2
BOW
DepRel8
6 Clusters of Causer
Leiden 2011/03/25
No Top freq nouns doen or laten
1 cd, cijfer, plaat, herstel, stem, aanslag, afwezigheid, rentree, resultaat, aanpak
doen
2 Feyenoord, PSV, dirigent, speler, beurs, orkest, doelman, zanger, Van Gaal, componist
laten
3 Gergiev, Van Hecke, gemeente bestuur, Harnoncourt, Morissette, Pollini, secretaris generaal, AH To Go, alleskunner, Ax
laten
4 Verenigde Staten, VS, Amerika, Europa, Washington, geheel, tempo, Duitsland, Engels, India
laten
5 regering, minister, bedrijf, trainer, president, belegger, muziek, ploeg, premier, man
laten
6 Mahlerstem, zaal technicus, beroepskader, Neal Evans, Bastuba en contrabasclarinet, Tuomarila, Wanderlied
NA
Prediction by Causee Classes
Leiden 2011/03/25
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Causee classes, Nagelkerke R^2
n of clusters
R^2
BOW
DepRel8
Prediction by Effected Predicate
Leiden 2011/03/25
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
EP classes, Nagelkerke R^2
n of clusters
R^2
BOW15BOW15rVrelVarelVbarel23richsubcat9relrichsubcat5relrichsubcat23relprep9relprep23sclass9sclass5sclass23syn9syn5syn
35 Clusters for EP (Examples)
Leiden 2011/03/25
No Top freq verbsdoen or
laten
3 herleven, kantelen, versmelten, beven, samensmelten, verslappen, sidderen, smelten, instorten, vervagen
doen
24 stijgen, zakken, dalen, groeien, overlopen, rijzen, terugzakken, wankelen, daveren, overvloeien
doen
25 denken, vermoeden, geloven, besluiten, vrezen, hopen, verzuchten, toegeven, betogen, concluderen
doen
23 zien, weten, leiden, maken, doen, blijken, voelen, kennen, worden, schijnen
laten
12 horen, verstaan, betrappen, afleiden, merken, delen, rijmen, verkiezen, verwachten, associëren
laten
34 liggen, vallen, gaan, komen, lopen, staan, zitten laten
The best models for 3 Slots
Leiden 2011/03/25
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Best models with Causer, Causee and EP classes, Nagelkerke R^2
n of clusters
R^2
Cr DepRel8
Ce DepRel8
EP 23syn
The best models for 3 Slots
Leiden 2011/03/25
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Best models with Causer, Causee and EP classes, Nagelkerke R^2
n of clusters
R^2
Cr DepRel8
Ce DepRel8
EP 23syn
Causers give more data reduction than
verbs!
All three slots in one model
Leiden 2011/03/25
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
3 slots, Nagelkerke R^2
n of EP clusters
R^2
only EP
EP + 2 clusters Cr Ce
EP + 5 clusters Cr Ce
EP + 10 clusters Cr Ce
EP + 20 clusters Cr Ce
EP + 30 clusters Cr Ce
All three slots in one model
Leiden 2011/03/25
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
3 slots, Nagelkerke R^2
n of EP clusters
R^2
only EP
EP + 2 clusters Cr Ce
EP + 5 clusters Cr Ce
EP + 10 clusters Cr Ce
EP + 20 clusters Cr Ce
EP + 30 clusters Cr Ce
slots interact in conveying
constructional meaning!
Outline
• quantitative approaches to constructional semantics: the problem of semantic classes
• distributional semantic models as a method of semantic classification
• experiments with nominal and verbal classes in Dutch doen and laten CCx
• discussion and future research
Leiden 2011/03/25
Desiderata Revisited
Leiden 2011/03/25
• data-driven, (potentially) entire vocabulary
• objective validation
• semantic relationships are multidimensional different criteria of similarity
• varying schematicity of semantic relationships different levels of
granularity
Desiderata Revisited
Leiden 2011/03/25
• data-driven, (potentially) entire vocabularybottom-up classes work!
• objective validation
• semantic relationships are multidimensional different criteria of similarity
• varying schematicity of semantic relationships different levels of
granularity
Desiderata Revisited
Leiden 2011/03/25
• data-driven, (potentially) entire vocabularybottom-up classes work!
• objective validationCr and EP classes perform better than Ce
• semantic relationships are multidimensional different criteria of similarity
• varying schematicity of semantic relationships different levels of
granularity
Desiderata Revisited
Leiden 2011/03/25
• data-driven, (potentially) entire vocabularybottom-up classes work!
• objective validationCr and EP classes perform better than Ce
• semantic relationships are multidimensional different criteria of similarity
syntax-sensitive models perform the best• varying schematicity of semantic relationships
different levels of granularity
Desiderata Revisited
Leiden 2011/03/25
• data-driven, (potentially) entire vocabularybottom-up classes work!
• objective validationCr and EP classes perform better than Ce
• semantic relationships are multidimensional different criteria of similarity
syntax-sensitive models perform the best• varying schematicity of semantic relationships
different levels of granularity
nouns ‘need’ less classes than verbs
Future research
• fuzzy clustering (a continuum of similarity)• compare with other constructions• compare with lexical near-synonyms• classes for semasiological studies
…suggestions?
for further information:http://wwwling.arts.kuleuven.be/[email protected]
Thank you!