"ssc" - geometria e semantica del linguaggio

Distributional Semantic Models

Pierpaolo [email protected]

Storming Science Caffè17 Dicembre, 2014

Il significato nel mondo reale

Rappresentazione estensionale del significato!

automobile

Il significato nel mondo reale

Rappresentazione intensionale del significato!

automòbile agg. e s. f. [dal fr. automobile ‹otomobìl›, comp. di auto-1 e dell’agg. lat. mobĭlis «che si muove»]. – 1. agg. Che si muove da sé, soprattutto con riferimento a veicoli che si muovono sul terreno (o anche nell’acqua, come per es. i mezzi autopropulsi, quali i siluri e i missili) per mezzo di un motore proprio. 2. s. f. Autoveicolo a quattro ruote con motore generalmente a scoppio, adibito al trasporto di un numero limitato di persone su strade ordinarie (detto, nelle... Leggi

AUTOMOBILE

automobile

http://www.treccani.it/vocabolario/automobile/

Il significato nella nostra mente

La rappresentazione del concetto AUTOMOBILE nella nostra mente

Connessionisti(reti neurali)

Simbolici(formalismo logico)

Il significato nel testo

Semantica distribuzionale: cosa è un’automobile?

AUTOMOBILE è l’insieme dei contesti linguistici in cui la parola automobile

occorre

A bottle of Tezguno is on the table.

Everyone likes Tezguno.

Tezguno makes you drunk.

We make Tezguno out of corn.

What’s Tezguno?

Modelli distribuzionali semantici

You shall know a word by the company it keeps!

Meaning of a word is determined by its usage

7

Modelli distribuzionali semantici

• Modelli computazionali che costruiscono rappresentazioni semantiche delle parole analizzando dei corpora

– Le parole sono rappresentate tramite vettori

– I vettori sono costruiti analizzando statisticamente i contesti linguistici in cui le parole occorrono

Vettore distribuzionale

1. contare quante volte una parola occorre in un determinato contesto

2. costruire un vettore in funzione delle occorrenze calcolate al punto 1

PAROLE SIMILI AVRANNO VETTORI SEMILI

Matrice = Spazio geometrico

• Matrice: parole X contesti

C1 C2 C3 C4 C5 C6 C7

cane 5 0 11 2 2 9 1

gatto 4 1 7 1 1 7 2

pane 0 12 0 0 9 1 9

pasta 0 8 1 2 14 0 10

carne 0 7 1 1 11 1 8

topo 4 0 8 0 1 8 1

Matrice = Spazio geometricoC3

C5

pasta

topo

cane

gatto

Similarità->vicinanza in uno spazio multi-dimensionale(similarità del coseno)

Generalizzazione

• Un modello distribuzionale può essere definito da <T, C, R, W, M, d, S>– T: target elements -> le parole (generalmente)

– C: i contesti

– R: la relazione che lega T a C

– W: schema di pesatura

– M: spazio geometrico TxC

– d: funzione di riduzione dello spazio M -> M’

– S: funzione di similarità in M’

Costruire uno spazio semantico

1. Pre-processing del corpus

2. Individuare parole e contesti

3. Contare le co-occorrenze parole/contesti

4. Pesatura (opzionale, ma consigliata)

5. Costruzione della matrice TxC

6. Riduzione della matrice (opzionale)

7. Calcolare la similarità tra vettori

I parametri

• La definizione di contesto

– Una finestra di dimensione n, frase, paragrafo, documento, un particolare contesto sintattico

• Schema di pesatura

• Funzione di similarità

Un esempio

• Matrice Termini-Termini– T: parole

– C: parole

– R: T occorre «vicino» a C

– W: numero di volte che T e C co-occorrono

– M: matrice termini/termini

– d: nessuna o ad esempio SVD (Latent SemanticAnalysis)

– S: similarità del coseno

1. pre-processing

• Tokenizzazione necessaria!– PoS-tag– Lemmatizazzione– Parsing

• Un’analisi troppo profonda– Introduce errori– Richiede altri parametri– Dipende dalla lingua

• Pre-processing influisce sulla scelta delle parole e dei contesti

2. Definizione del contesto

• Il documento– l’intero documento

– paragrafo, frase, porzione di testo (passage)

• Le altre parole– In genere si scelgono le n più frequenti

– Dove?• ad un distanza fissata a priori (finestra)

• dipendenza sintattica

• pattern

3. Pesatura

• Frequenza o log(frequenza) per mitigare i contesti che occorrono tante volte

• Idea: se l’occorrenza è più bassa significa che la relazione è più forte

– Mutual Information, Log-Likelihood Ratio

• Information Retrieval: tf-idf, word-entropy, …

Pointwise Mutual Information

N

wfreqwP

N

wwfreqwwP

wPwP

wwPwwMI

ii

)()(

),(),(

)()(

),(log),(

2121

21

21221

),(),( 2121 wwMIwwfreq

Local Mutual-Information

P(bere) = 100/106

P(birra) = 25/106

P(acqua)=150/106

P(bere, acqua)=60/106

P(bere, birra)=20/106

MI(bere, birra)=log2(2*106/250) = 12,96MI(bere, acqua)=log2(6*106/1500) = 11,96

LMI(bere, birra) = 0,0002592LMI(bere, acqua) = 0,0007176

MI tende a dare molto peso ad eventi poco ricorrenti

5. Riduzione della matrice

• M (TxC) è una matrice altamente dimensionale può essere utile ridurla:

1. Individuare le dimensioni latenti: LSI, PCA

2. Ridurre lo spazio: approssimazioni di M ad esempio Random Indexing

Random Indexing

Il metodo

• Assegnare un “vettore random” ad ogni contesto: random/context vector

• Il vettore semantico associato ad ogni target (e.s. parole) è la somma di tutti i vettori contesto in cui il target (parola) occorre

Vettore contesto

• sparso

• altamente dimensionale

• ternario, valori in {-1, 0, +1}

• un piccolo numero di elementi non nulli distribuiti casualmente

0 0 0 0 0 0 0 -1 0 0 0 0 1 0 0 -1 0 1 0 0 0 0 1 0 0 0 0 -1

24

Random Indexing (formal)

kmmnkn RAB ,,,

B preserves the distance between points(Johnson-Lindenstrauss lemma)

mk

dcdr

25

Esempio

John eats a red appleRjohn -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)Reat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)Rred-> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)

SVapple= Rjohn+ Reat+Rred=(1, 0, 0, 1, -1, 0, 1, -1, -1, 0)

26

Random Indexing

• Vantaggi

– Semplice e veloce

– Scalabile e parallelizzabile

– Incrementale

• Svantaggi

– Richiede molta memoria

Permutazioni

• Utilizzare diverse permutazioni degli elementi del vettore random per codificare diversi contesti

– l’ordine delle parole

– dipendenza (sintattica/relazionale) tra termini

• Il vettore random prima di essere sommato viene permutato in base al contesto che si sta codificando

Esempio (word order)

John eats a red appleRjohn -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)Reat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)Rapple-> (0, 1, 0, 0, 0, 0, 0, 0, 0, -1)

SVred= R-2john+ R-1eat+R+1apple=

(0,0,0,0,1,0,-1,0,0,0)+(0,0,0,-1,0,0,0,0,0,1)+(-1,0,1,0,0,0,0,0,0,0)=(-1,0,1,-1,1,0,-1,0,0,1)

29

Permutazioni (query)

• In fase di query applicare la permutazione inversa in base al contesto

• Word order: <t> ? (i termini più simili che si trovano a destra di <t>)

– Permutare -1 il vettore random di t R-1t

• t deve comparire a sinistra

– Calcolare la similarità di R-1t con tutti i vettori presenti nel term space

SIMPLE DSMS AND SIMPLE OPERATORS

Simple DSMs…

Term-term co-occurrence matrix (TTM): each cell contains the co-occurrences between two terms within a prefixed distance

dog cat computer animal mouse

dog 0 4 0 2 1

cat 4 0 0 3 5

computer 0 0 0 0 3

animal 2 3 0 0 2

mouse 1 5 3 2 0

…Simple DSMs

Latent Semantic Analysis (LSA): relies on the Singular Value Decomposition (SVD) of the co-occurrence matrix

Random Indexing (RI): based on the Random Projection

Latent Semantic Analysis over Random Indexing (RILSA)

Latent Semantic Analysis over Random Indexing

1. Reduce the dimension of the co-occurrences matrix using RI

2. Perform LSA over RI (LSARI)

– reduction of LSA computation time: RI matrix contains less dimensions than co-occurrences matrix

Simple operators…

Addition (+): pointwise sum of components

Multiplication (∘): pointwise multiplication of components

Addition and multiplication are commutative

– do not take into account word order

Complex structures represented summing or multiplying words which compose them

…Simple operators

Given two word vectors u and v

– composition by sum p = u + v

– composition by multiplication p = u ∘ v

Can be applied to any sequence of words

SYNTACTIC DEPENDENCIES IN DSMS

Syntactic dependencies…

John eats a red apple.

John eats apple

red

modifier

objectsubject

38

…Syntactic dependencies

John eats a red apple.

John eats apple

red

modifier

objectsubject

39

HEADDEPENDENT

Representing dependences

Use filler/role binding approach to represent a dependency dep(u, v)

rd⊚u + rh⊚v

rd and rh are vectors which representrespectively the role of dependent and head

⊚ is a placeholder for a composition operator

Representing dependences (example)

obj(apple, eat)

rd⊚apple + rh⊚eat

role vectors

Structured DSMs

1. Vector permutation in RI (PERM) to encode dependencies

2. Circular convolution (CONV) as filler/binding operator to represent syntactic dependencies in DSMs

3. LSA over PERM and CONV carries out two spaces: PERMLSA and CONVLSA

Vector permutation in RI (PERM)

Using permutation of elements in context vectors to encode dependencies

– right rotation of n elements to encode dependents (permutation)

– left rotation of n elements to encode heads(inverse permutation)

43

PERM (method)

Create and assign a context vector to each term

Assign a rotation function Π+1 to the dependent and Π-1 to the head

Each term is represented by a vector which is– the sum of the permuted vectors of all the dependent

terms

– the sum of the inverse permuted vectors of all the head terms

– the sum of the no-permuted vectors of both dependent and head words

44

PERM (example…)

John eats a red apple

John -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)eat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)red -> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)apple -> (1, 0, 0, 0, 0, 0, 0, -1, 0, 0)

TVapple = Π+1(CVred) + Π-1(CVeat) + CVred + CVeat

45

applered

eats

PERM (…example)


John -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)eat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)red-> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)apple -> (1, 0, 0, 0, 0, 0, 0, -1, 0, 0)

TVapple = Π+1(CVred) + Π-1(CVeat) + CVred + CVeat=…

…=(0, 0, 0, 0, 1, 0, 0, 0, -1, 0) + (0, 0, 0, -1, 0, 0, 0, 0, 1) +

+ (0, 0, 0, 1, 0, 0, 0, -1, 0, 0) + (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)

right shift left shift

46

Convolution (CONV)

Create and assign a context vector to each term

Create two context vectors for head and dependentroles

Each term is represented by a vector which is– the sum of the convolution between dependent terms

and the dependent role vector

– the sum of the convolution between head terms and the head role vector

– the sum of the vectors of both dependent and head words

Circular convolution operator

Circular convolution

p=u⊛v

defined as:

n

k

nkjkj vup1

)1()(

U1 U2 U3 U4 U5

V1 1 1 -1 -1 1

V2 -1 -1 1 1 -1

V3 1 1 -1 -1 1

V4 -1 -1 1 1 -1

V5 -1 -1 1 1 -1

U=<1, 1, -1, -1, 1>V=<1, -1, 1, -1, -1>P=<-1, 3, -1, -1, -1>

P1

P2

P3

P4

P5

Circular convolution by FFTs

Circular convolution is computed in O(n2)

– using FFTs is computed in O(nlogn)

Given f the discrete FFTs and f -1 its inverse

– u ⊛v = f -1( f(u) ∘ f (v) )

CONV (example)


John -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)eat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)red-> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)apple -> (1, 0, 0, 0, 0, 0, 0, -1, 0, 0)rd -> (0, 0, 1, 0, -1, 0, 0, 0, 0, 0)rh -> (0, -1, 1, 0, 0, 0, 0, 0, 0, 0)

apple = eat +red + (rd⊛ red) + (rh ⊛ eat)

50

Context vector for dependent role

Context vector for head role

Complex operators

Based on filler/role binding taking into account syntactic role: rd ⊚ u + rh ⊚ v

– u and v could be recursive structures

Two vector operators to bind the role:– convolution (⊛)

– tensor (⊗)

– convolution (⊛+): exploits also the sum of term vectors

rd ⊛ u + rh ⊛ v + v + u

Complex operators (remarks)

Existing operators

– t1 ⊚ t2 ⊚ … ⊚ tn: does not take into account syntactic role

– t1 ⊛ t2 is commutative

– t1⊗t2 ⊗ … ⊗tn: tensor order depends on the phrase length

• two phrases with different length are not comparable

– t1 ⊗ r1 ⊗t2 ⊗r2 ⊗ … ⊗ tn ⊗rn : also depends on the sentence length

System setup

• Corpus

– WaCkypedia EN based on a 2009 dump of Wikipedia

– about 800 million tokens

– dependency parse by MaltParser

• DSMs

– 500 vector dimension (LSA/RI/RILSA)

– 1,000 vector dimension (PERM/CONV/PERMLSA/CONVLSA)

– 50,000 most frequent words

– co-occurrence distance: 4

53

Evaluation

• GEMS 2011 Shared Task for compositional semantics– list of two pairs of words combination

(support offer) (help provide) 7(old person) (right hand) 1

• rated by humans• 5,833 rates• 3 types involved: noun-noun (NN), adjective-noun (AN),

verb-object (VO)

– GOAL: compare the system performance against humans scores• Spearman correlation

54

Results (simple spaces)…NN AN VO

TTM

LSA

RI

RILS

A

TTM

LSA

RI

RILS

A

TTM

LSA

RI

RILS

A

+ .21 .36 .25 .42 .22 .35 .33 .41 .23 .31 .28 .31

∘ .31 .15 .23 .22 .21 .20 .22 .18 .13 .10 .18 .21

⊛ .21 .38 .26 .35 .20 .33 .31 .44 .15 .31 .24 .34

⊛+ .21 .34 .28 .43 .23 .32 .31 .37 .20 .31 .25 .29

⊗ .21 .38 .25 .39 .22 .38 .33 .43 .15 .34 .26 .32

human .49 .52 .55

Simple Semantic Spaces

…Results (structured spaces)NN AN VO

CO

NV

PER

M

CO

NV

LSA

PER

MLS

A

CO

NV

PER

M

CO

NV

LSA

PER

MLS

A

CO

NV

PER

M

CO

NV

LSA

PER

MLS

A

+ .36 .39 .43 .42 .34 .39 .42 .45 .27 .23 .30 .31

∘ .22 .17 .10 .13 .23 .27 .13 .15 .20 .15 .06 .14

⊛ .31 .36 .37 .35 .39 .39 .45 .44 .28 .23 .27 .28

⊛+ .30 .36 .40 .36 .38 .32 .48 .44 .27 .22 .30 .32

⊗ .34 .37 .37 .40 .36 .40 .45 .45 .27 .24 .31 .32

human .49 .52 .55

Structured Semantic Spaces

Final remarks

• Best results are obtained when complex operators/spaces (or both) are involved

• No best combination of operator/space exists– depend on the type of relation (NN, AN, VO)

• Tensor product and convolution provide good results in spite of previous results– filler/role binding is effective

• Future work– generate several rd and rh vectors for each kind of

dependency– apply this approach to other direct graph-based

representations

Thank you for your attention!

Questions?<[email protected]>