"ssc" - geometria e semantica del linguaggio
TRANSCRIPT
Il significato nel mondo reale
Rappresentazione estensionale del significato!
automobile
Il significato nel mondo reale
Rappresentazione intensionale del significato!
automòbile agg. e s. f. [dal fr. automobile ‹otomobìl›, comp. di auto-1 e dell’agg. lat. mobĭlis «che si muove»]. – 1. agg. Che si muove da sé, soprattutto con riferimento a veicoli che si muovono sul terreno (o anche nell’acqua, come per es. i mezzi autopropulsi, quali i siluri e i missili) per mezzo di un motore proprio. 2. s. f. Autoveicolo a quattro ruote con motore generalmente a scoppio, adibito al trasporto di un numero limitato di persone su strade ordinarie (detto, nelle... Leggi
AUTOMOBILE
automobile
Il significato nella nostra mente
La rappresentazione del concetto AUTOMOBILE nella nostra mente
Connessionisti(reti neurali)
Simbolici(formalismo logico)
Il significato nel testo
Semantica distribuzionale: cosa è un’automobile?
AUTOMOBILE è l’insieme dei contesti linguistici in cui la parola automobile
occorre
A bottle of Tezguno is on the table.
Everyone likes Tezguno.
Tezguno makes you drunk.
We make Tezguno out of corn.
What’s Tezguno?
Modelli distribuzionali semantici
You shall know a word by the company it keeps!
Meaning of a word is determined by its usage
7
Modelli distribuzionali semantici
• Modelli computazionali che costruiscono rappresentazioni semantiche delle parole analizzando dei corpora
– Le parole sono rappresentate tramite vettori
– I vettori sono costruiti analizzando statisticamente i contesti linguistici in cui le parole occorrono
Vettore distribuzionale
1. contare quante volte una parola occorre in un determinato contesto
2. costruire un vettore in funzione delle occorrenze calcolate al punto 1
PAROLE SIMILI AVRANNO VETTORI SEMILI
Matrice = Spazio geometrico
• Matrice: parole X contesti
C1 C2 C3 C4 C5 C6 C7
cane 5 0 11 2 2 9 1
gatto 4 1 7 1 1 7 2
pane 0 12 0 0 9 1 9
pasta 0 8 1 2 14 0 10
carne 0 7 1 1 11 1 8
topo 4 0 8 0 1 8 1
Matrice = Spazio geometrico
• Matrice: parole X contesti
C1 C2 C3 C4 C5 C6 C7
cane 5 0 11 2 2 9 1
gatto 4 1 7 1 1 7 2
pane 0 12 0 0 9 1 9
pasta 0 8 1 2 14 0 10
carne 0 7 1 1 11 1 8
topo 4 0 8 0 1 8 1
Matrice = Spazio geometricoC3
C5
pasta
topo
cane
gatto
Similarità->vicinanza in uno spazio multi-dimensionale(similarità del coseno)
Generalizzazione
• Un modello distribuzionale può essere definito da <T, C, R, W, M, d, S>– T: target elements -> le parole (generalmente)
– C: i contesti
– R: la relazione che lega T a C
– W: schema di pesatura
– M: spazio geometrico TxC
– d: funzione di riduzione dello spazio M -> M’
– S: funzione di similarità in M’
Costruire uno spazio semantico
1. Pre-processing del corpus
2. Individuare parole e contesti
3. Contare le co-occorrenze parole/contesti
4. Pesatura (opzionale, ma consigliata)
5. Costruzione della matrice TxC
6. Riduzione della matrice (opzionale)
7. Calcolare la similarità tra vettori
I parametri
• La definizione di contesto
– Una finestra di dimensione n, frase, paragrafo, documento, un particolare contesto sintattico
• Schema di pesatura
• Funzione di similarità
Un esempio
• Matrice Termini-Termini– T: parole
– C: parole
– R: T occorre «vicino» a C
– W: numero di volte che T e C co-occorrono
– M: matrice termini/termini
– d: nessuna o ad esempio SVD (Latent SemanticAnalysis)
– S: similarità del coseno
1. pre-processing
• Tokenizzazione necessaria!– PoS-tag– Lemmatizazzione– Parsing
• Un’analisi troppo profonda– Introduce errori– Richiede altri parametri– Dipende dalla lingua
• Pre-processing influisce sulla scelta delle parole e dei contesti
2. Definizione del contesto
• Il documento– l’intero documento
– paragrafo, frase, porzione di testo (passage)
• Le altre parole– In genere si scelgono le n più frequenti
– Dove?• ad un distanza fissata a priori (finestra)
• dipendenza sintattica
• pattern
3. Pesatura
• Frequenza o log(frequenza) per mitigare i contesti che occorrono tante volte
• Idea: se l’occorrenza è più bassa significa che la relazione è più forte
– Mutual Information, Log-Likelihood Ratio
• Information Retrieval: tf-idf, word-entropy, …
Pointwise Mutual Information
N
wfreqwP
N
wwfreqwwP
wPwP
wwPwwMI
ii
)()(
),(),(
)()(
),(log),(
2121
21
21221
),(),( 2121 wwMIwwfreq
Local Mutual-Information
P(bere) = 100/106
P(birra) = 25/106
P(acqua)=150/106
P(bere, acqua)=60/106
P(bere, birra)=20/106
MI(bere, birra)=log2(2*106/250) = 12,96MI(bere, acqua)=log2(6*106/1500) = 11,96
LMI(bere, birra) = 0,0002592LMI(bere, acqua) = 0,0007176
MI tende a dare molto peso ad eventi poco ricorrenti
5. Riduzione della matrice
• M (TxC) è una matrice altamente dimensionale può essere utile ridurla:
1. Individuare le dimensioni latenti: LSI, PCA
2. Ridurre lo spazio: approssimazioni di M ad esempio Random Indexing
Random Indexing
Il metodo
• Assegnare un “vettore random” ad ogni contesto: random/context vector
• Il vettore semantico associato ad ogni target (e.s. parole) è la somma di tutti i vettori contesto in cui il target (parola) occorre
Vettore contesto
• sparso
• altamente dimensionale
• ternario, valori in {-1, 0, +1}
• un piccolo numero di elementi non nulli distribuiti casualmente
0 0 0 0 0 0 0 -1 0 0 0 0 1 0 0 -1 0 1 0 0 0 0 1 0 0 0 0 -1
24
Random Indexing (formal)
kmmnkn RAB ,,,
B preserves the distance between points(Johnson-Lindenstrauss lemma)
mk
dcdr
25
Esempio
John eats a red appleRjohn -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)Reat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)Rred-> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)
SVapple= Rjohn+ Reat+Rred=(1, 0, 0, 1, -1, 0, 1, -1, -1, 0)
26
Random Indexing
• Vantaggi
– Semplice e veloce
– Scalabile e parallelizzabile
– Incrementale
• Svantaggi
– Richiede molta memoria
Permutazioni
• Utilizzare diverse permutazioni degli elementi del vettore random per codificare diversi contesti
– l’ordine delle parole
– dipendenza (sintattica/relazionale) tra termini
• Il vettore random prima di essere sommato viene permutato in base al contesto che si sta codificando
Esempio (word order)
John eats a red appleRjohn -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)Reat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)Rapple-> (0, 1, 0, 0, 0, 0, 0, 0, 0, -1)
SVred= R-2john+ R-1eat+R+1apple=
(0,0,0,0,1,0,-1,0,0,0)+(0,0,0,-1,0,0,0,0,0,1)+(-1,0,1,0,0,0,0,0,0,0)=(-1,0,1,-1,1,0,-1,0,0,1)
29
Permutazioni (query)
• In fase di query applicare la permutazione inversa in base al contesto
• Word order: <t> ? (i termini più simili che si trovano a destra di <t>)
– Permutare -1 il vettore random di t R-1t
• t deve comparire a sinistra
– Calcolare la similarità di R-1t con tutti i vettori presenti nel term space
SIMPLE DSMS AND SIMPLE OPERATORS
Simple DSMs…
Term-term co-occurrence matrix (TTM): each cell contains the co-occurrences between two terms within a prefixed distance
dog cat computer animal mouse
dog 0 4 0 2 1
cat 4 0 0 3 5
computer 0 0 0 0 3
animal 2 3 0 0 2
mouse 1 5 3 2 0
…Simple DSMs
Latent Semantic Analysis (LSA): relies on the Singular Value Decomposition (SVD) of the co-occurrence matrix
Random Indexing (RI): based on the Random Projection
Latent Semantic Analysis over Random Indexing (RILSA)
Latent Semantic Analysis over Random Indexing
1. Reduce the dimension of the co-occurrences matrix using RI
2. Perform LSA over RI (LSARI)
– reduction of LSA computation time: RI matrix contains less dimensions than co-occurrences matrix
Simple operators…
Addition (+): pointwise sum of components
Multiplication (∘): pointwise multiplication of components
Addition and multiplication are commutative
– do not take into account word order
Complex structures represented summing or multiplying words which compose them
…Simple operators
Given two word vectors u and v
– composition by sum p = u + v
– composition by multiplication p = u ∘ v
Can be applied to any sequence of words
SYNTACTIC DEPENDENCIES IN DSMS
Syntactic dependencies…
John eats a red apple.
John eats apple
red
modifier
objectsubject
38
…Syntactic dependencies
John eats a red apple.
John eats apple
red
modifier
objectsubject
39
HEADDEPENDENT
Representing dependences
Use filler/role binding approach to represent a dependency dep(u, v)
rd⊚u + rh⊚v
rd and rh are vectors which representrespectively the role of dependent and head
⊚ is a placeholder for a composition operator
Representing dependences (example)
obj(apple, eat)
rd⊚apple + rh⊚eat
role vectors
Structured DSMs
1. Vector permutation in RI (PERM) to encode dependencies
2. Circular convolution (CONV) as filler/binding operator to represent syntactic dependencies in DSMs
3. LSA over PERM and CONV carries out two spaces: PERMLSA and CONVLSA
Vector permutation in RI (PERM)
Using permutation of elements in context vectors to encode dependencies
– right rotation of n elements to encode dependents (permutation)
– left rotation of n elements to encode heads(inverse permutation)
43
PERM (method)
Create and assign a context vector to each term
Assign a rotation function Π+1 to the dependent and Π-1 to the head
Each term is represented by a vector which is– the sum of the permuted vectors of all the dependent
terms
– the sum of the inverse permuted vectors of all the head terms
– the sum of the no-permuted vectors of both dependent and head words
44
PERM (example…)
John eats a red apple
John -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)eat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)red -> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)apple -> (1, 0, 0, 0, 0, 0, 0, -1, 0, 0)
TVapple = Π+1(CVred) + Π-1(CVeat) + CVred + CVeat
45
applered
eats
PERM (…example)
John eats a red apple
John -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)eat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)red-> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)apple -> (1, 0, 0, 0, 0, 0, 0, -1, 0, 0)
TVapple = Π+1(CVred) + Π-1(CVeat) + CVred + CVeat=…
…=(0, 0, 0, 0, 1, 0, 0, 0, -1, 0) + (0, 0, 0, -1, 0, 0, 0, 0, 1) +
+ (0, 0, 0, 1, 0, 0, 0, -1, 0, 0) + (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)
right shift left shift
46
Convolution (CONV)
Create and assign a context vector to each term
Create two context vectors for head and dependentroles
Each term is represented by a vector which is– the sum of the convolution between dependent terms
and the dependent role vector
– the sum of the convolution between head terms and the head role vector
– the sum of the vectors of both dependent and head words
Circular convolution operator
Circular convolution
p=u⊛v
defined as:
n
k
nkjkj vup1
)1()(
U1 U2 U3 U4 U5
V1 1 1 -1 -1 1
V2 -1 -1 1 1 -1
V3 1 1 -1 -1 1
V4 -1 -1 1 1 -1
V5 -1 -1 1 1 -1
U=<1, 1, -1, -1, 1>V=<1, -1, 1, -1, -1>P=<-1, 3, -1, -1, -1>
P1
P2
P3
P4
P5
Circular convolution by FFTs
Circular convolution is computed in O(n2)
– using FFTs is computed in O(nlogn)
Given f the discrete FFTs and f -1 its inverse
– u ⊛v = f -1( f(u) ∘ f (v) )
CONV (example)
John eats a red apple
John -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)eat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)red-> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)apple -> (1, 0, 0, 0, 0, 0, 0, -1, 0, 0)rd -> (0, 0, 1, 0, -1, 0, 0, 0, 0, 0)rh -> (0, -1, 1, 0, 0, 0, 0, 0, 0, 0)
apple = eat +red + (rd⊛ red) + (rh ⊛ eat)
50
Context vector for dependent role
Context vector for head role
Complex operators
Based on filler/role binding taking into account syntactic role: rd ⊚ u + rh ⊚ v
– u and v could be recursive structures
Two vector operators to bind the role:– convolution (⊛)
– tensor (⊗)
– convolution (⊛+): exploits also the sum of term vectors
rd ⊛ u + rh ⊛ v + v + u
Complex operators (remarks)
Existing operators
– t1 ⊚ t2 ⊚ … ⊚ tn: does not take into account syntactic role
– t1 ⊛ t2 is commutative
– t1⊗t2 ⊗ … ⊗tn: tensor order depends on the phrase length
• two phrases with different length are not comparable
– t1 ⊗ r1 ⊗t2 ⊗r2 ⊗ … ⊗ tn ⊗rn : also depends on the sentence length
System setup
• Corpus
– WaCkypedia EN based on a 2009 dump of Wikipedia
– about 800 million tokens
– dependency parse by MaltParser
• DSMs
– 500 vector dimension (LSA/RI/RILSA)
– 1,000 vector dimension (PERM/CONV/PERMLSA/CONVLSA)
– 50,000 most frequent words
– co-occurrence distance: 4
53
Evaluation
• GEMS 2011 Shared Task for compositional semantics– list of two pairs of words combination
(support offer) (help provide) 7(old person) (right hand) 1
• rated by humans• 5,833 rates• 3 types involved: noun-noun (NN), adjective-noun (AN),
verb-object (VO)
– GOAL: compare the system performance against humans scores• Spearman correlation
54
Results (simple spaces)…NN AN VO
TTM
LSA
RI
RILS
A
TTM
LSA
RI
RILS
A
TTM
LSA
RI
RILS
A
+ .21 .36 .25 .42 .22 .35 .33 .41 .23 .31 .28 .31
∘ .31 .15 .23 .22 .21 .20 .22 .18 .13 .10 .18 .21
⊛ .21 .38 .26 .35 .20 .33 .31 .44 .15 .31 .24 .34
⊛+ .21 .34 .28 .43 .23 .32 .31 .37 .20 .31 .25 .29
⊗ .21 .38 .25 .39 .22 .38 .33 .43 .15 .34 .26 .32
human .49 .52 .55
Simple Semantic Spaces
…Results (structured spaces)NN AN VO
CO
NV
PER
M
CO
NV
LSA
PER
MLS
A
CO
NV
PER
M
CO
NV
LSA
PER
MLS
A
CO
NV
PER
M
CO
NV
LSA
PER
MLS
A
+ .36 .39 .43 .42 .34 .39 .42 .45 .27 .23 .30 .31
∘ .22 .17 .10 .13 .23 .27 .13 .15 .20 .15 .06 .14
⊛ .31 .36 .37 .35 .39 .39 .45 .44 .28 .23 .27 .28
⊛+ .30 .36 .40 .36 .38 .32 .48 .44 .27 .22 .30 .32
⊗ .34 .37 .37 .40 .36 .40 .45 .45 .27 .24 .31 .32
human .49 .52 .55
Structured Semantic Spaces
Final remarks
• Best results are obtained when complex operators/spaces (or both) are involved
• No best combination of operator/space exists– depend on the type of relation (NN, AN, VO)
• Tensor product and convolution provide good results in spite of previous results– filler/role binding is effective
• Future work– generate several rd and rh vectors for each kind of
dependency– apply this approach to other direct graph-based
representations
Thank you for your attention!
Questions?<[email protected]>