text comparison using soft cardinality

15
Text Comparison Using Soft Cardinality Sergio Jimenez 1 Fabio González 1 Alexander Gelbukh 2 1 Universidad Nacional de Colombia – Bogota ro de Investigaciones en Computación (CIC), IPN, Me SPIRE’10

Upload: sergio-jimenez

Post on 25-Jun-2015

196 views

Category:

Technology


0 download

DESCRIPTION

The classical set theory provides a method for comparing objects using cardinality and intersection, in combination with well-known resemblance coefficients such as Dice, Jaccard, and cosine. However, set operations are intrinsically crisp: they disregard similarities between elements. We propose a new general-purpose method for object comparison using a soft cardinality function that assesses set cardinality via an auxiliary affinity (similarity) measure. Our experiments with 12 text matching datasets show that the soft cardinality method is superior to known approximate string comparison methods in text comparison task.

TRANSCRIPT

Page 1: Text Comparison Using Soft Cardinality

Text Comparison Using Soft Cardinality

Sergio Jimenez1

Fabio González1

Alexander Gelbukh2

1Universidad Nacional de Colombia – Bogota2Centro de Investigaciones en Computación (CIC), IPN, Mexico

SPIRE’10

Page 2: Text Comparison Using Soft Cardinality

Can you make a better comparison between two “bags” removing redundancy?

SPIRE’10

Page 3: Text Comparison Using Soft Cardinality

Classic vs. Soft CardinalityS

S

Classic set cardinality

|S|=4

Soft cardinality

|S|α=2.??

Repeated elements are counted only once

Similar elements count less for soft cardinality

SPIRE’10

Page 4: Text Comparison Using Soft Cardinality

Soft Cardinality Definition

1. Consider each element as a subset.S={s1, s2, …, sn}

S’={{s1}, {s2}, …, {sn}}2. Set the cardinality of each subset be equal to 1.

|{si}|=13. Consider similarity between pairs of elements in S as

intersections beween their corresponding subsets|{si}∩{sj}|= α(si ,sj)

4. Let |S|α be the soft cardinality of S as:

|S|α =|{s1}U{s2}U … U {sn}|

SPIRE’10

Page 5: Text Comparison Using Soft Cardinality

Soft CardinalityS S’

area=1

α( , )

|S|α=

SPIRE’10

Page 6: Text Comparison Using Soft Cardinality

Cardinality of the union of sets212121 AAAAAA

321313221321321 AAAAAAAAAAAAAAA

1st problem: binary similarity measures α(*,*) are common, but n-ary similarity functions are not.2nd problem: the number of terms for n sets is 2n-1

__ __ ___ ________ ________ ________ ______________

3rd problem: if pair-wise intersections are large (close to 1) n-wise intersections are also large, so they can not be ignored.

SPIRE’10

Page 7: Text Comparison Using Soft Cardinality

An Approximation to Soft Cardinality …

S 0.70

0.200.15

1.00 0.70 0.15

0.70 1.00 0.20

0.15 0.20 1.00

1.0<|S|α<3.0

Affinity (similarity) function α(·,·) Affinity Matrix

SPIRE’10

Page 8: Text Comparison Using Soft Cardinality

… An Approximation to Soft Cardinality

},,,{ 21 nsssS

1.00 0.70 0.15

0.70 1.00 0.20

0.15 0.20 1.00

|S|α=1.81

SPIRE’10

Page 9: Text Comparison Using Soft Cardinality

Weighted Soft CardinalityS 0.001

0.0070.001

500kg

0.1kg

0.001kg1.000 0.007 0.001

0.007 1.000 0.001

0.001 0.001 1.000

500

0.01

0.001

wi

Affinity Importance

|S|α=496 (kg)

|S|α=2.98 (elements)

non-weighted

weighted

SPIRE’10

Page 10: Text Comparison Using Soft Cardinality

Soft Cardinality for Text ComparisonS1={“Sergio”,”Jimenes”,“Vargaz“} S2={“Cergio”,”Gimenez”,“Vargas“}

Sergio Jimenes VargazSergio 1.000 0.000 0.333

Jimenes 0.000 1.000 0.000Vargaz 0.333 0.000 1.000

Cergio Gimenez VargasCergio 1.000 0.000 0.333

Gimenez 0.000 1.000 0.000Vargas 0.333 0.000 1.000

S1U S2={“Sergio”,”Jimenes”,“Vargaz“, “Cergio”,”Gimenez”,“Vargas“}

Sergio Jimenes Vargaz Cergio Gimenez Vargas

Sergio 1.000 0.000 0.333 0.833 0.000 0.333Jimenes 0.000 1.000 0.000 0.000 0.714 0.143Vargaz 0.333 0.000 1.000 0.333 0.143 0.833Cergio 0.833 0.000 0.333 1.000 0.000 0.333

Gimenez 0.000 0.714 0.143 0.000 1.000 0.000Vargas 0.333 0.143 0.833 0.333 0.000 1.000

|S1|α=2.50 |S2|α=2.50|S1U S2|α=2.63|S1∩ S2|α= |S1|α +|S2|α -|S1U S2|α

|S1∩ S2|α= 2.37

90.0),(21

2121

SS

SSSSJaccard α(x,y) is a normalized edit

distance converted to similarity

SPIRE’10

Page 11: Text Comparison Using Soft Cardinality

The Name Matching ProblemRelation #1 Relation #2

5 matches

SPIRE’10

Page 12: Text Comparison Using Soft Cardinality

12 Name Matching DatasetsSPIRE’10

Page 13: Text Comparison Using Soft Cardinality

IAP (interpolated average precision) results

State-of-the-artSoftTF-IDF

[Cohen et al. 2003]Adaptive approach

Similarity(A,B,corpus)

Soft CardinalityStatic approachSimilarity(A,B)

Soft Cardinalityweighted IDF

adaptive approachSimilarity(A,B,corpus)

SPIRE’10

Page 14: Text Comparison Using Soft Cardinality

Conclusions

• Soft Cardinality provides a “nice” method to compare bags of words considering similarity among terms and term weighting.

• Experimental evidence shows that Soft Cardinality method is comparable (slightly better) to soft versions of the word-space model.

SPIRE’10

Page 15: Text Comparison Using Soft Cardinality

Questions?SPIRE’10