text comparison using soft cardinality

Text Comparison Using Soft Cardinality

Sergio Jimenez1

Fabio González1

Alexander Gelbukh2

1Universidad Nacional de Colombia – Bogota2Centro de Investigaciones en Computación (CIC), IPN, Mexico

SPIRE’10

Can you make a better comparison between two “bags” removing redundancy?

SPIRE’10

Classic vs. Soft CardinalityS

S

Classic set cardinality

|S|=4

Soft cardinality

|S|α=2.??

Repeated elements are counted only once

Similar elements count less for soft cardinality

SPIRE’10

Soft Cardinality Definition

1. Consider each element as a subset.S={s1, s2, …, sn}

S’={{s1}, {s2}, …, {sn}}2. Set the cardinality of each subset be equal to 1.

|{si}|=13. Consider similarity between pairs of elements in S as

intersections beween their corresponding subsets|{si}∩{sj}|= α(si ,sj)

4. Let |S|α be the soft cardinality of S as:

|S|α =|{s1}U{s2}U … U {sn}|

SPIRE’10

Soft CardinalityS S’

area=1

α( , )

|S|α=

SPIRE’10

Cardinality of the union of sets212121 AAAAAA

321313221321321 AAAAAAAAAAAAAAA

1st problem: binary similarity measures α(*,*) are common, but n-ary similarity functions are not.2nd problem: the number of terms for n sets is 2n-1

__ __ ___ ________ ________ ________ ______________

3rd problem: if pair-wise intersections are large (close to 1) n-wise intersections are also large, so they can not be ignored.

SPIRE’10

An Approximation to Soft Cardinality …

S 0.70

0.200.15

1.00 0.70 0.15

0.70 1.00 0.20

0.15 0.20 1.00

1.0<|S|α<3.0

Affinity (similarity) function α(·,·) Affinity Matrix

SPIRE’10

… An Approximation to Soft Cardinality

},,,{ 21 nsssS

1.00 0.70 0.15

0.70 1.00 0.20

0.15 0.20 1.00

|S|α=1.81

SPIRE’10

Weighted Soft CardinalityS 0.001

0.0070.001

500kg

0.1kg

0.001kg1.000 0.007 0.001

0.007 1.000 0.001

0.001 0.001 1.000

500

0.01

0.001

wi

Affinity Importance

|S|α=496 (kg)

|S|α=2.98 (elements)

non-weighted

weighted

SPIRE’10

Soft Cardinality for Text ComparisonS1={“Sergio”,”Jimenes”,“Vargaz“} S2={“Cergio”,”Gimenez”,“Vargas“}

Sergio Jimenes VargazSergio 1.000 0.000 0.333

Jimenes 0.000 1.000 0.000Vargaz 0.333 0.000 1.000

Cergio Gimenez VargasCergio 1.000 0.000 0.333

Gimenez 0.000 1.000 0.000Vargas 0.333 0.000 1.000

S1U S2={“Sergio”,”Jimenes”,“Vargaz“, “Cergio”,”Gimenez”,“Vargas“}

Sergio Jimenes Vargaz Cergio Gimenez Vargas

Sergio 1.000 0.000 0.333 0.833 0.000 0.333Jimenes 0.000 1.000 0.000 0.000 0.714 0.143Vargaz 0.333 0.000 1.000 0.333 0.143 0.833Cergio 0.833 0.000 0.333 1.000 0.000 0.333

Gimenez 0.000 0.714 0.143 0.000 1.000 0.000Vargas 0.333 0.143 0.833 0.333 0.000 1.000

|S1|α=2.50 |S2|α=2.50|S1U S2|α=2.63|S1∩ S2|α= |S1|α +|S2|α -|S1U S2|α

|S1∩ S2|α= 2.37

90.0),(21

2121

SS

SSSSJaccard α(x,y) is a normalized edit

distance converted to similarity

SPIRE’10

The Name Matching ProblemRelation #1 Relation #2

5 matches

SPIRE’10

12 Name Matching DatasetsSPIRE’10

IAP (interpolated average precision) results

State-of-the-artSoftTF-IDF

[Cohen et al. 2003]Adaptive approach

Similarity(A,B,corpus)

Soft CardinalityStatic approachSimilarity(A,B)

Soft Cardinalityweighted IDF

adaptive approachSimilarity(A,B,corpus)

SPIRE’10

Conclusions

• Soft Cardinality provides a “nice” method to compare bags of words considering similarity among terms and term weighting.

• Experimental evidence shows that Soft Cardinality method is comparable (slightly better) to soft versions of the word-space model.

SPIRE’10

Questions?SPIRE’10

text comparison using soft cardinality

Technology

soft cardinality of

soft cardinality s s

soft cardinality spire10

soft cardinality definition

mexico spire10

nwise intersections

n sets

nary similarity functions