text comparison using soft cardinality
DESCRIPTION
The classical set theory provides a method for comparing objects using cardinality and intersection, in combination with well-known resemblance coefficients such as Dice, Jaccard, and cosine. However, set operations are intrinsically crisp: they disregard similarities between elements. We propose a new general-purpose method for object comparison using a soft cardinality function that assesses set cardinality via an auxiliary affinity (similarity) measure. Our experiments with 12 text matching datasets show that the soft cardinality method is superior to known approximate string comparison methods in text comparison task.TRANSCRIPT
Text Comparison Using Soft Cardinality
Sergio Jimenez1
Fabio González1
Alexander Gelbukh2
1Universidad Nacional de Colombia – Bogota2Centro de Investigaciones en Computación (CIC), IPN, Mexico
SPIRE’10
Can you make a better comparison between two “bags” removing redundancy?
SPIRE’10
Classic vs. Soft CardinalityS
S
Classic set cardinality
|S|=4
Soft cardinality
|S|α=2.??
Repeated elements are counted only once
Similar elements count less for soft cardinality
SPIRE’10
Soft Cardinality Definition
1. Consider each element as a subset.S={s1, s2, …, sn}
S’={{s1}, {s2}, …, {sn}}2. Set the cardinality of each subset be equal to 1.
|{si}|=13. Consider similarity between pairs of elements in S as
intersections beween their corresponding subsets|{si}∩{sj}|= α(si ,sj)
4. Let |S|α be the soft cardinality of S as:
|S|α =|{s1}U{s2}U … U {sn}|
SPIRE’10
Soft CardinalityS S’
area=1
α( , )
|S|α=
SPIRE’10
Cardinality of the union of sets212121 AAAAAA
321313221321321 AAAAAAAAAAAAAAA
1st problem: binary similarity measures α(*,*) are common, but n-ary similarity functions are not.2nd problem: the number of terms for n sets is 2n-1
__ __ ___ ________ ________ ________ ______________
3rd problem: if pair-wise intersections are large (close to 1) n-wise intersections are also large, so they can not be ignored.
SPIRE’10
An Approximation to Soft Cardinality …
S 0.70
0.200.15
1.00 0.70 0.15
0.70 1.00 0.20
0.15 0.20 1.00
1.0<|S|α<3.0
Affinity (similarity) function α(·,·) Affinity Matrix
SPIRE’10
… An Approximation to Soft Cardinality
},,,{ 21 nsssS
1.00 0.70 0.15
0.70 1.00 0.20
0.15 0.20 1.00
|S|α=1.81
SPIRE’10
Weighted Soft CardinalityS 0.001
0.0070.001
500kg
0.1kg
0.001kg1.000 0.007 0.001
0.007 1.000 0.001
0.001 0.001 1.000
500
0.01
0.001
wi
Affinity Importance
|S|α=496 (kg)
|S|α=2.98 (elements)
non-weighted
weighted
SPIRE’10
Soft Cardinality for Text ComparisonS1={“Sergio”,”Jimenes”,“Vargaz“} S2={“Cergio”,”Gimenez”,“Vargas“}
Sergio Jimenes VargazSergio 1.000 0.000 0.333
Jimenes 0.000 1.000 0.000Vargaz 0.333 0.000 1.000
Cergio Gimenez VargasCergio 1.000 0.000 0.333
Gimenez 0.000 1.000 0.000Vargas 0.333 0.000 1.000
S1U S2={“Sergio”,”Jimenes”,“Vargaz“, “Cergio”,”Gimenez”,“Vargas“}
Sergio Jimenes Vargaz Cergio Gimenez Vargas
Sergio 1.000 0.000 0.333 0.833 0.000 0.333Jimenes 0.000 1.000 0.000 0.000 0.714 0.143Vargaz 0.333 0.000 1.000 0.333 0.143 0.833Cergio 0.833 0.000 0.333 1.000 0.000 0.333
Gimenez 0.000 0.714 0.143 0.000 1.000 0.000Vargas 0.333 0.143 0.833 0.333 0.000 1.000
|S1|α=2.50 |S2|α=2.50|S1U S2|α=2.63|S1∩ S2|α= |S1|α +|S2|α -|S1U S2|α
|S1∩ S2|α= 2.37
90.0),(21
2121
SS
SSSSJaccard α(x,y) is a normalized edit
distance converted to similarity
SPIRE’10
The Name Matching ProblemRelation #1 Relation #2
5 matches
SPIRE’10
12 Name Matching DatasetsSPIRE’10
IAP (interpolated average precision) results
State-of-the-artSoftTF-IDF
[Cohen et al. 2003]Adaptive approach
Similarity(A,B,corpus)
Soft CardinalityStatic approachSimilarity(A,B)
Soft Cardinalityweighted IDF
adaptive approachSimilarity(A,B,corpus)
SPIRE’10
Conclusions
• Soft Cardinality provides a “nice” method to compare bags of words considering similarity among terms and term weighting.
• Experimental evidence shows that Soft Cardinality method is comparable (slightly better) to soft versions of the word-space model.
SPIRE’10
Questions?SPIRE’10