soft cardinality: a parameterized similarity function for text comparison
DESCRIPTION
Abstract We present an approach for the construction of text similarity functions using a parameterized resemblance coefficient in combination with a softened cardinality function called soft cardinality. Our approach provides a consistent and recursive model, varying levels of granularity from sentences to characters. Therefore, our model was used to compare sentences divided into words, and in turn, words divided into q -grams of characters. Experimentally, we observed that a performance correlation function in a space defined by all parameters was relatively smooth and had a single maximum achievable by “hill climbing.” Our approach used only surface text information, a stop-word remover, and a stemmer to tackle the semantic text similarity task 6 at SEMEVAL 2012. The proposed method ranked 3rd (average), 5th (normalized correlation), and 15th (aggregated correlation) among 89 systems submitted by 31 teams.TRANSCRIPT
Soft Cardinality: A Parameterized Similarity Function for Text Comparison
Sergio Jimenez Claudia Becerra Alexander Gelbukh
Center for Computing Research,Instituto Politécnico Nacional
(National Polytechnic (Technical) Institute), Mexico
Outline
• Cardinality-based similarity functions
• What is Soft Cardinality?
• Parameterized resemblance coefficient
• Building text similarity functions
• Optimizing parameters
• Results in STS SemEval-2012
• Conclusions
Cardinality-based similarity functions
Jaccard (1905)BA
BABASIM ),(
Dice (1945)BA
BABASIM
5.05.0),(
Only two thing are needed:
1. Cardinality function
2. Resemblance coefficient""
""
referent
iescommonalit
Soft cardinality
Parameterizedresemblancecoefficient
Soft Cardinality
A= , ,
B= , ,
|A|=3
|B|=3
Classical cardinality crisp count
Soft cardinality soft count
|A|’=2.9
|B|’=1.3
How to compute soft cardinality?
naaaA ,,, 21
n
in
j
ji aasim
A1
1
'
,
11),(; aasima
),(),(;, absimbasimba
]1,0[),(;, basimba
8A
'A
67.2'
A
71.3'
B 92.3'
BA''''
BABABA
46.2'
BA
63.0'
'
BA
BA40.0
BA
BA
Soft cardinatily Classical cardinatily
An extended soft cardinality model
n
in
j
ji aasim
A1
1
'
,
1
iaw Weights for the elements (words) e.g. tf-idf
iaw
p Controls the “softness” of the soft cardinality
0p AAp'
p
iawA'
8A
Parameterized resemblance coefficient
BA
BABASIM
5.05.0),(
The referent is a balance between the “sizes” of A and B
Tversky (1977)
“the son resembles the father” not “the father resembles the son”“an ellipse is like a circle” not
“a circle is like an ellipse”“North Korea is like Red China” not
“Red China is like North Korea”
“the son resembles the father” not “the father resembles the son”“an ellipse is like a circle” not
“a circle is like an ellipse”“North Korea is like Red China” not
“Red China is like North Korea”
A B
In general “the variant is more similar to the prototype”
A
BABASIM ),(
BA
BABASIM
,min),(
Overlap coefficient
Parameterized resemblance coefficient
BA
BABASIM
5.05.0),(
),max(5.0),min(5.0),(
BABA
BABASIM
),max()1(),min(),(
BABA
BABASIM
0 0.5 1
Overlap coeff. Dice coeff.),max( BA
BA
Tversky
Parameterized resemblance coefficient
),max()1(),min(),(
BABA
biasBABASIM
|A| |B|bias
0.0 0.2 -0.2
3 5 2 0.5 0.550 0.450
6 10 4 0.5 0.525 0.475
9 15 6 0.5 0.517 0.483
BA
#1
#2
#3
Building text similarity functions
),max()1(),min(),(
''''
'
BABA
biasBABASIM
Soft cardinality
n
in
j
p
ji
a
aasim
wAi
1
1
'
,
1
),max()1(),min(),(
iisimiisim
simii
jibaba
biasbabasim
),( BASIM
),( ji basim
Compares two texts assets of words
Compares two wordsas sets of q-grams
Optimal parametes found by hill climbing
DATA SET*
q-grams
parameters Pearson
bias p biassim r
MSRpar.training [4] 0.62 1.14 0.77 -0.04 -0.38 0.6598
MSR.par.test [4] 0.60 1.02 0.90 -0.02 -0.40 0.6335
MSRvid.training [1:4] 0.42 -0.80 2.28 0.18 0.08 0.8323
MSRvid.test [1:4] 0.32 -0.80 1.88 1.08 0.08 0.8579
SMTeuro.training [2:4] 0.74 -0.06 0.91 1.88 2.90 0.6193
SMTeuro.test [2:4] 0.84 -0.16 0.71 1.78 3.00 0.5178
OnWN.test [2:5] 0.88 -0.62 1.36 -0.02 -0.70 0.7202
SMTnews.test [1:4] 0.88 0.88 1.57 0.80 3.21 0.5344
sim
RankMean= 0.6788)( ia aidfwi
* Lemmatized with Porter stemmer
Used ResourcesRun baer/task6-UKP-run2
jan_snajder/task6-takelab-simple
sgjimenezv/task6-SOFT-CARDINALITY
Used resoruces
KB similarityLemmatizerString SimilarityDictionariesDistributional thesaurusMonolingual corporaMultilingual corporaWikipediaWordNetDistributional SimilarityPOS taggerSMTTextual EntailmentOther
KB similarityLemmatizerDictionariesDistributional thesaurusMonolingual corporaStop wordsWikipediaWordNetDistributional SimilarityLexical SubstitutionMachine LearningPOS taggerOther
KB similarityLemmatizerString Similarity
Mean (r) 0.6773 0.6753 0.6708
RankMean 1st 2nd 3rd
Difference 0.969% 0.671% 0%
Results for other measures
Cosine tf-idf + lemmatizer
RankMean=0.6326 (10th)
SoftTFIDF+ lemmatizer
RankMean=0.6415 (7th)
Soft Cardinality+lemmatizer+
Paramet.Res.Coeff.+Hill climbing
RankMean=0.6788
Text A
Text
B
Text A
Text
B
Text A Text B
Text
BTe
xt A
Conclusions
1. The soft cardinality approach proved to be a an effective and low-cost text similarity function, even in Semantic Textual Similarity scenarios.
2. The set of parameters of the proposed function were maningfull and easy to find their optimal values when training data is available.