soft cardinality: a parameterized similarity function for text comparison

18
Soft Cardinality: A Parameterized Similarity Function for Text Comparison Sergio Jimenez Claudia Becerra Alexander Gelbukh Center for Computing Research, Instituto Politécnico Nacional (National Polytechnic (Technical) Institute), Mexico

Upload: sergio-jimenez

Post on 08-Jul-2015

500 views

Category:

Travel


1 download

DESCRIPTION

Abstract We present an approach for the construction of text similarity functions using a parameterized resemblance coefficient in combination with a softened cardinality function called soft cardinality. Our approach provides a consistent and recursive model, varying levels of granularity from sentences to characters. Therefore, our model was used to compare sentences divided into words, and in turn, words divided into q -grams of characters. Experimentally, we observed that a performance correlation function in a space defined by all parameters was relatively smooth and had a single maximum achievable by “hill climbing.” Our approach used only surface text information, a stop-word remover, and a stemmer to tackle the semantic text similarity task 6 at SEMEVAL 2012. The proposed method ranked 3rd (average), 5th (normalized correlation), and 15th (aggregated correlation) among 89 systems submitted by 31 teams.

TRANSCRIPT

Page 1: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Sergio Jimenez Claudia Becerra Alexander Gelbukh

Center for Computing Research,Instituto Politécnico Nacional

(National Polytechnic (Technical) Institute), Mexico

Page 2: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Outline

• Cardinality-based similarity functions

• What is Soft Cardinality?

• Parameterized resemblance coefficient

• Building text similarity functions

• Optimizing parameters

• Results in STS SemEval-2012

• Conclusions

Page 3: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Cardinality-based similarity functions

Jaccard (1905)BA

BABASIM ),(

Dice (1945)BA

BABASIM

5.05.0),(

Only two thing are needed:

1. Cardinality function

2. Resemblance coefficient""

""

referent

iescommonalit

Soft cardinality

Parameterizedresemblancecoefficient

Page 4: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Soft Cardinality

A= , ,

B= , ,

|A|=3

|B|=3

Classical cardinality crisp count

Soft cardinality soft count

|A|’=2.9

|B|’=1.3

Page 5: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

How to compute soft cardinality?

naaaA ,,, 21

n

in

j

ji aasim

A1

1

'

,

11),(; aasima

),(),(;, absimbasimba

]1,0[),(;, basimba

Page 6: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

8A

'A

Page 7: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

67.2'

A

71.3'

B 92.3'

BA''''

BABABA

46.2'

BA

63.0'

'

BA

BA40.0

BA

BA

Soft cardinatily Classical cardinatily

Page 8: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

An extended soft cardinality model

n

in

j

ji aasim

A1

1

'

,

1

iaw Weights for the elements (words) e.g. tf-idf

iaw

p Controls the “softness” of the soft cardinality

0p AAp'

p

iawA'

Page 9: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

8A

Page 10: Soft Cardinality: A Parameterized Similarity Function for Text Comparison
Page 11: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Parameterized resemblance coefficient

BA

BABASIM

5.05.0),(

The referent is a balance between the “sizes” of A and B

Tversky (1977)

“the son resembles the father” not “the father resembles the son”“an ellipse is like a circle” not

“a circle is like an ellipse”“North Korea is like Red China” not

“Red China is like North Korea”

“the son resembles the father” not “the father resembles the son”“an ellipse is like a circle” not

“a circle is like an ellipse”“North Korea is like Red China” not

“Red China is like North Korea”

A B

In general “the variant is more similar to the prototype”

A

BABASIM ),(

BA

BABASIM

,min),(

Overlap coefficient

Page 12: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Parameterized resemblance coefficient

BA

BABASIM

5.05.0),(

),max(5.0),min(5.0),(

BABA

BABASIM

),max()1(),min(),(

BABA

BABASIM

0 0.5 1

Overlap coeff. Dice coeff.),max( BA

BA

Tversky

Page 13: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Parameterized resemblance coefficient

),max()1(),min(),(

BABA

biasBABASIM

|A| |B|bias

0.0 0.2 -0.2

3 5 2 0.5 0.550 0.450

6 10 4 0.5 0.525 0.475

9 15 6 0.5 0.517 0.483

BA

#1

#2

#3

Page 14: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Building text similarity functions

),max()1(),min(),(

''''

'

BABA

biasBABASIM

Soft cardinality

n

in

j

p

ji

a

aasim

wAi

1

1

'

,

1

),max()1(),min(),(

iisimiisim

simii

jibaba

biasbabasim

),( BASIM

),( ji basim

Compares two texts assets of words

Compares two wordsas sets of q-grams

Page 15: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Optimal parametes found by hill climbing

DATA SET*

q-grams

parameters Pearson

bias p biassim r

MSRpar.training [4] 0.62 1.14 0.77 -0.04 -0.38 0.6598

MSR.par.test [4] 0.60 1.02 0.90 -0.02 -0.40 0.6335

MSRvid.training [1:4] 0.42 -0.80 2.28 0.18 0.08 0.8323

MSRvid.test [1:4] 0.32 -0.80 1.88 1.08 0.08 0.8579

SMTeuro.training [2:4] 0.74 -0.06 0.91 1.88 2.90 0.6193

SMTeuro.test [2:4] 0.84 -0.16 0.71 1.78 3.00 0.5178

OnWN.test [2:5] 0.88 -0.62 1.36 -0.02 -0.70 0.7202

SMTnews.test [1:4] 0.88 0.88 1.57 0.80 3.21 0.5344

sim

RankMean= 0.6788)( ia aidfwi

* Lemmatized with Porter stemmer

Page 16: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Used ResourcesRun baer/task6-UKP-run2

jan_snajder/task6-takelab-simple

sgjimenezv/task6-SOFT-CARDINALITY

Used resoruces

KB similarityLemmatizerString SimilarityDictionariesDistributional thesaurusMonolingual corporaMultilingual corporaWikipediaWordNetDistributional SimilarityPOS taggerSMTTextual EntailmentOther

KB similarityLemmatizerDictionariesDistributional thesaurusMonolingual corporaStop wordsWikipediaWordNetDistributional SimilarityLexical SubstitutionMachine LearningPOS taggerOther

KB similarityLemmatizerString Similarity

Mean (r) 0.6773 0.6753 0.6708

RankMean 1st 2nd 3rd

Difference 0.969% 0.671% 0%

Page 17: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Results for other measures

Cosine tf-idf + lemmatizer

RankMean=0.6326 (10th)

SoftTFIDF+ lemmatizer

RankMean=0.6415 (7th)

Soft Cardinality+lemmatizer+

Paramet.Res.Coeff.+Hill climbing

RankMean=0.6788

Text A

Text

B

Text A

Text

B

Text A Text B

Text

BTe

xt A

Page 18: Soft Cardinality: A Parameterized Similarity Function for Text Comparison

Conclusions

1. The soft cardinality approach proved to be a an effective and low-cost text similarity function, even in Semantic Textual Similarity scenarios.

2. The set of parameters of the proposed function were maningfull and easy to find their optimal values when training data is available.