framework for syntactic string similarity measurescs.uef.fi/sipu/pub/titlesimilarity.pdf ·...
TRANSCRIPT
![Page 1: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/1.jpg)
Framework for Syntactic String Similarity Measures
Najlah Gali, Radu Mariescu-Istodor, Damien Hostettler, Pasi Fränti
24.4.2019
N. Gali, R. Mariescu-Istodor, D. Hostettler and P. Fränti, "Framework for syntactic string similarity measures",
Expert Systems with Applications, 2019.
![Page 2: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/2.jpg)
Introduction
![Page 3: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/3.jpg)
Application examples
Titles of web pages:
V-caféViet-Café
Keywords and keyphrases:
Theatertheatre
Named entities:
U.S State DepartmentUS Department of State
Personal names:
Gail VestGayle Vesty
Place names:
Ting Tsi RiverTingtze River
Ontology alignments:
associate professorsenior lecturer
Short segments of text:
Apple computerApple pie
Sentences:
I haven't watched television for agesIt's been a long time since I watched television
![Page 4: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/4.jpg)
Similarity framework
![Page 5: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/5.jpg)
Existing packages
![Page 6: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/6.jpg)
StringSim package
Existing measureNew combination
![Page 7: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/7.jpg)
Character-level measures
• Exact match
• Transformation
• Longest common substring (LCS)
![Page 8: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/8.jpg)
Edit distanceLevenshtein 1966
dist=5
Solved by dynamic programming algorithm
![Page 9: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/9.jpg)
Character-level measures
![Page 10: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/10.jpg)
String segmentation
• Tokenization
• Q-grams
![Page 11: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/11.jpg)
Segmentation examplesThe club at the Ivy
![Page 12: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/12.jpg)
Matching techniques
• Sequence matching
• Set matching
• Bag-of-tokens
![Page 13: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/13.jpg)
String matching at token level
![Page 14: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/14.jpg)
Problem of crisp sets
gray color
color gray
gray color
colour grey
?
Similarity = 1.0 Similarity = 0.0
![Page 15: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/15.jpg)
Soft set-matchingSmith-Waterman-Gotoh
Similarity = ( ) 63.08.09.02.03
1=++
![Page 16: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/16.jpg)
Soft cardinalities of sets{gray, grey} …. {gray, color}
![Page 17: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/17.jpg)
Soft cardinalities of sets{gray, grey} …. {gray, color}
![Page 18: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/18.jpg)
Another example
![Page 19: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/19.jpg)
Summary of measuresSequence and set-matching
![Page 20: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/20.jpg)
Summary of measuresBag-of-tokens
![Page 21: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/21.jpg)
Results
![Page 22: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/22.jpg)
Datasets
![Page 23: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/23.jpg)
Ch/ Q Edit MongeElkan Brau-Ban Simpson JaccardDice & Rouge
Cosine Manhattan Euclidean
Exact match2
4 10
Hamming
Levenshtein & Damerau-Levenshtein 6 12 6 9 6Needleman Wunch 14Smith Waterman & SWG 2 6 10 15 9 12 15Jaro 7
11 13 12Jaro Winkler 8LCS 1 9 12 6 9 62Grams 5
2 33Grams 3Word2Vec
Text manipulationCharacter changes
![Page 24: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/24.jpg)
Exact matchWord2Vec
Expected
7. Edit-Damerau
16. BraunBan-Needle15. BraunBan-JaroWinkler
14. Euclidean
13. BraunBan-SWG
12. Edit-Needle11. BraunBan-LCS10. BraunBan-Damerau
6. Edit-SWG
4. Edit-LCS
5. BraunBan-3Grams
3. Edit-3Gram
s
8. BraunBan-Ham
ming
2. LCS
1. Manhattan-Hamming
Hu
man
intu
ition
9. Edit-JaroWinkler
Effect of char changes
![Page 25: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/25.jpg)
Ch/ Q Edit MongeElkan Brau-Ban Simpson Jaccard Dice & Rouge Cosine Manhattan Euclidean
Exact match2
4 10
Hamming
Levenshtein & Damerau-Levenshtein 6 12 6 9 6Needleman Wunch 14Smith Waterman & SWG 2 6 10 15 9 12 15Jaro 7
11 13 12Jaro Winkler 8LCS 1 9 12 6 9 62Grams 5
2 33Grams 3
Text manipulationToken changes
![Page 26: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/26.jpg)
Effect of token changes
![Page 27: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/27.jpg)
Correlation to human intuition
![Page 28: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/28.jpg)
Qualitative examples
Excellent match
Poor match
![Page 29: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/29.jpg)
Correlation to distance
![Page 30: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/30.jpg)
Correlation to distance
![Page 31: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/31.jpg)
Clustering experiment
• 180 photos• 15 clusters
![Page 32: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/32.jpg)
Clustering results
![Page 33: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/33.jpg)
Name matching
![Page 34: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/34.jpg)
Name matching
![Page 35: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/35.jpg)
Summary of the results
![Page 36: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,](https://reader034.vdocuments.site/reader034/viewer/2022050515/5f9f26c44cbeaa23e4350284/html5/thumbnails/36.jpg)
Conclusions
Token level measures
• Well-maintained databases: ok as such• Free text: soft variants improves!
Semantic similarity
• Suffers from single char changes
Recommendation:
• Dice or Rouge (token level) + Q-grams• No single measure work for all applications