Download - Search explained T3DD15
![Page 1: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/1.jpg)
![Page 2: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/2.jpg)
Search explained
![Page 3: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/3.jpg)
My Name is Hans Höchtl
Technical director @ Onedrop Solutions
PHP, Java, Ruby Developer
Participation in TYPO3 Solr
![Page 4: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/4.jpg)
SELECT * FROM mytable WHERE field LIKE „%searchword%“
SELECT * FROM mytable WHERE field SOUNDS LIKE
„searchword“
![Page 5: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/5.jpg)
Appearance of a word inside a text can be determined easily.
But is it relevant?
![Page 6: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/6.jpg)
Relevance is subjective and depends on the judgement of users.
We use „scoring“ to predict relevance.
![Page 7: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/7.jpg)
Scoring is computed by a function applied on our indexed documents using the search term as input parameter.
![Page 8: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/8.jpg)
TF-IDF Term frequency-inverse document frequency
BM25Okapi BM25 - Best Matching
DFRDivergence from randomness
and many more
![Page 9: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/9.jpg)
All those scoring calculations should fulfill these two requirements:
1. PrecisionAre the results relevant to the user?
2. RecallHave we found all relevant content in the index?
![Page 10: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/10.jpg)
How to store documents for efficient computing of scoring?
![Page 11: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/11.jpg)
Vector Space Model Default in Solr, Elasticsearch
Document: A vector of terms
Term: A „word“ inside a document
Each unique term is a dimension
![Page 12: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/12.jpg)
Vector Space Model
The best match is the narrowest angle between query and document
![Page 13: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/13.jpg)
Document 1
„unique unique bag“
Document 2
„unique bag bag“
Query
unique bagunique
bagv(d1)
v(q)
v(d2)
![Page 14: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/14.jpg)
The calculation of the cosine of the angle between the vectors is much easier than the calculation of the angle itself. (CPU cycles)
![Page 15: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/15.jpg)
Where d2 * q is the intersection (dot product) of the document and the query vectors.
||q|| is the norm vector of q
![Page 16: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/16.jpg)
A cosine value of zero means that the query and document vector are orthogonal and have no match.
![Page 17: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/17.jpg)
TF-IDF
Regarding the vector space model (VSM) the weight of the vector is now represented for a document d as:
Term frequencyInverse document frequency
![Page 18: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/18.jpg)
TF-IDF
Now we have everything together to calculate the similarity between documents using TF-IDF:
![Page 19: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/19.jpg)
TF-IDF
PROs CONs
- Simple model based on linear algebra
- Term weights not binary - Allows computing a
continuous degree of similarity between queries and documents
- Allows ranking of documents according to their possible relevance
- Allows partial matching
- Long documents have poor similarity values (small scalar and large dimensionality)
- Search keywords must precisely match terms
- Missing semantic sensitivity - Order of terms in document
not taken into account - Terms are usually not
statistically independent (as this model states)
![Page 20: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/20.jpg)
TF-IDF - The Lucene way
Coord: Boosts documents that match more of the search terms (multiple words) => 3/4 vs 4/4
Norm: Length normalization boosts fields that are shorter
![Page 21: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/21.jpg)
TF-IDF - Multiple fields
TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.
![Page 22: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/22.jpg)
TF-IDF - Multiple fields
TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.
![Page 23: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/23.jpg)
TF-IDF - Multiple fields
Solr Solution: DisMax Query Parser (Maximum Disjunction)
Searchterm: „my funny house“
Documents matching query in
field title Documents matching
query in field subtitle
Documents matching query in
field content
TF-IDF calculated for every field independently. Score of a document is the highest score of the field scoring values.
![Page 24: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/24.jpg)
Natural languages
Adjectives, Adverbs, Nouns, Verbs, Conjunctions, Prepositions, Predicates, Compounds, Plurals, Past tense, Declination, Semantics, etc.
![Page 25: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/25.jpg)
Language families
Indo-European languages
Sino-Tibetan languages
![Page 26: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/26.jpg)
TF-IDF Problem
Only exakt Term matches are considered a hit.
„Car“ is not the same term as „Cars“
![Page 27: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/27.jpg)
Handling human languages (Analyzers)
Tokenizers:Splits a stream of characters into a series of tokens.
Filters:The generated tokens are passed through a series of filters that add, change or remove tokens.
![Page 28: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/28.jpg)
Index Analyzers vs. Query Analyzers
Index Analyzers:Perform their analysis chain on the token stream during indexation. The generated tokens will be indexed.
Query Analyzers:Perform their analysis chain on the entered search query during query execution. Otherwise the query would hit just an exact match.
Beware of Synonyms!
![Page 29: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/29.jpg)
Available analyzers
Solr (https://goo.gl/TXEjZK) Language best practices (https://goo.gl/11O2Qz)
Elasticsearch (https://goo.gl/QR1IYb) Language best practices (https://goo.gl/6FQt7A)
![Page 30: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/30.jpg)
FieldTypes
Solr and Elasticsearch use fieldTypes assigned to fields for defining the analyzer chain that should be performed
![Page 31: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/31.jpg)
Let’s take a look in the configuration of TYPO3 Solr and Neos Elasticsearch
![Page 32: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/32.jpg)
Let’s test the analyzer chain
Solr and Elasticsearch
![Page 33: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/33.jpg)
Display score calculation
Solr: /solr/core_de/select?q=test&debugQuery=1
Elasticsearch: /_explain instead of /_search
![Page 34: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/34.jpg)
Let’s take a look at0.51602894 = (MATCH) sum of: 0.51602894 = (MATCH) max of: 0.51602894 = (MATCH) weight(content:sony^40.0 in 5) [DefaultSimilarity], result of: 0.51602894 = fieldWeight in 5, product of: 2.0 = tf(freq=4.0), with freq of: 4.0 = termFreq=4.0 3.3025851 = idf(docFreq=4, maxDocs=50) 0.078125 = fieldNorm(doc=5) 0.16512926 = (MATCH) weight(keywords:sony^2.0 in 5) [DefaultSimilarity], result of: 0.16512926 = score(doc=5,freq=1.0 = termFreq=1.0 ), product of: 0.05 = queryWeight, product of: 2.0 = boost 3.3025851 = idf(docFreq=4, maxDocs=50) 0.0075698276 = queryNorm 3.3025851 = fieldWeight in 5, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.3025851 = idf(docFreq=4, maxDocs=50) 1.0 = fieldNorm(doc=5)
![Page 35: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/35.jpg)
Product-Codes
„AS1134-B“
„131555813“
„EOS 500D“
„13 S24 36-G“
![Page 36: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/36.jpg)
Product-Codes
Index the code in multiple fields to have different analyzers and boost them from strict to fuzzy.
Make use of N-Grams, EdgeN-Grams, WordDelimiter, Trim, etc.
![Page 37: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/37.jpg)
Use the knowledge you gain from your customers to improve your search, … like Google does.
![Page 38: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/38.jpg)
- Use Google Analytics during index time (preAddModifyDocuments hook)
- Use recency of news (boostfunction)
- Analyze the search behavior of your customers (popularity of pages)
- Track search result clicks
![Page 39: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/39.jpg)
Some more interesting thinks
- Facets
- Spellchecking
- Phonetics
- Spatial
![Page 40: Search explained T3DD15](https://reader031.vdocuments.site/reader031/viewer/2022032118/55ccee30bb61eb515b8b458c/html5/thumbnails/40.jpg)