search explained t3dd15
Post on 14-Aug-2015
270 Views
Preview:
TRANSCRIPT
Search explained
My Name is Hans Höchtl
Technical director @ Onedrop Solutions
PHP, Java, Ruby Developer
Participation in TYPO3 Solr
SELECT * FROM mytable WHERE field LIKE „%searchword%“
SELECT * FROM mytable WHERE field SOUNDS LIKE
„searchword“
Appearance of a word inside a text can be determined easily.
But is it relevant?
Relevance is subjective and depends on the judgement of users.
We use „scoring“ to predict relevance.
Scoring is computed by a function applied on our indexed documents using the search term as input parameter.
TF-IDF Term frequency-inverse document frequency
BM25Okapi BM25 - Best Matching
DFRDivergence from randomness
and many more
All those scoring calculations should fulfill these two requirements:
1. PrecisionAre the results relevant to the user?
2. RecallHave we found all relevant content in the index?
How to store documents for efficient computing of scoring?
Vector Space Model Default in Solr, Elasticsearch
Document: A vector of terms
Term: A „word“ inside a document
Each unique term is a dimension
Vector Space Model
The best match is the narrowest angle between query and document
Document 1
„unique unique bag“
Document 2
„unique bag bag“
Query
unique bagunique
bagv(d1)
v(q)
v(d2)
The calculation of the cosine of the angle between the vectors is much easier than the calculation of the angle itself. (CPU cycles)
Where d2 * q is the intersection (dot product) of the document and the query vectors.
||q|| is the norm vector of q
A cosine value of zero means that the query and document vector are orthogonal and have no match.
TF-IDF
Regarding the vector space model (VSM) the weight of the vector is now represented for a document d as:
Term frequencyInverse document frequency
TF-IDF
Now we have everything together to calculate the similarity between documents using TF-IDF:
TF-IDF
PROs CONs
- Simple model based on linear algebra
- Term weights not binary - Allows computing a
continuous degree of similarity between queries and documents
- Allows ranking of documents according to their possible relevance
- Allows partial matching
- Long documents have poor similarity values (small scalar and large dimensionality)
- Search keywords must precisely match terms
- Missing semantic sensitivity - Order of terms in document
not taken into account - Terms are usually not
statistically independent (as this model states)
TF-IDF - The Lucene way
Coord: Boosts documents that match more of the search terms (multiple words) => 3/4 vs 4/4
Norm: Length normalization boosts fields that are shorter
TF-IDF - Multiple fields
TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.
TF-IDF - Multiple fields
TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.
TF-IDF - Multiple fields
Solr Solution: DisMax Query Parser (Maximum Disjunction)
Searchterm: „my funny house“
Documents matching query in
field title Documents matching
query in field subtitle
Documents matching query in
field content
TF-IDF calculated for every field independently. Score of a document is the highest score of the field scoring values.
Natural languages
Adjectives, Adverbs, Nouns, Verbs, Conjunctions, Prepositions, Predicates, Compounds, Plurals, Past tense, Declination, Semantics, etc.
Language families
Indo-European languages
Sino-Tibetan languages
TF-IDF Problem
Only exakt Term matches are considered a hit.
„Car“ is not the same term as „Cars“
Handling human languages (Analyzers)
Tokenizers:Splits a stream of characters into a series of tokens.
Filters:The generated tokens are passed through a series of filters that add, change or remove tokens.
Index Analyzers vs. Query Analyzers
Index Analyzers:Perform their analysis chain on the token stream during indexation. The generated tokens will be indexed.
Query Analyzers:Perform their analysis chain on the entered search query during query execution. Otherwise the query would hit just an exact match.
Beware of Synonyms!
Available analyzers
Solr (https://goo.gl/TXEjZK) Language best practices (https://goo.gl/11O2Qz)
Elasticsearch (https://goo.gl/QR1IYb) Language best practices (https://goo.gl/6FQt7A)
FieldTypes
Solr and Elasticsearch use fieldTypes assigned to fields for defining the analyzer chain that should be performed
Let’s take a look in the configuration of TYPO3 Solr and Neos Elasticsearch
Let’s test the analyzer chain
Solr and Elasticsearch
Display score calculation
Solr: /solr/core_de/select?q=test&debugQuery=1
Elasticsearch: /_explain instead of /_search
Let’s take a look at0.51602894 = (MATCH) sum of: 0.51602894 = (MATCH) max of: 0.51602894 = (MATCH) weight(content:sony^40.0 in 5) [DefaultSimilarity], result of: 0.51602894 = fieldWeight in 5, product of: 2.0 = tf(freq=4.0), with freq of: 4.0 = termFreq=4.0 3.3025851 = idf(docFreq=4, maxDocs=50) 0.078125 = fieldNorm(doc=5) 0.16512926 = (MATCH) weight(keywords:sony^2.0 in 5) [DefaultSimilarity], result of: 0.16512926 = score(doc=5,freq=1.0 = termFreq=1.0 ), product of: 0.05 = queryWeight, product of: 2.0 = boost 3.3025851 = idf(docFreq=4, maxDocs=50) 0.0075698276 = queryNorm 3.3025851 = fieldWeight in 5, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.3025851 = idf(docFreq=4, maxDocs=50) 1.0 = fieldNorm(doc=5)
Product-Codes
„AS1134-B“
„131555813“
„EOS 500D“
„13 S24 36-G“
Product-Codes
Index the code in multiple fields to have different analyzers and boost them from strict to fuzzy.
Make use of N-Grams, EdgeN-Grams, WordDelimiter, Trim, etc.
Use the knowledge you gain from your customers to improve your search, … like Google does.
- Use Google Analytics during index time (preAddModifyDocuments hook)
- Use recency of news (boostfunction)
- Analyze the search behavior of your customers (popularity of pages)
- Track search result clicks
Some more interesting thinks
- Facets
- Spellchecking
- Phonetics
- Spatial
Thank you
Mail: hhoechtl@1drop.de or jhoechtl@gmail.comTwitter: @hhoechtlBlog: http://blog.1drop.de
top related