phpconf2008 sphinx en
DESCRIPTION
sphinxTRANSCRIPT
![Page 1: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/1.jpg)
The riddles of the Sphinx
Full-text engine anatomy atlas
![Page 2: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/2.jpg)
Who are you?
• Sphinx – FOSS full-text search engine
![Page 3: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/3.jpg)
Who are you?
• Sphinx – FOSS full-text search engine• Good at playing ball
![Page 4: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/4.jpg)
Who are you?
• Sphinx – FOSS full-text search engine• Good at playing ball• Good at not playing ball
![Page 5: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/5.jpg)
Who are you?
• Sphinx – FOSS full-text search engine• Good at playing ball• Good at not playing ball• Good at passing the ball to a team-mate
![Page 6: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/6.jpg)
Who are you?
• Sphinx – FOSS full-text search engine• Good at playing ball• Good at not playing ball• Good at passing the ball to a team-mate• Good at many other “inferior” games
– “faceted” search, geosearch, snippet extraction, multi-queries, IO throttling, and 10-20 other interesting directives
![Page 7: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/7.jpg)
What are you here for?
• What will not be covered?– No entry-level “what’s that Sphinx and
what’s in it for me” overview– No long quotes from the documentation– No C++ architecture details
![Page 8: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/8.jpg)
What are you here for?
• What will not be covered?– No entry-level “what’s that Sphinx and
what’s in it for me” overview– No long quotes from the documentation– No C++ architecture details
• What will be?– How does it generally work inside– How things can be optimized– How things can be parallelized
![Page 9: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/9.jpg)
Chapter 1. Engine insides
![Page 10: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/10.jpg)
Total workflow
• Indexing first• Searching second
![Page 11: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/11.jpg)
Total workflow
• Indexing first• Searching second
• There are data sources (what to fetch, where from)
• There are indexes– What data sources to index– How to process the incoming text– Where to put the results
![Page 12: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/12.jpg)
How indexing works
• In two acts, with an intermission• Phase 1 – collecting documents
– Fetch the documents (loop over the sources)– Split the documents into words– Process the words (morphology, *fixes)– Replace the words with their wordid’s
(CRC32/64)– Emit a number of temp files
![Page 13: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/13.jpg)
How indexing works
• Phase 2 – sorting hits– Hit (occurrence) is a (docid,wordid,wordpos) record– Input is a number of partially sorted (by wordid) hit
lists– The incoming lists are merge-sorted– Output is essentially a single fully sorted hit list
• Intermezzo– Collect and sort MVA values– Sort ordinals– Sort extern attributes
![Page 14: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/14.jpg)
Dumb & dumber
• The index format is… simple• Several sorted lists
– Dictionary (the complete list of wordid’s)– Attributes (only if docinfo=extern)– Document lists (for each keyword)– Hit lists (for each keyword)
• Everything is laid out linearly, good for IO
![Page 15: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/15.jpg)
How searching works
• For each local index– Build a list of candidates (documents that
satisfy the full-text query)– Filter (the analogy is WHERE)– Rank (compute the documents’ relevance
values)– Sort (the analogy is ORDER BY)– Group (the analogy is GROUP BY)
• Merge the results from all the local indexes
![Page 16: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/16.jpg)
1. Searching cost
• Building the candidates list– 1 keyword = 1+ IO (document list)– Boolean operations on document lists– Cost is proportional (~) to the lists lengths– That is, to the sum of all the keyword
frequencies– In case of phrase/proximity/etc search,
there also will be operations on hit lists – approx. 2x IO/CPU
![Page 17: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/17.jpg)
1. Searching cost
• Building the candidates list– 1 keyword = 1+ IO (document list)– Boolean operations on document lists– Cost is proportional (~) to the lists lengths– That is, to the sum of all the keyword
frequencies– In case of phrase/proximity/etc search, there also
are operations on hit lists – approx. 2x IO/CPU
• Bottom line – “The Who” are really bad
![Page 18: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/18.jpg)
2. Filtering cost• docinfo=inline
– Attributes are inlined in the document lists– ALL the values are duplicated MANY times! – Immediately accessible after disk read
• docinfo=extern– Attributes are stored in a separate list (file)– Fully cached in RAM– Hashed by docid + binary search
• Simple loop over all filters• Cost ~ number of candidates and filters
![Page 19: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/19.jpg)
3. Ranking cost
• Direct – depends on the ranker– To account for keyword positions –
• Helps the relevancy• But costs extra resources – double impact!
• Cost ~ number of results• Most expensive – phrase proximity + BM25• Most cheap – none (weight=1)
• Indirect – induced in the sorting
![Page 20: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/20.jpg)
4. Sorting cost
• Cost ~ number of results• Also depends on the sorting criteria
(documents will be supplied in @id asc order)• Also depends on max_matches• The more the max, the worse the server feels• 1-10K is acceptable, 100K is way too much• 10-20 is not enough (makes little sense)
![Page 21: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/21.jpg)
5. Grouping cost
• Grouping is internally a kind of sorting• Cost affected by the number of results,
too• Cost affected by max_matches, too• Additionally, max_matches setting
affects @count and @distinct precision
![Page 22: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/22.jpg)
Chapter 2. Optimizing things
![Page 23: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/23.jpg)
How to optimize queries
• Partitioning the data
• Choosing ranking vs. sorting mode• Filters vs. keywords• Filters vs. manual MTF• Multi queries
![Page 24: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/24.jpg)
How to optimize queries
• Partitioning the data
• Choosing ranking vs. sorting mode• Filters vs. keywords• Filters vs. manual MTF• Multi queries
• Last line of defense – Three Big Buttons
![Page 25: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/25.jpg)
1. Partitioning the data
• Swiss army knife, for different tasks• Bound by indexing time?
– Partition, re-index the recent changes only
• Bound by filtering?– Partition, search the needed indexes only
• Bound by CPU/HDD?– Partition, move out to different
cores/HDDs/boxes
![Page 26: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/26.jpg)
1a. Partitioning vs. indexing
• Vital to keep the balance right• Under-partition – and indexing will be
slow• Over-partition – and searching will be
slow• 1-10 indexes – work reasonably well• Some users are fine with 50+ (30+24...)• Some users are fine with 2000+ (!!!)
![Page 27: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/27.jpg)
1b. Partitioning vs. filtering
• Totally, 100% dependent on production query statistics– Analyze your very own production logs– Add comments if needed (3rd arg to Query())
• Justified only if the amount of processed data is going to decrease significantly– Move out last week’s documents – yes– Move out English-only documents – no (!)
![Page 28: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/28.jpg)
1c. Partitioning vs. CPU/HDD
• Use a distributed index, explicitly map the chunks to physical devices
• Point searchd “at itself” –
index dist1{
type = distributedlocal = chunk01agent = localhost:3312:chunk02agent = localhost:3312:chunk03agent = localhost:3312:chunk04
}
![Page 29: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/29.jpg)
1c. How to find CPU/HDD bottlenecks
• Three standard tools– vmstat – what’s the CPU busy with? how busy is it?– oprofile – specifically who eats the CPU?– iostat – how busy is the HDD?
• Also use logs, also use searchd --iostats option• Normally everything is clear (us/sy/bi/bo…),
but!• Caveat – HDD might be iops bound• Caveat – CPU load from Sphinx might be
induced and “hidden” in sy
![Page 30: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/30.jpg)
2. Ranking
• Can now be very different (so called rankers in extended2 mode)
• Default ranker – phrase+BM25, accounts for keyword positions – not for free
• Sometimes it’s ok to use simpler ranker• Sometimes @weight is ignored at all
совсем (searching for ipod, sorting by price)
• Sometimes you can save on ranker
![Page 31: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/31.jpg)
3. Filters vs. keywords• Well-known trick
– When indexing, add a special, fake keyword to the document (_authorid123)
– When searching, add it to the query• Obvious questions
– What’s faster, what’s better?• Simple answer
– Count the change before moving away from the cashier
![Page 32: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/32.jpg)
3. Filters vs. keywords• Cost of searching ~ keyword frequencies• Cost of filtering ~ number of candidates• Searching – CPU+IO, filtering – CPU only
• Fake keyword frequency= filter value selectivity
• Frequent value + few candidates → bad!• Rare value + many candidates → good!
![Page 33: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/33.jpg)
4. Filters vs. manual MTF
• Filters are looped over sequentially• In the order specified by the app!
• Narrowest filter – better at the start• Widest filter – better at the end
• Does not matter if you use fake keywords• Exercise to the reader – why?
![Page 34: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/34.jpg)
5. Multi-queries
• Any queries can be sent together in a batch
• Always saves on network roundtrip• Sometimes allows the optimizer to trigger
• Especially important and frequent case –different sorting/grouping modes
• 2x+ optimization for “faceted” searches
![Page 35: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/35.jpg)
5. Multi-queries
$client = new SphinxClient ();$q = “laptop”; // coming from website user
$client->SetSortMode ( SPH_SORT_EXTENDED, “@weight desc”);$client->AddQuery ( $q, “products” );
$client->SetGroupBy ( SPH_GROUPBY_ATTR, “vendor_id” );$client->AddQuery ( $q, “products” );
$client->ResetGroupBy ();$client->SetSortMode ( SPH_SORT_EXTENDED, “price asc” );$client->SetLimit ( 0, 10 );
$result = $client->RunQueries ();
![Page 36: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/36.jpg)
6. Three Big Buttons
• If nothing else helps…• Cutoff (см. SetLimits())
– Forcibly stops searching after first N matches– Per-index, not overall
• MaxQueryTime (см. SetMaxQueryTime())– Forcibly stops searching after M milli-seconds– Per-index, not overall
![Page 37: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/37.jpg)
6. Three Big Buttons
• If nothing else helps…• Consulting
– We can notice the unnoticed– We can implement the unimplemented
![Page 38: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/38.jpg)
Chapter 3. Parallelization sample
![Page 39: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/39.jpg)
Combat mission
• Got ~160M cross-links• Needed misc reports (by
domains→groupby)*************************** 1. row *************************** domain_id: 440682 link_id: 15 url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750 url_to: http://xbox360achievements.org/content/view/101/114/ anchor: NULL from_site_id: 9835 from_forum_id: 1818 from_author_id: 282 from_message_id: 2586message_published: 2006-09-30 00:00:00 ...
![Page 40: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/40.jpg)
Tackling – one
• Partitioned the data• 8 boxes, 4x CPU, ~5M links per CPU
• Used Sphinx• In theory, we could had used MySQL• It practice, way too complicated
– Would had resulted in 15-20M+ rows/CPU– Would had resulted in “manual” aggregation
code
![Page 41: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/41.jpg)
Tackling – two
• Extracted “interesting parts” of the URL when indexing, using an UDF
• Replaced the SELECT with full-text query*************************** 1. row *************************** url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750urlize(url_from,0): www$insidegamer$nl insidegamer$nl insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php insidegamer$nl$forum$viewtopic.php$t=40750urlize(url_from,1): www$insidegamer$nl insidegamer$nl insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php
![Page 42: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/42.jpg)
Tackling – three
• 64 indexes– 4 searchd instances per box, by CPU/HDD
count– 2 indexes (main+delta) per CPU
• All searched in parallel– Web box queries the main instance at each box– Main instance queries itself and other 3 copies– Using 4 instances, because of startup/update– Using plain HDDs, because of IO stepping
![Page 43: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/43.jpg)
Results
• The precision is acceptable• “Rare” domains – precise results• “Frequent” domains – precision within 0.5%
• Average query time – 0.125 sec• 90% queries – under 0.227 sec• 95% queries – under 0.352 sec• 99% queries – under 2.888 sec
![Page 44: Phpconf2008 Sphinx En](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b6bdc94a79593e4f8b4786/html5/thumbnails/44.jpg)
The end