phpconf2008 sphinx en

The riddles of the Sphinx

Full-text engine anatomy atlas

Who are you?

• Sphinx – FOSS full-text search engine

Who are you?

• Sphinx – FOSS full-text search engine• Good at playing ball

Who are you?

• Sphinx – FOSS full-text search engine• Good at playing ball• Good at not playing ball

Who are you?

• Sphinx – FOSS full-text search engine• Good at playing ball• Good at not playing ball• Good at passing the ball to a team-mate

Who are you?

• Sphinx – FOSS full-text search engine• Good at playing ball• Good at not playing ball• Good at passing the ball to a team-mate• Good at many other “inferior” games

– “faceted” search, geosearch, snippet extraction, multi-queries, IO throttling, and 10-20 other interesting directives

What are you here for?

• What will not be covered?– No entry-level “what’s that Sphinx and

what’s in it for me” overview– No long quotes from the documentation– No C++ architecture details

What are you here for?

• What will not be covered?– No entry-level “what’s that Sphinx and

what’s in it for me” overview– No long quotes from the documentation– No C++ architecture details

• What will be?– How does it generally work inside– How things can be optimized– How things can be parallelized

Chapter 1. Engine insides

Total workflow

• Indexing first• Searching second

Total workflow

• Indexing first• Searching second

• There are data sources (what to fetch, where from)

• There are indexes– What data sources to index– How to process the incoming text– Where to put the results

How indexing works

• In two acts, with an intermission• Phase 1 – collecting documents

– Fetch the documents (loop over the sources)– Split the documents into words– Process the words (morphology, *fixes)– Replace the words with their wordid’s

(CRC32/64)– Emit a number of temp files

How indexing works

• Phase 2 – sorting hits– Hit (occurrence) is a (docid,wordid,wordpos) record– Input is a number of partially sorted (by wordid) hit

lists– The incoming lists are merge-sorted– Output is essentially a single fully sorted hit list

• Intermezzo– Collect and sort MVA values– Sort ordinals– Sort extern attributes

Dumb & dumber

• The index format is… simple• Several sorted lists

– Dictionary (the complete list of wordid’s)– Attributes (only if docinfo=extern)– Document lists (for each keyword)– Hit lists (for each keyword)

• Everything is laid out linearly, good for IO

How searching works

• For each local index– Build a list of candidates (documents that

satisfy the full-text query)– Filter (the analogy is WHERE)– Rank (compute the documents’ relevance

values)– Sort (the analogy is ORDER BY)– Group (the analogy is GROUP BY)

• Merge the results from all the local indexes

1. Searching cost

• Building the candidates list– 1 keyword = 1+ IO (document list)– Boolean operations on document lists– Cost is proportional (~) to the lists lengths– That is, to the sum of all the keyword

frequencies– In case of phrase/proximity/etc search,

there also will be operations on hit lists – approx. 2x IO/CPU

1. Searching cost

• Building the candidates list– 1 keyword = 1+ IO (document list)– Boolean operations on document lists– Cost is proportional (~) to the lists lengths– That is, to the sum of all the keyword

frequencies– In case of phrase/proximity/etc search, there also

are operations on hit lists – approx. 2x IO/CPU

• Bottom line – “The Who” are really bad

2. Filtering cost• docinfo=inline

– Attributes are inlined in the document lists– ALL the values are duplicated MANY times! – Immediately accessible after disk read

• docinfo=extern– Attributes are stored in a separate list (file)– Fully cached in RAM– Hashed by docid + binary search

• Simple loop over all filters• Cost ~ number of candidates and filters

3. Ranking cost

• Direct – depends on the ranker– To account for keyword positions –

• Helps the relevancy• But costs extra resources – double impact!

• Cost ~ number of results• Most expensive – phrase proximity + BM25• Most cheap – none (weight=1)

• Indirect – induced in the sorting

4. Sorting cost

• Cost ~ number of results• Also depends on the sorting criteria

(documents will be supplied in @id asc order)• Also depends on max_matches• The more the max, the worse the server feels• 1-10K is acceptable, 100K is way too much• 10-20 is not enough (makes little sense)

5. Grouping cost

• Grouping is internally a kind of sorting• Cost affected by the number of results,

too• Cost affected by max_matches, too• Additionally, max_matches setting

affects @count and @distinct precision

Chapter 2. Optimizing things

How to optimize queries

• Partitioning the data

• Choosing ranking vs. sorting mode• Filters vs. keywords• Filters vs. manual MTF• Multi queries

How to optimize queries

• Partitioning the data

• Choosing ranking vs. sorting mode• Filters vs. keywords• Filters vs. manual MTF• Multi queries

• Last line of defense – Three Big Buttons

1. Partitioning the data

• Swiss army knife, for different tasks• Bound by indexing time?

– Partition, re-index the recent changes only

• Bound by filtering?– Partition, search the needed indexes only

• Bound by CPU/HDD?– Partition, move out to different

cores/HDDs/boxes

1a. Partitioning vs. indexing

• Vital to keep the balance right• Under-partition – and indexing will be

slow• Over-partition – and searching will be

slow• 1-10 indexes – work reasonably well• Some users are fine with 50+ (30+24...)• Some users are fine with 2000+ (!!!)

1b. Partitioning vs. filtering

• Totally, 100% dependent on production query statistics– Analyze your very own production logs– Add comments if needed (3rd arg to Query())

• Justified only if the amount of processed data is going to decrease significantly– Move out last week’s documents – yes– Move out English-only documents – no (!)

1c. Partitioning vs. CPU/HDD

• Use a distributed index, explicitly map the chunks to physical devices

• Point searchd “at itself” –

index dist1{

type = distributedlocal = chunk01agent = localhost:3312:chunk02agent = localhost:3312:chunk03agent = localhost:3312:chunk04

}

1c. How to find CPU/HDD bottlenecks

• Three standard tools– vmstat – what’s the CPU busy with? how busy is it?– oprofile – specifically who eats the CPU?– iostat – how busy is the HDD?

• Also use logs, also use searchd --iostats option• Normally everything is clear (us/sy/bi/bo…),

but!• Caveat – HDD might be iops bound• Caveat – CPU load from Sphinx might be

induced and “hidden” in sy

2. Ranking

• Can now be very different (so called rankers in extended2 mode)

• Default ranker – phrase+BM25, accounts for keyword positions – not for free

• Sometimes it’s ok to use simpler ranker• Sometimes @weight is ignored at all

совсем (searching for ipod, sorting by price)

• Sometimes you can save on ranker

3. Filters vs. keywords• Well-known trick

– When indexing, add a special, fake keyword to the document (_authorid123)

– When searching, add it to the query• Obvious questions

– What’s faster, what’s better?• Simple answer

– Count the change before moving away from the cashier

3. Filters vs. keywords• Cost of searching ~ keyword frequencies• Cost of filtering ~ number of candidates• Searching – CPU+IO, filtering – CPU only

• Fake keyword frequency= filter value selectivity

• Frequent value + few candidates → bad!• Rare value + many candidates → good!

4. Filters vs. manual MTF

• Filters are looped over sequentially• In the order specified by the app!

• Narrowest filter – better at the start• Widest filter – better at the end

• Does not matter if you use fake keywords• Exercise to the reader – why?

5. Multi-queries

• Any queries can be sent together in a batch

• Always saves on network roundtrip• Sometimes allows the optimizer to trigger

• Especially important and frequent case –different sorting/grouping modes

• 2x+ optimization for “faceted” searches

5. Multi-queries

$client = new SphinxClient ();$q = “laptop”; // coming from website user

$client->SetSortMode ( SPH_SORT_EXTENDED, “@weight desc”);$client->AddQuery ( $q, “products” );

$client->SetGroupBy ( SPH_GROUPBY_ATTR, “vendor_id” );$client->AddQuery ( $q, “products” );

$client->ResetGroupBy ();$client->SetSortMode ( SPH_SORT_EXTENDED, “price asc” );$client->SetLimit ( 0, 10 );

$result = $client->RunQueries ();

6. Three Big Buttons

• If nothing else helps…• Cutoff (см. SetLimits())

– Forcibly stops searching after first N matches– Per-index, not overall

• MaxQueryTime (см. SetMaxQueryTime())– Forcibly stops searching after M milli-seconds– Per-index, not overall

6. Three Big Buttons

• If nothing else helps…• Consulting

– We can notice the unnoticed– We can implement the unimplemented

Chapter 3. Parallelization sample

Combat mission

• Got ~160M cross-links• Needed misc reports (by

domains→groupby)*************************** 1. row *************************** domain_id: 440682 link_id: 15 url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750 url_to: http://xbox360achievements.org/content/view/101/114/ anchor: NULL from_site_id: 9835 from_forum_id: 1818 from_author_id: 282 from_message_id: 2586message_published: 2006-09-30 00:00:00 ...

Tackling – one

• Partitioned the data• 8 boxes, 4x CPU, ~5M links per CPU

• Used Sphinx• In theory, we could had used MySQL• It practice, way too complicated

– Would had resulted in 15-20M+ rows/CPU– Would had resulted in “manual” aggregation

code

Tackling – two

• Extracted “interesting parts” of the URL when indexing, using an UDF

• Replaced the SELECT with full-text query*************************** 1. row *************************** url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750urlize(url_from,0): www$insidegamer$nl insidegamer$nl insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php insidegamer$nl$forum$viewtopic.php$t=40750urlize(url_from,1): www$insidegamer$nl insidegamer$nl insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php

Tackling – three

• 64 indexes– 4 searchd instances per box, by CPU/HDD

count– 2 indexes (main+delta) per CPU

• All searched in parallel– Web box queries the main instance at each box– Main instance queries itself and other 3 copies– Using 4 instances, because of startup/update– Using plain HDDs, because of IO stepping

Results

• The precision is acceptable• “Rare” domains – precise results• “Frequent” domains – precision within 0.5%

• Average query time – 0.125 sec• 90% queries – under 0.227 sec• 95% queries – under 0.352 sec• 99% queries – under 2.888 sec

The end

phpconf2008 sphinx en

Technology

sorting cost cost

document lists cost

ball good

filters cost

text search engine good

hit lists

cost grouping

extern document lists