optimising xapian

43
Optimising Xapian Richard Boulton UKUUG, Birmingham 9 th August 2009

Upload: richard-boulton

Post on 02-Jul-2015

1.282 views

Category:

Technology


2 download

DESCRIPTION

Talk given to UKUUG on 9th August 2009 about the Xapian search engine, and some of the experiences I've had trying to optimise its design and implementation.

TRANSCRIPT

Page 1: Optimising Xapian

Optimising Xapian

Richard Boulton

UKUUG, Birmingham9th August 2009

Page 2: Optimising Xapian

Xapian is a search engine

Page 3: Optimising Xapian

Xapian is a search enginean information retrieval toolkit

Page 4: Optimising Xapian

Index my stuff

Page 5: Optimising Xapian

Find things

Page 6: Optimising Xapian

… quickly

… quickly

Page 7: Optimising Xapian

Arbitrary Boolean restrictionsCorrect spelling mistaks

Suggest search completionsFind similar things

Browse facets of thingsFind nearby thingsGet diverse results

Find images which look similarDifferent sort orders

Arbitrary weight influences

Page 8: Optimising Xapian

Mature

10 year old vintage

Page 9: Optimising Xapian

Not going to talk about...

Scaling across multiple machines

Big topic – go to a cloud talk.

Optimising ranking of results

Huge topic – go to an IR conference!

Optimising specific installations

Filesystems, hardware specification, SSDs etc

Page 10: Optimising Xapian

Am going to talk about...

Two types of optimisation

Algorithms

Implementation

10

Page 11: Optimising Xapian

Making the most of hardware

Single machine

Limited memory

Database on a slow disk

Page 12: Optimising Xapian

Requirements

Given a set of documents

terms and frequencies

And a set of queries

terms, frequencies and operators

Find the best matches

Page 13: Optimising Xapian

Analysing the problem

Do as much work at indexing time as possible

Precalculated searches?

Can't precalculate everything...

Calculate all single-term queries

Page 14: Optimising Xapian

Stored data

Posting lists:

Felt 1 6 8

Pens 3 6 7 9

Page 15: Optimising Xapian

Single term search

Read a posting list

Remember the best

Felt 1 6 8

Pens 3 6 7 9

Page 16: Optimising Xapian

AND search

Naive approach:

Read first list.

Hold it in memory:

Read next list

Merge it in:

Select the best

1

6

8

1

3

6

7

8

9

Page 17: Optimising Xapian

AND search

Problem – limited by amount of memory

Problem – no way to avoid reading all of the list

Page 18: Optimising Xapian

Better AND search

Read lists in parallel

Start with the shortest

Jump forward in second list to keep up with first

Keep only the best N items

Page 19: Optimising Xapian

Better AND search

Read lists in parallel

Start with the shortest

Jump forward in second list to keep up

with first

Keep only the best N items

Page 20: Optimising Xapian

OR search

Read lists in parallel

But, unlike AND, we can't skip items

So … make it into an AND

How?

Page 21: Optimising Xapian

OR search

ASSUMPTION: we only want the top few results

Track only those

Keep track of the lowest weight of those

Also, calculate upper bound on weight of each term

When both upper bounds < lowest weight, we need both, so become an AND

Page 22: Optimising Xapian

Taking it further

Can apply this idea across whole query tree

Can introduce other operators – AND_MAYBE

Phrase queries

AND, followed by checking positions

Or, store pairs of adjacent terms, and then check positions

Or, store certain pairs...

Page 23: Optimising Xapian

Does it work?

Page 24: Optimising Xapian

YES

18

Page 25: Optimising Xapian
Page 26: Optimising Xapian
Page 27: Optimising Xapian

Implementation

… not a small job

Datastructures

Compression techniques

Micro-optimisations

20

Page 28: Optimising Xapian

Datastructures

Assumption – too much data to fit it all in memory

Disks are slow

But faster when reading in chunks

B+-trees – traditional but good

Block structured, massively branching tree – very shallow

Page 29: Optimising Xapian

Posting list chunks

Store posting lists in chunks

Work out what statistics to store, where

Get tighter bounds on possible weights, so we can skip better

Page 30: Optimising Xapian

Document length

Needed for weight calculation

Store it in each posting list – duplicated, but no side lookup

Or store it only once?

Currently, we store it in all posting lists

New backend stores it only once → 40% smaller!

But, currently 10 times slower :(

Page 31: Optimising Xapian

Measurements

25

Page 32: Optimising Xapian
Page 33: Optimising Xapian
Page 34: Optimising Xapian
Page 35: Optimising Xapian

New problems

We often have enough memory these days!

500M = A huge collection 10 years ago, now only medium

10M = A large collection 10 years ago, now small – will often fit fully in memory

=> IO less of a bottleneck – optimise CPU

Page 36: Optimising Xapian

New problems

Faceted search

Display information about all the items in the result set

=> Have to calculate all the result set!

Or – approximate

Or – precalculate the facet values somehow

Page 37: Optimising Xapian

New problems

Bias results with external weights

Page rank / product rank

Fixed weights – so store documents in decreasing weight order – lets us finish early

But – harder to update dynamically

Page 38: Optimising Xapian
Page 39: Optimising Xapian

Geolocation

Bias results by distance from a location

Generate hierarchies of terms

HTM easiest way to implement

Use to restrict candidates

Combine candidates with dynamically calculated weight

Page 40: Optimising Xapian

Image similarity

Terms representing features

Queries with hundreds of terms

Current optimisations help

… but distribution of frequencies and weights is less amenable to early termination.

Page 41: Optimising Xapian

Variety

Strict relevance order leads to duplication

Similar items get similar scores

Usually want to present a selection of results

Order based on combination of novelty and relevance

Score depends on earlier documents

=> our early termination doesn't work

Page 42: Optimising Xapian

http://searchevent.org/

“A day of informal presentations, open discussion and hacking on open source search

technologies.”

Tuesday 29th September 2009

Friends meeting house, Cambridge, UK

Learn more

Page 43: Optimising Xapian

Questions

Xapian: http://xapian.org/Me: [email protected]

Photo credits:http://www.flickr.com/photos/striatic/729822/

http://www.flickr.com/photos/stephmcg/1592886057/http://www.flickr.com/photos/dullhunk/3389581452/http://www.flickr.com/photos/katielips/3367600309/