search engines i - ntnu · search engines 1 tdt4125 algoritmekonstruksjon, spring 2011 Øystein...
TRANSCRIPT
Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011
Øystein Torbjørnsen Microsoft® Development Center Norway
Outline
• Inverted index
• Constructing inverted indexes
• Compression
• Succinct index (Holger Bast)
• Hierarchical inverted indexes
• Skip lists
Inverted index
dark darker
Dictionary Posting file
a cal drill excellent
zebra
docid frequency position list
posting list
Inverted index
• Posting list is sorted on docid
• Usually 2 disk IOs to look up one term, O(1)
– One to read the dictionary entry
– One to read the posting list (possibly large)
Searching
• Phrase search "jens stoltenberg"
• Proximity search jens w/5 prime
• Wildcard search
– Prefix search stolt* je*s
– Postfix search *berg
– Full wildcard search *olten* *ol*be*
Construction
• Create sorted subfiles
• Merge the subfiles into one large file
Needs twice the disk storage as the final index
Compression
• Basic idea: – Use knowledge of value distribution to compress data
• Costly to compress and decompress, but – Less disk IO – More data fits in main memory – Better locality in memory
• Many different schemes: – Delta coding – vByte – PFOR-DELTA – Huffman, Golomb, Rice, Simple9, Simple16
Delta coding
• Works on sorted lists
• Encoded as difference from previous entry
• To be combined with other compression
17 31 62 88 89 97 113 187 199
17 14 31 26 1 8 16 74 12
÷
÷
vByte
• Variable-byte encoding
• Using full bytes
• 1 marker bit + 7 value bits
• Fast encoding and decoding
byte
end marker value
0 1001100 = 76 *128*128 = 1245184
0 0111001 = 57 *128 = 7296 1 1101010 = 106 = 106
= 1252586
PFOR-DELTA
• Combination of three techniques
– P=Prefix suppression
– FOR=Frame Of Reference
– DELTA = delta coding
• Blocks of e.g. 128 values
• Fixed number of bits per value
• Exception list for outliers
Succinct index
• Variation of inverted index
• Index ranges of words
• Prefix and range search
• Smaller dictionary
• Longer lists to process
• Better compression
• Less disk IOs – Disk position vs. transfer times
Hierarchical inverted indexes
• Incremental indexing
• Build vs lookup time
Never merge
• Just keep sub-files and never merge into large file
• Construction is O(n)
• Fastest possible construction time
• Slow lookup with many files O(n)
Hierarchy
n=3
Level 1
Level 2
Level 3
Merging strategy
Merge into same level Merge to level above
m=2 n=3
Issues
• Needs twice the space
• Merge of upper layer takes a long time
• Larger initial files leads to fewer merges
• Lookup times varies over time depending on number of files at each level
Column organization
• Field selection
– Based on query
• Phrase queries and proximity scoring needs position
• Simple boolean queries does not need position and frequency
• Relevance scoring needs frequency
– Don’t decompress what you don’t need
– Don’t read from disk what you don’t need
– Locality
More than text search
• Context info
• Meta data
• Values
docid frequency position list context
docid date
docid size
docid owner
docid person
docid zip code
docid company
position
position
position
docid URI
Skipping
• Search engine and skipping
– Used in merging (AND queries)
– Semi sequential access
– Direct lookup
– Disk based
• Skip list
• Vs Btree
• Variants
Skip list
0 < p < 1 (e.g. p=1/2 or p=1/4) Lookup and insertion is O(log n) Size vs speed
Issues
• Compression
• Can be skewed
Skip list vs B-Tree
Skip list
• Main-memory structure
• Less space
B-Tree
• Disk based structure
• Better locality
Variations
• Deterministic skip list
• 1 level skips
• Separate skip table
Literature
• Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM Comput. Surv. 38, 2, July 2006.
• Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. Super-Scalar RAM-CPU Cache Compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE '06).
• Holger Bast and Ingmar Weber. Type less, find more: fast autocompletion search with a succinct index. In Proceedings of the 29th annual international ACM SIGIR conference (SIGIR '06).
• William Pugh. Skip lists: a probabilistic alternative to balanced trees. Communications of the ACM 33, 6, June 1990. ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf