information retrieval - systems group · information retrieval eth zürich, fall 2012 thomas...

40
Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann LECTURE 4 INDEX COMPRESSION 10.10.2012 Information Retrieval, ETHZ 2012 1

Upload: truongthuan

Post on 14-Feb-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann

LECTURE 4 INDEX COMPRESSION 10.10.2012

Information Retrieval, ETHZ 2012 1

Overview

1.  Dictionary Compression 2.  Zipf’s Law

3.  Posting List Compression

4.  Gamma Codes

5.  Golomb Code

6.  Index Compression in Practice

Information Retrieval, ETHZ 2012 2

DICTIONARY COMPRESSION

3 Information Retrieval, ETHZ 2012

Vocabulary Growth: Heap’s Law

§  Can we assume there is an upper bound? §  Not really: the vocabulary will keep growing with

collection size. §  Heaps’ law: M = kTβ §  M is the size of the vocabulary, T is the number of

tokens in the collection. §  Typical values for the parameters k and β are: 30 ≤ k ≤

100 and β ≈ 0.5. §  Empirical law: Heaps’ law is linear, i.e., the simplest

possible relationship between collection size and vocabulary size in log-log space.

Information Retrieval, ETHZ 2012 4

Dictionary Hash Table

Information Retrieval, ETHZ 2012 5

terms

class

hashes

ETHZ

mountain

weather

0 1 2

r

n

r+1

h

. . .

collision lists

mountain 549283471

ETHZ 398437231

class 234443989

weather 770209991

… …

Fixed (known) function

Storage need for token strings

Dictionary as a String

Information Retrieval, ETHZ 2012 6

Dictionary as a String with Blocking

Information Retrieval, ETHZ 2012 7

Example: Space Estimate

Example block size k = 4 Where we used 4 × 3 bytes for term pointers without

blocking

...

… we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term.

We save 12 − (3 + 4) = 5 bytes per block. Total savings: 400,000/4 ∗ 5 = 0.5 MB

This reduces the size of the Reuters dictionary from 7.6 MB to 7.1 MB.

Information Retrieval, ETHZ 2012 8

Front Coding

Information Retrieval, ETHZ 2012 9

Example: Dictionary Compression for Reuters CV1

Information Retrieval, ETHZ 2012 10

ZIPF’S LAW

11 Information Retrieval, ETHZ 2012

Zipf’s Law

We have characterized the growth of the vocabulary in collections with Heap’s law.

We also want know how many frequent vs. infrequent terms we should expect in a collection.

In natural language, there are a few very frequent terms and very many very rare terms.

Zipf’s law: The i-th most frequent term has frequency proportional to 1/i, i.e., cfi ∝ 1/i cfi is collection frequency: the #occurrences of ti in coll.

Equivalent: cfi =c*ik or log(cfi)= log(c)+k*log(i) (for k=−1)

Example of a power law Information Retrieval, ETHZ 2012 12

Example: Zipf’s Law

Fit is not perfect for Reuters CV1.

What is important is the key insight:

Few frequent terms, many rare terms.

Information Retrieval, ETHZ 2012 13

POSTING LIST COMPRESSION

14 Information Retrieval, ETHZ 2012

Posting List Compression

The postings file is much larger than the dictionary, factor of at least 10.

Key desideratum: store each posting compactly. A posting for our purposes is a doc-id.

For Reuters (800,000 documents), we would use 32 bits per doc-id when using 4-byte integers.

Alternatively, we can use log2800,000 ≈ 20 bits per doc-id. Our goal: use a lot less than 20 bits per doc-id.

Information Retrieval, ETHZ 2012 15

Gap Encoding of Doc-IDs

Each postings list is ordered in increasing order of doc-id. Example postings list: computer: 283154, 283159,

283202, ...

It suffices to store gaps: 283159-283154=5, 283202-283159=43

Example postings list: computer: . . . 5, 43, . . . Gaps for frequent terms are small. Thus: We can encode small gaps with fewer than 20 bits.

Information Retrieval, ETHZ 2012 16

VARIABLE LENGTH ENCODING

17 Information Retrieval, ETHZ 2012

Variable Length Encoding

Aim: For arachnocentric and other rare terms, we will use about 20 bits per gap (= posting).

For the and other very frequent terms, we will use about 1 bit per gap (= posting).

In order to implement this, we need to devise some form of variable length encoding.

Use few bits for small gaps, many bits for large gaps.

Information Retrieval, ETHZ 2012 18

Variable Byte Code Used by many commercial/research systems Good low-tech blend of variable-length coding and

sensitivity to alignment matches (bit-level codes, see later).

Dedicate 1 bit (high bit) to be a continuation bit c.

If the gap G fits within 7 bits, binary-encode it in the 7 available bits and set c = 1.

Else: encode higher-order 7 bits (padding) and then use one or more additional bytes to encode the lower order bits using the same algorithm.

At the end set the continuation bit of the last byte to 1 (c = 1) and of the other bytes to 0 (c = 0).

Information Retrieval, ETHZ 2012 19

Variable Byte Code: Example

Information Retrieval, ETHZ 2012 20

GAMMA CODES

21 Information Retrieval, ETHZ 2012

Gamma Codes

§  Even better compression with bit-level code §  Gamma code is the best known of these. §  Represent a gap G as a pair of length and offset. §  Offset is the gap in binary, with the leading bit chopped

off. For example 13 → 1101 → 101 §  Length is the length of offset. §  For 13 (offset 101), this is 3. §  Encode length in unary code: 1110. §  Gamma code of 13 is the concatenation of length and

offset: 1110101.

Information Retrieval, ETHZ 2012 22

Unary Code

§  Represent n as n 1s with a final 0. §  Unary code for 3 is 1110.

§  Unary code for 40 is 11111111111111111111111111111111111111110 .

Information Retrieval, ETHZ 2012 23

Gamma Code: Examples

Information Retrieval, ETHZ 2012 24

Length of Gamma Code

§  The length of offset is ⌊log2 G⌋ bits. §  The length of length is ⌊log2 G⌋ + 1 bits,

So the length of the entire code is 2 × ⌊log2 G⌋ + 1 bits. Gamma codes are always of odd length.

§  Gamma codes are within a factor of 2 of the optimal encoding length log2 G.

§  Assuming equal-probability gaps – but the distribution is actually highly skewed.

Information Retrieval, ETHZ 2012 25

Gamma Codes: Alignment

§  Machines have word boundaries – 8, 16, 32 bits §  Compressing and manipulating at individual bit-

granularity can slow down query processing

§  Variable byte alignment is potentially more efficient

§  Regardless of efficiency, variable byte is conceptually simpler at little additional space cost

Information Retrieval, ETHZ 2012 26

Gamma Code: Encode

Information Retrieval, ETHZ 2012 27 Taken from en.wikipedia.org/wiki/Elias_gamma_coding

Gamma Code: Decode

Information Retrieval, ETHZ 2012 28 Taken from en.wikipedia.org/wiki/Elias_gamma_coding

GOLOMB CODE

29 Information Retrieval, ETHZ 2012

Shannon Limit

§  Is it possible to derive codes that are optimal (under certain assumptions)?

§  What is the optimal average code length for a code that encodes each integer (gap length) independently?

§  Lower bound on average code length: Shannon entropy

§  Asymptotically optimal codes (finite alphabets): arithmetic coding, Huffman codes

Information Retrieval, ETHZ 2012 30

Bernoulli Model

§  Assumption: term occurrences are Bernoulli events §  Notation: §  n: number of documents §  m: number of terms in vocabulary §  N: total number of (unique) occurrences

§  probability of term tj occurring in document di : p=N/nm

§  each term-document occurrence is an independent event

§  Probability of a gap of length x is given by the geometric distribution

Information Retrieval, ETHZ 2012 31

Golomb Code

Information Retrieval, ETHZ 2012 32

Golomb Code

Information Retrieval, ETHZ 2012 33

Local Bernoulli Model

§  If length of posting lists is known, then a Bernoulli model on each individual inverted list can be used

§  Frequent words are coded with smaller b, infrequent words with larger b

§  Term frequency need to be encoded (use gamma-code)

§  Local Bernoulli outperforms global Bernoulli model in practice (method of choice!)

Information Retrieval, ETHZ 2012 34

Compression of Reuters: Summary

Information Retrieval, ETHZ 2012 35

INDEX COMPRESSION IN PRACTICE

36 Information Retrieval, ETHZ 2012

Block-Based Index Format

Block-based, variable-length format to reduce space and CPU

Reduced index size by ~30%, plus much faster to decode

Information Retrieval, ETHZ 2012 37

CPU Optimized Compression

Block index format: very good compression, but CPU-intensive to decode

Better format: single flat position space §  Data structures on side keep track of doc boundaries

§  Posting lists are just lists of delta-encoded positions

§  Need to be compact (can’t afford 32 bit value per occurrence)

§  … but need to be very fast to decode

Information Retrieval, ETHZ 2012 38

Improved Byte-Aligned Variable-Length Encodings

Varint encoding: §  7 bits per byte with continuation bit

§  Con: Decoding requires lots of branches/shifts/masks

Idea: Encode byte length as low 2 bits §  Better: fewer branches, shifts, and masks

§  Con: Limited to 30-bit values, still some shifting to decode

Information Retrieval, ETHZ 2012 39

Group Varint Encoding Idea: encode groups of 4 values in 5-17 bytes Pull out 4 2-bit binary lengths into single byte prefix

Decode: Load prefix byte and lookup value in 256-entry table

Information Retrieval, ETHZ 2012 40