hinrich schütze and christina lioma lecture 5: index compression

45
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1

Upload: berk-neal

Post on 30-Dec-2015

52 views

Category:

Documents


2 download

DESCRIPTION

Hinrich Schütze and Christina Lioma Lecture 5: Index Compression. Overview. Recap Compression Term statistics Dictionary compression Postings compression. Overview. Motivation for compression in information retrieval systems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

Introduction to

Information Retrieval

Hinrich Schütze and Christina LiomaLecture 5: Index Compression

1

Page 2: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

Overview❶ Recap

❷ Compression❸ Term statistics

❹ Dictionary compression

❺ Postings compression

2

Page 3: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

3

Overview

Motivation for compression in information retrieval systems Term statistics: two statistical properties of the collection

that may offer information for compression How can we compress the dictionary component of the

inverted index? How can we compress the postings component of the

inverted index?3

Page 4: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Compression❸ Term statistics

❹ Dictionary compression

❺ Postings compression

4

Page 5: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

5

Why compression in information retrieval?

First, we will consider space for dictionary Main motivation for dictionary compression: make it small

enough to keep in main memory Then for the postings file

Motivation: reduce disk space needed, decrease time needed to read from disk

We will devise various compression schemes for dictionary and postings.

5

Page 6: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Compression❸ Term statistics

❹ Dictionary compression

❺ Postings compression

6

Page 7: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

7

Model collection: The RCV-1 collection

7

symbol statistics value

NL M

T

documentsavg. # tokens per documenttermavg. # bytes per token (incl. spaces/punct.)avg. # bytes per token (without spaces/punct.)avg. # bytes per term non-positional postings

800,000200400,000 64.57.5100,000,000

Page 8: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

8

Effect of preprocessing for RCV-1

∆% : the reduction in size from the previous line

T % : the cumulative reduction from unfiltered8

Terms non-positionalpostings

positional postings

dictionary non-positional index positional index

Size (K) ∆% T % Size (K) ∆ % T % Size (K) ∆ % T %

Unfiltered 484 109,971 197,879

No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9

Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9

30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38

150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52

stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52

Page 9: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

9

Lossy vs. lossless compression

Lossy compression: Discard some information Several of the preprocessing steps we frequently use can be

viewed as lossy compression: Case folding, stop words, porter

Lossless compression: All information is preserved. What we mostly do in index compression

The compression techniques described in the remaining of this chapter is lossless.

9

Page 10: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

Term Statistics – dictionary size vs. collection size

• Heap’s law is an empirical rule describing that dictionary growth as a function of the collection size

• Heap’s Law : M = kTb

• M is the size of the dictionary, T is the number of tokens in the collection

• In a log-log plot of dictionary vs. T, Heaps’ law is a line

10

Page 11: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

11

Term Statistics - Heaps’ law on RCV-1

Dictionary size M as afunction of collection sizeT (number of tokens) forReuters-RCV1

The dashed linelog10M = 0.49 log∗ 10 T + 1.64

is the best least squares fit

Thus, M = (101.64)∗(T0.49) and k = 101.64 ≈ 44 and b = 0.49

11

Page 12: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

12

Term Statistics - Heaps’ law on RCV-1

Empirical result is good, as we just saw in the graph Example: for the first 1,000,020 tokens Heaps’ law predicts

38,323 terms:44 × 1,000,0200.49 ≈ 38,323

The actual number is 38,365 terms, very close to the prediction

Empirical observation: fit is good in general

12

Page 13: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

13

Term Statistics - Heaps’ law on RCV-1

13

Heap’s law suggest two hypotheses : 1. The dictionary size increases with more documents in

the collection, and there’s no maximum size of the collection.

The dictionary size is quite large for large collections. The hypotheses have been empirically shown to be true for

large text collection.

Page 14: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

14

Term Statistics - Relevent Frequencies of terms

Now we have characterized the growth of the dictionary in a collection.

We also want to know how many frequent vs. infrequent terms we should expect in a collection.

In natural language, there are a few very frequent terms and very many very rare terms.

14

Page 15: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

15

Term Statistics - Zipf’s law Zipf’s law: cf is collection frequency: the number of occurrences of the

term in the collection

Intuitively, frequency decreases very rapidly with rank i Equivalent: cfi = cik and log cfi = log c +k log i (for k = −1)

15

Page 16: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

16

Term Statistics - Zipf’s law on RCV-1

Fit is not great. Whatis important is thekey insight: Few frequentterms, manyrare terms.

16

Page 17: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Compression❸ Term statistics

❹ Dictionary compression

❺ Postings compression

17

Page 18: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

18

Dictionary compression

The dictionary is small compared to the postings file Why we want to compress it?

Disk seeks are reduced if a larger portion of dictionary is put in memory

The size of a dictionary may not fit into the memory of a standard desktop machine

Competition with other applications, cell phones, onboard computers, fast startup time

So compressing the dictionary is important.

18

Page 19: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

19

Recall: Dictionary as array of fixed-width entries

Space needed: 20 bytes 4 bytes 4 bytesfor Reuters RCV-1: (20+4+4)*400,000 = 11.2 (MB)

19

Page 20: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

20

Fixed-width entries are bad.

Average length of a term in English: 8 characters Most of the bytes in the term column are wasted

We allocate 20 bytes for terms of length 1 character We can’t handle HYDROCHLOROFLUOROCARBONS and

SUPERCALIFRAGILISTICEXPIALIDOCIOUS How can we use on average 8 characters per term?

20

Page 21: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

21

Dictionary as a string

21

Page 22: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

22

Space for dictionary as a string

4 bytes per term for frequency 4 bytes per term for pointer to postings list 8 bytes (on average) for term in string 3 bytes per pointer into string (need log2 8 · 400000 < 24

bits to resolve 8 · 400,000 positions) Space: 400,000 × (4 +4 +3 + 8) = 7.6MB (compared to 11.2

MB for fixed-width array)

22

Page 23: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

23

Dictionary as a string with blocking

23

Page 24: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

24

Space for dictionary as a string with blocking

Example block size k = 4 Where we used 4 × 3 bytes for term pointers without

blocking . . . . . .we now use 3 bytes for one pointer plus 4 bytes for

indicating the length of each term. We save 12 − (3 + 4) = 5 bytes per block. Total savings: 400,000/4 5 = 0.5 MB∗ This reduces the size of the dictionary from 7.6 MB to 7.1 MB.

24

Page 25: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

25

Lookup of a term without blocking

25

Page 26: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

26

Lookup of a term with blocking: (slightly) slower

26

Page 27: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

27

Front coding

One block in blocked compression (k = 4) . . .8 a u t o m a t a 8 a u t o m a t e 9 a u t o m a t i c 10 a u t o m a t i o n

⇓ . . . further compressed with front coding.

8 a u t o m a t a ∗ 1 e⋄ 2 i c ⋄ 3 i o n⋄

27

Page 28: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

28

Dictionary compression for Reuters: Summary

28

data structure size in MB

dictionary, fixed-widthdictionary, term pointers into string∼, with blocking, k = 4∼, with blocking & front coding

11.27.67.15.9

Page 29: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Compression❸ Term statistics

❹ Dictionary compression

❺ Postings compression

29

Page 30: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

30

Postings compression

The postings file is much larger than the dictionary, factor of at least 10.

A posting for our purposes is a docID. For RCV-1(800,000 documents), we would use 32 bits per

docID when using 4-byte integers. Alternatively, we can use log2 800,000 ≈ 19.6 < 20 bits per

docID.

30

Page 31: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

31

Key idea: Store gaps instead of docIDs

Each postings list is ordered in increasing order of docID. Example postings list: COMPUTER: 283154, 283159, 283202, . . . It suffices to store gaps: 283159-283154=5, 283202-283154=43 Example postings list using gaps : COMPUTER: 283154, 5, 43, . . . Gaps for frequent terms are small. Thus: We can encode small gaps with fewer than 20 bits.

31

Page 32: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

32

Gap encoding

32

Page 33: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

33

Variable length encoding

Aim: For ARACHNOCENTRIC and other rare terms, we will use

about 20 bits per gap (= posting). For THE and other very frequent terms, we will use only a

few bits per gap (= posting). In order to implement this, we need to devise some form

of variable length encoding. Variable length encoding uses few bits for small gaps and

many bits for large gaps.

33

Page 34: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

34

Variable byte (VB) code Used by many commercial/research systems Good low-tech blend of variable-length coding and

sensitivity to alignment matches (bit-level codes, see later). Dedicate 1 bit (high bit) to be a continuation bit c. If the gap G fits within 7 bits, binary-encode it in the 7

available bits and set c = 1. Else: encode lower-order 7 bits and then use one or more

additional bytes to encode the higher order bits using the same algorithm.

At the end set the continuation bit of the last byte to 1 (c = 1) and of the other bytes to 0 (c = 0).

34

Page 35: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

35

VB code examples

35

docIDsgapsVB code

824

00000110 10111000

829510000101

21540621457700001101 00001100 10110001

Page 36: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

36

VB code encoding algorithm

36

docIDsgapsVB code

824

00000110 10111000

829510000101

21540621457700001101 00001100 10110001

Page 37: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

37

VB code decoding algorithm

37

docIDsgapsVB code

824

00000110 10111000

829510000101

21540621457700001101 00001100 10110001

Page 38: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

38

Other variable codes

Instead of bytes, we can also use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles) etc

Variable byte alignment wastes space if you have many small gaps – nibbles do better on those.

38

Page 39: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

39

Gamma codes for gap encoding You can get even more compression with another type of

variable length encoding: bitlevel code. Gamma code is the best known of these. First, we need unary code to be able to introduce gamma

code. Unary code

Represent n as n 1s with a final 0. Unary code for 3 is 1110 Unary code for 10 is 11111111110

39

Page 40: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

40

Gamma code

Represent a gap G as a pair of length and offset. Offset is the gap in binary, with the leading bit chopped off. For example 13 → 1101 → 101 = offset Length is the length of offset. For 13 (offset 101), this is 3. Encode length in unary code: 1110. Gamma code of 13 is the concatenation of length and

offset: 1110101.

40

Page 41: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

41

Gamma code examples

41

Page 42: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

42

Gamma codes: Alignment

Machines have word boundaries – 8, 16, 32 bits Compressing and manipulating bits can be slow. Variable byte encoding is aligned and thus potentially more

efficient. Regardless of efficiency, variable byte is conceptually

simpler at little additional space cost.

42

Page 43: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

43

Compression of Reuters

43

data structure size in MB

dictionary, fixed-widthdictionary, term pointers into string∼, with blocking, k = 4∼, with blocking & front codingcollection (text)T/D incidence matrixpostings, uncompressed (32-bit words)postings, uncompressed (20 bits)postings, variable byte encodedpostings, encoded by Gamma code

11.27.67.15.9

960.040,000.0

400.0250.0116.0101.0

Page 44: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

44

Term-document incidence matrix

Entry is 1 if term occurs. Example: CALPURNIA occurs in JuliusCaesar. Entry is 0 if term doesn’t occur. Example: CALPURNIAdoesn’t occur in The tempest.

44

Page 45: Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Introduction to Information RetrievalIntroduction to Information Retrieval

45

Summary

We can now create an index for highly efficient retrieval that is very space efficient.

Only 10-15% of the total size of the text in the collection. However, we’ve ignored positional and frequency

information. For this reason, space savings are less in reality.

45