modern information retrieval chapter 8 – indexing and searching

52
Modern information Modern information retrieval retrieval Chapter 8 – Indexing and Chapter 8 – Indexing and Searching Searching

Upload: lenard-wilcox

Post on 18-Dec-2015

265 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: Modern information retrieval Chapter 8 – Indexing and Searching

Modern information Modern information retrievalretrieval

Chapter 8 – Indexing and Chapter 8 – Indexing and Searching Searching

Page 2: Modern information retrieval Chapter 8 – Indexing and Searching

22

Indexing and Searching TOCIndexing and Searching TOC

• Indexing techniques:Indexing techniques:– Inverted filesInverted files– Suffix arraysSuffix arrays– Signature filesSignature files

• Technique used to search each type Technique used to search each type of indexof index

• Other searching techniquesOther searching techniques

Page 3: Modern information retrieval Chapter 8 – Indexing and Searching

33

MotivationMotivation

• Just like in traditional RDBMSs searching Just like in traditional RDBMSs searching for data may be costlyfor data may be costly

• In a RDB one can take (a lot of) advantage In a RDB one can take (a lot of) advantage from the well defined structure of (and from the well defined structure of (and constraints on) the dataconstraints on) the data

• Linear scan of the data is not feasible for Linear scan of the data is not feasible for non-trivial datasets (real life)non-trivial datasets (real life)

• Indices are not optional in IR (not meaning Indices are not optional in IR (not meaning that they are in RDBMS)that they are in RDBMS)

Page 4: Modern information retrieval Chapter 8 – Indexing and Searching

44

MotivationMotivation

• Traditional indices, e.g., B-trees, are not Traditional indices, e.g., B-trees, are not well suited for IR well suited for IR

• Main approaches:Main approaches:– Inverted files (or lists)Inverted files (or lists)– Suffix arraysSuffix arrays– Signature filesSignature files

Page 5: Modern information retrieval Chapter 8 – Indexing and Searching

55

Inverted FilesInverted Files• There are two main elements: There are two main elements:

– vocabulary – set of unique terms vocabulary – set of unique terms – Occurrences – where those terms appear Occurrences – where those terms appear

• The occurrences can be recorded as The occurrences can be recorded as terms or byte offsetsterms or byte offsets

• Using term offset is good to retrieve Using term offset is good to retrieve concepts such as proximity, whereas concepts such as proximity, whereas byte offsets allow direct accessbyte offsets allow direct access

VocabularyVocabulary Occurrences (byte Occurrences (byte offset)offset)

…… ……

Page 6: Modern information retrieval Chapter 8 – Indexing and Searching

66

Inverted FilesInverted Files

• The number of indexed terms is often The number of indexed terms is often several orders of magnitude smaller when several orders of magnitude smaller when compared to the documents size (Mbs vs compared to the documents size (Mbs vs Gbs)Gbs)

• The space consumed by the occurrence list The space consumed by the occurrence list is not trivial. Each time the term appears it is not trivial. Each time the term appears it must be added to a list in the inverted filemust be added to a list in the inverted file

• That may lead to a quite considerable That may lead to a quite considerable index overheadindex overhead

Page 7: Modern information retrieval Chapter 8 – Indexing and Searching

77

Inverted Files - layoutInverted Files - layout

IndexedTerms

Number of occurrences

Occurrences Lists

Vocabulary

Posting File

This could be a tree like structure !

Page 8: Modern information retrieval Chapter 8 – Indexing and Searching

88

ExampleExample

• Text: Text:

• Inverted fileInverted file

1 6 12 16 18 25 29 36 40 45 54 58 66 70

That house has a garden. The garden has many flowers. The flowers are beautiful

beautiful

flowers

garden

house

70

45, 58

18, 29

6

Vocabulary Occurrences

Page 9: Modern information retrieval Chapter 8 – Indexing and Searching

99

Inverted FilesInverted Files

• Coarser addressing may be usedCoarser addressing may be used

• All occurrences within a block (perhaps a whole All occurrences within a block (perhaps a whole document) are identified by the same block offsetdocument) are identified by the same block offset

• Much smaller overheadMuch smaller overhead• Some searches will be less efficient, e.g., Some searches will be less efficient, e.g.,

proximity searches. Linear scan may be needed, proximity searches. Linear scan may be needed, though hardly feasible (specially on-line)though hardly feasible (specially on-line)

TermsTerms Occurrences (Occurrences (blockblock offset)offset)

…… ……

Page 10: Modern information retrieval Chapter 8 – Indexing and Searching

1010

Space RequirementsSpace Requirements• The space required for the vocabulary is rather The space required for the vocabulary is rather

small. According to small. According to Heaps’Heaps’ law the vocabulary law the vocabulary grows as grows as O(nO(n),), where where is a constant between is a constant between 0.4 and 0.6 in practice0.4 and 0.6 in practice

• On the other hand, the occurrences demand On the other hand, the occurrences demand much more space. Since each word appearing much more space. Since each word appearing in the text is referenced once in that structure, in the text is referenced once in that structure, the extra space is the extra space is O(n)O(n)

• To reduce space requirements, a technique To reduce space requirements, a technique called called block addressingblock addressing is used is used

Page 11: Modern information retrieval Chapter 8 – Indexing and Searching

1111

Block AddressingBlock Addressing

• The text is divided in blocksThe text is divided in blocks

• The occurrences point to the blocks The occurrences point to the blocks where the word appearswhere the word appears

• Advantages:Advantages:– the number of pointers is smaller than positionsthe number of pointers is smaller than positions– all the occurrences of a word inside a single all the occurrences of a word inside a single

block are collapsed to one referenceblock are collapsed to one reference

• Disadvantages:Disadvantages:– online search over the qualifying blocks if exact online search over the qualifying blocks if exact

positions are requiredpositions are required

Page 12: Modern information retrieval Chapter 8 – Indexing and Searching

1212

ExampleExample

• Text:Text:

• Inverted file:Inverted file:

Block 1 Block 2 Block 3 Block 4

That house has a garden. The garden has many flowers. The flowers are beautiful

beautiful

flowers

garden

house

4

3

2

1

Vocabulary Occurrences

Page 13: Modern information retrieval Chapter 8 – Indexing and Searching

1414

Inverted Files - constructionInverted Files - construction

• Building the index in main memory is not Building the index in main memory is not feasible (wouldn’t fit, and swapping would feasible (wouldn’t fit, and swapping would be unbearable)be unbearable)

• Building it entirely in disk is not a good Building it entirely in disk is not a good idea either (would take a long time)idea either (would take a long time)

• One idea is to build several partial indices One idea is to build several partial indices in main memory, one at a time, saving in main memory, one at a time, saving them to disk and then merging all of them them to disk and then merging all of them to obtain a single indexto obtain a single index

Page 14: Modern information retrieval Chapter 8 – Indexing and Searching

1515

Inverted Files - constructionInverted Files - construction

• The procedure works as follows:The procedure works as follows:– Build and save partial indices l1, I2, …, InBuild and save partial indices l1, I2, …, In– Merge Ij and Ij+1 into a single partial index Ij,j+1Merge Ij and Ij+1 into a single partial index Ij,j+1

• Merging indices mean that their sorted vocabularies Merging indices mean that their sorted vocabularies are merged, and if a term appears in both indices then are merged, and if a term appears in both indices then the respective lists should be merged (keeping the the respective lists should be merged (keeping the document order)document order)

– Then indices Ij,j+1 and Ij+2,j+3 are merged into Then indices Ij,j+1 and Ij+2,j+3 are merged into partial index Ij,j+3, and so on and so forth until a partial index Ij,j+3, and so on and so forth until a single index is obtainedsingle index is obtained

– Several partial indices can be merged together Several partial indices can be merged together at onceat once

Page 15: Modern information retrieval Chapter 8 – Indexing and Searching

1616

ExampleExample

I 1...8

I 1...4 I 5...8

I 1...2 I 3...4 I 5...6 I 7...8

I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8

1 2 4 5

3 6

7

final index

initial dumps

level 1

level 2

level 3

Page 16: Modern information retrieval Chapter 8 – Indexing and Searching

1717

Inverted Files - constructionInverted Files - construction

• This procedure takes O(n log (n/M)) time This procedure takes O(n log (n/M)) time plus O(n) to build the partial indices – plus O(n) to build the partial indices – where n is the document size and M is the where n is the document size and M is the amount of main memory availableamount of main memory available

• Adding a new document is a matter of Adding a new document is a matter of merging its (partial) index (indices) to the merging its (partial) index (indices) to the index already builtindex already built

• Deletion can be done in O(n) time – Deletion can be done in O(n) time – scanning over all lists of terms occurring in scanning over all lists of terms occurring in the deleted documentthe deleted document

Page 17: Modern information retrieval Chapter 8 – Indexing and Searching

1818

SearchingSearching• The search algorithm on an inverted The search algorithm on an inverted

index follows three steps:index follows three steps:– Vocabulary search: Vocabulary search: the words present in the words present in

the query are searched in the vocabularythe query are searched in the vocabulary

– Retrieval occurrences: Retrieval occurrences: the lists of the the lists of the occurrences of all words found are retrievedoccurrences of all words found are retrieved

– Manipulation of occurrences: Manipulation of occurrences: the the occurrences are processed to solve the queryoccurrences are processed to solve the query

Page 18: Modern information retrieval Chapter 8 – Indexing and Searching

1919

SearchingSearching

• Searching task on an inverted file Searching task on an inverted file always starts in the vocabulary ( It is always starts in the vocabulary ( It is better to store the vocabulary in a better to store the vocabulary in a separate file )separate file )

• The structures most used to store The structures most used to store the vocabulary are the vocabulary are hashing, trieshashing, tries or or B-treesB-trees

• An alternative is simply storing the An alternative is simply storing the words in lexicographical order words in lexicographical order ( cheaper in space and very ( cheaper in space and very competitive with competitive with O(log v)O(log v) cost ) cost )

Page 19: Modern information retrieval Chapter 8 – Indexing and Searching

2020

Inverted Files - searchingInverted Files - searching• Searching using an inverted fileSearching using an inverted file

– Vocabulary searchVocabulary search•The terms used in the query (decoupled in the case The terms used in the query (decoupled in the case

of phrase or proximity queries) are searched of phrase or proximity queries) are searched separatelyseparately

– Retrieval of occurrences listsRetrieval of occurrences lists– Filtering answerFiltering answer

• If the query was boolean then the retrieved lists If the query was boolean then the retrieved lists have to be “booleany” processed as wellhave to be “booleany” processed as well

• If the inverted file used blocking and the query used If the inverted file used blocking and the query used proximity (for instance) then the actual byte/term proximity (for instance) then the actual byte/term offset has to be obtained from the documentsoffset has to be obtained from the documents

Page 20: Modern information retrieval Chapter 8 – Indexing and Searching

2121

Inverted Files - searchingInverted Files - searching• Processing the lists of occurrences (filtering the Processing the lists of occurrences (filtering the

answer set) may be criticalanswer set) may be critical• For instance, how to process a proximity query For instance, how to process a proximity query

(involving two terms) ?(involving two terms) ?– The lists are built in increasing order, so they may be The lists are built in increasing order, so they may be

traversed in a synchronous way, and each occurrence traversed in a synchronous way, and each occurrence is checked for the proximityis checked for the proximity

• What if blocking is used ?What if blocking is used ?– No positional information is kept, so a linear scan of the No positional information is kept, so a linear scan of the

document is requireddocument is required

• The traversal and merging of the obtained lists The traversal and merging of the obtained lists are sensitive operationsare sensitive operations

Page 21: Modern information retrieval Chapter 8 – Indexing and Searching

2222

ConstructionConstruction

• All the vocabulary is kept in a All the vocabulary is kept in a suitable data structure storing for suitable data structure storing for each word a list of its occurrenceseach word a list of its occurrences

• Each word of the text is read and Each word of the text is read and searched in the vocabularysearched in the vocabulary

• If it is not found, it is added to the If it is not found, it is added to the vocabulary with a empty list of vocabulary with a empty list of occurrences and the new position is occurrences and the new position is added to the end of its list of added to the end of its list of occurrencesoccurrences

Page 22: Modern information retrieval Chapter 8 – Indexing and Searching

2323

Compression of Inverted Compression of Inverted IndexIndex• I/O to read a occurrence list is reduced if the inverted I/O to read a occurrence list is reduced if the inverted

index takes less storageindex takes less storage

• Stop words eliminate about half the size of an inverted Stop words eliminate about half the size of an inverted index. “the” occurs in 7 percent of English text.index. “the” occurs in 7 percent of English text.

• Other compressionOther compression– Occurrence ListOccurrence List– Term DictionaryTerm Dictionary

• Half of terms occur only once Half of terms occur only once (hapax legomena)(hapax legomena) so they so they only have one entry in their Occurrence listonly have one entry in their Occurrence list

• Problem is some terms have very long posting lists -- in Problem is some terms have very long posting lists -- in Excite’s search engine 1997 occurs 7 million times.Excite’s search engine 1997 occurs 7 million times.

Page 23: Modern information retrieval Chapter 8 – Indexing and Searching

2424

Things to CompressThings to Compress

• Term name in the term listTerm name in the term list

• Term Frequency in each occurrence Term Frequency in each occurrence list entrylist entry

• Document Identifier in each Document Identifier in each occurrence list entryoccurrence list entry

Page 24: Modern information retrieval Chapter 8 – Indexing and Searching

2525

Data CompressionData Compression• Applied to occurrence listsApplied to occurrence lists

– term: (dterm: (d11,tf,tf11), (d), (d22,tf,tf22), ... (d), ... (dnn,tf,tfnn))

• Documents are ordered, so each dDocuments are ordered, so each dii is is replaced by the interval difference, replaced by the interval difference, namely, namely, ddii - d - di - 1i - 1

• Numbers are encoded using fewer bits for Numbers are encoded using fewer bits for smaller, common numberssmaller, common numbers

• Index is reduced to 10-15% of database Index is reduced to 10-15% of database sizesize

Page 25: Modern information retrieval Chapter 8 – Indexing and Searching

2626

Compressing Compressing tf: tf: Elias EncodingElias Encoding

X 1 0 2 10 0 3 10 1 4 110 00 5 110 01 6 110 10 7 110 11 8 1110 000

63 111110 11111

To represent a value X:To represent a value X:• ||loglog22 X X|| ones representing the ones representing the

highest highest power of 2 not exceeding power of 2 not exceeding XX

• a 0 markera 0 marker• ||loglog22 X X|| bits representing to bits representing to

represent represent the remainder X - 2^ the remainder X - 2^ ||loglog22 X X|| in in binary.binary.

• The smaller the integer, the fewer The smaller the integer, the fewer the bits used to represent the value. the bits used to represent the value. Most Most tf’stf’s are relatively small. are relatively small.

Page 26: Modern information retrieval Chapter 8 – Indexing and Searching

2727

Elias CodeElias Code

1 = 02 = 1 0 03 = 1 0 14 = 11 0 005 = 11 0 016 = 11 0 107 = 11 0 118 = 111 0 0009 = 111 0 001

For 63, its 2^5 = 32 +31 in binary (11111)11111 0 11111

...

• 3 parts, not byte aligned3 parts, not byte aligned1. n ones, one for each bit 1. n ones, one for each bit

in part 3in part 3

2. a 0 to mark the end of 2. a 0 to mark the end of part 1.part 1.

3. 3. the next n numbers the next n numbers in in binarybinary

Instead of two bytes for the tf we now Instead of two bytes for the tf we now are using only a few bits.are using only a few bits.

Page 27: Modern information retrieval Chapter 8 – Indexing and Searching

2828

ConclusionConclusion

• Inverted file is probably the most Inverted file is probably the most adequate indexing technique for adequate indexing technique for database textdatabase text

• The indices are appropriate when the The indices are appropriate when the text collection is large and semi-statictext collection is large and semi-static

• Otherwise, if the text collection is Otherwise, if the text collection is volatile online searching is the only volatile online searching is the only optionoption

• Some techniques combine online and Some techniques combine online and indexed searchingindexed searching

Page 28: Modern information retrieval Chapter 8 – Indexing and Searching

2929

Suffix Trees/ArraysSuffix Trees/Arrays

• Suffix trees are a generalization of inverted filesSuffix trees are a generalization of inverted files

• For traditional queries, i.e., those based on simple For traditional queries, i.e., those based on simple terms, inverted files are the structure of choiceterms, inverted files are the structure of choice

• This type of index treats the text to be indexed as This type of index treats the text to be indexed as a finite, but long string. a finite, but long string.

• Thus each position in the text represent a suffix Thus each position in the text represent a suffix of that text, and each suffix is uniquely identified of that text, and each suffix is uniquely identified by its positionby its position

• Such position determine what is and what is not Such position determine what is and what is not indexed and is called index pointindexed and is called index point

Page 29: Modern information retrieval Chapter 8 – Indexing and Searching

3030

Suffix TreesSuffix Trees

• Note that the choice of index points is crucial Note that the choice of index points is crucial to the retrieval capabilitiesto the retrieval capabilities

• Consider the document:Consider the document:This is a text. A text has many words. Words …This is a text. A text has many words. Words …

• And the index points/suffixes:And the index points/suffixes:11: text. A text has many words. Words …11: text. A text has many words. Words …

19: text has many words. Words …19: text has many words. Words …

28: many words. Words …28: many words. Words …

33: words. Words …33: words. Words …

40: Words …40: Words …

Page 30: Modern information retrieval Chapter 8 – Indexing and Searching

3131

Suffix TreesSuffix Trees

• Suffix Suffix TrieTrie

-

- -

-

-

- - -

- - - -

28

11

19

33

40

a

m

n

t e x t

‘ ’

.

w

o r d s

‘ ’

.

Page 31: Modern information retrieval Chapter 8 – Indexing and Searching

3232

Suffix TreesSuffix Trees

Suffix Tree

-

28

11

19

33

40

many

text‘ ’

.

words ‘ ’

.

Page 32: Modern information retrieval Chapter 8 – Indexing and Searching

3333

Suffix TreesSuffix Trees

• Even though the example showed only Even though the example showed only terms, one can use suffix trees to index and terms, one can use suffix trees to index and search phrasessearch phrases

• For practical purposes a rather small phrase For practical purposes a rather small phrase length (typically much smaller than the text) length (typically much smaller than the text) should be usedshould be used

• Suffix arrays are a better implementation of Suffix arrays are a better implementation of suffix treessuffix trees

Page 33: Modern information retrieval Chapter 8 – Indexing and Searching

3434

Suffix ArraysSuffix Arrays

• Careful inspection of the leaves in the suffix Careful inspection of the leaves in the suffix tree shows that there is a lexicographical tree shows that there is a lexicographical order among the index pointsorder among the index points

• An equivalent suffix array would be:An equivalent suffix array would be:[28, 19, 11, 40, 33] [28, 19, 11, 40, 33] forfor(many, text(many, text , text., words, text., words , words.}, words.}

• This allows the search to be processed as a This allows the search to be processed as a binary search, but note that one must use the binary search, but note that one must use the array array andand the text at query time the text at query time

• The number of I/Os may become a problemThe number of I/Os may become a problem

Page 34: Modern information retrieval Chapter 8 – Indexing and Searching

3535

Suffix ArraysSuffix Arrays

• One idea is to use a supra-indexOne idea is to use a supra-index• A supra-index will contain a sample of the A supra-index will contain a sample of the

indexed terms, e.g., indexed terms, e.g.,

[…, text, word, …][…, text, word, …]

[28, 19, 11, 40, 33][28, 19, 11, 40, 33]

• This alleviates the binary search, but makes it This alleviates the binary search, but makes it quite similar to inverted filesquite similar to inverted files

Page 35: Modern information retrieval Chapter 8 – Indexing and Searching

3636

Suffix ArraysSuffix Arrays

• This similarity is true if only single terms are This similarity is true if only single terms are indexed, but suffix trees/arrays aim more than indexed, but suffix trees/arrays aim more than that, they aim mostly at phrase queriesthat, they aim mostly at phrase queries

• However, the array (or tree) is unlikely to fit in However, the array (or tree) is unlikely to fit in main memory, whereas for the inverted files the main memory, whereas for the inverted files the vocabulary could fit in main memoryvocabulary could fit in main memory

• This requires careful implementation to reduce This requires careful implementation to reduce I/OsI/Os

• Good implementations show reduction of over Good implementations show reduction of over 50% compared to naive ones50% compared to naive ones

Page 36: Modern information retrieval Chapter 8 – Indexing and Searching

3737

Signature FilesSignature Files

• Unlike the case of inverted files where, most Unlike the case of inverted files where, most of the time, there is a tree structure of the time, there is a tree structure underneath, signature files use hash tablesunderneath, signature files use hash tables

• The main idea is to divide the document into The main idea is to divide the document into blocks of fixed size and each block has blocks of fixed size and each block has assigned to it a signature (also fixed size), assigned to it a signature (also fixed size), which is used to search the document for the which is used to search the document for the queried patternqueried pattern

• The block signature is obtained by OR’ing the The block signature is obtained by OR’ing the hashed bitstrings of each term in the blockhashed bitstrings of each term in the block

Page 37: Modern information retrieval Chapter 8 – Indexing and Searching

3838

Signature FilesSignature Files

• Consider:Consider:– H(information) = 010001H(information) = 010001– H(text) = 010010H(text) = 010010– H(data) = 110000H(data) = 110000– H(retrieval) = 100010H(retrieval) = 100010– The block signatures of a document D containing The block signatures of a document D containing

the text “textual retrieval and information retrieval” the text “textual retrieval and information retrieval” (after removing stopwords and stemming) for a (after removing stopwords and stemming) for a block size of two terms – would be:block size of two terms – would be:• B1D = 110010 and B1D = 110010 and

• B2D = 110011B2D = 110011

Page 38: Modern information retrieval Chapter 8 – Indexing and Searching

3939

Signature Files - searchingSignature Files - searching• To search for a given term we compare whether the To search for a given term we compare whether the

term’s bitstring could be “inside” the block signaturesterm’s bitstring could be “inside” the block signatures• Consider we are searching for “text” in document DConsider we are searching for “text” in document D

– H(text) = 010010 and B1D = 110010 H(text) = 010010 and B1D = 110010 – H(text) bit-wise-AND B1D = 010010 = H(text)H(text) bit-wise-AND B1D = 010010 = H(text)– Therefore “text” Therefore “text” couldcould be in B1D (it is in this particular be in B1D (it is in this particular

case)case)

• Consider we are now searching for “data”Consider we are now searching for “data”– H(data) bit-wise-AND B1D = 110000 = H(data)H(data) bit-wise-AND B1D = 110000 = H(data)– H(data) bit-wise-AND B2D = 110000 = H(data)H(data) bit-wise-AND B2D = 110000 = H(data)– Though “data” is not in either block !Though “data” is not in either block !

• Signature files may yield false hits …Signature files may yield false hits …

Page 39: Modern information retrieval Chapter 8 – Indexing and Searching

4040

Signature File - layoutSignature File - layout

Block signature (bitmask)

Logical Blocks Pointers to blocksin the documents

Page 40: Modern information retrieval Chapter 8 – Indexing and Searching

4141

Signature Files – design Signature Files – design issuesissues• How to keep the probability of a false alarms How to keep the probability of a false alarms

low ? How to predict how good a signature is ?low ? How to predict how good a signature is ?• Consider:Consider:

– B: the size (number of bits) of each term’s bitstring B: the size (number of bits) of each term’s bitstring – b: number of terms per document blockb: number of terms per document block– l: the minimum number of bits set (turned on) in B, l: the minimum number of bits set (turned on) in B,

this depends on the hashing functionthis depends on the hashing function

• The value of l which minimizes the false hits The value of l which minimizes the false hits (yielding probability of false hits equal to 2(yielding probability of false hits equal to 2-l-l) is) is– l = B ln(2)/bl = B ln(2)/b

• B/b indicates the space overhead to payB/b indicates the space overhead to pay

Page 41: Modern information retrieval Chapter 8 – Indexing and Searching

4242

Signature Files - alternativesSignature Files - alternatives

• There are many strategies which can be There are many strategies which can be used with signature filesused with signature files

• Signature compressionSignature compression– Given that the bitstring is likely to have many Given that the bitstring is likely to have many

0s a simple run-length encoding technique can 0s a simple run-length encoding technique can be usedbe used

– 0001 0010 0000 0001 is encoded as “328” 0001 0010 0000 0001 is encoded as “328” (each digit in the code represents the number (each digit in the code represents the number of 0s preceding a 1) of 0s preceding a 1)

– The main drawback is that comparing The main drawback is that comparing bitstrings require decoding at search timebitstrings require decoding at search time

Page 42: Modern information retrieval Chapter 8 – Indexing and Searching

4343

Signature Files - alternativesSignature Files - alternatives

• Using the original signature file layout and a Using the original signature file layout and a bitstring (queried term signature) all logical bitstring (queried term signature) all logical blocks are checked as to whether that block blocks are checked as to whether that block could contain that bitstringcould contain that bitstring

• This may take a long time if the signature file is This may take a long time if the signature file is long (this depends on the block size)long (this depends on the block size)

• Could we check only at the bits we are Could we check only at the bits we are interested ? Yes, we can …interested ? Yes, we can …

Page 43: Modern information retrieval Chapter 8 – Indexing and Searching

4444

Signature Files - alternativesSignature Files - alternatives

• Original signature file layout: Original signature file layout: – Each row has a bitmask corresponding to a blockEach row has a bitmask corresponding to a block– Each column tells whether that bit is set or notEach column tells whether that bit is set or not

• Bit-sliced signature files layout:Bit-sliced signature files layout:– Each row correspond to a particular bit in the Each row correspond to a particular bit in the

bitmasksbitmasks– Each column is the bitmask for the blocksEach column is the bitmask for the blocks

• The signature file is “transposed”The signature file is “transposed”

Page 44: Modern information retrieval Chapter 8 – Indexing and Searching

4545

Signature Files - alternativesSignature Files - alternatives

Block signature (bitmask)

Bits in the bitmask

Pointers to Blocks in the documents

Each row isa file !

Page 45: Modern information retrieval Chapter 8 – Indexing and Searching

4646

Signature Files - alternativesSignature Files - alternatives

• Advantages of bit-slicing:Advantages of bit-slicing:• Given a term bitstring one can read only Given a term bitstring one can read only

the lines (files) corresponding to those set the lines (files) corresponding to those set bits and if they are set in a given column, bits and if they are set in a given column, then the whole column may be retrieved.then the whole column may be retrieved.

• Empty answers are found very fast Empty answers are found very fast • Insertion is expensive, as all rows must be Insertion is expensive, as all rows must be

updated, but in an append-only fashion, updated, but in an append-only fashion, this is good for WORM media typesthis is good for WORM media types

Page 46: Modern information retrieval Chapter 8 – Indexing and Searching

4747

Signature Files - alternativesSignature Files - alternatives

• It has been proposed that m = 50% (half of It has been proposed that m = 50% (half of the bits to be 1)the bits to be 1)

• A typical value of m is 10, that implies 10 A typical value of m is 10, that implies 10 accesses for a single term queryaccesses for a single term query

• Some authors have suggested to use a Some authors have suggested to use a smaller m (thus smaller number of accesses)smaller m (thus smaller number of accesses)

• The trade-off is that to maintain low The trade-off is that to maintain low probability of false-hits, the signature has to probability of false-hits, the signature has to be larger, thus the space required for the be larger, thus the space required for the signature grows larger as wellsignature grows larger as well

Page 47: Modern information retrieval Chapter 8 – Indexing and Searching

4848

Searching - MotivationSearching - Motivation

• Thus far, we’ve seen how to pre-process texts Thus far, we’ve seen how to pre-process texts (stemming, “stopping” words, compressing, etc)(stemming, “stopping” words, compressing, etc)

• We’ve also seen how to build, process, improve We’ve also seen how to build, process, improve and assess the quality of queries and the and assess the quality of queries and the returned answer setsreturned answer sets

• In the last classes, we saw how to effectively In the last classes, we saw how to effectively index the textsindex the texts

• Now, how do we search text ? It goes beyond Now, how do we search text ? It goes beyond searching the indices, specially in the cases searching the indices, specially in the cases where the indices are not enough to answer where the indices are not enough to answer queries (e.g., proximity queries)queries (e.g., proximity queries)

Page 48: Modern information retrieval Chapter 8 – Indexing and Searching

4949

Sequential SearchingSequential Searching• If no index is used this is the only option, however If no index is used this is the only option, however

sometimes (e.g., when blocking is used) it is sometimes (e.g., when blocking is used) it is necessary even if an index existsnecessary even if an index exists

• The problem is “simple”The problem is “simple”– Given a pattern P of length n and a text T of length m (m Given a pattern P of length n and a text T of length m (m

>> n) find all positions in T where P occurs>> n) find all positions in T where P occurs

• There is much more work on this than we can There is much more work on this than we can cover, including many theoretical results. Thus we cover, including many theoretical results. Thus we will cover some well known approacheswill cover some well known approaches

Page 49: Modern information retrieval Chapter 8 – Indexing and Searching

5050

Brute ForceBrute Force• The name says it all …The name says it all …

• Starting from all possible initial positions (i.e., all Starting from all possible initial positions (i.e., all positions) whether the pattern could start at that positions) whether the pattern could start at that positionposition

• It takes O(mn) time in the worst case, but O(n) in the It takes O(mn) time in the worst case, but O(n) in the average case – not that bad average case – not that bad

• The most important thing is that it suggests the use The most important thing is that it suggests the use of a sliding window over the text. The idea is to see of a sliding window over the text. The idea is to see whether we can see the pattern through the windowwhether we can see the pattern through the window

Page 50: Modern information retrieval Chapter 8 – Indexing and Searching

5151

Knuth-Morris-PrattKnuth-Morris-Pratt

• Consider the following text:Consider the following text:TEXTURE OF A TEXT IS NOT TEXTUALTEXTURE OF A TEXT IS NOT TEXTUAL

• And that the queried term is ‘TEXT ’ (note And that the queried term is ‘TEXT ’ (note the blank at the end !)the blank at the end !)

• The window is slidThe window is slidTEXTUTEXTURE OF A TEXT IS NOT TEXTUALRE OF A TEXT IS NOT TEXTUAL

• At this point there cannot be a match so the At this point there cannot be a match so the window can be slid towindow can be slid toTEXTUTEXTURE OFRE OF A TEXT IS NOT TEXTUAL A TEXT IS NOT TEXTUAL

• Is that correct ?Is that correct ?

Page 51: Modern information retrieval Chapter 8 – Indexing and Searching

5252

Knuth-Morris-PrattKnuth-Morris-Pratt

• Only if we are not interested in the occurrence Only if we are not interested in the occurrence of pattern within terms, otherwise notof pattern within terms, otherwise not

• For instance, search for ABRACADABRA within For instance, search for ABRACADABRA within ABRACABRACABRACADABRAABRACADABRA

• The previous idea would do:The previous idea would do:ABRACABABRACABRACADABRA and then skip toRACADABRA and then skip toABRACABABRACABRACADABRA RACADABRA missing the patternmissing the pattern

• The correct move would be to The correct move would be to ABRACABRACABRACADABRAABRACADABRA

• How come ?How come ?

Page 52: Modern information retrieval Chapter 8 – Indexing and Searching

5353

Knuth-Morris-PrattKnuth-Morris-Pratt

• The processing time of this algorithm The processing time of this algorithm is O(n), which is linear, thus better is O(n), which is linear, thus better than the Brute Force Algorithmthan the Brute Force Algorithm

• However, in the average case the However, in the average case the Brute Force algorithm is O(n) as wellBrute Force algorithm is O(n) as well

• Many, many others algorithms exist …Many, many others algorithms exist …