indexing and searching

32
1 Indexing and Searching Modern Information Retrieval Modern Information Retrieval by by R. Baeza-Yates and B. Ribe R. Baeza-Yates and B. Ribe iro-Neto iro-Neto Chapter 8 Chapter 8

Upload: maire

Post on 22-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Indexing and Searching. Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 8. Outline. Inverted Files Other Indices for Text Sequential Searching Pattern Matching Compression. Inverted Files. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Indexing and Searching

1

Indexing and Searching

Modern Information RetrievalModern Information Retrieval

by by R. Baeza-Yates and B. Ribeiro-NetoR. Baeza-Yates and B. Ribeiro-Neto

Chapter 8Chapter 8

Page 2: Indexing and Searching

2

Outline

Inverted FilesInverted Files Other Indices for TextOther Indices for Text Sequential SearchingSequential Searching Pattern MatchingPattern Matching CompressionCompression

Page 3: Indexing and Searching

3

Inverted Files

And inverted file (or And inverted file (or inverted indexinverted index) is a ) is a word-word-orientedoriented mechanism for indexing a text collection mechanism for indexing a text collection in order to speed up the searching task.in order to speed up the searching task.

StructureStructure :: vocabularyvocabulary and and occurrencesoccurrences Block addressingBlock addressing

The text is divided in blocks, and the The text is divided in blocks, and the occurrences point to the blocksoccurrences point to the blocks

Full inverted indicesFull inverted indices :: exactexact occurrences occurrences

Page 4: Indexing and Searching

4

Page 5: Indexing and Searching

5

Page 6: Indexing and Searching

6

Inverted Files

The search algorithm on an inverted indexThe search algorithm on an inverted index Vocabulary searchVocabulary search Retrieval of occurrencesRetrieval of occurrences Manipulation of occurrencesManipulation of occurrences

Construction (split the index into two files)Construction (split the index into two files) Posting filePosting file :: the lists of occurrences are the lists of occurrences are

stored contiguouslystored contiguously The vocabulary is stored in lexicographical The vocabulary is stored in lexicographical

order and points to its list.order and points to its list.

Page 7: Indexing and Searching

7

Page 8: Indexing and Searching

8

Inverted Files

For Large textsFor Large texts Partial indexPartial index Merging two indices consists of merging Merging two indices consists of merging

the sorted the sorted vocabulariesvocabularies..

Page 9: Indexing and Searching

9

Page 10: Indexing and Searching

10

Other Indices for Text

Suffix TreesSuffix Trees Suffix ArraysSuffix Arrays Signature FilesSignature Files

Page 11: Indexing and Searching

11

Suffix Trees and Suffix Arrays

Each position in the text is considered as a Each position in the text is considered as a text suffixtext suffix

Index points are selected form the text, Index points are selected form the text, which point to the which point to the beginningbeginning of the text of the text positions which will be retrievablepositions which will be retrievable

Page 12: Indexing and Searching

12

Page 13: Indexing and Searching

13

Suffix arrays

The main drawbacks of Suffix Array are its The main drawbacks of Suffix Array are its costlycostly construction processconstruction process..

Allow Allow binary searchesbinary searches done by comparing done by comparing the contents of each pointer.the contents of each pointer.

Supra-indices (for large suffix array)Supra-indices (for large suffix array)

Page 14: Indexing and Searching

14

Page 15: Indexing and Searching

15

Page 16: Indexing and Searching

16

Construction of Suffix Arrays for Large Texts

Page 17: Indexing and Searching

17

Signature Files

Word-oriented index structures base on Word-oriented index structures base on hashinghashing Maps Maps wordswords to bit masks of to bit masks of BB bits bits Divides the text in Divides the text in blocksblocks of of b b words eachwords each The mask is obtained by bitwise The mask is obtained by bitwise ORingORing the signat the signat

ures of all the words in the text block.ures of all the words in the text block. Hash the Hash the query query to a bit mask Wto a bit mask W If If W & Bi = WW & Bi = W, the text block may contain the wo, the text block may contain the wo

rdrd

Page 18: Indexing and Searching

18

Page 19: Indexing and Searching

19

Sequential Searching

Brute ForceBrute Force Knuth-Morris-PrattKnuth-Morris-Pratt Boyer-Moore FamilyBoyer-Moore Family Shift-OrShift-Or Suffix AutomatonSuffix Automaton

Backward DAWG matching (BDM)Backward DAWG matching (BDM) BNDMBNDM

Page 20: Indexing and Searching

20

Knuth-Morris-Pratt

Page 21: Indexing and Searching

21

Boyer-Moore Family

Page 22: Indexing and Searching

22

Shift-Or

Page 23: Indexing and Searching

23

Suffix Automaton

Page 24: Indexing and Searching

24

Page 25: Indexing and Searching

25

Pattern Matching

Searching allowing errorsSearching allowing errors Dynamic ProgrammingDynamic Programming AutomatonAutomaton

Regular Expressions and Extended patternsRegular Expressions and Extended patterns Pattern Matching Using IndicesPattern Matching Using Indices

Inverted filesInverted files Suffix Trees and Suffix ArraysSuffix Trees and Suffix Arrays

Page 26: Indexing and Searching

26

Dynamic Programming

Page 27: Indexing and Searching

27

Automaton

Page 28: Indexing and Searching

28

Regular Expressions

Page 29: Indexing and Searching

29

Pattern Matching Using Indices

Inverted FilesInverted Files The types of queries such as suffix or subThe types of queries such as suffix or sub

string queries, searching allowing errors astring queries, searching allowing errors and regular expressions, are solved by a nd regular expressions, are solved by a sesequential searchquential search

The The restrictionrestriction is to find approximate mat is to find approximate matches or regular expressions that span manches or regular expressions that span many word.y word.

Page 30: Indexing and Searching

30

Pattern Matching Using Indices

Suffix TreesSuffix Trees Suffix trees are able to perform Suffix trees are able to perform complex searchescomplex searches

Word, prefix, suffix, substring, and Range queriesWord, prefix, suffix, substring, and Range queriesRegular expressionsRegular expressionsUnrestricted approximate string matchingUnrestricted approximate string matching

Useful in specific areasUseful in specific areasFind the Find the longest substringlongest substringFind the Find the most common substringmost common substring of a fixed size of a fixed size

Page 31: Indexing and Searching

31

Pattern Matching Using Indices

Suffix ArraysSuffix Arrays Some patterns can be searched Some patterns can be searched directly in directly in

the suffix arraythe suffix array without simulation the su without simulation the suffix treeffix tree

Word, prefix, suffix, subword search and Word, prefix, suffix, subword search and range searchrange search

Page 32: Indexing and Searching

32

Compression

Compressed text--Huffman codingCompressed text--Huffman coding Taking words as Taking words as symbolssymbols Use an Use an alphabetalphabet of bytes instead of bits of bytes instead of bits

Compressed indicesCompressed indices Inverted FilesInverted Files Suffix Trees and Suffix ArraysSuffix Trees and Suffix Arrays Signature FilesSignature Files