special topics in computer science the art of information retrieval chapter 8: indexing and...
TRANSCRIPT
![Page 1: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/1.jpg)
Special Topics in Computer ScienceSpecial Topics in Computer Science
The Art of Information RetrievalThe Art of Information Retrieval
Chapter 8: Indexing and Chapter 8: Indexing and Searching Searching
Alexander Gelbukh
www.Gelbukh.com
![Page 2: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/2.jpg)
2
Previous Chapter: Previous Chapter: ConclusionsConclusions
Text transformation: meaning instead of stringso Lexical analysis
o Stopwords
o Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems
Text compressiono Searchable (compress the query, then search)
o Random access
o Word-based statistical methods (Huffman)
Index compression
![Page 3: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/3.jpg)
3
Previous Chapter: Research topicsPrevious Chapter: Research topics
All computational linguisticso Improved POS tagging
o Improved WSD
Uses of thesauruso for user navigation
o for collating similar terms
Better compression methodso Searchable compression
o Random access
![Page 4: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/4.jpg)
4
![Page 5: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/5.jpg)
5
Types of searchingTypes of searching
Sequentialo Small texts
o Volatile, or space limited
Indexedo Semi-static
o Space overhead
First, we discuss indexed searching, then sequential
![Page 6: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/6.jpg)
6
Inverted filesInverted files
Vocabulary: sqrt (n). Heaps’ law. 1GB 5M Occurrences: n * 40% (stopwords)
o positions (word, char), files, sections...
![Page 7: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/7.jpg)
7
Compression: Block addressingCompression: Block addressing
Block addressing: 5% overheado 256, 64K, ..., blocks (1, 2, ..., bytes)
o Equal size (faster search) or logical sections (retrieval units)
![Page 8: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/8.jpg)
8
Searching in inverted filesSearching in inverted files
Vocabulary searcho Separate fileo Many searching techniqueso Lexicographic: log V (voc. size) = ½ log n (Heaps)o Hashing is not good for prefix search
Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf)
o Boolean operations. Context search Merging One list is shorter (Zipf law)
Only inverted files allow sublinear both space & timeSuffix trees and signature files don’t
![Page 9: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/9.jpg)
9
Building inverted file: 1Building inverted file: 1
Infinite memory? Use trie to store vocabulary
o append positions
O(n)
![Page 10: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/10.jpg)
10
Building inverted file: 2Building inverted file: 2
Finite memory? Fill the memory Write partial index; n/M pieces Merge partial indices (hierarchically): n log (n/M)
Insertion: index, merge. n + n'log(n'/M) Deleting: eliminate every occurrence. n
Very fast creating/maintenance
![Page 11: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/11.jpg)
11
Suffix treesSuffix trees
Text as one long string. No words.o Genetic databases
o Complex queries
o Compacted trie structure
o Problem: space
For text retrieval, inverted files are better
![Page 12: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/12.jpg)
12
![Page 13: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/13.jpg)
13
![Page 14: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/14.jpg)
14
Suffix arraySuffix array
All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access
![Page 15: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/15.jpg)
15
Searching. ConstructionSearching. Construction
Searching Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size)
Construction of arrays: sortingo Large text: n2 log (M)/M, more than for inverted fileso Skip details
Addition: n n' log (M)/M Deletion: n
![Page 16: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/16.jpg)
16
Signature filesSignature files
Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all its bits are set Sequential search for blocks False drops!
o Design of the hash function
o Have to traverse the block
Good to search ANDs or proximity querieso bit patterns are ORed
![Page 17: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/17.jpg)
17
![Page 18: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/18.jpg)
18
Boolean operationsBoolean operations
Merging file (occurrences) listso AND: to find repetitions
According to query syntax tree Complexity linear in intermediate results
o Can be slow if they are huge
There are optimization techniqueso E.g.: merge small list with a big one by searching
o This is a usual case (Zipf)
![Page 19: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/19.jpg)
19
Sequential searchSequential search
Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average Knuth-Morris-Pratt: linear worst, but the same avrg Boyer-Moore: n log(m) / m. Not all chars are examined!
o If some part of the pattern was compared,no need to compare inside it: you analyze the pattern once
Shift-Or: uses logical operation on all 32 bits in parallel BDM: automation. Complexity same as Boyer-Moore Combination of BDM with bit parallelism
![Page 20: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/20.jpg)
20
Approximate string matchingApproximate string matching
Match with k errors Levenshtein distance Dynamic programming: O(mn), O(kn) Automation: non-deterministic
o Convert to deterministic: O(n), but huge structure
o Bit-parallel: O(n), the fastest known
Filtering: sublinear!o k errors cannot alter k segments
o multipattern exact search; detect suspicious places
o uses approximate algorithm only when needed
![Page 21: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/21.jpg)
21
Regular expressionsRegular expressions
Regular expressionso Automation: O (m 2m) + O (n) – bad for long patterns
o Bit-parallel (simulates non-deterministic)
Using indices to search for words with errorso Inverted files: search in vocabulary, then each word
o Suffix trees and Suffix arrays: the same algorithms!
![Page 22: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/22.jpg)
22
Structural queriesStructural queries
Ad-hoc index for structure Indexing tags as words
o Inverted files are goodsince they store occurrences in order
![Page 23: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/23.jpg)
23
Search over compressionSearch over compression
Improves both space AND time (less disk operations) Compress query and search
o Huffman compression, words as symbols, bytes (frequencies: most frequent shorter)
o Search each word in the vocabulary its code
o More sophisticated algorithms
Compressed inverted files: less disk less time
Text and index compression can be combined
![Page 24: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/24.jpg)
24
...compression...compression
Suffix trees can be compressed almost to size ofsuffix arrays
Suffix arrays can’t be compressed (almost random),but can be constructed over compressed texto instead of Huffman, use a code that respects alphabetic order
o almost the same compression
Signature files are sparse, so can be compressedo ratios up to 70%
![Page 25: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/25.jpg)
25
![Page 26: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/26.jpg)
26
Research topicsResearch topics
Perhaps, new details in integration of compression and search
“Linguistic” indexing: allowing linguistic variationso Search in plural or only singular
o Search with or without synonyms
![Page 27: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/27.jpg)
27
ConclusionsConclusions
Inverted files seem to be the best option Other structures are good for specific cases
o Genetic databases
Sequential searching is an integral part of manyindexing-based search techniqueso Many methods to improve sequential searching
Compression can be integrated with search
![Page 28: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh](https://reader036.vdocuments.site/reader036/viewer/2022062712/56649b4b550346318e8c1e3e/html5/thumbnails/28.jpg)
28
Thank you!Till compensation
lecture?