text indexing - ituin part based on slides by s. srinivasa rao and paolo ferragina database tuning,...
TRANSCRIPT
![Page 1: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/1.jpg)
Database Tuning, Spring 2009
Text Indexing
Rasmus Pagh
in part based on slides by S. Srinivasa Rao and Paolo Ferragina
![Page 2: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/2.jpg)
Database Tuning, Spring 2009
Today’s lecture
• B+-trees and strings. • Inverted indexes. • Fingerprinting - Rabin-Karp. • Full-text indexing
– Suffix arrays and suffix trees. – String B-trees.
![Page 3: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/3.jpg)
Database Tuning, Spring 2009
Why is string data interesting ?
Ubiquitous: • Digital libraries (studied a lot already in the ‘80s).
• Product catalogues, e-commerce websites (this is what got the DB world interested in text).
• Electronic white and yellow pages (Apptus guest lecture) • Specialized information (e.g. Genomic or Patent dbs) • Web page repositories (search engines) • ...
String collections grow at a staggering rate:
...10s of Tb textual data on the web
...10s of Gb of base pairs in the genomic DBs
![Page 4: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/4.jpg)
Database Tuning, Spring 2009
The need for indexing
• Need special facilities, not just equality comparison. • Brute-force scanning is usually not a viable approach. • Indexing allows fast “simple” searches • Can use multiple simple searches for complex queries
An information retrieval (IR) system also encompasses many aspects we will not go into:
• Ranking algorithms • Query languages and operations • ...
![Page 5: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/5.jpg)
Database Tuning, Spring 2009
Strings in a B-tree
• Allows searching for strings with a given prefix.
• Does not help finding, say, a string containing a given word.
• Even when a B-tree is appropriate, string keys can be problematic: – Reduce fan-out, increase depth of tree – Prefix compression (i.e., omit part of key
shared with the previous key) may help, but not always.
• Motivates looking at special indexes.
![Page 6: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/6.jpg)
Database Tuning, Spring 2009
Two families of indexes
Types of data Linguistic or tokenizable text Raw sequence of characters or bytes
Word-based query Character-based query
Types of query
Two indexing approaches : • Word-based indexes, here a concept of “word” must be devised !
» Inverted files, Signature files or Bitmaps.
• Full-text indexes, no constraint on text and queries ! » Suffix Array, Suffix tree, Hybrid indexes, or String B-tree.
DNA sequences Audio-video files Executables
Arbitrary substring Complex matches
Exact word Word prefix or suffix Phrase
![Page 7: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/7.jpg)
Database Tuning, Spring 2009
Word-based: Inverted files
Now is the time for all good men
to come to the aid of their country
Doc #1
It was a dark and stormy night in
the country manor. The time
was past midnight
Doc #2
Query answering is a two-phase process: midnight AND time
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Vocabulary Occurrences
2
(or lists, or indexes)
![Page 8: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/8.jpg)
Database Tuning, Spring 2009
Observations on inverted files
• Can be implemented directly in the relational model.
• Use standard index on vocabulary and occurrences.
• Very efficient for single-word search. • When searching for multiple words, an
intersection (i.e. join) is needed. – Time depends on the number of
occurrences of each word.
• Size will often be smaller than the indexed strings. (Why?)
![Page 9: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/9.jpg)
Database Tuning, Spring 2009
Refinements to inverted files
• Keep track of locations of words - e.g. allows higher priority to strings where words occur close together.
• Non-exact matches of words – Prefix matches, e.g. Ras* – General regular expressions
• Ranking mechanisms for results – Often, need only ”top k”.
• Next: Full-text indexes - allow searches for any substring (e.g. of a word).
![Page 10: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/10.jpg)
Database Tuning, Spring 2009
Fingerprinting strings
• Consider a string s to be a sequence of numbers a0,a1,...,an.
• Corresponds to a polynomial ps(x)=a0+a1x+a2x2+...+anxn
• Value for a given x can be computed with 2n arithmetic operations. (How?)
• Fact: Two degree n polynomials have the same value for at most n values of x.
• Idea: Choose x at random from a large set, then ps(x) is unique for each string with high probability.
[Karp-Rabin, 1987]
![Page 11: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/11.jpg)
Database Tuning, Spring 2009
Problem session
• As stated, the technique works, but is inefficient! Hint: What is the cost of an arithmetic operation?
• Try to figure out why. • If you can, propose a better way.
![Page 12: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/12.jpg)
Database Tuning, Spring 2009
Fingerprinting strings
• Central observation: If all computation is done modulo a fixed prime p > x, then everything works.
• Further observations: – Can compute fingerprints of all prefixes of a
string in linear time. – Can compute fingerprints of all subtrings of
length k in linear time (not k times linear). (Google’s BigTable system uses this in the compression algorithm.)
– With fingerprints, we get fast equality search: Use them as B-tree keys.
[Karp-Rabin, 1987]
![Page 13: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/13.jpg)
Database Tuning, Spring 2009
Full-text indexes
Their need is pervasive: – Raw data: DNA sequences, Audio-Video files, ... – Linguistic texts: data mining, statistics, ... – Vocabulary for Inverted Lists – Xpath queries on XML documents – Intrusion detection, Anti-viruses, ...
Classes of indexes: • Suffix array, Suffix tree (variants) • Multi-level indexes: Short Pat array • B-tree based data structures: Prefix B-tree, String B-tree
![Page 14: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/14.jpg)
Database Tuning, Spring 2009
Terminology
• An alphabet, denoted ∑ is a set of (ordered) characters.
• A string S is an array of characters, S[1,n] = S[1] S[2] … S[n].
• S[i,j] = S[i] … S[j] is a substring of S.
• S[1,j] is a prefix of S; S[i,n] is a suffix of S.
• ∑* denotes all strings over alphabet ∑.
• Lexicographic order: – Example: For ∑ = { a, b, c, …, z }, where a < b < c < … < z,
the lexicographic order is the same as in a dictionary.
SUF(T) = Sorted set of suffixes of T
![Page 15: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/15.jpg)
Database Tuning, Spring 2009
Indexed string matching problem
• T is a set of K strings in ∑* – N is the total length of all strings in T.
• String matching query on T: – Given a pattern P find all occurrences of P in T.
• Full-text index : Store T in a data structure that supports string matching queries.
• Dynamic full-text index: Supports also insertions and deletions of strings in the full-text index.
![Page 16: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/16.jpg)
Database Tuning, Spring 2009
A simple but crucial observation
Pattern P[1,p] occurs at position i of T[1,n] iff P[1,p] is a prefix of the suffix T[i,n]
T P i
T[i,n]
Occurrences of P in T = All suffixes of T having P as a prefix
T = This is a visual example This is a visual example This is a visual example
3,6,12
Can transform the string matching problem to a “prefix matching problem” over all the suffixes.
![Page 17: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/17.jpg)
Database Tuning, Spring 2009
Suffix Array [Manber-Myers, 1990]
Suffix array: an array of pointers to all the suffixes in the text in their lexicographic order.
T = mississippi#
# i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi#
SUF(T)
Suffix Array • SA: array of ints, 4N bytes • Text T: N bytes 5N bytes of space occupancy
Θ(N2) space
SA 12 11 8 5 2 1 10 9 7 4 6 3
T = mississippi#
suffix pointer
5
![Page 18: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/18.jpg)
Database Tuning, Spring 2009
Two key properties [Manber-Myers, 1990]
Prop 1. All suffixes in SUF(T) having prefix P are contiguous. Prop 2. Starting position is the lexicographic position of P.
P=si
T = mississippi#
# i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi#
SUF(T) SA 12 11 8 5 2 1 10 9 7 4 6 3
T = mississippi#
![Page 19: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/19.jpg)
Database Tuning, Spring 2009
Searching in a Suffix Array [Manber-Myers, 1990]
Indirected binary search on SA: O(p log2 N) time
T = mississippi# SA 12 11 8 5 2 1 10 9 7 4 6 3
si
P is larger
2 accesses for binary step
![Page 20: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/20.jpg)
Database Tuning, Spring 2009
Searching in a Suffix Array [Manber-Myers, 1990]
Indirected binary search on SA: O(p log2 N) time
T = mississippi# SA 12 11 8 5 2 1 10 9 7 4 6 3
si
P is smaller
![Page 21: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/21.jpg)
Database Tuning, Spring 2009
Listing the occurrences [Manber-Myers, 1990]
Brute-force comparison: O(p x occ) time
T = mississippi# 4 6 7 SA
12 11 8 5 2 1 10 9 7 4 6 3
si
occ=2
Suffix Array search • O (p (log2 N + occ)) time • O (log2 N + occ) in practice
Simple disk paging for SA • Assume pattern is at most 1 disk block. • O (log2 N + occ) I/Os ssippi
P is not a prefix
P is a prefix sissippi
P is a prefix sippi
12 11 8 5 2 1 10 9 7 4 6 3
12 11 8 5 2 1 10 9 7 4 6 3
External memory
![Page 22: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/22.jpg)
Database Tuning, Spring 2009
Problem session
• Draw the suffix array for the string PAPAYAS.
• How will a search for the string PAY proceed?
![Page 23: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/23.jpg)
Database Tuning, Spring 2009
Suffix arrays and updates
• Suppose there is a (small) change in the text. The suffix array must be updated.
• Problem: Most of the suffixes are likely to have changed.
• Possible solution: – Bound then length of comparison by k. – Suffixes that are equal in the first k characters may
occur in any order. – Use a B-tree to store suffix order.
• Problem session: – Argue that after an update we need to move at most
k suffixes in the tree. – Give an example where this leads to more expensive
queries for string length > k.
![Page 24: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/24.jpg)
Database Tuning, Spring 2009
Summary: suffix array
[Manber-Myers, 1990]
• Space: O(N) • String matching: O(p log N + occ), can
be improved to O(p+log N+occ). • Can be constructed in O(N log N) time. • Recently: Simple O(N) time algorithms
– My favorite: Kärkkainen-Sanders, 2003. – Also works well on external memory.
• Static. – Can be adapted to be dynamic, but then
performance guarantees are worse.
![Page 25: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/25.jpg)
Database Tuning, Spring 2009
Tries
• Trie (name from the word “retrieval”): a data structure for storing a set of strings – Let’s assume that all strings end with
“$” (not in Σ)
b s e
a r
$
i
d
$
u
l
k
$ $
l
u n
d a
y
$
$
Set of strings: {bear, bid, bulk, bull, sun, sunday}
![Page 26: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/26.jpg)
Database Tuning, Spring 2009
Tries
• Properties of a trie: – A multi-way tree. – Each node has from 1 to Σ+1 children. – Each edge of the tree is labeled with a
character. – Each leaf node corresponds to the stored
string, which is a concatenation of characters on a path from the root to this node.
![Page 27: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/27.jpg)
Database Tuning, Spring 2009
Analysis of the Trie
Given k strings of total length N: • Size:
– O(N) in the worst-case • Search, insertion, and deletion (string
of length m): – O(m) (assuming ∑ is constant)
• Observation: – Having chains of one-child nodes is
wasteful
![Page 28: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/28.jpg)
Database Tuning, Spring 2009
Compact Tries
• Compact Trie: – Replace a chain of one-child nodes with an
edge labeled with a string – Each non-leaf node (except root) has at
least two children
b s e
a r
$
i
d
$
u
l
k
$ $
l
u n
d a
y
$
$
b sun ear$
id$ ul
k$ l$
day$ $
![Page 29: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/29.jpg)
Database Tuning, Spring 2009
Compact Tries • Implementation:
– Strings are external to the structure in one array, edges are labeled with indices in the array (from, to)
– Improves the space to O(k)
b sun ear$
id$ ul
k$ l$
day$ $
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
bear$bid$bulk$bull$sun$sunday$ T:
1,1 20,22 2,5
7,9 11,12
13,14 18,19
27,30 5,5
![Page 30: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/30.jpg)
Database Tuning, Spring 2009
Patricia Tries
• Patricia trie: – a compact trie where each edge’s label (from, to) is
replaced by (T[from], to – from + 1)
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
bear bid bulk bull sun sunday T:
b,1 s,3 e,4
i,3 u,2
k,2 l,2
d,4 _,1
1,1 20,22 2,5
7,9 11,12
13,14 18,19
27,30 5,5
![Page 31: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/31.jpg)
Database Tuning, Spring 2009
Suffix Trees [McCreight, 1976]
• Suffix tree – a compact trie (or similar structure) of all suffixes of the text – Patricia trie of suffixes is sometimes called a
Pat tree
carrara$ 1 2 3 4 5 6 7 8
a r
r $
carrara$
rara$ a$
rara$ a
$ ra$
2 5
7 1
6 4
3
(a,1) (r,1)
(r,1) ($,1)
(c,8)
(r,5) (a,2)
2 5
7 1
6 4
3
(r,5) (a,1)
(r,3) ($,1)
![Page 32: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/32.jpg)
Database Tuning, Spring 2009
Search in suffix trees
T = abababbc# 1 3 5 7 9
3
2 4
c
c c
b
a
b b
b
b b
a 2
4
1 3 5
a b
c
a
b
b
c
a
b
c
b P = ba
c b
0
1
(5,8) O(N) space
O(p) time
What about ST in external memory ? – Unbalanced tree topology – Updates
7 6 8
c
Packing ?!
Search is a path traversal
2 4
and O(occ) time
- Large space ~ 15N
b
a
![Page 33: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/33.jpg)
Database Tuning, Spring 2009
Problem session
• When searching in a suffix tree, why is it not a problem that some nodes may have large degree?
• Draw the suffix tree for the string PAPAYAS.
• How will a search for the string PAY proceed?
![Page 34: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/34.jpg)
Database Tuning, Spring 2009
Summary: suffix tree
For a string of length n: • Space: O(n). • String matching queries: O(p + occ). • Can be constructed O(n) time. • Static
– Again, bounding the length of comparison can be done to allow reasonably fast updates.
• Question: What are the pros and cons of suffix trees, versus suffix arrays?
![Page 35: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/35.jpg)
Database Tuning, Spring 2009
Text indexing in Oracle
Main index types, CONTEXT and CTXCAT. create index on mytable(attr) indextype is ctxsys.context;
create index on mytable(attr) indextype is ctxsys.ctxcat
To use, must use special string query language (not just a LIKE operator). (Why?)
![Page 36: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/36.jpg)
Database Tuning, Spring 2009
Text indexing in Oracle
• Unlike ordinary indexes, CONTEXT indexes are NOT updated as content is changed. A ”refresh interval” can be specified.
• This means they are only useful when the tables are largely read-only and/or ”the end-users don’t mind not having 100% search recall”.
• CTXCAT is transactional (i.e. kept updated) - but indexes each row separately.
• Unfortunately, I was not able to find information about the implementation.
![Page 37: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/37.jpg)
Database Tuning, Spring 2009
Text indexing in Oracle
From the documentation: • CTXCAT Indexes - A CTXCAT index is best for
smaller text fragments that must be indexed along with other standard relational data (VARCHAR2). WHERE CATSEARCH(text_column, 'ipod')> 0;
• CONTEXT Indexes - The CONTEXT index type is used to index large amounts of text such as Word, PDF, XML, HTML or plain text documents. WHERE CONTAINS(test_column, 'ipod', 1) > 0
![Page 38: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/38.jpg)
Database Tuning, Spring 2009
The String B-tree [Ferragina-Grossi, 95]
![Page 39: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/39.jpg)
Database Tuning, Spring 2009
The prologue
We are left with many open issues: – Suffix Array: updates – Suffix tree: difficult packing and Ω(p) I/Os
B-tree is ubiquitous in large-scale applications: – Atomic keys: integers, reals, ... – Prefix B-tree: bounded length keys (≤ 255 chars)
Suffix trees + B-trees ? String B-tree [Ferragina-Grossi, 95]
Index unbounded length keys Good worst-case I/O-bounds in search and update
![Page 40: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/40.jpg)
Database Tuning, Spring 2009
Some considerations
Strings have arbitrary length: – Disk page cannot ensure the storage of Θ(B) strings – M may be unable to store even one single string
String storage: – Pointers allow to fit Θ(B) strings per disk page – String comparison needs disk access and may be expensive
String pointers organization seen so far: • Suffix array: simple but static and not optimal • Patricia trie: sophisticated and very efficient (optimal ?)
Recall the problem: Τ is a string collection Search( P[1,p] ): retrieve all occurrences of P in Τ’s strings Update( S[1,s] ): insert or delete a text S from Τ
![Page 41: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/41.jpg)
Database Tuning, Spring 2009
1st step: B-tree on string pointers
AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30
Disk
29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23
29 2 26 13 20 25 6 18 3 14 21 23
29 13 20 18 3 23
P = AT O((p/B) log2 B) I/Os
O(logB N) levels
Search(P) • O ((p/B) log2 N) I/Os • O (occ/B) I/Os
It is dynamic !! O(s (s/B) log2 N) I/Os
![Page 42: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/42.jpg)
Database Tuning, Spring 2009
2nd step: The Patricia trie
1
A G A A G A
A G A A G G
A G A C
G C G C A G A
G C G C A G G
G C G C G G A
G C G C G G G A
6 5
3
6
4
0
Disk
A G A
4 5 6 7 2 3
(1; 1,3) (4; 1,4)
(6; 5,6) (5; 5,6) (2; 1,2) (3; 4,4)
(1; 6,6) (2; 6,6) (4; 7,7) (5; 7,7) (7; 7,8)
(6; 7,7)
G
![Page 43: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/43.jpg)
Database Tuning, Spring 2009
2nd step: The Patricia trie
1
A G A A G A
A G A A G G
A G A C
G C G C A G A
G C G C A G G
G C G C G G A
G C G C G G G A
6 5
3
Disk
A 0
G
4 5 6 2 3
A
A A
A 4
G
A 6
7
G G G
C
Space PT • O(k) , not O(N) Two-phase search:
P = GCACGCAC
G
1
G 5 7
A
Search(P): • First phase: no string access • Second phase: O(p/B) I/Os
mismatch
P’s position
LCP with P
Just one string is checked !!
G C
![Page 44: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/44.jpg)
Database Tuning, Spring 2009
Alternative 2nd step: The Patricia trie
1
A G A A G A
A G A A G G
A G A C
G C G C A G A
G C G C A G G
G C G C G G A
G C G C G G G A
6 5
3
Disk
A 0
G
4 5 6 2 3
A
A A
A 4
G
A 6
7
G G G
C
P = GCACGCAC
G
1
P’s position
No string checked on
disk!
Store fingerprint for each node in Patricia trie. Can be used to find a longest prefix match.
Signature mismatch
![Page 45: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/45.jpg)
Database Tuning, Spring 2009
3rd step: B-tree + Patricia tree
AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30
Disk
29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23
29 2 26 13 20 25 6 18 3 14 21 23
29 13 20 18 3 23
PT PT PT
PT PT PT PT PT PT
PT
Search(P) • O((p/B) logB N) I/Os • O(occ/B) I/Os
Insert(S) O( s (s/B) logB N) I/Os
O(p/B) I/Os
O(logB N) levels
P = AT
+
![Page 46: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/46.jpg)
Database Tuning, Spring 2009
PT Level i
Search(P) • O(logB N) I/Os just to go to the leaf level
4th step: Incremental Search
Max_lcp(i) P
Level i+1 PT PT
Leaf Level
adjacent
An issue only if the pattern is more than 1 disk block. We will not go into the details.
![Page 47: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/47.jpg)
Database Tuning, Spring 2009
Level i+1 PT
Inductive Step
PT Level i
Search(P) • O(p/B + logB N) I/Os • O(occ/B) I/Os
4th step: Incremental Search
Max_lcp(i) P
Max_lcp(i+1)
skip Max_lcp(i)
i-th step: O(( lcp i+1 – lcp i )/B + 1) I/Os
No rescanning
![Page 48: Text Indexing - ITUin part based on slides by S. Srinivasa Rao and Paolo Ferragina Database Tuning, Spring 2009 Today’s lecture • B+-trees and strings. • Inverted indexes](https://reader035.vdocuments.site/reader035/viewer/2022062609/61035308341bea7e73513a0c/html5/thumbnails/48.jpg)
Database Tuning, Spring 2009
Summary: String B-tree
String B-tree performance: – Search(P) takes O(p/B + logB N + occ/B) I/Os – Update(S) takes O( s logB N ) I/Os
(more efficient buffered variant is possible) – Space is Θ(N/B) disk pages
Using the String B-tree in internal memory: – Search(P) takes O(p + log2 N + occ) time – Update(S) takes O( s log2 N ) time – Space is Θ(N) bytes It is a sort of dynamic suffix array
Crux: Search time independent of the length of the strings
stored.
[Ferragina-Grossi, 95]