lecture4- indexing and searching i

56
Indexing and Searching The main techniques

Upload: priyankaprakasan

Post on 07-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 1/56

Indexing and Searching

The main techniques

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 2/56

Introduction

There are 2 ways to search a text

• First: Scan the text sequentially (online searching).

 – This can be done when the text is small (i.e., a few

megabytes),

 – if the text collection is very volatile (i.e., undergoes

modifications very frequently)

 – If the index space overhead cannot be afforded.

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 3/56

Introduction• Second: Build data structures over the text (called

indices) – It speeds up the search.

 – It is worthwhile when the text collection is large and semi-

static.

 – Most real databases are like this.

• E.g : dictionaries, Web search engines, journal archives.

Semi-static collections are collections that can be updated at reasonably regular

intervals

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 4/56

Introduction• Nowadays, the most successful techniques for medium

size databases (say up to 200Mb) combine online andindexed searching.

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 5/56

Introduction

• We cover two main indexing techniques

 – Inverted files

 – Suffix arrays

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 6/56

Introduction

• Before covering these portions you should be familiar

with

 – Sorted arrays

 –

Binary search trees – B-trees

 – Hash tables

 – Tries.

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 7/56

Introduction

• Sorted arrays

 – An array whose items are kept sorted,

 – so searching is faster

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 8/56

Introduction

• Binary search trees

 – A binary tree

 – For each internal node x stores an element

 – The element stored in the left subtree of  x <=  x and

elements stored in the right subtree of  x >=x 

 –

Both the left and right subtrees must also be binary searchtrees.

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 9/56

Binary Tree

Each

node has

at most 2

children

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 10/56

Binary Search Tree

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 11/56

Binary Search Tree

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 12/56

Introduction

• B-trees

 – A B-tree is a specialized multi way tree designedespecially for use on disk.

 –

Used when part or all of the tree must bemaintained in secondary storage such as a magnetic

disk.

 – An indexing technique most commonly used in

databases and file systems

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 13/56

Introduction

• B-trees

 – A multiway tree of order m is an ordered tree whereeach node has at most m children.

 –

The following is a multiway search tree of order 4

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 14/56

Introduction

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 15/56

Introduction• B-trees (contd..)

 – Pointers to data are placed in a balance treestructure so that all references to any data can be

accessed in an equal time frame.

 – Data in B-tree is kept sorted

• so that searching, inserting and deleting can be done in

logarithmic amortized time

 – A b-tree tries to minimize the number of disk

accesses. 

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 16/56

Introduction• B-trees Example

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 17/56

Introduction• B-trees Example

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 18/56

Introduction• Searching a B-Tree for Key 21

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 19/56

IntroductionInserting Key 33 into a B-Tree (w/ Split)

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 20/56

IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 21/56

IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 22/56

IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 23/56

IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 24/56

Introduction• Hash table

 –

A data structure that uses a hash function to efficiently mapcertain identifiers or keys (e.g., person names) to associated

values (e.g., their telephone numbers).

 –

The hash function is used to transform the key into theindex (the hash) of an array element (the slot or bucket )

where the corresponding value is to be sought. 

 – E.g : Division Method

d

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 25/56

Introduction• Hash table

 –

123456123467

123450

 – 123456 % 10 = 6 (the remainder is 6 when dividing

by 10)

123467 % 10 = 7 (the remainder is 7)

123450 % 10 = 0 (the remainder is 0)

d

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 26/56

Introduction

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 27/56

Tries

Trie , is an ordered tree data structure that is used tostore an array where the keys are usually strings

• It can be used to do a fast search in a large text

• The term trie comes from the word "retrieval".

• Used to implement the dictionary abstract data type

(ADT) where basic operations like search, insert, anddelete can be performed

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 28/56

Tries

They can be used for encoding and compression

• They can be used in regular expression search and

approximate string matching

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 29/56

Non Compact and Compact Tries

A non compact trie is one in which every edge of theunderlying tree represents a symbol of the alphabet.

• Let's construct the trie from the following 5 strings: BIG,

BIGGER, BILL, GOOD, GOSH.

d

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 30/56

Non Compact and Compact Tries

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 31/56

Non Compact Tries

• When we look for the string GOOD, we start at the root

and we follow the G O  OD edges

• If we want to look for the string BAD, we start from the

root, follow the B edge and find out that there is no A edge after. Thus BAD is not in the text.

• The above structure is rather wasteful because each

edge represents a single symbol.

• Not practical for huge texts

C i

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 32/56

Compact Tries

• This type of trie resembles the one in figure above

except that chains which lead to leaves are trimmed.

• This is illustrated in next figure

C T i

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 33/56

Compact Tries

C T i

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 34/56

Compact Tries

The compact form

of the trie is in the

figure

C t T i

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 35/56

Compact Tries

• The number of leaves is n+1 where n is the number of 

input strings.• In the leaves, we may store either the strings

themselves or pointers to the strings (that is, integers).

T i ll d "PATRICIA"

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 36/56

Tries called "PATRICIA"

• "PATRICIA" stands for "practical algorithm to retrieve

information coded in alphanumeric".• The difference is that an edge can be labeled with more

than one character.

All the unary nodes will be collapsed.

T i ll d "PATRICIA"

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 37/56

Tries called "PATRICIA"

T i ll d "PATRICIA"

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 38/56

Tries called "PATRICIA"

The very

compact trie

will look as

follows:

Tries called "PATRICIA"

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 39/56

Tries called "PATRICIA"

• Binary PATRICIA tries has only 2 symbols per edge

S ffi T

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 40/56

Suffix Tree• The suffix tree T(x) of string x[1..n] is the compacted trie

of all suffixes x[i..n] for i = 1,..,n+1, i.e. including theempty suffix 

• Allows for a particularly fast implementation of many

important string operations.

• The suffix tree for a string S is a tree (more specifically a

trie) whose edges are labeled with strings, such that each

suffix of S corresponds to exactly one path from the tree'sroot to a leaf.

S ffi T

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 41/56

Suffix Tree• The idea behind suffix tree is to assign to each symbol in

a text an index corresponding to its position in the text.

 – ie: First symbol has index 1, last symbol has indice n= #of 

symbols in text.

• In the tree we use indices instead of the actual object.

S ffi t

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 42/56

Suffix tree• The advantages are:

 –

It requires less storage space. – We do not have to worry how the text is represented (bin, ASCII,

etc)

 – We do not have to store the same object twice. (no duplicate) 

S ffi t i

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 43/56

Suffix trie

• We begin by giving a position to every suffix in the text.

We can now build a SUFFIX Trie for all n suffixes of the

text.

• E.g.

 –TEXT: G O O G O L $

 – POSITION: 1 2 3 4 5 6 7

Suffix trie

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 44/56

Suffix trie

The resulting tree has n leaves and height n

S ffi

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 45/56

Suffix tree• The suffix tree is created by TRIMMING (compacting +

collapsing every unary node) of the suffix TRIE

• The following is a picture of a compact suffix tree 

S ffi

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 46/56

Suffix tree

Suffix tree

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 47/56

Suffix tree

• In suffix tree we can store pointers rather than words in

the leaves.

• Also we can replace every string by a pair of indices,

(a,b), where a is the index of the beginning of the string

and b the index of the end of the string.• i.e: We write

 – (3,7) for OGOL$

 – (1,2) for GO

 – (7,7) for $

Suffix tree

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 48/56

Suffix tree

• The corresponding suffix tree looks like this

Search in suffix tree

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 49/56

Search in suffix tree

• Pseudo-code for searching in suffix tree:

 – Start at root

 – Go down the tree by taking each time the corresponding

bifurcation

 – If S correspond to a node then return all leaves in subtree

 – If S encountered a NIL pointer then S is not in the tree

Search in suffix tree

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 50/56

Search in suffix tree

• If S = "GO" we take the GO bifurcation and return:

GOOGOL$,GOL$. 

If S = "OR" we take the O bifurcation and then we hit aNIL pointer so "OR" is not in the tree.

Applications of suffix tree

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 51/56

Applications of suffix tree

• Exact matching

• Common substrings, with applications

• Matching statistics

• Suffix arrays

• Genome-scale projects

Exact Matching

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 52/56

Exact Matching

• Given string x and pattern y, report where y occurs in x 

• Pattern ata occurs at position 2 in tatat

Exact Matching

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 53/56

Exact Matching

• Given string x and pattern y, report where y occurs in x 

• Pattern tatt does not occur in tatat

Assumptions in indexing and searching

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 54/56

Assumptions in indexing and searching

• We make the following assumptions.

 – We call n the size of the text database.

 – Whenever a pattern is searched, we assume that it is of length

m, which is much smaller than n.

 – We call M the amount of main memory available.

 – The modifications which a text database undergoes are

additions, deletions, and replacements of pieces of text of size

n' < n.

Reference

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 55/56

Reference

• Modern Information Retrieval by Yates

• http://www.bluerwhite.org/btree/ 01/08/2011

• http://cis.stvincent.edu/carlsond/swdesign/btree/btree.

html 01/08/2011 01/08/2011

http://www.cs.princeton.edu/~rs/AlgsDS07/09BalancedTrees.pdf  01/08/2011

• http://www.cs.uregina.ca/Links/class-info/210/Hash/  

01/08/2011

• http://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash

 _tables.html 01/08/2011