ir presentation

17
By Bushra Al-Za’areer introducing Signature File – Suffix Tree & Suffix Array

Upload: bushra-al-zaareer

Post on 25-Dec-2015

20 views

Category:

Documents


1 download

DESCRIPTION

chapter 9

TRANSCRIPT

Page 1: IR Presentation

By Bushra Al-Za’areer

introducing

Signature File – Suffix Tree & Suffix Array

Page 2: IR Presentation

Chapter 9 Indexing & Searching

introducingSignature File – Suffix Tree & Suffix Array

1Signature File

2Suffi x Tree

3Suffi x Array

Page 3: IR Presentation

Signature File

Signature File – Suffix Tree & Suffix Array

1

Page 4: IR Presentation

Signature File chapter 9

• Consider:• H(information) = 010001• H(text) = 010010• H(data) = 110000• H(retrieval) = 100010

• The block signatures of a document D containing the text“textual retrieval and information retrieval” (after removingStop words and stemming) for a block size of two terms –would be:oB1D = 110010 andoB2D = 110011

Page 5: IR Presentation

Signature File chapter 9

To search for a given term we compare whether the term’s bit string could be “inside” the block signatures:• Consider we are searching for “text” in document Do H(text) = 010010 and B1D = 110010o H(text) bit-wise-AND B1D = 010010 = H(text)o Therefore “text” could be in B1D (it is in this particularocase)

• Consider we are now searching for “data”o H(data) bit-wise-AND B1D = 110000 = H(data)o H(data) bit-wise-AND B2D = 110000 = H(data)o Though “data” is not in either block !

• Signature files may yield false hits …

Page 6: IR Presentation

Signature File chapter 9

How to keep the probability of a false alarms low ?How to predict how good a signature is ?

o False drop occurs a document signature matches a query’s signature but the query’s word doesn’t match any word on document.

• The rate of false drop depends on:o The size of the signature.o The number of word per-block.

Page 7: IR Presentation

Signature File chapter 9

• Inverted or Signature? Inverted Files:

1. Slower retrieval2. More accurate 3. Easier to maintain

• In fact, inverted files are still the most popular storage for information retrieval.

Page 8: IR Presentation

2 Suffix Tree summary

Chapter 9

Page 9: IR Presentation

Signature File chapter 9

• Example:

Page 10: IR Presentation

3 Suffix Array summary

Chapter 9

Page 11: IR Presentation

Signature File chapter 9

• Suffix Trees and Suffix Arrays indexes see the text as one long string. Each position in the text is considered as a text suffix. Each suffix is thus uniquely identified by its position.

• Index points are selected from the text, which point to the beginning of the text positions which will be retrievable.

• This structure can be used to index words or characters.

Page 12: IR Presentation

Signature File chapter 9

• This structure can be used to index words or characters.

Page 13: IR Presentation

Signature File chapter 9

• Suffix arrays provide essentially the same functionality as suffix trees with much less space requirements.

• A suffix array is simply an array containing all the pointers to the text suffixes listed in lexicographical order.

• Suffix arrays are designed to allow binary searches done by comparing the contents of each pointer.

Page 14: IR Presentation

Signature File chapter 9

• With suffix trees and suffix arrays we can search for– Words– Prefixes & suffixes– Phrases.

Page 15: IR Presentation

? Any Question???Ask me!

Chapter 9

Page 16: IR Presentation

The most popular storage for information retrieval

inverted files…

Conclusion

Page 17: IR Presentation

What’s Your Message?Thank You