€¦ · web viewwith scoring functions boolean and tf.idf. depending on your tf.idf weight, you...

Indexing 1 - 7 CMSC476676 SP2020 These notes are pulled from Victor Lavrenko’s IR videos in the included links. Copyright is by Victor Lavrenko. *Indexing 1:What makes google fast? talk is here . 1. Inverted index is a sparse representation of a Term X Document Matrix. 2. Posting lists are linked lists coming off of the hash bucket for that term. Each posting has information for a given document. 3. Documents are numbered numerically and are sorted from lowest to highest. This facilitates merging postings lists for all the different Terms in a query, and for finding the proximity of different words in a given document. *Indexing 2: Inverted Index talk is here . 1. Heap’s Law v=k*n^b says that n will be very large for large collections. 2. Zipf’s law says that we will see half the documents only once. 3. In large collections you are still seeing new words after 30 million. These include spelling errors, personal names, product names, etc. *Indexing 3: Sparseness and Linear Merge talk is here . 1. Zipf’s law says that we will see half the documents only once. 2. Don’t want to store a large matrix (20 billion) with many 0s

Upload: others

Post on 07-Apr-2020

0 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: €¦ · Web viewWith scoring functions Boolean and tf.idf. depending on your tf.idf weight, you can ignore terms that are zero just as with the Boolean scoring function. (That is,

Indexing 1 - 7CMSC476676SP2020

These notes are pulled from Victor Lavrenko’s IR videos in the included links. Copyright is by Victor Lavrenko.

*Indexing 1:What makes google fast? talk is here.1. Inverted index is a sparse representation of a Term X

Document Matrix.2. Posting lists are linked lists coming off of the hash bucket for

that term. Each posting has information for a given document.3. Documents are numbered numerically and are sorted from

lowest to highest. This facilitates merging postings lists for all the different Terms in a query, and for finding the proximity of different words in a given document.

*Indexing 2: Inverted Index talk is here.1. Heap’s Law v=k*n^b says that n will be very large for large

collections.2. Zipf’s law says that we will see half the documents only once.3. In large collections you are still seeing new words after 30

million. These include spelling errors, personal names, product names, etc.

*Indexing 3: Sparseness and Linear Merge talk is here.1. Zipf’s law says that we will see half the documents only once.

2. Don’t want to store a large matrix (20 billion) with many 0s3. Store the tuples as a linked list instead. Called a postings list.

Page 2: €¦ · Web viewWith scoring functions Boolean and tf.idf. depending on your tf.idf weight, you can ignore terms that are zero just as with the Boolean scoring function. (That is,

4. Want to store in sorted order by document number so that you can do a linear merge. Provides a fast method of for extracting matches in multiple list. That is, if I have a multi-word query, then I want to know which documents contain all of the query terms. Linear merge works well for

this. Place a pointer at the beginning of both lists then compare. Are they the same document? If not, move the smaller pointer over. Compare again. Are they the same document? If not, move the smaller point over and compare? Same? No? keep going until match or end of list.

5. With scoring functions Boolean and tf.idf. depending on your tf.idf weight, you can ignore terms that are zero just as with the Boolean scoring function. (That is, the individual terms in the dot product will only be non-zero when the query term is

non-zero). Lavrenko’s scoring function

Basic Cosine Similarity

Page 3: €¦ · Web viewWith scoring functions Boolean and tf.idf. depending on your tf.idf weight, you can ignore terms that are zero just as with the Boolean scoring function. (That is,

with this simple td.idf weighting

scheme, if the query doesn’t have term i, then the element is zero.

*Indexing 4: Phrases and Proximity talk is here.1. Can use linear mere to find words in a document that are in

close proximity.2. Can also index word pairs. In this video, he is using n-grams to

mean sequences of words, not characters. !beware!3. Proximity index stores positions of the words in a document.

*Indexing 5: XML, Structure, and Metadata talk is here.1. Can store structure, tags (annotations such as part of speech,

translations, etc.)2. Push structure into index values, but it is better to use an

Extent Index.3. Extent index – introduce a special new term for each structure

you want to index, e.g. the Author Term, hyperlinks, etc. Use Linear Merge again to match when position falls into the

Page 4: €¦ · Web viewWith scoring functions Boolean and tf.idf. depending on your tf.idf weight, you can ignore terms that are zero just as with the Boolean scoring function. (That is,

extent.

*Indexing 6:Delta Encoding (compression) talk is here.1. Inverted indices are big2. Takes lots of disk space and I/O to disk is slow. 3. Index compression reduces spaces, improves I/O time4. Basic ideas:

a. Convert large numbers into small ones – Delta Encodingb. Represent numbers with as few bits as possible

5. Delta Encoding

Delta Encoding works well for frequent words but not for infrequent words!

*Indexing 7:v-byte Encoding talk is here.