1 cs 430: information discovery lecture 3 inverted files
TRANSCRIPT
2
Course Administration
• Course team:
Bill Arms [email protected]
Matt Schultz [email protected] Doug Mitarotonda [email protected] Trishul Patel [email protected]
• Email sent to:
will go to these four people
3
Course Administration
• Assignment 1 will be posted early next week. It is a programming assignment and is due on Friday, September 20 at 5 p.m.
• A text book was left in the Discussion Class yesterday.
4
Inverted File -- Concept (Basic Definition)
Inverted file: a list of the words in a set of documents and the documents in which they appear.
Word Document
abacus 3 19 22
actor 2 19 29
aspen 5 atoll 11
34
Stop words are removed and stemming carried out before building the index.
5
Inverted List -- Concept
Inverted list: All the entries in an inverted file that apply to a specific word, e.g.
abacus 3 19 22
Posting: Entry in an inverted list, e.g., the postings for "abacus" are documents 3, 19, 22.
6
Keywords and Controlled Vocabulary
Keyword:
A term that is used to describe the subject matter in a document. It is sometimes called an index term.
Keywords can be extracted automatically from a document or assigned by a human cataloguer or indexer.
Controlled vocabulary:
A list of words that can be used as keywords, e.g., in a medical system, a list of medical terms.
Inverted file (more complete definition):
A list of the keywords that apply to a set of documents, the documents in which they appear and related information.
7
Enhancements to Inverted Files -- Concept
Location: The inverted file holds information about the location of each term within the document.
Uses
adjacency and near operatorsuser interface design -- highlight location of search term
Frequency: The inverted file includes the number of postings for each term.
Uses
term weightingquery processing optimization
8
Inverted File -- Concept (Enhanced)
Word Postings Document Location
abacus 4 3 94 19 7 19 212
22 56actor 3 2 66
19 213 29 45
aspen 1 5 43atoll 3 11 3
11 70 34 40
9
Example: Boolean Queries
Boolean query: two or more search terms, related by logical operators, e.g.,
and or not
Examples:
abacus and actor
abacus or actor
(abacus and actor) or (abacus and atoll)
not actor
11
Evaluating a Boolean Query
3 19 22 2 19 29
To evaluate the and operator, merge the two inverted lists
with a logical AND operation.
Examples: abacus and actor
Postings for abacus
Postings for actor
Document 19 is the only document that contains both terms, "abacus" and "actor".
12
Adjacent and Near Operators
abacus adj actor
Terms abacus and actor are adjacent to each other as in the string
"abacus actor"
abacus near 4 actor
Terms abacus and actor are near to each other as in the string
"the actor has an abacus"
Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).
13
Evaluating an Adjacency Operation
Examples: abacus adj actor
Postings for abacus
Postings for actor
Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.
3 94 19 719 212 22 56
2 66 19 213 29 45
14
Evaluation of Boolean Operators
Precedence of operators must be defined:
adj, near high
and, not
or low
Example
A and B or C and B
is evaluated as
(A and B) or (C and B)
15
Set Records UniqueTerms
A 2,653 5,123
B 38,304 c.25,000
Sizes of Inverted Files
Set A has an average of 14 postings per term and a maximum of over 2,000 postings per term.
Set B has an average of 88 postings per record.
Examples from Harman and Candela, 1990
16
Representation of Inverted Files
Index (word list, vocabulary) file: Stores list of terms (keywords). Designed for rapid searching and processing range queries (lexicographic index). Often held in memory.
Postings file: Stores an inverted list (postings list) of postings for each term. Designed for rapid merging of lists. Each list may be stored sequentially.
Document file: Stores the documents. Important for user interface design.
[Repositories for the storage of document collections are covered in CS 502.]
17
Organization of Inverted Files
Term Pointer topostings
ant
bee
cat
dog
elk
fox
gnu
hog
Inverted lists
Index file Postings file Documents file
18
Decisions in Building Inverted Files
• Underlying character set, e.g., printable ASCII, Unicode, UTF8.
• Whether to use a controlled vocabulary. If so, what words to include.
• List of stopwords.
• Rules to decide the beginning and end of words, e.g., spaces or punctuation.
• Character sequences not to be indexed, e.g., sequences of numbers.
19
Efficiency Criteria
Storage
Inverted files are big, typically 10% to 100% the size of the collection of documents.
Update performance
It must be possible, with a reasonable amount of computation, to:
(a) Add a large batch of documents
(b) Add a single document
Retrieval performance
Retrieval must be fast enough to satisfy users and not use excessive resources.
20
Index Files
On disk
If an index is held on disk, search time is dominated by the number of disk accesses.
In memory
Suppose that an index has 1,000,000 distinct terms.
Each index entry consists of the term and a pointer to the inverted list, average 100 characters.
Size of index is 100 megabytes, which can easily be held in memory.
21
Postings File
Merging inverted lists is the most computationally intensive task in many information retrieval systems.
Since inverted lists may be very long, it is important to match postings efficiently.
Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially.
For efficient matching, the inverted lists should all be sorted in the same sequence.
Inverted lists are commonly cached to minimize disk accesses.
22
Efficiency and Query Languages
Some query options may require huge computation, e.g.,
Regular expressions
If inverted files are stored in alphabetical order,
comp* can be processed efficiently *comp cannot be processed efficiently
Boolean terms
If A and B are search terms
A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists
23
SMART System
An experimental system for automatic information retrieval
• automatic indexing to assign terms to documents and queries
• collect related documents into common subject classes
• identify documents to be retrieved by calculating similarities between documents and queries
• procedures for producing an improved search query based on information obtained from earlier searches
Gerald Salton and colleagues Harvard 1964-1968 Cornell 1968-1988
24
Vector Space Methods
Problem: Given two text documents, how similar are they?(One document may be a query.)
Vector space methods that measure similarity do not assume exact matches.
Benefits of similarity measures rather than exact matches
• Encourage long queries, which are rich in information. An abstract should be very similar to its source document.
• Accept probabilistic aspects of writing and searching. Different words will be used if an author writes the same document twice.
25
Vector Space Revision
x = (x1, x2, x3, ..., xn) is a vector in an n-dimensional vector space
Length of x is given by (extension of Pythagoras's theorem) |x|2 = x1
2 + x22 + x3
2 + ... + xn2
If x1 and x2 are vectors:
Inner product (or dot product) is given by x1.x2 = x11x21 + x12x22 + x13x23 + ... + x1nx1n
Cosine of the angle between the vectors x1 and x2:
cos () =
x1.x2 |x1| |x2|