1 cs 430 / info 430 information retrieval lecture 5 searching full text 5

1

CS 430 / INFO 430 Information Retrieval

Lecture 5

Searching Full Text 5

2

Course Administration

Assignment 1

• The final version of the assignment has now been posted. There are a number of small changes to simplify the assignment for everybody, including the graders.

• The submission instructions have been made more explicit. Points will be deleted for incorrect submission.

Course Management System

• The course has been linked to the Course Management System, https://cms2.csuglab.cornell.edu/. If you do not see CS 430 when you log in, send email to the course team.

3


Completion of Lecture 4

4

File Structures for Inverted Files: Binary Tree

elk

bee hog

cat

dog

foxant

gnu

Input: elk, hog, bee, fox, cat, gnu, ant, dog

5


Advantages

Can be searched quickly

Convenient for batch updating

Easy to add an extra term

Economical use of storage

Disadvantages

Less good for lexicographic processing, e.g., comp*

Tree tends to become unbalanced

If the index is held on disk, important to optimize the number of disk accesses

6


Calculation of maximum depth of tree.

Illustrates importance of balanced trees.

One possible variant is a red-black tree, which is a reasonable compromise between update performance and balance.

Worst case: depth = n

O(n)

Ideal case: depth = log(n + 1)/log 2

O(log n)

7

File Structures for Inverted Files: Right Threaded Binary Tree

Threaded tree:

A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in-order predecessor and an empty right child link to refer to its in-order successor.

Right-threaded tree:

A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Can be used for lexicographic processing.

A good data structure when held in memory

Knuth vol 1, 2.3.1, page 325.

8

File Structures for Inverted Files: Right Threaded Binary Tree

dog

bee

ant cat

gnu

elk

fox

hog

NULL

9

File Structures for Inverted Files: B-trees

B-tree of order m:

A balanced, multiway search tree:

• Each node stores many keys

• Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys.

• If ki is the ith key in a given internal node

-> all keys in the (i-1)th child are smaller than ki

-> all keys in the ith child are bigger than ki

• All leaves are at the same depth

10

File Structures for Inverted Files: B-trees

B-tree example (order 2)

50 65

10 19 35 55 59 70 90 98

1 5 8 9

12 14 18

36 47 66 68

72 73

91 95 97

Every arrow points to a node containing between 2 and 4 keys.A node with k keys has k + 1 pointers.

21 24 28

11

File Structures for Inverted Files: B+-tree

• A B-tree is used as an index

• Data is stored in the leaves of the tree, known as buckets

50 65

10 25 55 59 70 81 90

... D9 D51 ... D54 D66... D81 ...

Example: B+-tree of order 2, bucket size 4

(Implementation of B+-trees is covered in CS 432.)

12


Lecture 5

Searching Full Text 5

13

SMART System

An experimental system for automatic information retrieval

• automatic indexing to assign terms to documents and queries

• collect related documents into common subject classes

• identify documents to be retrieved by calculating similarities between documents and queries

• procedures for producing an improved search query based on information obtained from earlier searches

Gerald Salton and colleagues Harvard 1964-1968 Cornell 1968-1988

14

Indexing Subsystem

Documents

break into tokens

stop list*

stemming*

term weighting*

Inverted file system

text

non-stoplist tokens

tokens

stemmed terms

terms with weights

*Indicates optional operation.

assign document IDsdocuments

document numbers

and *field numbers

15

Search Subsystem

Inverted file system

query parse query

stemming*stemmed terms

stop list* non-stoplist tokens

query tokens

Boolean operations*

ranking*

relevant document set

ranked document set

retrieved document set

*Indicates optional operation.

16

Decisions in Building the Word List: What is a Term?

• Underlying character set, e.g., printable ASCII, Unicode, UTF8.

• Is there a controlled vocabulary? If so, what words are included?

• List of stopwords.

• Rules to decide the beginning and end of words, e.g., spaces or punctuation.

• Character sequences not to be indexed, e.g., sequences of numbers.

17

Lexical Analysis: Term

What is a term?

Free text indexing

A term is a group of characters, extracted from the input string, that has some collective significance, e.g., a complete word.

Usually, terms are strings of letters, digits or other specified characters, separated by punctuation, spaces, etc.

18

Oxford English Dictionary

19

Lexical Analysis: Choices

Punctuation: In technical contexts, punctuation may be used as a character within a term, e.g., wordlist.txt.

Case: Case of letters is usually not significant.

Hyphens:

(a) Treat as separators: state-of-art is treated as state of art.

(b) Ignore: on-line is treated as online.

(c) Retain: Knuth-Morris-Pratt Algorithm is unchanged.

Digits: Most numbers do not make good terms, but some are parts of proper nouns or technical terms: CS430, Opus 22.

20

Lexical Analysis: Choices

The modern tendency, for free text searching, is to map upper and lower case letters together in index terms, but otherwise to minimize the changes made at the lexical analysis stage.

With controlled vocabulary, the lexical decisions are made in creating the vocabulary.

21

Lexical Analysis Example: Query Analyzer

A term is a letter followed by a sequence of letters and digits.

Upper case letters are mapped into the lower case equivalents.

The following characters have significance as operators:

( ) & |

22

Lexical Analysis: Transition Diagram

0

1

2

3

4

5

6

7

letter ()

&

|

other

end-of-string

space

letter, digit

23

Lexical Analysis: Transition Table

State space letter ( ) & | other end-of digit string

0 0 1 2 3 4 5 6 7 6 1 1 1 1 1 1 1 1 1 1

States in red are final states.

24

This use of a transition table allows the system administrator to establish differ lexical choices for different collections of documents.

Example:

To change the lexical analyzer to accept tokens that begin with a digit, change the top right element of the table to 1.

Changing the Lexical Analyzer

25

Very common words, such as of, and, the, are rarely of use in information retrieval.

A stop list is a list of such words that are removed during lexical analysis.

A long stop list saves space in indexes, speeds processing, and eliminates many false hits.

However, common words are sometimes significant in information retrieval, which is an argument for a short stop list. (Consider the query, "To be or not to be?")

Stop Lists

26

Example: Stop List for Assignment 1

a about an and

are as at be

but by for from

has have he his

in is it its

more new of on

one or said say

that the their they

this to was who

which will with you

27

Example: the WAIS stop list(first 84 of 363 multi-letter words)

about above according across actually adj after afterwards again against all almost alone along already also although always among amongst an another any anyhow anyone anything anywhere are aren't around at be became because become becomes becoming been before beforehand begin beginning behind being below beside besides between beyond billion both but by can can't cannot caption co could couldn't did didn't do does doesn't don't down during each eg eight eighty either else elsewhere end ending enough etc even ever every everyone everything

28

Suggestions for Including Words in a Stop List

• Include the most common words in the English language (perhaps 50 to 250 words).

• Do not include words that might be important for retrieval (Among the 200 most frequently occurring words in general literature in English are time, war, home, life, water, and world).

• In addition, include words that are very common in context (e.g., computer, information, system in a set of computing documents).

29

Stop list policies

How many words should be in the stop list?

• Long list lowers recall

Multi-lingual document collections have special problems, e.g., die is a very common word in German but less common in English.

There is very little systematic evidence to use in selecting a stop list.

30

Stop Lists in Practice

The modern tendency is:

(a) have very short stop lists for broad-ranging or multi-lingual document collections, especially when the users are not trained.

(b) have longer stop lists for document collections in well-defined fields, especially when the users are trained professional.

31

Stemming

Morphological variants of a word (morphemes). Similar terms derived from a common stem:

engineer, engineered, engineeringuse, user, users, used, using

Stemming in Information Retrieval. Grouping words with a common stem together.

For example, a search on reads, also finds read, reading, and readable

Stemming consists of removing suffixes and conflating the resulting morphemes. Occasionally, prefixes are also removed.

32

Categories of Stemmer

The following diagram illustrate the various categories of stemmer. Porter's algorithm (which will be discussed in Lecture 6) is shown by the red path.

Conflation methods

Manual Automatic (stemmers)

Affix Successor Table n-gramremoval variety lookup

Longest Simplematch removal

33

Stemming in Practice

Evaluation studies have found that stemming can affect retrieval performance, usually for the better, but the results are mixed.

• Effectiveness is dependent on the vocabulary. Fine distinctions may be lost through stemming.

• Automatic stemming is as effective as manual conflation.

• Performance of various algorithms is similar.

Porter's Algorithm is entirely empirical, but has proved to be an effective algorithm for stemming English text with trained users.

34

Selection of tokens, weights, stop lists and stemming

Special purpose collections (e.g., law, medicine, monographs)

Best results are obtained by tuning the search engine for the characteristics of the collections and the expected queries.

It is valuable to use a training set of queries, with lists of relevant documents, to tune the system for each application.

General purpose collections (e.g., news articles)

The modern practice is to use a basic weighting scheme (e.g., tf.idf), a simple definition of token, a short stop list and little stemming except for plurals, with minimal conflation.

Web searching combine similarity ranking with ranking based on document importance.