1 cs 430 / info 430 information retrieval lecture 2 searching full text 2

26
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

Post on 20-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

1

CS 430 / INFO 430 Information Retrieval

Lecture 2

Searching Full Text 2

Page 2: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

2

Course Administration

Web site:

http://www.cs.cornell.edu/courses/cs430/2006fa

Notices:

See the home page on the course Web site

Sign-up sheet:

If you did not sign up at the first class, please sign up now.

Programming assignments:

The web site require Java or C++. Other languages are under consideration.

Page 3: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

3

Course Administration

Please send all questions about the course to:

[email protected]

The message will be sent to

William ArmsTeaching Assistants

Page 4: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

4

Course Administration

Discussion class, Wednesday, August 30Phillips Hall 203, 7:30 to 8:30 p.m.

Prepare for the class as instructed on the course Web site.

Participation in the discussion classes is one third of the grade, but tomorrow's class will not be included in the grade calculation.

Page 5: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

5

Discussion Classes

Format:

Questions.

Ask a member of the class to answer.

Provide opportunity for others to comment.

When answering:

Stand up.

Give your name. Make sure that the TA hears it.

Speak clearly so that all the class can hear.

Suggestions:

Do not be shy at presenting partial answers.

Differing viewpoints are welcome.

Page 6: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

6

Discussion Class: Preparation

You are given two problems to explore:

• What is the medical evidence that red wine is good or bad for your health?

• What in history led to the current turmoil in Palestine and the neighboring countries?

In preparing for the class, focus on the question: What characteristics of the three search services are helpful or lead to difficulties in addressing these two problems? The aim of your preparation is to explore the search services, not to solve these two problems.

Take care. Many of the documents that you might find are written from a one-sided viewpoint.

Page 7: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

7

Discussion Class: Preparation

In preparing for the discussion classes, you may find it useful to look at the slides from last year's class on the old Web site:

http://www.cs.cornell.edu/Courses/cs430/2005fa/

Page 8: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

8

Similarity Ranking Methods

Similarity ranking methods: measure the degree of similarity between a query and a document.

Query DocumentsSimilar

Similar: How similar is a document to a query?

[Contrast with methods that look for exact matches (e.g., Boolean). Those methods assume that a document is either relevant to a query or not relevant.]

Page 9: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

9

Similarity Ranking Methods: Use of Indexes

Query DocumentsIndex database

Mechanism for determining the similarity of the query to the document.

Set of documents ranked by how similar they are to the query

Page 10: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

10

Term Similarity: Example

Problem: Given two text documents, how similar are they?

A documents can be any length from one word to thousands. The following examples use very short artificial documents.

Example

Here are three documents. How similar are they?

d1 ant ant beed2 dog bee dog hog dog ant dogd3 cat gnu dog eel fox

Page 11: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

11

Concept: Two documents are similar if they contain some of the same terms.

Term Similarity: Basic Concept

Page 12: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

12

Term Vector Space: No Weighting

Term vector space

n-dimensional space, where n is the number of different terms used to index a set of documents (i.e. size of the word list).

Vector

Document i is represented by a vector. Its magnitude in dimension j is tij, where:

tij = 1 if term j occurs in document itij = 0 otherwise

[This is the basic method with no term weighting.]

Page 13: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

13

A Document Represented in a 3-Dimensional Term Vector Space

t1

t2

t3

d1

t13

t12t11

Page 14: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

14

Basic Method: Term Incidence Matrix

(No Weighting)

document text termsd1 ant ant bee ant bee

d2 dog bee dog hog dog ant dog ant bee dog hog

d3 cat gnu dog eel fox cat dog eel fox gnu

ant bee cat dog eel fox gnu hog

d1 1 1

d2 1 1 1 1

d3 1 1 1 1 1

tij = 1 if document i contains term j and zero otherwise

3 vectors in 8-dimensional term vector space

Page 15: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

15

Similarity between two Documents

Similarity

The similarity between two documents, d1 and d2, is a function of the angle between their vectors in the term vector space.

Page 16: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

16

Two Documents Represented in 3-Dimensional Term Vector Space

t1

t2

t3

d1 d2

Page 17: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

17

Vector Space Revision

x = (x1, x2, x3, ..., xn) is a vector in an n-dimensional vector space

Length of x is given by (extension of Pythagoras's theorem) |x|2 = x1

2 + x22 + x3

2 + ... + xn2

If x1 and x2 are vectors:

Inner product (or dot product) is given by x1.x2 = x11x21 + x12x22 + x13x23 + ... + x1nx2n

Cosine of the angle between the vectors x1 and x2:

cos () =

x1.x2 |x1| |x2|

Page 18: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

18

Example: Comparing Documents (No Weighting)

ant bee cat dog eel fox gnu hog length

d1 1 1 2

d2 1 1 1 1 4

d3 1 1 1 1 1 5

Page 19: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

19

Example: Comparing Documents

d1 d2 d3

d1 1 0.71 0

d2 0.71 1 0.22

d3 0 0.22 1

Similarity of documents in example:

Page 20: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

20

Similarity between a Query and a Document

Consider a query as another vector in the term vector space.

The similarity between a query, q, and a document, d, is a function of the angle between their vectors in the term vector space.

Page 21: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

21

Similarity between a Query and a Document in 3-Dimensional Term Vector Space

t1

t2

t3

q d

cos() is used as a measure of similarity

Page 22: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

22

Similarity of Query to Documents(Term Incidence Matrix: no Weighting)

ant bee cat dog eel fox gnu hog

q 1 1 d1 1 1 d2 1 1 1 1 d3 1 1 1 1 1

queryq ant dogdocument text termsd1 ant ant bee ant bee

d2 dog bee dog hog dog ant dog ant bee dog hog

d3 cat gnu dog eel fox cat dog eel fox gnu

Page 23: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

23

Calculate Ranking

d1 d2 d3

q 1/2 1/√2 1/√10 0.5 0.71 0.32

Similarity of query to documents in example:

If the query q is searched against this document set, the ranked results are:

d2, d1, d3

Page 24: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

24

Simple Uses of Vector Similarity in Information Retrieval

Threshold

For query q, retrieve all documents with similarity above a threshold, e.g., similarity > 0.50.

Ranking

For query q, return the n most similar documents ranked in order of similarity.

[This is the standard practice.]

Page 25: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

25

An improved measure of similarity might take account of:

(a) Whether the terms are common or unusual

(c) How many times each term appears in a document

(d) The lengths of the documents

(e) The place in the document that a term appears

(f) Terms that are adjacent to each other (phrases)

Term Extending the Basic Concept with Term Weighting

Page 26: 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

26

Term Vector Space with Weighting

Term vector space

n-dimensional space, where n is the number of different terms used to index a set of documents (i.e. size of the word list).

Vector

Document i is represented by a vector. Its magnitude in dimension j is tij, where:

tij > 0 if term j occurs in document itij = 0 otherwise

tij is the weight of term j in document i.