information retrieval (for beginners)

17
Information Retrieval James Melzer June 15, 2006 1

Upload: james-melzer

Post on 14-Jun-2015

2.299 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Information Retrieval (for beginners)

Information Retrieval

James Melzer

June 15, 2006

1

Page 2: Information Retrieval (for beginners)

How Does Search Work?

2

Page 3: Information Retrieval (for beginners)

The basics of search

• A search engine mediates between user’s query and metadata surrogates for documents

• Documents are reduced to metadata

• User’s need is translated into a query

• Query terms are used to find matching metadata terms

• Lots and lots of room for error...

3

Page 4: Information Retrieval (for beginners)

The search process

1. Crawl content for metadata

2. Index document terms into an inverted file;an inverted file is very fast to search

3. Search the index to identify the result set;search the index - not the documents

4. Rank the results for display;ranking is the hardest part

4

Page 5: Information Retrieval (for beginners)

Search algorithm 1

Term-based Ranking (tf/idf)

• tf = term frequency documents that use the query terms most are presumed to be most relevant

• idf = inverse document frequencyterms that are more rare are better indicators of relevance

• Assumptions1) relevance can be measured with document terms

5

Page 6: Information Retrieval (for beginners)

Search algorithm 2

PageRank (Google)

• Relevant set is still identified by term matching

• A revolution in ranking: based on linking between documents

• Assumptions: 1) important sites link to other important sites 2) if many people link to a site, it is important

6

Page 7: Information Retrieval (for beginners)

Citation Analysis

• Authors carefully select articles to cite

• The more citations an article gets, the better it must be

• Citations by authors who have a lot of citations confers their power to those they cite

• Aggregate and leverage all these small individual decisions...

7

Page 8: Information Retrieval (for beginners)

How Complex is Google?

8

Google has about 36 ranking algorithms

Examples:

Citation Analysis

Statistical Clustering

Parsing Document Structure

Parsing Data in the Document

Microcontent Parsing

Page 9: Information Retrieval (for beginners)

How to Make Search Better?

9

Page 10: Information Retrieval (for beginners)

Evaluating Search

Recall

the percentage of all relevant documents retrieved

100% recall means every relevant document is retrieved

Precision

the percentage of documents retrieved that are relevant

100% precision means only relevant documents are retrieved

10

Page 11: Information Retrieval (for beginners)

Thoughts & Reservations about Evaluating Search

• Precision and Recall are usually inversely proportional, so improving one often reduces the other.

• Given a corpus of content like the web (tens of billions of items)...Recall is unmeasurable, and thus essentially meaningless

• What is relevance?

• Measuring Precision depends on an agreed definition of relevance, which is tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard to quantify)

Page 12: Information Retrieval (for beginners)

Best Bets

• Manually selected results, tied to specific query terms or phrases

• User-driven phrasesselect the most-used phrases from search traffic;go for easy wins, because returns diminish sharply

• Business-driven phrasesselect phrases important to the business;such as product names or office locations;or politically sensitive phrases, so you can control the message people see

Zipf

12

Page 13: Information Retrieval (for beginners)

Relevance Feedback

• The user provides direct or indirect feedback on the search results

• Click tracking

• “More like this” or “Find similar”

• Clustering

13

Page 14: Information Retrieval (for beginners)

Structured Search

• Designers use patterns in search behavior to guess user’s intent;this requires a substantial understanding of user behavior;it may require structured content (although, not necessarily)

Examples

• Zip Code -> Zip Code Lookup Tool

• Person’s name -> Directory Listing

• Product Name -> Shop or Support?

• Address -> Map this?

• Topic -> Introduction, Forms, Policies or Reports?

14

Page 15: Information Retrieval (for beginners)

Controlled Vocabularies

• Classification with a controlled vocabulary is the best way to ensure 100% Recall

• Lead-in synonymsenter “fridge”; get “refrigerator” instead;best if the collection is well-catalogedincreases precision (e.g. in a library)

• Term-expansion synonyms;enter “refrigerator”; get “fridge” too;best if the collection is not well-catalogedincreases recall at the cost of precision (e.g on eBay)

• Spell check on query phrases

15

Page 16: Information Retrieval (for beginners)

Why is search important?

IF: About half of all users prefer to search first*

THEN:What percentage of a content site’s development effort should be devoted to search?

16

* This statistic is highly context-dependent. People’s behavior depends on the context of their actions. The stat is from Jared Spool.

Page 17: Information Retrieval (for beginners)

Questions?

James MelzerInformation Architect SRA [email protected]

17