information retrieval (for beginners)

Information Retrieval

James Melzer

June 15, 2006

1

How Does Search Work?

2

The basics of search

• A search engine mediates between user’s query and metadata surrogates for documents

• Documents are reduced to metadata

• User’s need is translated into a query

• Query terms are used to find matching metadata terms

• Lots and lots of room for error...

3

The search process

1. Crawl content for metadata

2. Index document terms into an inverted file;an inverted file is very fast to search

3. Search the index to identify the result set;search the index - not the documents

4. Rank the results for display;ranking is the hardest part

4

Search algorithm 1

Term-based Ranking (tf/idf)

• tf = term frequency documents that use the query terms most are presumed to be most relevant

• idf = inverse document frequencyterms that are more rare are better indicators of relevance

• Assumptions1) relevance can be measured with document terms

5

Search algorithm 2

PageRank (Google)

• Relevant set is still identified by term matching

• A revolution in ranking: based on linking between documents

• Assumptions: 1) important sites link to other important sites 2) if many people link to a site, it is important

6

Citation Analysis

• Authors carefully select articles to cite

• The more citations an article gets, the better it must be

• Citations by authors who have a lot of citations confers their power to those they cite

• Aggregate and leverage all these small individual decisions...

7

How Complex is Google?

8

Google has about 36 ranking algorithms

Examples:

Citation Analysis

Statistical Clustering

Parsing Document Structure

Parsing Data in the Document

Microcontent Parsing

How to Make Search Better?

9

Evaluating Search

Recall

the percentage of all relevant documents retrieved

100% recall means every relevant document is retrieved

Precision

the percentage of documents retrieved that are relevant

100% precision means only relevant documents are retrieved

10

Thoughts & Reservations about Evaluating Search

• Precision and Recall are usually inversely proportional, so improving one often reduces the other.

• Given a corpus of content like the web (tens of billions of items)...Recall is unmeasurable, and thus essentially meaningless

• What is relevance?

• Measuring Precision depends on an agreed definition of relevance, which is tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard to quantify)

Best Bets

• Manually selected results, tied to specific query terms or phrases

• User-driven phrasesselect the most-used phrases from search traffic;go for easy wins, because returns diminish sharply

• Business-driven phrasesselect phrases important to the business;such as product names or office locations;or politically sensitive phrases, so you can control the message people see

Zipf

12

Relevance Feedback

• The user provides direct or indirect feedback on the search results

• Click tracking

• “More like this” or “Find similar”

• Clustering

13

Structured Search

• Designers use patterns in search behavior to guess user’s intent;this requires a substantial understanding of user behavior;it may require structured content (although, not necessarily)

Examples

• Zip Code -> Zip Code Lookup Tool

• Person’s name -> Directory Listing

• Product Name -> Shop or Support?

• Address -> Map this?

• Topic -> Introduction, Forms, Policies or Reports?

14

Controlled Vocabularies

• Classification with a controlled vocabulary is the best way to ensure 100% Recall

• Lead-in synonymsenter “fridge”; get “refrigerator” instead;best if the collection is well-catalogedincreases precision (e.g. in a library)

• Term-expansion synonyms;enter “refrigerator”; get “fridge” too;best if the collection is not well-catalogedincreases recall at the cost of precision (e.g on eBay)

• Spell check on query phrases

15

Why is search important?

IF: About half of all users prefer to search first*

THEN:What percentage of a content site’s development effort should be devoted to search?

16

* This statistic is highly context-dependent. People’s behavior depends on the context of their actions. The stat is from Jared Spool.

Questions?

James MelzerInformation Architect SRA [email protected]

17

information retrieval (for beginners)

Technology

search precision

search behavior

search results

search algorithm

search engine

search trafc

search process1

documents documents