information retrieval (for beginners)
TRANSCRIPT
Information Retrieval
James Melzer
June 15, 2006
1
How Does Search Work?
2
The basics of search
• A search engine mediates between user’s query and metadata surrogates for documents
• Documents are reduced to metadata
• User’s need is translated into a query
• Query terms are used to find matching metadata terms
• Lots and lots of room for error...
3
The search process
1. Crawl content for metadata
2. Index document terms into an inverted file;an inverted file is very fast to search
3. Search the index to identify the result set;search the index - not the documents
4. Rank the results for display;ranking is the hardest part
4
Search algorithm 1
Term-based Ranking (tf/idf)
• tf = term frequency documents that use the query terms most are presumed to be most relevant
• idf = inverse document frequencyterms that are more rare are better indicators of relevance
• Assumptions1) relevance can be measured with document terms
5
Search algorithm 2
PageRank (Google)
• Relevant set is still identified by term matching
• A revolution in ranking: based on linking between documents
• Assumptions: 1) important sites link to other important sites 2) if many people link to a site, it is important
6
Citation Analysis
• Authors carefully select articles to cite
• The more citations an article gets, the better it must be
• Citations by authors who have a lot of citations confers their power to those they cite
• Aggregate and leverage all these small individual decisions...
7
How Complex is Google?
8
Google has about 36 ranking algorithms
Examples:
Citation Analysis
Statistical Clustering
Parsing Document Structure
Parsing Data in the Document
Microcontent Parsing
How to Make Search Better?
9
Evaluating Search
Recall
the percentage of all relevant documents retrieved
100% recall means every relevant document is retrieved
Precision
the percentage of documents retrieved that are relevant
100% precision means only relevant documents are retrieved
10
Thoughts & Reservations about Evaluating Search
• Precision and Recall are usually inversely proportional, so improving one often reduces the other.
• Given a corpus of content like the web (tens of billions of items)...Recall is unmeasurable, and thus essentially meaningless
• What is relevance?
• Measuring Precision depends on an agreed definition of relevance, which is tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard to quantify)
Best Bets
• Manually selected results, tied to specific query terms or phrases
• User-driven phrasesselect the most-used phrases from search traffic;go for easy wins, because returns diminish sharply
• Business-driven phrasesselect phrases important to the business;such as product names or office locations;or politically sensitive phrases, so you can control the message people see
Zipf
12
Relevance Feedback
• The user provides direct or indirect feedback on the search results
• Click tracking
• “More like this” or “Find similar”
• Clustering
13
Structured Search
• Designers use patterns in search behavior to guess user’s intent;this requires a substantial understanding of user behavior;it may require structured content (although, not necessarily)
Examples
• Zip Code -> Zip Code Lookup Tool
• Person’s name -> Directory Listing
• Product Name -> Shop or Support?
• Address -> Map this?
• Topic -> Introduction, Forms, Policies or Reports?
14
Controlled Vocabularies
• Classification with a controlled vocabulary is the best way to ensure 100% Recall
• Lead-in synonymsenter “fridge”; get “refrigerator” instead;best if the collection is well-catalogedincreases precision (e.g. in a library)
• Term-expansion synonyms;enter “refrigerator”; get “fridge” too;best if the collection is not well-catalogedincreases recall at the cost of precision (e.g on eBay)
• Spell check on query phrases
15
Why is search important?
IF: About half of all users prefer to search first*
THEN:What percentage of a content site’s development effort should be devoted to search?
16
* This statistic is highly context-dependent. People’s behavior depends on the context of their actions. The stat is from Jared Spool.