information retrieval presented by: smridh thapar st2385@columbia columbia university
TRANSCRIPT
Information Retrieval
Presented By:
Smridh Thapar
st2385@columbia
Columbia University
Objectives
Discuss in brief: Information Retrieval IR Components IR Models IR Evaluation IR Applications IR Future
Information Retrieval IR - A very important aspect of today’s world.
Broad Definition: “Information retrieval (IR) generally refers to the activity of finding information from various types of data.”
IR in CS: “IR is the science of finding material (usually documents) of an unstructured nature (usually text) that satisfy an information need from within large collections (usually on local servers or the internet).”
Implementation of an efficient IR system is mostly empirical science.
Seeker Source
RetrieverData Retrieval (DR) Information Retrieval (IR)
Matching Exact match Partial/Best match
Model Deterministic Probabilistic
Query language Artificial Natural
Query specification Complete Incomplete
Items wanted Matching Relevant
Information Retrieval
IR Components1. Crawling – Data Collection2. Indexing – Data Processing3. Searching – Information Lookup
1. Crawling Crawlers are responsible for keeping data up-to-date. Problems with crawling the Web:
Large volume Fast rate of change Dynamic page generation
Web Crawlers require high bandwidth. Crawler Designs:
Pipelined process – Eg: Initial Googlebot Link Extraction, Relative to absolute conversion, component retrieval
Distributed, parallel crawling – Eg: Mercator and WebFountain Each process responsible for allotted pages; Pages allotted by domains or some
other criteria
IR Components (continued) 2. Indexing
Computationally Expensive. Provides the bridge between Crawling and Searching. And provides a better
ground for efficient searching. Major Considerations:
Size of the index. Major search engines store the entire document information in the index and hence the size of the index is a lot. This makes it difficult to keep the entire index in the memory and also requires compression techniques
Index Update: A challenging task to update the index as it is being used to serve the queries. How frequently?
3. Searching Computationally Expensive. Various searching models exist like boolean, vector or probabilistic model.
Index needs to be designed according to the model used. Various other ranking methods are used. PageRank being the most famous.
IR Models Boolean Model
Simple Model based on the fact that a query term exists in the document or not. Terms may be connected by AND, OR, or NOT operators.
Advantages of Boolean Model: Easy to implement. Computationally efficient. Easy to express structural constraints. Most logical in nature of usage.
Disadvantages of Boolean Model: Difficult to construct Boolean queries, users rarely use the logical operators. Does not offer the scope of relevance. Everything is done my matching exact text. Does not allow ranked retrieval.
Extended Boolean model uses Fuzzy OR logic to allow ranked retrieval in boolean model.
Vector Model Best Match Model based on the vector representation of documents. The angle between the vectors
is used to measure similarity. Advantages of Vector Model:
Provides simplified query formulation. Vector model has an inbuilt notion of relevancy which helps in making the best match. Ranked retrieval is straight forward.
IR Models (continued)
Vector Model Disadvantages of Vector Model:
Vector models cannot execute the NOT queries and there is lack of query structure and user control.
Comparison between the vectors is not defined by the model and the cosine (or scalar) comparison is mostly used.
Devising a good scheme to weigh the different terms appropriately is a challenge.
Probabilistic Model Best probability match based on the probability of documents matching a given query. Advantages of Probability Model:
Query formulation is easy as the system helps the user to make a good query using the query expansion technique.
Users may specify the desired relevancy level explicitly which gives them a range for error in matches and can help to find material even if the exact match doesn’t exist.
Disadvantages of Probability Model: There is lack of structure in the query. Probabilistic models cannot handle the NOT
queries. Computationally expensive and generally requires large number of terms to improve
retrieval performance.
IR Evaluation (formal)
IR is inherently empirical and careful evaluation is essential.
Basic measures (numbers) used to evaluate IR systems are: Precision: Fraction of retrieved documents those are relevant.
Precision = = P(relevant | retrieved)
Recall: Fraction of relevant documents those are retrieved. Recall = = P(retrieved | relevant)
F-measure: Combines Precision and Recall in one value.
IR Evaluation (user satisfaction)
Factors affecting User Satisfaction Wait time, Language of interaction, Size and type of
documents indexed. User satisfaction can be measure by:
Surveys or monitored user studies. Number of returning users. In context to eCommerce search engines: number of users
turning into buyers.
“Search is changing the nature of the Web as much as the Web has changed the nature of search”
IR Applications
Web search, one of the many, but the most famous IR application.
Multi-media Retrieval. Eg: Snap, Google Video Directory Retrieval. Eg: DMOZ, Yahoo Encyclopedia. Eg: WikiPedia Domain Specific. For law, finance, medical field, etc E-commerce. UGC IR Apps. Eg: YouTube, Digg, Flickr many more…
IR Future IR in the past decade has changed the way people search for
information. Future IR systems:
will be more sensitive to user needs. will have more data to analyze and produce. Eg: “which shirt looks good with blue jeans”
Personalized search. Being explored by many including: Google, AskJeeves, Amazon. Eg: “which shirts among the ones I like look good with blue
jeans” New and sophisticated IR systems will make managing personal
data easier. Premature Eg: Google Desktop, GMail.
Information Retrieval
Presented By:
Smridh Thapar
st2385@columbia
Columbia University
THANK YOU