information retrieval presented by: smridh thapar st2385@columbia columbia university

Information Retrieval

Presented By:

Smridh Thapar

st2385@columbia

Columbia University

Objectives

Discuss in brief: Information Retrieval IR Components IR Models IR Evaluation IR Applications IR Future

Information Retrieval IR - A very important aspect of today’s world.

Broad Definition: “Information retrieval (IR) generally refers to the activity of finding information from various types of data.”

IR in CS: “IR is the science of finding material (usually documents) of an unstructured nature (usually text) that satisfy an information need from within large collections (usually on local servers or the internet).”

Implementation of an efficient IR system is mostly empirical science.

Seeker Source

RetrieverData Retrieval (DR) Information Retrieval (IR)

Matching Exact match Partial/Best match

Model Deterministic Probabilistic

Query language Artificial Natural

Query specification Complete Incomplete

Items wanted Matching Relevant


IR Components1. Crawling – Data Collection2. Indexing – Data Processing3. Searching – Information Lookup

1. Crawling Crawlers are responsible for keeping data up-to-date. Problems with crawling the Web:

Large volume Fast rate of change Dynamic page generation

Web Crawlers require high bandwidth. Crawler Designs:

Pipelined process – Eg: Initial Googlebot Link Extraction, Relative to absolute conversion, component retrieval

Distributed, parallel crawling – Eg: Mercator and WebFountain Each process responsible for allotted pages; Pages allotted by domains or some

other criteria

IR Components (continued) 2. Indexing

Computationally Expensive. Provides the bridge between Crawling and Searching. And provides a better

ground for efficient searching. Major Considerations:

Size of the index. Major search engines store the entire document information in the index and hence the size of the index is a lot. This makes it difficult to keep the entire index in the memory and also requires compression techniques

Index Update: A challenging task to update the index as it is being used to serve the queries. How frequently?

3. Searching Computationally Expensive. Various searching models exist like boolean, vector or probabilistic model.

Index needs to be designed according to the model used. Various other ranking methods are used. PageRank being the most famous.

IR Models Boolean Model

Simple Model based on the fact that a query term exists in the document or not. Terms may be connected by AND, OR, or NOT operators.

Advantages of Boolean Model: Easy to implement. Computationally efficient. Easy to express structural constraints. Most logical in nature of usage.

Disadvantages of Boolean Model: Difficult to construct Boolean queries, users rarely use the logical operators. Does not offer the scope of relevance. Everything is done my matching exact text. Does not allow ranked retrieval.

Extended Boolean model uses Fuzzy OR logic to allow ranked retrieval in boolean model.

Vector Model Best Match Model based on the vector representation of documents. The angle between the vectors

is used to measure similarity. Advantages of Vector Model:

Provides simplified query formulation. Vector model has an inbuilt notion of relevancy which helps in making the best match. Ranked retrieval is straight forward.

IR Models (continued)

Vector Model Disadvantages of Vector Model:

Vector models cannot execute the NOT queries and there is lack of query structure and user control.

Comparison between the vectors is not defined by the model and the cosine (or scalar) comparison is mostly used.

Devising a good scheme to weigh the different terms appropriately is a challenge.

Probabilistic Model Best probability match based on the probability of documents matching a given query. Advantages of Probability Model:

Query formulation is easy as the system helps the user to make a good query using the query expansion technique.

Users may specify the desired relevancy level explicitly which gives them a range for error in matches and can help to find material even if the exact match doesn’t exist.

Disadvantages of Probability Model: There is lack of structure in the query. Probabilistic models cannot handle the NOT

queries. Computationally expensive and generally requires large number of terms to improve

retrieval performance.

IR Evaluation (formal)

IR is inherently empirical and careful evaluation is essential.

Basic measures (numbers) used to evaluate IR systems are: Precision: Fraction of retrieved documents those are relevant.

Precision = = P(relevant | retrieved)

Recall: Fraction of relevant documents those are retrieved. Recall = = P(retrieved | relevant)

F-measure: Combines Precision and Recall in one value.

IR Evaluation (user satisfaction)

Factors affecting User Satisfaction Wait time, Language of interaction, Size and type of

documents indexed. User satisfaction can be measure by:

Surveys or monitored user studies. Number of returning users. In context to eCommerce search engines: number of users

turning into buyers.

“Search is changing the nature of the Web as much as the Web has changed the nature of search”

IR Applications

Web search, one of the many, but the most famous IR application.

Multi-media Retrieval. Eg: Snap, Google Video Directory Retrieval. Eg: DMOZ, Yahoo Encyclopedia. Eg: WikiPedia Domain Specific. For law, finance, medical field, etc E-commerce. UGC IR Apps. Eg: YouTube, Digg, Flickr many more…

IR Future IR in the past decade has changed the way people search for

information. Future IR systems:

will be more sensitive to user needs. will have more data to analyze and produce. Eg: “which shirt looks good with blue jeans”

Personalized search. Being explored by many including: Google, AskJeeves, Amazon. Eg: “which shirts among the ones I like look good with blue

jeans” New and sophisticated IR systems will make managing personal

data easier. Premature Eg: Google Desktop, GMail.


Presented By:

Smridh Thapar

st2385@columbia

Columbia University

THANK YOU

information retrieval presented by: smridh thapar st2385@columbia columbia university

Documents