information retrieval unit 1 seema chandak. unit 1 : objective & content objective to deal...
TRANSCRIPT
Unit 1 : Objective & Content
Objective To deal with IR representation, storage,
organization & access to information items.
Unit 1 : Content(contu..)
• Content :: Basic Concepts of IR, Data Retrieval & Information Retrieval, IR system block diagram. Automatic Text Analysis, Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing, Probabilistic Indexing,
Unit 1 : Content(contu…)
Automatic Classification. Measures of Association, Different Matching Coefficient, Classification Methods, Cluster Hypothesis. Clustering Algorithms, Single Pass Algorithm, Single Link Algorithm, Rochhio's Algorith Dendogram
What is IRInformation retrieval:
Subfield of computer science that deals with automated retrieval of infromaition (especially text) based on their content and context.
The term Information Retrieval was first coined by Calvin Moores (1950). “ It is concerned with the representation, storage, and organization and accessing of information items .“
Need for IR• Information is considered as the most important
source, for most of the activities.• Example : Timely Weather reports.• Timely sharing of information.• The timely retrieval of information plays a major role,
keeping with the motto “right information at the right time”.
Types of IR
– Structured (All Database management systems)– Unstructured (Search engines)– Semi structured(Datawarehouses)
IR Based on Structured Data
• Recollect Terms related to DBMS ..– Data Organization in the form of schema, keys,
index, metadata….– Query structure – Results set– …..– ….
IR Vs. DR Information Retrieval System: a system that allows a
user to retrieve documents that match her “information need” from a large corpus. Example: Get documents about Java, except for ones
that are about the Java coffee.
Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. Example: Get all documents containing the term
“Java” but no containing the term “coffee”.
IR Vs. DR1. Matching.– In data retrieval we are normally looking for an
exact match, that is, we are checking to see whether an item is or is not present in the file.
– Eg.Select * from Student where per >= 75.0
– In information retrieval more generally we want to find those items which partially match the request and then select from those a few of the best matching ones.
– Eg. Student having 75 or >75 percentage from student of pict college.
IR Vs. DR2. Inference – In data retrieval is of the simple deductive kind, that is, aRb
and bRc then aRc. – In information retrieval it is of inductive inference; – Relations are only specified with a degree of certainty or
uncertainty and hence our confidence in the inference is variable.
3. Model– Data retrieval is deterministic but information retrieval is
probabilistic.– Frequently Bayes' Theorem is invoked to carry out inferences
in IR, but in DR probabilities do not enter– into the processing.
IR Vs. DR4 .Classification: – In DR most likely monothetic classification is used.– That is, one with classes defined by objects – possessing attributes both necessary and sufficient to
belong to a class.
– In IR such a classification is not very useful.– A polythetic classification is mostly used. – Each individual in a class will possess only a proportion
of all the attributes possessed by all the members of that class.
– Hence no attribute is necessary nor sufficient for membership to a class.
IR Vs. DR5. Query Language:– The query language for DR is one with restricted
syntax and vocabulary.– In IR we prefer to use natural language although there
are some notable exceptions.
6. Query Specification :– In DR the query is generally a complete specification
of what is wanted,– In IR it is invariably incomplete.
IR Vs. DR7. Items wanted :– In IR we are searching for relevant documents as
opposed to exactly matching items in DR.
8. Error response :– DR is more sensitive to error in the sense that, an
error in matching will not retrieve the wanted item which implies a total failure of the system.
– In IR small errors in matching generally do not affect performance of the system significantly
IR Vs. DRData Retrieval (DR) Information Retrieval
(IR)
Matching Exact match Partial match, best match
Inference Deduction Induction
Model Deterministic Probabilistic
Classification Monothetic Polythetic
Data Database tables, structured
Free text, unstructured
Query language
Artificial, SQL, relational algebras.
Natural, Keywords, free text
Query specification
Complete Incomplete
Items wanted Matching Relevant
IR vs.DRInformation Retrieval Data Retrieval
Error Response
Insensitive Sensitive
Results Approximate matches
Exact matches
Results Ordered by relevance
Unordered
Accessibility Non-expert humans Knowledgeable users or automatic processes
Information Retrieval deals with uncertainty and vagueness in information systems.
• Uncertainty: available representation does typically not reflect true semantics/meaning of objects (text, images, video, etc.)
• Vagueness: information need of user lacks clarity, is only vaguel expressed in query, feedback or user actions.
• Differs conceptually from database queries!
Issues with Information Retrieval?
Re Call the Definition• What Is IR ?• “ Finding some desired information in large data sets or
store of information “
• Means : – Searching for documents – Searching for information in documents– Searching for metadata which describes documents– Searching within database–
• Web search engines like Google and Lycos are the most visible IR applications.• IR systems are used to reduce information overload.
Definition
Automatic Information Retrieval Automatic – as against ‘manual’. Information – as against ‘data’. Defn : An information retrieval system does not inform
(i.e.change the knowledge of) the user on the subject of his inquiry.
It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.
Media – Where Does Information Reside?
• Text documents: web pages, books, articles , papers, emails etc.
• Manuscripts• Graphics & Images• Speech & Video• Maps & Satellite Imagery• Local Information, Yellow Pages• Mismatch: given representation in specific medium vs. semantic description of information (semantic gap)
Scale - How Much Information is out there?
• World Wide Web Tens or hundreds billions of documents? Approx. 10KB/doc of 100s of TB
• Then there is everything else Email, personal files, proprietary databases,
broadcast media, print• Estimated 5 Exabytes p.a. (growing at 30%)• 800 MB p.a. and person• Web is just a tiny starting point….
IR problem It is mainly dealing with a very large , mostly
unstructured data set IR problem consists of :
building efficient indexes. processing user queries with high performance. improve ‘quality’ of answer set.
Basic Concepts
• Information retrieval is directly affected by the :– User Tasks– Document Logical view
User Tasks• Classical information retrieval system allows IR• Hypertext system are usually tuned for quick
Browsing.• Modern digital lib. and Web interfacing might
attempt to combine these tasks.
Logical view of the document• Documents are represented either by Keywords or
Indexes is known as logical view of the documents.• Keywords are either extracted directly from the text of
document or specified by human.• Modern computers represents doc by its set of :– Full words.– Small words. • Stopwords : elimination of articles and
connectives.• steaming : (reduces distinct words to their
common grammatical roots.)
Introduction…• Information Retrieval System:
Input
Queries
Processor
Documents
Feedback
Output
A typical IR system
Sample retrieval
28
Introduction…• Information Retrieval System:
– Input: Store only a representation of the document (or query) which means that the text of a document is lost once it has been processed for the purpose of generating its representation.
– A document representative could be a list of extracted words considered to be significant.
– The user has to use the language in which he/she can express the needed information in the language.
– Processor: Involve in performing actual retrieval function, executing the search strategy in response to a query.
– Feedback: Improving the subsequent run after a sample retrieval.– Output:A set of document numbers. And the evaluation can be
done.29
IntroductionInformationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
How isthe queryconstructed?
How isthe text processed?
Information Retrieval Process
Definitions
• Searching: Seeking for specific information within a body of information. The result of a search is a set of hits.
• Browsing: Unstructured exploration of a body of information.
• Linking: Moving from one item to another following links, such as citations, references, etc.
The Basics of Information RetrievalQuery: A string of text, describing the information that the user is seeking. Each word of the query is called a search term.
A query can be a single search term, a string of terms, a phrase in natural language, or a stylized expression using special symbols.
Full text searching: Methods that compare the query with every word in the text, without distinguishing the function of the various words.
Fielded searching: Methods that search on specific bibliographic or structural fields, such as author or heading.
SORTING AND RANKING HITSWhen a user submits a query to a search system, the system returns a set of hits. With a large collection of documents, the set of hits maybe very large.
The value to the use depends on the order in which the hits are presented.
Three main methods:
• Sorting the hits, e.g., by date
• Ranking the hits by similarity between query and document
• Ranking the hits by the importance of the documents
Examples of Search Systems
Find file on a computer system (Spotlight for Macintosh).
Library catalog for searching bibliographic records about books and other objects (Library of Congress catalog).
Abstracting and indexing system for finding research information about specific topics (Medline for medical information).
Web search service for finding web pages (Google).