Information Retrieval and Web Search
Lecture 1. Course overview
Instructor: Rada MihalceaClass web page: http://www.cs.unt.edu/~rada/CSCE5300
Slide 2
What is this course about?
•Processing
•Indexing
•Retrieving
•… textual data
•Fits in four lines, but much more complex and interesting than that
Slide 3
Need for IR
•With the advance of WWW - more than 3 Billion documents indexed on Google
•Various needs for information:– Search for documents that fall in a given topic– Search for a specific information– Search an answer to a question– Search for information in a different language
Slide 4
Some definitions of Information Retrieval (IR)
Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”
Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”
Slide 5
Examples of IR systems
• Conventional (library catalog)Search by keyword, title, author, etc. E.g. : You are probably familiar with www.library.unt.edu
• Text-based (Lexis-Nexis, Google, FAST).Search by keywords. Limited search using queries in natural language.
• Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ).
• Question answering systems (AskJeeves, Answerbus)Search in (restricted) natural language
• Other: cross language information retrieval, music retrieval
Slide 6
Slide 7
Slide 8
IR systems on the Web
•Search for Web pages http://www.google.com
•Search for images http://images.google.com
•Search for image content http://wang.ist.psu.edu/IMAGE/
•Search for answers to questions http://www.askjeeves.com
•Search for music?
Slide 9
Course information
•Instructor: Rada Mihalcea
•Contact info: NTRP 228, 940-369-7630, [email protected]
•Teaching assistant: TBA
•Class meets TTh, 2:00-3:20pm
•Office hourse – T, 4:00-5:30pm– Any time electronically – For grading, programming problems, first try to get in
touch with the TA.
Slide 10
Course resources
•Textbook:– Modern Information Retrieval Ricardo Baeza-Yates and Berthier Ribeiro-Neto
•Recommended:– Readings in Information Retrieval
K.Sparck Jones and P. Willett
– See the class website for pointers to places to buy them for less
•Papers from conferences, journals will be assigned throughout the course. Whenever possible, a copy of the paper will be placed on the class website.
Slide 11
Grading
•Homeworks: 30% – Start early! Some may be time consuming– 3 days late policy
•Midterm I: 15%
•Midterm II: 15%
•Project: 30%
•Class participation: 10%
•Good news! No final – final is replaced by the project
Slide 12
Programming language
• Students are free to choose the programming language they want to work with
• However:– I recommend working with Perl– We’ll have a short Perl tutorial next 1-2 lectures
– Why Perl? • Makes life much much more easier for text processing problems and
for Web based applications• Information Retrieval involves a lot of text processing, and often
involves Web access– Code reusability
• Regardless of the language, code MUST compile and run on the CSP Linux machines. – No credit will be given for programs that do not compile!
Slide 13
Tentative schedule
Course Overview
Short Perl Tutorial
Introduction to IR models and methods
Text analysis / document preprocessing
Vectorial model
Boolean model
Probabilistic model; other IR models
IR collections
IR evaluation
Query operations
Query languages
Natural Language IR (Named Entity recognition)
Slide 14
Tentative scheduleNatural Language IR (Semantic ambiguity, conceptual indexing)
Natural Language IR (Phrase indexing, other)
Question Answering: TREC / Web
Information extraction
Text classification/Topic tracking and detection
Web IR: crawlers
Web IR: search engines
Web IR: link based / content based
Web IR: evaluation metrics / Midterm review
Special topics: Cross Language IR
Special topics
Final IR overview, future directions
…. Midterm I, Midterm II, Project presentations