information retrieval:...
TRANSCRIPT
![Page 1: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/1.jpg)
Information Retrieval:
Introduction
Romi Satria [email protected]://romisatriawahono.net
0878-804804-85
![Page 2: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/2.jpg)
Lahir di Madiun, 2 Oktober 1974
SD Sompok Semarang (1987)
SMPN 8 Semarang (1990)
SMA Taruna Nusantara, Magelang (1993)
S1, S2 dan S3 (on-leave)Department of Computer SciencesSaitama University, Japan (1994-2004)
Core Competence: Software Engineering, Computational Intelligence
Founder dan Koordinator IlmuKomputer.Com
CEO PT Brainmatics Cipta Informatika
Romi Satria Wahono
![Page 3: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/3.jpg)
Learning Methods
Lecture
Discussion
Case Study
Practice
![Page 4: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/4.jpg)
Textbook
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008
![Page 5: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/5.jpg)
References
Christopher D. Manning, Prabhakar Raghavan, HinrichSchütze, Introduction to Information Retrieval, Cambridge University Press, 2008
Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, The MIT Press, 2010
Bruce Croft, Donald Metzler, and Trevor Strohman, Search Engines: Information Retrieval in Practice, Addison Wesley, 2009
David A. Grossman and Ophir Frieder, Information Retrieval: Algorithms and Heuristics 2nd edition, Springer, 2004
Charles T. Meadow, Bert R. Boyce, Donald H. Kraft, and Carol L Barry, Text Information Retrieval Systems Third Edition, Library and Information Science, 2007
![Page 6: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/6.jpg)
Course Contents
1. Introduction
2. Boolean Retrieval
3. The Term Vocabulary
4. Dictionaries and Tolerant Retrieval
5. Index Construction
6. Index Compression
7. Vector Space Model
8. Computing Scores
9. Evaluation in Information Retrieval
10. Relevance Feedback and Query Expansion
11. XML Retrieval
![Page 7: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/7.jpg)
Course Contents
12. Probabilistic Information Retrieval
13. Language Models for Information Retrieval
14. Text Classification and Naive Bayes
15. Vector Space Classification
16. Support Vector Machines and Machine Learning on Documents
17. Flat Clustering
18. Hierarchical Clustering
19. Latent Semantic Indexing
20. Web Search
21. Web Crawling and Indexes
22. Link Analysis
![Page 8: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/8.jpg)
INTRODUCTION
![Page 9: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/9.jpg)
History of Information Retrieval (IR)
![Page 10: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/10.jpg)
1940-
late 1940s: The US military confronted problems of indexing and retrieval of wartime scientific research documents captured from Germans
1945: Vannevar Bush's As We May Thinkappeared in Atlantic Monthly
1947: Hans Peter Luhn (research engineer at IBM since 1941) began work on a mechanized punch card-based system for searching chemical compounds
![Page 11: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/11.jpg)
1950-
1950s: mechanized literature searching systems (Allen Kent et al.) and the invention of citation indexing (Eugene Garfield)
1950: The term "information retrieval" appears to have been coined by Calvin Mooers
1951: Philip Bagley conducted the earliest experiment in computerized document retrieval in a master thesis at MIT
1955: Kent and colleagues published a paper in American Documentation describing the precision and recall: the IR evaluation method
1959: Hans Peter Luhn published "Auto-encoding of documents for information retrieval."
![Page 12: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/12.jpg)
1960-
early 1960s: Gerard Salton began work on IR at Harvard, later moved to Cornell
1960: Melvin Earl (Bill) Maron and John Lary Kuhns published "On relevance, probabilistic indexing, and information retrieval" in the Journal of the ACM 7(3):216–244, July 1960\
1962: Cyril W. Cleverdon published early findings of the Cranfield studies, developing a model for IR system evaluation
1963: Joseph Becker and Robert M. Hayes published text on information retrieval. Becker, Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories. New York, Wiley (1963).
1964: Karen Spärck Jones finished her thesis at Cambridge, Synonymy and Semantic Classification, and continued work on computational linguistics as it applies to IR.
![Page 13: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/13.jpg)
1960- continued
mid-1960s: National Library of Medicine developed MEDLARS Medical Literature Analysis and Retrieval System, the first major machine-readable database and batch-retrieval system (Project Intrex at MIT)
1965: J. C. R. Licklider published Libraries of the Future.
1966: Don Swanson was involved in studies at University of Chicago on Requirements for Future Catalogs
late 1960s: F. Wilfrid Lancaster completed evaluation studies of the MEDLARS system and published the first edition of his text on information retrieval
1968: Gerard Salton published Automatic Information Organization and Retrieval. John W. Sammon, Jr.'s RADC Tech report "Some Mathematics of Information Storage and Retrieval..." outlined the vector model
![Page 14: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/14.jpg)
1970
early 1970s: First online systems—NLM's AIM-TWX, MEDLINE; Lockheed's Dialog; SDC's ORBIT. Theodor Nelson promoting concept of hypertext, published Computer Lib/Dream Machines.
1971: Nicholas Jardine and Cornelis J. van Rijsbergen published "The use of hierarchic clustering in information retrieval", which articulated the "cluster hypothesis." (Information Storage and Retrieval, 7(5), pp. 217–240, December 1971)
1975: Three highly influential publications by Salton fully articulated his vector processing framework and term discrimination model:
• A Theory of Indexing (Society for Industrial and Applied Mathematics)
• A Theory of Term Importance in Automatic Text Analysis (JASIS v. 26)
• A Vector Space Model for Automatic Indexing (CACM 18:11)
1978: The First ACM SIGIR conference
1979: C. J. van Rijsbergen published Information Retrieval(Butterworths). Heavy emphasis on probabilistic models
![Page 15: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/15.jpg)
1980-
1980: First international ACM SIGIR conference, joint with British Computer Society IR group in Cambridge
1982: Nicholas J. Belkin, Robert N. Oddy, and Helen M. Brooks proposed the ASK (Anomalous State of Knowledge) viewpoint for information retrieval. This was an important concept, though their automated analysis tool proved ultimately disappointing
1983: Salton (and Michael J. McGill) published Introduction to Modern Information Retrieval (McGraw-Hill), with heavy emphasis on vector models
mid-1980s: Efforts to develop end-user versions of commercial IR systems
1989: First World Wide Web proposals by Tim Berners-Leeat CERN
![Page 16: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/16.jpg)
1990
1992: First TREC conference
1997: Publication of Korfhage's Information Storage and Retrieval with emphasis on visualization and multi-reference point systems
mid 1990s:Searching FTPable documents on the Internet (Archie, WAIS) and Searching the World Wide Web (Lycos, Yahoo, Altavista)
late 1990s: Web search engines implementation of many features formerly found only in experimental IR systems. Search engines become the most common and maybe best instantiation of IR models, research, and implementation
![Page 17: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/17.jpg)
17
2000-
Link analysis for Web Search (Google)
Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
Question Answering
• TREC Q/A track
Automated Text Categorization & Clustering
Recommender Systems (Ringo, Amazon, NetPerceptions)
![Page 18: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/18.jpg)
18
2000- continued
Multimedia IR
• Image
• Video
• Audio and music
Cross-Language IR
• DARPA Tides
Document Summarization
![Page 19: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/19.jpg)
19
Related Areas
Database Management
Library and Information Science
Artificial Intelligence
Natural Language Processing
Machine Learning
![Page 20: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/20.jpg)
20
Database Management
Focused on structured data stored in relational tables rather than free-form text
Focused on efficient processing of well-defined queries in a formal language (SQL)
Clearer semantics for both data and queries
Recent move towards semi-structured data (XML) brings it closer to IR
![Page 21: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/21.jpg)
21
Library and Information Science
Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization)
Concerned with effective categorization of human knowledge
Concerned with citation analysis and bibliometrics (structure of information)
Recent work on digital libraries brings it closer to CS & IR
![Page 22: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/22.jpg)
22
Artificial Intelligence
Focused on the representation of knowledge, reasoning, and intelligent action
Formalisms for representing knowledge and queries:
• First-order Predicate Logic
• Bayesian Networks
Recent work on web ontologies and intelligent information agents brings it closer to IR
![Page 23: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/23.jpg)
23
Natural Language Processing
Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse
Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords
![Page 24: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/24.jpg)
24
Natural Language Processing:IR Directions
Methods for determining the sense of an ambiguous word based on context (word sense disambiguation)
Methods for identifying specific pieces of information in a document (information extraction)
Methods for answering specific NL questions from document corpora
![Page 25: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/25.jpg)
25
Machine Learning
Focused on the development of computational systems that improve their performance with experience
Automated classification of examples based on learning concepts from labeled training examples (supervised learning)
Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning)
![Page 26: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/26.jpg)
26
Machine Learning: IR Directions
Text Categorization
• Automatic hierarchical classification (Yahoo)
• Adaptive filtering/routing/recommending
• Automated spam filtering
Text Clustering
• Clustering of IR query results
• Automatic formation of hierarchies (Yahoo)
Learning for Information Extraction
Text Mining
![Page 27: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/27.jpg)
Basic Concepts
![Page 28: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/28.jpg)
Information Retrieval (IR)
Finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)
![Page 29: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/29.jpg)
Structured vs Unstructured Data
Structured data tends to refer to information in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
50000Ivy Smith
Typically allows numerical range and exact match
(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith
![Page 30: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/30.jpg)
Structured vs Unstructured Data in 1996
![Page 31: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/31.jpg)
Structured vs Unstructured Data in 2009
![Page 32: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/32.jpg)
IR Fields: Document Filtering
Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic.
Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class, if any, each of a set of documents belongs to
![Page 33: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/33.jpg)
IR Fields: Operation Scale
1. Web Search:
• provide search over billions of documents stored on millions of computers
• distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently at this enormous scale, and handling particular aspects of the web, such as the exploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings, given the commercial importance of the web.
![Page 34: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/34.jpg)
IR Fields: Operation Scale
2. Personal information retrieval:
• consumer operating systems have integrated information retrieval (Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search). Email programs usually not only provide search but also text classification: they at least provide a spam filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders
• Distinctive issues here include handling the broad range of document types on a typical personal computer, and making the search system maintenance free and sufficiently lightweight in terms of startup, processing, and disk space usage that it can run on one machine without annoying its owner
![Page 35: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/35.jpg)
IR Fields: Operation Scale
3. Enterprise, Institutional, and Domain-Specific Search
• provided for collections such as a corporation’s internal documents, a database of patents, or research articles on biochemistry
• the documents will typically be stored on centralized file systems and one or a handful of dedicated machines will provide search over the collection
![Page 36: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/36.jpg)
Perbandingan antara IRS, IS dan AI
Objek Data Fungsi Ukuran
Basis Data
IRS Dokumen Temu-kembali
(probabilistik)
Kecil – besar
IS (DBMS) Tabel Temu-kembali
(deterministik)
Kecil – besar
AI Pernyataan
Logika
Inferensia Kecil
![Page 37: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/37.jpg)
What is a Document?
Examples:
• web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, IM sessions, etc.
Common properties
• Significant text content
• Some structure (e.g., title, author, date for papers; subject, sender, destination for email)
![Page 38: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/38.jpg)
Documents vs Database Records
Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes)
• e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc.
Easy to compare fields with well-defined semantics to queries in order to find matches
Text is more difficult
![Page 39: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/39.jpg)
Documents vs Records
Example bank database query
• Find records with balance > $50,000 in branches located in Amherst, MA.
• Matches easily found by comparison with field values of records
Example search engine query
• bank scandals in western mass
• This text must be compared to the text of entire news stories
![Page 40: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/40.jpg)
Comparing Text
Comparing the query text to the document text and determining what is a good match is the core issue of information retrieval
Exact matching of words is not enough
• Many different ways to write the same thing in a “natural language” like English
• e.g., does a news story containing the text “bank director in Amherst steals funds” match the query?
• Some stories will be better matches than others
![Page 41: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/41.jpg)
Information Retrieval System (IRS)
Sistem yang berfungsi untuk menemukaninformasi yang relevan dengan kebutuhanpemakai
Informasi yang diproses terkandung didalamsebuah dokumen yang bersifat tekstual
Temu kembali informasi berkaitan denganrepresentasi, penyimpanan, dan akses terhadapdokumen
Dokumen yang ditemukan belum pasti apakahrelevan dengan kebutuhan informasi penggunayang dinyatakan dalam query
![Page 42: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/42.jpg)
Sistem Informasi dan IRS
Secara hirarkis Sistem Pemrosesan Transaksi
Sistem Informasi Manajemen
Sistem Informasi Ekskutif
Secara fungsional Sistem Informasi Pemasaran
Sistem Informasi Kepegawaian
Sistem Informasi Keuangan
dsb
Tidak terkait dengan hirarki dan fungsional Sistem Pendukung Keputusan (Decision Support System)
Sistem Kecerdasan Buatan (Artificial Intelligent System)
Sistem Temu Kembali Informasi (Information Retrieval System)
Sistem Informasi Perpustakaan
dsb
![Page 43: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/43.jpg)
Komponen IRS
PENGGUNAKOLEKSI
DOKUMEN
DOKUMEN
TERAMBIL
FUNGSI
MATCHING
PENENTUAN
RELEVANSI
![Page 44: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/44.jpg)
Kategori Pengguna IRS
Novice: Pengguna pemula
• Belum mempunyai kebutuhan informasi yang jelas
• Masih punya keinginan mem-browse informasi
Intermediate: Pengguna sudah mulai belajar
• Sudah punya keinginan informasi tapi masih agak kabur
• Berkeinginan untuk mem-browse dan men-search
Expert: Pengguna yang ahli
• Mempunyai kebutuhan informasi yang terdefinisikan denganjelas
• Melakukan searching informasi yang dibutuhkan
![Page 45: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/45.jpg)
Big Issues in IR
Relevance
Evaluation
User and Information Needs
![Page 46: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/46.jpg)
Big Issues in IR: Relevance
Simple (and simplistic) definition: A relevant document contains the information that a person was looking for when they submitted a query to the search engine
Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style
Topical relevance (same topic) vs. user relevance (everything else)
![Page 47: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/47.jpg)
Big Issues in IR: Relevance
Retrieval models define a view of relevance
Ranking algorithms used in search engines are based on retrieval models
Most models describe statistical properties of text rather than linguistic
• i.e. counting simple text features such as words instead of parsing and analyzing the sentences
• Statistical approach to text processing started with Luhn in the 50s
• Linguistic features can be part of a statistical model
![Page 48: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/48.jpg)
Big Issues in IR: Evaluation
Experimental procedures and measures for comparing system output with user expectations
• Originated in Cranfield experiments in the 60s
IR evaluation methods now used in many fields
Typically use test collection of documents, queries, and relevance judgments
• Most commonly used are TREC collections
Recall and precision are two examples of effectiveness measures
![Page 49: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/49.jpg)
Big Issues in IR: User and Information Needs
Search evaluation is user-centered
Keyword queries are often poor descriptions of actual information needs
Interaction and context are important for understanding user intent
Query refinement techniques such as query expansion, query suggestion, relevance feedback improve ranking
![Page 50: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/50.jpg)
IR and Search Engines
A search engine is the practical application of information retrieval techniques to large scale text collections
Web search engines are best-known examples, but many others
• Open source search engines are important for research and development
• e.g., Lucene, Lemur/Indri, Galago
![Page 51: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/51.jpg)
IR and Search Engines
Relevance
-Effective ranking
Evaluation
-Testing and measuring
Information needs
-User interaction
Performance
-Efficient search and indexing
Incorporating new data
-Coverage and freshness
Scalability
-Growing with data and users
Adaptability
-Tuning for applications
Specific problems
-e.g. Spam
Information Retrieval
Search Engines
![Page 52: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/52.jpg)
Search Engine Issues
Performance
• Measuring and improving the efficiency of search
e.g., reducing response time, increasing query throughput, increasing indexing speed
• Indexes are data structures designed to improve search efficiency
designing and implementing them are major issues for search engines
![Page 53: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/53.jpg)
Search Engine Issues
Dynamic data
• The “collection” for most real applications is constantly changing in terms of updates, additions, deletions
e.g., web pages
• Acquiring or “crawling” the documents is a major task
Typical measures are coverage (how much has been indexed) and freshness (how recently was it indexed)
• Updating the indexes while processing queries is also a design issue
![Page 54: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/54.jpg)
Search Engine Issues
Scalability
• Making everything work with millions of users every day, and many terabytes of documents
• Distributed processing is essential
Adaptability
• Changing and tuning search engine components such as ranking algorithm, indexing strategy, interface for different applications
![Page 55: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/55.jpg)
Spam
For Web search, spam in all its forms is one of the major issues
Affects the efficiency of search engines and, more seriously, the effectiveness of the results
Many types of spam
• e.g. spamdexing or term spam, link spam, “optimization”
New subfield called adversarial IR, since spammers are “adversaries” with different goals
![Page 56: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/56.jpg)
Information Retrieval Model (Techniques)
![Page 57: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/57.jpg)
The Taxonomy of IR Model
![Page 58: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/58.jpg)
The Taxonomy of IR Model (Kuropka, 2004)
![Page 59: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/59.jpg)
Cara Menemukan Informasi (User Tasks)
Browsing
• Untuk pengguna yang belum begitu “pasti” mengenai informasi apa yang dicarinya
• Browsing dapat dilakukan secara acak maupun secara terstruktur (menu based)
Searching
• Untuk pengguna yang sudah tahu informasi yang dicarinya
• Menggunakan kata-kata kunci
![Page 60: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/60.jpg)
Model Klasik Retrieval Techniques
1. Model Boolean
1. Fuzzy
2. Extended Boolean
2. Model Vektor
1. General vector space
2. Latent semantic indexing
3. Neural network
3. Model Probabilistik
1. Inferensia network
2. Neural network
![Page 61: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/61.jpg)
Karakteristik Model Klasik
Dokumen direpresentasikan denganmenggunakan indeks term
Nilai biner digunakan untuk bobot index term
Bobot indeks term menunjukkanspesifikasi untuk dokumen tertentu
Pengolahan komputasi dilakukan denganpendekatan matematik statistik
![Page 62: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/62.jpg)
Model Boolean
![Page 63: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/63.jpg)
Model Boolean
Model ini berdasarkan teori himpunan dan aljabar Boolean
Dokumen adalah himpunan dari istilah (term)
Query adalah pernyataan Boolean yang ditulis pada term
Dokumen diprediksi apakah relevan atau tidak
Model ini menggunakan operator Boolean
Istilah (term) dalam sebuah query dihubungkan dengan menggunakan operator AND, OR atau NOT
Metode ini merupakan metode yang paling sering digunakan pada mesin pencari (search engine) karena kecepatannya
![Page 64: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/64.jpg)
Boolean OR
OR is a boolean operator used to broaden your search by retrieving any, some, or all of the keywords used in the search statement
OR helps you make sure you aren't missing anything valuableQuery: College OR University(I would like information about college)
We retrieve records in which AT LEAST ONE of the search terms is present
We are searching on the terms college and also university since documents containing either of these words might be relevant
![Page 65: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/65.jpg)
Query and Result (OR)
Search terms Results
college 17,320,770
university 33,685,205
college OR university 33,702,660
Search terms Results
college 17,320,770
university 33,685,205
college OR university 33,702,660
college OR university OR campus 33,703,082
![Page 66: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/66.jpg)
X OR Y
X Y Z
1 0 1
0 1 1
1 1 1
0 0 0
![Page 67: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/67.jpg)
Boolean AND
AND is a boolean operator used to narrow your search by ensuring that all keywords used appear in the search results
Since the Web is already huge, it is important you use AND effectively.
Query: Poverty AND Crime (I'm interested in the relationship between poverty and crime)
We retrieve records in which both of the search terms are present
Notice how we do not retrieve any records with only "poverty" or only "crime."
![Page 68: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/68.jpg)
Query and Result (AND)
Search terms Results
poverty 783.447
crime 2,962,165
poverty AND crime 1,677
Search terms Results
poverty 783.447
crime 2,962,165
poverty AND crime 1,677
poverty AND crime AND gender 76
![Page 69: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/69.jpg)
X AND Y
X Y Z
1 0 0
0 1 0
1 1 1
0 0 0
![Page 70: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/70.jpg)
Syarat Tahun Kabisat
1. Tahun % 400 == 0
OR
2. (Tahun % 4) && !(tahun % 100 == 0)
![Page 71: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/71.jpg)
Boolean NOT
NOT is a boolean operator used to eliminate an unwanted concept or word in your search statement.
Query: Pets NOT Cats
I want to see information about pets, but I want to avoid seeing anything about cats.
We retrieve records in which ONLY ONE of the search terms is present.
No records are retrieved in which the word "cats" appears, even if the word "pets" appears there, too
![Page 72: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/72.jpg)
Query and Result (NOT)
Search terms Results
pets 4,556,515
cats 3,651,252
pets NOT cats 81,497
![Page 73: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/73.jpg)
Nesting
A method of combining Boolean operators in a logical order
When using Boolean Operators in combination, however, it is important to "nest" them
Nesting means putting operators in parentheses in order to tell the library catalog, database, or Internet search engine how it should search for your terms
![Page 74: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/74.jpg)
Example of Nesting
(A OR B) AND C -- find concepts A or B and where they intersect with concept CExample: (ford OR chevrolet) AND recall -- finds Fords or Chevrolets and the recalls on each
(A OR B OR C) AND (D OR E) -- finds either concept A or B or C, then finds concept D or E, and then combines A or B or C with D or EExample: (smoking OR tobacco OR nicotine) AND (adolescentsOR teenagers) -- finds the smoking or tobacco or nicotine for adolescent or for teenagers
(A OR B) AND (C NOT D) -- This search finds either concept A or B, then finds concept C but not D, and then combines A or B with CExample: (treatment or outcomes) AND (anorexia not bulimia) -- finds the treatment or outcomes for anorexia but not for bulimi
![Page 75: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/75.jpg)
Contoh Model Boolean
A and B D1AB, D2AB, ...d1AB > d2AB >
... dengan dAB = min(dA,dB)
A or B D1AB, D2AB, ...d1AB > d2AB >
… dengan dAB = max(dA,dB)
Not A U – dA
• Dimana dA menyatakan bobot istilah A pada dokumen D
• Bobot istilah ini didapat dari hasil proses Indexing• Min(dA,dB) berarti bahwa sebuah dokumen di retrieve dengan
bobot sebesar nilai terkecil dari bobot-bobot istilah yang
dipunyainya
• Max(dA,dB) berarti bahwa sebuah dokumen di retrieve dengan
bobot sebesar nilai terbesar dari bobot-bobot istilah yang
dipunyainya
![Page 76: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/76.jpg)
Tugas
1. (A OR B) AND C
2. (A OR B OR C) AND (D OR E)
3. (A OR B) AND (NOT C)
![Page 77: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/77.jpg)
(A OR B) AND C
A B C A OR B (A OR B) AND C
0 0 0 0 0
0 0 1 0 0
0 1 0 1 0
0 1 1 1 1
![Page 78: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/78.jpg)
Kelebihan Model Boolean
Model Boolean merupakan model sederhana yang menggunakan teori dasar himpunan sehingga mudah diimplementasikan
Model Boolean dapat diperluas dengan menggunakan proximity operator dan wildcard operator
Adanya pertimbangan biaya untuk mengubah software dan struktur database, terutama pada sistem komersial
![Page 79: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/79.jpg)
Kelemahan Boolean Model
Model Boolean tidak bisa membuat peringkatpada dokumen yang terambil
Dokumen yang terambil hanya dokumen yang benar-benar sesuai dengan pernyataan Booleanatau query yang diberikan (exact match)
Sehingga dokumen yang terambil bisa sangat banyak atau sangat sedikit menyulitkanpengambilan keputusan
Pernyataan Boolean bisa kompleks pengguna harus memiliki pengetahuan tentang querydengan Boolean agar pencarian efisien
Tidak bisa menyelesaikan partial matching pada query
![Page 80: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/80.jpg)
Tugas
Baca artikel berjudul Teknik PencarianEfektif dengan Google (Romi SatriaWahono)
Uji coba dengan searching keyword melaluiGoogle
![Page 81: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/81.jpg)
Extended Boolean Model
Teknik Extended Boolean berdasarkan p-norm model merupakan pengembanganlebih lanjut dari model Boolean
Teknik ini memakai operator yang dikomputasi berdasarkan rumusSavoy(1993)
![Page 82: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/82.jpg)
Query Retrieval Status Value (RSV)
A OR <p> B
A AND <p> B
NOT A 1 – Wia
Rumus Extended Boolean
p
p
ib
p
iaWW
2
p pib
pia WW
2
)1()1(1
p adalah nilai p-norm yang dimasukkan pada query
Wia adalah bobot istilah A dalam indeks pada dokumen Di
Wib adalah bobot istilah B dalam indeks pada dokumen Di
![Page 83: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/83.jpg)
Pemeringkatan Extended Boolean
Langsung mengurutkan dokumen (dari besar ke kecil) berdasarkan bobot dokumen yang didapat denganrumus RSV (retrieval status value)
Memakai rumus Learning Scheme
RSV(Di) = RSVinit (Di) + ik norm * RSVinit (Dk) untuk i= 1, 2,...., n,
Dimana:
• RSVinit(Di) merupakan retrieval status value daridokumen i yang dikomputasi berdasarkan rumus teknikretrieval P-norm model
• ik merupakan bobot keterhubungan antara dokumen idan kBobot keterhubungan ini didapat dari nilai relevance linkyang merupakan hasil dari proses pembelajaran
![Page 84: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/84.jpg)
Citra Komputer
1. S048 2.000000 1. S005 0.099570
2. S005 1.000000 2. S048 0.039120
3. S006 1.000000 3. T044 0.031300
4. S030 1.000000 4. S006 0.026080
5. S067 1.000000 5. T005 0.022350
6. T005 1.000000 6. S030 0.013040
7. T044 1.000000 7. S067 0.013040
Boolean vs. Extended BooleanSearch: Citra and Komputer
![Page 85: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/85.jpg)
Model Ruang Vektor
![Page 86: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/86.jpg)
Model Ruang Vektor
Model vektor berdasarkan keyterm
Model vektor mendukung partial matching dan penentuan peringkat dokumen
Prinsip dasar vektor model:
• Dokumen direpresentasikan dengan menggunkan vektor keyterm
• Ruang dimensi ditentukan oleh keyterms
• Query direpresentasikan dengan menggunakan vektor keyterm
• Kesamaan document-keyterm dihitung berdasarkan jarak vektor
Model vektor memerlukan
• Bobot keyterm untuk vektor dokumen
• Bobot keyterm untuk query
• Perhitungan jarak untuk vektor document-keyterm
![Page 87: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/87.jpg)
Prosedur Model Ruang Vektor
1. Pengideks-an dokumen
2. Pembobotan indeks, untuk menghasilkan dokumen yang relevan
3. Memberikan peringkat dokumen berdasarkan ukuran kesamaan (similarity measure)
![Page 88: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/88.jpg)
Keuntungan Model Ruang Vektor
Sangat efisien
• Menggunakan metode matrik sparse
• Menggunakan aljabar linier yang sederhana
• Mudah dalam representasi
• Dapat diimplementasikan pada document-matching
Fleksibel
• Digunakan dalam resolusi query
• Menggunakan kesamaan dokumen (document to document similarity)
• Menggunakan cluster
Sangat populer dan sering digunakan
![Page 89: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/89.jpg)
Kerugian Model Ruang Vektor
Teoritical frameworknya tidak jelas
Menghasilkan indeks yang berdekatan
Asumsi yang digunakan adalah independensi index term
![Page 90: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/90.jpg)
Pengindeksan Dokumen
Beberapa kata dalam sebuah dokumen, tidak menggambarkan isi dari dokumen tersebut, seperti kata the, is, a, dsb
Kata-kata tersebut dikenal dengan nama kata-kata buangan. Dengan menggunakan automatic document indexing, kata-kata buangan tersebut dihilangkan dari dokumen
Pembuatan indeks tersebut dapat berdasarkan
• Frekuensi kemunculan istilah dalam sebuah dokumen
• Metode non linguistic: probabilistic indexing
![Page 91: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/91.jpg)
Pembobotan Indeks (Term Weighting)
Pembobotan istilah dalam ruang vektor secara keseluruhan berdasarkan single term statistic. Ada tiga faktor utama dalam pembobotan istilah dengan menggunakan ruang vektor:
1. Term frequency factor
2. Collection frequency factor
3. Length normalization factor
Ketiga faktor tersebut diatas dikalikan untuk menghasilkan bobot istilah
Skema pembobotan yang paling umum untuk istilah dalam sebuah dokumen adalah dengan menggunakan frekuensi kemunculan
![Page 92: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/92.jpg)
Peringkat Dokumen
Ukuran kesamaan (similarity) istilah dalam model ruang vektor ditentukan berdasarkan assosiative coefficient berdasarkan inner product dari document vector dan query vector, dimana word overlap menunnjukkan kesamaan istilah
Inner product umumnya sudah dinormalisasi
Metode ukuran kesamaan yang paling populer adalah cosine coefficient, yang menghitung sudut antara vektor dokumen dengan vektor query
Metode ukuran kesamaan lainnya adalah Jaccardand Dice Coeeficient
![Page 93: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/93.jpg)
Model Probabilistik
![Page 94: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/94.jpg)
Model Probabilistik
Melakukan pendugaan page relevansi denganmenggunakan probabilistik
Mempunyai teoritical framework yang jelas
• Berdasarkan prinsip statistik
• Relevansi dokumen dapat diupdate
• Adanya feedback dari user
Ide dasar
• Query dapat menghasilkan jawaban yang benar
• Menggunkan indeks term
• Menggunakan pendugaan awal
• Menggunakan initial hasil
• Feedback dari user dapat memperbaiki probabilitas dari relevansi
![Page 95: Information Retrieval: Introductiondinus.ac.id/repository/docs/ajar/romi-ir-01-introduction-17nopember2010.pdf · SMA Taruna Nusantara , Magelang (1993) S1, S2 dan S3 (on-leave) Department](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d2a265b88c993140a8b5512/html5/thumbnails/95.jpg)
References
1. Christopher D. Manning, Prabhakar Raghavan, HinrichSchütze, Introduction to Information Retrieval, Cambridge University Press, 2008
2. Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, The MIT Press, 2010
3. Bruce Croft, Donald Metzler, and Trevor Strohman, Search Engines: Information Retrieval in Practice, Addison Wesley, 2009
4. David A. Grossman and Ophir Frieder, Information Retrieval: Algorithms and Heuristics 2nd edition, Springer, 2004
5. Charles T. Meadow, Bert R. Boyce, Donald H. Kraft, and Carol L Barry, Text Information Retrieval Systems Third Edition, Library and Information Science, 2007