ta lecture 11 information retrieval.ppt - ut lecture ir 2up.pdf · vector model with tf-idf weights...
TRANSCRIPT
3.12.2008
1
Text Algorithms (4AP)Information Retrieval
Jaak Vilo
2008 fall
1MTAT.03.190 Text Algorithms Jaak Vilo
Materials
• Modern Information Retrieval by Ricardo Baeza‐Yates and Berthier Ribeiro‐Neto.Berthier Ribeiro Neto. – http://people.ischool.berkeley.edu/~hearst/irbook/– http://www.amazon.co.uk/Modern‐Information‐Retrieval‐ACM‐
Press/dp/020139829X/ref=sr_1_1?ie=UTF8&s=books&qid=1228237684&sr=8‐1
– http://books.google.com/books?id=HLyAAAAACAAJ&dq=modern+information+retrieval
– New edition in May 2009
• Google Books: Information Retrievalhtt //b k l /b k ? i f ti t i l– http://books.google.com/books?q=information+retrieval
• ESSCaSS’08 : Ricardo Baeza‐Yates and Filippo Menczer– http://courses.cs.ut.ee/schools/esscass2008/Main/Materials
3.12.2008
2
• Given a set of documents, find those relevant to topic X • User formulates a query, documents are returned and retrieved by userUser formulates a query, documents are returned and retrieved by user
• Looking at first 100, result – how many are relevant to topic, how many of all fit in the first 100?
• Given an interesting document (one?), how to find similar ones?
• Which keywords characterise documents similar to other documents?
• How to present the answer to user?
– Topic hierarchies
– Self organising maps (see WebSom)
– ...
3.12.2008
3
3.12.2008
4
3.12.2008
5
Mida otsiti?
auto buss rong tramm troll bensiin diisel puit vesinik maagaas elekter
• Milline dokument peaks kõige sarnasem olema?
– Dokumendi ja päringu sarnasus
Dokumentide järjestamine– Dokumentide järjestamine
– Käänded/pöörded
– Ontoloogiad (mõistete struktuur)
– Dokumendi enda olulisus (e.g. PageRank)
Information retrieval (IR)
• Finding relevant information
– From unstructured document database(s)
• Relevance, measures
• Presenting information (UI, relevance)
• Free text queries (Natural Language Processing)
• User feedback
• http://www.google.com/search?q=define:information+retrieval
– Information Retrieval is the "science of search”
– The study of systems for indexing, searching, and recalling data, particularly text or other unstructured forms
3.12.2008
6
History of IR
• 1960‐70’s: Small text retrieval systems; basic booleanand vector space retrieval modelsand vector‐space retrieval models
• 1980’s: Large document database systems, many run by companies: (e.g. Lexis‐Nexis, Dialog, MEDLINE)
• 1990’s: Searching FTPable documents on the Internet (e.g. Archie, WAIS); Searching the World Wide Web (e.g. Lycos, Yahoo, Altavista)
History cont.
• 2000’s:– Link analysis for Web Search (e g Google)Link analysis for Web Search (e.g. Google)
– Automated Information Extraction (e.g. Whizbang, Fetch, Burning Glass)
– Question Answering (e.g. TREC Q/A track, Ask Jeeves)
– Multimedia IR (Image, Video, Audio and music)
– Cross‐Language IR (e.g. DARPA Tides)
Document Summarization– Document Summarization
3.12.2008
7
Cont …
• 2000’s:
R d S t– Recommender Systems • (e.g. MovieLens, Pandora, LastFM)
– Automated Text Categorization & Clustering• iTunes “Top Songs”
• Amazon “people who bought this also bought…”
• Bloglines “similar blogs”
D l i i “ t l ” b k k• Del.icio.us “most popular” bookmarks
• Flickr.com “ most viewed pictures”
• NYTimes “most emailed articles”
• IR discipline that deals with:
t i l– retrieval• representation
• storage
• organization
• access
– of structured, semi‐structured and unstructured data(information objects)(information objects)
– in response to query (topic statement)• structured (e.g. boolean expression)
• unstructured (e.g. sentence, document)
3.12.2008
8
ConceptsInformation Retrieval ‐ the study of systems for representing, indexing (organising), searching (retrieving) and recalling (delivering) data(retrieving), and recalling (delivering) data.
Information Filtering ‐ given a large amount of data, return the data that the user wants to see
Information Need ‐ what the user really wants to know; a query is an approximation to the i f ti dinformation need.
Query ‐ a string of words that characterizes the information that the user seeks
Browsing ‐ a sequence of user interaction tasks that characterizes the information that the user seeks
• The process of applying algorithms over unstructured, semi‐structured or structureddata in order to satisfy a given information (explicit) query
• Efficiency with respect to:l ith– algorithms
– query building
– data organization/structure
3.12.2008
9
Data vs. Information Retrieval
• Information Retrieval:– Set of keywords (loose semantics)y ( )– Semantics of the information need– Errors are tolerable
• Data Retrieval:– Regular expression (well defined query)– Constraints for the objects in the answer setj– Single error results in a falure
retrieval task
SummaryCompare the information need with the information
generate a ranking which reflects relevance
IR SystemUser Query
Ranked list of documents
Information Need
Lecture 2: Query Languages & Operations
2ID10: Information Retrieval (2005‐2006), Lora Aroyo
feedback
3.12.2008
10
1. IR Models
2. IR Query Languages & Operations
IR introduction IR research issues Applications of IR
4. Language Modeling for IR
8. Multimedia IR
5. Search Engines
Operations
6. Semantic in IR3. Searcher Feedback
8. Multimedia IR
9. Structured Content
• classification and categorization (catalogues)
t d l (NL b d t )• systems and languages (NL‐based systems)
• user interfaces and visualization
• The Web fenomena
– universal repository of knowledge
– free (low cost) universal access
– no central editorial board
• IR – the key to finding the solutions
3.12.2008
11
Logical View of Documents
A i
text +structure text
structure Full text Index terms
Documents
Structure
accents, spacing
Stop-wordsNoun groups
StemmingAutomatic or Manual Indexing
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 21
• Document representation continuum
• Intermediate representations (transformations)
• Text operations to reduce complexity of documents
UserInterface
The Retrieval Process ...user feedback – change the query
1
text
specifies user need410
Text Operations
Query Operations
Indexing
5
logical view logical view
inverted filequery generated
2
1text – defines logical view
DB Manager Module
4
6
7
builds
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 22
Searching
Ranking
Index
Text Database
query generated
retrieved docs
ranking docs
3
17
8
9
3.12.2008
12
3.12.2008
13
Inverted index: document levelhttp://en.wikipedia.org/wiki/Inverted_index
• T0 = "it is what it is",T " h t i it"
"a": {2} "banana": {2}T1 = "what is it"
T2 = "it is a banana“
• Q "what" "is" "it"
banana : {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1}
• Q: "what", "is" "it"
– {0, 1} ∩ {0, 1, 2} ∩ {0, 1, 2} = {0,1}
Inverted index: word level
• T0 = "it is what it is",T " h t i it"
"a": {(2, 2)} "banana": {(2, 3)}
T1 = "what is it"T2 = "it is a banana“
• Q "what is it“
"is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)}
• Q: "what is it“{(0, 2), (1, 0)}{(0, 1), (0, 4), (1, 1), (2, 1)} {(0, 0), (0, 3), (1, 2), (2, 0)}
3.12.2008
14
• The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a searchtypical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word.
3.12.2008
15
Measures
• Precision is the fraction of the documents t i d th t l t t th 'retrieved that are relevant to the user's
information need.
• Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.
Measures: Precision & Recall
Retrieved Not retrieved
User need TP FN Relevant
Not needed FP TN Irrelevant
TP+FP TN+FN
TP – True PositiveTN – True Negative
TP Relevant ∩ RetrievedPrecision = ––––––– = –––––––––––––––––––––
TP+FP Retrieved
FP – False PositiveFN – False Negative
TP Relevant ∩ RetrievedRecall = ––––––– = –––––––––––––––––––––
TP+FN Relevant
3.12.2008
16
Measures: Precision & Recall
Retrieved Not retrieved
User need TP FN Relevant
Not needed FP TN Irrelevant
TP+FP TN+FN
TPPrecision = –––––––
TP+FPSpecificity
TPRecall = –––––––
TP+FNSensitivity
Measure: F‐Measure
• The weighted harmonic mean of precision and ll th t diti l F b l drecall, the traditional F‐measure or balanced
F‐score is:
2 X Precision x RecallF-Measure = –––––––––––––––––––––
( Precison + Recall)
•F2 measure, weights recall twice as much as precision, and the F0.5 measure, which weights precision twice as much as recall.
3.12.2008
17
ROC Receiver Operator CharacteristicAUC Area Under Curve
3 systems compared
FP
TP
Rel
evan
t
Irrelevant
Vector space model
• http://en.wikipedia.org/wiki/Vector_space_model
• Document: a vector of words
– A sparse vector over all possible words…
• Similarity between query and document:
– Scalar product
– An angle between the two vectors
3.12.2008
18
Scalar product
• Query Q is a document with perhaps just a i l dsingle word.
• Similarity of query and document
M(Q, Di) = Q ∙ Di
• X ∙ Y = ∑i xiyi∑i iyi
Weighted version
• The more the word occurs, the more relevant,
• Same word vectors, count occurrences
M(X , Y ) = ∑i wq,i wd,i
• w is different for word in each document
• Extend: add weight for a word in a"more important" context
• Can you add term weight on query words?
3.12.2008
19
Limitations of vector space
• Long documents are poorly represented because they have poor similarity values (a small scalarthey have poor similarity values (a small scalar product and a large dimensionality)
• Search keywords must precisely match document terms; word substrings might result in a "false positive match"
• Semantic sensitivity; documents with similar context• Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".
• The order in which the terms appear in the document is lost in the vector space representation.
• quantification of intra-document (-cluster)
Term Weigh Calculation
q ( )contents (similarity) = tf factor – the term frequency within a document – how well a term describes a document
• quantification of inter-documents (-cluster) separation (dissimilarity) = idf factor
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 38
separation (dissimilarity) = idf factor – the inverse document frequency – frequency of the term in docs of the collection
wij = tf(i,j) * idf(i)
3.12.2008
20
• Let,
TF and IDF Factors
– N be the total number of docs in the collection
– ni be the number of docs which contain ki
– freq(i,j) raw frequency of ki within dj
• A normalized frequency (tf factor) is given by:– f(i,j) = freq(i,j) / max(freq(l,j))
– where max is computed over all terms occuring in doc dj
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 39
• The idf factor is computed as:– idf(i) = log (N/ ni )
– log makes values tf and idf comparable – or the amount of information associated with term ki
• Let,
TF and IDF Factors
– N be the total number of docs in the collection
– ni be the number of docs which contain ki
– freq(i,j) raw frequency of ki within dj
• A normalized frequency (tf factor) is given by:– f(i,j) = freq(i,j) / max(freq(l,j))
– where max is computed over all terms occuring in doc dj
vector model with tf-idf weights
a good ranking strategy in general collections
simple and fast to compute
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 40
• The idf factor is computed as:– idf(i) = log (N/ ni )
– log makes values tf and idf comparable – or the amount of information associated with term ki
simple and fast to compute
3.12.2008
21
• Advantages:
Pros & Cons
Advantages:– term-weighting improves quality of the answer set
– partial matching allows retrieval of docs that approximate the query conditions
– cosine ranking formula sorts documents according to degree of similarity to the query
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 41
• Disadvantages:– assumes independence of index terms
– not clear whether this is bad though
Ontology
• Ontology: a conceptualisation of things…
• An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them. g
Sõidukid
veesõidukid autod lennukid
Vesilennuk
3.12.2008
22
Ontology driven search
• Query => map to an ontology
– Use ontology to guide what you “really want”
• Map documents to the same ontology
F t h t l t t t t l t• Fetch most relevant to term, ontology, etc…
gopubmed
3.12.2008
23
Importance of a document
• Can we say that some documents are a priori more important than others?
• Type of a document /law, news, chat,…/
• “Good source”
• Relevant (often cited, popular)
What is a Markov Chain?
• A Markov chain has two components:1) A network structure much like a web site1) A network structure much like a web site,
where each node is called a state. So the complete web is the set of all possible states.
2) A transition probability of traversing a link given that the chain is in a state.
– For each state the sum of outgoing probabilities is one.
A f t th h th h i i• A sequence of steps through the chain is called a random walk.
3.12.2008
24
The Random Surfer• Assume the web is a Markov chain.
• Surfers randomly click on links, where the probability of tli k f A i 1/ h i th b fan outlink from page A is 1/m, where m is the number of
outlinks from A.
• The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page.
• Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.
Dangling Pages
• Problem: A and B have no outlinks
A C B
• Problem: A and B have no outlinks.
Solution: Assume A and B have links to all web pages with equal probability.
3.12.2008
25
Rank Sink
P bl P i l l t k b t d t• Problem: Pages in a loop accumulate rank but do not distribute it.
• Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop.
PageRank (PR) ‐ Definition
• P is a web page
)()(
)(...
)(
)(
)(
)()1()(
2
2
1
1
n
n
PO
PPR
PO
PPR
PO
PPRd
N
dPPR
• Pi are the web pages that have a link to P
• O(Pi) is the number of outlinks from Pi
• d is the teleportation probability
• N is the size of the web
3.12.2008
26
Example Web Graph
Iteratively Computing PageRank• Replace d/N in the def. of PR(P) by d, so PR will take values
between 1 and N.
• d is normally set to 0.15, but for simplicity lets set it to 0.5
• Set initial PR values to 1
• Solve the following equations iteratively:
))(2/)((5.05.0)(
)2/)((5.05.0)(
)(5.05.0)(
BPRAPRCPR
APRBPR
CPRAPR
3.12.2008
27
Example Computation of PRIteration PR(A) PR(B) PR(C)
0 1 1 10 1 1 1
1 1 0.75 1.125
2 1.0625 0.765625 1.1484375
3 1.07421875 0.76855469 1.15283203
4 1.07641602 0.76910400 1.15365601
5 1.07682800 0.76920700 1.15381050
… … … …
12 1.07692308 0.76923077 1.15384615
Large Matrix Computation
• Computing PageRank can be done via matrix multiplication, where the matrix has 30 million rowsmultiplication, where the matrix has 30 million rows and columns.
• The matrix is sparse as average number of outlinks is between 7 and 8.
• Setting d = 0.15 or above requires at most 100 iterations to convergence.
• Researchers still trying to speed‐up the computation.
3.12.2008
28
PageRank - Motivation
• A link from page A to page B is a vote of the author of A for B, or a d ti f threcommendation of the page.
• The number incoming links to a page is a measure of importance and authority of the page.
• Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing llinks are important.
3.12.2008
29
3.12.2008
30
The Anatomy of a Large‐Scale Hypertextual Web Search Engine
• http://infolab.stanford.edu/~backrub/google.html
• In this paper, we present Google, a prototype of a large‐scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results thanproduce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/
3.12.2008
31
• Personalized PageRank
– Teleportation to a set of pages defining the preferences of a– Teleportation to a set of pages defining the preferences of a particular user
• Topic‐sensitive PageRank [Haveliwala 02]
– Teleportation to a set of pages defining a particular topic
• TrustRank [Gyöngyi 04]
– Teleportation to “trustworthy” pages
• Many papers on analyzing PageRank and numerical methodsfor efficient computation
3.12.2008
32
3.12.2008
33
Future? Or current?
• Recommendations (Tagging)
• Common behaviour (news/epidemics spread)
• Social networks
• Focus
• Generalisation
• Rich get richer; Googlearchy?; …
• Your contribution?