![Page 1: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/1.jpg)
Web- and Multimedia-based Information Systems
Lecture 2
![Page 2: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/2.jpg)
Vector Model
Non-binary Weigths Degree of similarity Result ranking possible Fast & Good results
![Page 3: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/3.jpg)
Vector Model
Document Vector with weights for every index term
Query Vector with weights for every index term
Vectors of the dimension of the total number of index terms in the collection
![Page 4: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/4.jpg)
Documents in Vector Space
t1
t2
t3
D1
D2
D10D3
D9
D4
D7D8
D5
D11
D6
![Page 5: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/5.jpg)
Vector Model
Position 1 corresponds to term 1, position 2 to term 2, position t to term t
The weight of the term is stored in each position
absent is terma if 0
,...,,
,...,,
21
21
w
wwwQ
wwwD
qtqq
dddi itii
![Page 6: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/6.jpg)
Vector Model
Cosine of the angle between the vectors taken as similarity measure
Sorting/Ranking of results Threshold for results More precise answer with more relevant docs
on the top
![Page 7: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/7.jpg)
Similarity Function
*
),(
1
2,
1
2,
1
t
jqi
t
iji
t
kjkik
ji
ww
wwDDsim
ji
jiji
dd
ddDDsim
cos),(
![Page 8: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/8.jpg)
Vector Model Index Terms Weighting
Binary Weights Raw Term Weights Term frequency x Inverse document
frequency
![Page 9: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/9.jpg)
Binary Weights
Only the presence (1) or absence (0) of a term is included in the vector
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D10 0 1 1D11 1 0 1
![Page 10: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/10.jpg)
Raw Term Weights
The frequency of occurrence for the term in each document is included in the vector
docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1D10 0 3 5D11 4 0 1
![Page 11: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/11.jpg)
Term frequency x Inverse document frequency
)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
i
ikik
freq
freqtf
max
![Page 12: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/12.jpg)
IDF Example
IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
![Page 13: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/13.jpg)
Probabilistic Model
Based on Probability For every document, a probability is
calculated for:– Document being relevant– Document being irrelevant
to the query
Documents more relevant than not ranked in decreasing order of relevance
![Page 14: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/14.jpg)
Text Operations in Detail
Goal: Automated Generation of Index Terms All terms conveying meaning vs. Space
requirements Rules for extraction from documents
– Rules for divison of terms Punctuation Dashes
– List of Stop Words Articles, prepositions, conjunctions
![Page 15: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/15.jpg)
Word-oriented Reduction Schemes
Lemmatisations Smaller term lists Generalization of terms Methods
– Reduction to the infinitive– Reduction to a stem
Algorithmic Methods for English German:
– Biggest Problems: Prefixes & Compositions– Only with dictionaries
Explicit listing of all forms Or rules to derive forms
![Page 16: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/16.jpg)
Stemming
Different Methods Most efficiently: Affix removal
– Porter Algorithm– Implement later – Series of rules to strip suffixes
s -> nil sses -> ss
![Page 17: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/17.jpg)
Word Type Index Term Selection
Nouns usually convey most meaning Elimination of other word types Clustering of compounds (computer science)
– Noun groups– Maximum distance between terms
![Page 18: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/18.jpg)
Thesauri
„Treasury of words“ For every entry
– Definition– Synonyms
Useful with a specific knowledge domain where a controlled vocabulary can easily be obtained
Difficult with a large and dynamic document collection as the web
![Page 19: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/19.jpg)
Creation of Inverted List
Create Vocabulary Note document, position in Document for
each term Sort List (first by terms, then by positions) Split Terms & Positions
![Page 20: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/20.jpg)
Basic Query
Terms of the query isolated Get pointer to positions for every term Conduct Set Operations Get result documents and present
![Page 21: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/21.jpg)
Advanced Query Functionality
Comparison Operators for Metadata String of multiple terms More general: take into account distance and
order of terms Truncation (Wildcards)
![Page 22: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/22.jpg)
Information Retrieval System Evaluation
Functionality Analysis Performance
– Time– Space
Retrieval Performance– Batch vs. Interactive mode
![Page 23: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/23.jpg)
Retrieval Performance Measures
Recall– The fraction of relevant documents which has
been retrieved
Precision– The fraction of the retrieved documents which is
relevant
![Page 24: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/24.jpg)
Precision vs. Recall
User does usually not inspect all results Example: Relevant documents R={d2, d5} Result ranking returned by system
1. d1 2. d5 3. d2
For the second result, recall is at 50%, precision is also 50%
For the third result, recall is 100%, precision is 67%
![Page 25: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/25.jpg)
Programming Assignment
![Page 26: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/26.jpg)
Programming Assignment
Different part each week Web Search Engine
![Page 27: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/27.jpg)
WWW Search Engine
Search Engine
Indexer
Robot
DB
WWW-Server
Index
WWW-Server WWW-Client
Query Result List
Query Results
Files Request
Documents
![Page 28: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/28.jpg)
Assignment Part 1
Program a web robot Starts at a user-defined URL Navigates the Web via Hypertext links Speaks HTTP (see RFC1945) Stores the path it took (URLs) – preferrable
in a tree-like datastructure Stores result code & important header fields
for every request to disk in a format suitable for further processing
![Page 29: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/29.jpg)
Assignment Part 1 (cont.)
Implementation in Java Pure TCP socket communications No need to save documents in this
assignment Robot shall identify itself via HTTP User-
Agent header Extensibility required for future assignments
![Page 30: Web- and Multimedia-based Information Systems Lecture 2](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649f455503460f94c66042/html5/thumbnails/30.jpg)
Example HTTP session
telnet www 80
GET / HTTP/1.0
HTTP/1.0 200 Document follows
Date: Tue, 10 Sep 1996 14:34:06 GMT
Server: NCSA/1.4.2
Content-type: image/gif
Last-modified: Tue, 10 Sep 1996 13:25:26 GMT
Content-length: 9755
<HTML>
TCP connectionHTTP Request<CRLF>Response Headers
<CRLF>Start of content