web search week 6 lbsc 796/infm 718r october 15, 2007
TRANSCRIPT
![Page 1: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/1.jpg)
Web Search
Week 6
LBSC 796/INFM 718R
October 15, 2007
![Page 2: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/2.jpg)
What “Caused” the Web?
• Affordable storage– 300,000 words/$ by 1995
• Adequate backbone capacity– 25,000 simultaneous transfers by 1995
• Adequate “last mile” bandwidth– 1 second/screen (of text) by 1995
• Display capability– 10% of US population could see images by 1995
• Effective search capabilities– Lycos and Yahoo! achieved useful scale in 1994-1995
![Page 3: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/3.jpg)
Defining the Web
• HTTP, HTML, or URL?
• Static, dynamic or streaming?
• Public, protected, or internal?
![Page 4: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/4.jpg)
Number of Web Sites
![Page 5: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/5.jpg)
What’s a Web “Site”?
• Any server at port 80?– Misses many servers at other ports
• Some servers host unrelated content– Geocities
• Some content requires specialized servers– rtsp
![Page 6: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/6.jpg)
Estimated Web Size (2005)
0
2
4
6
8
10
12
MSN Ask Yahoo Google IndexedWeb
Total Web
Bil
lio
ns
of
Pag
es
Gulli and Signorini, 2005
![Page 7: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/7.jpg)
Crawling the Web
![Page 8: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/8.jpg)
Web Crawling Algorithm
• Put a set of known sites on a queue
• Repeat the until the queue is empty:– Take the first page off of the queue– Check to see if this page has been processed– If this page has not yet been processed:
• Add this page to the index
• Add each link on the current page to the queue
• Record that this page has been processed
![Page 9: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/9.jpg)
Link Structure of the Web
![Page 10: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/10.jpg)
31%
18%
9%
7%
7%
5%
4%
3%
3%
2%
11%
English
Chinese
Japanese
Spanish
German
Korean
French
Portuguese
Italian
Russian
Other
Native speakers, Global Reach projection for 2004 (as of Sept, 2003)
Global Internet Users
![Page 11: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/11.jpg)
68%
4%
6%
2%
6%
1%
3%1%
2%2%
5%
31%
18%
9%
7%
7%
5%
4%
3%
3%
2%
11%
English
Chinese
Japanese
Spanish
German
Korean
French
Portuguese
Italian
Russian
Other
Native speakers, Global Reach projection for 2004 (as of Sept, 2003)
Global Internet Users
![Page 12: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/12.jpg)
Web Crawl Challenges• Discovering “islands” and “peninsulas”
• Duplicate and near-duplicate content– 30-40% of total content
• Server and network loads
• Dynamic content generation
• Link rot– Changes at 1% per week
• Temporary server interruptions
![Page 13: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/13.jpg)
Duplicate Detection
• Structural– Identical directory structure (e.g., mirrors, aliases)
• Syntactic– Identical bytes– Identical markup (HTML, XML, …)
• Semantic– Identical content– Similar content (e.g., with a different banner ad)– Related content (e.g., translated)
![Page 14: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/14.jpg)
Robots Exclusion Protocol
• Depends on voluntary compliance by crawlers
• Exclusion by site– Create a robots.txt file at the server’s top level– Indicate which directories not to crawl
• Exclusion by document (in HTML head)– Not implemented by all crawlers
<meta name="robots“ content="noindex,nofollow">
![Page 15: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/15.jpg)
Adversarial IR
• Search is user-controlled suppression– Everything is known to the search system– Goal: avoid showing things the user doesn’t want
• Other stakeholders have different goals– Authors risk little by wasting your time– Marketers hope for serendipitous interest
• Metadata from trusted sources is more reliable
![Page 16: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/16.jpg)
Index Spam
• Goal: Manipulate rankings of an IR system
• Multiple strategies:– Create bogus user-assigned metadata– Add invisible text (font in background color, …)– Alter your text to include desired query terms– “Link exchanges” create links to your page
![Page 17: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/17.jpg)
Hands on:The Internet Archive
• Web crawls since 1997– http://archive.org
• Check out Maryland’s Web site in 1997
• Check out the history of your favorite site
![Page 18: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/18.jpg)
Estimating Authority from Links
Authority
Authority
Hub
![Page 19: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/19.jpg)
Rethinking Ranked Retrieval
• We ask “is this document relevant?”– Vector space: we answer “somewhat”– Probabilistic: we answer “probably”
• The key is to know what “probably” means– First, we’ll formalize that notion– Then we’ll apply it to ranking
![Page 20: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/20.jpg)
Retrieval w/ Language Models
• Build a model for every document
• Rank document d based on P(MD | q)
• Expand using Bayes’ Theorem
)(
)()|()|(
qP
MPMqPqMP DD
D
P(q) is same for all documents; doesn’t change ranksP(MD) [the prior] comes from the PageRank score
![Page 21: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/21.jpg)
“Noisy-Channel” Model of IR
Information need
Query
User has a information need, “thinks” of a relevant document…
and writes down some queries
Information retrieval: given the query, guess the document it came from.
d1
d2
dn
document collection
…
![Page 22: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/22.jpg)
Where do the probabilities fit?
Comparison Function
Representation Function
Query Formulation
Human Judgment
Representation Function
Retrieval Status Value
Utility
Query
Information Need Document
Query Representation Document Representation
Que
ry P
roce
ssin
g
Doc
umen
t P
roce
ssin
g
P(d is Rel | q)
![Page 23: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/23.jpg)
Probability Ranking Principle
• Assume binary relevance, document independence– Each document is either relevant or it is not– Relevance of one doc reveals nothing about another
• Assume the searcher works down a ranked list– Seeking some number of relevant documents
• Documents should be ranked in order of decreasing probability of relevance to the query,
P(d relevant-to q)
![Page 24: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/24.jpg)
Probabilistic Inference
• Suppose there’s a horrible, but very rare disease
• But there’s a very accurate test for it
• Unfortunately, you tested positive…
The probability that you contracted it is 0.01%
The test is 99% accurate
Should you panic?
![Page 25: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/25.jpg)
Bayes’ Theorem
• You want to find
• But you only know– How rare the disease is– How accurate the test is
• Use Bayes’ Theorem (hence Bayesian Inference)
P(“have disease” | “test positive”)
)(
)()|()|(
BP
APABPBAP
Prior probability
Posterior probability
![Page 26: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/26.jpg)
Applying Bayes’ Theorem
Two cases:1. You have the disease, and you tested positive2. You don’t have the disease, but you tested positive (error)
Case 1: (0.0001)(0.99) = 0.000099Case 2: (0.9999)(0.01) = 0.009999Case 1+2 = 0.010098
P(“have disease” | “test positive”)= (0.99)(0.0001) / 0.010098= 0.009804 < 1%
Don’t worry!
![Page 27: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/27.jpg)
Another ViewIn a population of one million people
100 are infected 999,900 are not
99 test positive
1 test negative
9999test positive
989901test negative
10098 will test positive… Of those, only 99 really have the disease!
![Page 28: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/28.jpg)
Language Models
• Probability distribution over strings of text– How likely is a string in a given “language”?
• Probabilities depend on what language we’re modeling
p1 = P(“a quick brown dog”)
p2 = P(“dog quick a brown”)
p3 = P(“быстрая brown dog”)
p4 = P(“быстрая собака”)
In a language model for English: p1 > p2 > p3 > p4
In a language model for Russian: p1 < p2 < p3 < p4
![Page 29: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/29.jpg)
Unigram Language Model
• Assume each word is generated independently– Obviously, this is not true…– But it seems to work well in practice!
• The probability of a string, given a model:
k
iik MqPMqqP
11 )|()|(
The probability of a sequence of words decomposes into a product of the probabilities of individual words
![Page 30: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/30.jpg)
A Physical Metaphor
• Colored balls are randomly drawn from an urn (with replacement)
P ( )P ( ) == (4/9) (2/9) (4/9) (3/9)
wordsM
P ( ) P ( ) P ( )
![Page 31: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/31.jpg)
An Example
the man likes the woman0.2 0.01 0.02 0.2 0.01
multiply
P(s | M) = 0.00000008
P(“the man likes the woman”|M)= P(the|M) P(man|M) P(likes|M) P(the|M) P(man|M)= 0.00000008
P(w) w
0.2 the
0.1 a
0.01 man
0.01 woman
0.03 said
0.02 likes
…
Model M
![Page 32: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/32.jpg)
Comparing Language Models
P(w) w
0.2 the
0.0001 yon
0.01 class
0.0005 maiden
0.0003 sayst
0.0001 pleaseth
…
Model M1
P(w) w
0.2 the
0.1 yon
0.001 class
0.01 maiden
0.03 sayst
0.02 pleaseth
…
Model M2
maidenclass pleaseth yonthe
0.00050.01 0.0001 0.00010.2
0.010.001 0.02 0.10.2
P(s|M2) > P(s|M1)
What exactly does this mean?
![Page 33: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/33.jpg)
Retrieval w/ Language Models
• Build a model for every document
• Rank document d based on P(MD | q)
• Expand using Bayes’ Theorem
)(
)()|()|(
qP
MPMqPqMP DD
D
P(q) is same for all documents; doesn’t change ranksP(MD) [the prior] comes from the PageRank score
![Page 34: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/34.jpg)
Visually …
Ranking by P(MD | q)…
Hey, what’s the probability this query came from you?
model1
Hey, what’s the probability that you generated this
query?
model1
is the same as ranking by P(q | MD)
Hey, what’s the probability this query came from you?
model2
Hey, what’s the probability that you generated this
query?
model2
Hey, what’s the probability this query came from you?
modeln
Hey, what’s the probability that you generated this
query?
modeln
… …
![Page 35: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/35.jpg)
Ranking Models?
Hey, what’s the probability that you generated this
query?
model1
Ranking by P(q | MD)
Hey, what’s the probability that you generated this
query?
model2
Hey, what’s the probability that you generated this
query?
modeln
…
… is a model of document1
… is a model of document2
… is a model of documentn
… is the same as ranking documents
![Page 36: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/36.jpg)
Building Document Models• How do we build a language model for a
document?
What’s in the urn?
M
Physical metaphor:
What colored balls and how many of each?
![Page 37: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/37.jpg)
A First Try
• Simply count the frequencies in the document = maximum likelihood estimate
M P ( ) = 1/2
P ( ) = 1/4
P ( ) = 1/4
P(w|MS) = #(w,S) / |S|
Sequence S
#(w,S) = number of times w occurs in S|S| = length of S
![Page 38: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/38.jpg)
Zero-Frequency Problem
• Suppose some event is not in our observation S– Model will assign zero probability to that event
M
P ( ) = 1/2
P ( ) = 1/4
P ( ) = 1/4
Sequence S
P ( )P ( ) =
= (1/2) (1/4) 0 (1/4) = 0
P ( ) P ( ) P ( )
!!
![Page 39: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/39.jpg)
Smoothing
P(w)
w
Maximum Likelihood Estimate
wordsallofcountwofcount
ML wp )(
The solution: “smooth” the word probabilities
Smoothed probability distribution
![Page 40: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/40.jpg)
Implementing Smoothing
• Assign some small probability to unseen events– But remember to take away “probability mass”
from other events
• Some techniques are easily understood– Add one to all the frequencies (including zero)
• More sophisticated methods improve ranking
![Page 41: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/41.jpg)
Recap: LM for IR
• Indexing-time: – Build a language model for every document
• Query-time Ranking– Estimate the probability of generating the query
according to each model– Rank the documents according to these
probabilities
![Page 42: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/42.jpg)
Key Ideas
• Probabilistic methods formalize assumptions– Binary relevance
– Document independence
– Term independence
– Uniform priors
– Top-down scan
• Natural framework for combining evidence– e.g., non-uniform priors
![Page 43: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/43.jpg)
A Critique
• Most of the assumptions are not satisfied!– Searchers want utility, not relevance– Relevance is not binary– Terms are clearly not independent– Documents are often not independent
• Smoothing techniques are somewhat ad hoc
![Page 44: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/44.jpg)
But It Works!
• Ranked retrieval paradigm is powerful– Well suited to human search strategies
• Probability theory has explanatory power– At least we know where the weak spots are– Probabilities are good for combining evidence
• Good implementations exist (e.g., Lemur)– Effective, efficient, and large-scale
![Page 45: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/45.jpg)
Comparison With Vector Space
• Similar in some ways– Term weights based on frequency– Terms often used as if they were independent
• Different in others– Based on probability rather than similarity– Intuitions are probabilistic rather than geometric
![Page 46: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/46.jpg)
Conjunctive Queries
• Perform an initial Boolean query– Balancing breadth with understandability
• Rerank the results– Using either Okapi or a language model– Possibly also accounting for proximity, links, …
![Page 47: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/47.jpg)
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
18,000,000
20,000,000
Mar
-03
Apr-0
3
May
-03
Jun-
03
Jul-0
3
Aug-
03
Sep-
03
Oct-0
3
Nov-0
3
Dec-0
3
Jan-
04
Feb-
04
Mar
-04
Apr-0
4
May
-04
Jun-
04
Jul-0
4
Aug-
04
Sep-
04
Oct-0
4
Nov-0
4
Dec-0
4
Jan-
05
Feb-
05
Mar
-05
Apr-0
5
May
-05
Jun-
05
Jul-0
5
Aug-
05
Sep-
05
Oct-0
5
Doubling
Doubling
Doubling
18.9 Million Weblogs TrackedDoubling in size approx. every 5 monthsConsistent doubling over the last 36 months
BlogsDoubling
![Page 48: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/48.jpg)
EschatonCommon Dreams
The EconomistBinary BonsaiDaveneticsNPRTalking Points MemoThe TimesPBSESPNBoston.comEngadgetNational ReviewAsahi ShinbunSlateFARKGizmodoLA TimesInstapunditDaily Kos
MTVSalon
SF GateReutersNews.comFox NewsUSA Today
Boing BoingWired NewsMSNBCGuardian
BBCYahoo News
Washington PostNew York Times
0 10,000 20,000 30,000 40,000 50,000 60,000
Blue = Mainstream Media
Red = Blog
![Page 49: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/49.jpg)
![Page 50: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/50.jpg)
0
200000
400000
600000
800000
1000000
1200000
1400000
9/1/
04
9/15
/04
9/29
/04
10/1
3/04
10/2
7/04
11/1
0/04
11/2
4/04
12/8
/04
12/2
2/04
1/5/
05
1/19
/05
2/2/
05
2/16
/05
3/2/
05
3/16
/05
3/30
/05
4/13
/05
4/27
/05
5/11
/05
5/25
/05
6/8/
05
6/22
/05
7/6/
05
7/20
/05
8/3/
05
8/17
/05
8/31
/05
9/14
/05
9/28
/05
KryptoniteLock Controversy
US Election Day
Indian Ocean Tsunami
Superbowl
Schiavo Dies
Newsweek Koran
Deepthroat Revealed
Justice O’ConnorLive 8 Concerts
London Bombings Katrina
Daily Posting Volume
1.2 Million legitimate Posts/DaySpam posts marked in redOn average, additional 5.8% are spam posts Some spam spikes as high as 18%
![Page 51: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/51.jpg)
The “Deep Web”
• Dynamic pages, generated from databases
• Not easily discovered using crawling
• Perhaps 400-500 times larger than surface Web
• Fastest growing source of new information
![Page 52: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/52.jpg)
Deep Web• 60 Deep Sites Exceed Surface Web by 40 Times
NameType URL
Web Size
(GBs)
National Climatic Data Center (NOAA)
Public http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html
366,000
NASA EOSDIS Public http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html
219,600
National Oceanographic (combined with Geophysical) Data Center (NOAA)
Public/Fee http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/
32,940
Alexa Public (partial)
http://www.alexa.com/ 15,860
Right-to-Know Network (RTK Net) Public http://www.rtk.net/ 14,640
MP3.com Public http://www.mp3.com/
![Page 53: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/53.jpg)
Content of the Deep Web
![Page 54: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/54.jpg)
![Page 55: Web Search Week 6 LBSC 796/INFM 718R October 15, 2007](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfcf1a28abf838ca9fe6/html5/thumbnails/55.jpg)
Semantic Web
• RDF provides the schema for interchange
• Ontologies support automated inference– Similar to thesauri supporting human reasoning
• Ontology mapping permits distributed creation– This is where the magic happens