cs246 web characteristics. junghoo "john" cho (ucla computer science)2 web characteristics...
TRANSCRIPT
![Page 1: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/1.jpg)
CS246
Web Characteristics
![Page 2: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/2.jpg)
Junghoo "John" Cho (UCLA Computer Science) 2
Web Characteristics
What is the Web like? Any questions on some of the characteristics
and/or properties of the Web?
![Page 3: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/3.jpg)
Junghoo "John" Cho (UCLA Computer Science) 3
Web Characteristics
Size of the Web Search engine coverage Link structure of the Web
![Page 4: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/4.jpg)
Junghoo "John" Cho (UCLA Computer Science) 4
How Many Web Sites?
Polling every IP 2^32 = 4B sites, 10 sec/IP, 1000
simultaneous connection: 2^32*10/(1000*24*60*60) = 460 days
![Page 5: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/5.jpg)
Junghoo "John" Cho (UCLA Computer Science) 5
How Many Web Sites?
Sampling based
T: All IPs
S: Sampled IPs
V: Valid reply ||||
||T
S
V
![Page 6: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/6.jpg)
Junghoo "John" Cho (UCLA Computer Science) 6
How Many Web Sites?
1. Select |S| random IPs
2. Send HTTP requests to port 80 at the selected IPs
3. Count valid replies: “HTTP 200 OK” = |V|
4. |T| = 2^32
![Page 7: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/7.jpg)
Junghoo "John" Cho (UCLA Computer Science) 7
How Many Web Sites?
OCLC (Online Computer Library) results http://wcp.oclc.org
Total number of available IPs: 2^32 = 4.2 Billion Growth (in terms of sites) has slowed down
1998 1999 2000 2001 2002
Sites 2,636,000 4,662,000 7,128,000 8,443,000 8,712,000
![Page 8: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/8.jpg)
Junghoo "John" Cho (UCLA Computer Science) 8
Issues
Multi-hosted servers cnn.com: 207.25.71.5, 207.25.71.20, …
Select the lowest IP addressFor each sampled IP:
Look up domain name Resolve the name to IP Is our sampled IP the lowest?
![Page 9: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/9.jpg)
Junghoo "John" Cho (UCLA Computer Science) 9
Issues
Virtual hosting Multiple sites on the same IP Find the average number of hosted sites per IP
7.4M sites on 3.4M IPs by polling all available site names [Netcraft, 2000]
Other ports? Temporarily unavailable sites?
![Page 10: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/10.jpg)
Junghoo "John" Cho (UCLA Computer Science) 10
Where Are They Located?
![Page 11: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/11.jpg)
Junghoo "John" Cho (UCLA Computer Science) 11
What Language?
(Based on Web sites)
![Page 12: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/12.jpg)
Junghoo "John" Cho (UCLA Computer Science) 12
Questions?
![Page 13: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/13.jpg)
Junghoo "John" Cho (UCLA Computer Science) 13
How Many Web Pages?
Infinite number of URLs
Sampling based?
T: All URLs
S: Sampled URLs
V: Valid reply ||||
||T
S
V
![Page 14: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/14.jpg)
Junghoo "John" Cho (UCLA Computer Science) 14
How Many Web Pages?
Solution 1: Estimate the average number of pages per site:
(average no of pages) * (total no of sites) Algorithm:
For each site with valid reply, download all pages Take average
Result [LG99]: 289 pages per site, 2.8M sites 800M pages
![Page 15: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/15.jpg)
Junghoo "John" Cho (UCLA Computer Science) 15
Issues
A small number of sites with TONS of pages Very likely to miss these sites Lots of samples necessary
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
0 200 400 600 800 1000
No of Sites
No
of
Pa
ge
s
99.99% of the sites
![Page 16: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/16.jpg)
Junghoo "John" Cho (UCLA Computer Science) 16
How Many Pages?
Solution 2: Sampling-based
T: All pages
B: Base setS: Random samples
||
||
||
||
S
SB
T
B
![Page 17: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/17.jpg)
Junghoo "John" Cho (UCLA Computer Science) 17
Related Question
How many deer in Yosemite National Park?
![Page 18: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/18.jpg)
Junghoo "John" Cho (UCLA Computer Science) 18
Random Page?
Idea: Random walk Start from the Yahoo home page Follow random links, say 10,000 times Select the page
Problem: Biased to “popular” pages. e.g., Microsoft, Google
![Page 19: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/19.jpg)
Junghoo "John" Cho (UCLA Computer Science) 19
Random Page?
Random walks on regular, undirected graph uniform random sample Regular graph: an equal number of edges for all nodes After steps
: depends on the graph structure N: number of nodes
Idea: Transform the Web graph to a regular, undirected graph Perform a random walk
NO log
1
![Page 20: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/20.jpg)
Junghoo "John" Cho (UCLA Computer Science) 20
Ideal Random Walk
Generate the regular, undirected graph: Make edges undirected Decide d the maximum # of edges per page:
say, 300,000 If edge(n) < 300,000, then add self-loop
Perform random walks on the graph 10-5 for the 1996 Web, N 109
3,000,000 steps, but mostly self-loops 100 actual walk
![Page 21: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/21.jpg)
Junghoo "John" Cho (UCLA Computer Science) 21
Different Interpretation
Random walk on irregular Web graph High chance to be at a “popular” node at a
particular time Increase the chance to be at an “unpopular”
node by staying there longer through self loops.
Unpopular nodesPopular node
![Page 22: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/22.jpg)
Junghoo "John" Cho (UCLA Computer Science) 22
Issues
How to get edges to/from node n? Edges discovered so far From search engines, like Altavista, HotBot Still limited incoming links
![Page 23: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/23.jpg)
Junghoo "John" Cho (UCLA Computer Science) 23
WebWalker [BBCF00]
Our graph does not have to be the same as the real Web
Construct regular undirected graphs while performing the random walk
Add new node n when it visits n Find edges for node n at that time
1. Edges discovered so far2. From search engines
Add self-loops as necessary Ignore any more edges to n later
![Page 24: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/24.jpg)
Junghoo "John" Cho (UCLA Computer Science) 24
WebWalker
d = 5
1
31
2
2
![Page 25: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/25.jpg)
Junghoo "John" Cho (UCLA Computer Science) 25
WebWalker
Why ignore “new incoming” edges? Make the graph regular.
“Discovered parts” of the graph do not change “Uniformity theorem” still holds
Can we arrive at “all reachable” pages? We ignore only the edges to “visited nodes”
Can we use the same ? No
![Page 26: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/26.jpg)
Junghoo "John" Cho (UCLA Computer Science) 26
WebWalker results
Size of the Web Altavista: |B| = 250M |BS|/|S| = 35% |T| = 720M
Avg page size: 12K Avg no of out-links: 10
![Page 27: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/27.jpg)
Junghoo "John" Cho (UCLA Computer Science) 27
WebWalker results
Pages by domain .com: 49% .edu: 8% .org: 7% .net: 6% .de: 4% .jp: 3% .uk: 3% .gov: 2%
![Page 28: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/28.jpg)
What About Other Web Pages?
Pages that are Available within corporate Intranet Protected by authentication Not reachable through following links
E.g., pages within e-commerce sites
Deep Web vs Hidden Web Information reachable through search interface What if a page is reachable both through links
and search interface?
![Page 29: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/29.jpg)
Junghoo "John" Cho (UCLA Computer Science) 29
Size of Deep Web?
Estimation: (Avg no of records per site) * (Total no of Deep
Web sites) How to estimate?
By sampling
![Page 30: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/30.jpg)
Junghoo "John" Cho (UCLA Computer Science) 30
Size of Deep Web?
Total # of Deep Web sites: |BS|/|S|
Avg no of records per site: Contact the site directly Use “Not zzxxyyxx,” if the site reports no of
matches
![Page 31: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/31.jpg)
Junghoo "John" Cho (UCLA Computer Science) 31
Size of Deep Web
BrightPlanet report Avg no of records per site: 5 million Total no of Deep Web sites: 200,000 Avg size of a record: 14KB Size of the Deep Web: 10^16 (10 petabytes) 1000 larger than the “Surface Web”
How to access it?
![Page 32: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/32.jpg)
Junghoo "John" Cho (UCLA Computer Science) 32
Web Characteristics
Size of the Web Search engines Link structure of the Web
![Page 33: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/33.jpg)
Junghoo "John" Cho (UCLA Computer Science) 33
Search Engines
Coverage Overlap Dead links Indexing delay
![Page 34: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/34.jpg)
Junghoo "John" Cho (UCLA Computer Science) 34
Coverage?
Q: How to estimate coverage? A: Create a random sample and measure how many
of them are indexed by a search engine In 1999
Estimated Web size: 800M, 1999 Reported indexed pages: 128M (Northern light)
16%
No reliable Web size estimate at this point Search engines often claim ~20B index
![Page 35: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/35.jpg)
Junghoo "John" Cho (UCLA Computer Science) 35
Overlap? How many pages are commonly indexed? Method 1
Create a random sample and measure how many are indexed only by A or B and commonly by A and B
Method 2 Send common queries, compare returned pages, and
measure overlap Result from method 2: Little overlap
E.g., Infoseek and AltaVista: 20% overlap [Bharat and Broder 1997]
Is it still true? Results seem to converge
![Page 36: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/36.jpg)
Junghoo "John" Cho (UCLA Computer Science) 36
Dead Links?
Q: How can we measure what fraction of pages in search engines are dead?
A: Issue random queries and check and see whether returned pages are dead?
Result in Feb 2000 AltaVista: 13.7% Excite: 8.7% Google: 4.3%
Search engines have got much better due to better recrawling algorithms A topic for later study
![Page 37: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/37.jpg)
Junghoo "John" Cho (UCLA Computer Science) 37
How Early Pages Get Indexed?
Method 1: Create pages at random locations Check when they are available at search engines Cons: Difficult to create pages at random
locations Method 2:
Repeatedly issue same queries over time When a new page appears in the result, record
the “last modified date” Cons: last modified date is only a “lower bound”
![Page 38: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/38.jpg)
Junghoo "John" Cho (UCLA Computer Science) 38
How Early are Pages Indexed?
Mean time [Lawrence and Giles 2000] Northern Light: 141 days AltaVista: 166 days HotBot: 192 days
![Page 39: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/39.jpg)
Junghoo "John" Cho (UCLA Computer Science) 39
Monitor a set of random sites Percentage of Web servers available:
(similar results for other years)
How Stable Are the Sites?
Year 1998 1999 2000 2001 2002
Available 100% 56% 35% 25% 13%
![Page 40: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/40.jpg)
Junghoo "John" Cho (UCLA Computer Science) 40
Web Characteristics
Size of the Web Search engines Link structure of the Web
![Page 41: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/41.jpg)
Junghoo "John" Cho (UCLA Computer Science) 41
Web As A Graph
Page: Node Link: Edge
![Page 42: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/42.jpg)
Junghoo "John" Cho (UCLA Computer Science) 42
Power lawWhy consistently 2.1?
(No of pages) 1
(No of links)2.1
Link Degree
How many links? In-degree
![Page 43: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/43.jpg)
Junghoo "John" Cho (UCLA Computer Science) 43
Link Degree
Out-degree
(No of pages) 1
(No of links)2.7
![Page 44: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/44.jpg)
Junghoo "John" Cho (UCLA Computer Science) 44
Large-Scale Structure?
Study by AltaVista & IBM, 1999 Based on 200M pages downloaded by
AltaVista crawler “Bow-tie” result based on two experiments
![Page 45: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/45.jpg)
Junghoo "John" Cho (UCLA Computer Science) 45
Experiment 1:Strongly Connected Components
Strongly connected component (SCC): C is a strongly connected component if:
a, b C, there are pathsfrom a b and from b a
b c
a
b c
aSCC?
YesNo
![Page 46: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/46.jpg)
Junghoo "John" Cho (UCLA Computer Science) 46
Result 1: SCC
Identified all SCCs from 200M pages Biggest SCC: 50M (25%) Other SCCs are small
Second largest: 150K Mostly fewer than 1000 nodes
![Page 47: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/47.jpg)
Junghoo "John" Cho (UCLA Computer Science) 47
Experiment 2: Reachability
How many pages can we reach starting from a random page?
Experiment Pick 500 random pages Follow links in the Breadth-first manner until no
more links Repeated the same experiments following links in
the “reverse direction”
![Page 48: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/48.jpg)
Junghoo "John" Cho (UCLA Computer Science) 48
Result 2: Reachability
50% reaches 100M 50% reaches fewer
than 1000
Out-links (forward direction)
![Page 49: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/49.jpg)
Junghoo "John" Cho (UCLA Computer Science) 49
Result 2: Reachability
50% reaches 100M 50% reaches fewer
than 1000
In-links (reverse direction)
![Page 50: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/50.jpg)
Junghoo "John" Cho (UCLA Computer Science) 50
What Can We Conclude?
50M (25%) SCC
SCC(50M, 25%)
![Page 51: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/51.jpg)
Junghoo "John" Cho (UCLA Computer Science) 51
What Can We Conclude?
How many nodes would we reach from SCC? Clearly not 1000, then 100M 50M more pages reachable from SCC
(no way back, though)
SCC(50M, 25%)
Out(50M,25%)
![Page 52: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/52.jpg)
Junghoo "John" Cho (UCLA Computer Science) 52
What Can We Conclude?
Similar result for “in-links” when we followed links backwards 50M more pages reachable by following in-links
SCC(50M, 25%)
Out(50M,25%)
In(50M,25%)
![Page 53: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/53.jpg)
Junghoo "John" Cho (UCLA Computer Science) 53
What Can We Conclude?
25% Miscellaneous
SCC(50M, 25%)
Out(50M,25%)
In(50M,25%)
(50M, 25%)
![Page 54: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/54.jpg)
Junghoo "John" Cho (UCLA Computer Science) 54
Questions
How did they “crawl” 50M In and 50M Misc nodes in the first place?
There may be much more In and Misc nodes that were not crawled (25% is lower bounds)
Only 25% SCC surprising (will be explained)
![Page 55: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/55.jpg)
Junghoo "John" Cho (UCLA Computer Science) 55
SCC
If there are only two links, A B and B A, then A and B becomes one SCC.
A B
![Page 56: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/56.jpg)
Junghoo "John" Cho (UCLA Computer Science) 56
Links between In, SCC and Out
No single link from SCC to In No single link from Out to SCC At least 50% of the Web “unknown” to the
core SSC
![Page 57: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/57.jpg)
Junghoo "John" Cho (UCLA Computer Science) 57
Diameter of SCC
On average, 16 links between two nodes in SCC
The “maximum distance” (diameter) is at least 28
![Page 58: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/58.jpg)
Junghoo "John" Cho (UCLA Computer Science) 58
Questions?
![Page 59: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/59.jpg)
Junghoo "John" Cho (UCLA Computer Science) 59
More Sources For Web Characteristics
OCLC (Online Computer Library) http://wcp.oclc.org
Netcraft Survey http://www.netcraft.com/survey/
NEC Web Analysis http://www.webmetrics.com
![Page 60: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/60.jpg)
Junghoo "John" Cho (UCLA Computer Science) 60
How To Sample?
Method 1: Take the last page and repeat
Many “wasted” visits
![Page 61: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/61.jpg)
Junghoo "John" Cho (UCLA Computer Science) 61
How To Sample?
Method 2: Take last k pages
Are they random samples?
![Page 62: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/62.jpg)
Junghoo "John" Cho (UCLA Computer Science) 62
How To Sample?
Theorem: If k is large enough, they are approximately random pages Intuition: If we visit many pages, we visit all
different pages
![Page 63: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/63.jpg)
Junghoo "John" Cho (UCLA Computer Science) 63
How To Sample?
Goal: Estimate A/N by m/k. Make A/N ~ m/k, i.e.,
if NA
km
1/
/1Pr
km
NA
1~/
/
km
NA
1
log/
12 NA
Ok
![Page 64: CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e5f5503460f94b599bb/html5/thumbnails/64.jpg)
Junghoo "John" Cho (UCLA Computer Science) 64
How To Sample?
Assuming A is 20% of the Web = 0.1: less than 10% error = 0.01: 99% confidence = 10^-5: the value from 1996 Web crawl
k = 350,000,000 12,000 non-self-loop