Download - WebGather Design and Implementation
![Page 1: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/1.jpg)
WebGather Design and Implementation
Hongfei YanNetwork Group,CST,PKU,Dec. 15, 2000
Email: [email protected]
http://net.cs.pku.edu.cn/~yhf
![Page 2: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/2.jpg)
Outline
Introduction of searchengine WebGather Conclusion
![Page 3: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/3.jpg)
Introduction: http://www.yahoo.com/
![Page 4: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/4.jpg)
Introduction: http://sohu.com/
![Page 5: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/5.jpg)
Introduction: http://sina.com.cn/
![Page 6: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/6.jpg)
Introduction: http://www.google.com/
![Page 7: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/7.jpg)
Introduction: http://e.pku.edu.cn/
![Page 8: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/8.jpg)
Introduction: Search Engine Sizes--searchenginewatch in Nov 8, 2000
GG=Google WT=WebTop.com AV=AltaVista, FAST=FAST NL=Northern Light EX=Excite INK=Inktomi, Go=Go (Infoseek)
![Page 9: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/9.jpg)
Introduction: a new study -- Inktomi and the NEC Research Institute, Inc. In Feb. 2000
Number of indexable pages on the web : over 1 billion Number of servers discovered: 6,409,521 Number of mirrors in servers discovered: 1,457,946 Number of sites (total servers minus mirrors): 4,951,247 Number of good sites (reachable over 10 day period):
4,217,324 Number of bad sites (unreachable): 733,923
Web pages on a site: 1000,000,000/4,217,324 = 237.1
![Page 10: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/10.jpg)
Introduction:
Inktomi Search Engine cluster
In the picture9*8*2=144
![Page 11: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/11.jpg)
WebGather:Introduction
由北大计算机系网络与分布式系统研究室研制开发的“天网”中英文搜索引擎系统是国家“九五”重点科技攻关项目“中文编码和分布式中英文信息发现”的研究成果,并于 1997 年 10 月 29 日正式在 CERNET 上向广大 Internet 用户提供 web 信息导航服务。在“天网”系统对外提供服务期间,广泛采纳用户的意见和建议,不断地改进其服务质量,到目前为止访问量已突破 800万人次。 2000 年初新成立的“天网”搜索引擎课题组在国家 973重点基础研究发展规划项目基金资助下,秉承老的开发队伍的优良传统,将致力于探索和研究中英文搜索引擎系统的关键技术,以便向广大用户提供更为快速、准确、全面、时新的海量 web信息导航服务。欢迎广大用户给我们提出更好的意见和建议。
http://e.pku.edu.cn/ 身无彩凤双飞翼,心有灵犀一点通
![Page 12: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/12.jpg)
WebGather:in Dec. 1, 2000
2.5 million scale Index 2.5 million web pages More than 200,000 web pages
everyday Ten day to update all data three PCs
![Page 13: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/13.jpg)
collect all the web pages in China
keep pace with the rapid growth of Chinese web information
WebGather: Design goals for a distributed web-crawling system for
WebGather
238 X 40,000 = 9,520,000
![Page 14: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/14.jpg)
WebGather 2.0: architecture
Client log database
User behavior
Gather Database
Indexer
Retrieve Database
Client
Retriever
Gatherer
WWW
![Page 15: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/15.jpg)
WebGather 1.2:architecture of gather subsystem 1/4
Main Control
Gather1Gather2
GatherN…
![Page 16: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/16.jpg)
WebGather 2.0:architecture of gather subsystem 1/4
![Page 17: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/17.jpg)
WebGather : technologies in gather subsystem 1/4
Distributed system architecture High availability
…… Load balance Low bandwidth Scalability Re-configurability
…… Cut words Position relativity Anchor text, Link popularity
![Page 18: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/18.jpg)
WebGather :architecture of indexer subsystem 2/4
A B
webpage1
webpage2
webpageK
…
webpageN
feature1
feature2
feature1
feature2
feature3
feature1
feature2
…
featureK
…
featureN
…
webpage1
webpage2
webpage1
webpage2
webpage3
![Page 19: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/19.jpg)
WebGather : technologies in retriever subsystem 3/4
Traditional IR (VSM ) Query cache, hot click Cut words Anchor text, Link popularity
![Page 20: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/20.jpg)
WebGather : technologies in user behavior subsystem 4/4
Link popularity Replica popularity User popularity
![Page 21: WebGather Design and Implementation](https://reader033.vdocuments.site/reader033/viewer/2022061603/56812d16550346895d920055/html5/thumbnails/21.jpg)
Conclusion :
Searchengine is More and more important.
Web is a good experimental object, we can do a lot R&D on it.