webgather design and implementation

21
WebGather Design and Implementation Hongfei Yan Network Group,CST,PKU,Dec. 15, 2000 Email: yhf @net. cs .pku. edu . cn http://net.cs.pku.edu.cn/~yhf

Upload: shaine-graham

Post on 31-Dec-2015

50 views

Category:

Documents


0 download

DESCRIPTION

WebGather Design and Implementation. Hongfei Yan Network Group,CST,PKU,Dec. 15, 2000 Email: [email protected] http://net.cs.pku.edu.cn/~yhf. Outline. Introduction of searchengine WebGather Conclusion. Introduction : http://www.yahoo.com/. Introduction : http://sohu.com/. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: WebGather Design and Implementation

WebGather Design and Implementation

Hongfei YanNetwork Group,CST,PKU,Dec. 15, 2000

Email: [email protected]

http://net.cs.pku.edu.cn/~yhf

Page 2: WebGather Design and Implementation

Outline

Introduction of searchengine WebGather Conclusion

Page 3: WebGather Design and Implementation

Introduction: http://www.yahoo.com/

Page 4: WebGather Design and Implementation

Introduction: http://sohu.com/

Page 5: WebGather Design and Implementation

Introduction: http://sina.com.cn/

Page 6: WebGather Design and Implementation

Introduction: http://www.google.com/

Page 7: WebGather Design and Implementation

Introduction: http://e.pku.edu.cn/

Page 8: WebGather Design and Implementation

Introduction: Search Engine Sizes--searchenginewatch in Nov 8, 2000

GG=Google WT=WebTop.com AV=AltaVista, FAST=FAST NL=Northern Light   EX=Excite INK=Inktomi, Go=Go (Infoseek)

Page 9: WebGather Design and Implementation

Introduction: a new study -- Inktomi and the NEC Research Institute, Inc. In Feb. 2000

Number of indexable pages on the web : over 1 billion   Number of servers discovered: 6,409,521 Number of mirrors in servers discovered: 1,457,946 Number of sites (total servers minus mirrors): 4,951,247 Number of good sites (reachable over 10 day period):

4,217,324 Number of bad sites (unreachable): 733,923

Web pages on a site: 1000,000,000/4,217,324 = 237.1

Page 10: WebGather Design and Implementation

Introduction:

Inktomi Search Engine cluster

In the picture9*8*2=144

Page 11: WebGather Design and Implementation

WebGather:Introduction

由北大计算机系网络与分布式系统研究室研制开发的“天网”中英文搜索引擎系统是国家“九五”重点科技攻关项目“中文编码和分布式中英文信息发现”的研究成果,并于 1997 年 10 月 29 日正式在 CERNET 上向广大 Internet 用户提供 web 信息导航服务。在“天网”系统对外提供服务期间,广泛采纳用户的意见和建议,不断地改进其服务质量,到目前为止访问量已突破 800万人次。 2000 年初新成立的“天网”搜索引擎课题组在国家 973重点基础研究发展规划项目基金资助下,秉承老的开发队伍的优良传统,将致力于探索和研究中英文搜索引擎系统的关键技术,以便向广大用户提供更为快速、准确、全面、时新的海量 web信息导航服务。欢迎广大用户给我们提出更好的意见和建议。

http://e.pku.edu.cn/ 身无彩凤双飞翼,心有灵犀一点通

Page 12: WebGather Design and Implementation

WebGather:in Dec. 1, 2000

2.5 million scale Index 2.5 million web pages More than 200,000 web pages

everyday Ten day to update all data three PCs

Page 13: WebGather Design and Implementation

collect all the web pages in China

keep pace with the rapid growth of Chinese web information

WebGather: Design goals for a distributed web-crawling system for

WebGather

238 X 40,000 = 9,520,000

Page 14: WebGather Design and Implementation

WebGather 2.0: architecture

Client log database

User behavior

Gather Database

Indexer

Retrieve Database

Client

Retriever

Gatherer

WWW

Page 15: WebGather Design and Implementation

WebGather 1.2:architecture of gather subsystem 1/4

Main Control

Gather1Gather2

GatherN…

Page 16: WebGather Design and Implementation

WebGather 2.0:architecture of gather subsystem 1/4

Page 17: WebGather Design and Implementation

WebGather : technologies in gather subsystem 1/4

Distributed system architecture High availability

…… Load balance Low bandwidth Scalability Re-configurability

…… Cut words Position relativity Anchor text, Link popularity

Page 18: WebGather Design and Implementation

WebGather :architecture of indexer subsystem 2/4

A B

webpage1

webpage2

webpageK

webpageN

feature1

feature2

feature1

feature2

feature3

feature1

feature2

featureK

featureN

webpage1

webpage2

webpage1

webpage2

webpage3

Page 19: WebGather Design and Implementation

WebGather : technologies in retriever subsystem 3/4

Traditional IR (VSM ) Query cache, hot click Cut words Anchor text, Link popularity

Page 20: WebGather Design and Implementation

WebGather : technologies in user behavior subsystem 4/4

Link popularity Replica popularity User popularity

Page 21: WebGather Design and Implementation

Conclusion :

Searchengine is More and more important.

Web is a good experimental object, we can do a lot R&D on it.