webgather design and implementation

Post on 31-Dec-2015

50 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

WebGather Design and Implementation. Hongfei Yan Network Group,CST,PKU,Dec. 15, 2000 Email: yhf@net.cs.pku.edu.cn http://net.cs.pku.edu.cn/~yhf. Outline. Introduction of searchengine WebGather Conclusion. Introduction : http://www.yahoo.com/. Introduction : http://sohu.com/. - PowerPoint PPT Presentation

TRANSCRIPT

WebGather Design and Implementation

Hongfei YanNetwork Group,CST,PKU,Dec. 15, 2000

Email: yhf@net.cs.pku.edu.cn

http://net.cs.pku.edu.cn/~yhf

Outline

Introduction of searchengine WebGather Conclusion

Introduction: http://www.yahoo.com/

Introduction: http://sohu.com/

Introduction: http://sina.com.cn/

Introduction: http://www.google.com/

Introduction: http://e.pku.edu.cn/

Introduction: Search Engine Sizes--searchenginewatch in Nov 8, 2000

GG=Google WT=WebTop.com AV=AltaVista, FAST=FAST NL=Northern Light   EX=Excite INK=Inktomi, Go=Go (Infoseek)

Introduction: a new study -- Inktomi and the NEC Research Institute, Inc. In Feb. 2000

Number of indexable pages on the web : over 1 billion   Number of servers discovered: 6,409,521 Number of mirrors in servers discovered: 1,457,946 Number of sites (total servers minus mirrors): 4,951,247 Number of good sites (reachable over 10 day period):

4,217,324 Number of bad sites (unreachable): 733,923

Web pages on a site: 1000,000,000/4,217,324 = 237.1

Introduction:

Inktomi Search Engine cluster

In the picture9*8*2=144

WebGather:Introduction

由北大计算机系网络与分布式系统研究室研制开发的“天网”中英文搜索引擎系统是国家“九五”重点科技攻关项目“中文编码和分布式中英文信息发现”的研究成果,并于 1997 年 10 月 29 日正式在 CERNET 上向广大 Internet 用户提供 web 信息导航服务。在“天网”系统对外提供服务期间,广泛采纳用户的意见和建议,不断地改进其服务质量,到目前为止访问量已突破 800万人次。 2000 年初新成立的“天网”搜索引擎课题组在国家 973重点基础研究发展规划项目基金资助下,秉承老的开发队伍的优良传统,将致力于探索和研究中英文搜索引擎系统的关键技术,以便向广大用户提供更为快速、准确、全面、时新的海量 web信息导航服务。欢迎广大用户给我们提出更好的意见和建议。

http://e.pku.edu.cn/ 身无彩凤双飞翼,心有灵犀一点通

WebGather:in Dec. 1, 2000

2.5 million scale Index 2.5 million web pages More than 200,000 web pages

everyday Ten day to update all data three PCs

collect all the web pages in China

keep pace with the rapid growth of Chinese web information

WebGather: Design goals for a distributed web-crawling system for

WebGather

238 X 40,000 = 9,520,000

WebGather 2.0: architecture

Client log database

User behavior

Gather Database

Indexer

Retrieve Database

Client

Retriever

Gatherer

WWW

WebGather 1.2:architecture of gather subsystem 1/4

Main Control

Gather1Gather2

GatherN…

WebGather 2.0:architecture of gather subsystem 1/4

WebGather : technologies in gather subsystem 1/4

Distributed system architecture High availability

…… Load balance Low bandwidth Scalability Re-configurability

…… Cut words Position relativity Anchor text, Link popularity

WebGather :architecture of indexer subsystem 2/4

A B

webpage1

webpage2

webpageK

webpageN

feature1

feature2

feature1

feature2

feature3

feature1

feature2

featureK

featureN

webpage1

webpage2

webpage1

webpage2

webpage3

WebGather : technologies in retriever subsystem 3/4

Traditional IR (VSM ) Query cache, hot click Cut words Anchor text, Link popularity

WebGather : technologies in user behavior subsystem 4/4

Link popularity Replica popularity User popularity

Conclusion :

Searchengine is More and more important.

Web is a good experimental object, we can do a lot R&D on it.

top related