web search engine design

59
Web Search Engine Design Lee-Feng Chien ( 簡簡簡 ) Web Knowledge Discovery Lab Institute of Information Science Academia Sinica http:/ /csmart.iis.sinica.edu.tw /

Upload: hector-lin

Post on 18-Dec-2014

1.168 views

Category:

Technology


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Web Search Engine Design

Web Search Engine Design

Lee-Feng Chien ( 簡立峰 )

Web Knowledge Discovery LabInstitute of Information Science

Academia Sinica

http://csmart.iis.sinica.edu.tw/

Page 2: Web Search Engine Design

Outline

Basics of Search Engine Design Why Google Can Do It? New-Generation Search Technologies

Page 3: Web Search Engine Design

About the Speaker

Working Position• Research Fellow, IIS, Academia Sinica (1993~)

Education• Ph.D., CS&IE, NTU, 1991

Professional Activities • Associate Editor, ACM Trans on Asian Lang. Info. Proc. (2000

~)• Editorial Board Member, J.of Information Processing & Mana

gement (1995~2000)• PC member, ACM SIGIR (1999~2003)• Speech and Search Technology Consultant, Microsoft Resea

rch

Page 4: Web Search Engine Design

Part I.

Basics of Search Engine Design

Page 5: Web Search Engine Design

Differences Scale

Personal, site/intranet (Tornado/Verity), internet (Google) Thousand, million or billion (documents, users, queries)

Media Text, e.g., Web pages, documents, bibliographic data Audio, e.g., music, speech, broadcast news Image, e.g., pictures, computer graphics Video, e.g., films

Subject General or specific subjects, languages

Structure Non-structure, semi-structure, structure

Interface Web-based, WAP-based, voice-based

Page 6: Web Search Engine Design

Components

Crawler/Spider Index Server Query Server Document Delivery

Page 7: Web Search Engine Design

Architecture

SESE

SESE

SESE BrowserBrowserWeb

1B queries/day

Quality results

LogLog.Spam. Freshness

5B pages

Scalable Scalable

IndexIndex

IndexIndex

IndexIndexSpiderSpider

IndexerIndexer

ArchiveArchive

(1)

(2)

(3)

(4)

(5)

Page 8: Web Search Engine Design

Spider

Get all Pages from the Web Web Traverse Challenges

Performance, e.g., #Pages/Per PC Coverage Currency Spam Filtering Hidden Web

Page 9: Web Search Engine Design

Index Server Index occurrences of all words in the pa

ges Data Cleanness Challenges

Space Overhead,#pages/PC Incremental Scalability & Distributed Processing Multiple Languages

Page 10: Web Search Engine Design

Query Server

Search Relevant URLs for queries via looking up indices

Challenges Speed, check #queries/Per Sec Functions supported Localization

Page 11: Web Search Engine Design

AltaVista’s Search Functions

Phrase search, e.g. "petite galerie" Truncation, e.g. librar*, wom*n Constraining search, e.g. title:"The Wall Street Journal" Proximity search, e.g. gold near silver Boolean, e.g. +noir +film -"pinot noir" Parentheses and Nested Boolean, e.g. silver and not (gold or platinu

m) Limit search, e.g. limit by date range Capitalization, e.g. turkey vs. Turkey Ranking fields and refine search LiveTopics Translate Service Other

Page 12: Web Search Engine Design

Document Delivery Bottleneck of Bandwidth Presentation Caching

Queries, Search Results Aakman Model

Page 13: Web Search Engine Design

Others

Security System Maintenance

Page 14: Web Search Engine Design

評鑑

收錄範圍 (Contents and Scopes) 檢索功能 (Search Logic) 顯示格式 (Display Results) 檢索效率 (Search Performance) 使用者介面 (Interface Design)

Page 15: Web Search Engine Design

收錄範圍 (Contents and Scopes)

資料量 (Size of database) 收錄項目 (WWW, Usenet, FTP, Gopher ... etc.) 索引深度 (Index depth, e.g. HTML title, header, summary of content, f

ull text) 索引建立方式 (Automatic or manual indexing) 新穎性及更新頻率 (Currency & frequency of updating the index) 多國語文處理 (Multilingual) 涵蓋種類 (e.g. Excite 包含 Web search, Usenet search, subject guide, ci

ty.net, NewsTracker 等 ) 提供評論 (review)

Page 16: Web Search Engine Design

檢索功能 (Search Logic) 布林邏輯 (Boolean logic, e.g. AND, OR, NOT) 複合布林邏輯 (Nested Boolean) 竄字 (Truncation, automatic or user-defined) 相近運算元 (Proximity, e.g. NEAR, FOLLOWED BY) 片語查詢 (Phrase Searching) 限制欄位 (Field Search, e.g. URL, title ... etc.) 大小寫、特殊符號等處理 (Capitalization, punctuation ... etc.) 關鍵語 (Keyword search) 自然語句輸入 (Natural language query) Relevance feedback Refine search (Narrow down) Weighted search Duplicate detection Search set manipulation Other

Page 17: Web Search Engine Design

顯示格式 (Display Results)

相關性排序 (Relevance Ranking) 限制顯示筆數 限制顯示資料的詳細程度 ( 註解或摘要 ) Direct Links to Resources

Page 18: Web Search Engine Design

檢索效率 (Search Performance) 精確度 (Precision Ratio) 查全率 (Recall Ratio) 反應時間 (Response Time) 連線容易程度 (Accessibility)

Page 19: Web Search Engine Design

Part II.

Why Google Can Do it ?

Page 20: Web Search Engine Design

Spider

索引頁

Out Links重複 (Duplicatio

n)

權威 (Authority)

從 Out link 遊走 Authorized Pages

Page 21: Web Search Engine Design

Indexing& Ranking

Page Title:Academia Sinica

Indexed Page

Anchor Text:Government Research Institution in Taiwan

abstractPopularity

Anchor Text: My CS Lab

Authority

Page 22: Web Search Engine Design

Inverted File

Google’s Index File Structure

Page 23: Web Search Engine Design

Distributed Search

Query

Query

Processor

SE

SE

SE

SE

Document Delivery

Page 24: Web Search Engine Design

IndexSpace

User Space

DocumentSpace

Information UseInformation Need

Seek

Use

Users Authors

Short QuerySubject TermsReal Names

X YX1,X2... Y1,Y2...

Abstract Modeling

Page 25: Web Search Engine Design

Facts (I)

查詢 (Query) short query problem 50% are personal and company names Boolean or natural language query is few

瀏覽 (Browsing) no more 2nd page precision is more important than recall

資訊收集 (Robot) low coverage 、 deadlinks 、 garbage sites and pages

Page 26: Web Search Engine Design

Facts (II)- Accuracy 誰的責任 ?

使用者• Short query or NLQ?• HFQ 、 LFQ?

搜尋引擎• 技術 , 資料量,排序 ?

Page 27: Web Search Engine Design

Facts (III)- Speed 誰的責任 ?

使用者• 關鍵詞 , 頻寬

搜尋引擎• 頻寬,文件傳遞

Page 28: Web Search Engine Design

語言比例

Table 3 Statistics concerning what language used in each search termAll Chinese All English Other

Dreamer 78.20% 19.18% 2.62%GAIS 78.22% 16.90% 4.88%

Page 29: Web Search Engine Design

關鍵詞長

Table 4 Statistics concerning the number of terms per queryin Chinese in English All

Dreamer 3.18 characters 1.10 words 6.31 bytesGAIS 3.55 characters 1.22 words 7.26 bytes

Page 30: Web Search Engine Design

關鍵詞頻

Table 5 Statistics concerning how often distinct queries are askedquery occurs 1

time2 times 3 times > 3 times

Dreamer 52.4% 17.8% 8.4% 21.4%GAISAltaVista 63.7% 16.2% 6.5% 13.6%

Page 31: Web Search Engine Design

核心關鍵詞

Table 1 Coverage comparison between Dreamer and GAIS GAISDreamer

top 1000 top 20k ALL

top 1000 583/58.30% 977/97.70% 992/99.20%top 20k 914/91.40% 9709/50.71% 14721/76.89%

Page 32: Web Search Engine Design

主題領域

Adult

Computer

EntertainmentChat

Life

Education

Travel

Game

Business

SocietyMedia

HumanitiesHealth

ScienceOther

Software

GraphSearch engine

Network

Company

HardwareBBS Other

Page 33: Web Search Engine Design

Part III.

New-Generation Search Technologies

Page 34: Web Search Engine Design

New-Generation IR

Information Perspectives Web IRMultimedia IRSemantic Web IRUser-Oriented IR

Retrieval Perspectives Question Answering Information ExtractionInformation Filtering Web Mining

Page 35: Web Search Engine Design

New-Generation IR

Information PerspectiveWeb IR: Global/Specific Search Engines, SpidersMultimedia IR: Speech, Music, Image, Video IRSemantic Web IR: XML IR, Ontology, IEUser-Oriented IR: Log Mining, Ontology

Retrieval PerspectiveQuestion Answering: NLQ, FAQ Search Information ExtractionInformation Filtering: e-mail Spam, Web Page

Mining PerspectiveWeb Mining, Log Mining

Page 36: Web Search Engine Design

以圖查圖

Page 37: Web Search Engine Design

影音瀏覽

Page 38: Web Search Engine Design

影片摘要

Page 39: Web Search Engine Design

文件分類

Page 40: Web Search Engine Design

跨語搜尋

Page 41: Web Search Engine Design

智慧型問答

Page 42: Web Search Engine Design

問專家

Page 43: Web Search Engine Design

IR Research at WKD (I)

Information Perspective Web IR:

Cross-Language Web SearchConcept-based Search

Multimedia IR: Speech RetrievalImage Retrieval

User-Oriented IR: Query Taxonomy Generation

Page 44: Web Search Engine Design

Cross-Language Web Search

LiveTrans

Page 45: Web Search Engine Design
Page 46: Web Search Engine Design
Page 47: Web Search Engine Design
Page 48: Web Search Engine Design
Page 49: Web Search Engine Design
Page 50: Web Search Engine Design
Page 51: Web Search Engine Design
Page 52: Web Search Engine Design
Page 53: Web Search Engine Design

LiveConcept

Page 54: Web Search Engine Design

LiveConcept

Concept-based Web Search

Page 55: Web Search Engine Design

Query: 請幫我找中美軍機擦撞

Indexing Approach

Query by Exemplar

Retrieved documents (Ranked)

Recording Time

Relevance Score

Speech Query(Recognition results)

Spoken Document(Recognition results)

Speech Retrieval ( 陳柏琳博士)

Page 56: Web Search Engine Design

Web Image Annotation

彩虹 (Rainbow)天氣 (Weather)花 (Flower)自然 (Nature)

向日葵 (Sunflower)花 (Flower)植物 (Plant)沙漠 (Desert)海豹 (Seal)哺乳類 (Mammal)海岸 (Coast)動物 (Animal)

太陽系 (Solar System)慧星 (Comet)熱帶魚 (Tropical Fish)太空 (Universe)

瀑布 (Waterfall)地形 (Landform)自然 (Nature)蟑螂 (Cockroach)

狗 (Dog)哺乳類 (Mammal)穿山甲 (Pangolin) 羊 (Sheep)

Top 4 keywords Top 4 keywordsImages Images

Page 57: Web Search Engine Design

Web Image Annotation

Page 58: Web Search Engine Design
Page 59: Web Search Engine Design

Q&A

Thanks !

Web Knowledge Discovery LabInstitute of Information Science

Academia Sinica

http://csmart.iis.sinica.edu.tw/