web search engine design

Post on 18-Dec-2014

1.168 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Web Search Engine Design

Lee-Feng Chien ( 簡立峰 )

Web Knowledge Discovery LabInstitute of Information Science

Academia Sinica

http://csmart.iis.sinica.edu.tw/

Outline

Basics of Search Engine Design Why Google Can Do It? New-Generation Search Technologies

About the Speaker

Working Position• Research Fellow, IIS, Academia Sinica (1993~)

Education• Ph.D., CS&IE, NTU, 1991

Professional Activities • Associate Editor, ACM Trans on Asian Lang. Info. Proc. (2000

~)• Editorial Board Member, J.of Information Processing & Mana

gement (1995~2000)• PC member, ACM SIGIR (1999~2003)• Speech and Search Technology Consultant, Microsoft Resea

rch

Part I.

Basics of Search Engine Design

Differences Scale

Personal, site/intranet (Tornado/Verity), internet (Google) Thousand, million or billion (documents, users, queries)

Media Text, e.g., Web pages, documents, bibliographic data Audio, e.g., music, speech, broadcast news Image, e.g., pictures, computer graphics Video, e.g., films

Subject General or specific subjects, languages

Structure Non-structure, semi-structure, structure

Interface Web-based, WAP-based, voice-based

Components

Crawler/Spider Index Server Query Server Document Delivery

Architecture

SESE

SESE

SESE BrowserBrowserWeb

1B queries/day

Quality results

LogLog.Spam. Freshness

5B pages

Scalable Scalable

IndexIndex

IndexIndex

IndexIndexSpiderSpider

IndexerIndexer

ArchiveArchive

(1)

(2)

(3)

(4)

(5)

Spider

Get all Pages from the Web Web Traverse Challenges

Performance, e.g., #Pages/Per PC Coverage Currency Spam Filtering Hidden Web

Index Server Index occurrences of all words in the pa

ges Data Cleanness Challenges

Space Overhead,#pages/PC Incremental Scalability & Distributed Processing Multiple Languages

Query Server

Search Relevant URLs for queries via looking up indices

Challenges Speed, check #queries/Per Sec Functions supported Localization

AltaVista’s Search Functions

Phrase search, e.g. "petite galerie" Truncation, e.g. librar*, wom*n Constraining search, e.g. title:"The Wall Street Journal" Proximity search, e.g. gold near silver Boolean, e.g. +noir +film -"pinot noir" Parentheses and Nested Boolean, e.g. silver and not (gold or platinu

m) Limit search, e.g. limit by date range Capitalization, e.g. turkey vs. Turkey Ranking fields and refine search LiveTopics Translate Service Other

Document Delivery Bottleneck of Bandwidth Presentation Caching

Queries, Search Results Aakman Model

Others

Security System Maintenance

評鑑

收錄範圍 (Contents and Scopes) 檢索功能 (Search Logic) 顯示格式 (Display Results) 檢索效率 (Search Performance) 使用者介面 (Interface Design)

收錄範圍 (Contents and Scopes)

資料量 (Size of database) 收錄項目 (WWW, Usenet, FTP, Gopher ... etc.) 索引深度 (Index depth, e.g. HTML title, header, summary of content, f

ull text) 索引建立方式 (Automatic or manual indexing) 新穎性及更新頻率 (Currency & frequency of updating the index) 多國語文處理 (Multilingual) 涵蓋種類 (e.g. Excite 包含 Web search, Usenet search, subject guide, ci

ty.net, NewsTracker 等 ) 提供評論 (review)

檢索功能 (Search Logic) 布林邏輯 (Boolean logic, e.g. AND, OR, NOT) 複合布林邏輯 (Nested Boolean) 竄字 (Truncation, automatic or user-defined) 相近運算元 (Proximity, e.g. NEAR, FOLLOWED BY) 片語查詢 (Phrase Searching) 限制欄位 (Field Search, e.g. URL, title ... etc.) 大小寫、特殊符號等處理 (Capitalization, punctuation ... etc.) 關鍵語 (Keyword search) 自然語句輸入 (Natural language query) Relevance feedback Refine search (Narrow down) Weighted search Duplicate detection Search set manipulation Other

顯示格式 (Display Results)

相關性排序 (Relevance Ranking) 限制顯示筆數 限制顯示資料的詳細程度 ( 註解或摘要 ) Direct Links to Resources

檢索效率 (Search Performance) 精確度 (Precision Ratio) 查全率 (Recall Ratio) 反應時間 (Response Time) 連線容易程度 (Accessibility)

Part II.

Why Google Can Do it ?

Spider

索引頁

Out Links重複 (Duplicatio

n)

權威 (Authority)

從 Out link 遊走 Authorized Pages

Indexing& Ranking

Page Title:Academia Sinica

Indexed Page

Anchor Text:Government Research Institution in Taiwan

abstractPopularity

Anchor Text: My CS Lab

Authority

Inverted File

Google’s Index File Structure

Distributed Search

Query

Query

Processor

SE

SE

SE

SE

Document Delivery

IndexSpace

User Space

DocumentSpace

Information UseInformation Need

Seek

Use

Users Authors

Short QuerySubject TermsReal Names

X YX1,X2... Y1,Y2...

Abstract Modeling

Facts (I)

查詢 (Query) short query problem 50% are personal and company names Boolean or natural language query is few

瀏覽 (Browsing) no more 2nd page precision is more important than recall

資訊收集 (Robot) low coverage 、 deadlinks 、 garbage sites and pages

Facts (II)- Accuracy 誰的責任 ?

使用者• Short query or NLQ?• HFQ 、 LFQ?

搜尋引擎• 技術 , 資料量,排序 ?

Facts (III)- Speed 誰的責任 ?

使用者• 關鍵詞 , 頻寬

搜尋引擎• 頻寬,文件傳遞

語言比例

Table 3 Statistics concerning what language used in each search termAll Chinese All English Other

Dreamer 78.20% 19.18% 2.62%GAIS 78.22% 16.90% 4.88%

關鍵詞長

Table 4 Statistics concerning the number of terms per queryin Chinese in English All

Dreamer 3.18 characters 1.10 words 6.31 bytesGAIS 3.55 characters 1.22 words 7.26 bytes

關鍵詞頻

Table 5 Statistics concerning how often distinct queries are askedquery occurs 1

time2 times 3 times > 3 times

Dreamer 52.4% 17.8% 8.4% 21.4%GAISAltaVista 63.7% 16.2% 6.5% 13.6%

核心關鍵詞

Table 1 Coverage comparison between Dreamer and GAIS GAISDreamer

top 1000 top 20k ALL

top 1000 583/58.30% 977/97.70% 992/99.20%top 20k 914/91.40% 9709/50.71% 14721/76.89%

主題領域

Adult

Computer

EntertainmentChat

Life

Education

Travel

Game

Business

SocietyMedia

HumanitiesHealth

ScienceOther

Software

GraphSearch engine

Network

Company

HardwareBBS Other

Part III.

New-Generation Search Technologies

New-Generation IR

Information Perspectives Web IRMultimedia IRSemantic Web IRUser-Oriented IR

Retrieval Perspectives Question Answering Information ExtractionInformation Filtering Web Mining

New-Generation IR

Information PerspectiveWeb IR: Global/Specific Search Engines, SpidersMultimedia IR: Speech, Music, Image, Video IRSemantic Web IR: XML IR, Ontology, IEUser-Oriented IR: Log Mining, Ontology

Retrieval PerspectiveQuestion Answering: NLQ, FAQ Search Information ExtractionInformation Filtering: e-mail Spam, Web Page

Mining PerspectiveWeb Mining, Log Mining

以圖查圖

影音瀏覽

影片摘要

文件分類

跨語搜尋

智慧型問答

問專家

IR Research at WKD (I)

Information Perspective Web IR:

Cross-Language Web SearchConcept-based Search

Multimedia IR: Speech RetrievalImage Retrieval

User-Oriented IR: Query Taxonomy Generation

Cross-Language Web Search

LiveTrans

LiveConcept

LiveConcept

Concept-based Web Search

Query: 請幫我找中美軍機擦撞

Indexing Approach

Query by Exemplar

Retrieved documents (Ranked)

Recording Time

Relevance Score

Speech Query(Recognition results)

Spoken Document(Recognition results)

Speech Retrieval ( 陳柏琳博士)

Web Image Annotation

彩虹 (Rainbow)天氣 (Weather)花 (Flower)自然 (Nature)

向日葵 (Sunflower)花 (Flower)植物 (Plant)沙漠 (Desert)海豹 (Seal)哺乳類 (Mammal)海岸 (Coast)動物 (Animal)

太陽系 (Solar System)慧星 (Comet)熱帶魚 (Tropical Fish)太空 (Universe)

瀑布 (Waterfall)地形 (Landform)自然 (Nature)蟑螂 (Cockroach)

狗 (Dog)哺乳類 (Mammal)穿山甲 (Pangolin) 羊 (Sheep)

Top 4 keywords Top 4 keywordsImages Images

Web Image Annotation

Q&A

Thanks !

Web Knowledge Discovery LabInstitute of Information Science

Academia Sinica

http://csmart.iis.sinica.edu.tw/

top related