web search engine design
DESCRIPTION
TRANSCRIPT
Web Search Engine Design
Lee-Feng Chien ( 簡立峰 )
Web Knowledge Discovery LabInstitute of Information Science
Academia Sinica
http://csmart.iis.sinica.edu.tw/
Outline
Basics of Search Engine Design Why Google Can Do It? New-Generation Search Technologies
About the Speaker
Working Position• Research Fellow, IIS, Academia Sinica (1993~)
Education• Ph.D., CS&IE, NTU, 1991
Professional Activities • Associate Editor, ACM Trans on Asian Lang. Info. Proc. (2000
~)• Editorial Board Member, J.of Information Processing & Mana
gement (1995~2000)• PC member, ACM SIGIR (1999~2003)• Speech and Search Technology Consultant, Microsoft Resea
rch
Part I.
Basics of Search Engine Design
Differences Scale
Personal, site/intranet (Tornado/Verity), internet (Google) Thousand, million or billion (documents, users, queries)
Media Text, e.g., Web pages, documents, bibliographic data Audio, e.g., music, speech, broadcast news Image, e.g., pictures, computer graphics Video, e.g., films
Subject General or specific subjects, languages
Structure Non-structure, semi-structure, structure
Interface Web-based, WAP-based, voice-based
Components
Crawler/Spider Index Server Query Server Document Delivery
Architecture
SESE
SESE
SESE BrowserBrowserWeb
1B queries/day
Quality results
LogLog.Spam. Freshness
5B pages
Scalable Scalable
IndexIndex
IndexIndex
IndexIndexSpiderSpider
IndexerIndexer
ArchiveArchive
(1)
(2)
(3)
(4)
(5)
Spider
Get all Pages from the Web Web Traverse Challenges
Performance, e.g., #Pages/Per PC Coverage Currency Spam Filtering Hidden Web
Index Server Index occurrences of all words in the pa
ges Data Cleanness Challenges
Space Overhead,#pages/PC Incremental Scalability & Distributed Processing Multiple Languages
Query Server
Search Relevant URLs for queries via looking up indices
Challenges Speed, check #queries/Per Sec Functions supported Localization
AltaVista’s Search Functions
Phrase search, e.g. "petite galerie" Truncation, e.g. librar*, wom*n Constraining search, e.g. title:"The Wall Street Journal" Proximity search, e.g. gold near silver Boolean, e.g. +noir +film -"pinot noir" Parentheses and Nested Boolean, e.g. silver and not (gold or platinu
m) Limit search, e.g. limit by date range Capitalization, e.g. turkey vs. Turkey Ranking fields and refine search LiveTopics Translate Service Other
Document Delivery Bottleneck of Bandwidth Presentation Caching
Queries, Search Results Aakman Model
Others
Security System Maintenance
評鑑
收錄範圍 (Contents and Scopes) 檢索功能 (Search Logic) 顯示格式 (Display Results) 檢索效率 (Search Performance) 使用者介面 (Interface Design)
收錄範圍 (Contents and Scopes)
資料量 (Size of database) 收錄項目 (WWW, Usenet, FTP, Gopher ... etc.) 索引深度 (Index depth, e.g. HTML title, header, summary of content, f
ull text) 索引建立方式 (Automatic or manual indexing) 新穎性及更新頻率 (Currency & frequency of updating the index) 多國語文處理 (Multilingual) 涵蓋種類 (e.g. Excite 包含 Web search, Usenet search, subject guide, ci
ty.net, NewsTracker 等 ) 提供評論 (review)
檢索功能 (Search Logic) 布林邏輯 (Boolean logic, e.g. AND, OR, NOT) 複合布林邏輯 (Nested Boolean) 竄字 (Truncation, automatic or user-defined) 相近運算元 (Proximity, e.g. NEAR, FOLLOWED BY) 片語查詢 (Phrase Searching) 限制欄位 (Field Search, e.g. URL, title ... etc.) 大小寫、特殊符號等處理 (Capitalization, punctuation ... etc.) 關鍵語 (Keyword search) 自然語句輸入 (Natural language query) Relevance feedback Refine search (Narrow down) Weighted search Duplicate detection Search set manipulation Other
顯示格式 (Display Results)
相關性排序 (Relevance Ranking) 限制顯示筆數 限制顯示資料的詳細程度 ( 註解或摘要 ) Direct Links to Resources
檢索效率 (Search Performance) 精確度 (Precision Ratio) 查全率 (Recall Ratio) 反應時間 (Response Time) 連線容易程度 (Accessibility)
Part II.
Why Google Can Do it ?
Spider
索引頁
Out Links重複 (Duplicatio
n)
權威 (Authority)
從 Out link 遊走 Authorized Pages
Indexing& Ranking
Page Title:Academia Sinica
Indexed Page
Anchor Text:Government Research Institution in Taiwan
abstractPopularity
Anchor Text: My CS Lab
Authority
Inverted File
Google’s Index File Structure
Distributed Search
Query
Query
Processor
SE
SE
SE
SE
Document Delivery
IndexSpace
User Space
DocumentSpace
Information UseInformation Need
Seek
Use
Users Authors
Short QuerySubject TermsReal Names
X YX1,X2... Y1,Y2...
Abstract Modeling
Facts (I)
查詢 (Query) short query problem 50% are personal and company names Boolean or natural language query is few
瀏覽 (Browsing) no more 2nd page precision is more important than recall
資訊收集 (Robot) low coverage 、 deadlinks 、 garbage sites and pages
Facts (II)- Accuracy 誰的責任 ?
使用者• Short query or NLQ?• HFQ 、 LFQ?
搜尋引擎• 技術 , 資料量,排序 ?
Facts (III)- Speed 誰的責任 ?
使用者• 關鍵詞 , 頻寬
搜尋引擎• 頻寬,文件傳遞
語言比例
Table 3 Statistics concerning what language used in each search termAll Chinese All English Other
Dreamer 78.20% 19.18% 2.62%GAIS 78.22% 16.90% 4.88%
關鍵詞長
Table 4 Statistics concerning the number of terms per queryin Chinese in English All
Dreamer 3.18 characters 1.10 words 6.31 bytesGAIS 3.55 characters 1.22 words 7.26 bytes
關鍵詞頻
Table 5 Statistics concerning how often distinct queries are askedquery occurs 1
time2 times 3 times > 3 times
Dreamer 52.4% 17.8% 8.4% 21.4%GAISAltaVista 63.7% 16.2% 6.5% 13.6%
核心關鍵詞
Table 1 Coverage comparison between Dreamer and GAIS GAISDreamer
top 1000 top 20k ALL
top 1000 583/58.30% 977/97.70% 992/99.20%top 20k 914/91.40% 9709/50.71% 14721/76.89%
主題領域
Adult
Computer
EntertainmentChat
Life
Education
Travel
Game
Business
SocietyMedia
HumanitiesHealth
ScienceOther
Software
GraphSearch engine
Network
Company
HardwareBBS Other
Part III.
New-Generation Search Technologies
New-Generation IR
Information Perspectives Web IRMultimedia IRSemantic Web IRUser-Oriented IR
Retrieval Perspectives Question Answering Information ExtractionInformation Filtering Web Mining
New-Generation IR
Information PerspectiveWeb IR: Global/Specific Search Engines, SpidersMultimedia IR: Speech, Music, Image, Video IRSemantic Web IR: XML IR, Ontology, IEUser-Oriented IR: Log Mining, Ontology
Retrieval PerspectiveQuestion Answering: NLQ, FAQ Search Information ExtractionInformation Filtering: e-mail Spam, Web Page
Mining PerspectiveWeb Mining, Log Mining
以圖查圖
影音瀏覽
影片摘要
文件分類
跨語搜尋
智慧型問答
問專家
IR Research at WKD (I)
Information Perspective Web IR:
Cross-Language Web SearchConcept-based Search
Multimedia IR: Speech RetrievalImage Retrieval
User-Oriented IR: Query Taxonomy Generation
Cross-Language Web Search
LiveTrans
LiveConcept
LiveConcept
Concept-based Web Search
Query: 請幫我找中美軍機擦撞
Indexing Approach
Query by Exemplar
Retrieved documents (Ranked)
Recording Time
Relevance Score
Speech Query(Recognition results)
Spoken Document(Recognition results)
Speech Retrieval ( 陳柏琳博士)
Web Image Annotation
彩虹 (Rainbow)天氣 (Weather)花 (Flower)自然 (Nature)
向日葵 (Sunflower)花 (Flower)植物 (Plant)沙漠 (Desert)海豹 (Seal)哺乳類 (Mammal)海岸 (Coast)動物 (Animal)
太陽系 (Solar System)慧星 (Comet)熱帶魚 (Tropical Fish)太空 (Universe)
瀑布 (Waterfall)地形 (Landform)自然 (Nature)蟑螂 (Cockroach)
狗 (Dog)哺乳類 (Mammal)穿山甲 (Pangolin) 羊 (Sheep)
Top 4 keywords Top 4 keywordsImages Images
Web Image Annotation
Q&A
Thanks !
Web Knowledge Discovery LabInstitute of Information Science
Academia Sinica
http://csmart.iis.sinica.edu.tw/