search engineswidit2.knu.ac.kr/~kiyang/teaching/se/s20/lectures/3.se... · 2020-03-29 ·...

13
Search Engines Introduction

Upload: others

Post on 02-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search EnginesIntroduction

Page 2: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search Engine Overview

User Intermediary Information

What am I looking for?- Identification of info. Need

What question do I ask?- Query formulation

What is the searcher looking for?- Discovery of user’s info. need How should the question be posed? - Query representation Where is the relevant information?

- Query-document matching

What data to collect?- Collection development What information to index?- Indexing/Representation How to represent it?- Data structure

Search Engines 2

Searchable Index(색인)

Query(질의)

Search Results

1

23

0

Search Data (0)

(1) Query Indexing(2) Document Ranking(3) Result Display

1. Document Collection- e.g., spider/crawler

2. Document Indexing- term indexing

(tokenizing, stop & stem)- term weighting

Page 3: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search Engine: Data Document Collection

Select target data sources – e.g., domain, corpus, WWW

Harvest data – e.g., data entry, data import, spider/crawler

Document Indexing Select indexing sources (색인어) – e.g., metadata, keywords, content

Extract indexing terms – e.g., tokenization, stop & stem

Assign term weights – e.g., tf-idf, okapi

Search Engines 3

“The frequency of word occurrence in an article furnishes a useful measurement of word significance.”- 문헌에출현한던어들은문헌의내용분석을위해사용될수있으며, 단어의

출현빈도가이단어의주제어로서의중요성을측정하는기준이된다 .

Luhn, H.P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159-165.

Page 4: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

TokensTokens

Search Engine: Indexing Process

Search Engines 4

Documents(Text)

Tokenization

Token Selection

Token Normalization

Tokens

TokensTokensSelectTokens

TokensTokensSEQUENTIALINDEX

Term Weighting

INVERTEDINDEX

D1 D2 D3

wd1 (information) 1 1 1

wd2 (model) 0 1 1

wd3 (retrieval) 1 2 0

wd4 (seminar) 1 0 0

D1: Information retrieval seminarsD2: Retrieval Models and Information RetrievalD3: Information Model

D1 information 1, retrieval 1, seminar 1

D2 information 1, model 1, retrieval 2

D3 information 1, model 1

D1: information, retrieval, seminar(s)D2: retrieval, model(s), and, information, retrievalD3: information, model

0

0

1

2

3

4

5

1 2 34

5

Page 5: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search Engine: Search

Query Indexing Tokenization Stop & Stem Term Weighting

Document Ranking Query-Document matching Document Score computation

Result Display Content - e.g., title & snippets

Layout - e.g., grouped by category

Toppings - e.g., related searches

Search Engines 5

Index Term D1 D2 D3

wd1 (information) 1 1 1

wd2 (model) 0 1 1

wd3 (retrieval) 1 2 0

wd4 (seminar) 1 0 0

Rank docID score

1 D2 3

2 D1 2

3 D3 1

Query: What is information retrieval?Q: Information 1, retrieval 1

Page 6: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search Engines 6

1

2

3

4

5

6

7

8

9

10

11

12

13

14

2015

Page 7: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search Engines 7

15

16

17

18

19 20

Result Categories1. Encyclopedia2. Naver Books3. Q&A DB (지식iN)4. Magazine5. Café6. Blog7. Book8. Map9. Website10. Advertisement (파워링크)

11. Image12. Webpage13. Naver News Library14. Video15. Naver AppStore16. Naver Scholar17. Naver Post18. Naver Shopping19. News20. Naver Dictionary

Proprietary (Naver-specific) content Dynamic category order Toppings

• Search by Category• Related Searches• Popular Searches (by category)

2015Query: 정보검색(Information Retrieval)

Page 8: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search Engines 8

Query: 정보검색(Information Retrieval) 2020

NAVER 2020 NAVER 2015

1. Encyclopedia (지식백과) 1. Encyclopedia

2. Naver Dictionary (어학사전) 20. Naver Dictionary

3. Website (웹사이트) 9. Website

4. Advertisement (파워링크) 10. Advertisement

5. Naver Post (포스트) 17. Naver Post

6. Blog (블로그) 6. Blog

7. Video 14. Video8. Online Open Courses

(온라인공개강좌)

9. Q&A DB (지식iN) 3. Q&A DB

10. Café (카페) 5. Café

10. Naver AppStore (앱정보) 15. Naver AppStore

10. Naver Books (Naver 책)2. Naver Books7. Book (본문검색)

10. Image 11. Image

4. Magazine

8. Map

12. Webpage

13. Naver News Library

16. Naver Scholar

17. Naver Shopping

19. News

Page 9: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search Engines 9

2020Query: 검색엔진(Search Engine)

NAVER 2020 (검색엔진) NAVER 2020 (정보검색)

1. Advertisement (파워링크) 1. Encyclopedia (지식백과)

2. Encyclopedia (지식백과) 2. Naver Dictionary (어학사전)

3. Naver Dictionary (어학사전) 3. Website (웹사이트)

4. Website (웹사이트) 4. Advertisement (파워링크)

5. Naver Post (포스트) 5. Naver Post (포스트)

6. Advertisement (비즈사이트) 6. Blog (블로그)

7. Blog (블로그) 7. Video

7. Video 8. Online Open Courses(온라인공개강좌)

9. Q&A DB (지식iN) 9. Q&A DB (지식iN)

10. Café (카페) 10. Café (카페)

10. Naver AppStore (앱정보) 10. Naver AppStore (앱정보)

10. Naver Books (Naver 책) 10. Naver Books (Naver 책)

10. Image 10. Image

Page 10: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search Engines 10

1

2

Result Categories1. Webpage2. Advertisement

Webpage-centric content Dynamic category order Toppings

• Search by Category• Related Searches

2015Query: Information Retrieval

Page 11: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search Engines 11

Query: Information Retrieval 2020

Google 2020 NAVER 2020

1. Wikipedia 1. Encyclopedia (지식백과)

2. Knowledge Panel 2. Naver Dictionary (어학사전)

3. Answer Box 3. Website (웹사이트)

4. Webpage 4. Advertisement (파워링크)

5. Naver Post (포스트)

6. Blog (블로그)

7. Video8. Online Open Courses

(온라인공개강좌)

9. Q&A DB (지식iN)

10. Café

10. AppStore, Books, Image

5. Related Searches Related Searches (연관검색어)

Top Categories Top Categories (subset)

Image Naver Dictionary (어학사전)

Video Image

News Blog

Books News

Maps Books

Shopping Encyclopedia (지식백과)

Finance Website

Page 12: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search Engines 12

Query: Search Engines 2020

Google 2020 (Search Engines)

Google 2020 (Information Retrieval)

1. Wikipedia 1. Wikipedia

2. Knowledge Panel 2. Knowledge Panel

3. Answer Box 3. Answer Box

4. Disambiguation Box 4. Webpage

5. Webpage

6. Top Stories (News)

7. Webpage

8. Related Searches 5. Related Searches

Google SERP Features by Overthink Group

Knowledge Graphs Knowledge Panel Answer Box Disambiguation Box Carousels Google Posts

Page 13: Search Engineswidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/3.SE... · 2020-03-29 · Information. What am I looking for? - Identification of info. Need: What question do I ask?-Query

Search Engine vs. Database vs. Directories

Search Engines 13

Search Engine Database Directories

Corpus Type General Specific General/Specific

Data Collection Automatic - crawler/spider

Manual - data entry/import

Manual- classification

Data Quality Not controlled Controlled Controlled

Data Organization None(bag-of-words)

Structured - Relational

Structured - Hierarchical

Query Input Text box Field-specific - Boolean

Text box

Search Result Ranked- documents

Not ranked- records

Ranked- categories

Search Index Document text Database Tables Category Tree

e.g. Google Library Search curlie.org