[2b1]검색엔진의 패러다임 전환

검색엔진의 패러다임 전환- 빅데이터분석과검색의융합 -

고려대학교정보대학컴퓨터학과

강 재 우

연구 배경

사용자의 정보 욕구 변화

참여, 공유, 개방의 Web 2.0 시대 도래

사용자 중심의 정보 생산/소비 구조로의 변화

웹 및 SNS 상에 개인의 의견/주관적 정보의 양 폭증

“분당 상견례하기 좋은 한식집”, “반전이 좋은 스릴러“, “유행하는 핸드백”

등의 주관적 정보에 대한 정보 요구 증가

• 사실 검색(e.g., ‘action movie’) 수요는 정체 또는 불규칙한 반면, ‘best action

movie’, ‘best SUV’와 같은 주관적 질의는 꾸준히 증가

2

“action movie'와 best action movie' 질의어에 대한 구글 검색 추세 그래프(Google Trends, http://www.google.com/trends/)

3

Aardvark: Large-Scale Social Search Engine (Horowitz and Kamvar, WWW2010)

“64% of queries contain subjective element in Aardvark”

(e.g., “Do you know of any great delis in Baltimore, MD?”“What are the things/crafts/toys your children have made that made them really proud of themselves?”)

2010년 google이 $50,000,000 USD (한화 530억) 에 인수

사실검색 VS. 컨센서스 검색

컨센서스 검색 요구의 증가

검색엔진 VS. 컨센서스엔진

기존 문서 기반 검색엔진의 한계

객관적 정보(e.g., ‘액션 영화’ 또는 ‘핸드백 가격‘)는 현재의 검색엔진에서

검색 가능하나 주관적 질의(‘재미있는 액션 영화’, ’요즘 유행하는 핸드백‘)

에는 적절한 대응 불가능

문서 내에서 기술의 대상이 되는 객체를 찾아내어 이를 색인의 대상으로

인식하고 다양한 문서에 산재한 사용자의 의견을 대상 객체별로 종합/분석

하여 랭킹 하는 새로운 검색기술로의 근본적인 패러다임의 전환 요구

4

5

• 낮은 가격 순• 높은 가격 순• 등록일 순• 상품 평 많은 순의 단순한 상품 정렬

단순 나열되는사용자 리뷰

• 내용 파악이 힘들며• 정보의 종합이어려움

복잡한 옵션 선택

TV의 인치와 가격외에 유용한 정보가없는 결과 리스트

6

구매후기|2013.04.12

고가의 전자제품을 인터넷 구매라 많이 망설였습니다. 설치 된 후 제품을 보니 너무만족합니다. 화면 크고 잘나오고 저렴하게 구입 잘한 것 같아서 기분이 좋습니다.

LG전자47LM6200

가격대비막강한성능을가진 TV입니다.|2013.04.01

제품 자체가 보급형으로 저렴한 가격. 인터넷, 3D 등의 막강한 기능을 가졌고 이곳저곳 상품 평 읽어보니 모두 만족하는 제품이라 안심하고 구매했습니다. 좋은 제품합리적인 가격에 잘 구매한 것 같습니다. 감사합니다.

탁월한선택... LG 스마트TV 47LM6200...|2012.09.10

가격대비 아주 좋은 선택이었네요. 특히 리모콘의 기능과 3D안경은 S사것보다활용도가 아주 편하고 좋습니다. 3D안경도 타사의 밧데리로 하는 3D안경 보다훨씬 편하고 특히 안경 쓴 사람들에게 편리한 클립형은 아이디어가 돋보인다.

깔끔한화질및벽걸이설치 Good. 제품수급에따른배송지연|2012.07.02

가격대비 성능비가 매우 우수한 3D 스마트 LED TV라고 생각합니다. 화질도깔끔히 잘나오고, 무엇보다 벽걸이 형으로 아주 잘 설치되어서 만족합니다.

나쁘지않습니다.|2013.04.19

가격 대비 이 정도면 괜찮은 듯 싶습니다. 그러나 마우스 리모컨이 은근계륵이네요. 스마트 티비엔 확실히 필요하나 감도가 영 불편하게 되어있구요. 리모컨도 초간단으로 나오는데.. 너무 간단해서 조작하기 영.. 리모컨 시스템빼고는 뭐 나쁘지 않습니다.

Search가격성능비가 좋은 TV

제품 자체가 보급형으로 저렴한 가격LG 47LM

가격대비 아주 좋은 선택이었네요.LG 47LM

가격대비 성능비가 매우 우수한 3D 스마트 LED TV라고 생각합니다.LG 47LM

가격 대비 이 정도면 괜찮은 듯 싶습니다.LG 47LM

화면 크고 잘나오고 저렴하게 구입 잘한 것 같아서 기분이 좋습니다.삼성 UN50

무엇보다 가격대배 최고의 제품이라 말하고 싶습니다삼성 UN50

아주 좋은 가격에 사게 되어 만족합니다삼성 UN50

가격대비 크기 및 화질 좋습니다.삼성 UN50

정말최고의제품&서비스입니다.|2013.07.31

어제 주문했는데 이렇게 빨리 배송이 올 줄이야!!! 배송기사님도 너무 마음에 들게설치해 주시고. 무엇보다 가격대배 최고의 제품이라 말하고 싶습니다. 모든 것만족!!

착한가격에만족합니다.|2012.12.18

아주 좋은 가격에 사게 되어 만족합니다. 삼성스마트 TV로 성능이나 외관은 기존에백화점에서 보는 것과 별반 다르지 않고 만족합니다. 현재 약 2주정도 사용 중인데기능이나 외관 모두 만족입니다

가격대비최고의가치있는모델|2013.03.21

가격대비 크기 및 화질 좋습니다. 저녁에 주문했는데 다음날 아침에배송!!! 벽걸이로 샀는데 크기도 크고 영화보기에는 아주 좋을 것 같습니다. 화질도 좋고, 크기도 좋고, 배송도 번개배송!!

저렴하게구입

가격대배최고

저렴한가격

가격대비성능비가매우우수

가격대비크기및 화질좋습니다

아주좋은가격

가격대비이 정도면괜찮

가격대비아주좋은선택

0.5

0.8

0.9

0.7

0.5

0.8

0.7

0.6

Query Term과매칭된 AspectSegment Score

삼성 합계 : 2.9

LG 합계 : 2.6

최종검색순위

1. 삼성 UN50ES6800F 2. LG 47LM6200

Click!삼성전자UN50ES6800F

Consensus Search

최근 사용자들은 구매활동이나 문화 생활과 관련된 의사 결정을

위해 인터넷 검색을 활발히 활용

공연 관람이나, 상품 구매를 위해 타 사용자들의 리뷰, 후기를 참조

각 리뷰는 작성자의 “주관적 의견”을 토대로 작성

가능한 많은 리뷰를 읽어야 의사결정에 도움

컨센서스 엔진이란?

타 사용자들이 기 작성해 놓은 수많은 리뷰를 사전에 분석

사용자가 원하는 관점(질의)에서 타 사용자들의 리뷰를 분석, 종합해 주는

검색 시스템

7

Consensus Engine

현재의 검색엔진으로는 충분하지 않다!

상위 몇 개의 문서에 원하는 정보가 있을 수는 있다

하지만 각각의 문서는 각 작성자의 의견

대중의 consensus를 대표할 수 없다

하지만 답은 이미 Web에 존재!

많은 사용자들이 각자의 의견을 여러 형태(SNS, blog, review)로

온라인 상에 게시

이러한 온라인 의견들을 “잠재적 투표”로 인식

이미 피력된 온라인 의견을 검색 시점에(query time) 모아서 분석하면

컨센서스 검색이 가능

8

Uhm.. Yeah.. It is noisy, but…

9

Online Consumer Posts: 2nd most trusted forms of advertising (The Nielson Company, Q3 2011)

Is consensus search ever possible…?

“Best Action Movies in 2013”

Not immediately answerable with conventional search engines

Because the answer should be based on consensus, which cannot

be found in one of “top-10” documents

However, the answers are already on the Web

Numerous implicit votes from people on the Web and Social

Networks

Only if we can process them ….

… ONLINE!

10

CONSENTO Overview

11

CONSENTO Overview

12

The Key Ideas (I)

Subdocument-level Indexing

Capture semantics from user opinion more precisely

Indexing unit no longer a page but;

• a review within a page if more than one reviews exist on the page,

• or a sentence within a review,

• or even a clause or phrase within a sentence discussing one aspect of the target entity

Maximal Coherent Semantic Unit (MCSU)

• a finest granule indexing unit used in CONSENTO indexing

• maximal subsequence of words within a sentence, which carries single coherent semantics

Indexing MCSUs instead of documents enables semantic analysis to be performed during indexing time

• facilitating the online processing of consensus search in query time

13

The Key Ideas (II)

ConsensusRank: A Unique Ranking Method based on

Public Sentiment

Virtually, all existing ranking methods rank target objects (either

documents or entities) directly based on their relevance to the query

terms

Contrastingly, ConsensusRank ranks the entities indirectly through

aggregating the scores of referring segments (e.g., MCSUs) that

match to the query context

It can be viewed as a voting process where each reviewer casts a

weighted vote on an entity with respect to a query by expressing

positive or negative opinions about that entity

14

15

(A) Indexing Subsystem

Web

Documents

Parsing &

Preprocessing

DOM-tree Parsing

Contents Extraction

Contents

Segmentation

Sentence Splitter

MCSU Extraction

Entity

Search Index

(B) Searching Subsystem

Query Parsing

Query Preprocessing

& Expansion

Retrieval

Matching MCSU

Retrieval

Ranking

Segment Grouping

Score Aggregation

Entity List

User

Query

1

2

3

4

5

6

Review

Contents

Expanded

Query

MCSU

Posting ListMCSUs

Indexing

Inverted Entry

Construction

& Indexing

CONSENTO Architecture

Indexing Subsystem

Parsing & Preprocessing

Contents Segmentation

Indexing

Searching Subsystem

Query Parsing

Retrieval

Ranking

The current working prototype of CONSENTO is built on movie domain

CONSENTO crawled review pages from popular movie review sites such as IMDB, Meta Critics etc.

Review contents are extracted using DOM-tree parsing and XPATH queries

Extracted information include:

entity name (i.e., movie name)

review text,

date and time

review quality (e.g., “20 out of 30 people found the review helpful”)

I: Parsing & Preprocessing

Split the review contents into MCSUs

e.g., “The storyline is ridiculous, the acting

is laughable, and the camera work is terrible.”

s1) “The storyline is ridiculous”

s2) “the acting is laughable”

s3) “the camera work is terrible”

II: Contents Segmentation

II: Contents Segmentation

CONSENTO indexes MCSUs on a

conventional inverted index that is used in

most modern search engines.

Only mapping needs to be redefined

logically from (terms → documents) to

(terms → MCSUs)

III: Indexing

III: Indexing

20

Feature 2Feature 1

excellent visual effects, but plot was hard to follow

Entity Name Transformer 3

sentiment sentiment

Document #1

Bag of words

excellent

effects,

plot

hard

Doc#1 Term Doc

excellent #1

hard #1

follow #1

plot #1

visual #1

effects #1

follow

visual

TraditionalInverted index

Query: “excellent plot”. System return this document

* Conventional Indexing Method Example

III: Indexing

21

excellent visual effects, but plot was hard to follow

Segment 2Segment 1

Segment ID Object Name Feature Sentiment

Segment 1 Transformer 3 visual effects excellent

Segment 2 Transformer 3 plot hard to follow

Sub-document level indexing

Term Segment ID Object Name Feature Sentiment

excellent SID1 Transformer 3 visual effects excellent

visual SID1 Transformer 3 visual effects excellent

effect SID1 Transformer 3 visual effects excellent

plot SID2 Transformer 3 plot hard

hard SID2 Transformer 3 plot hard

follow SID2 Transformer 3 plot hard

Query: “excellent plot”, doesn't match any segment

* Subdocument-level Indexing Example

III: Indexing

Simply treating an MCSU as a document

Store additional information in each posting for use in the ranking stage

MCSU posting structure

rid ts rq

𝑟1 𝑡𝑠1 0.8

𝑟2 𝑡𝑠2 0.4

𝑟3 𝑡𝑠3 0.6

𝑟4 𝑡𝑠4 0.9

𝑟5 𝑡𝑠5 0.4

𝑟6 𝑡𝑠6 0.5

𝑟7 𝑡𝑠7 0.7

𝑟8 𝑡𝑠8 0.6

𝑟9 𝑡𝑠9 0.8

Site Name Source ID

IMDb 𝑤1

Flixster 𝑤2

Metacritic 𝑤3

Yahoo! 𝑤4

Feature id

music 𝑎1

soundtrack 𝑎2

story 𝑎3

plot 𝑎4

performance 𝑎5

acting 𝑎6

Sentiword id

great 𝑚1

excellent 𝑚2

superb 𝑚3

tragic 𝑚4

Entity id

Titanic 𝑒1

Brokeback

Mountain𝑒2

Dark Knight 𝑒3

Avatar 𝑒4

Term Postings

Cameron <𝑠19, 𝑒4, [−], [𝑚3], 𝑟7, 𝑤3>

Pandora<𝑠16, 𝑒4, [𝑎2], [−], 𝑟6, 𝑤3>,

<𝑠18, 𝑒4, [−], [−], 𝑟6, 𝑤3>

tragic <𝑠7, 𝑒2, [𝑎3], [𝑚4], 𝑟3, 𝑤1>

performance

<𝑠5, 𝑒1, [𝑎6], [𝑚6], 𝑟2, 𝑤1>,

<𝑠9, 𝑒2, [𝑎6], [𝑚3], 𝑟3, 𝑤1>,

<𝑠11, 𝑒2, [𝑎6], [𝑚1], 𝑟4, 𝑤1>,

<𝑠13, 𝑒3, [𝑎6], [−], 𝑟5, 𝑤2>,

<𝑠15, 𝑒4, [𝑎6], [−], 𝑟5, 𝑤3>,

<𝑠20, 𝑒3, [𝑎6], [−], 𝑟8, 𝑤4>,

<𝑠21, 𝑒3, [𝑎6], [𝑚6], 𝑟9, 𝑤4>

soundtrack

<𝑠4, 𝑒1, [𝑎2], [−], 𝑟2, 𝑤1>,

<𝑠10, 𝑒2, [𝑎2], [𝑚2], 𝑟4, 𝑤1>,

<𝑠16, 𝑒4, [𝑎2], [−], 𝑟6, 𝑤2>,

<𝑠22, 𝑒3, [𝑎2], [𝑚1], 𝑟9, 𝑤4>

plot <𝑠14, 𝑒3, [𝑎4], [−], 𝑟5, 𝑤2>

acting <𝑠13, 𝑒4, [𝑎6], [−], 𝑟9, 𝑤4>,

music<𝑠2, 𝑒1, [𝑎1], [𝑚1], 𝑟1, 𝑤1>,

<𝑠8, 𝑒2, [𝑎1], [𝑚1], 𝑟3, 𝑤1>

Yeston <𝑠2, 𝑒1, [𝑎1], [−], 𝑟1, 𝑤1>,

story

<𝑠1, 𝑒1, [𝑎3], [𝑚1], 𝑟1, 𝑤1>,

<𝑠7, 𝑒2, [𝑎3], [−], 𝑟3, 𝑤1>,

<𝑠12, 𝑒2, [𝑎3], [𝑚2], 𝑟4, 𝑤1>,

<𝑠17, 𝑒4, [𝑎3], [−], 𝑟6, 𝑤3>

(s7) beautiful tragic love story, // (s8) with great

music. // (s9) superb performances in movies ever!

(s10) The soundtrack is also excellent,//

(s11) great performance, // (s12) excellent

presentation of a love story…

Brokeback

Mountain 𝐫𝟑

𝐫𝟒

The Dark Knight

(s13) The performance by Heath Ledger was

outstanding // (s14) and plot is amazing too…

𝐫𝟓

The Dark Knight

(s20) Joker shows phonemically awesome

performance!…

(s21) nice performance // (s22) and backed up

with great soundtrack. // (s23) excellent casting!

𝐫𝟖

𝐫𝟗

(s1) the greatest love stories of all // (s2) and

beautiful music from Yeston. // (s3) Everything

about this movie was excellent...

(𝑠4) touching soundtrack, // (𝑠5) and perfect

handling of the known tragedy with nice

performance. // (𝑠6) This has the best love scene I

have ever seen…

Titanic 𝐫𝟏

𝐫𝟐

(s15) Navi looks very real, good performance,

// (s16) beautiful soundtrack that emphasize the

vastness of the Pandora, // (s17) with love story.//

(s18) The world of Pandora is stunning

Avatar

𝐫𝟔

𝐫𝟕 (s19) James Cameron deserves high praise for

this creation…

Review ID

IV: Query Parsing

CONSENTO preprocesses the query and

performs query expansionstop-word removal,

polarity only-word removal

feature expansion

stemming

Polarity only-word removal

"good action movie" and "great action movie" should

be treated as the same query

Feature words expanded for better recall‘plot’ → {plot, story}

‘music’ → {music, soundtrack}

V: Retrieval

Retrieve MCSU segments that match to the

query terms

Same as the conventional systems retrieve

document posting lists

VI: Ranking

Group MCSU postings by entity and aggregate the scores of the postings to compute the score of the corresponding entity

VI: Ranking

VI: Ranking

29

VI: Ranking

30

Movie data sets

Source

• Amazon , IMDB, Metacritic, Flixster, Rotten Tomatoes

and Yahoo Movies

Period

• 2008 ~ 2010

More than 740 movies, and 30K reviews

Hotel data sets

hotel data set from Ganesan and Zhai

reviews for the hotels in 10 major cities from TripAdvisor

The authors provided us the corrected judgment set for our test

Experimental Setup: Data Set

Experiment

Methods

Ganesan and Zhai’s OE and QAM methods

• Opinion expansion word

• Query aspect model

Baseline

1) BM25

• b = 0.75

• k1 = 2

2) VSMBM (lucene default)

• Vector space model + Boolean model

3) ConsensusRank

Experimental Result - Movie

Experimental Result - Hotel

HawaiiCebuGold Coast

Honeymoon

Snorkeling

Hawaii !

HoneymoonWhale Watching Snorkeling

Whale watching

Whale WatchingSnorkeling

Snorkeling

Whale WatchingActive

VolcanoHoneymoon

Honeymoon

Whale WatchingSnorkeling

Honeymoon

SnorkelingHoneymoon

Whale Watching

1. 웹및소셜네트워크상의다양한정보를사전에분석및인덱싱 반전 있는

스릴러영화?

반전 있는스릴러영화?

대학생 백팩?대학생 백팩?

믿을만한중고차딜러?

믿을만한근처 어린이집

2. Ad-hoc 의사결정질의에대한실시간결과도출

면접용메이크업미용실

학원 근처갈 만한

스터디 장소

강남 상견례한식집

배낭여행 숙소

우리동네PT 잘 하는트레이너?

38

best thriller with plot twist

The Artist vs. Jack and Jill

39

40

good pizza restaurant

Click!

CONSENTO Local 서비스 예제

43

CONSENTO Local 서비스 예제

44

‘Napk-In’ 서비스 예제

45

‘Napk-In’ 서비스 예제

46

‘슝’ 서비스 예제

47

잠재된 컨센서스 검색 시장

48

사실검색

컨센서스검색

ENGINEERING KNOWLEDGE

SEARCHING WISDOM

CONSENTO

THANK YOU

[2b1]검색엔진의 패러다임 전환

Technology