using lucene solr to build advertising systems

77
Using Lucene/Solr to build Advertising Systems Hide (Hatayama Hideharu) Big Data Department, Targeting Section, Advertising Group Rakuten, Inc. May 2 nd 2013

Upload: lucenerevolution

Post on 15-Jan-2015

2.790 views

Category:

Education


4 download

DESCRIPTION

Presented by Hideharu Hatayama, Rakuten, Inc. I want to talk about architecture patterns of Solr centered ad systems and practical knowledge which we gained by operating the system with high availability for years, and these topics would be applicable for other systems such as e-commerce site or restaurant recommendation site.Through the presentation, I'll aim that beginners will get the hints of how to design their system architecture using Solr with high performance, and how to manage or operate the systems avoiding down time.

TRANSCRIPT

Page 1: Using lucene solr to build advertising systems

Using Lucene/Solr to build Advertising Systems

Hide (Hatayama Hideharu)

Big Data Department, Targeting Section, Advertising Group

Rakuten, Inc. May 2nd 2013

Page 2: Using lucene solr to build advertising systems

2

Intro

Agenda | www.lucenerevolution.org

http://www.lucenerevolution.org/2013/agenda

Page 3: Using lucene solr to build advertising systems

3

Intro

Agenda | www.lucenerevolution.org

http://www.lucenerevolution.org/2013/agenda

35 min...orz

my talk is NOT about... m(_ _)m

NRT

SolrCloud

complicated queries

or other Solr hot topics

my talk is just about

Overview of Solr, most common features

Our empirical knowledge about Solr

Page 4: Using lucene solr to build advertising systems

4

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Page 5: Using lucene solr to build advertising systems

5

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Page 6: Using lucene solr to build advertising systems

6

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Page 7: Using lucene solr to build advertising systems

7

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Page 8: Using lucene solr to build advertising systems

8

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Page 9: Using lucene solr to build advertising systems

9

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Page 10: Using lucene solr to build advertising systems

10

Who am I?

Hatayama Hideharu (call me Hide)

M.Eng, Tokyo Institute of Technology, Japan

Worked with advertising system in Rakuten for 3 years

ad management system development

ad distribution system development

system architecture design

increase the performance of systems

increase profitability of ad services

User of Solr, not implementer http://6109.hidepiy.com/

Page 11: Using lucene solr to build advertising systems

11

Who are we?

Rakuten, Inc.

Internet services company

Founded : Feb. 7th 1997, Tokyo, Japan

The first service: Rakuten Ichiba (shopping mall)

Page 12: Using lucene solr to build advertising systems

12

Who are we?

Page 13: Using lucene solr to build advertising systems

13

Rakuten in Japan

Page 14: Using lucene solr to build advertising systems

14

Rakuten Ichiba

Ichiba: The largest online shopping mall in Japan

user info

campaign

other services

item search

category navigation

personalized item

item history

sale event shop history

bookmarked item

service tab

:

Page 15: Using lucene solr to build advertising systems

15

Rakuten’s Global Expansion

● ● ●

● ●

● ● ● ● ● ●

● ●

● ● ● ●

● ● ● ●

● ● ●

● ●

● ●

● ●

● ●

● ●

● E-Commerce

eBook

Travel

Other services & businesses

Development center ●

Page 16: Using lucene solr to build advertising systems

16

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Page 17: Using lucene solr to build advertising systems

17

Types of advertisements on Rakuten Ichiba [1/3]

Listing Ad (search word related ad)

item search

searched ads

searched items

Page 18: Using lucene solr to build advertising systems

18

Types of advertisements on Rakuten Ichiba [2/3]

Display Ad (placement related ad)

where, when …

Targeting Ad (user related ad)

sex, age, browsing history …

Page 19: Using lucene solr to build advertising systems

19

... Ad ?

120 ads on 1 page ...orz

Types of advertisements on Rakuten Ichiba [3/3]

Page 20: Using lucene solr to build advertising systems

20

ad system function landscape

ad system

Rakuten

Owned

Media (Web/Email)

Owned Ad

Network

Rakuten staff

Merchants

Tool User Media

External

ADNW,

AdEx

Other staff

Tenancy Ad (Fixed placement/fee/term)

P4P Ad (CPM/CPC/CPA etc.)

Ad placement def.

Sales mgmt.

Creative mgmt.

Campaign mgmt.

Budget mgmt.

Bidding

Additional Function

Big Data Analysis Advanced

targeting

Creative

optimization

Connect to

affiliate network

Programmatic

media buying

- Attribution

- Behavior

- Optimization

Delivery mgmt.

Reporting

Merchant Tool

Targeting/media

Reporting

Merchant Tool

Ad server.

ad management ad distribution

Log processing

Targeting (Placement, keyword,

behavioral, demographic, etc.)

Beacon server.

Redirect server.

Device

x

PC Mobile

Smart

phone Tablet

Page 21: Using lucene solr to build advertising systems

21

ad distribution system [1/2]

JSON

HTML

JavaScript

ad searching

ad filtering

ad sorting

logging

...

???

parameter

placement

keyword

ad type

...

cookie

Page 22: Using lucene solr to build advertising systems

22

ad distribution system [2/2]

need high performance, high availability

e.g., more than 7,000 req / sec for 1 server with 100.00% avail.

collect & analyze log, then improve profitability

basic architecture is the same for our variety of ad

using...

Kyoto Tycoon

Page 23: Using lucene solr to build advertising systems

23

system design: few years ago [1/5]

master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4 x4 x4 x4

x4 x4

x2

slave

web svr

app svr

master

Page 24: Using lucene solr to build advertising systems

24

master

system design: few years ago [2/5]

master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4

x2

slave

web svr

app svr x4 x4 x4 x4 x4

cluster

web server x 4

app server x 5

Page 25: Using lucene solr to build advertising systems

25

master

system design: few years ago [3/5]

master

...

: 1 physical server

: SLB

: 1 server cluster x2

slave

web svr

app svr

...

x4 x4 x4 x4 x4

x4 x4

SLB connect

app <-> Solr

Page 26: Using lucene solr to build advertising systems

26

system design: few years ago [4/5]

master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4 x4 x4 x4

x4 x4

x2

slave

web svr

app svr

High availability, robust

simplified task for each servers

Web server only do Apaching

Solr server searching

...

make full use of resources, on demand provisioning

e.g., add 1 front cluster

e.g., swap broken apache server

e.g., tune up performance, decrease app server 5 -> 3

Page 27: Using lucene solr to build advertising systems

27

system design: few years ago [5/5]

master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4 x4 x4 x4

x4 x4

x2

slave

web svr

app svr

so many servers, so many configurations

we didn’t have automatic deploy or operation tools

so many external networking

Apache <-> Tomcat

app <-> Solr

...

Apache, Tomcat, Solr, and Redis had never died,

but the performance was our biggest issue.

Page 28: Using lucene solr to build advertising systems

28

system design: little bit changed [1/4]

master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4

x4 x4

x2

slave

master

Page 29: Using lucene solr to build advertising systems

29

system design: little bit changed[2/4]

master

: 1 physical server

... : SLB

: 1 server cluster

x4 x4

x2

slave

master

... x4 x4

merged web & app server

1 physical server both contains

Apache & Tomcat

Page 30: Using lucene solr to build advertising systems

30

system design: little bit changed[3/4]

master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4

x4 x4

x2

slave

master

easy to understand whole system network

easy to operate

easy to deploy or change configurations

Page 31: Using lucene solr to build advertising systems

31

system design: little bit changed [4/4]

master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4

x4 x4

x2

slave

master

Solr is still far from apps

Page 32: Using lucene solr to build advertising systems

32

system design: current[1/4]

...

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

Page 33: Using lucene solr to build advertising systems

33

system design: current [2/4]

: 1 physical server

: SLB x2 x2

master

... app

x2 x2 app

x2 x2

Solr slave is included

in app server

Page 34: Using lucene solr to build advertising systems

34

system design: current [3/4]

: 1 physical server

: SLB

master

... app

x2 x2 app

x2 x2

x2 x2

SLB connect

master <-> slave

Page 35: Using lucene solr to build advertising systems

35

system design: current [4/4]

...

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

no SPOF (Solr master is redundant)

easy to understand whole system process

easy to operate

easy to deploy or change configurations

easy to scale out

good performance (7000 req / sec by 1 server)

but we can’t make full use of server resources

e.g., we want 0.7 Solr instance for 1 app instance...

Page 36: Using lucene solr to build advertising systems

36

system design: in the near future

server instance

physical on-premise, private cloud, public cloud

PaaS

Apache or Nginx?

shared cache

master <-> slave or SolrCloud?

Solr or Elasticsearch?

abolish servlet & tomcat style?

collaborate more with Hadoop family members

Page 37: Using lucene solr to build advertising systems

37

system design: in the near future

server instance

physical on-premise, private cloud, public cloud

PaaS

Apache or Nginx?

shared cache

Solr or Elasticsearch?

abolish servlet & tomcat style

collaborate more with Hadoop family members

m(_ _)m

UNDER

CONSTRUCTION

Page 38: Using lucene solr to build advertising systems

38

operation e.g. Solr schema update [1/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

Page 39: Using lucene solr to build advertising systems

39

operation e.g. Solr schema update [2/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

Stop replication of

Solr & Redis

Page 40: Using lucene solr to build advertising systems

40

operation e.g. Solr schema update [3/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

Separated from the net

Service IN Service IN Service OUT

Page 41: Using lucene solr to build advertising systems

41

operation e.g. Solr schema update [4/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

update schema & app

Service IN Service IN Service OUT

Page 42: Using lucene solr to build advertising systems

42

operation e.g. Solr schema update [5/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

update schema

Service IN Service IN Service OUT

Page 43: Using lucene solr to build advertising systems

43

operation e.g. Solr shcema update [6/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

restart replication

Service IN Service IN Service OUT

Page 44: Using lucene solr to build advertising systems

44

operation e.g. Solr shema update [7/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

test app functions

with reverse proxy

Service IN Service IN Service OUT

Page 45: Using lucene solr to build advertising systems

45

operation e.g. Solr shcema update [8/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

Service IN Service IN Service IN

connected to the net

Page 46: Using lucene solr to build advertising systems

46

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Page 47: Using lucene solr to build advertising systems

47

Solr cache

about various kind of Lucene/Solr cache

fieldCache (Lucene level)

fieldValueCache

documentCache

filterCache

queryResultCache

HTTP chache

and user defined cache

Page 48: Using lucene solr to build advertising systems

48

filter cache

we’re using it for caching the results of filter queries

<!-- default in solrconfig.xml --> <filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="0"/>

Page 49: Using lucene solr to build advertising systems

49

query result cache

we used to activate it for avoiding useless searching

<!-- default in solrconfig.xml --> <queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0"/>

Page 50: Using lucene solr to build advertising systems

50

application cache

about cache in app side

processing time without Searching is 0 – 1 msec

-> convert from doc to DTO is relatively wasteful

-> SolrJ with javabin works well, but...

Page 51: Using lucene solr to build advertising systems

51

sizing & memory usage

monitoring -> tuning configuration, memory allocation

server: traffic, load, cpu, memory, page, swap

Apache: busy, rps, bps, cpu, state, processing time

Tomcat: thread, rps, bps, eps, memory, jmx

Solr: index size, doc num, memory, cache hit ratio

admin page, admin/Luke, replication?command=details...

server mon GrowthForecast Solr admin, command, Luke

Page 52: Using lucene solr to build advertising systems

52

avoid Full GC

Full GC

if we allocate 2GB for a tomcat heap

-> “Stop the World” would be more than 1 sec

Concurrent GC (we’re still struggling in tuning)

e.g.,)

HEAP_OPTS="-Xmx2g -Xms2g -Xss512k"

GC_LOG_OPTS="-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails"

FULL_GC_OPTS="-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC -

XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=32 -XX:TargetSurvivorRatio=90"

JMX_OPTS="-Dcom.sun.management.config.file=${CATALINA_HOME}/conf/management.properties"

CATALINA_OPTS="-server ${HEAP_OPTS} ${GC_LOG_OPTS} ${FULL_GC_OPTS} ${JMX_OPTS}"

Page 53: Using lucene solr to build advertising systems

53

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Page 54: Using lucene solr to build advertising systems

54

Solr plugin

RequestHandler, SearchHandler

SearchComponent, QueryComponent

QParserPlugin, PostFilter

QueryResponseWriter

-> implemented these classes for our own use

Page 55: Using lucene solr to build advertising systems

55

RequestHandler & SearchHandler

for logging

for health check

like /admin/ calls AdminHandlers

public class OurRequestHandler extends RequestHandlerBase { /** Logger */ private static Logger log = LoggerFactory.getLogger(OurRequestHandler.class); @Override public void init(NamedList args) { super.init(args); } @Override public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception { log.info(req.toString()); rsp.setHttpCaching(false); ... } }

Page 56: Using lucene solr to build advertising systems

56

Solr index situation [1/2]

Solr’s indexing need huge costs, we thought (just thought...)

-> then separated into these two

basic stable data

additional unstable data

or

Page 57: Using lucene solr to build advertising systems

57

Solr index situation [2/2]

Solr index: for searching

keyword, placement data (Japan, Ichiba, footer...)

a few GB

Redis data (previously MySQL): for filtering or sorting

ad status (active or not)

ad price, ad rank (based on CTR, CVR...)

and ad contents data (image path, link, text...)

100MB – 10GB (depends on advertisement types)

Page 58: Using lucene solr to build advertising systems

58

searching: handle ads in app [1/2]

handle req

search

filter

sort

...

Page 59: Using lucene solr to build advertising systems

59

searching: handle ads in Solr [2/2]

handle req

search

...

Page 60: Using lucene solr to build advertising systems

60

Solr with Redis data handling [1/2]

ResponseWriter

-> unsuitable for searching or filtering

SearchComponent

-> easy to implement, configure

-> basic process is already handled in QueryComponent

Page 61: Using lucene solr to build advertising systems

61

Solr with Redis data handling [2/2]

modify QueryComponent

-> good position in terms of functionality

-> base for default searching

-> relatively big component

ConstantScoreQuery with our own Filter?

Page 62: Using lucene solr to build advertising systems

62

QueryParserPlugin & PostFilter [1/2]

e.g.)

<!–- solrconfig.xml --> <!-- put jar file here --> <lib dir=“.../orochi_search” /> <!-- define implemented class --> <queryParser name=“redis” class=“...orochi.search.ExtendedQParserPlugin” />

public class ExtendedQParserPlugin extends QParserPlugin { public void init(NamedList args) { /* NOOP */ } @Override public QParser createParser (String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { return new QParser(qstr, localParams, params, req) { ... @Override public Query parse() throws ParseException { return new RedisPostFilter(rows, preview, currentTimeMillis); } }; } }

Page 63: Using lucene solr to build advertising systems

63

QueryParserPlugin & PostFilter [2/2]

public class RedisPostFilter extends ExtendedQueryBase implements PostFilter { public RedisPostFilter(int rows, long preview, long currentTimeMillis) { setCache(false); ... } public boolean isValid(int docId, IndexSearcher indexSearcher) { // return the document is valid or not. document = indexSearcher.doc(docId, fieldSelector); ... } public DelegatingCollector getFilterCollector(final IndexSearcher indexSearcher) { return new DelegatingCollector() { @Override public void collect(int docId) throws IOException { if (isValid(docId, indexSearcher)) { super.collect(docId); ... } } }; } @Override public int getCost() { return Math.max(super.getCost(), 100); } ... }

Page 64: Using lucene solr to build advertising systems

64

Merge Solr & Redis

handle req

search

...

Page 65: Using lucene solr to build advertising systems

65

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Page 66: Using lucene solr to build advertising systems

66

Japanese linguistics

すもももももももも

(pronunciation) sumomomomomomomomo

すもも も もも も もも

(words) sumomo mo momo mo momo

李も桃も桃

(meaning) Plums and peaches are both part of peaches

Page 67: Using lucene solr to build advertising systems

67

Japanese linguistics

最中を食べている最中ですm(_ _)m

(pronunciation) monakawotabeteirusaichudesu

(meaning) I’m eating monaka. (excuse me)

how to separate this sentence into tokens for indexing?

Page 68: Using lucene solr to build advertising systems

68

Tokenize approach: N-gram

最中を食べている最中ですm(_ _)m

unigram

最 中 を 食 べ て い る 最 中 で す m ( _ _ ) m

bigram

最中 中を を食 食べ べて てい いる る最 最中 中で です すm m( (_ _ _ _) )m

trigram

最中を 中を食 を食べ 食べて べてい ている いる最 る最中 最中で 中です ですm す

m( m(_ (_ _ _ _) _)m

Page 69: Using lucene solr to build advertising systems

69

Tokenize approach: Morphological Analysis [1/2]

最中を食べている最中ですm(_ _)m

using dictionary

最中 を 食べ て いる 最中 です m(_ _)m

最中 を 食べ て いる 最中 です m(_ _)m

text 最中 を 食べ て いる 最中 です m(_ _)m

partO

fSpee

ch

noun-

common

particle-

case-

misc

verb-

main

particle-

conjuncti

ve

verb-

auxiliary

noun-

adverbial

auxiliary-

verb

-

pronu

nciati

on

monaka o tabe te iru saichu desu -

Page 70: Using lucene solr to build advertising systems

70

Tokenize approach: Morphological Analysis [2/2]

最中を食べている最中ですm(_ _)m

Page 71: Using lucene solr to build advertising systems

71

Tokenize approach: compare 2 ways

N-gram Morphological Analysis

index size big small

preparation not needed make & maintain word

dictionary

implementation very easy hard

NLP, ML, statistic

new word no problem update dictionary, re-index

search relevancy without omission

contains trivial

with omission

human like

processing time ... ...

Page 72: Using lucene solr to build advertising systems

72

Solr with Morphological Analysis

ver. -3.5 : setup component & dictionary manually

Sen

Lucene gosen

...

ver. 3.6- : field type text_ja woks well

“kuromoji” is inside

Page 73: Using lucene solr to build advertising systems

73

issues of kuromoji

some adjustments are needed for migration

supported dictionaries would be different between

previous engine & kuromoji

half width & full width characters

Windows8 <-> Windows8

AKB48 <-> AKB48

Page 74: Using lucene solr to build advertising systems

74

Japanese Analyzer

JapaneseTokenizer

JapaneseBaseFormFilter

JapanesePartOfSpeechStopFilter

CJKWidthFilter

StopFilter

JapaneseKatakanaStemFilter

LowerCaseFilter

Page 75: Using lucene solr to build advertising systems

75

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Page 76: Using lucene solr to build advertising systems

76

Thank you, San Diego

any question?

any comment?

any advice?

If you have some, let’s talk later (not now...?)

Page 77: Using lucene solr to build advertising systems

Hide (Hatayama Hideharu)

Big Data Department, Targeting Section, Advertising Group

Rakuten Inc.

blog: http://6109.hidepiy.com

facebook: http://www.facebook.com/hatayama.hideharu

twitter: ... I don’t remember