using lucene solr to build advertising systems

Using Lucene/Solr to build Advertising Systems

Hide (Hatayama Hideharu)

Big Data Department, Targeting Section, Advertising Group

Rakuten, Inc. May 2nd 2013

Agenda | www.lucenerevolution.org

http://www.lucenerevolution.org/2013/agenda

Agenda | www.lucenerevolution.org

http://www.lucenerevolution.org/2013/agenda

35 min...orz

my talk is NOT about... m(_ _)m

SolrCloud

complicated queries

or other Solr hot topics

my talk is just about

Overview of Solr, most common features

Our empirical knowledge about Solr

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

Agenda

4 Solr plug-in

3 Solr performance

Agenda

4 Solr plug-in

3 Solr performance

Agenda

4 Solr plug-in

3 Solr performance

Agenda

4 Solr plug-in

3 Solr performance

Agenda

4 Solr plug-in

3 Solr performance

Who am I?

Hatayama Hideharu (call me Hide)

M.Eng, Tokyo Institute of Technology, Japan

Worked with advertising system in Rakuten for 3 years

ad management system development

ad distribution system development

system architecture design

increase the performance of systems

increase profitability of ad services

User of Solr, not implementer http://6109.hidepiy.com/

Who are we?

Rakuten, Inc.

Internet services company

Founded : Feb. 7th 1997, Tokyo, Japan

The first service: Rakuten Ichiba (shopping mall)

Who are we?

Rakuten in Japan

Rakuten Ichiba

Ichiba: The largest online shopping mall in Japan

user info

campaign

other services

item search

category navigation

personalized item

item history

sale event shop history

bookmarked item

service tab

Rakuten’s Global Expansion

● ● ●

● ●

● ● ● ● ● ●

● ●

● ● ● ●

● ● ●

● ●

● E-Commerce

Travel

Other services & businesses

Development center ●

Agenda

4 Solr plug-in

3 Solr performance

Types of advertisements on Rakuten Ichiba [1/3]

Listing Ad (search word related ad)

item search

searched ads

searched items

Display Ad (placement related ad)

where, when …

Targeting Ad (user related ad)

sex, age, browsing history …

... Ad ?

120 ads on 1 page ...orz

ad system function landscape

ad system

Rakuten

Media (Web/Email)

Owned Ad

Network

Rakuten staff

Merchants

Tool User Media

External

Other staff

Tenancy Ad (Fixed placement/fee/term)

P4P Ad (CPM/CPC/CPA etc.)

Ad placement def.

Sales mgmt.

Creative mgmt.

Campaign mgmt.

Budget mgmt.

Bidding

Additional Function

Big Data Analysis Advanced

targeting

Creative

optimization

Connect to

affiliate network

Programmatic

media buying

- Attribution

- Behavior

- Optimization

Delivery mgmt.

Reporting

Merchant Tool

Targeting/media

Reporting

Merchant Tool

Ad server.

ad management ad distribution

Log processing

Targeting (Placement, keyword,

behavioral, demographic, etc.)

Beacon server.

Redirect server.

Device

PC Mobile

phone Tablet

ad distribution system [1/2]

JavaScript

ad searching

ad filtering

ad sorting

logging

parameter

placement

keyword

ad type

cookie

ad distribution system [2/2]

need high performance, high availability

e.g., more than 7,000 req / sec for 1 server with 100.00% avail.

collect & analyze log, then improve profitability

basic architecture is the same for our variety of ad

using...

Kyoto Tycoon

system design: few years ago [1/5]

master

: 1 physical server

... : SLB

: 1 server cluster

x4 x4 x4 x4 x4

web svr

app svr

master

: 1 physical server

... : SLB

: 1 server cluster

web svr

app svr x4 x4 x4 x4 x4

cluster

web server x 4

app server x 5

master

: 1 physical server

: 1 server cluster x2

web svr

app svr

x4 x4 x4 x4 x4

SLB connect

app <-> Solr

master

: 1 physical server

... : SLB

: 1 server cluster

x4 x4 x4 x4 x4

web svr

app svr

High availability, robust

simplified task for each servers

Web server only do Apaching

Solr server searching

make full use of resources, on demand provisioning

e.g., add 1 front cluster

e.g., swap broken apache server

e.g., tune up performance, decrease app server 5 -> 3

master

: 1 physical server

... : SLB

: 1 server cluster

x4 x4 x4 x4 x4

web svr

app svr

so many servers, so many configurations

we didn’t have automatic deploy or operation tools

so many external networking

Apache <-> Tomcat

app <-> Solr

Apache, Tomcat, Solr, and Redis had never died,

but the performance was our biggest issue.

system design: little bit changed [1/4]

master

: 1 physical server

... : SLB

: 1 server cluster

master

system design: little bit changed[2/4]

master

: 1 physical server

... : SLB

: 1 server cluster

master

... x4 x4

merged web & app server

1 physical server both contains

Apache & Tomcat

system design: little bit changed[3/4]

master

: 1 physical server

... : SLB

: 1 server cluster

master

easy to understand whole system network

easy to operate

easy to deploy or change configurations

system design: little bit changed [4/4]

master

: 1 physical server

... : SLB

: 1 server cluster

master

Solr is still far from apps

system design: current[1/4]

: 1 physical server

: SLB x2

x2 x2 app

master

system design: current [2/4]

: 1 physical server

: SLB x2 x2

master

... app

x2 x2 app

Solr slave is included

in app server

: 1 physical server

master

... app

x2 x2 app

SLB connect

master <-> slave

: 1 physical server

: SLB x2

x2 x2 app

master

no SPOF (Solr master is redundant)

easy to understand whole system process

easy to operate

easy to deploy or change configurations

easy to scale out

good performance (7000 req / sec by 1 server)

but we can’t make full use of server resources

e.g., we want 0.7 Solr instance for 1 app instance...

system design: in the near future

server instance

physical on-premise, private cloud, public cloud

Apache or Nginx?

shared cache

master <-> slave or SolrCloud?

Solr or Elasticsearch?

abolish servlet & tomcat style?

collaborate more with Hadoop family members

system design: in the near future

server instance

physical on-premise, private cloud, public cloud

Apache or Nginx?

shared cache

Solr or Elasticsearch?

abolish servlet & tomcat style

collaborate more with Hadoop family members

m(_ _)m

CONSTRUCTION

operation e.g. Solr schema update [1/8]

: 1 physical server

: SLB x2

x2 x2 app

master

: 1 physical server

: SLB x2

x2 x2 app

master

Stop replication of

Solr & Redis

: 1 physical server

: SLB x2

x2 x2 app

master

Separated from the net

Service IN Service IN Service OUT

: 1 physical server

: SLB x2

x2 x2 app

master

update schema & app

: 1 physical server

: SLB x2

x2 x2 app

master

update schema

operation e.g. Solr shcema update [6/8]

: 1 physical server

: SLB x2

x2 x2 app

master

restart replication

operation e.g. Solr shema update [7/8]

: 1 physical server

: SLB x2

x2 x2 app

master

test app functions

with reverse proxy

operation e.g. Solr shcema update [8/8]

: 1 physical server

: SLB x2

x2 x2 app

master

Service IN Service IN Service IN

connected to the net

Agenda

4 Solr plug-in

3 Solr performance

Solr cache

about various kind of Lucene/Solr cache

fieldCache (Lucene level)

fieldValueCache

documentCache

filterCache

queryResultCache

HTTP chache

and user defined cache

filter cache

we’re using it for caching the results of filter queries

<filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="0"/>

query result cache

we used to activate it for avoiding useless searching

<queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0"/>

application cache

about cache in app side

processing time without Searching is 0 – 1 msec

-> convert from doc to DTO is relatively wasteful

-> SolrJ with javabin works well, but...

sizing & memory usage

monitoring -> tuning configuration, memory allocation

server: traffic, load, cpu, memory, page, swap

Apache: busy, rps, bps, cpu, state, processing time

Tomcat: thread, rps, bps, eps, memory, jmx

Solr: index size, doc num, memory, cache hit ratio

admin page, admin/Luke, replication?command=details...

server mon GrowthForecast Solr admin, command, Luke

avoid Full GC

Full GC

if we allocate 2GB for a tomcat heap

-> “Stop the World” would be more than 1 sec

Concurrent GC (we’re still struggling in tuning)

e.g.,)

HEAP_OPTS="-Xmx2g -Xms2g -Xss512k"

GC_LOG_OPTS="-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails"

FULL_GC_OPTS="-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC -

XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=32 -XX:TargetSurvivorRatio=90"

JMX_OPTS="-Dcom.sun.management.config.file=${CATALINA_HOME}/conf/management.properties"

CATALINA_OPTS="-server ${HEAP_OPTS} ${GC_LOG_OPTS} ${FULL_GC_OPTS} ${JMX_OPTS}"

Agenda

4 Solr plug-in

3 Solr performance

Solr plugin

RequestHandler, SearchHandler

SearchComponent, QueryComponent

QParserPlugin, PostFilter

QueryResponseWriter

-> implemented these classes for our own use

RequestHandler & SearchHandler

for logging

for health check

like /admin/ calls AdminHandlers

public class OurRequestHandler extends RequestHandlerBase { /** Logger */ private static Logger log = LoggerFactory.getLogger(OurRequestHandler.class); @Override public void init(NamedList args) { super.init(args); } @Override public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception { log.info(req.toString()); rsp.setHttpCaching(false); ... } }

Solr index situation [1/2]

Solr’s indexing need huge costs, we thought (just thought...)

-> then separated into these two

basic stable data

additional unstable data

Solr index situation [2/2]

Solr index: for searching

keyword, placement data (Japan, Ichiba, footer...)

a few GB

Redis data (previously MySQL): for filtering or sorting

ad status (active or not)

ad price, ad rank (based on CTR, CVR...)

and ad contents data (image path, link, text...)

100MB – 10GB (depends on advertisement types)

searching: handle ads in app [1/2]

handle req

search

filter

searching: handle ads in Solr [2/2]

handle req

search

Solr with Redis data handling [1/2]

ResponseWriter

-> unsuitable for searching or filtering

SearchComponent

-> easy to implement, configure

-> basic process is already handled in QueryComponent

Solr with Redis data handling [2/2]

modify QueryComponent

-> good position in terms of functionality

-> base for default searching

-> relatively big component

ConstantScoreQuery with our own Filter?

QueryParserPlugin & PostFilter [1/2]

<!–- solrconfig.xml -->  <lib dir=“.../orochi_search” />  <queryParser name=“redis” class=“...orochi.search.ExtendedQParserPlugin” />

public class ExtendedQParserPlugin extends QParserPlugin { public void init(NamedList args) { /* NOOP */ } @Override public QParser createParser (String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { return new QParser(qstr, localParams, params, req) { ... @Override public Query parse() throws ParseException { return new RedisPostFilter(rows, preview, currentTimeMillis); } }; } }

QueryParserPlugin & PostFilter [2/2]

public class RedisPostFilter extends ExtendedQueryBase implements PostFilter { public RedisPostFilter(int rows, long preview, long currentTimeMillis) { setCache(false); ... } public boolean isValid(int docId, IndexSearcher indexSearcher) { // return the document is valid or not. document = indexSearcher.doc(docId, fieldSelector); ... } public DelegatingCollector getFilterCollector(final IndexSearcher indexSearcher) { return new DelegatingCollector() { @Override public void collect(int docId) throws IOException { if (isValid(docId, indexSearcher)) { super.collect(docId); ... } } }; } @Override public int getCost() { return Math.max(super.getCost(), 100); } ... }

Merge Solr & Redis

handle req

search

Agenda

4 Solr plug-in

3 Solr performance

Japanese linguistics

すもももももももも

(pronunciation) sumomomomomomomomo

すもももももももも

(words) sumomo mo momo mo momo

李も桃も桃

(meaning) Plums and peaches are both part of peaches

Japanese linguistics

最中を食べている最中ですm(_ _)m

(pronunciation) monakawotabeteirusaichudesu

(meaning) I’m eating monaka. (excuse me)

how to separate this sentence into tokens for indexing?

Tokenize approach: N-gram

unigram

最中を食べている最中です m ( _ _ ) m

bigram

最中中をを食食べべてていいるる最最中中でですすm m( (_ _ _ _) )m

trigram

最中を中を食を食べ食べてべていているいる最る最中最中で中ですですm す

m( m(_ (_ _ _ _) _)m

Tokenize approach: Morphological Analysis [1/2]

using dictionary

最中を食べている最中です m(_ _)m

text 最中を食べている最中です m(_ _)m

common

particle-

conjuncti

auxiliary

adverbial

auxiliary-

nciati

monaka o tabe te iru saichu desu -

Tokenize approach: Morphological Analysis [2/2]

Tokenize approach: compare 2 ways

N-gram Morphological Analysis

index size big small

preparation not needed make & maintain word

dictionary

implementation very easy hard

NLP, ML, statistic

new word no problem update dictionary, re-index

search relevancy without omission

contains trivial

with omission

human like

processing time ... ...

Solr with Morphological Analysis

ver. -3.5 : setup component & dictionary manually

Lucene gosen

ver. 3.6- : field type text_ja woks well

“kuromoji” is inside

issues of kuromoji

some adjustments are needed for migration

supported dictionaries would be different between

previous engine & kuromoji

half width & full width characters

Windows8 <-> Ｗｉｎｄｏｗｓ８

AKB48 <-> ＡＫＢ４８

Japanese Analyzer

JapaneseTokenizer

JapaneseBaseFormFilter

JapanesePartOfSpeechStopFilter

CJKWidthFilter

StopFilter

JapaneseKatakanaStemFilter

LowerCaseFilter

Agenda

4 Solr plug-in

3 Solr performance

Thank you, San Diego

any question?

any comment?

any advice?

If you have some, let’s talk later (not now...?)

Hide (Hatayama Hideharu)

Big Data Department, Targeting Section, Advertising Group

Rakuten Inc.

blog: http://6109.hidepiy.com

facebook: http://www.facebook.com/hatayama.hideharu

twitter: ... I don’t remember

using lucene solr to build advertising systems

Education

geneva jug lucene solr

randomized continuous testing: solr & lucene use case

solr lucene conference 2014 - nitin presentation

solr @ etsy - apache lucene eurocon

text categorization with lucene and solr

lucene solr meetup july 2010 short

facettensuche mit lucene und solr

oslo lucene/solr meetup

solr, lucene, apache, and you!

solr & lucene at etsy

introduction to apache lucene/solr

dagens næringslivs overgang til lucene/solr søk

relevantes schneller finden – mit-lucene und solr

lucene for solr developers

presentation lucene / solr / datafari - nantes jug

solr and lucene search revolution

nyc lucene/solr meetup: spark / solr

understanding the solr security framework - lucene solr...

introduction to lucene & solr and usecases

lucene and solr