slash n near real time indexing

27
A real time search index for e-commerce Umesh Prasad Thejus V M

Upload: umesh-prasad

Post on 16-Apr-2017

1.243 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: Slash n   near real time indexing

A real time search index for e-commerce

Umesh PrasadThejus V M

Page 2: Slash n   near real time indexing

Oh!! Out Of Stock

Page 3: Slash n   near real time indexing

Damn !! Out of Stock

Page 4: Slash n   near real time indexing

Damn !! Missed the Offer

Page 5: Slash n   near real time indexing

E-commerce Index Attributes

catalogue service

Promise Engine

Availability Service

Seller Rating

LISTINGPRODUCT aka SKU

OfferEngine

PricingEngine

Page 6: Slash n   near real time indexing

Out Of Stock, but Why Show?

Index has Stale Availability Data

234K Products

Page 7: Slash n   near real time indexing

Outline

❏ E-commerce search Challenge

❏ Challenges in Keeping an Inverted Index Updated

❏ Our approach to Near Real Time indexing

Page 8: Slash n   near real time indexing

Challenge 1 : Update rates

updates / secmax update /hr

min maxtext / catalogue ~10 ~100 ~100K

pricing ~100 ~1K ~10 million

availability ~100 ~10K ~10 million

offer ~100 ~10K ~10 million

seller rating ~10 ~1K ~1 million

signal 6 ~10 ~100 ~1 million

signal 7 ~100 ~10K ~10 million

signal 8 ~100 ~10K ~10 million

Page 9: Slash n   near real time indexing

Challenge 2 : Lucene Index Update

● Lucene doesn’t support Partial Updates.● Update = Delete Old Doc + Add New Document

– Recreate the entire document for every update– Not friendly with multiple micro-services with

different update rates

● Problem Compounded By MarketPlace ● Product + All Its Listings == SINGLE BLOCK● BLOCK structure chosen for query performance ( ~100X

better latencies)

Page 10: Slash n   near real time indexing

Challenge 3 : Refresh Cycle

Ingestion pipeline Solr Master

Solr Slave

Solr Slave

Solr Slave

Solr Slave

Solr Slave

Solr Slave

Commit fsync

Replication

Open new Index

Open new Index

Open new Index

Open new Index

Open new Index

Open new Index

Batch of documents

Page 11: Slash n   near real time indexing

ProductA

brand : Apple

availability : T

price : 45000

ProductB

brand : Samsung

availability : F

price : 23000

ProductC

brand : Apple

availability : T

price : 5000

Document ID Mappings

Posting List

(Inverted Index)

DocValues

(columunar data)

Lucene Segment

Lucene Index

0 ProductA

1 ProductB

2 ProductC

45000 23000 5000Price

availability : T

brand : Samsung

brand : Apple 0 , 1

2

0 , 2

Terms Sparse Bitsets

Page 12: Slash n   near real time indexing

Root Cause :Updating Data Structures

Term 3 Bitset 3

POSTING LIST

………………………...Millions of Terms

BitSet 1Term 1

BitSet 2Term 2

BitSet 3Term 3

Document

Term1 Term2 Term3 Term4

………………………...Thousands of Terms

Posting List / Bit Set

D : 0 1 0 0 0 0 1 0 0 0 0 0 0 1

S: 2,7,14

SE : 2,5,7

Yes

May Be

NO

Updatable ?

Millions of Documents

Page 13: Slash n   near real time indexing

Outline

❏ E-commerce search Challenge

❏ Challenges in Keeping an Inverted Index Updated

❏ Our approach to Near Real Time indexing

Page 14: Slash n   near real time indexing

A Typical Search Flow

Query Rewrite

Results

Query

Matching

Ranking Faceting

Stats

Posting List

Doc Values

Other Components

Lucene Segment

Inverted Index

Forward Index

NRT Store

Page 15: Slash n   near real time indexing

NRT Forward Index - Considerations

● Lookup efficiency

– 50th percentile : ~10K matches

– 99th percentile : ~1 million matches

● Data on Java heap

– Memory efficiency

● Hook it to Lucene

Page 16: Slash n   near real time indexing

NRT Store - Forward Index Naive

NRT Forward IndexLucene Segment

Lookup Engine

0 ProductB

1 ProductA

2 ProductC

3 ProductD

ProductC

ProductA

ProductB

ProductC

ProductD

True

False

False

True

100

150

200

250

ProductId(3) <ProductC,price>

DocId : 3field : price

200

ProductId Availability Price

Latency : ~10 secs for ~1 Million lookups

Page 17: Slash n   near real time indexing

NRT Store - Forward Index Optimized

NRT Forward Index (Segment Independent)

Lucene Segment

Lookup Engine

0 ProductB

1 ProductA

2 ProductC

3 ProductD

100 200 250 150

NrtId(3)

2

DocId : 3field : price

200Availability

Price

0 ProductA

1 ProductC

2 ProductD

3 ProductB

T F F T

DocId - NrtId

0

1

2

3

3

0

1

2

Price(2)

200

Page 18: Slash n   near real time indexing

NRT Store - Invert index

NRT Forward Store

NRT Invert Store

NRT Inverter

Lucene Segment

0 ProductB

1 ProductA

2 ProductC

3 ProductD

Availability : T 0 3

Offer : O1 2 3

Availability:T Matching BitSet

Page 19: Slash n   near real time indexing

Near Real Time Solr Architecture

Solr

Kafka

Ingestion pipeline

NRT Forward Index

Ranking

Macthing

Faceting

Redis

Bootstrap

NRT Inverted store

Solr Master

NRT Updates

Text Updates

Catalogue

Pricing

Availability

Offers

Seller Quality

Commit+

Replicate+

Reopen

LuceneOthers

Page 20: Slash n   near real time indexing

Accomplishments

● Real time consumption for Ranking Signals

● BBD saw upto ~30K updates/second

● Query latency comparable to DocValues

– Consistent 99% performance

Page 21: Slash n   near real time indexing

Thank you&

Questions

Page 22: Slash n   near real time indexing

A Typical Search Flow

Query Rewrite

Results

Query

Matching

Ranking Faceting

Stats

Posting List

Doc Values

Schema

Other Components

Lucene Index

Inverted Index

Forward Index

Schema

NRT Store

Page 23: Slash n   near real time indexing

Lucene Index

0 availability:true 0,2

1 availability:false 1

0 brand:adidas 0,1

1 brand:nike 2

1 price:230 1

2 price:250 0

term ords Terms Dictionary

Posting List (inverted index)

Doc Value(Forward index)

field 0 1 2

price 2 2 3

brand 0 0 1

availability 0 1 0

docId External ID Brand Availability Price

0 ProductA Adidas True 250

1 ProductB Adidas False 230

2 ProductC Nike True 500

● Lucene Index = Multiple Mini Indexes aka Segments

● Lucene Segment○ Write Once → Immutable Data structures○ Posting Listing ( Sparse encoded bitsets)○ Doc Values (Columnar Data structures)

Page 24: Slash n   near real time indexing

Lucene Index

0 availability:true 0,2

1 availability:false 1

0 brand:adidas 0,1

1 brand:nike 2

1 price:230 1

2 price:250 0

term ords Terms Dictionary

Posting List (inverted index)

Doc Value(Forward index)

field 0 1 2

price 2 2 3

brand 0 0 1

availability 0 1 0

docId External ID Brand Availability Price

0 ProductA Adidas True 250

1 ProductB Adidas False 230

2 ProductC Nike True 500

● Lucene Index = Multiple Mini Indexes aka Segments

● Lucene Segment○ Write Once → Immutable Data structures○ Posting Listing ( Sparse encoded bitsets)○ Doc Values (Columnar Data structures)

Page 25: Slash n   near real time indexing

C5 : Lucene in-place update

● Only numeric / byte Array fields

● Updates to go through the entire refresh cycle

● Not exposed via Solr

Page 26: Slash n   near real time indexing

Forward Index - API Hook

● Lucene API Hook

– ValueSource

● Input

– Lucene Internal Document Id

– Field Name

● Output

– Field Value

Page 27: Slash n   near real time indexing

NRT Store - Inverted Index

● Input

– Lucene Segment

– query

• Field Name : Field Value

• offer : o1

● Output

– DocSet (posting list)