blablacar elastic search feedback
Post on 12-Jul-2015
1.938 Views
Preview:
TRANSCRIPT
1/37
ElasticSearchfeedback
2/37
Introduction
3/37
Nicolas Blanc - BlaBlArchitect
SinfomicSinfomic (1999)
@thewhitegeek
(2001)
(2005)
(2008)
(2012)
4/37
What is BlaBlaCar ?
5/37
3 000 000MEMBERSIN EUROPE
6/37
10 9 countries10 9 countries
● France● Spain● Italy● UK● Poland● Portugal● Netherlands● Belgium● Luxemburg● NEW Germany
● France● Spain● Italy● UK● Poland● Portugal● Netherlands● Belgium● Luxemburg
7/37
Growth50 millions
25 millions
January
2008January
2013
8/37
Infrastructure
2 front web servers 2 MySQL master (+4 slaves SSD) 1 private cloud
(KVM + Open vSwitch)● Redis● Memcache● RabbitMQ/workers
1 cluster ElasticSearch
9/37
Changing the Search Engine
10/37
What's existing ? Why Changing ?
MySQL Database● Relationnal DB (lots of join needed)● Plain SQL query● Home made geographical search
Recent problems● New feature, means more complex queries● Scalability : Performance depending on DB load
11/37
Initial requirements
Scalability● Trip search need to be made in less than 200ms● The system part of the solution easy to maintain● Be able to cluster it (also to not have SPOF)
Low code impact on existing application● Same features as of today (geographical search)● Minimize the developper's work ● Add one missing feature : facets
12/37
Initial Competitors
SenseiDB
13/37
Why ElasticSearch
✔ Easyest cluster possibility✔ Good performance when indexing✔ Few code to write to use it✔ Schema less✔ Based on Lucene✔ Written in Java (need to code grouping feature)
14/37
ElasticSearch has won,now migrate our search !
15/37
Changing our mindset
Object in Relationnal Database● Can be exploded on multiple tables● Lots of informations usable by JOIN
Object in Document Oriented Database● Only one big index for theses objects● All informations need to be in the object, not on multiple tables
16/37
Changing our mindset
Object in Relationnal Database● Can be exploded on multiple tables● Lots of informations usable by JOIN
Object in Document Oriented Database● Only one big index for theses objects● All informations need to be in the object, not on multiple tables
17/37
Well defining our objects
Need to know what we want to search● Searching trips (front office usage)● Searching members (backoffice usage)● Searching FAQ (front office usage)
Think of all needed field● The ones used for query● The ones used for filters● The ones used for facets
18/37
Thinking of well defining index
System point of view● Number of Nodes in the cluster● Number of Shards● Number of Replica
Application point of view● Define type and attributes for all fields (mapping)● Using parent/child or nested to improve indexing● How to push documents from DB ?
19/37
Indexing : using a river or not ?
River advantages● Plugs directly to our source backend● ElasticSearch API exists to code a new one
River problems● Not easy to add business logic on some fields● Really hard when your DB is unconventionnal● Full Reindex all the documents
20/37
Indexing : our manual way
We write an asynchronous indexer● Written in java● Have business logic when fetching from db● Fetch from multiple DB/source● Use of java ES library● Easy interface
●send {“trip”:1234567} and the server answer {“OK”}
21/37
One index sample : Trip
22/37
Well defining our object Trip
Think of all needed field● The ones used for query
● Trip date of departure,from where,to where,user id● The ones used for filters
● User ratings,price,vehicle,seats left,is user blocked(a blocked user, is a user who made some forbidden
action on the website.)● The ones used for facets
● User ratings,price,vehicle
23/37
Well defining our index Trip
Think of all system requirement● The cluster has 2 nodes
● We keep the default configuration for shards/replica
Think of object mapping● For each field :
● Define the type (string, long, geo_point, date, float, boolean)
● Define the scope (include_in_all)● Define the analyzer (for type string)
24/37
Trip Mapping
"trip": { "properties": { "is_user_blocked": { "type": "boolean", "include_in_all" : false }, "user_ratings" : { "type" : "long", "include_in_all" : false }, "from": { "type": "geo_point", "include_in_all" : false }, "price": { "include_in_all": false, "type": "float" },
"price_euro": { "type": "float", “include_in_all: false }, "seats_left": { "include_in_all": false, "type": "long" }, "seats_offered": { "include_in_all": false, "type": "long" }, "to": { "include_in_all": false, "type": "geo_point" },
"trip_date": { "format": "dateOptionalTime", "include_in_all": false, "type": "date" }, “vehicle”: { "include_in_all": false, "type": "string" }, "userid": { "include_in_all": false, "index": "not_analyzed", "type": "string" } }}
25/37
Well indexing eventsWhich modification send event change●All trips creation/deletion/modification●Member modifications (block or not)●New ratings from other members●A seat has been reserved●Member change his vehicle
Event change is a call to internal indexer●Send '{“trip”:123456}' to indexer (create/update)●Send '{“tripd”:123456}' to indexer (delete)
26/37
Sample trip index query{"query": { "filtered": { "query": { "match_all": {} }, "filter": { "and": [{ "geo_distance": { "distance": "40.14937866995km", "from": { "lat": 48.856614, "lon": 2.3522219 } } }, { "geo_distance": { "distance": "40.14937866995km", "to": { "lat": 45.764043, "lon": 4.835659 } } },
{ "range": { "price": { "from": 0, "include_lower": false } } }] } } }, "sort": [{ "trip_date": { "order": "asc" }, }], "filter": { "term": { "is_user_blocked": false } } }, "from": 0, "size": 10}
27/37
The Real WorldA trip has now more than 30 fields● (faq is around 25 fields)● (members even more...)
To build a trip document we need 3 differents SQL queries● (FAQ : 2 differents SQL queries)● (Member : 10 differents SQL queries)
A trip has only 1 shard (grouping)
28/37
And now the caveats
29/37
Preloaded Scripts
We use mvel script to improve scoring● They are not clustered● Each node need to have the scripts● Need a node restart to be added or modified
Solution : Chef (tool from Opscode) All nodes configurations are centralized into Chef repository
30/37
Grouping documents
Home made patchs to ElasticSearch(based on a Martijn Van Groningen work for lusini.de)
Soon in ElasticSearch(I hope so much)
31/37
Mapping modification
On a running index :Changing a type is not allowedChanging analyzer is not allowed
Solution : index alias1) Changing mapping → create a new index2) When new index is up to date → changing alias
32/37
IOs limits
We have only 2 nodes● Trip index is around 2GB● But only 1 shard for Trip index● Can index 100 trips / seconds on busy evening
Solution : We put Intel SSDs(waiting for distributed grouping feature)
33/37
Choosing the analyzer
Some field need to not be analyzed● If you use ISO code for country(IT, for Italy or DE for Germany are ignored in some cases)
Global analyzer has limits● Accentuation from countries like France, Germany or Spain are not always parsed correctly● One analyzer by country is difficult to implement in some cases
34/37
OK Sweet,What's next
?
35/37
Using ElasticSearch to ease log analysis
36/37
By the way…
We’re hiring !!! Dev, HTML Ninja, leader,…
Come & See me right now… or send me your friends
(And we have beer, baby foot and arcade cabinet )
37/37
Thank you !
Follow us !
@covoiturage
Apply now :
join@BlaBlaCar.com
top related