the journy to real time analytics

28
The journey to real-time analytics Ido Friedman

Upload: nosql-tlv

Post on 19-Feb-2017

242 views

Category:

Software


0 download

TRANSCRIPT

Page 1: The journy to real time analytics

The journey to real-time analyticsIdo Friedman

Page 2: The journy to real time analytics

IdoFriedman.ymlName: Ido Friedman,Past:”SQL Server consultant,Instructor,Team Leader”Present:”Data engineer and Architect,

Elasticsearch,CouchBase,MongoDB,Python”,…]WorkPlace:PerionWhenNotWorking:@Sea

Page 3: The journy to real time analytics

AgendaWhat is Real-Time analytics

Our use case goals and insight

What’s next

Page 4: The journy to real time analytics

Real-Time analyticsReal-time analytics is the use of, or the capacity to use,

all available enterprise data and resources when they are needed. It consists of dynamic analysis and reporting, based on data entered into a system less than one minute before the actual time of use. Real-time analytics is also known as real-time data analytics, real-time data integration, and real-time intelligence.

Page 5: The journy to real time analytics

Time dimensions/SLAs

Real Time

Msec/Secs

Near Real Time

(Min/Hour)

Batch

(Hours/Days)

Page 6: The journy to real time analytics

Analytics

Batch

Analytics

Real Time analytics Stream

Analytics

Page 7: The journy to real time analytics

Our goals

Online segmentation

User report dashboard

Page 8: The journy to real time analytics

SegmentationSingle event granularity

Full filtering flexibility no predefinition

No restriction on time range queries

No data purging

Msec response time

Hundreds to Thousands of requests per minute

Page 9: The journy to real time analytics

So it began

Elastic search was selected because

No overhead on indexing fields – It’s all index

VERY fast filtering and aggregation

Rich aggregation and querying

Relatively easy maintenance of large data sets

Page 10: The journy to real time analytics

Some words on Elastic searchFull Text engine gone wild

Highly available Search and analytics

Ultra scalable and easily maintainable

By developers for Developers

https://www.elastic.co/products/elasticsearch

Page 11: The journy to real time analytics

ES ExamplesDate histogramsFiltersAggsCardinalityTopMany more..

Page 12: The journy to real time analytics

POC

Number of indexes and shards was decide…

Index mapping was set

Search patterns, queries and SLA were achieved

Data set was not big enough

RE – POC

IN PRODUCTION

Page 13: The journy to real time analytics

POC v2 - GoalsFind the correct sharding / nodes combination

Create a manageable cluster

Achieve repeatable results

Reduce costs

Page 14: The journy to real time analytics

The insightsShardingReplicationNodesRoutingCluster managementRoutingDoc Values vs Field DataMaster nodes

Page 15: The journy to real time analytics

The insights - Nodes

1 TB Data

250 GB Data

250 GB Data

250 GB Data

250 GB Data

250 GB Data

250 GB Data

Data Nodes option 1Nodes option 1 Effect of a single node downtime

50%

25%

Page 16: The journy to real time analytics

Data loading•Analyze your need and choose your tools to suite

• If you know your data don’t invest in generic solution

•Check your data load processes and verify its accuracy

Page 17: The journy to real time analytics

Re sharding

Will be internally in elastic in future versions

Page 18: The journy to real time analytics

$$$$$

Money is not your enemy

Use costs as the main drive to improve your solution

Use costs as the main matric it will keep your company running

Page 19: The journy to real time analytics

Issues – not all is perfectCardinality aggregation

PerformanceAccuracyData set size

Page 20: The journy to real time analytics

Hardware resource balanceFind your real bottle neck

Choose the correct node for your load

Best practices are sometimes too general

Page 21: The journy to real time analytics

We are not happy yetWe need joins – Data modeling Elastic search main issue for us –> data piping

Page 22: The journy to real time analytics

Where we go next?Other analytics engines?

DruidMongoDB

Couchbase

Page 23: The journy to real time analytics

MongoDB Aggregation framework

Page 24: The journy to real time analytics

CouchBase - Global Service Indexing

CREATE INDEX productName_index1 ON bucket_name(productName, ProductID) WHERE type="product" USING GSI WITH {"nodes":"node1:8091"}; CREATE INDEX productName_index2 ON bucket_name(productName, ProductID) WHERE type="product" USING GSI WITH {"nodes":"node2:8091"};

CREATE INDEX productName_index1 ON bucket_name(productName, ProductID) WHERE type="product" AND productName BETWEEN "A" AND "K" USING GSI WITH {"nodes":"node1:8091"}; CREATE INDEX productName_index2 ON bucket_name(productName, ProductID) WHERE type="product" AND productName BETWEEN "K" AND "Z" USING GSI WITH {"nodes":"node2:8091"};

Manual scale out and replication

Page 25: The journy to real time analytics

Druid

Page 26: The journy to real time analytics

Joins in ElasticSearchhttp://siren.solutions/relational-joins-for-elasticsearch-the-siren-join-plugin/

$ curl -XGET 'http://localhost:9200/articles/_coordinate_search?pretty' -d '{ "query" : { "filtered" : { "query" : { "match_all" : { } }, "filter" : { "filterjoin" : { (1) "mentions" : { (2) "indices" : ["companies"], (3) "path" : "id", (4) "query" : { (5) "term" : { "name" : "orient" } } } } } } }}'

Page 27: The journy to real time analytics

SummaryNo magic solutions

Always understand your data and needs

Invest the time on modeling and optimization

Page 28: The journy to real time analytics