the journy to real time analytics

Post on 19-Feb-2017

242 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The journey to real-time analyticsIdo Friedman

IdoFriedman.ymlName: Ido Friedman,Past:”SQL Server consultant,Instructor,Team Leader”Present:”Data engineer and Architect,

Elasticsearch,CouchBase,MongoDB,Python”,…]WorkPlace:PerionWhenNotWorking:@Sea

AgendaWhat is Real-Time analytics

Our use case goals and insight

What’s next

Real-Time analyticsReal-time analytics is the use of, or the capacity to use,

all available enterprise data and resources when they are needed. It consists of dynamic analysis and reporting, based on data entered into a system less than one minute before the actual time of use. Real-time analytics is also known as real-time data analytics, real-time data integration, and real-time intelligence.

Time dimensions/SLAs

Real Time

Msec/Secs

Near Real Time

(Min/Hour)

Batch

(Hours/Days)

Analytics

Batch

Analytics

Real Time analytics Stream

Analytics

Our goals

Online segmentation

User report dashboard

SegmentationSingle event granularity

Full filtering flexibility no predefinition

No restriction on time range queries

No data purging

Msec response time

Hundreds to Thousands of requests per minute

So it began

Elastic search was selected because

No overhead on indexing fields – It’s all index

VERY fast filtering and aggregation

Rich aggregation and querying

Relatively easy maintenance of large data sets

Some words on Elastic searchFull Text engine gone wild

Highly available Search and analytics

Ultra scalable and easily maintainable

By developers for Developers

https://www.elastic.co/products/elasticsearch

ES ExamplesDate histogramsFiltersAggsCardinalityTopMany more..

POC

Number of indexes and shards was decide…

Index mapping was set

Search patterns, queries and SLA were achieved

Data set was not big enough

RE – POC

IN PRODUCTION

POC v2 - GoalsFind the correct sharding / nodes combination

Create a manageable cluster

Achieve repeatable results

Reduce costs

The insightsShardingReplicationNodesRoutingCluster managementRoutingDoc Values vs Field DataMaster nodes

The insights - Nodes

1 TB Data

250 GB Data

250 GB Data

250 GB Data

250 GB Data

250 GB Data

250 GB Data

Data Nodes option 1Nodes option 1 Effect of a single node downtime

50%

25%

Data loading•Analyze your need and choose your tools to suite

• If you know your data don’t invest in generic solution

•Check your data load processes and verify its accuracy

Re sharding

Will be internally in elastic in future versions

$$$$$

Money is not your enemy

Use costs as the main drive to improve your solution

Use costs as the main matric it will keep your company running

Issues – not all is perfectCardinality aggregation

PerformanceAccuracyData set size

Hardware resource balanceFind your real bottle neck

Choose the correct node for your load

Best practices are sometimes too general

We are not happy yetWe need joins – Data modeling Elastic search main issue for us –> data piping

Where we go next?Other analytics engines?

DruidMongoDB

Couchbase

MongoDB Aggregation framework

CouchBase - Global Service Indexing

CREATE INDEX productName_index1 ON bucket_name(productName, ProductID) WHERE type="product" USING GSI WITH {"nodes":"node1:8091"}; CREATE INDEX productName_index2 ON bucket_name(productName, ProductID) WHERE type="product" USING GSI WITH {"nodes":"node2:8091"};

CREATE INDEX productName_index1 ON bucket_name(productName, ProductID) WHERE type="product" AND productName BETWEEN "A" AND "K" USING GSI WITH {"nodes":"node1:8091"}; CREATE INDEX productName_index2 ON bucket_name(productName, ProductID) WHERE type="product" AND productName BETWEEN "K" AND "Z" USING GSI WITH {"nodes":"node2:8091"};

Manual scale out and replication

Druid

Joins in ElasticSearchhttp://siren.solutions/relational-joins-for-elasticsearch-the-siren-join-plugin/

$ curl -XGET 'http://localhost:9200/articles/_coordinate_search?pretty' -d '{ "query" : { "filtered" : { "query" : { "match_all" : { } }, "filter" : { "filterjoin" : { (1) "mentions" : { (2) "indices" : ["companies"], (3) "path" : "id", (4) "query" : { (5) "term" : { "name" : "orient" } } } } } } }}'

SummaryNo magic solutions

Always understand your data and needs

Invest the time on modeling and optimization

top related