javantura v3 - real-time bigdata ingestion and querying of aggregated data – davor poldrugo
Upload: hujak-hrvatska-udruga-java-korisnika-croatian-java-user-association
Post on 08-Jan-2017
2.513 views
TRANSCRIPT
REAL-TIME BIG DATA INGESTION AND QUERYING OF AGGREGATED DATA
Davor Poldrugosoftware engineer
Davor Poldrugo @ Infobip
Software engineer with interest in backend development, high availability and distributed systems.https://about.me/davor.poldrugo
MOBILE SERVICES: Professional SMS, number validation, voice, USSD, mobile payments; deeply integrated into the telecoms world
ENTERPRISE PRODUCTS for businesses of any scale and need (mGate, fully-featured web apps, SMS authentication solutions, reseller solutions...)
APP ENGAGEMENT PLATFORM based on advanced push notifications
APIs and protocols for EASY INTEGRATION: xml, soap/rest, smpp, http, json
Full 24/7 TECHNICAL SUPPORT regardless of location
QUALITY guaranteed by a strict SLA
Our services
Presentation overview
Dictionary
The real-time use case and the challenges (because there are no problems ;)
The platform and how we got here
Our path towards real-time data
Architecture and component overview
Numbers and conclusion
Dictionary
REAL-TIME nounthe actual time during which something takes place real-time adjectivehttp://www.merriam-webster.com/dictionary/real%20timeBIG DATA nounan accumulation of data that is too large and complex for processing by traditional database management toolshttp://www.merriam-webster.com/dictionary/big%20data
I'll use only this meaning although there are many meanings for the noun big data.
Real time can anything really be real time?Even our the world as we experience it has latency... eyes, ears, smell, touch has to be processed by our brain. This processing takes time, maybe as little as miliseconds
Dictionary
INGEST verbto take (something, such as food) into your body : to swallow (something) sometimes used figurativelyShe ingested [=absorbed] large amounts of information very quickly.http://www.learnersdictionary.com/definition/ingestI'll use this figurative meaning... in context of data ingestion.
The real-time use case and the challenges
Our new web requirement: provide real-time data and graphs of traffic
SMS Campaigns Web application
Near real-timeBut we wanted real-time!
The platform and how we got here
There was only one node a monolith
One transactional database (OLTP)
Traffic increased
After a while the database began to be a bottleneck
Then we introduced multiple transaction databases
Then multiple monolith nodes were introduced one per database
Then load balancers were needed
OLTP - ON LINE TRANSACTION PROCESSING
The platform and how we got here
After that querying has become complex:when one or more databases down for maintenance - data from that DB is missing
queries had to span over multiple databases and then results had to be joined
aggregate reports become a problem (complexity, availability)
aggregation databases introduced (ETL) that pulled from transactional databases
In the meantime we decoupled our monolithic node to lots of microservice nodes (IpCore, Billing, Contacts, Campaigns, ...)
As traffic increased, non-transactional (apps, reports) queries become a problem throughput decrease
The platform and how we got here
Our Database Team introduced GREEN our ODS/DWHnamed after the color of the pencil used to draw on the board ;)
Near real-time ETL (for traffic tables with 150+ columns)
Centralized reporting
Decreased workload from transactional databases
Throughput increase of our core nodes (IpCore)
Specialized indexes
Specialized aggregations
But still... near real-time...
1 to 60 minutes out of sync with the transactional databases (depending on the load)
ODS operational data store
Our path towards real-time data
GREEN ODS/DWH provided an abstract solution for all our traffic data but was not in REAL-TIME
GREEN consists of big hardware scales vertically
This approach tries to solve a particular REAL-TIME use case one by one not a silver bullet!
Because REAL-TIME isn't always needed
Resources are limited
The path towards horizontal scalability
ODS operational data store
Our path towards real-time data
All data entering the system is dispatched to both the batch layer and the speed layer for processing.
The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
Any incoming query can be answered by merging results from batch views and real-time views.
Lambda architecture ( http://lambda-architecture.net/ )
Our path towards real-time data
Know thyself! Adapt lambda architecture to fit your needs!
IpCore(Core MessageProcessing) IpCore(Core MessageProcessing) Messaging Cloud
TransactionalDatabases(OLTP)AppMessageEventAppMessageEventAppMessageEvent
GREEN DBODSDWH(newly proclaimedBATCH/SERVINGLAYER)
REAL-TIMELAYERQUERYLAYER
(queries REAL-TIMEORBATCH)IngestpointIngestpoint
MessagingCloudAppMessagingCloudApp
...
ODS operational data store
Architecture and component overview
Messaging CloudAppMessageEventAppMessageEventAppMessageEvent
REAL-TIMELAYER...
Data Ingestion ServiceProcessMessageProcessDeltaPairing and composinga new messageKafka cluster
Druid cluster
BillingingestpointIpCoreingestpoint
Architecture and component overview
REAL-TIMELAYERData Ingestion ServiceKafka clusterDruid cluster
{ "sendDateTime":"2016-02-19T12:07:47Z", "campaignId":29680, "currencyId":2, "currencyHNBCode":"EUR", "currencySymbol":"", "countDelta":1, "priceDelta":0.02}
Architecture and component overview
REAL-TIME LAYERKafka clusterDruid clusterData Ingestion ServiceGREEN DBODSDWH
BATCH LAYERQUERY LAYERDataQueryServiceMessagingCloudAppMessagingCloudAppMessagingCloudAppMessagingCloudAppMessagingCloudApp
Isrealtime?
TRUE
FALSE
ODS operational data store
Architecture and component overview
REAL-TIME LAYERDruid clusterQUERY LAYERDataQueryServicePOST /druid/v2 HTTP/1.1Host: druid-broker-node:8080Content-Type: application/json
{ "queryType": "groupBy", "dataSource": "campaign-totals-v2", "granularity": "all", "intervals": [ "2012-01-01T00:00:00.000/2100-01-01T00:00:00.000" ], "dimensions": ["campaignId", "currencyId", "currencySymbol", "currencyHNBCode"], "filter": { "type": "selector", "dimension": "campaignId", "value": 29680 }, "aggregations": [ { "type": "longSum", "name": "totalCountSum", "fieldName": "totalCount" }, { "type": "doubleSum", "name": "totalPriceSum", "fieldName": "price" } ]}
Request to Druid
ODS operational data store
Architecture and component overview
REAL-TIME LAYERDruid clusterQUERY LAYERDataQueryServiceResponse from Druid
[ { "version": "v1", "timestamp": "2012-01-01T00:00:00.000Z", "event": { "totalCountSum": 1000000, "currencyid": "2", "totalPriceSum": 20000, "currencysymbol": "", "currencyhnbcode": "EUR", "campaignid": "29680" } }]ODS operational data store
Architecture and component overview
KAFKA - https://kafka.apache.org/Kafka maintains feeds of messages in categories called topics
A distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
FEATUREStwo messaging models incorporated in an abstraction called consumer group (group id) queue and publish-subscribequeue - a pool of consumers may read from a server and each message goes to one of them
publish-subscribe - the message is broadcast to all consumers
constant performance with respect to data size
replay all messages are stored and can be accessd with a sequential id number called the offset
REAL-TIME LAYERKafka clusterDruid clusterData Ingestion ServiceREAL-TIME LAYERKafka clusterDruid clusterData Ingestion Service - queue - a pool of consumers may read from a server and each message goes to one of them - publish-subscribe - the message is broadcast to all consumers
- If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
- If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.- Kafka's performance - so retaining lots of data is not a problem.
- You can forget about Kafka it just works!
Architecture and component overview
DRUID - http://druid.io/Druid is a fast column-oriented distributed data store.
Real-time StreamsDruid supports streaming data ingestion and offers insights on events immediately after they occur. Retain events indefinitely and unify real-time and historical views.
Sub-Second QueriesDruid supports fast aggregations and sub-second OLAP queries.
Scalable to PetabytesExisting Druid clusters have scaled to petabytes of data and trillions of events, ingesting millions of events every second. Druid is extremely cost effective, even at scale.
Deploy AnywhereDruid runs on commodity hardware. Deploy it in the cloud or on-premise. Integrate with existing data systems such as Hadoop, Spark, Kafka, Storm, and Samza.
REAL-TIME LAYERKafka clusterDruid clusterData Ingestion ServiceREAL-TIME LAYERKafka clusterDruid clusterData Ingestion Service
Numbers and conlusion
Data pipelineMax. throughput (msg/s)
Ingest points Data Ingestion Service7700
Billing ingest point Data Ingestion Service5500
IpCore ingest point Data Ingestion Service2200
Data Ingestion service Kafka2130
Druid firehose pull and aggregate from Kafka29000
Real-time!