imc summit 2016 breakout - roman shtykh - apache ignite as a data processing hub

APACHE IGNITE AS A DATA PROCESSING HUBROMAN SHTYKHCYBERAGENT, INC.

See all the presentations from the In-Memory Computing Summit at http://imcsummit.org

INTRODUCTION

ABOUT ME

Roman Shtykh R&D Engineer at CyberAgent, Inc. Areas of focus

Data streaming and NLP Committer on the Apache Ignite and MyBatis projects Judoka @rshtykh

CYBERAGENT, INC.

Internet ads Games Media Investing

25%

13%

52%

3%7%

GamesMediaInternet adsInvestingOther

* As of Sep 2015

AMEBA SERVICES

・　 Monthly visitors (DUB total): 6 billion*・　 Number of member users : about 39 million*

CyberAgent, Inc.Ameba Services

* As of Dec 2014

• Games• Community services• Content curation• Other

AMEBA SERVICES

Ameba Pigg

CONTENTS

Apache Ignite Feed your data

Log Aggregation with Apache Flume Integration with Apache Ignite

Streaming Data with Apache Kafka Data Pipeline with Kafka and Ignite: Example

APACHE IGNITE

“High-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.”

High performance, unlimited scalability and resiliency High-performance transactions and fast analytics Hadoop Acceleration, Apache Spark Apache project

https://ignite.apache.org/

MAKING APACHE IGNITE A DATA PROCESSING HUB

Question: How to feed data? A simple solution: Create a client node

MAKING APACHE IGNITE A DATA PROCESSING HUB

Question: How to feed data? A simple solution: Create a client node

Is it reliable? Does it scale? Ignite-only solution? Does it keep your operational costs low?

LOG AGGREGATION WITH APACHE FLUME

LOG AGGREGATION WITH APACHE FLUME

Flume “Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large

amounts of log data.” Scalable Flexible Robust and fault tolerant Declarative configuration Apache project

DATA FLOW IN FLUME

Source Sink

Agent

ChannelIncoming datato another Agentor Destination

DATA FLOW IN FLUME (REPLICATION/MULTIPLEXING)

SourceSink

Agent

ChannelIncoming data

SinkChannelChannel Selector

DATA FLOW IN FLUME (RELIABILITY)

No data is lost (configurable)

Source Sink

Agent


Source tx Sink tx

LOG TRANSFER AT AMEBA

AmebaService

Aggregator

Aggregator

Aggregator

Monitoring

Recommender System

Elastic Search

HadoopBatch processing

HBase

Stream Processing(Onix)

Stream Processing(HBaseSink)

AmebaService

AmebaService

LOG TRANSFER AT AMEBA

Web Hosts More than 1600

Size 5.0 TB/day (raw)

Traffic at peak 160Mbps (compressed)

IGNITE SINK

Reads Flume events from a channel With a user-implemented pluggable transformer converts them into cacheable entries Adding it requires no modification to the existing architecture

FLUME ⇒ IGNITE (1)

Source Ignite Sink

Agent

ChannelIncoming data new connection


Source Ignite Sink

Agent


Sink tx

start tx


Source Ignite Sink

Agent


Sink tx

takeevent send events

ENABLING FLUME SINK Steps1. Implement EventTransformer

convert Flume events into cacheable entries (java.util.Map<K, V>)2. Put transformer’s jar to ${FLUME_HOME}/plugins.d/ignite/lib3. Put IgniteSink and Ignite core jar files to ${FLUME_HOME}/plugins.d/ignite/libext4. Set up a Flume agent

Sink setupa1.sinks.k1.type = org.apache.ignite.stream.flume.IgniteSinka1.sinks.k1.igniteCfg = /some-path/ignite.xmla1.sinks.k1.cacheName = testCachea1.sinks.k1.eventTransformer = my.company.MyEventTransformera1.sinks.k1.batchSize = 100

FLUME SINKS

HDFS THRIFT AVRO HBASE ElasticSearch IRC IGNITE

APACHE FLUME & APACHE IGNITE

If you do data aggregation with Flume Adding an Ignite cluster is as simple as writing a simple data transformer and deploying a new Flume agent

If you store your data (and do computations) in Ignite Improving data injection becomes easy with Flume sink

Combining Apache Flume and Ignite makes/keeps your data pipeline (both aggregation and processing) Scalable

Reliable

Highly-Performant

STREAMING DATA WITH APACHE KAFKA

APACHE KAFKA

“Publish-subscribe messaging rethought as a distributed commit log” Low latency High Throughput Partitioned and Replicated

Kafka is an essential component of any data pipeline today

http://kafka.apache.org/

APACHE KAFKA

Messages are grouped in topics Each partition is a log Each partition is managed by a broker

(when replicated, one broker is the partition leader) Producers & consumers (consumer groups)

Used for Log aggregation Activity tracking Monitoring Stream processing

http://kafka.apache.org/documentation.html

KAFKA CONNECT

Designed for large scale stream data integration using Kafka Provides an abstraction from communication with your Kafka cluster

Offset management Delivery semantics Fault tolerance Monitoring, etc.

Worker (scalability & fault tolerance) Connector (task config) Task (thread) Standalone & Distributed execution models http://www.confluent.io/blog/apache-kafka-0.9-is-released

INGESTING DATA STREAMS

Two ways Kafka Streamer Sink Connector

SQL queries Distributed closuresTransactions

Conn

ect

ETL

STREAMING VIA SINK CONNECTOR

Configure your connector Configure Kafka Connect worker Start your connector

# connectorname=my-ignite-connectorconnector.class=IgniteSinkConnectortasks.max=2topics=someTopic1,someTopic2# cachecacheName=myCachecacheAllowOverwrite=trueigniteCfg=/some-path/ignite.xml

$ bin/connect-standalone.sh myconfig/connect-standalone.properties myconfig/ignite-connector.properties

STREAMING VIA SINK CONNECTOR

Easy data pipeline Records from Kafka are written to Ignite grid via high-performance IgniteDataStreamer At-least-once delivery guarantee As of 1.6, start a new connector to write to a different cache

a b c d e0 1 2 … Kafka offsets

a.key, a.valb.key, b.val…

a2

b2

c2

d2

e2

INGESTING DATA STREAMS

Bi-directional streaming

SQL queries Distributed closuresTransactions

Conn

ect

EventsContinuous queries

Conn

ect

Sin k

Sour

ce

STREAMING BACK TO KAFKA

Listening to cache events PUT READ REMOVED EXPIRED, etc.

Remote filtering can be enabled Kafka Connect offsets are ignored Currently, no delivery guarantees

evt1evt2evt3 as records

ENABLING SOURCE CONNECTOR

Configure your connector Define a remote filter if needed

cacheFilterCls=MyCacheEventFilter Make sure that event listening is enabled

on the server nodes Configure Kafka Connect worker Start your connector

#connectorname=ignite-src-connectorconnector.class=org.apache.ignite.stream.kafka.connect.IgniteSourceConnectortasks.max=2

#topics, eventstopicNames=testcacheEvts=put,removed

#cachecacheName=myCacheigniteCfg=myconfig/ignite.xml

key.converter=org.apache.kafka.connect.storage.StringConvertervalue.converter=org.apache.ignite.stream.kafka.connect.serialization.CacheEventConverter

APACHE KAFKA & APACHE IGNITE

If you do data streaming with Kafka Adding an Ignite cluster is as simple as writing a configuration file (and creating a filter if you need

it for source) If you store your data (and do computations) in Ignite

Improving data injection and listening for events on data becomes easy with Kafka Connectors

Combining Apache Kafka and Ignite makes/keeps your data pipeline Scalable Reliable Highly-Performant Covers a wide range of ETL contexts

DATA PIPELINE WITH KAFKA AND IGNITEEXAMPLE

DATA PIPELINE WITH KAFKA AND IGNITE

Requirements instant processing and analysis scalable and resilient to failures low latency high throughput flexibility


Filter and aggregate events

data Flume

filter/transform

datadata

data

slow down on heavy loads

more channels/layers


data

filtertransfor

metc.

• Parsimonious resource use• Replay enabled• More operations on streams• Flexibility

deriv

ed

stre

ams

Other sourc

es


Filter and aggregate events Store events Notify about updates on aggregates

data

filtertransfor

metc.

Connectorsderiv

ed

stre

ams


Improving ads delivery

clicksimpressions

ads

Adsdelivery

Adsrecommend

er

storage/computati

on

Imagestorag

e

data & computationin one place


Improving ads delivery Better network utilization and reliability

clicksimpressions

ads

Adsdelivery

Adsrecommen

der

storage/computati

on

Imagestorag

e

Anomaly

detection

OTHER INTEGRATIONS

OTHER COMPLETED INTEGRATIONS

CAMEL MQTT STORM FLINK SINK TWITTER

THE END

imc summit 2016 breakout - roman shtykh - apache ignite as a data processing hub

Software