imc summit 2016 breakout - roman shtykh - apache ignite as a data processing hub
TRANSCRIPT
APACHE IGNITE AS A DATA PROCESSING HUBROMAN SHTYKHCYBERAGENT, INC.
See all the presentations from the In-Memory Computing Summit at http://imcsummit.org
INTRODUCTION
ABOUT ME
Roman Shtykh R&D Engineer at CyberAgent, Inc. Areas of focus
Data streaming and NLP Committer on the Apache Ignite and MyBatis projects Judoka @rshtykh
CYBERAGENT, INC.
Internet ads Games Media Investing
25%
13%
52%
3%7%
GamesMediaInternet adsInvestingOther
* As of Sep 2015
AMEBA SERVICES
・ Monthly visitors (DUB total): 6 billion*・ Number of member users : about 39 million*
CyberAgent, Inc.Ameba Services
* As of Dec 2014
• Games• Community services• Content curation• Other
AMEBA SERVICES
Ameba Pigg
CONTENTS
Apache Ignite Feed your data
Log Aggregation with Apache Flume Integration with Apache Ignite
Streaming Data with Apache Kafka Data Pipeline with Kafka and Ignite: Example
APACHE IGNITE
“High-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.”
High performance, unlimited scalability and resiliency High-performance transactions and fast analytics Hadoop Acceleration, Apache Spark Apache project
https://ignite.apache.org/
MAKING APACHE IGNITE A DATA PROCESSING HUB
Question: How to feed data? A simple solution: Create a client node
MAKING APACHE IGNITE A DATA PROCESSING HUB
Question: How to feed data? A simple solution: Create a client node
Is it reliable? Does it scale? Ignite-only solution? Does it keep your operational costs low?
MAKING APACHE IGNITE A DATA PROCESSING HUB
Question: How to feed data? A simple solution: Create a client node
Is it reliable? Does it scale? Ignite-only solution? Does it keep your operational costs low?
LOG AGGREGATION WITH APACHE FLUME
LOG AGGREGATION WITH APACHE FLUME
Flume “Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large
amounts of log data.” Scalable Flexible Robust and fault tolerant Declarative configuration Apache project
DATA FLOW IN FLUME
Source Sink
Agent
ChannelIncoming datato another Agentor Destination
DATA FLOW IN FLUME (REPLICATION/MULTIPLEXING)
SourceSink
Agent
ChannelIncoming data
SinkChannelChannel Selector
DATA FLOW IN FLUME (RELIABILITY)
No data is lost (configurable)
Source Sink
Agent
ChannelIncoming data
Source tx Sink tx
LOG TRANSFER AT AMEBA
AmebaService
Aggregator
Aggregator
Aggregator
Monitoring
Recommender System
Elastic Search
HadoopBatch processing
HBase
Stream Processing(Onix)
Stream Processing(HBaseSink)
AmebaService
AmebaService
LOG TRANSFER AT AMEBA
Web Hosts More than 1600
Size 5.0 TB/day (raw)
Traffic at peak 160Mbps (compressed)
IGNITE SINK
Reads Flume events from a channel With a user-implemented pluggable transformer converts them into cacheable entries Adding it requires no modification to the existing architecture
FLUME ⇒ IGNITE (1)
Source Ignite Sink
Agent
ChannelIncoming data new connection
FLUME ⇒ IGNITE (2)
Source Ignite Sink
Agent
ChannelIncoming data
Sink tx
start tx
FLUME ⇒ IGNITE (3)
Source Ignite Sink
Agent
ChannelIncoming data
Sink tx
takeevent send events
ENABLING FLUME SINK Steps1. Implement EventTransformer
convert Flume events into cacheable entries (java.util.Map<K, V>)2. Put transformer’s jar to ${FLUME_HOME}/plugins.d/ignite/lib3. Put IgniteSink and Ignite core jar files to ${FLUME_HOME}/plugins.d/ignite/libext4. Set up a Flume agent
Sink setupa1.sinks.k1.type = org.apache.ignite.stream.flume.IgniteSinka1.sinks.k1.igniteCfg = /some-path/ignite.xmla1.sinks.k1.cacheName = testCachea1.sinks.k1.eventTransformer = my.company.MyEventTransformera1.sinks.k1.batchSize = 100
FLUME SINKS
HDFS THRIFT AVRO HBASE ElasticSearch IRC IGNITE
APACHE FLUME & APACHE IGNITE
If you do data aggregation with Flume Adding an Ignite cluster is as simple as writing a simple data transformer and deploying a new Flume agent
If you store your data (and do computations) in Ignite Improving data injection becomes easy with Flume sink
Combining Apache Flume and Ignite makes/keeps your data pipeline (both aggregation and processing) Scalable
Reliable
Highly-Performant
STREAMING DATA WITH APACHE KAFKA
APACHE KAFKA
“Publish-subscribe messaging rethought as a distributed commit log” Low latency High Throughput Partitioned and Replicated
Kafka is an essential component of any data pipeline today
http://kafka.apache.org/
APACHE KAFKA
Messages are grouped in topics Each partition is a log Each partition is managed by a broker
(when replicated, one broker is the partition leader) Producers & consumers (consumer groups)
Used for Log aggregation Activity tracking Monitoring Stream processing
http://kafka.apache.org/documentation.html
KAFKA CONNECT
Designed for large scale stream data integration using Kafka Provides an abstraction from communication with your Kafka cluster
Offset management Delivery semantics Fault tolerance Monitoring, etc.
Worker (scalability & fault tolerance) Connector (task config) Task (thread) Standalone & Distributed execution models http://www.confluent.io/blog/apache-kafka-0.9-is-released
INGESTING DATA STREAMS
Two ways Kafka Streamer Sink Connector
SQL queries Distributed closuresTransactions
Conn
ect
ETL
STREAMING VIA SINK CONNECTOR
Configure your connector Configure Kafka Connect worker Start your connector
# connectorname=my-ignite-connectorconnector.class=IgniteSinkConnectortasks.max=2topics=someTopic1,someTopic2# cachecacheName=myCachecacheAllowOverwrite=trueigniteCfg=/some-path/ignite.xml
$ bin/connect-standalone.sh myconfig/connect-standalone.properties myconfig/ignite-connector.properties
STREAMING VIA SINK CONNECTOR
Easy data pipeline Records from Kafka are written to Ignite grid via high-performance IgniteDataStreamer At-least-once delivery guarantee As of 1.6, start a new connector to write to a different cache
a b c d e0 1 2 … Kafka offsets
a.key, a.valb.key, b.val…
a2
b2
c2
d2
e2
INGESTING DATA STREAMS
Bi-directional streaming
SQL queries Distributed closuresTransactions
Conn
ect
EventsContinuous queries
Conn
ect
Sin k
Sour
ce
STREAMING BACK TO KAFKA
Listening to cache events PUT READ REMOVED EXPIRED, etc.
Remote filtering can be enabled Kafka Connect offsets are ignored Currently, no delivery guarantees
evt1evt2evt3 as records
ENABLING SOURCE CONNECTOR
Configure your connector Define a remote filter if needed
cacheFilterCls=MyCacheEventFilter Make sure that event listening is enabled
on the server nodes Configure Kafka Connect worker Start your connector
#connectorname=ignite-src-connectorconnector.class=org.apache.ignite.stream.kafka.connect.IgniteSourceConnectortasks.max=2
#topics, eventstopicNames=testcacheEvts=put,removed
#cachecacheName=myCacheigniteCfg=myconfig/ignite.xml
key.converter=org.apache.kafka.connect.storage.StringConvertervalue.converter=org.apache.ignite.stream.kafka.connect.serialization.CacheEventConverter
APACHE KAFKA & APACHE IGNITE
If you do data streaming with Kafka Adding an Ignite cluster is as simple as writing a configuration file (and creating a filter if you need
it for source) If you store your data (and do computations) in Ignite
Improving data injection and listening for events on data becomes easy with Kafka Connectors
Combining Apache Kafka and Ignite makes/keeps your data pipeline Scalable Reliable Highly-Performant Covers a wide range of ETL contexts
DATA PIPELINE WITH KAFKA AND IGNITEEXAMPLE
DATA PIPELINE WITH KAFKA AND IGNITE
Requirements instant processing and analysis scalable and resilient to failures low latency high throughput flexibility
DATA PIPELINE WITH KAFKA AND IGNITE
Filter and aggregate events
data Flume
filter/transform
datadata
data
slow down on heavy loads
more channels/layers
DATA PIPELINE WITH KAFKA AND IGNITE
data
filtertransfor
metc.
• Parsimonious resource use• Replay enabled• More operations on streams• Flexibility
deriv
ed
stre
ams
Other sourc
es
DATA PIPELINE WITH KAFKA AND IGNITE
Filter and aggregate events Store events Notify about updates on aggregates
data
filtertransfor
metc.
Connectorsderiv
ed
stre
ams
DATA PIPELINE WITH KAFKA AND IGNITE
Filter and aggregate events Store events Notify about updates on aggregates
data
filtertransfor
metc.
Connectorsderiv
ed
stre
ams
DATA PIPELINE WITH KAFKA AND IGNITE
Improving ads delivery
clicksimpressions
ads
Adsdelivery
Adsrecommend
er
storage/computati
on
Imagestorag
e
data & computationin one place
DATA PIPELINE WITH KAFKA AND IGNITE
Improving ads delivery Better network utilization and reliability
clicksimpressions
ads
Adsdelivery
Adsrecommen
der
storage/computati
on
Imagestorag
e
Anomaly
detection
OTHER INTEGRATIONS
OTHER COMPLETED INTEGRATIONS
CAMEL MQTT STORM FLINK SINK TWITTER
THE END