confluent kafka meetupseattle jan2017
TRANSCRIPT
11Confidential
State of the Streaming Platform 2017What’s new in Apache Kafka and the Confluent Platform
David Tucker, Confluent
44Confidential
The shift to streams
“By 2020, 70% of organizations will adopt data streaming to enable real-time analytics.”1
1: Gartner: Harness Streaming Data for Real-Time Analytics - Nov 20162: Forrester’s 2016 Predictions: Turn Data Into Insight And Action - Nov 2015
“Streaming ingestion and analytics will become a must-have for digital winners.”2
55Confidential
Vision of a Streaming Enterprise
Search
NewSQL / NoSQL
RDBMS Monitoring
Document StoreReal-time Analytics Data Warehouse
Mobile Apps
Legacy Apps
Hadoop
Streaming Platform
66Confidential
What Can You Do with a Streaming Platform ?
• Publish and Subscribe to streams of data• Analogous to traditional messaging systems
• Store streams of data• Consumers can look back in time
• Process streams of data• Analyze and correlate events in real time
77Confidential
The typical integration architecture
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational Metrics
Hadoop Data Warehouse
MySQL Cassandra Oracle
App
Databases
Storage
Interfaces
Monitoring App
Databases
Storage
Interfaces
88Confidential
Challenges abound
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational Metrics
Hadoop Data Warehouse
Espresso Cassandra Oracle
App
Databases
Storage
Interfaces
Monitoring App
Databases
Storage
Interfaces
Difficult to handle massive amounts of data
Diverse data sets, arriving at an increasing rate
Many complex data pipelines
Require a separate cluster for real-time
Difficult & time consuming to change
Require mission critical availability into most recent/relevant data
99Confidential
Modernized architecture using Apache Kafka
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational MetricsEspresso Cassandra Oracle
HadoopStreams API
App
Streams API
Monitoring App
Data Warehouse
Apache Kafka
1010Confidential
Challenges addressed by a streaming platform
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational MetricsEspresso Cassandra Oracle
HadoopStreams API
App
Streams API
Monitoring App
Data Warehouse
Apache Kafka
Rewind data stream to re-load into any target system
Scale to meet demands of diverse streams
Pub/sub to data streams
Lightweight, easy to modify with minimal
disruption
Decoupled from upstream apps creating agility Real-time, context specific
data in the moment
1111Confidential
Stream Data isThe Faster the Better
Stream Data can beBig or Fast (Lambda)
Stream Data will beBig AND Fast (Kappa)
From Big Data to Stream Data
Apache Kafka is the Enabling Technology of this Transition
Big Data wasThe More the Better
Valu
e of
Dat
a
Volume of Data
Valu
e of
Dat
a
Age of Data
Job 1 Job 2
Streams
Table 1 Table 2
DB
Speed Table Batch Table
DB
Streams Hadoop
1212Confidential
Ingest, Process, Load, and Serve Data at a Global Scale
Data Systeam A
…Data System B
…
Kafka cluster
Applications
Other data stores
Kafka cluster
FIXRaw data / Events
Kafka Streams(Data Enrichment and Transformation)
Kafka Connect(Connectors to Extract and Load data)
ConfluentReplicator
ConfluentReplicator
CustomReplication
CustomReplication
1313Confidential
Confluent: Enterprise Streaming Platform based on Apache Kafka™
Confluent Platform
Database Changes
Log Events loT Data Web
Events …
CRM
Data Warehouse
Database
Hadoop
DataIntegration
…
Monitoring
Analytics
Custom Apps
Transformations
Real-time Applications
…
Apache Open Source Confluent Open Source Confluent Commercial
Confluent Enterprise
Apache Kafka™
Data Compatibility
Monitoring & Administration
Operations
Clients Connectors
Complete
Open
Trusted
Enterprise Grade
1515Confidential
How do I get streams of datainto and out of my apps?
Connect Clients REST
1717Confidential
Apache KafkaTM Connect – Streaming Data Capture
JDBC
IRC / Twitter
MySQL
Elastic
NoSQL
HDFS
Kafka Connect API
Kafka Pipeline
Connector
Connector
Connector
Connector
Connector
Connector
Sources Sinks
Fault tolerant
Manage hundreds of data sources and sinks
Preserves data schema
Part of Apache Kafka project
Integrated within Confluent Platform’s Control Center
1818Confidential
Apache KafkaTM Connect – Let the framework do the hard work
• Serialization / de-serialization
• Schema Registry integration
• Fault tolerance, automatic fail-over
• Partitioning and scale-out
• … and let the developer focus on domain specific details on copying data
1919Confidential
Kafka Connect Architecture: Logical Model
Connect has three main components: Connectors, Tasks, and Workers
Data flowing into / out of the connectors is a stream; each stream is 1 or more partitions. In practice, a stream partition could be a database table, a log file, etc.There may or may not be an exact alignment of streams to Kafka topics.
2020Confidential
Kafka Connect Architecture: Execution Model
Host 1 Host 2
Task 1 Task 2 Task 3 Task 4
Worker 1 Worker 2 Worker 3
2121Confidential
Kafka Connect API Library of Connectors
* Denotes Connectors developed at Confluent and distributed by Confluent. Extensive validation and testing has been performed.
Databases
*
Datastore/File Store
*
Analytics
*
Applications / Other
2222Confidential
Kafka Clients
Ruby Proxy http/REST
Stdin/stdout
Apache Kafka Native Clients
Confluent Native Clients
Community Supported Clients
2323Confidential
REST Proxy: Talking to Legacy Apps and Across Restricted Networks
REST Proxy
Legacy Applications
Native Kafka Applications
Schema Registry
REST / HTTP
Simplifies administrative actions
Simplifies message creation and consumption
Provides a RESTful interface to a Kafka cluster
2424Confidential
How do I maintain my data formats and ensure compatibility?
2525Confidential
The Challenge of Data Compatibility at Scale
App 1
App 2
App 3
Many sources without a policy causes mayhem in a centralized data pipeline
Ensuring downstream systems can use the data is key to an operational stream pipeline
Example: Date formats
Even within a single application, different formats can be presented
Incompatibly formatted message
2626Confidential
Schema Registry
Elastic
NoSQL
HDFS
Example Consumers
SerializerApp 1
SerializerApp 2
!
Kafka Topic!
Schema Registry
Define the expected fields for each Kafka topic
Automatically handle schema changes (e.g. new fields)
Prevent backwards incompatible changes
Supports multi-datacenter environments
2727Confidential
How do I build streamprocessing apps?
2828Confidential
Architecture of Kafka Streams API, a Part of Apache Kafka
KafkaStreams API
Producer
Kafka Cluster
Topic TopicTopic
Consumer Consumer
Key benefits• No additional cluster• Easy to run as a service• Supports large aggregations and joins• Security and permissions fully
integrated from Kafka
Example Use Cases• Microservices• Continuous queries• Continuous transformations• Event-triggered processes
2929Confidential
Kafka Streams API: the Easiest Way to Process Data in Apache Kafka™
Example Use Cases• Microservices
• Large-scale continuous queries and transformations
• Event-triggered processes
• Reactive applications
• Customer 360-degree view, fraud detection, location-based marketing, smart electrical grids, fleet management, …
Key Benefits of Apache Kafka’s Streams API
• Build Apps, Not Clusters: no additional cluster required
• Elastic, highly-performant, distributed, fault-tolerant, secure
• Equally viable for small, medium, and large-scale use cases
• “Run Everywhere”: integrates with your existing deployment strategies such as containers, automation, cloud
Your App
KafkaStreams API
3030Confidential
Architecture Example
Before: Complexity for development and operations, heavy footprint
1 2 3
Capture businessevents in Kafka
Must process events with separate, special-purpose
clusters
Write resultsback to Kafka
Your Processing Job
3131Confidential
Architecture ExampleWith Kafka Streams: App-centric architecture that blends well into your existing infrastructure
1 2 3a
Capture businessevents in Kafka
Process events fast, reliably, securely with standard Java applications
Write resultsback to Kafka
Your App
3b
External apps can directly query the latest results
AppApp
KafkaStreams API
3333Confidential
How do I manage and monitormy streaming platform at scale?
3434Confidential
Confluent Control Center: End-to-end Monitoring
See exactly where your messages are going in your Kafka cluster
3535Confidential
Confluent Control Center: Connector Management
3636Confidential
Confluent Control Center: Alerting
Alerts
• Configure alerts on incomplete data delivery, high latency, Kafka connector status, and more
• Manage alerts for different users and applications from a web UI
• Manage alerts for different users and applications from a web UI
User authentication
• Control access to Confluent Control Center
• Integrates with existing enterprise authentication systems
3737Confidential
Data Pipeline DemoReal-time data firehose archived to searchable stores
3838Confidential
Demo Scenario: Multiple Streaming Data Pipelines
• IRC feed of Wikipedia updates• IRC Source connector publishes real-time stream of Wikipedia updates to Kafka topic• Kafka Streams application parses records and re-writes to new topic• Elasticsearch Sink connector indexes parsed data• Kibana dashboards visualize Wikipedia updates in real time
• Twitter feed augmented with sentiment data• Twitter Source connector configured to publish data to Kafka topic• Kafka Streams application strips extraneous twitter fields and adds sentiment score• Sink connector saves K-Streams output to key-value store (eg Couchbase or DynamoDB)• Key-value queries can track sentiment trends
3939Confidential
Wikipedia-to-Elastic Data Pipeline
4040Confidential
Wikipedia Transformation
• Raw input records{"createdat":1485386068652,"channel":"#en.wikipedia","sender":{"nick":"rc-pmtpa","login":"~rc-pmtpa","hostname":"special.user"},"message":"[[List of Iranian Americans]] https://en.wikipedia.org/w/index.php?diff=761978901&oldid=760575313 * 01:445:4080:1510:F1A4:7C08:B276:FA8B * (+0) /* Media/Journalism */"}{"createdat":1485386069199,"channel":"#en.wikipedia","sender":{"nick":"rc-pmtpa","login":"~rc-pmtpa","hostname":"special.user"},"message":"[[In the Bleak Midwinter]] https://en.wikipedia.org/w/index.php?diff=761978902&oldid=761960970 * Grover cleveland * (+422) /* Settings */"}• Parsed records{"createdat":1485386068652,"wikipage":"List of Iranian Americans","isnew":false,"isminor":false,"isunpatrolled":false,"isbot":false,"diffurl":"https://en.wikipedia.org/w/index.php?diff=761978901&oldid=760575313","username":"01:445:4080:1510:F1A4:7C08:B276:FA8B","bytechange":0,"commitmessage":"/* Media/Journalism */"}{"createdat":1485386069199,"wikipage":"In the Bleak Midwinter","isnew":false,"isminor":false,"isunpatrolled":false,"isbot":false,"diffurl":"https://en.wikipedia.org/w/index.php?diff=761978902&oldid=761960970","username":"Grover cleveland","bytechange":422,"commitmessage":"/* Settings */"}
4141Confidential
Twitter Transformation
• Raw input records"CreatedAt": 1479252348000,"Id": 798668350956126200,"Text": "Iago Aspas pays tribute to #Spain players for making his international debut “easy” vs
#England… https://t.co/G13NUaZj8W","Source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>","User": { }
128 separate fields
• Filtered records{"sentiment":"Negative","sentimentScore":1,"UserName":"tits","CreatedAt":1485387765000,"Text":"RT @STsportsdesk: Football: Real Madrid eliminated from #CopaDelRey by Celta Vigo https://t.co/QfCLayqRsHhttps://t.co/53GWANPDXj","id":"824402156707049475","UserScreenName":"titusanghongwen"}
4242Confidential
Kafka Connect Demonstration
Kafka Connect
Apache Kafka Brokers
K-Streams app(s)
1
43
2 5
5
1
2
3 4
4444Confidential
Thank You