event driven architecture

28
1 © OCTO 2015 © OCTO 2015 Event Driven Architecture bluckbluck

Upload: benjamin-joyen-conseil

Post on 16-Apr-2017

764 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Event Driven Architecture

1

© OCTO 2015 © OCTO 2015

Event Driven Architecturebluckbluck

Page 2: Event Driven Architecture

2

© OCTO 2015

The first problem was how to transport data between systems

The second part of this problem was the need to do richer analytical data processing with very low latency.

Page 3: Event Driven Architecture

3

© OCTO 2015

The pipeline for log data was scalable but lossy and could only deliver data with high latency.

The pipeline between Oracle instances was fast, exact, and real-time, but not available to any other systems.

Page 4: Event Driven Architecture

4

© OCTO 2015

The pipeline of Oracle data for Hadoop was periodic CSV dumps—high throughput, but batch.

The pipeline of data to our search system was low latency, but unscalable and tied directly to the database.

The messaging systems were low latency but unreliable and unscalable.

Page 5: Event Driven Architecture

5

© OCTO 2015

Page 6: Event Driven Architecture

6

© OCTO 2015

We added data centers geographically distributed around the world we had to build out geographical replication for each of these data flows

he data was always unreliable. Our reports were untrustworthy, derived indexes and stores were questionable, and everyone spent a lot of time battling data quality issues of all kinds

At the same time we weren't just shipping data from place to place; we also wanted to do things with it

Hadoop had given us a platform for batch processing, data archival, and ad hoc processing, and this had been enormously successful, but we lacked an analogous platform for low-latency processing.

Page 7: Event Driven Architecture

7

© OCTO 2015

Stream Data Plateform

Page 8: Event Driven Architecture

8

© OCTO 2015

Stream Data Plateform

Page 9: Event Driven Architecture

9

© OCTO 2015

Your database stores the current state of your data. But the current state is always caused by some actions that took place in the past. The actions are the events.

Much of what people refer to when they talk about "big data" is really the act of capturing these events that previously weren't recorded anywhere and putting them to use for analysis, optimization, and decision making

Event streams are an obvious fit for log data or things like "orders", "sales", "clicks" or "trades" that are obviously event-like.

The Rise of Events and Event Streams

Page 10: Event Driven Architecture

10

© OCTO 2015

data in databases can also be thought of as an event stream. The process of creating a backup or standby copy of a database :

to dump out the contentsto take a "diff" of what has changed

Change capture : If we take our diffs more and more frequently what we will be left with is a continuous sequence of single row changes.

By publishing the database changes into the stream data platform you add this to the other set of event streams. You can use these streams to synchronize other systems like

Hadoop cluster, a replica database, or a search index, or you can feed these changes into applications or stream processors to directly compute new things off the changes.

Databases Are Event Streams

Page 11: Event Driven Architecture

11

© OCTO 2015

A stream data platform has two primary uses:Data Integration: The stream data platform captures streams of events or data changes and feeds these to other data systems such as relational databases, key-value stores, Hadoop, or the data warehouse.Stream processing: It enables continuous, real-time processing and transformation of these streams and makes the results available system-wide.

The stream data platform is a central hub for data streams.

t also acts as a buffer between these systems—the publisher of data doesn't need to be concerned with the various systems that will eventually consume and load the data. This means consumers of data can come and go and are fully decoupled from the source.

What Is a Stream Data Platform For?

Page 12: Event Driven Architecture

12

© OCTO 2015

Hadoop wants to be able to maintain a full copy of all the data in your organization and act as a "data lake" or "enterprise data hub". Directly integrating each data source with HDFS is a hugely time consuming proposition

the end result only makes that data available to Hadoop.

This type of data capture isn't suitable for real-time processing or syncing other real-time applications.

This same pipeline can run in reverse: Hadoop and the data warehouse environment can publish out results that need to flow into appropriate systems for serving in customer-facing applications.

What Is a Stream Data Platform For? Zoom Hadoop

Page 13: Event Driven Architecture

13

© OCTO 2015

The stream processing use case plays off the data integration use case.

The results of the stream processing are just a new, derived stream.

Stream processing acts as both a way to develop applications that need low-latency transformations but it is also directly part of the data integration usage as well:

integrating systems often requires some munging of data streams in between.

What Is a Stream Data Platform For? Zoom ETL

Page 14: Event Driven Architecture

14

© OCTO 2015

A stream data platform is similar to an enterprise messaging system—it receives messages and distributes them to interested subscribers. There are three important differences:

Messaging systems are typically run in one-off deployments for different applications. The purpose of the stream data platform is very much as a central data hub.

Messaging systems do a poor job of supporting integration with batch systems, such as a data warehouse or a Hadoop cluster, as they have limited data storage capacity.

Messaging systems do not provide semantics that are easily compatible with rich stream processing.

How Does a Stream Data Platform Relate To Existing Things

Page 15: Event Driven Architecture

15

© OCTO 2015

In other words a data stream data platform is a messaging system whose role has been rethought at a company-wide scale.

How Does a Stream Data Platform Relate To Existing Things

Page 16: Event Driven Architecture

16

© OCTO 2015

A stream data platform is a true platform that any other system can choose to tap into and many applications can build around.

by making data available in a uniform format in a single place with a common stream abstraction, many of the routine data clean-up tasks can be avoided entirely.

Data Integration Tools

Page 17: Event Driven Architecture

17

© OCTO 2015

The advantage of a stream data platform is that transformation is fundamentally decoupled from the stream itself.

This code can live in applications or stream processing tasks, allowing teams to iterate at their own pace without a central bottleneck for application development.

Enterprise Service Buses

Page 18: Event Driven Architecture

18

© OCTO 2015

Databases have long had similar log mechanisms such as Golden Gate. However these mechanisms are limited to database changes only and are not a general purpose event capture platform.

Change Capture Systems

Page 19: Event Driven Architecture

19

© OCTO 2015

A stream data platform doesn't replace your data warehouse; in fact, quite the opposite: it feeds it data.

Data Warehouses and Hadoop

Page 20: Event Driven Architecture

20

© OCTO 2015

They attempt to add richer processing semantics to subscribers and can make implementing data transformation easier.

Stream Processing Systems

Page 21: Event Driven Architecture

21

© OCTO 2015

everything from user activity to database changes to administrative actions like restarting a process are captured in real-time streams that are subscribed to and processed in real-time.

What Does This Look Like In Practice?

Page 22: Event Driven Architecture

22

© OCTO 2015

part of the promise of this approach to data management is having a central repository with the full set of data streams your organization generates. This works best when data is all in the same place.

simplifying system architecture. fewer integration points for data consumers, fewer things to operate, lower incremental cost for adding new applications, makes it easier to reason about data flow.

But, there are several reasons to end up with multiple clustersTo keep activity local to a datacenterFor security reasonsFor SLA control.

Rcommendations : Limit The Number of Clusters

Page 23: Event Driven Architecture

23

© OCTO 2015

Apache Kafka does not enforce any particular format

If each individual or application chooses a representation of their own preference—say some use JSON, others XML, and others CSV—the result is that any system or process which uses multiple data streams has to munge and understand each of these. Local optimization—choosing your favorite format for data you produce—leads to huge global sub-optimization since now each system needs to write N adaptors, one for each format it wants to ingest.

imagine how useless the Unix toolchain would be if each tool invented its own format: you would have to translate between formats every time you wanted to pipe one command to another.

Rcommendations : Pick A Single Data Format

Page 24: Event Driven Architecture

24

© OCTO 2015

Connecting all systems directly would look something like this

Whereas having this central stream data platform looks something like this

Rcommendations : Pick A Single Data Format

Page 25: Event Driven Architecture

25

© OCTO 2015

We think Avro is the best choice for a number of reasons:

1. It has a direct mapping to and from JSON2. It has a very compact format. The bulk of JSON, repeating every field name with every single

record, is what makes JSON inefficient for high-volume usage.3. It is very fast.4. It has great bindings for a wide variety of programming languages so you can generate Java

objects that make working with event data easier, but it does not require code generation so tools can be written generically for any data stream.

5. It has a rich, extensible schema language defined in pure JSON6. It has the best notion of compatibility for evolving your data over time.

Rcommendations : Use Avro as Your Data Format

Page 26: Event Driven Architecture

26

© OCTO 2015

Isn't the modern world of big data all about unstructured data, dumped in whatever form is convenient, and parsed later when it is queried?

One of the primary advantages of this type of architecture where data is modeled as streams is that applications are decoupled. Applications produce a stream of events capturing what occurred without knowledge of which things subscribe to these streams.

The Need For Schemas

Page 27: Event Driven Architecture

27

© OCTO 2015

Whenever you see a common activity across multiple systems try to use a common schema for this activity.An example of this that is common to all businesses is application errors.

Share Event Schemas

Page 28: Event Driven Architecture

28

© OCTO 2015

MODELING SPECIFIC DATA TYPES IN KAFKA