streaming in scala with avro

29
Streaming in Scala*, with Avro

Upload: jonathan-winandy

Post on 18-Jan-2017

274 views

Category:

Technology


0 download

TRANSCRIPT

Streaming in Scala*, with Avro

• My name is Jonathan Winandy

• @ahoy_jon

> Me

> Introduction

Streams

What are Streams ? It’s an abstract data structure with the following :

operations : • append(bytes) -> void? • readAt(int) -> null | bytes

rule 1 : ∀p ∈ ℕ, for some definition of ‘==‘ x := readAt(p) y := readAt(p)

x != null => x == y

Rule 1 implies : Infinite cacheability once the data is available at a position.

> Introduction

Streams are the simplest way to manage data.

0 1 2 3 4 5 6

> Introduction

HARD DRIVES : Magnetic and mutable.

And we know that since the XVth century …

So what happened ?

Memorial Journal Ledger

> Introduction

Exemple of architecturesCLASSICAL

APP

REPLICATION(BIN/LOG)

APP

APP

DB

DB

APPAPP

APP

append

broadcast

WITH STREAMS

The broadcast mechanism is equivalent to a db replication mechanism.

State of data encodings in the industry

• As always worse is considered better.

• Most of streams have data encoded in :

• CSV/TSV

• JSON

• Platform specific serialisations (eg: Java serialisation, Kryo)

> Data encoding

Why Data Encoding is Important ?

• Some streams may contain very large amounts of Data, the chosen encoding must be CPU and space efficient.

• Streams are processed by many programs, and many intermediaries, for many years, the chosen encoding must be processable in a generic way.

> Data encoding

JSON is the Common Denominator

Pros :

• It reaches the browser, you can produce and consume data from inside a web page.

vs a lot of Cons :

• Inefficient,

• No dates, no proper numerics,

• Very basic data structures,

• Very prone to errors.

We all need JSON,but we should use it only when we can't avoid it.

> Data encoding

Eg : In our databases, we can avoid JSONs ;)

:02:06:62:6f:62:02:16:02:da:01

{“name”:"Bob", “age":11, "gender":"Male"}

> Data encoding

How bad is JSON ?39 Bytes for 10 Bytes of data

relevantones

popular binaries low tech cognitect “papa ?!”

Avro Thrift Proto Buf JSON CSV Fressian Transit EDN XML RDF

binary YES NO YES OK NO ??

generic YES ?? NO YES YES YES

schema based

YES NO YES NO ?? meta

specific encoding

YES NO “STRINGS” YES OK Literal

s

reach the browser

YES NO +++++ OK NO YES OK

easy ? NO I PASS “true”

YEPHUM

?…

safe ? YES HUM? NO NOMISMATCH

<!YES

has dates? Soon NO NO YES YES

Building your own Data Empire with Scala and Avro

TOPIC

TOPIC!?

Collector

Collector

Collector

> Sandboxing

> Pull to Push

> Sandboxing

TOPIC

TOPIC!?

Collector

Collector

Collector

TOPIC

TOPIC!?

Collector

Collector

Collector

TOPIC

TOPIC!?

Collector

Collector

Collector

Metadata ?

> Durability

> Example Collector Event

using avro-scala-macro-annotations

> Example

AA | 00 | 00 | 04FFFF

BaseEvent

> Example

AA | 02D5CB1A5355EE405BADDC1133C628F7A6 | 00 | 06FFFFFF

BaseEvent

> Example Write Context

> Example

write-context

collector-staging

Collector

1

22222222

> Example

write-context

collector-staging

Consumer

> To improve

• Writer Context / Schema manager

• Avro support :- type providers from repository - bug fixes

• Producers, Encoder and Decoder for Kafka.

Conclusion / Questions ?

Bonus :What is a CAS ?A Content Addressable Storage is a specific “key value store” :

operations : • store(bytes) -> key • get(key) -> null | bytes

rule 1 : key = h(data) h being a cryptographic hash function like md5 or sha1.

rule 2 : ∀data get(store(data)) = data

Rule 1 and 2 imply : Infinite cacheability and scalability.