efficient schemas in motion with kafka and schema registry

Post on 29-Jan-2018

365 Views

Category:

Software

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Efficient Schemas in Motion with Kafka and Schema Registry

Pat Patterson

Community Champion

@metadaddy

pat@streamsets.com

Enterprise Data DNA

Commercial Customers Across Verticals

250,000+ downloads

50+ of the Fortune 100

Doubling each quarter

Strong Partner Ecosystem Open Source Success

Mission: empower enterprises to harness their data in motion.

Who is StreamSets?

Avro

Schema Registry

Demo

Agenda

Joined ASF as a Hadoop subproject in 2009

Record-oriented serialization format

Binary (most common) and JSON (human readable) encodings

Apache Avro

Avro Prehistory

Schema defined in JSON• Relatively readable

Schema evolution• Can add new fields, rename fields in schema• Existing data can still be read under the new schema

Untagged binary data• Space-efficient!

Avro Advantages

{

"type": "record",

"namespace": "com.example",

"name": "Person",

"fields": [

{ "name": "first_name", "type": "string" },

{ "name": "last_name", "type": "string" }

]

}

Avro Schema Definition

• null: 0 bytes

• boolean: 1 byte

• int/long: variable-length, zig-zag encoded

• float/double: 4/8 bytes

• bytes: length as long, then data

• string: length as long, then UTF-8-encoded data

Avro Binary Encoding - Simple Types

• Record: concatenate the field encodings

• Enum: zero-based index of symbol, as int

• Array: blocks of items, each preceded by a long count; zero count terminates array

• Map: blocks of K-V pairs, each preceded by a long count; zero count terminates array

• Union: position of item in schema as a long, then the item

• Fixed: the number of bytes defined in the schema

Avro Binary Encoding - Complex Types

{

"type": "record",

"namespace": "com.example",

"name": "Person",

"fields": [

{ "name": "first_name", "type": "string" },

{ "name": "last_name", "type": "string" },

{ "name": "age", "type": "int", "default": -1 }

]

}

Avro Schema Evolution

Compatibility Rules:• New fields must have a default• Deleted field must have had a default• Doc/Order can be added/removed/changed• Field default can be added/changed• Field/type aliases can be added/removed• Non-union can be converted to union with just that type, or vice

versa

General rule is that old data can be read under the new schema

Avro Schema Evolution

Avro Schema Serialization

Various options, depending on file/message orientation, but, generally:• Metadata, including the schema• Data

Great for files - schema is sent just once, but what about messages?• Send just once? Periodically?• Send per message?• Agree out of band?

Schema Overhead

Demo

Online schema repository• Simple REST APIEach schema has an ID• Unique within the repositorySchemas versioned within subjects• Supports schema evolution• Subject loosely corresponds to topic• Subject + version -> ID

Schema Registry

Register schema, registry returns an ID

Sender passes schema ID in each message

Recipient looks up ID in registry

Solves the Avro-by-Message Problem

Schema By Reference

Demo

Just register a new (compatible) schema via the same topic

Schema is assigned a new ID

Evolution with Schema Registry

Schema Evolution

Demo

Landoop schema-registry-uihttps://github.com/Landoop/schema-registry-ui

Bonus Feature: Web UI

Schema Evolution Part Deux

Demo

Conclusion

Avro: a row-oriented, self-describing format for data serialization

Default Avro is inefficient in a message-passing setting

Referencing schema by ID dramatically reduces the volume of network traffic

Thank You! Pat Patterson

Community Champion

@metadaddy

pat@streamsets.com

top related