efficient schemas in motion with kafka and schema registry

Efficient Schemas in Motion with Kafka and Schema Registry

Pat Patterson

Community Champion

@metadaddy

pat@streamsets.com

Enterprise Data DNA

Commercial Customers Across Verticals

250,000+ downloads

50+ of the Fortune 100

Doubling each quarter

Strong Partner Ecosystem Open Source Success

Mission: empower enterprises to harness their data in motion.

Who is StreamSets?

Schema Registry

Agenda

Joined ASF as a Hadoop subproject in 2009

Record-oriented serialization format

Binary (most common) and JSON (human readable) encodings

Apache Avro

Avro Prehistory

Schema defined in JSON• Relatively readable

Schema evolution• Can add new fields, rename fields in schema• Existing data can still be read under the new schema

Untagged binary data• Space-efficient!

Avro Advantages

"type": "record",

"namespace": "com.example",

"name": "Person",

"fields": [

{ "name": "first_name", "type": "string" },

{ "name": "last_name", "type": "string" }

Avro Schema Definition

• null: 0 bytes

• boolean: 1 byte

• int/long: variable-length, zig-zag encoded

• float/double: 4/8 bytes

• bytes: length as long, then data

• string: length as long, then UTF-8-encoded data

Avro Binary Encoding - Simple Types

• Record: concatenate the field encodings

• Enum: zero-based index of symbol, as int

• Array: blocks of items, each preceded by a long count; zero count terminates array

• Map: blocks of K-V pairs, each preceded by a long count; zero count terminates array

• Union: position of item in schema as a long, then the item

• Fixed: the number of bytes defined in the schema

Avro Binary Encoding - Complex Types

"type": "record",

"namespace": "com.example",

"name": "Person",

"fields": [

{ "name": "first_name", "type": "string" },

{ "name": "last_name", "type": "string" },

{ "name": "age", "type": "int", "default": -1 }

Avro Schema Evolution

Compatibility Rules:• New fields must have a default• Deleted field must have had a default• Doc/Order can be added/removed/changed• Field default can be added/changed• Field/type aliases can be added/removed• Non-union can be converted to union with just that type, or vice

General rule is that old data can be read under the new schema

Avro Schema Evolution

Avro Schema Serialization

Various options, depending on file/message orientation, but, generally:• Metadata, including the schema• Data

Great for files - schema is sent just once, but what about messages?• Send just once? Periodically?• Send per message?• Agree out of band?

Schema Overhead

Online schema repository• Simple REST APIEach schema has an ID• Unique within the repositorySchemas versioned within subjects• Supports schema evolution• Subject loosely corresponds to topic• Subject + version -> ID

Schema Registry

Register schema, registry returns an ID

Sender passes schema ID in each message

Recipient looks up ID in registry

Solves the Avro-by-Message Problem

Schema By Reference

Just register a new (compatible) schema via the same topic

Schema is assigned a new ID

Evolution with Schema Registry

Schema Evolution

Landoop schema-registry-uihttps://github.com/Landoop/schema-registry-ui

Bonus Feature: Web UI

Schema Evolution Part Deux

Conclusion

Avro: a row-oriented, self-describing format for data serialization

Default Avro is inefficient in a message-passing setting

Referencing schema by ID dramatically reduces the volume of network traffic

Thank You! Pat Patterson

Community Champion

@metadaddy

pat@streamsets.com

efficient schemas in motion with kafka and schema registry

Software

cassandra and kafka support on...

kafka low-level design discussion of kafka design kafka...

the meg metadata schemas registry

maurice blanchot, de kafka a kafka

kelly technologies · kafka introduction to kafka kafka...

universal schemas

xml schemas

deploying confluent enterprise on microsoft azure ·...

35 schemas

time schemas

multi-tenant deep learning and streaming as-a-service with...

cloudurablecloudurable.com/ppt/cloudurable-kafka-intro-with-simple-java-produc… ·...

kafka and avro with confluent schema registry

working with kafka advanced consumers - cloudurable ·...

kafka audit - kafka meetup - january 27th, 2015

kafka connect & kafka streams/ksql - powerful ecosystem...

blanchot maurice - de kafka a kafka

schemas & research

franz kafka - letter to his · pdf filefranz kafka pictures...

directory schemas - tf-emc2i think some break up is in...