a real-time data ingestion system or: how i learned to stop worrying and love avro by maciej arciuch...

14
A Real-Time Data Ingestion System Or: How I Learned to Stop Worrying and Love the Bomb Avro Maciej Arciuch

Upload: maciej-arciuch

Post on 21-Apr-2017

326 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

A Real-Time Data Ingestion System

Or: How I Learned to Stop Worrying and Love the Bomb Avro

Maciej Arciuch

Page 2: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

Allegro.pl

● biggest online auction website in Poland

● sites in other countries● “Polish eBay” (but better!)

Page 3: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

Clickstream at Allegro.pl

● how do our users behave?● ~ 400 M of raw clickstream events

daily● collected at the front-end● web and mobile devices● valuable source of information

Page 4: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

Legacy system

● HDFS, Flume and MapReduce● Main issues:

○batch processing - per hour or day○data formats○how to make data more accessible

for others?

Page 5: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

How to do it better? (1)

stream processing: Spark Streaming and Kafka - data available “almost” instantly

new applications:securityrecommendations & search

Page 6: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

How to do it better? (2)

Use Avro● mature software, good support in Hadoop

ecosystem● space-efficient● schema: structure + doc placeholder● the same format for stream and batch

processing● backward/forward compatibility control

Page 7: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

How to do it better? (3)

Create a central schema repository:● single source of truth● all the elements of system refer to the latest

version● validate backward/forward compatibility on

commit● immutable schemas● propagate info to Hive metastore, files, HTMLs

Page 8: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

How to do it better? (4)

Page 9: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

How to do it better? (5)

New system:● two separate Kafka instances (buffer and

destination)● if your infrastructure is down – you still collect

data● collectors – only save HTTP requests, no logic● logic in Spark Streaming● dead letter queue – you can reprocess failed

messages

Page 10: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

How to do it better? (6)

New system:● data saved to HDFS in hourly batches using

LinkedIn’s Camus (now obsolete, but good tool)● Hive tables and partitions created

automatically (look for camus2hive on Github)

Page 11: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

Why Spark Streaming?

● pros:○momentum○good integration with YARN - better

resource utilization, easy scaling○good integration with Kafka○reuse batch Spark code

● cons:○micro-batching○as complex as Spark

Page 12: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

Key take-aways

● Kafka, Avro, Spark - solid building blocks

● Use a central schema repository

Page 13: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

Q/A?

Page 14: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015

Thank you!

http://github.com/allegrohttp://allegro.tech