a real-time data ingestion system or: how i learned to stop worrying and take good advices by maciej...

A Real-Time Data Ingestion System

Or: How I Learned to Stop Worrying and Love the Bomb Avro

Maciej Arciuch

Allegro.pl

● biggest online auction website in Poland

● sites in other countries

● “Polish eBay” (but better!)

Clickstream at Allegro.pl

● how do our users behave?

● ~ 400 M of raw clickstream events daily

● collected at the front-end

● web and mobile devices

● valuable source of information

Legacy system

● HDFS, Flume and MapReduce

● Main issues:

○ batch processing - per hour or day

○ data formats

○ how to make data more accessible for others?

How to do it better? (1)

stream processing: Spark Streaming and Kafka - data available “almost”

instantly

new applications:

security

recommendations & search


Use Avro

● mature software, good support in Hadoop ecosystem

● space-efficient

● schema: structure + doc placeholder

● the same format for stream and batch processing

● backward/forward compatibility control


Create a central schema repository:

● single source of truth

● all the elements of system refer to the latest version

● validate backward/forward compatibility on commit

● immutable schemas

● propagate info to Hive metastore, files, HTMLs


New system:

● two separate Kafka instances (buffer and destination)

● if your infrastructure is down – you still collect data

● collectors – only save HTTP requests, no logic

● logic in Spark Streaming

● dead letter queue – you can reprocess failed messages


New system:

● data saved to HDFS in hourly batches using LinkedIn’s Camus (now obsolete, but good tool)

● Hive tables and partitions created automatically (look for camus2hive on Github)

Why Spark Streaming?

● pros:

○momentum

○ good integration with YARN - better resource utilization, easy scaling

○ good integration with Kafka

○ reuse batch Spark code

● cons:

○micro-batching

○ as complex as Spark

Key take-aways

● Kafka, Avro, Spark - solid building blocks

● Use a central schema repository

Thank you!

http://github.com/allegro

http://allegro.tech

a real-time data ingestion system or: how i learned to stop worrying and take good advices by maciej...

Technology