a real-time data ingestion system or: how i learned to stop worrying and take good advices by maciej...
TRANSCRIPT
A Real-Time Data Ingestion System
Or: How I Learned to Stop Worrying and Love the Bomb Avro
Maciej Arciuch
Allegro.pl
● biggest online auction website in Poland
● sites in other countries
● “Polish eBay” (but better!)
Clickstream at Allegro.pl
● how do our users behave?
● ~ 400 M of raw clickstream events daily
● collected at the front-end
● web and mobile devices
● valuable source of information
Legacy system
● HDFS, Flume and MapReduce
● Main issues:
○ batch processing - per hour or day
○ data formats
○ how to make data more accessible for others?
How to do it better? (1)
stream processing: Spark Streaming and Kafka - data available “almost”
instantly
new applications:
security
recommendations & search
How to do it better? (2)
Use Avro
● mature software, good support in Hadoop ecosystem
● space-efficient
● schema: structure + doc placeholder
● the same format for stream and batch processing
● backward/forward compatibility control
How to do it better? (3)
Create a central schema repository:
● single source of truth
● all the elements of system refer to the latest version
● validate backward/forward compatibility on commit
● immutable schemas
● propagate info to Hive metastore, files, HTMLs
How to do it better? (5)
New system:
● two separate Kafka instances (buffer and destination)
● if your infrastructure is down – you still collect data
● collectors – only save HTTP requests, no logic
● logic in Spark Streaming
● dead letter queue – you can reprocess failed messages
How to do it better? (6)
New system:
● data saved to HDFS in hourly batches using LinkedIn’s Camus (now obsolete, but good tool)
● Hive tables and partitions created automatically (look for camus2hive on Github)
Why Spark Streaming?
● pros:
○momentum
○ good integration with YARN - better resource utilization, easy scaling
○ good integration with Kafka
○ reuse batch Spark code
● cons:
○micro-batching
○ as complex as Spark