Meetup realtime datacollection

Download Meetup realtime datacollection

Post on 13-Nov-2014




2 download

Embed Size (px)




  • 1. CONDUIT: REAL TIME DATACOLLECTION AT SCALESharad Agarwal, Amareshwari & Inder Singh0

2. Agenda About INMOBI Data Collection Challenges Goals Design/tech stack deep dive Publisher/Consumer eco-system 1 3. As you grow to multi-region..Web WebApplication ApplicationsOtherOtherApplications Applications Consumer Consumer ApplicationsApplicationsWAN High volume complex data flows across WAN 2 4. Data Collection Challenges Adhoc log aggregation Duplicate data transfer Tightly coupled point to point No Reliability Guarantees Network glitches lead to huge backlog High peak bandwidth requirement Transfers inBursts No support or different data paths for real-timeusecase 3 5. Goals collect event data from distributed sub-systems inreliable, efficient, scalable and uniform way for batch aswell as near real-time consumption. Decouple data consumers from producers Savings on Network Bandwidth Reduce peak network requirements due to bulk data transfers in spurts. Minimize Duplicate data transfers across WAN4 6. DC1 Producers DC2 Producers DC3 ProducersA_dc1 B_dc1 B_dc3 B_dc2 Control FlowAB Data Flow ADC1 Consumers DC2 Consumers DC3 Consumers 5 7. How to Achieve? Use data collection/relay technology Kafka Pull semantics. Data stored on Kafka brokers. Scribe Push semantics. Data typically written to HDFS. Flume Push semantics similar to Scribe. Being rewritten as Flume NG Promising but in nascent stage 8. For HAZK ClusterWorkerProducer Powered PrimarySecondary by Scribe | V I Collectors | PDISTCPProducerStreaming Batch ConsumersConsumerConsumer WAN 7 9. Producer Producers publish messages using PublisherAPI- Transparent to the underlying publishing technology(scribe, flume, etc)8 10. Consumer Batch- Data is published in HDFS cluster in min directories: ../streams//YYYY/MM/DD/HR/MN ../streams_local//YYYY/MM/DD/HR/MN Streaming- Streaming Consumer API through an iteratorinterface.- Streams messages directly from HDFS- Streaming from multiple clusters in parallel- Checkpoint the stream at any time- Kafka alike static consumer groups for L.B.9 11. Client library Producer publish| | ProducerConsumernext ConsumerProducer ||Producer Consumer WAN10 12. Salient Features Data compression Data merging Mirroring Streaming consumer API for low latency datatransfers E2E Audit for SLA, reliability11 13. Open Source1. Conduit - Pintail Soon to follow 12 14. Thank you! 13