sf hadoop users group august 2014 meetup slides
DESCRIPTION
Slides for Hadoop Users Group Meetup on 13th August 2014TRANSCRIPT
![Page 1: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/1.jpg)
Hadoop at LookoutAug 13, 2014
Yash Ranadive@yashranadive
Thursday, August 14, 14
![Page 2: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/2.jpg)
BIO
• Data Engineer
• From Mumbai, India
• Lived in 7 different cities in US
• @yashranadive
• etl.svbtle.com
Thursday, August 14, 14
![Page 3: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/3.jpg)
AGENDA
• What we do @Lookout
• Data warehouse
• Evolution from monolithic to micro-services
• Protocol Buffers
• Areas we are exploring
Thursday, August 14, 14
![Page 4: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/4.jpg)
WHAT WE DO@LOOKOUT
Thursday, August 14, 14
![Page 5: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/5.jpg)
Over 50 million registered users
Thursday, August 14, 14
![Page 6: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/6.jpg)
DATA TEAM
• 3 Data Engineers
• 6 data analysts
• Hadoop
• 64 hosts
• 300 TB capacity
Thursday, August 14, 14
![Page 7: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/7.jpg)
DATA WAREHOUSEINTERNAL AND EXTERNAL DATA SOURCES
MySQL Star Schema
Warehouse
HDFS
HIVE HBase ImpalaChunker
Mudskipper
R Hue Shiny Tableau Custom Apps
WAREHOUSE
Thursday, August 14, 14
![Page 8: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/8.jpg)
FROM MONOLITHIC TO MICROSERVICES
Thursday, August 14, 14
![Page 9: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/9.jpg)
MONOLITHIC APPLICATION
Routing
Controller
Mobile/Web Clients
Database
RAILS APPLICATION
HTTP
ORM
Views
Tables
Thursday, August 14, 14
![Page 10: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/10.jpg)
DATA INGESTION - MONOLITHIC
Application master_db slave_db
Data Warehouse
MySQL HiveETL
ELTMySQL
Replication
External Sources
Reporting
Ingestion is batch-oriented
Thursday, August 14, 14
![Page 11: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/11.jpg)
PROBLEM
• Rails has fast TTM but challenges in scaling
• One code base
• Slower Deployments
• Too complex and large to manage
• Solution
• Microservices / service oriented architecture
• Break out the app in to smaller services
Thursday, August 14, 14
![Page 12: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/12.jpg)
MICROSERVICES ARCHITECTURE
Routing
Controller
Mobile/Web Clients
Database
RAILS APPLICATION
HTTP
ORM
Views
Tables
Settings Service
PhotoBackup
We frequently add new services
Thursday, August 14, 14
![Page 13: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/13.jpg)
DATA INGESTION - MICROSERVICES
Application master_db slave_db
Data Warehouse
MySQL Hive
ETL
ELTMySQL
Replication
External Sources
Reporting
Settings Service
Backup Service
Locate Service
Messaging Layer
Consumer
Thursday, August 14, 14
![Page 14: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/14.jpg)
DATA INGESTION - MONOLITIHIC VS MICROSERVICES
select * from user_settings;
id | setting_id | user_id | modified_at===========================1 backup 2629 20140709T0400Z3 locate 2682 20140709T0402Z8 wipe 2629 20140709T0403Z9 theft_alert 2629 20140709T0407Z
{guid: 1, event_type: “modify_setting”,setting_id: “backup”, setting_status: “ON”, user_id: “2629”, timestamp: “20140709T0400Z”}
{guid: 3, event_type: “start_backup”, user_id: “2629”, timestamp: “20140709T0400Z”}...
Monolithic - Snapshot of a point in time
Microservices - Events
Thursday, August 14, 14
![Page 15: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/15.jpg)
DESIGN
• We wanted to create an always-on event ingestion framework that:
• Would scale workers on demand
• Would be easy to monitor
Thursday, August 14, 14
![Page 16: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/16.jpg)
FIRST STAB - WORKER
Service ActiveMQ Ruby Worker HIVE
• Upstart script that daemonized Ruby process
• Monitoring using Zenoss
• Very easy to set up
• Mapping Files for JSON -> CSV
• Ruby is terse and clean
Thursday, August 14, 14
![Page 17: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/17.jpg)
PROBLEMS
• ActiveMQ
• ActiveMQ did not scale well - even with multiple machines in the AMQ cluster
• ActiveMQ creates a separate queue for every consumer of the topic
• Monitoring using Zenoss is not ideal especially for multi-process consumers
• The worker ran on a single machine- not fault tolerant
Thursday, August 14, 14
![Page 18: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/18.jpg)
CURRENT ARCHITECTURE - WORKER
Service Kafka Storm HIVE
• Monitoring using Storm’s thrift API
• Scaling number of workers is easy
• Kafka has better scalability than Kafka
Service ActiveMQ
Thursday, August 14, 14
![Page 19: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/19.jpg)
Storm
STORM TOPOLOGY
Service Kafka HDFS
Kafka Spout
ActiveMQ Spout
Processing Bolt
Storm-hdfs bolt
Landing Directory
Hive Directory
Thursday, August 14, 14
![Page 20: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/20.jpg)
JSON PROBLEMS
• Problems with JSON
• No predefined schema
• No enforcement of backward compatibility
• Solution
• Protocol Buffers (also Avro/Thrift)
Thursday, August 14, 14
![Page 21: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/21.jpg)
PROTOBUFS
• What?
• Way of encoding structured data
• Binary
• Why?
• Schema
• Backward compatibility
• Smaller in size than JSON
Thursday, August 14, 14
![Page 22: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/22.jpg)
VERSIONING
• backward compatible changes only
,proto ,proto
Version 1.4 Version 1.1
Producer ConsumerQueue
Thursday, August 14, 14
![Page 23: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/23.jpg)
SHARING PROTOBUF SCHEMAS
Artifactory(Schema Repo)
Data Team Storm ProjectProducers
PushJava jars
Ruby gems
PullJava jars
Thursday, August 14, 14
![Page 24: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/24.jpg)
BUT HOW DO YOU STORE PROTOBUFS IN HDFS?
Thursday, August 14, 14
![Page 25: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/25.jpg)
HOW WE STORE PROTOBUFS
• Store raw version
• Raw dump of kafka topic in to HDFS
• Convert them to a tuple using Storm
• Inflate then convert to TSV
• Can query raw protobufs directly from HIVE but we don’t yet
• elephant-bird (difficult to get it working)
Thursday, August 14, 14
![Page 26: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/26.jpg)
Storm
STORM TOPOLOGY
Service Kafka HDFS
Kafka Spout
ActiveMQ Spout
Deserialize Protobuf
Storm-hdfs bolt
Landing Directory
Hive Directory
Thursday, August 14, 14
![Page 27: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/27.jpg)
AREAS WE ARE EXPLORING
Thursday, August 14, 14
![Page 28: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/28.jpg)
SPARK
• ETL
• Wordcount ~5 lines of scala code vs. 58 lines of Java Map reduce code
• Spark Streaming can achieve similar results as of storm through micro-batchinghttp://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
• Machine Learning
• Online learning using MLLIB
• Logistic Regression and SVM
Thursday, August 14, 14
![Page 29: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/29.jpg)
H20
• In-memory machine learning
• Tight integration with R
• Preferred by Data Scientists
Thursday, August 14, 14
![Page 30: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/30.jpg)
OPEN SOURCE PROJECTS
• Currently open sourced
• Pipefish - write from MySQL to HDFSgithub.com/lookout/pipefish
• Future
• Mudskipper - capture change-data events from MySQL binlogs.
• Chunker - download mysql table data in chunks
Thursday, August 14, 14
![Page 31: SF Hadoop Users Group August 2014 Meetup Slides](https://reader033.vdocuments.site/reader033/viewer/2022061218/54b760af4a7959f9168b46c9/html5/thumbnails/31.jpg)
Questions
Thursday, August 14, 14