sf hadoop users group august 2014 meetup slides

Hadoop at LookoutAug 13, 2014

Yash Ranadive@yashranadive

Thursday, August 14, 14

BIO

• Data Engineer

• From Mumbai, India

• Lived in 7 different cities in US

• @yashranadive

• etl.svbtle.com


AGENDA

• What we do @Lookout

• Data warehouse

• Evolution from monolithic to micro-services

• Protocol Buffers

• Areas we are exploring


WHAT WE DO@LOOKOUT


Over 50 million registered users


DATA TEAM

• 3 Data Engineers

• 6 data analysts

• Hadoop

• 64 hosts

• 300 TB capacity


DATA WAREHOUSEINTERNAL AND EXTERNAL DATA SOURCES

MySQL Star Schema

Warehouse

HDFS

HIVE HBase ImpalaChunker

Mudskipper

R Hue Shiny Tableau Custom Apps

WAREHOUSE


FROM MONOLITHIC TO MICROSERVICES


MONOLITHIC APPLICATION

Routing

Controller

Mobile/Web Clients

Database

RAILS APPLICATION

HTTP

ORM

Views

Tables


DATA INGESTION - MONOLITHIC

Application master_db slave_db

Data Warehouse

MySQL HiveETL

ELTMySQL

Replication

External Sources

Reporting

Ingestion is batch-oriented


PROBLEM

• Rails has fast TTM but challenges in scaling

• One code base

• Slower Deployments

• Too complex and large to manage

• Solution

• Microservices / service oriented architecture

• Break out the app in to smaller services


MICROSERVICES ARCHITECTURE

Routing

Controller

Mobile/Web Clients

Database

RAILS APPLICATION

HTTP

ORM

Views

Tables

Settings Service

PhotoBackup

We frequently add new services


DATA INGESTION - MICROSERVICES

Application master_db slave_db

Data Warehouse

MySQL Hive

ETL

ELTMySQL

Replication

External Sources

Reporting

Settings Service

Backup Service

Locate Service

Messaging Layer

Consumer


DATA INGESTION - MONOLITIHIC VS MICROSERVICES

select * from user_settings;

id | setting_id | user_id | modified_at===========================1 backup 2629 20140709T0400Z3 locate 2682 20140709T0402Z8 wipe 2629 20140709T0403Z9 theft_alert 2629 20140709T0407Z

{guid: 1, event_type: “modify_setting”,setting_id: “backup”, setting_status: “ON”, user_id: “2629”, timestamp: “20140709T0400Z”}

{guid: 3, event_type: “start_backup”, user_id: “2629”, timestamp: “20140709T0400Z”}...

Monolithic - Snapshot of a point in time

Microservices - Events


DESIGN

• We wanted to create an always-on event ingestion framework that:

• Would scale workers on demand

• Would be easy to monitor


FIRST STAB - WORKER

Service ActiveMQ Ruby Worker HIVE

• Upstart script that daemonized Ruby process

• Monitoring using Zenoss

• Very easy to set up

• Mapping Files for JSON -> CSV

• Ruby is terse and clean


PROBLEMS

• ActiveMQ

• ActiveMQ did not scale well - even with multiple machines in the AMQ cluster

• ActiveMQ creates a separate queue for every consumer of the topic

• Monitoring using Zenoss is not ideal especially for multi-process consumers

• The worker ran on a single machine- not fault tolerant


CURRENT ARCHITECTURE - WORKER

Service Kafka Storm HIVE

• Monitoring using Storm’s thrift API

• Scaling number of workers is easy

• Kafka has better scalability than Kafka

Service ActiveMQ


Storm

STORM TOPOLOGY

Service Kafka HDFS

Kafka Spout

ActiveMQ Spout

Processing Bolt

Storm-hdfs bolt

Landing Directory

Hive Directory


JSON PROBLEMS

• Problems with JSON

• No predefined schema

• No enforcement of backward compatibility

• Solution

• Protocol Buffers (also Avro/Thrift)


PROTOBUFS

• What?

• Way of encoding structured data

• Binary

• Why?

• Schema

• Backward compatibility

• Smaller in size than JSON


VERSIONING

• backward compatible changes only

,proto ,proto

Version 1.4 Version 1.1

Producer ConsumerQueue


SHARING PROTOBUF SCHEMAS

Artifactory(Schema Repo)

Data Team Storm ProjectProducers

PushJava jars

Ruby gems

PullJava jars


BUT HOW DO YOU STORE PROTOBUFS IN HDFS?


HOW WE STORE PROTOBUFS

• Store raw version

• Raw dump of kafka topic in to HDFS

• Convert them to a tuple using Storm

• Inflate then convert to TSV

• Can query raw protobufs directly from HIVE but we don’t yet

• elephant-bird (difficult to get it working)


Storm

STORM TOPOLOGY

Service Kafka HDFS

Kafka Spout

ActiveMQ Spout

Deserialize Protobuf

Storm-hdfs bolt

Landing Directory

Hive Directory


AREAS WE ARE EXPLORING


SPARK

• ETL

• Wordcount ~5 lines of scala code vs. 58 lines of Java Map reduce code

• Spark Streaming can achieve similar results as of storm through micro-batchinghttp://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

• Machine Learning

• Online learning using MLLIB

• Logistic Regression and SVM


http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

H20

• In-memory machine learning

• Tight integration with R

• Preferred by Data Scientists


OPEN SOURCE PROJECTS

• Currently open sourced

• Pipefish - write from MySQL to HDFSgithub.com/lookout/pipefish

• Future

• Mudskipper - capture change-data events from MySQL binlogs.

• Chunker - download mysql table data in chunks


https://github.com/lookout/pipefish

https://github.com/lookout/pipefish

Questions


sf hadoop users group august 2014 meetup slides

Engineering

kafka service activemq

data engineers

data scientists

data warehouse internal

data analysts hadoop

bio data engineer

structured data binary

problems activemq activemq