achieve big data analytic platform with lambda architecture on cloud

1

Achieve Big Data Analytic Platform with Lambda Architecture on CloudSPN Infra. , Trend MicroScott Miao & SPN infra.9/10/2016

Who am I

• Scott Miao• RD, SPN, Trend Micro• Hadoop ecosystem about 6

years• AWS for BigData about 3 years• Expertise in

HDFS/MR/HBase/AWS EMR• @takeshimiao• @slideshare

https://twitter.com/takeshimiao

https://twitter.com/takeshimiao

http://www.slideshare.net/takeshi_miao

http://www.slideshare.net/takeshi_miao

Agenda

• Why go on Cloud• Common Cloud Services in Trend• Lambda Architecture on Cloud• Servicing Layer as-a Service• What we learned

Why go on Cloud

Data volume increases 1.5 ~ 2x every year

Growth becomes 2x

Return of Investment

• On traditional infra., we put a lot of efforts on services operation

• On the Cloud, we can leverage its elasticities to automate our services

• More focus on innovation !!

Time

Money

Revenue

Cost

Why AWS ?

AWS is a leader of IaaS platform

https://www.gartner.com/doc/reprints?id=1-2G2O5FC&ct=150519&st=sbSource: Gartner (May 2015)

https://www.gartner.com/doc/reprints?id=1-2G2O5FC&ct=150519&st=sb

https://www.gartner.com/doc/reprints?id=1-2G2O5FC&ct=150519&st=sb

AWS Evaluation

Cost acceptable

Functionalities satisfied

Performance satisfied

Common Cloud Services in TrendANALYTIC ENGINE + CLOUD STORAGE

Common Services on the Cloud

Cloud CI/CD

Common Auth

Analytic Engine

Cloud Storage

AE + CS

Analytic Engine• Computation service

for Trenders• Based on AWS EMR

• Simple RESTful API calls

• Computing on demand• Short live• Long running

• No operation effort• Pay by computing

resources

Cloud Storage• Storage service for

Trenders• Based on AWS S3

• Simple RESTful API calls

• Share data to all in one place

• Metadata search for files

• No operation effort• Pay by storage size

used

Analytic Engine is a…A common Big Data

computation service on Cloud (AWS)

2

Major Features in nutshell

14

AE

CS

submitJob

EMRcreateClust

er

Input from• cs path• cs metadata

search• Pig UDFs support

Output to CSwith meta data

UIs

Cost visibility(AWS Cost

explor.)Client logs

(SumoLogic)

Cluster info.(Proxy Gateway)

Visibility• Fully HA• Fully automated• Auto recovery

Support usecases1. User creates a cluster2. User can create multiple clusters as he/she need3. User submits job to target cluster to run4. AE delivers job to secondary cluster if target cluster

down5. Diff. group of users are not allowed to submit

cluster(s)6. Diff. group of users are not allowed to delete cluster7. Only same group of users are allowed to delete cluster8. User wants to know what their current cost is9. User wants to troubleshoot his/her submitted job10.User wants to observe his/her cluster status

2

1.User invokes submitJob2.Auth service check user’s credential3.AE knows user name and group4.AE matches the job and deliver it to target cluster5.AE pull data from CS6.Job run on target cluster7.AE output result to CS8. AE sends msg to SNSTopic if user specified

Usecase#3 – User submits job to target cluster to run (1/4)

16

AE SaaSuserssubmitJob

EMR

Cloud Storage

1.

2.

4.

3.

clusterCriteria:

[[‘sched:adhoc’,

‘env:prod’], [“env:prod”]]

group:SPN,tag:

‘sched:routine’,

‘env:prod’

validUser is SPN group

group:SPN,tag:

‘sched:adhoc’,

‘env:prod’

5.

7.

6.8.

Auth Service


• Sample payload of submitJob API

2

{ "clusterCriterias": [ { "tags": [ "sechd:adhoc", "env:prod" ] }, { "tags": [ "env:prod" ] } ], "commandArgs": "$inputPaths $outputPaths",// see below


2

// see previous "fileDependencies": "s3://path/to/my/main.sh,s3://path/to/my/test.pig", "inputPaths": [ "cs://path/to/my/input/data“ // or you can use metadata search for input data // “csq://first_entry_date:['2016-05-30T09:00:000Z','2016-05-30T09:01:000Z'}” ], "name": "SubmitJob_pig_cs_to_cs_csq", "outputPaths": [ "cs://path/to/my/output/result" ], "tags": [ "env:my-test" ], "notifyTo" : "arn:aws:sns:us-east-1:123456789123:my-sns"}


• All existing job types used in on-premise are supported

• Pure MR• Pig and UDFs• Hadoop streaming– Python, Ruby, etc

2

Usecase#8 – User wants to know what their current cost is (1/2)

20

• Billing & Cost management -> Cost Explorer -> Launch Cost Explorer• Filtered by

• tags: “sys = ae“ and “comp = emr” and “other = <your-cluster-name>”• Group by Service

2

Usecase#8 – User wants to know what their current cost is (2/2) - Billing and Cost Analysis

• Attach tags to your AWS resourcesTag Key Tag Value (sample) Description

name aesaas-s-11-api *optional* for AWS cost explorerstack aesaas-s-11 *optional* for AWS cost explorerservice aesaas *optional* for AWS cost explorer

owner spn *required* the bill is under whose budget

env prod|stg|dev *required* environment typesys ae *required* the system name

comp api-server|emr *required* the subcomponent name

other spn-stg *optional* an optional tag that free for other usage.

Why we use AE instead of EMR directly ?• Abstraction

• Avoid locked-in• Hide details impl. behind the scene

• AWS EMR was not design for long running jobs• >= AMI-3.1.1 – 256 ACTIVE or PENDING jobs

(STEPs)• < AMI-3.1.1 – 256 jobs in total

• Better integrated with other common services• Keep our hands off from AWS native codes

• Centralized Authentication & Authorization• Leverage our internal LDAP server• No AWS tokens for user

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/AddingStepstoaJobFlow.html



Lambda Architecture on Cloud

Next Phase

Cloud Infra.

AE-v1.0

AE + CS

(v1.1~)

Lambda

arch.

24

What is Lambda (λ) Architecture

2

Data Ingestio

n

Batch Layer

Master Dataset

Speed Layer

Streaming Processing

Batch Processing Batch View

Merged View

Real-Time View

Serving Layer

Data Access

API

Batch Layer as-a Service

Serving Layer as-a Service

A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods

https://en.wikipedia.org/wiki/Lambda_architecture



Servicing Layer as-a ServiceMETADATA STORE

GoalsHelp everyone to easily access metadata shared by several teams

• Access data in one place• Avoid storage duplication• Share immediately to all• Provide unified intelligence

Common metadata storage for several services• Abstract to hide infra & ops• Customize for different needs

28

(on aws)

Usecase• Store all threat entities into one place from new

born– Every team can leverage contributions from other teams

at very early stage

2

Features

30

Metadata Store Service

Random Writes

Bulk Writes

Sync Query

Async Query

Automatic ProvisionCustomizable Schema

Unified Intelligence Threat Monitor

Borrow idea from Star Schema• A schema design widely used in data

warehousing

31

Historical data – measurements or metrics for a specific event

Descriptive attributes – characteristics to describe and select the fact data

Basic Idea

• Refer to Star Schema design– Fact table• Put all records into this table (Single Source

of Truth)• Affordable for random and bulk load of writes• Fast random reads by rowkey

– Dimension table• Fast and flexible info. discovery• Get rowkey of records stored in Fact table• Then retrieve records by rowkey

https://en.wikipedia.org/wiki/Star_schema

https://en.wikipedia.org/wiki/Star_schema#Fact_tables

https://en.wikipedia.org/wiki/Star_schema#Dimension_tables

Reference Implementation – Part 1• This Star Schema concept can be fulfill by

different impl.• A famous one is HBase + Indexer + Solr

http://www.hadoopsphere.com/2013/11/the-evolving-hbase-ecosystem.htmlhttps://community.hortonworks.com/articles/1181/hbase-indexing-to-solr-with-hdp-search-in-hdp-23.html

http://www.hadoopsphere.com/2013/11/the-evolving-hbase-ecosystem.html

http://www.hadoopsphere.com/2013/11/the-evolving-hbase-ecosystem.html

https://community.hortonworks.com/articles/1181/hbase-indexing-to-solr-with-hdp-search-in-hdp-23.html

https://community.hortonworks.com/articles/1181/hbase-indexing-to-solr-with-hdp-search-in-hdp-23.html

Reference Implementation – Part 2

2http://www.slideshare.net/AmazonWebServices/bdt310-big-data-architectural-patterns-and-best-practices-on-aws #p57

http://www.slideshare.net/AmazonWebServices/bdt310-big-data-architectural-patterns-and-best-practices-on-aws



DimensionTables

Schema

Dimension TablesEngine:Elastic Search

Dimension TablesEngine:MySQL (RDS)

Dimension TablesEngine:

Dynamo DB

Propagate data to dimension storage

35

Fact Tables(Dynamo DB)

Propagator

Dynamo DB Streams

Propagation Rules

Random Writes

Bulk Writes

(Eventually Consistent)

2http://www.programmableweb.com/wp-content/open.graph-600x403.pnghttp://www.parorrey.com/wp-content/uploads/2012/01/facebook-graph-api.jpg

http://www.programmableweb.com/wp-content/open.graph-600x403.png

http://www.programmableweb.com/wp-content/open.graph-600x403.png

http://www.parorrey.com/wp-content/uploads/2012/01/facebook-graph-api.jpg

http://www.parorrey.com/wp-content/uploads/2012/01/facebook-graph-api.jpg

2http://www.olily.com/cblog/wp-content/uploads/2013/11/%E6%97%85%E5%B1%9502.jpg

http://www.olily.com/cblog/wp-content/uploads/2013/11/%E6%97%85%E5%B1%9502.jpg

http://www.olily.com/cblog/wp-content/uploads/2013/11/%E6%97%85%E5%B1%9502.jpg

What we learnedFROM BIG DATA ON CLOUD

Pros & ConsAspects IDC AWSData Capacity Limited by

physical rack space

No limitation in seasonable amount

Computation Capacity

Limited by physical rack space

No limitation in seasonable amount

DevOps Hard, due to on physical machine/ VM farm

Easy, due to code is everything (CI/CD)

Scalability Hard, due to on physical machine/ VM farm

Easy, relied on ELB, Autoscaling group from AWS

Pros & Cons

Aspects IDC AWSDisaster Recovery

Hard, due to on physical machine/ VM farm

Easy, due to code is everything

Data Location Limited due to IDC location

Various and easy due to multiple regions of AWS

Cost Implied in Total Cost of Ownership

Acceptable cost with Cost Conscious DesignSomething more details…

We Are Hiring !

Backup

AE SaaS Architecture Design

IDC

High Level Architecture Design

46

AZb

AE API servers

RDS

Private ELB

AZa

AZb

AZc

AE API servers

RDS

services

services

services

peering

HTTPS

EMR

EMR

Cross-accountS3 buckets

Time based Auto

Scaling group

workers

workersMulti-AZs

Auto Scaling group

Time based Auto

Scaling group

Eureka

Eureka

VPN

HTTPS/HTTP Basic

Cloud StorageInternet

HTTPS/HTTP Basic

Amazon SNS

Oregon (us-west-2) SJC1

SPN VPC

CI slave

Splunk forward

er

peering

VPN

Splunk

peering

What is Netflix Genie

• A practice from Netflix• A hadoop client to submit jobs to EMR• Flexible data model design to adopt diff

kind of cluster• Flexible Job/cluster matching design

(based on tags)• Cloud characteristics built-in design– e.g. auto-scaling, load-balance, etc

• It’s goal is plain & simple• We use it as an internal component

47https://github.com/Netflix/genie/wiki

https://github.com/Netflix/genie/wiki

https://github.com/Netflix/genie/wiki

What is Netflix Eureka• Is a RESTful service• Built by Netflix• A critical component for Genie to do Load Balance

and failover

48

Genie

API API API

05/02/2023

Confidential | Copyright 2016 TrendMicro Inc. 49

AWS EMR (Elastic MapReduce)

2http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and-best-practices-bdt404-aws-reinvent-2013

http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and-best-practices-bdt404-aws-reinvent-2013



2http://www.slideshare.net/AmazonWebServices/deep-dive-amazon-elastic-map-reduce?from_action=save

http://www.slideshare.net/AmazonWebServices/deep-dive-amazon-elastic-map-reduce?from_action=save

http://www.slideshare.net/AmazonWebServices/deep-dive-amazon-elastic-map-reduce?from_action=save

05/02/2023

Confidential | Copyright 2016 TrendMicro Inc. 53

Lessons Learned on AWS details

Different types of Auto-scaling group

54

Service Auto Scaling Group Type

Features ProvisionDeploy/Config Method

OpsWorks

24/7•manual creation/deletion•configure one instance for one AZ

• CloudFormation• AWS::OpsWorks

::Instance. AutoScalingType

chef recipe

time-based

•can specify time slot(s) based on hour unit, on everyday or any day in week•configure one instance for one AZ

load-based

•can specify CPU/MEM/workload avg. based on an OPS layer•UP: when to increase instances•Down: when to decrease instances•No max./min. # of instances setting•configure one instance for one AZ

EC2 •can set max./min. for # of instance•Multi-AZs support

• CloudFormation• AWS::AutoScali

ng::AutoScalingGroup

• AWS::AutoScaling::LaunchConfiguration

user-data

ELB + Auto-Scaling Group

• ELB– Health Check

• Determining the route for coming requests• Auto-Scaling Groups–Monitoring EC2 instance by CloudWatch– If EC2 abnormal, then terminate and start a

new one• ELB + Auto-Scaling Group– Auto attach/detach EC2 instance(s) to ELB

if Auto-Scaling Group launch/terminate EC2

http://docs.aws.amazon.com/autoscaling/latest/userguide/autoscaling-load-balancer.html



Auto Recovery based on Monit• OpsWorks already use Monit for Auto

Recovery– Leverage the Monit on EC2– Have practices in on-premise

2

AZ1 AZ2

API serve

r

API serve

r

https://mmonit.com/monit/

Auto Scaling group• Instance check

by CloudWatch• Process check

by Monit

• No process – restart process

• Process health check failed – terminate EC2

• Terminate EC2 !Auto Scaling group launch new EC2



achieve big data analytic platform with lambda architecture on cloud

Data & Analytics