achieve big data analytic platform with lambda architecture on cloud
TRANSCRIPT
1
Achieve Big Data Analytic Platform with Lambda Architecture on CloudSPN Infra. , Trend MicroScott Miao & SPN infra.9/10/2016
Who am I
• Scott Miao• RD, SPN, Trend Micro• Hadoop ecosystem about 6
years• AWS for BigData about 3 years• Expertise in
HDFS/MR/HBase/AWS EMR• @takeshimiao• @slideshare
Agenda
• Why go on Cloud• Common Cloud Services in Trend• Lambda Architecture on Cloud• Servicing Layer as-a Service• What we learned
Why go on Cloud
Data volume increases 1.5 ~ 2x every year
Growth becomes 2x
Return of Investment
• On traditional infra., we put a lot of efforts on services operation
• On the Cloud, we can leverage its elasticities to automate our services
• More focus on innovation !!
Time
Money
Revenue
Cost
Why AWS ?
AWS is a leader of IaaS platform
https://www.gartner.com/doc/reprints?id=1-2G2O5FC&ct=150519&st=sbSource: Gartner (May 2015)
AWS Evaluation
Cost acceptable
Functionalities satisfied
Performance satisfied
Common Cloud Services in TrendANALYTIC ENGINE + CLOUD STORAGE
Common Services on the Cloud
Cloud CI/CD
Common Auth
Analytic Engine
Cloud Storage
AE + CS
Analytic Engine• Computation service
for Trenders• Based on AWS EMR
• Simple RESTful API calls
• Computing on demand• Short live• Long running
• No operation effort• Pay by computing
resources
Cloud Storage• Storage service for
Trenders• Based on AWS S3
• Simple RESTful API calls
• Share data to all in one place
• Metadata search for files
• No operation effort• Pay by storage size
used
Analytic Engine is a…A common Big Data
computation service on Cloud (AWS)
2
Major Features in nutshell
14
AE
CS
submitJob
EMRcreateClust
er
Input from• cs path• cs metadata
search• Pig UDFs support
Output to CSwith meta data
UIs
Cost visibility(AWS Cost
explor.)Client logs
(SumoLogic)
Cluster info.(Proxy Gateway)
Visibility• Fully HA• Fully automated• Auto recovery
Support usecases1. User creates a cluster2. User can create multiple clusters as he/she need3. User submits job to target cluster to run4. AE delivers job to secondary cluster if target cluster
down5. Diff. group of users are not allowed to submit
cluster(s)6. Diff. group of users are not allowed to delete cluster7. Only same group of users are allowed to delete cluster8. User wants to know what their current cost is9. User wants to troubleshoot his/her submitted job10.User wants to observe his/her cluster status
2
1.User invokes submitJob2.Auth service check user’s credential3.AE knows user name and group4.AE matches the job and deliver it to target cluster5.AE pull data from CS6.Job run on target cluster7.AE output result to CS8. AE sends msg to SNSTopic if user specified
Usecase#3 – User submits job to target cluster to run (1/4)
16
AE SaaSuserssubmitJob
EMR
Cloud Storage
1.
2.
4.
3.
clusterCriteria:
[[‘sched:adhoc’,
‘env:prod’], [“env:prod”]]
group:SPN,tag:
‘sched:routine’,
‘env:prod’
validUser is SPN group
group:SPN,tag:
‘sched:adhoc’,
‘env:prod’
5.
7.
6.8.
Auth Service
Usecase#3 – User submits job to target cluster to run (2/4)
• Sample payload of submitJob API
2
{ "clusterCriterias": [ { "tags": [ "sechd:adhoc", "env:prod" ] }, { "tags": [ "env:prod" ] } ], "commandArgs": "$inputPaths $outputPaths",// see below
Usecase#3 – User submits job to target cluster to run (3/4)
2
// see previous "fileDependencies": "s3://path/to/my/main.sh,s3://path/to/my/test.pig", "inputPaths": [ "cs://path/to/my/input/data“ // or you can use metadata search for input data // “csq://first_entry_date:['2016-05-30T09:00:000Z','2016-05-30T09:01:000Z'}” ], "name": "SubmitJob_pig_cs_to_cs_csq", "outputPaths": [ "cs://path/to/my/output/result" ], "tags": [ "env:my-test" ], "notifyTo" : "arn:aws:sns:us-east-1:123456789123:my-sns"}
Usecase#3 – User submits job to target cluster to run (4/4)
• All existing job types used in on-premise are supported
• Pure MR• Pig and UDFs• Hadoop streaming– Python, Ruby, etc
2
Usecase#8 – User wants to know what their current cost is (1/2)
20
• Billing & Cost management -> Cost Explorer -> Launch Cost Explorer• Filtered by
• tags: “sys = ae“ and “comp = emr” and “other = <your-cluster-name>”• Group by Service
2
Usecase#8 – User wants to know what their current cost is (2/2) - Billing and Cost Analysis
• Attach tags to your AWS resourcesTag Key Tag Value (sample) Description
name aesaas-s-11-api *optional* for AWS cost explorerstack aesaas-s-11 *optional* for AWS cost explorerservice aesaas *optional* for AWS cost explorer
owner spn *required* the bill is under whose budget
env prod|stg|dev *required* environment typesys ae *required* the system name
comp api-server|emr *required* the subcomponent name
other spn-stg *optional* an optional tag that free for other usage.
Why we use AE instead of EMR directly ?• Abstraction
• Avoid locked-in• Hide details impl. behind the scene
• AWS EMR was not design for long running jobs• >= AMI-3.1.1 – 256 ACTIVE or PENDING jobs
(STEPs)• < AMI-3.1.1 – 256 jobs in total
• Better integrated with other common services• Keep our hands off from AWS native codes
• Centralized Authentication & Authorization• Leverage our internal LDAP server• No AWS tokens for user
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/AddingStepstoaJobFlow.html
Lambda Architecture on Cloud
Next Phase
Cloud Infra.
AE-v1.0
AE + CS
(v1.1~)
Lambda
arch.
24
What is Lambda (λ) Architecture
2
Data Ingestio
n
Batch Layer
Master Dataset
Speed Layer
Streaming Processing
Batch Processing Batch View
Merged View
Real-Time View
Serving Layer
Data Access
API
Batch Layer as-a Service
Serving Layer as-a Service
A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods
https://en.wikipedia.org/wiki/Lambda_architecture
Servicing Layer as-a ServiceMETADATA STORE
GoalsHelp everyone to easily access metadata shared by several teams
• Access data in one place• Avoid storage duplication• Share immediately to all• Provide unified intelligence
Common metadata storage for several services• Abstract to hide infra & ops• Customize for different needs
28
(on aws)
Usecase• Store all threat entities into one place from new
born– Every team can leverage contributions from other teams
at very early stage
2
Features
30
Metadata Store Service
Random Writes
Bulk Writes
Sync Query
Async Query
Automatic ProvisionCustomizable Schema
Unified Intelligence Threat Monitor
Borrow idea from Star Schema• A schema design widely used in data
warehousing
31
Historical data – measurements or metrics for a specific event
Descriptive attributes – characteristics to describe and select the fact data
Basic Idea
• Refer to Star Schema design– Fact table• Put all records into this table (Single Source
of Truth)• Affordable for random and bulk load of writes• Fast random reads by rowkey
– Dimension table• Fast and flexible info. discovery• Get rowkey of records stored in Fact table• Then retrieve records by rowkey
Reference Implementation – Part 1• This Star Schema concept can be fulfill by
different impl.• A famous one is HBase + Indexer + Solr
http://www.hadoopsphere.com/2013/11/the-evolving-hbase-ecosystem.htmlhttps://community.hortonworks.com/articles/1181/hbase-indexing-to-solr-with-hdp-search-in-hdp-23.html
Reference Implementation – Part 2
2http://www.slideshare.net/AmazonWebServices/bdt310-big-data-architectural-patterns-and-best-practices-on-aws #p57
DimensionTables
Schema
Dimension TablesEngine:Elastic Search
Dimension TablesEngine:MySQL (RDS)
Dimension TablesEngine:
Dynamo DB
Propagate data to dimension storage
35
Fact Tables(Dynamo DB)
Propagator
Dynamo DB Streams
Propagation Rules
Random Writes
Bulk Writes
(Eventually Consistent)
2http://www.programmableweb.com/wp-content/open.graph-600x403.pnghttp://www.parorrey.com/wp-content/uploads/2012/01/facebook-graph-api.jpg
2http://www.olily.com/cblog/wp-content/uploads/2013/11/%E6%97%85%E5%B1%9502.jpg
What we learnedFROM BIG DATA ON CLOUD
Pros & ConsAspects IDC AWSData Capacity Limited by
physical rack space
No limitation in seasonable amount
Computation Capacity
Limited by physical rack space
No limitation in seasonable amount
DevOps Hard, due to on physical machine/ VM farm
Easy, due to code is everything (CI/CD)
Scalability Hard, due to on physical machine/ VM farm
Easy, relied on ELB, Autoscaling group from AWS
Pros & Cons
Aspects IDC AWSDisaster Recovery
Hard, due to on physical machine/ VM farm
Easy, due to code is everything
Data Location Limited due to IDC location
Various and easy due to multiple regions of AWS
Cost Implied in Total Cost of Ownership
Acceptable cost with Cost Conscious DesignSomething more details…
We Are Hiring !
Backup
AE SaaS Architecture Design
IDC
High Level Architecture Design
46
AZb
AE API servers
RDS
Private ELB
AZa
AZb
AZc
AE API servers
RDS
services
services
services
peering
HTTPS
EMR
EMR
Cross-accountS3 buckets
Time based Auto
Scaling group
workers
workersMulti-AZs
Auto Scaling group
Time based Auto
Scaling group
Eureka
Eureka
VPN
HTTPS/HTTP Basic
Cloud StorageInternet
HTTPS/HTTP Basic
Amazon SNS
Oregon (us-west-2) SJC1
SPN VPC
CI slave
Splunk forward
er
peering
VPN
Splunk
peering
What is Netflix Genie
• A practice from Netflix• A hadoop client to submit jobs to EMR• Flexible data model design to adopt diff
kind of cluster• Flexible Job/cluster matching design
(based on tags)• Cloud characteristics built-in design– e.g. auto-scaling, load-balance, etc
• It’s goal is plain & simple• We use it as an internal component
47https://github.com/Netflix/genie/wiki
What is Netflix Eureka• Is a RESTful service• Built by Netflix• A critical component for Genie to do Load Balance
and failover
48
Genie
API API API
05/02/2023
Confidential | Copyright 2016 TrendMicro Inc. 49
AWS EMR (Elastic MapReduce)
2http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and-best-practices-bdt404-aws-reinvent-2013
2http://www.slideshare.net/AmazonWebServices/deep-dive-amazon-elastic-map-reduce?from_action=save
2
05/02/2023
Confidential | Copyright 2016 TrendMicro Inc. 53
Lessons Learned on AWS details
Different types of Auto-scaling group
54
Service Auto Scaling Group Type
Features ProvisionDeploy/Config Method
OpsWorks
24/7•manual creation/deletion•configure one instance for one AZ
• CloudFormation• AWS::OpsWorks
::Instance. AutoScalingType
chef recipe
time-based
•can specify time slot(s) based on hour unit, on everyday or any day in week•configure one instance for one AZ
load-based
•can specify CPU/MEM/workload avg. based on an OPS layer•UP: when to increase instances•Down: when to decrease instances•No max./min. # of instances setting•configure one instance for one AZ
EC2 •can set max./min. for # of instance•Multi-AZs support
• CloudFormation• AWS::AutoScali
ng::AutoScalingGroup
• AWS::AutoScaling::LaunchConfiguration
user-data
ELB + Auto-Scaling Group
• ELB– Health Check
• Determining the route for coming requests• Auto-Scaling Groups–Monitoring EC2 instance by CloudWatch– If EC2 abnormal, then terminate and start a
new one• ELB + Auto-Scaling Group– Auto attach/detach EC2 instance(s) to ELB
if Auto-Scaling Group launch/terminate EC2
http://docs.aws.amazon.com/autoscaling/latest/userguide/autoscaling-load-balancer.html
Auto Recovery based on Monit• OpsWorks already use Monit for Auto
Recovery– Leverage the Monit on EC2– Have practices in on-premise
2
AZ1 AZ2
API serve
r
API serve
r
https://mmonit.com/monit/
Auto Scaling group• Instance check
by CloudWatch• Process check
by Monit
• No process – restart process
• Process health check failed – terminate EC2
• Terminate EC2 !Auto Scaling group launch new EC2