© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Scott Miao, SPN, Trend Micro
2016/5/20
Analytic Engine
A common Big Data computation service on the AWS
Who am I
• Scott Miao
• RD, SPN, Trend Micro
• Worked on Hadoop ecosystem about 6
years
• Worked on AWS for BigData about 3 years
• Expertise in HDFS/MR/HBase
• Speaker in some Hadoop related confs
• @takeshi.miao
Agenda
• What problems we suffered ?
• Why AWS ?
• Analytic Engine
• The benefits AWS brings to AE
• AE roadmap on AWS
Hadoop Expansion
Data volume increases 1.5 ~ 2x every year
Data center issues
• network bottleneck
• server depreciation
Growth
becomes 2x
Return of Investment
• On traditional infra., we put a lot of efforts on services operation
• On the Cloud, we can leverage its elasticities to automate our
services
• More focus on innovation !!
Time
Money
Revenue
Cost
AWS is a leader of IaaS platform
https://www.gartner.com/doc/reprints?id=1-2G2O5FC&ct=150519&st=sbSource: Gartner (May 2015)
High Level Architecture
Analytic Engine
(AE)
CloudStorage
(CS)
createCluster
submitJob
deleteCluster
Input from
Output to
AWS EMR
RESTful API RESTful API
RDs
Researchers
Services
Common
Storage
Service
Common
Computation
Service
Common Cloud Services in Trend
Analytic Engine
• Computation service for Trenders
• Based on AWS EMR
• Simple RESTful API calls
• Computing on demand
• Short live
• Long running
• No operation effort
• Pay by computing resources
Cloud Storage
• Storage service for Trenders
• Based on AWS S3
• Simple RESTful API calls
• Share data to all in one place
• Metadata search for files
• No operation effort
• Pay by storage size used
Why we use AE instead of EMR directly ?
• Abstraction
• Avoid locked-in
• Hide details impl. behind the scene
• AWS EMR was not design for long running jobs
• >= AMI-3.1.1 – 256 ACTIVE or PENDING jobs (STEPs)
• < AMI-3.1.1 – 256 jobs in total
• Better integrated with other common services
• Keep our hands off from AWS native codes
• Centralized Authentication & Authorization
• No AWS/SSH keys for user
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/AddingStepstoaJobFlow.html
Common usecases for AE
• User creates a cluster
• User can create multiple clusters
• User submits job to target cluster
• AE helps user to deliver job to secondary cluster
• User wants to know their cost
Usecase#1 – User creates a cluster
AEuserscreateCluster
EMR
1.User invokes createCluster
2.AE launches an EMR cluster for user
With tags attached
1.
2.
tag:
‘sched:routine’,
‘env:prod’,
m3.xlarge * 10
tag:
‘sched:routine’,
‘env:prod’,
m3.xlarge * 10It is RESTful API,
so I can use any
client I am familiar
with !
Usecase#2 – User can create multiple clusters
as he/she need
AEuserscreateCluster
EMR
1.User invokes createCluster
2.AE launches another new EMR cluster for user
with tags attached
3. User can create many clusters he/she needs
1.
2.
tag:
‘sched:adhoc’,
‘env:prod’,
c3.4xlarge * 20
tag:
‘sched:routine’,
‘env:prod’,
m3.xlarge * 10
tag:
‘sched:adhoc’,
‘env:prod’,
c3.4xlarge * 20
1.User invokes submitJob
2.AE matches the job and
deliver it to target cluster
3. AE submits job
4.EMR pulls data from CS
5.Job runs on target cluster
6.EMR outputs result to CS
7. AE sends msg to SNS
Topic if user specified
Usecase#3 – User submits job to target cluster
to run
AEuserssubmitJob
EMR
CS
1.
2.
3.clusterCriteria:
[[‘sched:adhoc’,
‘env:prod’],
[“env:prod”]]
tag:
‘sched:routine’,
‘env:prod’
tag:
‘sched:adhoc’,
‘env:prod’
5.7.
4. 6.
Usecase#4 – AE delivers job to secondary
cluster if target cluster down
AEuserssubmitJob
EMR
CS
1.
2.
3.
clusterCriteria:
[[‘sched:adhoc’,
‘env:prod’],
[“env:prod”]]
tag:
‘sched:routine’,
‘env:prod’
tag:
‘sched:adhoc’,
‘env:prod’
1.User invokes submitJob
2.AE matches the job and
deliver it to secondary cluster
3. AE submits job
4.EMR pull data from CS
5.Job run on target cluster
6.EMR output result to CS
5.
4. 6.
Usecase#5 – User wants to know what their
current cost isBilling & Cost management -> Cost Explorer -> Launch Cost Explorer
IDC
Middle Level Architecture
AZb
AE API servers
RDS
Internal ELB
AZa
AZb
AZc
AE API servers
RDS
services
services
services
peering
HTTPS
EMR
EMR
Cross-account
S3 buckets
input/outputAuto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
groupAuto
Scaling
group
Eureka
Eureka
Internet
HTTPS/HTTP
Basic/VPN
Cloud Storage
HTTPS/HTTP
Basic
Amazon
SNS
Oregon (us-west-2)peering
Pros & Cons
Aspects IDC AWS
Data Capacity Limited by physical rack
space
No limitation in
seasonable amount
Computation Capacity Limited by physical rack
space
No limitation in
seasonable amount
DevOps Hard, due to on physical
machine/ VM farm
Easy, due to code is
everything (Continuous
Deployment)
Scalability Hard, due to on physical
machine/ VM farm
Easy, relied on ELB,
Autoscaling group from
AWS
Pros & Cons
Aspects IDC AWS
Disaster Recovery Hard, due to on physical
machine/ VM farm
Easy, due to code is
everything
Data Location Limited due to IDC
location
Various and easy due to
multiple regions of AWS
Cost Implied in Total Cost of
Ownership
Acceptable cost with
Cloud optimized design
AZb
AZa
AZb
AZcRDS
peering
HTTPS
Cross-account
S3 buckets
input/output
Oregon (us-west-2)
RDS
1. pre-built infra. by AWS CF
2. Users permission granted
3. Pre-launched RDS
1.
2.
3.
AZb
AE API servers
RDS
Internal ELB
AZa
AZb
AZc
AE API servers
RDS
peering
HTTPS
EMR
EMR
Cross-account
S3 buckets
input/outputAuto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
groupAuto
Scaling
group
Eureka
Eureka
Oregon (us-west-2)
4. Provision AE SaaS by CI/CD
4.
IDC
AZb
AE API servers
RDS
Internal ELB
AZa
AZb
AZc
AE API servers
RDS
services
services
services
peering
HTTPS
EMR
EMR
Cross-account
S3 buckets
Auto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
groupAuto
Scaling
group
Eureka
Eureka
Internet
HTTPS/HTTP
Basic/VPN
Cloud Storage
HTTPS/HTTP
BasicOregon (us-west-2)
5. Users can access via VPN, FW open for Trend
6. Input from CS or S3
7. Computation in AWS EMR cluster
5.
7.
6. 8.
6. 8.
Amazon
SNS
9.
8. Output to CS or S3
9. Job end msg to AWS SNS (optional)
What is Netflix Genie
• A practice from Netflix
• Hadoop client to submit different kinds of Job
• Flexible data model design to adopt diff kind of cluster
• Flexible Job/cluster matching design (based on tags)
• Cloud characteristics built-in design
• e.g. auto-scaling, load-balance, etc
• It’s goal is plain & simple
• We use it as an internal component
https://github.com/Netflix/genie/wiki