high performance computing implementation on aws
TRANSCRIPT
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pawan Agnihotri – Global Financial Services Solutions Architect
March 23, 2017
High Performance Computing
Implementation on AWS
Risk Management for Financial Services
Risk Management is essential to the operations of all
Financial Services institutions (FSI).
Types of risks that need to be tested for include, credit, market,
foreign exchange, liquidity, volatility, and inflation.
Regulatory bodies are requiring FSIs to perform higher levels
of stress testing to maintain adequate capital ratios.
Models
Banks use different models for risk
analysis. Some examples include:
• CCAR
• CCR
• VaR
• CVaR
To run these simulations Global FSIs
need large amounts of compute
resources.
The Challenges
Datacenter capacity is limited,
resulting in simulation backlogs
or inadequate risk calculations.
Financial instruments require
flexible compute resources for
development and testing.
Limited capacity, which results
in long run times for
simulations.
Regulatory & market
fluctuations require flexible
compute capabilities.
Large upfront investments &
maintenance required to run on
premises grids.
Standardized hardware offers
limited grid and compute types.
What Is Needed for a Solution
Security of data (environment isolation
& encryption of data at rest).
Capacity on demand.
Large amounts of storage
capacity for data.
Availability of different
compute types.
Schedule
impact!
The Cluster as Seen by the Application User
Security
AWS Compliance
Key Certifications and Assurance Programs
Access
Control
Identity
Management
Key Management
& Storage
Monitoring
& Logs
Assessment
and reporting
Resource & Usage
Auditing
SECURITY & COMPLIANCE
Configuration
Compliance
Web application
firewall
Encryption
Key
Management
Service
CloudHSM Server-side
Encryption
Networking
Virtual
Private
Cloud
Web
Application
Firewall
Compliance
ConfigCloudTrail
&
Inspector
Service
Catalog
Identity
IAM Active
Directory
Integration
SAML
Federation
AWS Security: Deep Set of Cloud Security Tools
Compute Performance
Performance Factors: Compute Capacity
AWS proprietary 10Gb networking
• Highest performance in .8xlarge instance sizes
• Full bi-section bandwidth in placement groups
Enhanced networking
• Available on D2, C3, C4, M4, R3, I2
• Over 1M PPS performance, reduced instance-to-instance
latencies, consistent performance
Performance Factors: Networks
Performance Factors: Storage
Locally attached or “instance storage”
Amazon EBS General Purpose (SSD) volumes
Amazon EBS Provisioned IOPS (SSD) volumes
Amazon EBS Magnetic volumes
Amazon S3 and Amazon Glacier for object storage
Intel Xeon E5-2670 (Sandy Bridge) CPUs
• Available on M3, CC2, CR1, and G2 instance types
Intel Xeon E5-2680 v2 (Ivy Bridge) CPUs
• Available on C3, R3, and I2 instance types
• 2.8 GHz in C3, Turbo enabled up to 3.6 GHz
• Supports Enhanced Advanced Vector Extensions (AVX)
instructions
Intel Xeon E5-2666 v3 (Haswell – AVX2) CPUs
• Available on C4, D2, and M4 instance types
• 2.9 GHz in C4, Turbo enabled up to 3.5 GHz (with Intel Turbo
Boost)
• Supports AVX2 instructions
Performance Factors: CPU
EC2 Instances: Types and Sizes
c4.largeInstance family
Instance generation
Instance size
New EC2 GPU instance type, specifically for accelerated computing:
• Offers up to eight NVIDIA Tesla K80 accelerators
The 16xlarge size provides:
• Combined 192 GB of GPU memory
• 40 thousand CUDA cores
• 70 teraflops of single precision floating point performance
• Over 23 teraflops of double precision floating point performance
Target workloads:
• Deep learning, computational fluid dynamics, computational finance, seismic analysis, molecular
modeling, genomics, rendering
New GPU Instance Types: P2
Available in three sizes:
Instance Size GPUs P2P vCPUs Memory
(GiB)
Network
Bandwidth*
p2.xlarge 1 - 4 61 1.25Gbps
p2.8xlarge 8 Y 32 488 10Gbps
p2.16xlarge 16 Y 64 976 20Gbps
*In a placement group
P2 Instance Types
Grid Reference Architecture
virtual private cloud
Subnet Placement Group
10.40.0.0/16
10.40.10.0/20
Amazon S3
EFS
IAM RoleMSSNode
SchedulerNodeCompute
Nodes
Compute
Nodes
Metadata
Servers
Datanode
Servers
Amazon
CloudWatch
AWS
CloudFormation
AWS
CloudTrail
AWS
ConfigAWS KMS
corporate data centerAWS cloud
Grid Operation
Amazon S3
COST
Time
Typical cluster
utilization rates
are low due to
need to deploy for
peak times.
The Old Way: Low Utilization, High Costs
Server
acquisition
Server
acquisition
Server
acquisition
Actual Demand for Computing
Unused
IT
Resources
Total servers
deployed
Reduced Time
Project
Acceleration
Scale higher to reduce time-to-results: shorter wait times, greater agility,
faster innovation cycles
New
Peak(62K cores)
Previous
Peak(31K cores)
The Cloud Way: Scalability When Needed
Assumptions:
• 1 Petaflop total computing on AWS
• 636 Gigaflops for each m4.10xlarge instance
• 1572 total m4.10xlarge instances
• 31,447 total Xeon cores (E5-2676 v3 Intel Haswell)
• 251TB total RAM (8GB RAM per core)
• EC2 instance type selected for modeling purposes is m4.10xlarge. Other instance types and sizes
are available, and may be recommended for cost optimization or to optimize for specific workloads
• Utilization for comparison purposes is assumed to be 60%
• Storage (1000TB) modeled as a blend of S3 object storage, Glacier, EFS, and EBS
• Persistent (head) nodes and license server nodes are assumed to be 100% utilized
Scenarios for 1 Petaflop Cluster
1. Scenario 1 (50% Reserved Instances, 50% Spot)
2. Scenario 2 (25% Reserved Instances, 75% Spot)
Reserved Instances 50% Reserved Instances
50% Spot31,447
cores
Reserved Instances 25% Reserved Instances
75% Spot31,447
cores
Scenarios for 1 Petaflop Peak Core Cluster
Reserved Instances 50% Reserved Instances
50% Spot31,447
cores
Cost Structure 1 – 50% RI, 50% Spot
Summary
Total Compute Cost: $0.025 per core, per hour
Reserved Instances 25% Reserved Instances
75% Spot31,447
cores
Cost Structure 2 – 25% RI, 75% Spot
Summary
Total Compute Cost: $0.02 per core, per hour