© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
1
© 2008 Cisco Systems, Inc. All rights reserved. Cisco PublicBRKAPP-201114413_04_2008_c1 2
Scaling Applications in a Clustered Environment
BRKAPP-2011
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
2
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 3BRKAPP-201114413_04_2008_c1
Agenda
Cluster BasicsDatabase Clusters
Oracle RAC Implementation
Financial ClustersMarket Feed/Algorithmic TradingCompute Cluster
High Performance Computing ClustersHPC ApplicationsParallel ApplicationsMessaging
Data DeliveryNAS, Clustered NASBlock/File Parallel File Systems, Object-Based Parallel File Systems
Three Facets of Latency
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 4BRKAPP-201114413_04_2008_c1
Cluster Basics
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
3
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 5BRKAPP-201114413_04_2008_c1
What Is Clustering?
Cluster—two or more interconnected computers that provide:Application High Availability
Load Distribution
Distributed and High Performance Computing (HPC)
Clustering can be implemented at different levels of the system:Storage Abstraction—shared disk, mirrored disk, and shared nothing
Operating systems: UNIX/Linux server clusters, Microsoft clustering
APIs: PVM, MPI, DAPL
Applications (includes Database)—Three Major Categories
Compute intensive
I/O intensive
Transaction intensive
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 6BRKAPP-201114413_04_2008_c1
Defining Clustered Servers
Database
Financial Trading and Compute
High Performance Computing
Application Servers
Multi-ProtocolGateway
IBM DB2 Parallel
Oracle RAC
IP Infrastructure
Database Clusters on
Ethernet
Server Switch Fabric
IBM DB2 Parallel
MySQL Cluster
Database Clusters on InfiniBand
Application Servers
Storage Network
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
4
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 7BRKAPP-201114413_04_2008_c1
Defining Clustered Servers
Database
Financial Trading and Compute
High Performance Computing
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 8BRKAPP-201114413_04_2008_c1
Defining Clustered Servers
IP Multicast
High BandwidthLow Latency
Security Services
Bandwidth Control
NFS /PVFS /NAS SAN
Management Network
ProcessorFarm #3
Fabric Hosted Applications
Fabric Assisted Applications
Storage Virtualization
Data Replication Services
ApplicationControl Engine SSL/IPSec VPNServer Load BalancingApplication Message Services Security Services
Master Node
GRID/HPCComputingHigh BandwidthLow Latency
I/O Network
ProcessorFarm #4
ProcessorFarm #5
ProcessorFarm #6
InfiniBandAttached Storage
ProcessorFarm #2
ProcessorFarm #1
Database
Financial Trading and Compute
High Performance Computing
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
5
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 9BRKAPP-201114413_04_2008_c1
Database Cluster Oracle Implementation
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 10BRKAPP-201114413_04_2008_c1
Oracle RAC in the Data Center
Clustered Implementation (RAC)
Latency sensitivity for inter-process communications
Bandwidth sensitivity for data delivery
Interconnect density and bandwidth10Gbps solutions—10G Ethernet or InfiniBand
Database
Financial
High Performance Computing
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
6
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 11BRKAPP-201114413_04_2008_c1
Oracle RAC
Development started in 2002, Oracle 9 RAC
Implementations supporting Ethernet and InfiniBand
Offload implementations for user space UDPEthernet—Solar Flare (formerly Level5) NIC
InfiniBand—DAPL (uDAPL) chosen due to network independent model
Required changing IPC communications infrastructure for IB
Eventually discarded due to massive internal code change requirements
Database
Financial
High Performance Computing
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 12BRKAPP-201114413_04_2008_c1
Oracle RAC Optimization
DB IPC communication acceleration
DB to App tier potential acceleration with a 10 Gig class network
Oracle 11i AS actually has the ability to leverage SDP in Asynchronous I/O mode (RDMA) with IB and using iWARP for IB and Ethernet with OFED 1.2
Oracle 10g uses UDP—IB will use IPoIB-CM
Oracle 11g RAC—RDS standard within OFED 1.3 Database
Financial
High Performance Computing
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
7
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 13BRKAPP-201114413_04_2008_c1
Basic Multi-Tier Oracle Environment
APPWEB DB
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 14BRKAPP-201114413_04_2008_c1
APPWEB DB
Oracle Bottlenecks
StorageIOPs
DB IPC
APP/DB IPC
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
8
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 15BRKAPP-201114413_04_2008_c1
Blade Servers
Blade Servers are the Gillette of the server worldBuy one chassis
Plug in your blades
When they get dull, just swap them out for sharp ones
Any “dumb” datacenter tech can swap a blade
Datacenter in a BoxEthernet and high speed connections in one box
Most blades are limited to two high speed ports of the same type (usually just a single host adapter) Database
Financial
High Performance Computing
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 16BRKAPP-201114413_04_2008_c1
Basic Multi-Tier Oracle Environment with Blade Servers
?
APPWEB DB
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
9
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 17BRKAPP-201114413_04_2008_c1
Blade Servers
Ethernet and Fiber Channel Ethernet and InfiniBand
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 18BRKAPP-201114413_04_2008_c1
Blade Servers
Ethernet and Fiber Channel
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
10
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 19BRKAPP-201114413_04_2008_c1
Ethernet/Fiber Channel and Blades?
APPWEB DB
10GbE TOE 10GbE iWARP
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 20BRKAPP-201114413_04_2008_c1
Ethernet/InfiniBand and Blades?
APPWEB DB
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
11
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 21BRKAPP-201114413_04_2008_c1
Sample Design—Blade Servers Accelerated IPC
Blade-servers using 10GbE for IPC and Storage access
Use Multi-Fabric I/O for Storage Access
Could use MFIO technology for App tier access as well, not common due to available low cost Ethernet interfaces
Database
Financial
High Performance Computing
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 22BRKAPP-201114413_04_2008_c1
Optimized IPC and I/O for Oracle and Blade Servers
APPWEB DB
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
12
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 23BRKAPP-201114413_04_2008_c1
Optimized IPC Multi-Tier Oracle Environment
APPWEB DB
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 24BRKAPP-201114413_04_2008_c1
Fully-Optimized Multi-Tier Oracle Environment
APPWEB DB
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
13
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 25BRKAPP-201114413_04_2008_c1
Financial Trading and Compute Clusters
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 26BRKAPP-201114413_04_2008_c1
Financial Trading and Compute Clusters
Algorithmic tradingUp to 100’s of machines
End to end latency is king—but not just low latency, latency deviation is just as critical
Compute machines for pricing, risk analysis10,000’s to 100,000s of machines
Database
Financial
High Performance Computing
Two Key Areas of Cluster Computing in the Financial Banking World
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
14
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 27BRKAPP-201114413_04_2008_c1
“In any gun fight, it’s not enough just to shoot fast or to shoot straight. Survival depends on being able to do both… The lone gunslinger of the open-outcry trading floors is rapidly being replaced by ultra-fast, computerized trading systems which are more akin to robots with machine guns.”
IBM Report, “Tacking Latency: the Algorithmic Arms Race”
Algorithmic Trading
Database
Financial
High Performance Computing
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 28BRKAPP-201114413_04_2008_c1
Deterministic Performance
#1 problem in financial trading environments
Financials don’t care about MIN(latency) or AVG(latency), but STDDEV(latency) at the application level
A single frame dropped in a switch or adapter causes significant impact on performance
TCP NACK delayed by up to 125 ms with most NICs with interrupt throttling enabled
TCP window shortened
TCP retransmit timeout 500ms standard usually 200ms implementation
Database
Financial
High Performance Computing
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
15
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 29BRKAPP-201114413_04_2008_c1
Why Is Latency/Performance a Problem?
Response to changing market conditions is delayed by system latency and creates significant loss of opportunity for trade execution, and affects trading strategies
ExchangeSystems
TradePrice
Market DataSupplier
DistributionPlatform
TradingEngine
RiskSoftware
ExchangeSystems
ExecTrade
Latency Is Introduced
by the Exchange
and Supplier
Latency Is Introduced
by the Exchange
and Supplier
Most Trading Houses Systems Are No Different
The goal of the Low Latency is to provide the required level of capacity to support current and future market volumes while minimizing latency
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 30BRKAPP-201114413_04_2008_c1
0
500
1000
1500
2000
2500
3000
3500
4000
4500
08/1
1/20
06
08/1
2/20
06
08/0
1/20
07
08/0
2/20
07
08/0
3/20
07
08/0
4/20
07
08/0
5/20
07
08/0
6/20
07
08/0
7/20
07
08/0
8/20
07
08/0
9/20
07
08/1
0/20
07
08/1
1/20
07
08/1
2/20
07
08/0
1/20
08
08/0
2/20
08
08/0
3/20
08
08/0
4/20
08
08/0
5/20
08
08/0
6/20
08
08/0
7/20
08
08/0
8/20
08
08/0
9/20
08
08/1
0/20
08
08/1
1/20
08
Date
Gig
abits
per
day
gb_received gb_sent Linear (gb_received) Linear (gb_sent)
Traffic Growth Next 12 Months
CPU Problems
These are estimates of average data rates, if no changes are made to the environment
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
16
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 31BRKAPP-201114413_04_2008_c1
The Trading Challenge
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 32BRKAPP-201114413_04_2008_c1
Market Data—Algorithmic Trading
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
17
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 33BRKAPP-201114413_04_2008_c1
Financial Compute Cluster
Data intensive
Latency insensitive
Scatter-Gather type work
Post trade analysis
Feed back in to risk engine for algorithmic trading
As successful trades increase, post trade analysis and feedback mechanisms increase
Parametric Computing
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 34BRKAPP-201114413_04_2008_c1
High-Performance Computing Clusters
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
18
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 35BRKAPP-201114413_04_2008_c1
HPC Network Communication
Access Network
Management Network
IPC Network
Storage Network
User Access
Management
Storage
IPC
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 36BRKAPP-201114413_04_2008_c1
Access Network (Public)
Communications to/from external resources
Security
QoS
Availability
User Access
Management
Storage
IPC
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
19
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 37BRKAPP-201114413_04_2008_c1
Management Network (Private)
User Access
Management
Storage
IPC
Communications between master and slave nodes
Heartbeat
Small to medium-sized HPC clusters commonly consolidate Access and Management Networks
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 38BRKAPP-201114413_04_2008_c1
IPC Network (MPI)
User Access
Management
Storage
IPC
Communications between nodes during run time
IPC Network and Management Network may be the same physical network
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
20
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 39BRKAPP-201114413_04_2008_c1
Storage Network
User Access
Management
Storage
IPC
Access to stored data
NAS—file-level access
SAN—block-level access
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 40BRKAPP-201114413_04_2008_c1
HPC Network Consolidation
User Access
Management
Storage
IPC
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
21
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 41BRKAPP-201114413_04_2008_c1
Network Design Considerations
Database
Financial
High Performance Computing
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 42BRKAPP-201114413_04_2008_c1
HPC Cluster Components
Applications
Communication Libraries
Device Driver
OS and Kernel
CPU and Bus Technologies
Network Interfaces (NICs, HCAs)
Interconnect Network(s)
Storage
Languages andCompilers
File Systems
Physical
Mid
dlew
are
and
Sche
dulin
g
Users
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
22
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 43BRKAPP-201114413_04_2008_c1
Two Key Concepts—Terminology
CapacityThe ability to provide predictable (and large) computation throughput for experimentation, production runs, testing
CapabilityThe ability to provide peak power for a specific amount of time as to solve a problem within a guaranteed time window
The Two Require Different Architectural Solutions, but in Practice the Same Infrastructure Must Deliver Both; this Leads Naturally to Concepts like Virtualization, Grid-Computing and Dynamic Provisioning
Database
Financial
High Performance Computing
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 44BRKAPP-201114413_04_2008_c1
What Do We Need to Know?
Application characteristics
Cluster size
Network/switch characteristics
Node configuration
Node communication considerations
Node interconnect
Database
Financial
High Performance Computing
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
23
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 45BRKAPP-201114413_04_2008_c1
Application Characteristics
Is the application:Latency sensitive?Bandwidth sensitive?
Real time or batchResponse time requirementsStorage
DAS, NAS, SAN,Physical attachment
FC, Ethernet, IBFile System
Parallel Virtual File System, Luster, NFS
Database
Financial
High Performance Computing
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 46BRKAPP-201114413_04_2008_c1
ApplicationRequirements
NetworkTraffic
HighLow
BandwidthHighLow
LatencyHighLow
CPUArchitecture
AMDIntel
OperatingSystem
Windows
Linux Solaris
StorageSystem
File System
NAS SAN
Form Factor
Analyze Application(s)
Database
Financial
High Performance Computing
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
24
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 47BRKAPP-201114413_04_2008_c1
Application Mix
ParallelTightly Coupled
Loosely Coupled
Parametric
Serial
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 48BRKAPP-201114413_04_2008_c1
Job Mix
Running multiple application in parallel
Running multiple copies of the same non-parallel application with different inputs—parametric execution
Parametric execution is widely used in HPC and accounts for more than 70% of cluster usage
Running multiple serial applications on one node or one core per serial application run
Database
Financial
High Performance Computing
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
25
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 49BRKAPP-201114413_04_2008_c1
Determine Cluster Size
How many nodes can the application support?
How many concurrent users?
How large are their projects?
How much speedup can application achieve?
Load Balancing
High Availability
Database
High Performance Computing
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 50BRKAPP-201114413_04_2008_c1
Ethernet(Mgmt
and Access)
InfiniBand(IPC)
Ethernet Only
Ethernet(IPC, Mgmt
and Access)
InfiniBand(IPC and Access)
Ethernet(Mgmt)
Most popular choice for smaller clusters1:1, 2:1, 4:1 blocking (depending upon app)
Most common in larger clusters1:1 or 2:1 typical oversubscription in IB IPC 8:1 to 16:1 typical oversubscription in Ethernet to Access nodes
Ethernet (Mgmt/Access)InfiniBand (IPC)
Ethernet (Mgmt)InfiniBand (IPC+ Access)
Used for IB attached storage 1:1, 2:1, 4:1 typical oversubscription in IB fabric16:1 or higher oversubscription for Ethernet management
Determine Network Architecture
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
26
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 51BRKAPP-201114413_04_2008_c1
Determine Network Topology
Based on a non-blocking architecture and equivalent sized non-blocking switch “building blocks”
Sometimes combined with Star architecture to provide a hybrid network
CoreSpine
LeafEdge
Fat Tree
Star
If Beyond a Single Switch, Use a Fat Tree/CLOS Style Network Design
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 52BRKAPP-201114413_04_2008_c1
Calculate Cluster Design
I/O Nodes
Compute Nodes
n Gbps
m Gbps
Minimum BisectionMgmt and
I/O Traffic
IPC Traffic
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
27
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 53BRKAPP-201114413_04_2008_c1
High-Performance Computing Solution
Interprocessor Communication (IPC) NetworkLow latency, high bandwidth on standard open source MPI over InfiniBand networkCisco Catalyst® switching for TCP-based applications, can benefit from policing, QoS and multicast
Management and I/O Network Used for job scheduling, network monitoringTCP- or UDP-based—benefits from Quality of Service and MulticastNetFlow reporting, NSF/SSO for high availability
Storage Network NAS or iSCSI over Ethernet fabricIB-attached storage for lower storage overheadFiber Channel storage with data replication, integrated applications
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 54BRKAPP-201114413_04_2008_c1
Data Delivery
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
28
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 55BRKAPP-201114413_04_2008_c1
Storage Access Protocols and Technologies
Fiber Channel, InfiniBand or iSCSI
FC or InfiniBand GatewayBlockSAN
DAS, Fiber Channel or iSCSI
Ethernet or InfiniBandObjectParallel File
System
DAS or Fiber Channel
Ethernet or InfiniBandFile/BlockParallel File
System
SCSI or FiberEthernet or InfiniBand GatewayFileCluster NAS
SCSI or Fiber Channel
Ethernet or InfiniBand GatewayFileNAS
Back-End Storage Access
Server Access
Block or File Access
Storage Type
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 56BRKAPP-201114413_04_2008_c1
Network Attached Storage (NAS)
Attaches via connections to the network using Gigabit and 10Gigabit Ethernet
There are a few NAS vendors using IB as the interconnect of choice
Primarily using NFS (only standards-based file systems in this space)
Perform well for small clusters but does not scale well
Single point of access and single point of failure
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
29
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 57BRKAPP-201114413_04_2008_c1
Clustered NAS
Attaches via connections to the network using Gigabit and 10Gigabit Ethernet
Where the traditional NAS or NFS solution uses a single filer or server, a cluster NAS solution utilizes several heads with storage that is connected directly to the heads or via some type of storage network (fiber channel)
Each of the filer heads can only access the storage assigned to it and not the storage assigned to other filers
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 58BRKAPP-201114413_04_2008_c1
Clustered NAS
Access is limited to assigned storage
All filers have knowledge of the location of data regardless of which storage and filer the data is located
Depending on implementation data access occurs either via a process which moves data from one filer to another or in an NFS gateway process with a parallel file system on the back-end
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
30
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 59BRKAPP-201114413_04_2008_c1
Parallel File Systems
Attached via connections to the network using Gigabit Ethernet, 10Gigabit Ethernet and InfiniBand
Provides multiple or parallel access to storage nodes also known as I/O nodes
PFS nodes have access to direct attached storage
Implementations are file/block-based and/or object-based
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 60BRKAPP-201114413_04_2008_c1
Parallel File Systems
For File/Block-based, Metadata service is one of the key bottle necks to scalability
Example: file write requests are made to the metadata server which allocates the block(s); compute note then sends the data to the metadata server which sends the data to the file system and then to disk
Metadata services are either a dedicated or shared/ clustered implementation
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
31
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 61BRKAPP-201114413_04_2008_c1
Parallel File Systems—Object-Based
Metadata services are used but have limited functionality
I/O nodes use protocols to manage the location of data as the nodes are not just storage bricks
If we follow the process as we did in the file/block-based solution:
The compute node will first contact the metadata service regarding a file operation
The metadata service then contacts the storage devices and then based upon the protocols in the file system identifies where the object can be stored
The metadata server then passes a list of which storage devices that can be used for the file operation
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 62BRKAPP-201114413_04_2008_c1
Parallel File Systems—Object-Based
Unlike the file/block-based solution, the metadata service is removed from the file operation as the node will then write the data directly to an I/O node and then on to storage
The metadata service will monitor the file operations as to allow the location of the data to current within the metadata records
Performance of these systems can and will vary based upon any number of variable; the choice of network architecture, interconnect and switch fabric can and will have a significant impact to the performance
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
32
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 63BRKAPP-201114413_04_2008_c1
Parallel File Systems—Object-Based
Metadata services are used but have limited functionalityI/O nodes use protocols to manage the location of data as the nodes are not just storage bricksIf we follow the process as we did in the file/block-based solution:
The compute node will first contact the metadata service regarding a file operationThe metadata service then contacts the storage devices and then based upon the protocols in the file system identifies where the object can be stored The metadata server then passes a list of which storage devices that can be used for the file operation
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 64BRKAPP-201114413_04_2008_c1
Cluster Performance Design and Latency
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
33
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 65BRKAPP-201114413_04_2008_c1
Latency
Latency is the time taken for a packet of data to be:1. The time for encoding the packet for transmission and
transmitting it 2. The time for that serial data to traverse the network
equipment between the nodes, and3. The time to get the data off the circuit
This is also known as “one-way latency;” a minimum bound on latency is determined by the distance between communicating devices and the speed at which the signal propagates in the circuits (typically 70–95% of the speed of light) Actual latency is much higher, due to packet processing in networking equipment, and other traffic
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 66BRKAPP-201114413_04_2008_c1
Latency
End user of the data
Is it core in a parallel application?
It is another application
Is it an end user 3+ms one way latency cost from the data center?
It takes a batsmen in cricket 400ms to decide where and how to hit the ball when the bowler releases it
It takes a normal human 250ms just to recognize that data has been delivered to their screen—not to mention the in host latency and deciding what to do with it
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
34
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 67BRKAPP-201114413_04_2008_c1
Sources of Latency in Network Today
Problem Needs to Be Solved End-to-End: Applications, NIC, Blade, Leaf and Core Switches
~10 μs
~3 μs Blade Switch
~10 μs
~10 μs
Variable Variable
Ping/Pong Latency 25–30 μs
Application
Networking Stack
Application
Networking Stack
~3 μs - ToR~7 μs - Core~10 μs
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 68BRKAPP-201114413_04_2008_c1
Latency Effects in an Ethernet World
End-to-End Latency100 us E2E at 1GbE reduces throughput by 15–20%
100us E2E at 10GbE reduces throughput by 20–25%
Thank our good friend TCP for that
Who cares about throughput?Storage heavy applications
Load/unload operations
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
35
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 69BRKAPP-201114413_04_2008_c1
NIC
Traditional Server I/O Architecture
Bus-Based architecture with I/O memory pool
Access to I/O resource handled by BIOS
A data packet is typically copied three to four timesCPU Interrupts, Bus bandwidth constrained, Memory bus constrained
1.
32.2.
3.3.5.5.
4.4.
CPU
ApplicationMemory
Network
I/O MemoryPool
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 70BRKAPP-201114413_04_2008_c1
Adapter and Protocol Considerations
Fundamental part of any solution
Great advantage of EthernetHighly competitive and open market place
On-loading vs. Off-loading camp, …
Linux vs. Windows vs. Solaris
iWARP—RDMA, RDDP, DDPSingle sided offload with zero copy kernel bypass
TCP over lossless Ethernet is just the beginningAlternative protocols being considered
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
36
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 71BRKAPP-201114413_04_2008_c1
Kernel Bypass/Zero Copy Architecture
Bypass-CapableAdapter
1.
33.3.
CPU
ApplicationMemory
I/O MemoryPool
2.2.
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 72BRKAPP-201114413_04_2008_c1
Low Latency Performance Comparison
25%25%25%28%27%26%23%25%25%9%CPU
135113541220103389672756012191214118Bandwidth MB/s
IPoIBIP
OMPIMVAPICH10G LLE
DDR IB
SDR IB
DDR IB
SDR IB
10GE LLE10GEGigabit
Ethernet
MPI OFED 1.2
SDP OFED 1.2
TCP
3.293.828.810 3.3214.320.3
8.5 (L2)
11 (TCP)
25.835.3Latency (μs)
MPISockets API
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
37
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 73BRKAPP-201114413_04_2008_c1
Switch Architecture Value
VS.
3.2 ms
KernelKernel
NICNIC NICNICN5000SwitchN5000Switch
End-to-End Latency ~11 ms
KernelKernel
Application Application
3.9 ms 3.9 ms
Data Packet
20 ms
KernelKernel
NICNIC NICNICSwitchSwitch
End-to-End Latency ~ 70ms
KernelKernel
Application Application
25 ms 25 ms
Data Packet
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 74BRKAPP-201114413_04_2008_c1
Jitter: Delay VariationJitter Is #1 Problem for
HPC ApplicationsIdeal Solution for Jitter Reduction
Latency “Standard Deviation”is bigger issue than Average Latency (Jitter)
Jitter and Latency limit “Cluster Size”
A single frame dropped in a switch or adapter causes significant impact on application performance
TCP windowing is a major source of jitter
Cut through architecture
Line rate processing and forwarding
PFC and congestion management control jitter at the at the source
Line Rate processing
No-drop and delay-drop classes of service
Delay-drop is best suited for TCP-based workloads
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
38
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 75BRKAPP-201114413_04_2008_c1
Three Facets of Latency
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 76BRKAPP-201114413_04_2008_c1
Latency
Compute LatencyCore-to-core message latency
Application LatencyLatency between multi-tiered application
Data LatencyData load and unload times
Three Focuses of Latency
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
39
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 77BRKAPP-201114413_04_2008_c1
Compute Latency
KernelKernel
NICNIC NICNICSwitchSwitch
End-to-End Latency
KernelKernel
Core Core
Message
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 78BRKAPP-201114413_04_2008_c1
Compute Latency
Core to core messaging for Inter-Processor Communication
Medium to large node count job distribution
Data intensive
Compute and data latencies impact overall wall clock time and scalability
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
40
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 79BRKAPP-201114413_04_2008_c1
Compute Latency
60–70% of message latency is in host
Offload technologies reduce in host latencyKernel bypass and zero copy
Balance between time communicating vs. time computing—node efficiency
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 80BRKAPP-201114413_04_2008_c1
Compute Latency
InfiniBand has been interconnect and fabric of choice
10GbE moving forward with iWARP (RDMA/RDDP)Port costs, port densities, physical power draw and heating issues to be overcome
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
41
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 81BRKAPP-201114413_04_2008_c1
Application Latency
KernelKernel
NICNIC NICNICSwitchSwitch
KernelKernel
Application Application
Message Message
NICNIC NICNICSwitchSwitch
KernelKernel
Application
Message
NICNIC NICNICSwitchSwitch
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 82BRKAPP-201114413_04_2008_c1
Application Latency
Multi-tiered applications
Few application environments can be a mix of all three areas of latency (Oracle RAC)
Low latency is not a significant requirement outside of market data and algorithmic trading applications
Higher latencies at the initial tiers will be exacerbated as secondary and tertiary application tiers act on higher tier output
TCP and/or UDP traffic
Unicast or Multicast
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
42
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 83BRKAPP-201114413_04_2008_c1
Data Latency
KernelKernel
NICNIC NICNICSwitchSwitch
KernelKernel
Application Parallel Storage
Read/Write Data Request
ControllerHBA
ControllerHBA
ArrayController
ArrayController
FiberChannelSwitch
FiberChannelSwitch
DrivesDrives
File
ArrayController
ArrayController
ControllerHBA
ControllerHBA
FiberChannelSwitch
FiberChannelSwitch
KernelKernel
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 84BRKAPP-201114413_04_2008_c1
Data Latency
Data Delivery to Computational ClustersOnce on the network data flows are limited by slowest link
Cluster to parallel File SystemsSustained data flows to/from disk limited by slowest link
Peta-Scale File Systems pushing 2000 ports
Large compute farms
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
43
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 85BRKAPP-201114413_04_2008_c1
Low Latency and Data Delivery
Data Delivery through NAS and Parallel File System interconnects will drive more 10GbE interconnects than node connects until prices are at or near IB pricing
Speed matching is critical for maximum data delivery
10GbE targets with 1GbE initiators limits throughput by more than 50% versus 10GbE–10GbE
IB shows higher throughput due to init/target speed matching and RDMA effects (TCP offload)
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 86BRKAPP-201114413_04_2008_c1
Data Latency—Seismic Processing
Large data sets
Medium to large node count distribution
Loosely coupled to parametric processing
TCP/UDP-based transport
Data load/unload times impact overall wall clock time
Connect speed of storage targets and compute nodesGbE, 10GbE, InfiniBand, FC
Storage structureNFS, Parallel NFS, NAS, Clustered, Parallel File System,
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
44
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 87BRKAPP-201114413_04_2008_c1
Data Latency
Data Delivery through NAS and Parallel File System interconnects will drive more 10GbE interconnects than node connects until prices are at or near IB pricing
Speed matching is critical for maximum data delivery
10GbE targets with 1GbE initiators limits throughput by more than 50% versus 10GbE–10GbE
IB shows higher throughput due to init/target speed matching and RDMA effects (TCP offload)
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 88BRKAPP-201114413_04_2008_c1
Low Latency and Data Delivery
Application wall clock time changes by moving to 10GbE or IB-connected Parallel File System:
Dark Matter application (Gig to IB)35-hour run time with data delivered over GbEFour-hour run time with data delivered over SDR IB
Oil and Gas Seismic Processing 120,000+ cores Oil and Gas exploration (Gig to 10GbE)
Small jobs 2x reduction in wall clockLarge jobs 16x reduction in wall clock
Parallel File Systems scale to large numbers of systemsThe issue is how to deliver tens of Gigabits/s of I/O to a large number of clustersDOD/DOE labs share peta-scale storage systems across clusters with 10GbEUse gateways (PCs with10GbE/IB) to get to their IB clusters
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
45
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 89BRKAPP-201114413_04_2008_c1
Data Delivery in HPC
Large data repositories in the peta-scale rangeUse of NFS Filers and Parallel File SystemsDirect Attached Storage and FC SAN with Parallel File System
GbE is not scaling to meet the higher throughput and lower wall clock times required in research and business
PFS and large data sets are driving 10GbE and IB for I/O node interconnectData latency impacts more applications than compute latency
Compute latency benefit from a low latency high bandwidth fabric will affect <30 % of many applications
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 90BRKAPP-201114413_04_2008_c1
Data Throughput Performance
920–1000 MBpsIB DDR–SDPIB DDR–SDP
525–600 MBpsIB DDR—IPoIB (OFED 1.2)
IB DDR—IPoIBCM (ofced 1.2)
350–375 MBpsIB DDR—IPoIBIB DDR—IPoIB
590–625 MBpsIB SDR—SDPIB SDR—SDP
525–575 MBpsIB SDR—IPoIB (OFED 1.2)
IB SDR–IPoIBCM (OFED 1.2)
350–375 MBpsIB SDR—IPoIBIB SDR—IPoIB
550–600 MBps10 Gigabit Ethernet10 Gigabit Ethernet
325–450 MBps10 Gigabit EthernetGigabit Ethernet
112–118 MBpsGigabit EthernetGigabit Ethernet
Data Throughput per I/O Node or I/O ConnectTarget SpeedInitiator Speed
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
46
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 91BRKAPP-201114413_04_2008_c1
Q and A
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 92BRKAPP-201114413_04_2008_c1
Recommended Reading
Continue your Cisco Live learning experience with further reading from Cisco Press®
Check the Recommended Reading flyer for suggested books
Available Onsite at the Cisco Company Store
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
47
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 93BRKAPP-201114413_04_2008_c1
Complete Your Online Session Evaluation
Give us your feedback and you could win fabulous prizes; winners announced daily
Receive 20 Passport points for each session evaluation you complete
Complete your session evaluation online now (open a browser through our wireless network to access our portal) or visit one of the Internet stations throughout the Convention Center
Don’t forget to activate your Cisco Live virtual account for access to all session material on-demand and return for our live virtual event in October 2008
Go to the Collaboration Zone in World of Solutions or visit www.cisco-live.com
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 94BRKAPP-201114413_04_2008_c1
© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr
48
© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 95BRKAPP-201114413_04_2008_c1
Latency Fundamentals
What matters is the application-to-application latency and jitter
Driver/Kernel software
Adapter
Network components
Latencies of 1GbE switches can be quite high (>20 μs)
Store and forward
Multiple hops
Line serialization delay
Protocol processing, context switching and copying dominates latency
KernelKernel
NICNIC NICNICN5000SwitchN5000Switch
End-to-End Latency ~11 μs
KernelKernel
Application Application
Data Packet
3.2 μs