scalability and reliability in the cloud
DESCRIPTION
From AT&T Bootstrap Week: This session focuses on architecture and design concepts to ensure scalability and maximize reliability for server-based applications running in the cloud environment. The session will discuss techniques to consider for achieving scalability and reliability and tradeoffs to consider such as time vs. cost based on the needs for different types of applications.TRANSCRIPT
![Page 1: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/1.jpg)
HIGH SCALABILITY AND RELIABILITY IN THE CLOUDGREG THOMPSONHEAD OF ARCHITECTURE, APPS ENABLEMENTALCATEL-LUCENT
[email protected]@gmthomps
![Page 2: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/2.jpg)
About This Session Target audience is backend application
developers deploying infrastructure into a cloud environment
Will cover concepts for scalability and reliability with the goal of helping application developers understand some key considerations when designing and building the backend.
![Page 3: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/3.jpg)
Design Time Decisions When first building your application backend,
consider a few important questions How fast should the application be recovered if a
failure occurs? What kind of down time is acceptable? Is the application maintaining stateful data? What kind of information needs to be shared
across multiple instances?
![Page 4: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/4.jpg)
Scalability
![Page 5: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/5.jpg)
What is Scalability? Scalability is a
term used to describe how the application will handle increased loads of traffic volume
![Page 6: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/6.jpg)
Scalability – Factors to Consider Horizontal vs. Vertical Stateless vs. Stateful Understanding Limitations Connection Management Segmentation of traffic Segmentation of responsibility (distributed arch) Clustering Messaging
![Page 7: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/7.jpg)
What Type of Scalability?Vertical vs. Horizontal
Scaling up a single node Physical limitations –
instances are very powerful but still have finite limits
Resources such as number of sockets can only go so high
Scaling out across multiple nodes Ability to distribute
traffic over a number of nodes
Allows for more flexibility over time
Vertical Horizontal
![Page 8: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/8.jpg)
Will the App Maintain State?Stateless Applications
Application does not persist information about transactions
Each transaction is independent and atomic
Application
Request Response
![Page 9: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/9.jpg)
Will the App Maintain State?Stateful Applications
Application needs to maintain data about transactions in progress
Requires storage Persistence may also
be required depending the reliability model
Application
First Request
DB
Subsequent Request
![Page 10: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/10.jpg)
Understanding Limitations Thorough testing is key
to understanding bottlenecks
Test real-world scenarios included latency
Push the system to the max to understand how it behaves
![Page 11: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/11.jpg)
Connection ManagementMobile Device Connections Mobile devices don’t always
behave like you expect Connectivity is often very
dynamic Devices move from
4G/3G/2G/no G/Wifi Not all TCP events will get
reported and sockets can remain open
If not handled correctly, these factors can be time bomb no matter how vertically you scale a component
![Page 12: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/12.jpg)
Segmenting Traffic Once the application is
able to be scaled out, traffic can be segmented in different ways Location (i.e. east coast
vs. west coast) Pre-assigned criteria -
User ID, IP, or other dynamic criteria
Load Balanced
![Page 13: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/13.jpg)
Segmenting Responsibility Segmenting
responsibility allows for a distributed architecture Each component can be
scaled independently Allows for more flexibility
in scaling Adds more complexity
and potential messaging overhead
![Page 14: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/14.jpg)
Clustering Clustering is the concept
of having a group of nodes working together to provide the same capability Nodes typically co-located Common data shared as
needed across the cluster Communication may be
needed between nodes
AppNod
e
AppNod
e
AppNod
e
AppNod
e
Shared
Data
![Page 15: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/15.jpg)
Messaging Once a clustered
and/or distributed architecture is used messaging will be needed between various components and/or nodes
Types of Messaging JMS Open Source MQ
packages Custom Designed Use of APIs
![Page 16: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/16.jpg)
Example of Scaled Architecture
Site 1
Load BalancerLoad
Balancer
Web Serve
rWeb
Server
Component 1Compone
nt 1
Component 2Compone
nt 2
Database
Site 2
Load BalancerLoad
Balancer
Web Serve
rWeb
Server
Component 1Compone
nt 1
Component 2Compone
nt 2
Database
![Page 17: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/17.jpg)
Reliability/Availability
![Page 18: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/18.jpg)
What is Reliability/Availability? Availability is typically
measured by the amount of downtime your application has in a given year Unplanned downtime and
planned downtime are both considered
Reliability is described by the likelihood of failure based on actual measurements
We’ll focus more on Availability
![Page 19: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/19.jpg)
Reliability/AvailabilityFactors to Consider
Cost vs. Need Problem detection Automation for recovery Active/standby, active/active, hot standby vs. cold
standby Local and Geo-redundancy Multi-zone, multi-cloud Test Until You Break the System
![Page 20: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/20.jpg)
Reliability Requirements
Number of instances Bandwidth
requirements between sites
Complexity of software Monitoring
User Experience Customer
requirements Negative Publicity
Cost Considerations Need
![Page 21: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/21.jpg)
Problem Detection Effective monitoring of
the application is key to minimizing downtime Event reporting in the
software External monitoring – test
for successful behavior Auto detection and
alerting to minimize cost of operations personnel
![Page 22: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/22.jpg)
Automation for Recovery How quickly a failed
component recovers increases reliability Automatic detection
and automatic recovery Automated installation
key for minimizing setup time during recovery
![Page 23: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/23.jpg)
Availability Models N = number of nodes
required for normal processing
N+1 = one additional node to provide redundancy in case of failure
N+K = K nodes provide additional redundancy
N N
N N +1
N N K K
![Page 24: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/24.jpg)
Redundancy Models Active/Cold Standby
backup site is booted up when needed
Active/Hot Standby Backup site is running
and ready to takeover Active/Active
Both sites active and processing traffic
ActiveCold
Standby
ActiveActiveStand
by
Active Active
![Page 25: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/25.jpg)
Local and Geo-Redundancy Local
Backup instances are available within the same location
Use of availability zones within a region very similar
Geo-Graphic Backup instances are
available in another geo-graphic location
Typically in a separate region to account for events such as natural disasters
![Page 26: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/26.jpg)
Availability to the Max Multi-Zone/Multi-
Region Multi-zone typically
provide instances running in different physical locations, but in same region
Multi-region provides different geographic regions of availability
Multi-Cloud If your application
requires the maximum possible availability
Run in different cloud providers in different regions
![Page 27: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/27.jpg)
Test Until You Break the System Push the system to the
max and observe the breaking points
Fix the problem, repeat The best way to find
problems to prevent unplanned downtime is to thoroughly test with a mindset to break
![Page 28: Scalability and Reliability in the Cloud](https://reader034.vdocuments.site/reader034/viewer/2022051609/547999aeb4af9fb4158b488b/html5/thumbnails/28.jpg)
Q&A