071310 sun d_0930_feldman_stephen

Deploying a Highly Performing, Scalable and Available

Blackboard Solution

Steve Feldman, [email protected]

performance* The amount of useful work accomplished by a computer system compared to the time and resource used.

Alternative Definition: Response time plus latency.

scalability* The ability for a distributed system to expand by accommodating greater levels of load while maintaining similar levels of performance.

availability* The capability to service a functional request without issue under conditions of desired performance and workload scalability.

What We’ll Cover

•  In the beginning there was Performance…

•  Which came first…Scalability or the Egg?

•  How much availability do I really need?

•  2010 BbWorld conference theme…deploy for performance.

•  Continuous measurements are absolutely critical.

•  Collaborative monitoring solutions with Quest Software

The Driver

The Online Momentum Shift •  66% of degree-granting post-secondary institutions in

the US offer online, hybrid/blended online and other distance education courses.1

•  Over 4.6 million students were taking at least one online course during the fall 2008 term; a 17 percent increase over the number reported the previous year.2

•  The 17 percent growth rate for online enrollments far exceeds the 1.2 percent growth of the overall higher education student population.

•  By 2020, 50% of high school students will take an online course.1

7

Communities are Getting Larger

•  State and County Initiatives

•  Consortium Programs and strategic alliances between institutions.

•  Content distribution networks

•  New sources or revenue to reach markets and students that were not historically accessible –  Non-traditional students are

being marketed to

Stakes are Getting Higher •  Competition for funding by government

•  Competition for revenue by students

•  Learning modality changing with each technological innovation

•  User expectations and online behavior changing constantly

•  Hours of availability fighting toward mission critical –  Often VLEs identified as 24x7 mission

critical systems, but resources to support are more like 8 x 5

Areas of Consideration Tied to Online Learning

•  Educational Continuity

•  High Availability, Scalability and Disaster Recovery

•  Expectations and Impatience

•  The Cost of Doing Business –  Hidden costs –  Justifiable costs (things you should bring back to work) –  Costs to plan for the future

Budget more than money…

Budget your time and resources!

The Blackboard Profile Shift

•  Existing customers approached Bb about hybrid eLearning modalities: –  Creative growth opportunities –  Accessible communities –  Competitive offerings

•  New customers approached Bb about 100% online programs. –  Struggling with competitor systems and/or home-grown offerings.

•  Customers wanted “proven” solutions not just from Bb, but from recognized vendors.

So what did we do about it?

Performance Matters

What is Performance? •  Simple Definition:

–  Performance = Response Time + Latency •  Performance is quantifiable and measureable

•  Performance is also perception

•  Mostly recognized from a cognitive perspective –  Instantaneous –  Immediate –  Continuous –  Captive

Realistic Views of Performance •  Should all my pages respond the same?

•  Will my response times vary because of the browser I use?

•  Is it acceptable for the application to respond differently for the same exact page request at different times of the day?

•  As the administrator am I responsible only for time for first byte, or end-to-end response time?

•  Do I understand the expectations of my users? Are they satisfied with the response times they are receiving?

•  Do I understand the patterns of my users?

Realistic Approaches to Achieve Performance •  Eliminate interface and resource contention.

–  Better to have more capacity than queuing •  Know your user behavior.

•  Optimize for the saturated and low-bandwidth network conditions. –  Enable Compression –  Optimize Images –  Cache Static Content

•  Large JVM memory allocations are not a bad thing, but rather something to expect with Java-based applications. –  Large JVM (4GB to 16GB) with aggressive options you understand.

•  Two keys to the database –  Continuous maintenance –  Understand the key queries and how the CBO handles

Scalable Deployments

Flexible and Scalable Application Deployment

Flexible and Scalable Application Deployment •  An ideal deployment will contain…

–  Availability at every edge of the application environment •  Strategy: Physical distribution of load-balanced systems •  Strategy: Minimum DB recovery, not necessarily 0 downtime

–  Consumption of every possible machine resource •  Strategy: Virtualization provisioning

–  Techniques for improving user experience •  Strategy: Techniques and tools for achieving page-level SLAs

–  Large addressable memory spaces •  Strategy: 64-bit and large OS process space allocations

Flexible Deployments •  Emphasis on adoption of virtualization technologies

–  Virtualization technology transparent to guest OS and application.

–  Why: Take advantage of CPU and Memory expansion •  Emphasis on fast provisioning

–  Provisioning technology such as Dell AIM, VMWare deployment technology and XenServer deployment technology

–  Why: Solved problems to minimize human error and fast deployment.

•  Emphasis on diskless systems –  Hardware is just “rented” space for CPU, Memory and

Network. –  Why: Speed of network and storage so fast, why be

dependent on “wired” solutions.

Reliable Deployments

•  Emphasis on distributed computing –  De-emphasize clustering and push heavy load-balancing and

virtualization. –  Clustering was before our 64-bit offering

•  Emphasis on active/passive availability solutions at the DB –  SQL Server cluster and Oracle RAC One –  Availability != Scalability

•  Emphasis on diskless hardware systems using enterprise network boot storage

•  Fault tolerance is expensive and has to be strategically identified… –  Focus on sources of greatest probability of failure

Responsive Deployments •  Large 64-bid address space…

–  It’s cheaper today than 4 years ago –  Technology is heading this direction –  It’s not a bad thing…

•  Plentiful CPU worker threads… –  Use only which you need –  Take advantage of hyperthreading and MT technology –  Partition via virtualization

•  Many bigger…distributed environments

•  Continuous maintenance –  If you want to make your systems remain fast, you have to

“service” the roads. Lots of litter and potholes out there.

Efficient Deployments •  Emphasis on blade and rack mount systems

–  Space management –  Power and Conditioning Control

•  Emphasis on virtualization –  Efficient utilization of CPU and Memory resources that are

quickly exploding •  Shared enterprise networked storage

–  Capable of OS/VM boot partitions –  Application Binary Installation –  Non-Relational File Content –  Relational Content

•  Network optimized –  Compression, HTTP optimization –  Be wary of proxies that disable Gzip compression

Adaptive Deployments •  Pooled resources via virtualization and consolidated

storage

•  Deployment/Provisioning considerations –  Dell AIM (Recent acquisition of Scalent) –  VMWare Provisioning Software (VCenter/VMotion)

Flexible and Scalable Application Deployment •  An ideal deployment will contain…

–  Minimum Storage Recovery Time •  Strategy: Enterprise storage with Snapshot capabilities

–  Advanced monitoring for operations and planning •  Strategy: Measurement tools and analytics

–  Automation…Automation…Automation •  Strategy: Investment in repeatable, reliable automated processes.

Deployment: Resource Utilization

•  Moore’s law is in full effect –  CPUs are getting faster with more cores –  Memory is in abundance and cheap –  Storage is grossly abundant

•  Massive systems can be obtained at low cost, but cannot be saturated in stand-alone configurations.

•  Virtualization offers the opportunity… –  Deploy with availability in mind –  Saturate system resources

Deployment: Large Address Space •  As of Blackboard Learn™ Release 9.1 all supported/

certified configurations include a 64-bit option.

•  Pushing more processing to client and DB over the last few releases, but major memory management technique is to use more application caches. –  Memory stays persistent longer –  Less wasteful from a creation/destruction perspective, but puts

greater demands on larger spaces. •  Most of our application testing focused on 4GB and 8GB

JVM deployments on 6GB and 10GB OS spaces. –  Limited testing at 16GB and 32GB

The Need for Availability

What is Availability? •  High-availability offerings mask the effects of a

system failure in order to minimize the impact of access and functional use of a system to a community of users.

•  Simple Definition: –  Percentage of time the system is in its operational state.

•  You will often hear the concept of 3x9’s, 4x9’s or even 5x9’s –  Planned versus Unplanned

•  Availability = (Total Units of Time – Downtime) / Total Units of Time –  8760 hours in a year –  Downtime = 10 hours –  Availability = (8760 – 10)/8760 = 99.88%

Quick View into Availability Statistics Availability Percentage Model Unexpected Down8me per Year

90% 36.5 days

95% 18.25 days

98% 7.30 days

99% 3.65 days

99.5% 1.83 days

99.8% 17.52 hours

99.9% 8.76 hours

99.95% 4.38 hours

99.99% 52.6 minutes

99.999% 5.26 minutes

99.9999% 31.5s

Realistic Views of Availability •  If the application is not functioning as expected, but you

can login, is it available? –  Perception versus Reality –  If it’s slow, do my users feel just as bad as if they received an

error? •  How do you plan for unexpected?

–  Practice really does make perfect •  Do I treat the calendar from a date and time perspective

differently from an availability perspective? –  Will my users cause problems if I take the site down during low

usage periods/dates? –  Will the users even know that something happened? –  Can I recover fast enough?

Realistic Approaches to Achieve Availability •  Strategically picking redundancy in the architecture.

–  Servers and storage make sense to a degree –  Monitoring makes sense –  Do advanced clustering architectures really make a difference? –  Do the costs of a dedicated DR facility and site make sense?

•  Choosing the right initiatives based on the resources available to manage –  Don’t set your administrators up to fail. –  If you don’t have the capabilities on-site, don’t be skeptical of

outsourcing the problem. •  Balance costs over goals

–  Choose the right places to put your pennies. –  Make the business drive the decision…it’s their money!

Deployment: Availability

•  VLEs are different beasts today then in the past. –  Communities are bigger –  Sessions last longer –  Content is richer –  Key point: Adoption is greater and users expect their sites up 24 x

7 x 365 •  Architecture is designed for many parallel instances of the

product scaled in a horizontal fashion. –  Distributed physical deployments –  Virtualization is a key element

•  Database failover more important than horizontal database scalability. –  Emphasis on vertical database scalability

Flexible and Scalable Application Deployment

Pros and Cons of SQL Server Clustering

Pros of Clustering Cons of Clustering

Reduces overall downDme for both planned and unplanned situaDons

Does not account for AcDve/AcDve for Monolithic ApplicaDons

Easy to Configure and Manage Differences between SQL 2005 and 2008

Simplifies management of patches and upgrades

More expensive than alternaDve failover approaches

Mean Dme to Recovery is sub-‐5 seconds in most situaDons

Requires more dedicated DBA personnel on-‐board

Pros and Cons of Oracle RAC

Pros of RAC Cons of RAC

Reduces overall downDme for both planned and unplanned situaDons

Very pro-‐Oracle uDliDes and licensing which can make RAC beyond expensive.

Can improve overall scalability with increased parallel nodes able to handle concurrent and compeDng requests.

Performance can suffer dramaDcally due to basic configuraDons challenges and ulDmate complexity of RAC.

Seamless integraDon with applicaDons like Blackboard making easier to stand-‐up and enterprise applicaDon in a RAC environment.

NoDon that developers do not have to programmaDcally account for RAC is not true. Certain SQL operaDons can be harmful in a RAC environment.

Has opDon for AcDve/AcDve and AcDve/Passive for monolithic schemas.

Requires more dedicated DBA personnel on-‐board

Deployment: Storage MTTR

•  Reference architecture pushes for “diskless” boots in which ISCSI or NFS partition resides on an enterprise storage system.

•  Both OS/VM partition and data partition served up from remote storage deployment designed for performance and scalability. –  Make your hardware work from a CPU, Memory and Network

perspective…save the Disk for the experts. •  Consider scenarios for reducing “Mean Time to

Recovery or Repair” –  Snapshot technology offering minutes for recovery

Monitoring

Deployment: Advanced Monitoring

•  Measurement is the secret sauce for successful deployments. –  Most reliable and scalable deployments measure beyond

the server infrastructure •  Different types of measurements

–  System/Environmental measurements –  Business measurements –  Synthetic measurements

•  Collecting is only part of the prize –  Need to analyze the data to drive business decisions from

the data.

Lifecycle of Measurement

Define Metrics: Goal Se\ng

Implement InstrumentaDon: Begin Measuring

Prepare ReporDng: Generate Reports

Share Results with Stakeholders:

Distribute Reports

Align to KPI/ROI: Convince

Stakeholders

Recommend Changes: Show Business Value

Reset ExpectaDons: New

IniDaDves

Different Types of Monitoring

SyntheDc Monitoring

Real User Monitoring

Performance Forensic Monitoring

What is Synthetic Monitoring?

•  Automated monitoring technique to measure the functional behavior of a system, sub-system or component.

•  Typically a scheduled activity used to measure the availability, responsiveness and functional attributes of a common application scenario.

•  Can be executed from any access point to the system in question, both internal or external.

•  Also considered “Active” Monitoring of a system

•  Not intended to supply load, but rather perform sampling of performance and availability

•  Two methods: –  HTTP Simulation or Real Browser Emulation

Tools for Synthetic Transactions •  You can really use any form of HTTP emulation tool

like JMeter, Grinder, MSTS, LoadRunner, SilkPerformer, SOASTA, etc…

•  Some monitoring software systems like Foglight, SiteScope, Nagios, CA IntroScope, Argent Defender

•  External services: Keynote, Gomez (Compuware), WebMetrics, AlertSite, Pingdom, SiteUpTime

•  Browser based solution: Selenium

Strategies for Synthetic Transactions •  Site and Host Ping Tests should run on a multi-

second basis (15s to 30s)

•  Common, yet critical paths targeting functional systems for availability should run on a continuous interval (x < 5 minutes).

•  Complicated paths focusing on performance and availability should run every 30 to 60 minutes.

•  Repeated tests when desired SLA or outcome not achieved

Why Synthetic Transactions are Critical •  Knowing is half the battle…

•  Organic growth of transaction data available for comprehensive analytics. –  Am I meeting my SLAs? –  Are my users experiencing response challenges? –  Do I experience issues everywhere or just specific parts of

my system, sub-system or component? •  I could use Real User Transactions, but is it really

fair to compare? –  Continuous baseline comparison test about a “known” or

expected experience. •  Probably the most important…we monitor to protect

our community!

What is Real User Experience Monitoring?

•  Passive web monitoring that observes web traffic to measure the user experience.

•  Provides both quality of service and responsiveness metrics in order to gauge service levels of performance and availability.

•  Typically a continuous activity watching silently in a parallel channel or as a pass through channel.

•  Able to capture characteristics about the entire HTTP stream to be used for forensics and user incidents.

•  Most vendors package as an appliance, but beginning to see the rise of “virtual” appliances.

•  Synthetic monitoring is just not enough…

Tools for RUM Monitoring •  Dominated by commercial vendors who have a niche in

web performance and/or application performance management. –  Quest FxM –  Coradiant TrueSight –  Oracle Real User Experience Insight –  Tealeaf –  CA/NetQoS

•  Rise in new tools coming from network equipment vendors like Cisco, Opnet and Citrix/NetScaler

Strategies for RUM Monitoring •  Identify areas of dense usage in order to highlight

performance, availability and functional experience in most common components of system.

•  Start with a wide lens of traffic watching and slowly narrow the area of focus to minimize the “purge” of data.

•  The “purge” of data is going to happen, so be prepared to move the data out of the system into an alternative repository. –  Some of the vendors have already solved this problem via an

Enterprise Data Warehouse (eg: Coradiant BI) •  Most of these tools can show

–  Time 2 First Byte, Host Latency, Network Latency and E2E •  Avoid the trap of focusing on Time 2 First Byte

–  You are serving an entire application from client to server

Why RUM Monitoring is Important •  Critical data for use in solving forensics issues.

•  Closest data point to informing the implementation team about the “real” user experience without talking to the user (passive watching).

•  Captures both functional and performance characteristics about the user’s session experience.

•  Provides insight into user’s clickstream, but does not aggregate clickstream behavior.

•  Covers the full pipeline from host to network to client.

What is Performance Forensic Monitoring? •  Deliberate instrumentation approach to capture

performance characteristics about an application deployment.

•  Measures resource and interface statistics not typically visible from the application directly.

•  Provides data points about application code execution that can be tied down to both the user and/or the application component.

•  Can’t measure everything, but can sample consistently. –  Certain data points can be captured on a continuous basis such

as Java/J2EE container statistics

Tools for Forensic Monitoring •  Recommended tool sets tie the PFM tool with the RUM

tool. –  Foglight FxM seemless integration with Foglight Application

Cartridges and Database Performance Analysis –  Coradiant TrueSight integration with Dynatrace APM (Coradiant

AV) –  CA NetQoS integration with CA Wily IntroScope –  Oracle RUE Insight with Oracle Enterprise Manager for

Applications and Databases. •  Limited supply of open source tools that can perform a

fraction of the functionality. –  No known integrations with RUM tools –  Point based tools per container (not aggregators) –  Example tools: JConsole, Java VisualVM

Strategies for Forensic Monitoring •  Measure the essentials such as container interfaces and

resources.

•  Most vendors have rule agents to begin sampling with a greater degree of instrumentation when certain rules are broken.

•  Retain statistics for extended periods of time (greater than 1 year) for annual, month, weekly, daily and hourly comparison purposes.

•  Construct trending thresholds for alert purposes to invoke a planning exercise in advance of an incident. –  Yes application forensics can be used for trending purposes for

events in the future as they are based on events in the past as points of reference.

Why Forensic Monitoring is Important •  Most obvious is for explaining why an incident occurred,

when it occurred and to whom it occurred.

•  Unlocks the black box of the application container. –  Provide feedback to the vendor about application design issues. –  Provide guidance for capacity and configuration changes to the

environment. •  Some vendors provide an entire pipeline request model

from the client to the container to the database. –  Great for schools that are leveraging home grown B2 or non-Bb

developed B2s that have not gone through full-fledge performance and scalability testing.

Managing an HA/DR environment

Bryan Wynns, [email protected]

•  Monitoring the environment to increase performance and reduce failovers

•  Solutions to help implement an HA/DR environment

•  Managing your virtual environment

What We’ll Cover

Foglight Application Performance Monitoring

Web Servers

Application Servers Databases

Experience Monitor & Viewer

Foglight Management Server

Quest Collectors

Synthetic Transactions

Heterogeneous Systems Management System Center + Quest (QMX)

Network Devices: Cisco, Juniper, etc.

Databases: Oracle, etc.

Applications: Apache, Blackberry, etc.

Operating Systems

Third-Party Frameworks: Connectors

Mainframes: AS400 & z/OS

Storage: EMC & NetApp

Quest and Microsoft - Managing the Enterprise

Comprehensive Heterogeneous Systems Management

Monitoring & Diagnostics

Change & Configuration

Backup & Recovery

Provisioning & Virtualization

Quest System Center Solutions Product Alignment to Microsoft System Center Capabilities

QMX – Operations Manager 2005 Edition

QMX – Operations Manager 2007 Edition

QMPs: Oracle, .NET, AS400, z/OS, Cisco

QMCs: Ex.Patrol, NetCool, MOM, Foglight

QRX – Audit Collection Services

QMX - SCCM 2007 Edition

QMX - Configuration Manager 2007 Edition

QMX for Device Management – SCCM 2007 Edition

QMX for Device Management – Configuration Manager 2007 Edition

MSI Studio

DPM + Recovery Manager for AD, Exchange & SharePoint

QMX for DPM (Delivered via Prof Services)

VMM + Virtual Access Suite (from Provision Networks) = Advanced VDI

Simplified Management of System Center using Windows PowerShell and Quest PowerGUI

QMX – Quest Management Xtensions

QMP – Quest Management Packs

QMC – Quest Management Connectors

QRX – Quest Reporting Xtensions

SharePlex for Oracle High Availability / Disaster Recovery

•  Provides an alternate copy of production data for failover in the event of maintenance or downtime

•  Ensures production databases are available 24x7

•  Avoids loss in revenues or end-user satisfaction due to loss of critical data

High Availability / Disaster Recovery

SharePlex for Oracle High Availability / Disaster Recovery

Capture

Read

Export Import

Post

Capture Queue

Export Queue

Post Queue

SQL

Redo-Logs

Quest Virtualization Solutions

Vizioncore vRanger Pro vReplicator vOptimizer

Provision Networks Virtual Access Suite

Vizioncore vConverter

Quest Foglight for

Virtualization

Vizioncore vFoglight

Quest InTrust QMX for SMS

ScriptLogic Desktop Authority

Vintela Access Suite

Select

pla<orms Convert Protect Monitor Op8mize

VMWARE

MS HYPER-V

CITRIX XEN

Automate

7/19/10

Quest Virtualization Solutions

vConverter

7/19/10

  Help companies begin first steps of virtualization – Physical to Virtual (P2V) conversions

  Convert physical servers to VMware, MicroSoft, Citrix, or Virtual Iron virtual machines (VMs)

  Assist organizations in deploying multi-vendor hypervisor solutions – Virtual to Virtual (V2V) conversions

  Provide Disaster Recovery (DR) protection for old physical servers – allow quick recovery to as close to point of failure as possible

Used To:

Image-‐Based Backup & Recovery vRanger Pro

Image-‐Based Backup and Recovery

• Backup-‐Once-‐Recover-‐Any (files, email objects, OS, patches, registry, applicaDons, and recovery agents)

• Full VM recovery Dmes are faster than tradiDonal backup and recovery soluDons

Comprehensive Backup and Recovery for Virtual Infrastructures

Applica8on

OS Applica8on

OS OS

Applica8on

•  Object-‐level restore (OLR) for Microsof Exchange email objects offers faster, more flexible recovery opDons

Replicate VMs

vReplicator DR & BC with

shorter RTO/RPO VM replicaDon with minimal data and

overhead

Asynchronous Data Transfer

to reduce impact

Affordable and Easy to Use

Applica8on

OS

Applica8on

OS OS

Application

Applica8on

OS

Applica8on

OS OS

Application

Applica8on

OS

Applica8on

OS OS

Application

Applica8on

OS

Applica8on

OS OS

Application

Performance Monitoring

Capacity Planning

Chargeback Service Management

vFoglight

Manage VM Performance

OpDmize VM Storage

Find and reclaim over allocated VM storage

Prevent VM unavailable storage outages

Reduce 8me spent monitoring & managing

VM storage

Automated 64K alignment improves VM performance

vOpDmizer Pro

vControl

Self Service Provisioning and VM Management

Self-Service Request, Approval & Provisioning

•  Self-Service VM Request Portal •  VM Approval & Fulfillment System

VM Management, Visibility & Control

•  VM Management Console •  Extensible Workflow Engine

Quest Virtualization Solutions: Business Continuity

•  Complement existing file-level backups

–  Image servers weekly to complement nightly file level back up solutions for faster restore

•  Complete backup solution

–  Image servers weekly with nightly differentials and file level restore for most complete and cost effective backup solution

•  Offsite Disaster Recovery (DR)

–  Image complete servers offsite for low cost, comprehensive DR plan

Use Cases for vRanger Pro

vConverter

7/19/10

  Powerful and easy-to-use graphical interface

  Automate many P2V and V2V pre- and post-conversion tasks

  High-speed file based or block level transfers

  Synchronized cutover for large conversion projects

  Generate DR image of physical servers to virtual machines

  Provide continuous DR protection for physical servers

Simplify P2V migrations; manage more conversions with less people and effort

Speed up P2V & V2V conversions – reduce manual, time-consuming, and error-prone tasks

Complete conversions faster while using less critical system resources

Pre-plan and schedule large P2V and V2V conversions – reduce server outages

Streamline and simplify recovery of old physical servers that fail

Ensure recovery of physical servers to as close to the time of failure as possible

Feature: Benefit:

Quest Virtualization Solutions: Business Continuity

•  Quick Recovery of Servers

–  Recover critical servers where data recovery from the day before is not acceptable – need at least last 2 hours of data

•  Cost-Effective Offsite Solution

–  Organizations who do not have the budget or personnel to support traditional HA / DR solutions

•  Complement to SAN Replication

–  Organizations who want replication at departmental or group level or who have selected virtual machines they need to replicate without using SAN

Use Cases for vReplicator

Chance to win a prize….

Please provide feedback for this session by emailing [email protected].

The subject of the email should be title of this session:

Deploying a Highly Performing, Scalable and Available Blackboard Solution

071310 sun d_0930_feldman_stephen

Documents

latency performance

measureable performance

performance matters

similar levels of performance

realistic views of performance

endtoend response time

response times

online behaviorchanging