rac operational best practices

Upload: devjeet

Post on 02-Jun-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 RAC Operational Best Practices

    1/25

    RAC & ASM Best PracticesYou Probably Need More than just RAC

    Kirk McGowan

    Technical DirectorRAC PackOracle Server Technologies

    Cluster and Parallel Storage Development

  • 8/11/2019 RAC Operational Best Practices

    2/25

    AgendaOperational Best Practices (IT MGMT 101)

    Background

    Requirements

    Why RAC Implementations Fail

    Case Study

    Criticality of IT Service Management (ITIL)

    process

    Best Practices

    People, Process, AND Technology

  • 8/11/2019 RAC Operational Best Practices

    3/25

    Why do people buy RAC?

    Low cost scalability

    Cost reduction, consolidation, infrastructure that

    can grow with the business

    High Availability

    Growing expectations for uninterrupted service.

  • 8/11/2019 RAC Operational Best Practices

    4/25

    Why do RAC Implementationsfail?

    RAC, scale-out clustering is new technology

    Insufficient budget and effort is put towards filling

    the knowledge gap

    HA is difficult to do, and cannot be done withtechnology alone

    Operational processes and discipline are critical

    success factors, but are not addressed

    sufficiently

  • 8/11/2019 RAC Operational Best Practices

    5/25

    Case Study

    Based on true stories. Any resemblance, in

    full or in part, to your own experiences is

    intentional and expected.

    Names have been changed to protect the

    innocent

  • 8/11/2019 RAC Operational Best Practices

    6/25

    Case Study

    Background 8-12 months spent implementing 2 systemssomewhat

    different architectures, very different workloads, identicaltech stacks

    Oracle expertise (Development) engaged to help flattentech learning curve

    Non-mission critical systems, but important elements of alarger enterprise re-architecture effort.

    Many technology issues encountered across the stack, and

    resolved over the 8-12 month implementationHw, OS, storage, network, rdbms,

    cluster, and application

  • 8/11/2019 RAC Operational Best Practices

    7/25

    Case Study

    Situation New mission critical deployment using same technology

    stack

    Distinct architecture, applications development teams, and

    operations teams Large staff turnover

    Major escalation, post production

    CIO: Oracle products do not meet our

    business requirements RAC is unstable

    DG doesnt handle the workload

    JDBC connections dont failover

  • 8/11/2019 RAC Operational Best Practices

    8/25

    Case Study

    Operational Issues Requirements, aka SLOs were not defined

    e.g. Claim of 20s failover time; application logic included 80sfailover time, cluster failure detection time alone set to 120s.

    Inadequate test environments Problems encountered first in productionincluding the fact

    that SLOs could not be met

    Inadequate change control

    Lessons learned in previous deployments were not applied tonew deploymentrediscovery of same problems

    Some changes implemented in test, but never rolled intoproductionre-occuring problems (outages) in production

    No process for confirming a change actually fixes the problemprior to implementing in production

  • 8/11/2019 RAC Operational Best Practices

    9/25

    Case Study

    More Operational Issues

    Poor knowledge xfer between internal teams

    Configuration recommendations, patches, fixes identified inprevious deployments were not communicated.

    Evictions are a symptom, not the problem.

    Inadequate system monitoring

    OS level statistics (CPU, IO, memory) were not being captured.

    Impossible to RCA on many problems without ability to correlatecluster / database symptoms with system level activity.

    Inadequate Support procedures

    Inconsistent data capture

    No on-site vendor support consistent with criticality of system No operations manual

    - Managing and responding to outages

    - Responding and restoring service after outages

  • 8/11/2019 RAC Operational Best Practices

    10/25

    Overview of OperationalProcess Requirements

    What are ITIL Guidelines?

    ITIL (the IT Infrastructure Library) is the most widely accepted

    approach to IT service management in th e wo rld, ITILprovides a comprehensive and con sistent set of b est

    pract ices for IT service management, promoting a qual ity

    approach to achieving busin ess effect iveness and eff ic iency

    in the use of information systems.

  • 8/11/2019 RAC Operational Best Practices

    11/25

    IT Service Management

    IT Service Management = Service Delivery

    + Service Support

    Service Delivery: partially concerned with

    setting up agreements and monitoring the

    targets within these agreements.

    Service Support: processes can be viewed

    as delivering services as laid down inthese agreements.

  • 8/11/2019 RAC Operational Best Practices

    12/25

    Provisioning of IT Service Mgmt

    In all organizations, must be matched to current andrapidly changing business demands. The objective isto continually improve the quality of service, aligned tothe business requirements, cost-effectively. To meet

    this objective, three areas need to be considered: People with the right skills, appropriate training and the

    right service culture

    Effective and efficient Service Management processes

    Good IT Infrastructure in terms of tools and technology.

    Unless People, Processes and Technology areconsidered and implemented appropriately within asteering framework, the objectives of ServiceManagement will not be realized.

  • 8/11/2019 RAC Operational Best Practices

    13/25

    Service Delivery

    Financial Management

    Service Level Management Severity/priority definitions

    e.g. Sev1, Sev2, Sev3, Sev4 Response time guidelines

    SLAs

    Capacity Management

    IT Service Continuity Management

    Availability Management

  • 8/11/2019 RAC Operational Best Practices

    14/25

    Service Support

    Incident Management Incident documentation & Reporting, incident handling,

    escalation procedures

    Problem Management RCAs, QA & Process improvement

    Configuration Management Standard configs, gold images, CEMLIs

    Change Management

    Risk assessment, backout, sw maintenance, decommission Release Management

    New deployments, upgrades, Emergency release,component release

  • 8/11/2019 RAC Operational Best Practices

    15/25

    BP: Set & Manage Expectations

    Why is this important? Expectations with RAC are different at the outset

    HA is as much (if not moreso) about the processes andprocedures, than it is about the technology

    No matter what technology stack you implement, on its own itis incapable of meeting stringent SLAs

    Must communicate what the technology can ANDcant do

    Must be clear on what else needs to be in place to

    supplement the technology if HA businessrequirements are going to be met.

    HA isnt cheap!

  • 8/11/2019 RAC Operational Best Practices

    16/25

    BP: Clearly define SLOs Sufficiently granular

    Cannot architect, design, OR manage a system without clearlyunderstanding the SLOs

    24x7 is NOT an SLO

    Define HA/recovery time objectives, throughput,response time, data loss, etc

    Need to be established with an understanding of the cost ofdowntime for the system.

    RTO and RPO are key availability metrics

    Response time and throughput are key performance metrics

    Must address different failure conditions Planned vs unplanned

    Localized vs site-wide

    Must be linked to the business requirements Response time and resolution time

    Must be realistic

  • 8/11/2019 RAC Operational Best Practices

    17/25

    Manage to the SLOs Definitions of problem severity levels

    Documented targets for both incident response time, andresolution time, based on severity

    Classification of applications w.r.t. business criticality

    Establish SLA with business

    Negotiated response and resolution times

    Definition of metrics E.g. Application Availability shall be measured using the

    following formula: Total Minutes In A Calendar Mon thmin us Unsch eduled Outage Minutes minus Scheduled

    Outage Minutes in suc h month, div ided by Total Minutes

    In A Calendar Month

    Negotiated SLOs Effectively documents expectations between IT and business

    Incident log: date, time, description, duration, resolution

  • 8/11/2019 RAC Operational Best Practices

    18/25

    Example Resolution TimeMatrix

    Severity 1 Priority 1 and 2 SRs < 1 hour

    Severity 1 Priority 3 SRs < 13 Hours

    Severity 2 Priority 1 SRs < 14 hours

    Severity 2 SRs < 132 hrs

  • 8/11/2019 RAC Operational Best Practices

    19/25

  • 8/11/2019 RAC Operational Best Practices

    20/25

    BP: TEST, TEST, TEST Testing is a shared responsibility

    Functional, destructive, and stress testing

    Test environments must be representative of production Both in terms of configuration, and capacity

    Separate from Production

    Building a test harness to mimic production workload is a necessary, butnon-trivial effort

    Ideally, problems would never be encountered first inproduction

    If they are, the first question should be: Why didnt we catch the problemin test?

    Exceeding some threshold

    Unique timing or race condition

    What can we do so we catch this type of problem in the future?

    Build a test case that can be reused as part of pre-productiontesting.

    BP D fi d t d

  • 8/11/2019 RAC Operational Best Practices

    21/25

    BP: Define, document, andadhere to Change ControlProcesses This amounts to self discipline

    Applies to all changes at all levels of the tech stack Hw changes, configuration changes, patches and patchsets,

    upgrades, and even significant changes in workload.

    If no changes are introduced, system will reach a steady state,and function for ever.

    A well designed system will be able to tolerate somefluctuations, and faults.

    A well managed system will meet service levels If a problem (that was fixed) is encountered again elsewhere, it is

    a change management process problem, not a technologyproblem. I.e. rediscovery should not happen.

    Ensure fixes are applied across all nodes in a cluster, and allenvironments to which the fix applies.

  • 8/11/2019 RAC Operational Best Practices

    22/25

  • 8/11/2019 RAC Operational Best Practices

    23/25

    BP: Monitor your system

    Define key metrics and monitor them actively Establish a (performance) baseline

    Learn how to use Oracle-provided tools RDA (+ RACDDT)

    AWR/ADDM

    Active Session History

    OSWatcher

    Coordinate monitoring and collection of OS level stats

    as well as db-level stats Problems observed at one layer are often just symptoms of

    problems that exist at a different layer

    Dont jump to conclusions

  • 8/11/2019 RAC Operational Best Practices

    24/25

  • 8/11/2019 RAC Operational Best Practices

    25/25

    Summary

    Deficiencies in operational processes and procedures

    are the root cause of the vast majority of escalations

    Address these, you dramatically increase your chances of

    a successful RAC deployment, and will save yourself a lot

    of future pain

    Additional areas of challenge

    Configuration ManagementInitial Install and config,

    standardized gold image deployment

    Incident Management - Diagnosing cluster-relatedproblems