rac operational best practices

8/11/2019 RAC Operational Best Practices

1/25

RAC & ASM Best PracticesYou Probably Need More than just RAC

Kirk McGowan

Technical DirectorRAC PackOracle Server Technologies

Cluster and Parallel Storage Development


2/25

AgendaOperational Best Practices (IT MGMT 101)

Background

Requirements

Why RAC Implementations Fail

Case Study

Criticality of IT Service Management (ITIL)

process

Best Practices

People, Process, AND Technology


3/25

Why do people buy RAC?

Low cost scalability

Cost reduction, consolidation, infrastructure that

can grow with the business

High Availability

Growing expectations for uninterrupted service.


4/25

Why do RAC Implementationsfail?

RAC, scale-out clustering is new technology

Insufficient budget and effort is put towards filling

the knowledge gap

HA is difficult to do, and cannot be done withtechnology alone

Operational processes and discipline are critical

success factors, but are not addressed

sufficiently


5/25

Case Study

Based on true stories. Any resemblance, in

full or in part, to your own experiences is

intentional and expected.

Names have been changed to protect the

innocent


6/25

Case Study

Background 8-12 months spent implementing 2 systemssomewhat

different architectures, very different workloads, identicaltech stacks

Oracle expertise (Development) engaged to help flattentech learning curve

Non-mission critical systems, but important elements of alarger enterprise re-architecture effort.

Many technology issues encountered across the stack, and

resolved over the 8-12 month implementationHw, OS, storage, network, rdbms,

cluster, and application


7/25

Case Study

Situation New mission critical deployment using same technology

stack

Distinct architecture, applications development teams, and

operations teams Large staff turnover

Major escalation, post production

CIO: Oracle products do not meet our

business requirements RAC is unstable

DG doesnt handle the workload

JDBC connections dont failover


8/25

Case Study

Operational Issues Requirements, aka SLOs were not defined

e.g. Claim of 20s failover time; application logic included 80sfailover time, cluster failure detection time alone set to 120s.

Inadequate test environments Problems encountered first in productionincluding the fact

that SLOs could not be met

Inadequate change control

Lessons learned in previous deployments were not applied tonew deploymentrediscovery of same problems

Some changes implemented in test, but never rolled intoproductionre-occuring problems (outages) in production

No process for confirming a change actually fixes the problemprior to implementing in production


9/25

Case Study

More Operational Issues

Poor knowledge xfer between internal teams

Configuration recommendations, patches, fixes identified inprevious deployments were not communicated.

Evictions are a symptom, not the problem.

Inadequate system monitoring

OS level statistics (CPU, IO, memory) were not being captured.

Impossible to RCA on many problems without ability to correlatecluster / database symptoms with system level activity.

Inadequate Support procedures

Inconsistent data capture

No on-site vendor support consistent with criticality of system No operations manual

- Managing and responding to outages

- Responding and restoring service after outages


10/25

Overview of OperationalProcess Requirements

What are ITIL Guidelines?

ITIL (the IT Infrastructure Library) is the most widely accepted

approach to IT service management in th e wo rld, ITILprovides a comprehensive and con sistent set of b est

pract ices for IT service management, promoting a qual ity

approach to achieving busin ess effect iveness and eff ic iency

in the use of information systems.


11/25

IT Service Management

IT Service Management = Service Delivery

+ Service Support

Service Delivery: partially concerned with

setting up agreements and monitoring the

targets within these agreements.

Service Support: processes can be viewed

as delivering services as laid down inthese agreements.


12/25

Provisioning of IT Service Mgmt

In all organizations, must be matched to current andrapidly changing business demands. The objective isto continually improve the quality of service, aligned tothe business requirements, cost-effectively. To meet

this objective, three areas need to be considered: People with the right skills, appropriate training and the

right service culture

Effective and efficient Service Management processes

Good IT Infrastructure in terms of tools and technology.

Unless People, Processes and Technology areconsidered and implemented appropriately within asteering framework, the objectives of ServiceManagement will not be realized.


13/25

Service Delivery

Financial Management

Service Level Management Severity/priority definitions

e.g. Sev1, Sev2, Sev3, Sev4 Response time guidelines

SLAs

Capacity Management

IT Service Continuity Management

Availability Management


14/25

Service Support

Incident Management Incident documentation & Reporting, incident handling,

escalation procedures

Problem Management RCAs, QA & Process improvement

Configuration Management Standard configs, gold images, CEMLIs

Change Management

Risk assessment, backout, sw maintenance, decommission Release Management

New deployments, upgrades, Emergency release,component release


15/25

BP: Set & Manage Expectations

Why is this important? Expectations with RAC are different at the outset

HA is as much (if not moreso) about the processes andprocedures, than it is about the technology

No matter what technology stack you implement, on its own itis incapable of meeting stringent SLAs

Must communicate what the technology can ANDcant do

Must be clear on what else needs to be in place to

supplement the technology if HA businessrequirements are going to be met.

HA isnt cheap!


16/25

BP: Clearly define SLOs Sufficiently granular

Cannot architect, design, OR manage a system without clearlyunderstanding the SLOs

24x7 is NOT an SLO

Define HA/recovery time objectives, throughput,response time, data loss, etc

Need to be established with an understanding of the cost ofdowntime for the system.

RTO and RPO are key availability metrics

Response time and throughput are key performance metrics

Must address different failure conditions Planned vs unplanned

Localized vs site-wide

Must be linked to the business requirements Response time and resolution time

Must be realistic


17/25

Manage to the SLOs Definitions of problem severity levels

Documented targets for both incident response time, andresolution time, based on severity

Classification of applications w.r.t. business criticality

Establish SLA with business

Negotiated response and resolution times

Definition of metrics E.g. Application Availability shall be measured using the

following formula: Total Minutes In A Calendar Mon thmin us Unsch eduled Outage Minutes minus Scheduled

Outage Minutes in suc h month, div ided by Total Minutes

In A Calendar Month

Negotiated SLOs Effectively documents expectations between IT and business

Incident log: date, time, description, duration, resolution


18/25

Example Resolution TimeMatrix

Severity 1 Priority 1 and 2 SRs < 1 hour

Severity 1 Priority 3 SRs < 13 Hours

Severity 2 Priority 1 SRs < 14 hours

Severity 2 SRs < 132 hrs


19/25


20/25

BP: TEST, TEST, TEST Testing is a shared responsibility

Functional, destructive, and stress testing

Test environments must be representative of production Both in terms of configuration, and capacity

Separate from Production

Building a test harness to mimic production workload is a necessary, butnon-trivial effort

Ideally, problems would never be encountered first inproduction

If they are, the first question should be: Why didnt we catch the problemin test?

Exceeding some threshold

Unique timing or race condition

What can we do so we catch this type of problem in the future?

Build a test case that can be reused as part of pre-productiontesting.

BP D fi d t d


21/25

BP: Define, document, andadhere to Change ControlProcesses This amounts to self discipline

Applies to all changes at all levels of the tech stack Hw changes, configuration changes, patches and patchsets,

upgrades, and even significant changes in workload.

If no changes are introduced, system will reach a steady state,and function for ever.

A well designed system will be able to tolerate somefluctuations, and faults.

A well managed system will meet service levels If a problem (that was fixed) is encountered again elsewhere, it is

a change management process problem, not a technologyproblem. I.e. rediscovery should not happen.

Ensure fixes are applied across all nodes in a cluster, and allenvironments to which the fix applies.


22/25


23/25

BP: Monitor your system

Define key metrics and monitor them actively Establish a (performance) baseline

Learn how to use Oracle-provided tools RDA (+ RACDDT)

AWR/ADDM

Active Session History

OSWatcher

Coordinate monitoring and collection of OS level stats

as well as db-level stats Problems observed at one layer are often just symptoms of

problems that exist at a different layer

Dont jump to conclusions


24/25


25/25

Summary

Deficiencies in operational processes and procedures

are the root cause of the vast majority of escalations

Address these, you dramatically increase your chances of

a successful RAC deployment, and will save yourself a lot

of future pain

Additional areas of challenge

Configuration ManagementInitial Install and config,

standardized gold image deployment

Incident Management - Diagnosing cluster-relatedproblems

rac operational best practices

Documents