reliability week 11 - lecture 2. what do we mean by reliability? correctness – system/application...

Reliability

Week 11 - Lecture 2

What do we mean by reliability?

• Correctness – system/application does what it has to do correctly.

• Availability – Be available within the agreed time frame

• Consistency – provide much the same response time on each occasion

Service Level Agreement

• Reliability and performance requirements are usually built into an SLA or Service Level Agreement

• An SLA defines the level of service the organisation and the users can expect from the DIS

• It is negotiated between the organisation and the service provider, be that the internal IT dept or an outside body

All components affect reliability

• Any component can effect the reliability of the whole system, but each component can affect different aspects: correctness, availability and consistency

• We will look at:• Application software

• System software – O/S, DBMS & Middleware

• Server hardware

• Network

• Storage

• Change management and Problem management

Application Software

• Application software can affect availability for a few, some or all customers in the event of a failure.

• Main area for bugs – particularly if developed in-house or modified.

• Can affect correctness and consistency if changes to application software are not rigorously tested.

System software (DBMS, O/S, etc)

• System software failures generally affect availability for all customers on a server.

• Operating at high utilisation (90-95% capacity) can affect reliability. Parts of system not often used can become active (eg. queuing logic).

Server hardware

• Hardware failure will affect availability for all users on the server.

• One server supporting an application/database provides a Single Point of Failure (to be avoided).

• Server problems can affect consistency (eg failure of one procesor in multi-processor server will affect performance.)

Networks - LAN

• Lan failures will affect availability for a few or many users.

• Changes to routers, switches or cabling can affect availability.

• Lan component failures/changes generally affect availability and consistency.

Networks - WAN

• It is a Purchased service, controlled by an external company.

• Wan failure will generally affect all users (eg ISP failure will affect all access to the Internet)

• It requires• Careful selection of supplier

• Sufficient capacity for peak loads

• Carefully negotiated SLA

• Capable network management

Planning for Reliability

• Managing problems and changes.

• Planning for application and system software reliability

• Planning for hardware reliability

• Planning for disaster recovery

Managing Problems/Changes

• The cause of all problems MUST be determined and then resolved (or they will simply return again and again to affect availability)

• All application and system software changes MUST– be reviewed by a committee before implementation

– have been thoroughly tested

– have a back-out plan

– be APPROVED by all affected parties

– implemented out of normal availability periods

Planning System Reliability

• Server selection and operating system must fit the scale of the operation.

• Regular system software update plan should be followed to fix bugs, implement new features.

• Update plan should be fully investigated– update may introduce new bugs

– may cause problems for applications

– may intoduce performance problems

Planning Application Reliability

• Starts in design – how the objects and components are packaged and the interfaces designed

• Software package selection must place high weight on reliability factors (availability etc.)

• Implementations need formal processes• Test plans

• Testing techniques

• Test scripts

Planning for Harware Reliability

• Build in redundancy, avoid single points of failure (even within hardware items).

• Use servers with multiple processors and hot-swap capability. Use server clusters if appropriate.

• Build redundancy and alternate routes into the network. Lan can be controlled.

• Disks have many mechanical parts and will fail often. Use Raid or redundancy when-ever possible

RAID

• Redundant Arrays of Independent Disks

• Groups of drives are linked to a special controller

• They appear as a single logical drive

• Take advantage of multiple physical drives to store data redundantly

• Six different RAID approaches numbered 0 to 5

0 Data striping, block orientedNo redundancy – no protection from disk lossReads and writes for contiguous block overlap, giving improved performanceNo space overhead

1 Disk mirroring – all data written to two disksFull data protectionImproved read accessDoubles disk space requiredEasy to implement, easy to recover

5 Data striping, block oriented, distributed parityFull error protection, but slower to recover than 1Slow write, good read performance25% overhead in disk space

Planning for Business Continuance(or Disaster/Recovery)

• Planning to continue business in the event of a disaster - is a design job . 1993 and 9/11.

• Consider all scenarios, plan recovery approach, test & document.

• Common causes are fires (Sydney) , floods (Brisbane) or back-hoes.

• Test recovery regularly (3- 6 months)

Performance

Week 11 - Lecture 2

Why is Performance Important

• DIS systems have potential for performance issues

• New systems almost always require performance tuning

• DIS performance affects user productivity

• Performance is a measure of value for money

A simple test

• In most systems, what is likely to be the highest priority for users?

– Improved functionality– Improved reliability– Improved performance

Performance Measures

• Response time - time taken to complete a task or transaction

• Throughput - the amount of work (transactions) that can be completed in a set time period (sec or hour)

• The relationship between the two is generally inverse (although not always)

Concurrency is the answer

Slow response timeHigh throughput

Fast response timeLow throughput

Time

A user requires consistency, then speed.

• A user wants a transaction to run consistently. The faster, the better.

• A user sees response time at the PC or terminal.

• A user is not concerned with the entire infrastructure that supports a transaction.

• It staff see reponse time only in their domain of responsibility (server, database, network etc)

Difficult to measure total response time

• How do you add together web server + application server + database server + network

• Do you get statistics from each group ? Will each group maintain statistics is the same format ?

• You need to measure total response time and response in each area (server, database etc).

• New network monitors may be able to provide statistics closer to what you need

Improving performance

• You can add more resources (faster servers, faster disks, networks etc) to improve response time and throughput.

• However, performance improvements may not be proportional to the additional resources.

• 100% increase in resources may only bring, say, 70% performance improvement. Scalability.

Monitoring Performance

• Performance is a process, not a task.

• Performance should be constantly monitored. Cost of monitoring must weighed against “do nothing”

• Performance tuning should be carried out to correct performance problems.

reliability week 11 - lecture 2. what do we mean by reliability? correctness – system/application...

Documents

system software changes

system software dbms

system software failures

occasion slide

hardware reliability

reliability week

server problems

problem management slide