reliability week 11 - lecture 2. what do we mean by reliability? correctness – system/application...
Post on 20-Dec-2015
217 views
TRANSCRIPT
What do we mean by reliability?
• Correctness – system/application does what it has to do correctly.
• Availability – Be available within the agreed time frame
• Consistency – provide much the same response time on each occasion
Service Level Agreement
• Reliability and performance requirements are usually built into an SLA or Service Level Agreement
• An SLA defines the level of service the organisation and the users can expect from the DIS
• It is negotiated between the organisation and the service provider, be that the internal IT dept or an outside body
All components affect reliability
• Any component can effect the reliability of the whole system, but each component can affect different aspects: correctness, availability and consistency
• We will look at:• Application software
• System software – O/S, DBMS & Middleware
• Server hardware
• Network
• Storage
• Change management and Problem management
Application Software
• Application software can affect availability for a few, some or all customers in the event of a failure.
• Main area for bugs – particularly if developed in-house or modified.
• Can affect correctness and consistency if changes to application software are not rigorously tested.
System software (DBMS, O/S, etc)
• System software failures generally affect availability for all customers on a server.
• Operating at high utilisation (90-95% capacity) can affect reliability. Parts of system not often used can become active (eg. queuing logic).
Server hardware
• Hardware failure will affect availability for all users on the server.
• One server supporting an application/database provides a Single Point of Failure (to be avoided).
• Server problems can affect consistency (eg failure of one procesor in multi-processor server will affect performance.)
Networks - LAN
• Lan failures will affect availability for a few or many users.
• Changes to routers, switches or cabling can affect availability.
• Lan component failures/changes generally affect availability and consistency.
Networks - WAN
• It is a Purchased service, controlled by an external company.
• Wan failure will generally affect all users (eg ISP failure will affect all access to the Internet)
• It requires• Careful selection of supplier
• Sufficient capacity for peak loads
• Carefully negotiated SLA
• Capable network management
Planning for Reliability
• Managing problems and changes.
• Planning for application and system software reliability
• Planning for hardware reliability
• Planning for disaster recovery
Managing Problems/Changes
• The cause of all problems MUST be determined and then resolved (or they will simply return again and again to affect availability)
• All application and system software changes MUST– be reviewed by a committee before implementation
– have been thoroughly tested
– have a back-out plan
– be APPROVED by all affected parties
– implemented out of normal availability periods
Planning System Reliability
• Server selection and operating system must fit the scale of the operation.
• Regular system software update plan should be followed to fix bugs, implement new features.
• Update plan should be fully investigated– update may introduce new bugs
– may cause problems for applications
– may intoduce performance problems
Planning Application Reliability
• Starts in design – how the objects and components are packaged and the interfaces designed
• Software package selection must place high weight on reliability factors (availability etc.)
• Implementations need formal processes• Test plans
• Testing techniques
• Test scripts
Planning for Harware Reliability
• Build in redundancy, avoid single points of failure (even within hardware items).
• Use servers with multiple processors and hot-swap capability. Use server clusters if appropriate.
• Build redundancy and alternate routes into the network. Lan can be controlled.
• Disks have many mechanical parts and will fail often. Use Raid or redundancy when-ever possible
RAID
• Redundant Arrays of Independent Disks
• Groups of drives are linked to a special controller
• They appear as a single logical drive
• Take advantage of multiple physical drives to store data redundantly
• Six different RAID approaches numbered 0 to 5
0 Data striping, block orientedNo redundancy – no protection from disk lossReads and writes for contiguous block overlap, giving improved performanceNo space overhead
1 Disk mirroring – all data written to two disksFull data protectionImproved read accessDoubles disk space requiredEasy to implement, easy to recover
5 Data striping, block oriented, distributed parityFull error protection, but slower to recover than 1Slow write, good read performance25% overhead in disk space
Planning for Business Continuance(or Disaster/Recovery)
• Planning to continue business in the event of a disaster - is a design job . 1993 and 9/11.
• Consider all scenarios, plan recovery approach, test & document.
• Common causes are fires (Sydney) , floods (Brisbane) or back-hoes.
• Test recovery regularly (3- 6 months)
Why is Performance Important
• DIS systems have potential for performance issues
• New systems almost always require performance tuning
• DIS performance affects user productivity
• Performance is a measure of value for money
A simple test
• In most systems, what is likely to be the highest priority for users?
– Improved functionality– Improved reliability– Improved performance
Performance Measures
• Response time - time taken to complete a task or transaction
• Throughput - the amount of work (transactions) that can be completed in a set time period (sec or hour)
• The relationship between the two is generally inverse (although not always)
A user requires consistency, then speed.
• A user wants a transaction to run consistently. The faster, the better.
• A user sees response time at the PC or terminal.
• A user is not concerned with the entire infrastructure that supports a transaction.
• It staff see reponse time only in their domain of responsibility (server, database, network etc)
Difficult to measure total response time
• How do you add together web server + application server + database server + network
• Do you get statistics from each group ? Will each group maintain statistics is the same format ?
• You need to measure total response time and response in each area (server, database etc).
• New network monitors may be able to provide statistics closer to what you need
Improving performance
• You can add more resources (faster servers, faster disks, networks etc) to improve response time and throughput.
• However, performance improvements may not be proportional to the additional resources.
• 100% increase in resources may only bring, say, 70% performance improvement. Scalability.