Problems in High Availability
Availability is the measure of system that how much amount of time the system is available, when it’s required. In other words we can say availability is the ratio between time in service (available for services) and total time. It can be measured MTTF / (MTTF+MTTR). Here MTTF (Mean Time To Failure) and MTTR (Mean Time To Repair or Recover). When user tries to connect to a server and server is not responding then it is called unavailable . Different systems have different requirements as far as availability is concern. As we go for larger systems then it’s difficult to make them highly available .
Problem in High Availability:
Here are the list of problems that may cause for a system to down or unavailable. These problems can be Software Failure, planned down, careless mistakes, hardware failure or environment where system is deployed . Here are the details of each problem;
Any software can have faults or bugs due to any error or mistake. These bugs stay in a software and can be triggered when an input supplied to that part of software .
Programmer’s mistakes or errors lead to software faults/bugs. These bugs reside in the software and can be activated with an input pattern . Finding and removing the bugs from software is the classic strategy of dealing with them because fixing the bugs in operations is costly as compare to finding in development and testing phase.
In software we can face two types of bugs one is Bohrbugs and other is Heisenbugs. Bohrbugs can be consistent in same sort of circumstances, these bugs can be reproduced. While Heisenbugs only triggered when we have some special set of events process in same order. These bugs are hard to reproduce that’s why programmers and testers cannot find them easily .
When any physical component of system stops working due to any sort failure then it’s called Hardware Failure. Hardware components like storage devices, network devices or CPU can be failed during operation of system. These can be fail in combination or single at a time. Hardware failures are mostly initiated at designing of hardware, manufacturing time or due to any exhaustion .
It’s not compulsory that software or hardware is the only responsible for system unavailable. Power plays an important role in high availability. If there is no proper power backup system installed then system can be down due to power failure. Unavailability of power can cause of stopping cooling at data centers and due to heating hardware can stop working.
A system can be unavailable due to maintenance or operations errors. Poor maintenance plan may lead to non-availability of system in crucial hours. There should be proper schedule for maintenance and it should be done when there would be minimal load on the system.
A system can be down due to any mistake by human being, it can be due to inexperience or wrong planning. For example if administrator wants to make some changes in the system and for this, he stops network services instead of desired service .
According to statistics, 40% of total downtime is due to software failures, 30% due to planned maintenance or up gradation, 15% due to careless mistakes by people, 10% due to hardware failure and 5% due to environment .
 F. Piedad, High Availability: Design, Techniques, and Processes. 2001.
 J. Gray and D. P. Siewiorek, “High-Availability Computer Systems,” Computer, vol. 24, no. 9, pp. 39-48, 1991.
 H. Aziz, “High Availability, Lecture slides in Server Architecture subject.”
 J.-C. Laprie, “DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY,” in Fault-Tolerant Computing, 1995, “ Highlights from Twenty-Five Years”., Twenty-Fifth International Symposium on, 1995, p. 2.
 Michael Grottke and Kishor S. Trivedi, “Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate,” Computer, vol. 40, no. 2, pp. 107-109, 2007.
 A. Wood, “Predicting client/server availability,” Computer, vol. 28, no. 4, p. 41, 1995.