Solutions of The Problems in High Availability

Google+ Pinterest LinkedIn Tumblr +

Solutions of the problems in High Availability

Introduction

A system is called available if the user request for some service and he gets proper response and desired job done on server. It is also defined as the ratio between mean time in service and total time in service [1]. Different systems have different requirements in terms of availability of the system. Important systems have very critical requirements of availability for the systems. If user wants to access the system and user does not get proper response from system then it is called unavailable.  There can be many reasons, like software, power or hardware failures can cause the unavailability of the system [2].

Solutions in High Availability:

Here are the main reasons of system unavailability and solution how to get rid of these problems

Software Failure:

Software failure is one of the major reasons of system unavailability. Software fails due to unhandled errors in software programs [3]. These errors are reside in software programs and triggered when any external input interact with that part of software program. Software errors or bugs can be divided into two categories; Bohrbugs and Heisenbugs [4]. Bohrbugs are those bugs which can be reproduced; hence developers or testers can detect and remove those bugs. Heisenbugs are hard to reproduce; hence these are difficult to find and remove from software programs. Because Heisenbugs are not reproducible that’s why these are hard to find and remove during software development.

Due to non-deterministic behavior of Heisenbugs, it can be handled by repeating those steps, so by restarting the application can solve the problem. This restarting technique can be implemented by introducing check points. Check points keep the snapshot of the system regularly during the execution and when system restarts it will restore the previous state of the system.

The other approach is that can be used for software component is to use redundant components while developing large scale applications. These redundant components can be used as backup and in case of any failure the other component may replace it. Software redundancy components prevent unavailability of the system due to failure of any other component by detecting failing component and replace it before it actually fails.

Hardware Failure:

When a system is down due to failure of any physical component then it is called hardware failure. We can overcome this hardware failure by using hardware redundancy; hardware redundancy prevents the unavailability of system caused by hardware failures by detecting a failing component before it actually fails and bypassing a failure when it does occur. For this we can use server-class hardware. This server class hardware monitors all components of server for their failure and when that component fails the server-class notifies the administrator and includes redundant component so that server is keep working during the failure [5].

There can be other solutions be used for preventing hardware failure, one of them is to use fault-tolerant design concept while design hardware components. Fault-tolerant design can be implemented by using modularity, fail-fast or independent failure modes. Modularity is the decomposition of whole system into independent components so that in case of failure only affected module fails instead of whole system. Fail-fast is basically working of each module independently. The whole concept is that each module should be independent and work by its own so that in case of single module failure the other components should be working without any interruption.

Power Failure:

Proper power backup systems should be installed with the servers so that in case of any power failures these backup power systems start working. UPS and alternative power source should be installed to overcome this failure.

Maintenance Issues:

A system could be unavailable due to wrongly planned maintenance plan, for example maintenance is doing on peak hours then majority of users suffer due to this bad maintenance plan. There must be a proper plan for maintenance of system, it should be done when there is minimal load on the system and notify to the users of system so that if anybody wants to use during that time period then user use any alternative time slot for his work.

Human Mistakes:

A system could be unavailable due to any mistake made by human being. For example administrator stops wrong services and due to this the whole system is not accessible. To overcome this problem, proper training and expertise required before dealing with critical components of system [6].

References:

[1]        H. Aziz, “High Availability, Lecture slides in Server Architecture subject,” 2011.

[2]        J. Gray, “Why Do Computers Stop And What Can Be Done About It?,” 1985.

[3]        J.-C. Laprie, “DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY,” in Fault-Tolerant Computing, 1995, “ Highlights from Twenty-Five Years”., Twenty-Fifth International Symposium on, 1995, p. 2.

[4]        Michael Grottke and Kishor S. Trivedi, “Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate,” Computer, vol. 40, no. 2, pp. 107-109, 2007.

[5]        “Preventing Downtime with Redundant Components.” [Online]. Available: http://technet.microsoft.com/en-us/library/cc917700.aspx. [Accessed: 07-May-2011].

[6]        A. Wood, “Predicting client/server availability,” Computer, vol. 28, no. 4, p. 41, 1995.

Share.

About Author

Leave A Reply