Tuesday 28 June 2011
Usually only 20% of the causes of failures are technology failures. In 80% of the cases, human errors are the reason. For instance, a system administrator accidentally pulls a wrong cable or enters an incorrect command. Users sometimes delete inportant (system) files.
Of course it helps to have highly qualified and trained personnel, with a healthy sense of responsibility. Errors are human, however, and there is not a cure for it. End users can introduce downtime by misuse of the system. When a user for instance starts the generation of 10 very large reports at the same time, the performance of the system could suffer in such a degree that in fact the system is unavailable to other users.
Also when a user forgets a password (and maybe tries an incorrect password for more than 5 times) he is locked out and the system is unavailable for him. If that person has a very reponsible job, for instance approving some steps in a business process, being locked-out could mean that a business process is unavailable to other users as well.
Most unavailability issues however are the result of actions from system managers. Some typical actions (or the lack thereof) are:
- Performing a test in the production environment (hopefully by accident - testing in production is of course not recommended at all)
- Switching off a wrong component (not the defective server that needs repair, but the one still operating)
- Swapping a good working disk in a RAID set instead of the defective one
- Restoring the wrong back-up tape to production • Accidentally removing files (mail folders, configuration files, database files)
- Incorrect changes to configuration files (for instance the routing table of a network router, or a change in the Windows registry)
- Tripping over cables, creating a broken or disconnected cable • Incorrect labling of cables, later leading to errors when a change is performed
- Stopping an incorrect virtual machine (the one in production instead of the one in the test environment)
- Making a typo in a system command (in UNIX: sudo rm -rf / *.back instead of sudo rm -rf /*.back where one space too many leads to a complete erasure of a hard disk - did you notice the difference?)
- Insufficient testing, for instance the fall-back procedure to mover operations from the primary datacenter to the secondary was never tested, and failed when it was most needed
- A system manager or architect made a mistake in the design of the infrastructure, leading to downtime (we thought the Windows cluster was designed in a good way, but when one of the cluster nodes failed, we found that the complete cluster went down)
Many of these mistakes can be avoided by using proper system menegement procedures, like have having a standard template for creating new servers, using formal deployment strategies with the appropriate tools, using adminstrative accounts only when absolutely needed, etc.
When in some UNIX environments the user works under a administrative account (root), automatically he gets the following message:
We assume you have received the usual lecture from the local System Administrator.
It usually boils down to these three things:
#1) Respect the privacy of others.
#2) Think before you type.
#3) With great power comes great responsibility.
I think this message makes people aware, leading to fewer mistakes.
Tuesday 14 June 2011
Although many measures can be taken to provide high availability, the availability of the IT infrastructure can never be guaranteed in all situations. In case of a disaster, the infrastructure could become unavailable, in some cases for a longer period of time. Business continuity is about identifying threats an organization faces and providing an effective response.
Business Continuity Management (BCM) and Disaster Recovery Planning (DRP) are processes to handle the effect of disasters. The Recovery Time Objective (RTO) and the Recovery Point Objective (RPO) determine the requirements for DRP.
Business Continuity Management (BCM)
BCM is not about IT alone. It includes managing business processes, and the availability of people and work places in disaster situations. It includes disaster recovery, business recovery, crisis management, incident management, emergency management, product recall, and contingency planning.
A Business Continuity Plan (BCP) describes the measures to be taken when a critical incident occurs in order to continue running critical operations, and to halt non-critical processes. The BS:25999 norm describes guidelines on how to implement BCM.
Disaster Recovery Planning (DRP)
Disaster recovery planning contains a set of measures to take in case of a disaster, when (parts of) the IT infrastructure must be accommodated in an alternative location. The IT disaster recovery standard BS:25777 can be used to implement DRP.
DRP assesses the risk of failing IT systems and provides solutions. A typical DRP solution is the use of fall-back facilities and having a Computer Emergency Response Team (CERT) in place. A CERT is usually a team of systems managers and senior management that decides how to handle a certain crisis once it becomes reality.
The steps that need to be taken to resolve a disaster highly depends on the type of disaster. It could be that the organization's building is damaged or destroyed (for instance in case of a fire), maybe even people got hurt or died.
One of the first worries is of course to save people. But after that, procedures must be followed to restore IT operations as soon as possible. A new (temporary) building might be needed, temporary staff might be needed, and new equipment must be installed or hired.
After that, steps must be taken to get the systems up and running again and to have the data restored. Connections to the outside world must be established (not only to the Internet, but also to business partners) and business processes must be initiated again.
RTO and RPO
Two important objectives of DRP are the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO). The next figure shows the difference.
Figure: RTO and RPO
The RTO is the duration of time within which a business process must be restored after a disaster, in order to avoid unacceptable consequences (like bankruptcy). So the RTO determines the maximum time it may take for a failure to be repaired.
RTO is basically the same concept as MTTR, but for a complete business process instead of one component. Measures like failover and fall-back must be taken in order to fulfill the RTO requirements.
The RPO is the point in time to which data must be recovered considering some "acceptable loss" in a disaster situation. It describes the amount of data loss a business is willing to accept in case of a disaster, measured in time. For instance, when each day a backup is made of all data, and a disaster destroys all data, the maximum RPO is 24 hours – the maximum amount of data lost between the last backup and the occurrence of the disaster.
To lower the RPO, a different back-up regime should be implemented.