Wednesday 27 July 2011
Fall-back is the manual switch-over to a identical standby computer system in a different location.
There are three basic forms of fall-back services, with some variations:
- Hot site
- Warm site
- Cold site
A hot site is a fully configured computer facility with electrical power, heating, ventilation, and air conditioning (HVAC), and functioning file/print servers and workstations. The applications that are needed to sustain remote transaction processing are installed on the servers and workstations and are kept up-to-date to mirror the production system.
Theoretically, personnel and/or operators should be able to walk in and, with a data restoration of modified files from the last backup, begin full operations in a very short time.
If the site participates in remote journaling, that is, mirroring transaction processing with a high-speed data line to the hot site, even the backup time may be reduced or eliminated.
This type of site requires constant maintenance of the hardware, software, data, and applications to be sure the site accurately mirrors the state of the production site. This adds administrative overhead and can be a strain on resources, especially if a dedicated disaster recovery maintenance team does not exist.
A warm site could best be described as a cross between a hot site and cold site. Like a hot site, the warm site is a computer facility readily available with electrical power and HVAC and computers, but the applications may not be installed or configured.
It may have file/print servers, but not a full complement of workstations. External communication links and other data elements that commonly take a long time to order and install will be present, however. To enable remote processing at this type of site, workstations will have to be delivered quickly and applications and their data will need to be restored from backup media.
A cold site differs from the other two in that it is ready for equipment to be brought in during an emergency, but no computer hardware (servers or workstations) resides at the site.
The cold site is a room with electrical power and HVAC, but computers must be brought on-site if needed, and communications links may be ready or not. File and print servers have to be brought in, as well as all workstations, and applications will need to be installed and current data restored from backups.
If an organization has very little budget for an alternative backup processing site, the cold site may be better than nothing.
In rare cases, an organization may contract with a service bureau to fully provide all alternate backup processing services. The big advantage to this type of arrangement is the quick response and availability of the service bureau, testing is possible, and the service bureau may be available for more than backup.
The disadvantages of this type of setup are primarily the expense and resource contention during a large emergency.
Tuesday 12 July 2011
Infrastructure components or the combination thereof all fail at some moment in time. There are several reasons for failure, as described below.
Of course unavailability can arise from physical defects of parts in the infrastructure. Of course everything breaks down eventally, but mechanical parts are most likely to break first. Some examples of mechanical parts are:
- Fans for cooling equipment. The fans usually have a limited lifespan, they usually break because of dust in the bearing, leading to the motor to work harder until it breaks.
- Disk drives. Disk drives contain two moving parts: the motor spinning the platters and the linear motor that moves the read/write heads.
- Tapes and tape drives. Tapes in itself are very vulnerable to defects as the tape is spun on and off the reels all the time. Tape drives and especially tape robots contain very senistive pieces of mechanics that can easily break.
Apart from mechanical faillures due to normal use parts also break because of external factors like ambient temperature, vibrations and aging. Most parts favor stable temperatures. When the temperature in for instance a datacenter fluctuates, parts expand and shrink, leading to for instance contact problems in connectors, printed circuit board connections or solder joints.
This effect also occurs when parts are exposed to vibrations and when parts are switched on and off frequently.
Some parts also age over time. Not only mechanical parts wear out, but also some electronic parts like large capacitors that contain fluids and transformers that vibrate due to humming. Solder joints also age over time, just like on/off switches that are used regularly.
Cables tend to fail too. The best example of this is SCSI flat cables. When confronted with a intermittend SCSI error, always first replace the cable. But not only flat cables have problems. Also network cables, especially when they are moved around much tend to fail over time. Another type of cable that is highly sensitive to mechanical stress is fiber optics cable.
Some systems like PC system boards and external disk caches are equipped with batteries. Batteries, even rechargable batteries, are known to fail often.
Another typical component to fail are oscillators used on system baords. These oscillators are also in effect mechanical parts and prone to faillure.
The best solution to this problem is to implement resilience to avoid Singe Points Of Faillures (SPOFs) as described in one of the next sections.
Environmental issues can cause downtime as well. Issues with power and cooling, and external factors like fire and flooding can cause entire datacenters to fail.
Power can fail for a short or long time, and can have voltage drops or spikes. Power outages can couse downtime, and power spikes can cause power supplies to fail. The effect of these power issues can be eliminated by using an Uninterruptable Power Supply (UPS).
Cooling issues can be faillure of the air conditioning system, leading to high temperatures in the datacenter. When the temperature rises too much systems must be shut down to avoid damage.
Some external factors that can lead to unavailability are:
- Earthquakes - Not much can be done about this, apart from the building quality of the datacenter
- Flooding - In parts of the world where sea level is higher than land level (I live in The Netherlands at 6 metres below sea level) or where rivers can overflow easily flooding can occur. This is why I always advise to locate a datacenter at least at the second floor of a building.
- Fire - Proper fire extinction systems and fire prevention systems can help avoid or minimize downtime
- Smoke - Most fire related downtime is due to smoke, not fire. Even when a fire is not at the datacenter but at some other part of the building, smoke can be a good reason to shut down the entire IT infrastructure, since smoke gets sucked into components by their fans and cause damage to the components.
- Terrorist attacks - There were datacenters located in the World Trade Center in New York. The 9/11 attacks casued severe downtime.
Downtime caused by software includes software bugs, including errors in file systems and operating systems. After human errors software bugs are the number one reason for unavailability.
Because of the complexity of most software it is nearly impossible (and very costly) to create bug-free software. Software bugs in applications can stop an entire system (like the infamous Blue Screen of Death on Windows systems), or create downtime in other ways. Since operating systems are software too, operating systems contain bugs which can lead to corrupted file systems, network faillures or other sources for unavailability.
In most cases the availability of a components follows a so-called bathbub curve. A component failure is most likely when the components is new. In the first month of use the chance of a components failure is relatively high. Sometimes a components doesn't even work when unpacked for the first time, before it is used at all. This is what we call a DOA component - Dead On Arrival.
When the components still works after the first month it is highly likely that it will continue working without failure, until the end of its technical life cycle. This is the other end of the bathtub - the chance of failure rises enormously at the end of the life cycle of a component.
Complexity of the infrastructure
Adding more components to an overall system design can undermine high availability, even if the extra components are needed to achhieve high availability. This sounds paradoxal, but in practice I have seen such situations. Complex systems inherently have more potential failure points and are more difficult to implement correctly. Also the complex system is harder to manage, more knowledge is needed to maintain the system and errors are easily made.
Sometimes it is better to have an extra spare system than to have complex system redundancy in place. When a workstation fails, most people can work on another rmachine, and the defective machjine can be swapped in 15 minutes. This is probably a better choice than implementing high availability measures in the workstation like dual network cards, dual connections to dual network switches that can failover, failover drivers for the network card in the workstation, dual power supplies in the workstation fed via two separate cables and power outlets on two fuse boxes, etc. You get the point.
The same goes for high availability measures on other levels. I have had a very instable set of redundant ATM (Asynchronous Transfer Mode) network switches in the core of a network once. I could not get the systems to perform failover well, leading to many periodes of downtime of a few minutes each. When I removed the redundancy in the network, the network never failed again for a year. The leftover switches were loaded with a working configuration and put in the closet. If the core switch would fail, we could swap it in 10 minutes (which given that this woule not happen more than once a year - an probably fewer, led to an availability of at least 99.995%) .