nl There is also a DUTCH VERSION of this site


My book on IT infrastructure architecture





More articles

01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Jul - 31 Jul 2011
01 Jun - 30 Jun 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 28 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 Jun - 30 Jun 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 28 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jun - 30 Jun 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 Jun - 30 Jun 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 28 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 Jun - 30 Jun 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 28 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006

Links

Recommended
Ruth Malan
Bredemeyer Consulting
Gaudi site
Byelex
XR Magazine
Esther Barthel's site on virtualization



Misc

Powered by Pivot - 1.40.1: 'Dreadwind' 
XML: RSS Feed 
XML: Atom Feed 


Availability - Fall-back

27 July 11 - 16:25
Area: default - Link to this article

Fall-back is the manual switch-over to a identical standby computer system in a different location.

There are three basic forms of fall-back services, with some variations:

  • Hot site
  • Warm site
  • Cold site

Hot site

A hot site is a fully configured computer facility with electrical power, heating, ventilation, and air conditioning (HVAC), and functioning file/print servers and workstations. The applications that are needed to sustain remote transaction processing are installed on the servers and workstations and are kept up-to-date to mirror the production system.

Theoretically, personnel and/or operators should be able to walk in and, with a data restoration of modified files from the last backup, begin full operations in a very short time.

If the site participates in remote journaling, that is, mirroring transaction processing with a high-speed data line to the hot site, even the backup time may be reduced or eliminated.

This type of site requires constant maintenance of the hardware, software, data, and applications to be sure the site accurately mirrors the state of the production site. This adds administrative overhead and can be a strain on resources, especially if a dedicated disaster recovery maintenance team does not exist.

Warm site

A warm site could best be described as a cross between a hot site and cold site. Like a hot site, the warm site is a computer facility readily available with electrical power and HVAC and computers, but the applications may not be installed or configured.

It may have file/print servers, but not a full complement of workstations. External communication links and other data elements that commonly take a long time to order and install will be present, however. To enable remote processing at this type of site, workstations will have to be delivered quickly and applications and their data will need to be restored from backup media.

Cold site

A cold site differs from the other two in that it is ready for equipment to be brought in during an emergency, but no computer hardware (servers or workstations) resides at the site.

The cold site is a room with electrical power and HVAC, but computers must be brought on-site if needed, and communications links may be ready or not. File and print servers have to be brought in, as well as all workstations, and applications will need to be installed and current data restored from backups.

If an organization has very little budget for an alternative backup processing site, the cold site may be better than nothing.

Service Bureaus

In rare cases, an organization may contract with a service bureau to fully provide all alternate backup processing services. The big advantage to this type of arrangement is the quick response and availability of the service bureau, testing is possible, and the service bureau may be available for more than backup.

The disadvantages of this type of setup are primarily the expense and resource contention during a large emergency.

Reliabilty of infrastructure components

12 July 11 - 16:04
Area: default - Link to this article

Infrastructure components or the combination thereof all fail at some moment in time. There are several reasons for failure, as described below.    

Physical defects 

Of course unavailability can arise from physical defects of parts in the infrastructure. Of course everything breaks down eventally, but mechanical parts are most likely to break first. Some examples of mechanical parts are:

  • Fans for cooling equipment. The fans usually have a limited lifespan, they usually break because of dust in the bearing, leading to the motor to work harder until it breaks.
  • Disk drives. Disk drives contain two moving parts: the motor spinning the platters and the linear motor that moves the read/write heads.
  • Tapes and tape drives. Tapes in itself are very vulnerable to defects as the tape is spun on and off the reels all the time. Tape drives and especially tape robots contain very senistive pieces of mechanics that can easily break.

Apart from mechanical faillures due to normal use parts also break because of external factors like ambient temperature, vibrations and aging. Most parts favor stable temperatures. When the temperature in for instance a datacenter fluctuates, parts expand and shrink, leading to for instance contact problems in connectors, printed circuit board connections or solder joints.

This effect also occurs when parts are exposed to vibrations and when parts are switched on and off frequently.

Some parts also age over time. Not only mechanical parts wear out, but also some electronic parts like large capacitors that contain fluids and transformers that vibrate due to humming. Solder joints also age over time, just like on/off switches that are used regularly.

Cables tend to fail too. The best example of this is SCSI flat cables. When confronted with a intermittend SCSI error, always first replace the cable. But not only flat cables have problems. Also network cables, especially when they are moved around much tend to fail over time. Another type of cable that is highly sensitive to mechanical stress is fiber optics cable.

Some systems like PC system boards and external disk caches are equipped with batteries. Batteries, even rechargable batteries, are known to fail often.

Another typical component to fail are oscillators used on system baords. These oscillators are also in effect mechanical parts and prone to faillure.

The best solution to this problem is to implement resilience to avoid Singe Points Of Faillures (SPOFs) as described in one of the next sections.

Environmental issues

Environmental issues can cause downtime as well. Issues with power and cooling, and external factors like fire and flooding can cause entire datacenters to fail.

Power can fail for a short or long time, and can have voltage drops or spikes. Power outages can couse downtime, and power spikes can cause power supplies to fail. The effect of these power issues can be eliminated by using an Uninterruptable Power Supply (UPS). 

Cooling issues can be faillure of the air conditioning system, leading to high temperatures in the datacenter. When the temperature rises too much systems must be shut down to avoid damage.

Some external factors that can lead to unavailability are:

  • Earthquakes - Not much can be done about this, apart from the building quality of the datacenter
  • Flooding - In parts of the world where sea level is higher than land level (I live in The Netherlands at 6 metres below sea level) or where rivers can overflow easily flooding can occur. This is why I always advise to locate a datacenter at least at the second floor of a building.
  • Fire - Proper fire extinction systems and fire prevention systems can help avoid or minimize downtime
  • Smoke - Most fire related downtime is due to smoke, not fire. Even when a fire is not at the datacenter but at some other part of the building, smoke can be a good reason to shut down the entire IT infrastructure, since smoke gets sucked into components by their fans and cause damage to the components.
  • Terrorist attacks - There were datacenters located in the World Trade Center in New York. The 9/11 attacks casued severe downtime.

Software

Downtime caused by software includes software bugs, including errors in file systems and operating systems. After human errors software bugs are the number one reason for unavailability.

Because of the complexity of most software it is nearly impossible (and very costly) to create bug-free software. Software bugs in applications can stop an entire system (like the infamous Blue Screen of Death on Windows systems), or create downtime in other ways. Since operating systems are software too, operating systems contain bugs which can lead to corrupted file systems, network faillures or other sources for unavailability.

Bathtub curve

In most cases the availability of a components follows a so-called bathbub curve. A component failure is most likely when the components is new. In the first month of use the chance of a components failure is relatively high. Sometimes a components doesn't even work when unpacked for the first time, before it is used at all. This is what we call a DOA component - Dead On Arrival.

When the components still works after the first month it is highly likely that it will continue working without failure, until the end of its technical life cycle. This is the other end of the bathtub - the chance of failure rises enormously at the end of the life cycle of a component.

Complexity of the infrastructure

Adding more components to an overall system design can undermine high availability, even if the extra components are needed to achhieve high availability. This sounds paradoxal, but in practice I have seen such situations. Complex systems inherently have more potential failure points and are more difficult to implement correctly. Also the complex system is harder to manage, more knowledge is needed to maintain the system and errors are easily made.

Sometimes it is better to have an extra spare system than to have complex system redundancy in place. When a workstation fails, most people can work on another rmachine, and the defective machjine can be swapped in 15 minutes. This is probably a better choice than implementing high availability measures in the workstation like dual network cards, dual connections to dual network switches that can failover, failover drivers for the network card in the workstation, dual power supplies in the workstation fed via two separate cables and power outlets on two fuse boxes, etc. You get the point.

The same goes for high availability measures on other levels. I have had a very instable set of redundant ATM (Asynchronous Transfer Mode) network switches in the core of a network once. I could not get the systems to perform failover well, leading to many periodes of downtime of a few minutes each. When I removed the redundancy in the network, the network never failed again for a year. The leftover switches were loaded with a working configuration and put in the closet. If the core switch would fail, we could swap it in 10 minutes (which given that this woule not happen more than once a year - an probably fewer, led to an availability of at least 99.995%) .


More articles: See left pane.
Twitter LinkedIn Facebook RSS


About Sjaak Laan

Sjaak Laan

I am 46 years old and married with Angelina. We have 3 children of 13, 8 and 6 years old. We live in The Netherlands, in a place called Drachten

I work for Logica as Principal IT Architect. I have 20 years IT experience.

I own the following certificates:

ITAC Master Certified IT Architect

CISSP_logo CISSP (Certified Information Systems Security Professional)


TOGAF8_Certified_web TOGAF Certified Architect



I am a member of the:


I manage my business contacts using Linkedin.


I can be reached through sjaak.laan [ a t ] gmail [dot] com.

This site states my opinion only, and not nessecarily the opinion of my employer or of the clients I work for.