99,999% availability

I sometimes hear te strangest figures when availability is discussed.

"The system must be available 95% of the time" or "The system shall never fail" or "We will only accept 99.999% uptime (5 nines)".

Usually these figures are not based on calculations and/or people have no idea about the cost of reaching these numbers.

To make things clear: All hardware will break. The question is not if something breaks, but when.

Some calculations:

There are 24*365=8760 hours in one year. 1% of this is 87,6 hours. A system with an availability of 95% can be unavailable for 438 hours per year. This means 18 full days per year!

On the other end of the horizon is the 99.999% demand. Here a system may only be unavailable for 5 minutes per year, including any repair times! The 99.999% (five nines) is a popular number these days.

Availability can be calculated by multiplying the MTBF with the MTTR.

MTBF

For hardware usually an MTBF is stated (Mean Time Between Failures). A Seagate Cheetah hard disk for instance, has an MTBF of 1.200.000 hours. This means that on average the hard disk will fail every 136 years. A system is built with many components, each with it's own MTBF. Imagine a disk cabinet with 64 disks (this is not unusual in a SAN). In such a setup, every 2 years one of these disks will fail, even with the large MTBF of the Seagate disks.

While disks are the components that fail the most (because they contain many moving parts), other components of a system also have a MTBF. For instance servers (mainly the Fans in the power supplies), routers, switches, and even cabling.

The MTBF figure is mainly a marketing instrument. How can Seagate prove that their disks will actually on average fail every 136 years? Usually this is done using simulations and tests under stress-conditions.

MTTR

Apart from MTBF, there is MTTR: Mean Time To Repair. This is the time needed to fix or replace a broken system(part). Usually the MTTR is kept low by having a service contract with the supplier of the part. Sometimes spare parts can be kept on-site to keep the MTTR low.

Software

Except for hardware, systems contain software. Usually the MTBF and MTTR for software components can not be calculated easily. No programmer will state the MTBF of the software she wrote. Who knows the MTBF of Windows? Of Linux? SAP? Your in-house developed software?

The human aspect

Usually only 20% of the causes of failures are technology failures. In 80% of the cases, human errors are the reason. For instance, a system administrator accidentally pulls a wrong cable or enters an incorrect command. Users sometimes delete inportant (system) files.

Of course it helps to have highly qualified and trained personnel, with a healthy sense of responsibility. Errors are human, however, and there is no MTBF to be calculated here.

Conclusion

As stated above, availability figures of a system are very hard to guarantee. MTBF and MTTR are either unknown, can not be calculated, or are exaggerated.

Availability can only be reported on afterwards, when a system has run for some years. With this knowledge afterwards, new systems can be designed which will probably have a higher availability.

Of course , in the last years much knowledge is gained on how to design high-available systems, for instance by using clustering, failover, redundancy, structured programming, avoiding Single Points of Failures (SPOF's) and implementing proper system management.

IT architects (or security architects for that matter) are responsible for giving availability the attention it deserves. Because the costs of being not-available can be very high, a good match between IT and business is crucial.


This entry was posted on Friday 27 October 2006

Earlier articles

Quantum computing

Security at cloud providers not getting better because of government regulation

The cloud is as insecure as its configuration

Infrastructure as code

DevOps for infrastructure

Infrastructure as a Service (IaaS)

(Hyper) Converged Infrastructure

Object storage

Software Defined Networking (SDN) and Network Function Virtualization (NFV)

Software Defined Storage (SDS)

What's the point of using Docker containers?

Identity and Access Management

Using user profiles to determine infrastructure load

Public wireless networks

Supercomputer architecture

Desktop virtualization

Stakeholder management

x86 platform architecture

Midrange systems architecture

Mainframe Architecture

Software Defined Data Center - SDDC

The Virtualization Model

What are concurrent users?

Performance and availability monitoring in levels

UX/UI has no business rules

Technical debt: a time related issue

Solution shaping workshops

Architecture life cycle

Project managers and architects

Using ArchiMate for describing infrastructures

Kruchten’s 4+1 views for solution architecture

The SEI stack of solution architecture frameworks

TOGAF and infrastructure architecture

The Zachman framework

An introduction to architecture frameworks

How to handle a Distributed Denial of Service (DDoS) attack

Architecture Principles

Views and viewpoints explained

Stakeholders and their concerns

Skills of a solution architect architect

Solution architects versus enterprise architects

Definition of IT Architecture

What is Big Data?

How to make your IT "Greener"

What is Cloud computing and IaaS?

Purchasing of IT infrastructure technologies and services

IDS/IPS systems

IP Protocol (IPv4) classes and subnets

Infrastructure Architecture - Course materials

Introduction to Bring Your Own Device (BYOD)

Fire prevention in the datacenter

Where to build your datacenter

Availability - Fall-back, hot site, warm site

Reliabilty of infrastructure components

Human factors in availability of systems

Business Continuity Management (BCM) and Disaster Recovery Plan (DRP)

Performance - Design for use

Performance concepts - Load balancing

Performance concepts - Scaling

Performance concept - Caching

Perceived performance

Ethical hacking

The first computers

Open group ITAC /Open CA Certification


Recommended links

Ruth Malan
Gaudi site
Esther Barthel's site on virtualization
Eltjo Poort's site on architecture


Feeds

 
XML: RSS Feed 
XML: Atom Feed 


Disclaimer

The postings on this site are my opinions and do not necessarily represent CGI’s strategies, views or opinions.

 

Copyright Sjaak Laan