Human factors in availability of systems

Usually only 20% of the causes of failures are technology failures. In 80% of the cases, human errors are the reason. For instance, a system administrator accidentally pulls a wrong cable or enters an incorrect command. Users sometimes delete inportant (system) files.

Of course it helps to have highly qualified and trained personnel, with a healthy sense of responsibility. Errors are human, however, and there is not a cure for it. End users can introduce downtime by misuse of the system. When a user for instance starts the generation of 10 very large reports at the same time, the performance of the system could suffer in such a degree that in fact the system is unavailable to other users.

Also when a user forgets a password (and maybe tries an incorrect password for more than 5 times) he is locked out and the system is unavailable for him. If that person has a very reponsible job, for instance approving some steps in a business process, being locked-out could mean that a business process is unavailable to other users as well.

Most unavailability issues however are the result of actions from system managers. Some typical actions (or the lack thereof) are:

  • Performing a test in the production environment (hopefully by accident - testing in production is of course not recommended at all)
  • Switching off a wrong component (not the defective server that needs repair, but the one still operating)
  • Swapping a good working disk in a RAID set instead of the defective one
  • Restoring the wrong back-up tape to production • Accidentally removing files (mail folders, configuration files, database files)
  • Incorrect changes to configuration files (for instance the routing table of a network router, or a change in the Windows registry)
  • Tripping over cables, creating a broken or disconnected cable • Incorrect labling of cables, later leading to errors when a change is performed
  • Stopping an incorrect virtual machine (the one in production instead of the one in the test environment)
  • Making a typo in a system command (in UNIX: sudo rm -rf / *.back instead of sudo rm -rf /*.back where one space too many leads to a complete erasure of a hard disk - did you notice the difference?)
  • Insufficient testing, for instance the fall-back procedure to mover operations from the primary datacenter to the secondary was never tested, and failed when it was most needed
  • A system manager or architect made a mistake in the design of the infrastructure, leading to downtime (we thought the Windows cluster was designed in a good way, but when one of the cluster nodes failed, we found that the complete cluster went down)

Many of these mistakes can be avoided by using proper system menegement procedures, like have having a standard template for creating new servers, using formal deployment strategies with the appropriate tools, using adminstrative accounts only when absolutely needed, etc.

When in some UNIX environments the user works under a administrative account (root), automatically he gets the following message:

We assume you have received the usual lecture from the local System Administrator. 
It usually boils down to these three things: 
#1) Respect the privacy of others. 
#2) Think before you type. 
#3) With great power comes great responsibility. 

I think this message makes people aware, leading to fewer mistakes.


This entry was posted on Tuesday 28 June 2011

Earlier articles

Infrastructure as code

My Book

DevOps for infrastructure

Infrastructure as a Service (IaaS)

(Hyper) Converged Infrastructure

Object storage

Software Defined Networking (SDN) and Network Function Virtualization (NFV)

Software Defined Storage (SDS)

What's the point of using Docker containers?

Identity and Access Management

Using user profiles to determine infrastructure load

Public wireless networks

Supercomputer architecture

Desktop virtualization

Stakeholder management

x86 platform architecture

Midrange systems architecture

Mainframe Architecture

Software Defined Data Center - SDDC

The Virtualization Model

What are concurrent users?

Performance and availability monitoring in levels

UX/UI has no business rules

Technical debt: a time related issue

Solution shaping workshops

Architecture life cycle

Project managers and architects

Using ArchiMate for describing infrastructures

Kruchten’s 4+1 views for solution architecture

The SEI stack of solution architecture frameworks

TOGAF and infrastructure architecture

The Zachman framework

An introduction to architecture frameworks

How to handle a Distributed Denial of Service (DDoS) attack

Architecture Principles

Views and viewpoints explained

Stakeholders and their concerns

Skills of a solution architect architect

Solution architects versus enterprise architects

Definition of IT Architecture

What is Big Data?

How to make your IT "Greener"

What is Cloud computing and IaaS?

Purchasing of IT infrastructure technologies and services

IDS/IPS systems

IP Protocol (IPv4) classes and subnets

Infrastructure Architecture - Course materials

Introduction to Bring Your Own Device (BYOD)

IT Infrastructure Architecture model

Fire prevention in the datacenter

Where to build your datacenter

Availability - Fall-back, hot site, warm site

Reliabilty of infrastructure components

Human factors in availability of systems

Business Continuity Management (BCM) and Disaster Recovery Plan (DRP)

Performance - Design for use

Performance concepts - Load balancing

Performance concepts - Scaling

Performance concept - Caching

Perceived performance

Ethical hacking

The first computers

Open group ITAC /Open CA Certification

Sjaak Laan


Recommended links

Ruth Malan
Gaudi site
Byelex
XR Magazine
Esther Barthel's site on virtualization


Feeds

 
XML: RSS Feed 
XML: Atom Feed 


Disclaimer

The postings on this site are my opinions and do not necessarily represent CGI’s strategies, views or opinions.

 

Copyright Sjaak Laan