Server Reliability
In the world before 9/11, there was a glut of host server facilities (Internet Hotels). Many sold their facilities at bargain basement prices as they exited from the business (voluntarily or thru bankruptcy), while others mothballed installations. Developers that converted facilities for such uses reverted to original uses, often warehousing.
Now five years later the servers are back, a little smarter but just as hungry - for power. The Internet is bigger than ever, and many data intensive users are expanding their operations after many years of stagnation.
There is also a greater understanding of the reliability requirements for power as well as capacity. This Bulletin focuses on the reliability aspect.
Several terms have become common reference parameters in defining the level of reliability for data intensive sites, which include:
  • The "Nines (9's)" - Simply, this represents the online availability of a load expressed in %. Typically 4 (99.99%), 5 (99.999%) or 6 (99.9999%) nines are the benchmark parameters. In reverse, that means tolerated annual (8,760 hour per year) downtime is 53 minutes, 5.3 minutes or .53 minutes respectively.
  • The "Tiers"- This term parallels the 9's but is applicable to the entire installation whereas the 9's are more applied to systems within a facility. The tiers range from 1 to 4, with one being a normal installation with no alternative paths or enhancements and Tier 4 providing six-9s of reliability.
  • The "N's" -This term refers to the configuration of systems and the degree of redundancy planned. N+1 defines a system where there is one system component installed beyond that required to support the system load requirements. (In the past this was called a standby component). 2N means there is a full duplication of a system or installation, so that upon failure of one system, it can be completely supplanted by the second. Testing your algebra: 2(N +1) means there are two complete systems each with an extra component (standby).
With newer servers, such as Blade servers, demanding more and more power (from 8KW to as high as 20KW per rack), achieving high reliability in a large installation becomes expensive and complex. Costs include the obvious initial construction costs, the space used to house the equipment, the energy usage for part loaded equipment and maintenance, which needs to include periodic systems tests to ensure continued reliability.
We often frame the reliability issues in terms of power reliability, but for a data environment to maintain high reliability the following issues must be addressed: HVAC systems must operate over a wide range of conditions inside and out; fire protection systems must respond quickly to isolate events to the smallest possible area; building envelope must ensure against breach (natural and human); potential for the "human factor" to affect operational integrity minimized.
The decision on reliability level for a project is critical. Unanticipated downtime can cost millions and usually the damage is done in the first moments of an event with continuing duration having a lesser impact.
There are not enough 9's to guarantee operation with zero failure potential. The statistical analysis of the 9's approach can validate equipment reliability. Careful supervision of construction ensures initial operational integrity. Comprehensive commissioning at startup and stringent periodic testing keeps reliability high, while well trained operating personnel help to avoid the human factor events. But there always remains the possibility that something, someone, or a combination of conditions can result in an outage.
Understanding a firm's operation, its ability to sustain an outage, the long-term funding needs of a high reliability system, the quality & training of operating personnel and initial funds to construct a facility are all important components in setting the reliability standard for a project. Tier IV, six 9's and 2(N+1) are laudable goals, but may not be suitable for many firms if they are not prepared for the long-term commitments involved in sustaining such performance levels, especially when guarantees cannot be given.