Designing a resilient, high availability network is a process that requires a comprehensive and structured approach to ensure long-term success.
September 1, 2005
Never before have the health and reliability of data networks been so crucial to the viability of today’s businesses. Yet, the impact of the state of these networks will only heighten as data traffic continues to grow exponentially and companies keep building more network-based mission-critical applications.
Virtually all enterprise applications and services are networked and many essential business operations now run over the Internet as well as over internal corporate networks. E-commerce, Voice over IP (VoIP), Business to Business applications, and ATM machines are just a few examples of applications that are expected to run 24 hours/day, 365 days a year.
An interruption can trigger a loss of revenue, potential penalties, customer dissatisfaction, market share loss and more.
Any data communication network is composed of a collection of hardware, software and communication lines, each of which can eventually break down. The bottom line is that it is often only a matter of time until a hardware or software defect, compounded by the human error factor, affects network availability.
Complex state of affairs
In fact, according to one survey, major network outage is far more prevalent than we might expect.
A typical large IP network achieves between 99.95 and 99.99% availability, corresponding to about 50 to 260 minutes of outage per customer annually. A review of service level agreements (SLAs) found that even the largest IP networks provided only marginal guarantees of network availability, with most not counting an outage unless it was unscheduled and lasted over one hour.
The associated penalties were lenient, typically offering one day’s service credit for each hour of confirmed outage.
However, the cost of that network downtime is very significant for their customers, averaging from US$100,000 per hour in the transportation, retail and e-commerce industries, up to US$4.5 million in the brokerage industry.
Those costs will only rise as traffic and dependency on networks continue to mount. So, how does a business or organization guard against network failure and how can they strive to achieve this elusive goal of High Availability (HA)?
Let’s look first at the principal causes of network outages and how they can be avoided or minimized.
Network outages can be caused by something as simple and easy to control as scheduled changes and upgrades, to unanticipated environmental impacts and malicious attacks. An estimated 31% of network downtime is due to self-inflicted errors. Here are the main causes and basic ways to avoid or minimize them:
*Routine changes, such as line card insertion or removal, software upgrades, etc, should not necessitate taking all operations of a network device out of service. If they do, the device should have a built-in backup.
*Hardware failures. The impact of these failures is minimized through redundancy, either at the device or at the network level.
*Software bugs. Software architecture should minimize bugs and prevent any single bug from causing an outage by prohibiting its propagation, for example.
*Hardware misconfiguration. Errors created due to misconfiguration should not be allowed to propagate in the network and affect other elements. Automatic configuration scripts, requiring minimal human intervention, will reduce these occurrences.
*Human errors. Automation and simplification of the user interface help reduce the chances of human error. When an error does occur, the network should be intelligent enough to identify the error and resolve or compensate for it as quickly as possible, while alerting the proper operator.
*Network attacks. Even without a malfunction in the network itself, outages can be caused by attacks from malicious users or stray devices. The IP network should be able to identify the onset of an attack and defend against it.
*Facility outages. Network elements should be able to survive localized environmental problems. Sufficient backup power and temperature/humidity control should be provisioned.
*Physical link failures. A loss of transport resources, due to a fibre cut for example, is best addressed by redundancy and reduced convergence time.
Achieving High Availability requires going even deeper than this and begins with fully understanding it.
HA is comprised of four critical components: reliability, the ability to perform under stated conditions for a stated period of time; recoverability, the ability to easily bypass and recover from a component failure; serviceability, the ability to perform effective problem determination, diagnosis, and repair; and manageability, the ability to create and maintain an environment that limits the negative impact people may have on the system.
A network can only be HA if all these factors are fully realized.
Establishing relevant metrics in tune with the organization’s business objectives is an essential aspect of achieving HA. Availability is an overall metric, measured network-wide, end-to-end. It also must be measured from the customer’s point of view: it is the service not the network that must work continuously.
Designing a resilient, high availability network is a process that requires a comprehensive and structured approach to ensure long term success. Cisco has defined a lifecycle-based services approach that is based on the technology lifecycle of Prepare, Plan, Design, Implement, Operate and Optimize (PPDIOO), with each of these phases contributing to the success of the solution and in achieving HA.
Preparation: Business requirements and growth plans are analyzed to formulate the resiliency and scalability network requirements.
Planning: Key information impacting the detailed design are collected and analyzed.
Design: Final network architecture details of the resilient solution are considered, tested and integrated into the design.
Implementation: A production environment requires an experienced-based migration plan that enables uninterrupted additions and upgrades.
Operate and optimize: Changing business requirements necessitate incremental improvement and optimization that also mitigates the risk of data or application access loss.
Lifecycle Services can minimize many of the causes of downtime through activities such as design change, software version assistance, deployment preparation, performance audit and optimization, and knowledge transfer.
Achieving the promise: Is the vision and goal of achieving true high availability possible? With a comprehensive planning model, a committed service provider and the right monitoring tools, HA is certainly within any organization’s grasp.
In getting on the road to achieving HA, the Yankee Group recommends that companies adopt a network lifecycle process as soon as possible, strive for five nines (99.999% reliability with just 5.25 downtime minutes annually), understand their strengths and weaknesses, and calculate the cost of downtime and cost of repair and support costs.
The importance of basing these activities on proper metrics that support a company’s business objectives cannot be overstressed.
Successfully maintaining HA means managing risk. The following are areas to consider in reducing the outage risk as much as possible:
Reduce outage frequency – Look for ways to prevent outages from happening to a critical component, thereby increasing its reliability.
Minimize outage duration – If outages cannot be entirely avoided, find ways to recover from them immediately, thereby improving recoverability. If recovery is impossible, ensure the component can be immediately repaired. In other words, improve serviceability.
Minimize outage scope – Minimize the parts of a system impacted by an outage.
Prevent future outages – Identify sources of degradation and compensate.
Implement strong security – Defend against attacks, malicious or accidental.
Minimize scheduled downtime – Allow maintenance and upgrades with service continuity.
In successfully maintaining HA, an organization needs to look beyond service level agreements (SLAs), which stipulate contractual objectives including performance, capacity, failure or downtime, and recovery metrics.
Focusing on penalties and remedies can be counterproductive and promote a protectionist mentality, reducing collaboration, innovation and synergy.
Rather, organizations should focus on defining the correct metrics as improperly defined metrics can drive the wrong behaviour and actually work against achieving HA.
A more encompassing lifecycle network management approach — such as Cisco’s planning, design, implementation, operation, and optimization (PDIOO) methodology — based on an organization’s unique business strategies and goals is essential.
Availability is not purchased, it is built. No matter how advanced the equipment, high availability also requires sound design and solid operational practices.
Gabriel Soreanu is the Customer Advocacy Support Manager for Cisco Systems Canada.