Network World
Thursday, January 8, 2009
DNSstuff.com
Get information about your IP
IP Information
50+ On-demand DNS and network tools

Jeff Doyle on IP Routing

Cisco Subnet

Navigation

High Availability Does Not Always Mean High Cost

“High availability” has been a technical and marketing buzzword for a number of years, and lately infrastructure equipment vendors have made “HA” a feature set. In that regard HA has come to mean a combination of hardware and software that reduces device downtime. In this age of “five nines” reliability and stringent Service Level Agreements, pretty much any downtime is unacceptable: If a device is out of service for more than about 315 seconds in a year, it is below the 99.999% threshold.

The biggest hardware vulnerability is the power supply. Heat makes this the single most failure-prone component of any router or switch. A close second are the cooling fans, which can fail because they have moving parts. Therefore you should expect any mid-range and up router or switch to have redundant power supplies and fans.

These are reasonably simple. Redundant power supplies are both on (and hopefully connected to separate electrical circuits!) and supplying power to the system, so if one fails the other continues powering the system.  Fan redundancy is usually just a matter of putting enough fans in the system so that if one fails there are enough remaining to provide sufficient cooling. So the cost redundant power supplies and fans add to a device is mostly just the cost of the components themselves.

As you approach high-end equipment, you start finding redundant control and forwarding planes. These components are far more expensive than power supplies and fans, and so making them redundant will make the cost of a networking device soar.

Let’s look at a control plane: The Route Processor (RP) on a typical Cisco router or Routing Engine (RE) on a typical Juniper router. At its most basic implementation, redundant control planes means that one RP or RE runs in active mode while the other is in standby. If the active one fails, operational intervention is required to switch over to the backup. Downtime is reduced because you do not have to wait for an on-site technician to replace the failed component. It’s certainly not ideal, though, because the system is still down while someone in operations detects the failure and performs the switchover.

An automatic switchover on failure sounds like it can significantly reduce recovery time, but it can also open a can of worms: How do you define a failure? A cold piece of circuitry with no electrons moving through it certainly meets the criteria. (As a Munchkin would say, it’s not only merely dead, it’s really quite sincerely dead.) But what about a control plane that is still performing almost all its duties but is, say, increasing its OSPF sequence number by a large value every time it increments, bringing it quickly to its max value and the consequent need for OSPF to reset itself (bringing all its adjacencies down in the process)? Is this a control plane failure meriting a switchover to the backup, or just a protocol failure that, while service impacting, is not as disruptive as a full control plane switchover. And what about a software bug that causes a control plane failure? If the bug is in one processor, it is probably in the other also. Do you allow them to continually fail and switch to the other, only to fail and switch back to the first, to fail again, endlessly until someone intervenes? You might as well not even have a backup control plane in such a situation. How do you set rules around when a switchover is helpful and when it is not? How do you determine the thresholds of failure? How do you insure that a system does not go into a perpetual flip-flop between control planes?

Perhaps even more important, how do you design a failure detection mechanism with sufficient reliability that it does not mistakenly declare a failure and switch over to the backup control plane when the primary was working just fine? You’d better make your choices well, because 315 seconds of downtime per year can get used up very quickly.

Another focus of using redundant control planes to increase system availability is the reduction of intentional downtime. Operating system software must be upgraded now and then to increase security, get rid of a bug, to add a new feature, or simply to keep versions current. Traditionally, a software upgrade meant loading the new image and then restarting the system. Right there, even if everything goes well, you probably use up your 5 minutes of allowable yearly downtime.

This situation is the driver behind in-service software upgrades (ISSU), the capability to upgrade software without taking the system out of service. The key to making ISSU work is to have redundant control planes, both of which are separate physical entities from the forwarding plane. Rather than having one control plane in a simple standby mode, it “pays attention” to what the active control plane is doing. A copy of various databases and states used by the active control plane are kept on the standby. To perform a software upgrade you first switch over to the standby control plane. Because it has been tracking databases and states, it should be able to take control of the system much faster than if it had to come up from a passive mode. Then you perform the upgrade on the previously-active control plane and restart it. When that component is back up and stable, and has again synchronized its databases and states, you switch back to that control plane. You can then upgrade and restart the backup.

While this basic version of ISSU can reduce the amount of downtime needed for a software upgrade, it doesn’t eliminate it entirely. And in fact the switchover can still be severely disruptive to the network. For example, even though the backup control plane has copies of interface states, neighbor states, routing tables, and so on, when the switchover first happens routing adjacencies are broken. When the standby becomes active it must follow protocol procedures to bring the adjacencies back up. While that is happening, routing protocol neighbors will detect that the node is down and tell their own neighbors, causing topological changes throughout the network. Then when the new control plane has established its adjacencies, the neighbors again tell their neighbors and there is a second topology recalculation in the network.

This problem can be solved too, by implementing software that homes routing adjacencies to the active control plane and at the same time keeps the standby aware of the adjacencies. When a switchover occurs, the standby can immediately take over the existing adjacencies without the neighbors being aware that anything changed.

I’ve posted before on mechanisms designed to prevent these kinds of routing protocol disruptions.

But while this “non-stop routing” capability is easy to describe, it is quite complex to implement. Router vendors developing such solutions can sink as much funding into an ISSU/NSR project as they might spend on development of a new hardware platform. Those costs are of course passed on to you.

And that’s the point (at long last) of this post. Vendors invest millions in the development of these kinds of complex HA solutions because their customers demand them. Yet those same customers are often negligent in implementing the simplest, cheapest procedural rules for preventing network outages. I’m amazed at how often a network operator is willing to accept tens of thousands of dollars in added cost to get HA features in their routers but do not have – or do not enforce – configuration standards. Or have clear change management procedures. Or even implement multiple layers of configuration permissions.

And modeling network changes before performing them on the production network? That’s the exception rather than the rule.

Don’t get me wrong, I think redundancy and features like ISSU/NSR are essential to any network that must meet stringent SLAs. It’s just that the most prevalent source of outages – simple human error – gets the least attention and is the easiest and cheapest problem to remedy.

 

Dear Jeff, While

Useful answer?
0

Dear Jeff,
While reading your post , i was thinkning about new solutions & protocols, But at the end i got your point , And i really belive in Service providers must pay attention more for down-time outages due to hummun errors & change mangment more than new protocols & solutions , ofcorse new solutions is a must .......but it never can have more priority that the other points you mentioned at the end of your post....

modeling network changes before performing them on the production network?

Today ive spent 8 hours of my weekend in preforming a fallback to one of our metro-ethernet customers .....Due to a change that wanst well planned nor implemened well in the change mangment dep. .....from L3 config to the L1 cct provider.

Thanks for the nice post , sometimes your post touches my real-life working & thinking ...excatly like the Itch to Teach post.

Thanks & B.Regards
Ahmed Elhoussiny, CCIE # 21988

Hi Ahmed, I like your

Useful answer?
0

Hi Ahmed,

I like your example of the fallback you had to perform for a Metro-E customer. It's an excellent example of what I was trying to emphasize. From your story, your employer had to pay you for 8 hours of work, which is compounded by losing the time you could have spent on other efforts, plus the probable cost of aggravation to the customer (which is never cheap). And in addition, at some point the change has to be performed all over again, with its attendant costs.

All of which could have been prevented by good (and much less expensive) planning and change management up front.  

--Jeff 

Reasons for using redundancy

Useful answer?
0

I think that there are three big reasons why companies tend to use hardware redundancy more than they use configuration standards and change management.

One is that it is easy to prove that you have done the right thing by adding hardware redundancy. For example everybody knows that processors can crash, and you can simulate this by pulling out a controller card and watching the router keep on routing. It's a lot harder to be confident that your procedures are such that they improve uptime, and it's a lot harder to perform any meaningful tests to demonstrate this.

Another reason is that psychologically it is a lot easier to think of a human error as a one-off event which is unlikely to happen again (however untrue this is) than it is to think that a hardware or software problem will not be repeated.

In addition to these, when you are showing your racks of shiny Cisco kit to a customer (or potential customer) they will be a lot more reassured by the fact that you have got two of everything than they would be if they were told that you had good and reliable change management.

Re; Reasons for Using Redundancy

Useful answer?
0

Hi Daniel,

Great comment!

And by the way, I highly recommend that everyone read Jimmy Ray Purser's post on MTBF (www.networkworld.com/community/node/35386) which is related to this topic. 

I was at a customer site just the other day and they had, as many SPs and vendors have, a display area in which several racks of equipment were on display behind a glass wall: All pretty twinkling lights, spotless gear (no missing faceplates or used coffee cups in sight!), and perfectly groomed cabling. In other words, racks of equipment that actual humans hardly ever touch.

You're absolutely right that being able to show a customer that you have two of everything is psychologically more reassuring that to talk about your change management procedures (say change management in a crowded room and watch a significant number of eyelids start to droop).

There's also a potential trap in talking to customers about any policies designed to prevent human error: If not done carefully, it can raise in the customer's mind questions about whether your operations personnel are competent. A bit like banks not wanting to talk about how much money is misplaced due to clerical errors, service providers tend to not like talking about outages due to operational errors.

The challenge for service providers (including IT groups that are in effect service providers for individual organizations) is how to translate safe operational practices into something sexy. SLAs backed up by financial guarantees can get people's attention, but not many providers are confident enough to go very far down that road.

Simple numbers can make people sit up and listen, though. Being able to provide impressive statistics on service availability reflects HA features in the network, good network design, and proven operational best practices.

--Jeff 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <i> <b> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote> <br /> <br> <p>
  • Lines and paragraphs break automatically.
  • You can use BBCode tags in the text.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.

About Jeff Doyle

Jeff Doyle is president of Jeff Doyle and Associates, an IP network consultancy. Jeff is the author of Routing TCP/IP, Volumes I (read an excerpt) and II and of OSPF and IS-IS: Choosing an IGP for Large-Scale Networks. He is a frequent speaker on IPv6, MPLS, and large-scale routing.

Contact him.

RSS feed XML feed

Jeff Doyle archive.

Cisco Subnet

RSS feed Cisco news RSS feed

The opinions expressed in this Weblog are those of the writer and may not represent the opinions of Network World.

Advertisement: