RFC8161command.com RFC index

RFC index | STD index | BCP index | FYI index



RFC:  816



                      FAULT ISOLATION AND RECOVERY

                             David D. Clark
                  MIT Laboratory for Computer Science
               Computer Systems and Communications Group
                               July, 1982


     1.  Introduction


     Occasionally, a network or a gateway will go down, and the sequence

of  hops  which the packet takes from source to destination must change.

Fault isolation is that action which  hosts  and  gateways  collectively

take  to  determine  that  something  is  wrong;  fault  recovery is the

identification and selection of an alternative route which will serve to

reconnect the source to the destination.  In fact, the gateways  perform

most  of  the  functions  of  fault  isolation and recovery.  There are,

however, a few actions which hosts must take if they wish to  provide  a

reasonable  level  of  service.   This document describes the portion of

fault isolation and recovery which is the responsibility of the host.


     2.  What Gateways Do


     Gateways collectively implement an algorithm which  identifies  the

best  route  between  all pairs of networks.  They do this by exchanging

packets  which  contain  each  gateway's  latest   opinion   about   the

operational status of its neighbor networks and gateways.  Assuming that

this  algorithm is operating properly, one can expect the gateways to go

through a period of confusion immediately after some network or  gateway

                                   2


has  failed,  but  one  can assume that once a period of negotiation has

passed, the gateways are equipped with a consistent and correct model of

the connectivity of the internet.  At present this period of negotiation

may actually take several minutes, and many TCP implementations time out

within that period, but it is a design goal of  the  eventual  algorithm

that  the  gateway  should  be  able to reconstruct the topology quickly

enough that a TCP connection should be able to survive a failure of  the

route.


     3.  Host Algorithm for Fault Recovery


     Since  the gateways always attempt to have a consistent and correct

model of the internetwork topology, the host strategy for fault recovery

is very simple.  Whenever the host feels that  something  is  wrong,  it

asks the gateway for advice, and, assuming the advice is forthcoming, it

believes  the  advice  completely.  The advice will be wrong only during

the transient  period  of  negotiation,  which  immediately  follows  an

outage, but will otherwise be reliably correct.


     In  fact,  it  is  never  necessary  for a host to explicitly ask a

gateway for advice, because the gateway will provide it as  appropriate.

When  a  host  sends  a datagram to some distant net, the host should be

prepared to receive back either  of  two  advisory  messages  which  the

gateway  may  send.    The  ICMP  "redirect"  message indicates that the

gateway to which the host sent the  datagram  is  not  longer  the  best

gateway  to  reach the net in question.  The gateway will have forwarded

the datagram, but the host should revise its routing  table  to  have  a

different  immediate  address  for  this  net.    The  ICMP "destination

                                   3


unreachable"  message  indicates  that  as  a result of an outage, it is

currently impossible to reach the addressed net or host in  any  manner.

On  receipt  of  this  message, a host can either abandon the connection

immediately without any further retransmission, or resend slowly to  see

if the fault is corrected in reasonable time.


     If  a  host  could assume that these two ICMP messages would always

arrive when something was amiss in the network, then no other action  on

the  part  of the host would be required in order maintain its tables in

an optimal condition.  Unfortunately, there are two circumstances  under

which  the  messages  will  not  arrive  properly.    First,  during the

transient following a failure, error messages may  arrive  that  do  not

correctly  represent  the  state of the world.  Thus, hosts must take an

isolated error message with some scepticism.  (This transient period  is

discussed  more  fully  below.)    Second,  if the host has been sending

datagrams to a particular gateway, and that gateway itself crashes, then

all the other gateways in the internet will  reconstruct  the  topology,

but  the  gateway  in  question will still be down, and therefore cannot

provide any advice back to the host.  As long as the host  continues  to

direct  datagrams at this dead gateway, the datagrams will simply vanish

off the face of the earth, and nothing will come back in return.   Hosts

must detect this failure.


     If some gateway many hops away fails, this is not of concern to the

host, for then the discovery of the failure is the responsibility of the

immediate  neighbor gateways, which will perform this action in a manner

invisible to the host.  The  problem  only  arises  if  the  very  first

                                   4


gateway, the one to which the host is immediately sending the datagrams,

fails.   We thus identify one single task which the host must perform as

its part of fault isolation in the internet:  the  host  must  use  some

strategy  to  detect  that a gateway to which it is sending datagrams is

dead.


     Let us  assume  for  the  moment  that  the  host  implements  some

algorithm  to  detect  failed  gateways; we will return later to discuss

what this algorithm might be.  First, let  us  consider  what  the  host

should  do  when it has determined that a gateway is down. In fact, with

the exception of one small problem, the action the host should  take  is

extremely  simple.    The host should select some other gateway, and try

sending the datagram to it.  Assuming that  gateway  is  up,  this  will

either  produce  correct  results, or some ICMP advice.  Since we assume

that, ignoring temporary periods immediately following  an  outage,  any

gateway  is capable of giving correct advice, once the host has received

advice from any gateway, that host is in as good a condition as  it  can

hope to be.


     There is always the unpleasant possibility that when the host tries

a different gateway, that gateway too will be down.  Therefore, whatever

algorithm  the  host  uses to detect a dead gateway must continuously be

applied, as the host tries every gateway in turn that it knows about.


     The only difficult part of this algorithm is to specify  the  means

by which the host maintains the table of all of the gateways to which it

has  immediate  access.    Currently,  the specification of the internet

protocol does not architect any message by which a host can  ask  to  be

                                   5


supplied  with  such a table.  The reason is that different networks may

provide very different mechanisms by which this table can be filled  in.

For  example,  if  the  net is a broadcast net, such as an ethernet or a

ringnet, every gateway may simply broadcast such a table  from  time  to

time,  and  the  host  need do nothing but listen to obtain the required

information.  Alternatively, the network may provide  the  mechanism  of

logical  addressing,  by  which  a whole set of machines can be provided

with a single group  address,  to  which  a  request  can  be  sent  for

assistance.   Failing those two schemes, the host can build up its table

of neighbor gateways by remembering all the gateways from which  it  has

ever received a message.  Finally, in certain cases, it may be necessary

for  this  table,  or  at  least the initial entries in the table, to be

constructed manually by a manager or operator at the  site.    In  cases

where  the  network  in question provides absolutely no support for this

kind of host query, at least some manual intervention will  be  required

to  get  started,  so  that  the  host  can  find out about at least one

gateway.


     4.  Host Algorithms for Fault Isolation


     We now return to the question raised above.  What  strategy  should

the  host use to detect that it is talking to a dead gateway, so that it

can know to switch to some other gateway in the list. In fact, there are

several algorithms which can be used.   All  are  reasonably  simple  to

implement, but they have very different implications for the overhead on

the  host, the gateway, and the network.  Thus, to a certain extent, the

algorithm picked must depend on the details of the network  and  of  the

host.

                                   6



1. NETWORK LEVEL DETECTION

Many networks, particularly the Arpanet, perform precisely the required function internal to the network. If a host sends a datagram to a dead gateway on the Arpanet, the network will return a "host dead" message, which is precisely the information the host needs to know in order to switch to another gateway. Some early implementations of Internet on the Arpanet threw these messages away. That is an exceedingly poor idea.

2. CONTINUOUS POLLING

The ICMP protocol provides an echo mechanism by which a host may solicit a response from a gateway. A host could simply send this message at a reasonable rate, to assure itself continuously that the gateway was still up. This works, but, since the message must be sent fairly often to detect a fault in a reasonable time, it can imply an unbearable overhead on the host itself, the network, and the gateway. This strategy is prohibited except where a specific analysis has indicated that the overhead is tolerable.

3. TRIGGERED POLLING

If the use of polling could be restricted to only those times when something seemed to be wrong, then the overhead would be bearable. Provided that one can get the proper advice from one's higher level protocols, it is possible to implement such a strategy. For example, one could program the TCP level so that whenever it retransmitted a 7 segment more than once, it sent a hint down to the IP layer which triggered polling. This strategy does not have excessive overhead, but does have the problem that the host may be somewhat slow to respond to an error, since only after polling has started will the host be able to confirm that something has gone wrong, and by then the TCP above may have already timed out. Both forms of polling suffer from a minor flaw. Hosts as well as gateways respond to ICMP echo messages. Thus, polling cannot be used to detect the error that a foreign address thought to be a gateway is actually a host. Such a confusion can arise if the physical addresses of machines are rearranged.

4. TRIGGERED RESELECTION

There is a strategy which makes use of a hint from a higher level, as did the previous strategy, but which avoids polling altogether. Whenever a higher level complains that the service seems to be defective, the Internet layer can pick the next gateway from the list of available gateways, and switch to it. Assuming that this gateway is up, no real harm can come of this decision, even if it was wrong, for the worst that will happen is a redirect message which instructs the host to return to the gateway originally being used. If, on the other hand, the original gateway was indeed down, then this immediately provides a new route, so the period of time until recovery is shortened. This last strategy seems particularly clever, and is probably the most generally suitable for those cases where the network itself does not provide fault isolation. (Regretably, I have forgotten who suggested this idea to me. It is not my invention.) 8 5. Higher Level Fault Detection The previous discussion has concentrated on fault detection and recovery at the IP layer. This section considers what the higher layers such as TCP should do. TCP has a single fault recovery action; it repeatedly retransmits a segment until either it gets an acknowledgement or its connection timer expires. As discussed above, it may use retransmission as an event to trigger a request for fault recovery to the IP layer. In the other direction, information may flow up from IP, reporting such things as ICMP Destination Unreachable or error messages from the attached network. The only subtle question about TCP and faults is what TCP should do when such an error message arrives or its connection timer expires. The TCP specification discusses the timer. In the description of the open call, the timeout is described as an optional value that the client of TCP may specify; if any segment remains unacknowledged for this period, TCP should abort the connection. The default for the timeout is 30 seconds. Early TCPs were often implemented with a fixed timeout interval, but this did not work well in practice, as the following discussion may suggest. Clients of TCP can be divided into two classes: those running on immediate behalf of a human, such as Telnet, and those supporting a program, such as a mail sender. Humans require a sophisticated response to errors. Depending on exactly what went wrong, they may want to 9 abandon the connection at once, or wait for a long time to see if things get better. Programs do not have this human impatience, but also lack the power to make complex decisions based on details of the exact error condition. For them, a simple timeout is reasonable. Based on these considerations, at least two modes of operation are needed in TCP. One, for programs, abandons the connection without exception if the TCP timer expires. The other mode, suitable for people, never abandons the connection on its own initiative, but reports to the layer above when the timer expires. Thus, the human user can see error messages coming from all the relevant layers, TCP and ICMP, and can request TCP to abort as appropriate. This second mode requires that TCP be able to send an asynchronous message up to its client to report the timeout, and it requires that error messages arriving at lower layers similarly flow up through TCP. At levels above TCP, fault detection is also required. Either of the following can happen. First, the foreign client of TCP can fail, even though TCP is still running, so data is still acknowledged and the timer never expires. Alternatively, the communication path can fail, without the TCP timer going off, because the local client has no data to send. Both of these have caused trouble. Sending mail provides an example of the first case. When sending mail using SMTP, there is an SMTP level acknowledgement that is returned when a piece of mail is successfully delivered. Several early mail receiving programs would crash just at the point where they had received all of the mail text (so TCP did not detect a timeout due to outstanding 10 unacknowledged data) but before the mail was acknowledged at the SMTP level. This failure would cause early mail senders to wait forever for the SMTP level acknowledgement. The obvious cure was to set a timer at the SMTP level, but the first attempt to do this did not work, for there was no simple way to select the timer interval. If the interval selected was short, it expired in normal operational when sending a large file to a slow host. An interval of many minutes was needed to prevent false timeouts, but that meant that failures were detected only very slowly. The current solution in several mailers is to pick a timeout interval proportional to the size of the message. Server telnet provides an example of the other kind of failure. It can easily happen that the communications link can fail while there is no traffic flowing, perhaps because the user is thinking. Eventually, the user will attempt to type something, at which time he will discover that the connection is dead and abort it. But the host end of the connection, having nothing to send, will not discover anything wrong, and will remain waiting forever. In some systems there is no way for a user in a different process to destroy or take over such a hanging process, so there is no way to recover. One solution to this would be to have the host server telnet query the user end now and then, to see if it is still up. (Telnet does not have an explicit query feature, but the host could negotiate some unimportant option, which should produce either agreement or disagreement in return.) The only problem with this is that a reasonable sample interval, if applied to every user on a large system, 11 can generate an unacceptable amount of traffic and system overhead. A smart server telnet would use this query only when something seems wrong, perhaps when there had been no user activity for some time. In both these cases, the general conclusion is that client level error detection is needed, and that the details of the mechanism are very dependent on the application. Application programmers must be made aware of the problem of failures, and must understand that error detection at the TCP or lower level cannot solve the whole problem for them. 6. Knowing When to Give Up It is not obvious, when error messages such as ICMP Destination Unreachable arrive, whether TCP should abandon the connection. The reason that error messages are difficult to interpret is that, as discussed above, after a failure of a gateway or network, there is a transient period during which the gateways may have incorrect information, so that irrelevant or incorrect error messages may sometimes return. An isolated ICMP Destination Unreachable may arrive at a host, for example, if a packet is sent during the period when the gateways are trying to find a new route. To abandon a TCP connection based on such a message arriving would be to ignore the valuable feature of the Internet that for many internal failures it reconstructs its function without any disruption of the end points. But if failure messages do not imply a failure, what are they for? In fact, error messages serve several important purposes. First, if 12 they arrive in response to opening a new connection, they probably are caused by opening the connection improperly (e.g., to a non-existent address) rather than by a transient network failure. Second, they provide valuable information, after the TCP timeout has occurred, as to the probable cause of the failure. Finally, certain messages, such as ICMP Parameter Problem, imply a possible implementation problem. In general, error messages give valuable information about what went wrong, but are not to be taken as absolutely reliable. A general alerting mechanism, such as the TCP timeout discussed above, provides a good indication that whatever is wrong is a serious condition, but without the advisory messages to augment the timer, there is no way for the client to know how to respond to the error. The combination of the timer and the advice from the error messages provide a reasonable set of facts for the client layer to have. It is important that error messages from all layers be passed up to the client module in a useful and consistent way. -------