Contents:

ABSTRACT

INTRODUCTION

Link State Routing

Hierarchical Names

Hierarchical Routing

Goals

MAIN IDEAS

Restricting Topology Update Propagation

Pruning the Routing Tree

Pseudolinks

ALGORITHM

Routing

Calculating the Routing Tables

Maintaining the Topology Database

Additional Features

EXPERIENCE

DISCUSSION

RELATED WORK

FURTHER WORK

CONCLUSION

REFERENCES


HIERARCHICAL NETWORK ROUTING

P. Lauder
R. J. Kummerfeld
A. Fekete

Software Systems Research Group
Department of Computer Science
University of Sydney

Authors' addresses and acknowledgement...

ABSTRACT

A disadvantage of Link State routing schemes is that exact shortest path calculations require a complete topology, which can overload the capacity of small nodes in a large network. Area routing schemes (when destination names are structured corresponding to the network topology) allow nodes to reduce the size of routing tables, by recording only one entry for an entire region rather than one for each node in the region. We describe a general hierarchical routing scheme that allows all nodes to participate in a distributed routing network, using close to optimal paths, with short routing tables, and a reduction of topology information for minor nodes.

INTRODUCTION

The fundamental problem of data communication is to transmit a datagram from one node (machine or sub-network), called the origin, to another called the destination. This can be done relatively easily if a direct connection exists from one node to the other. However the large number of nodes in the world makes it impractical to organize them in a completely connected way (when each node would require a direct link to every other). Instead each node is connected to only a few others, called its neighbors, and a datagram may need to be passed from the source to another node (a neighbor of the source) and then to another (a neighbor of the first neighbor), and so on through many intermediate nodes, each hop being from one node to a neighbor, before finally reaching the destination. The problem of routing is to design an algorithm which will enable any node to choose (based on the destination of the datagram) to which neighbor to pass the datagram, in such a way that the different nodes' choices combine to bring the datagram eventually to the destination, and to minimize the total cost of the delivery (expressed as the sum of the costs incurred on each hop). This paper describes a method of routing that has been implemented by the first author, and is currently in use in the Australian Computer Science Network (ACSnet). This network involves some 1100 nodes (some of which run earlier versions of the software). Experience over the past 2 years indicates that the algorithm is reliable and efficient.

Link State Routing

The algorithm of this paper belongs to the family of Link State Routing algorithms, the most famous of which is the `new' Arpanet routing algorithm [Mcquillan 80]. The details of these vary, but the key ideas are given in the following algorithm: Each node maintains a topology database which is a representation of the state of the network. Usually the database is considered as a graph formed by the nodes, with each direct connection as an edge having a weighting that reflects the cost of using that link in a hop. Based on this graph, the node calculates the cheapest route through the entire graph from itself to each destination (this is the `shortest paths problem' from graph theory), and the node constructs a routing table showing, for each possible destination, which neighbor is the first point in the cheapest route to that machine. When any datagram is to be sent (either originating at the node in question, or arriving there from a neighbor), the datagram is passed on to the neighbor given in the routing table as appropriate for the destination. When the network changes (either in topology, as when a new link is added or an existing connection is broken or a new node joins the network, or else in weights, as when the cost of using a link changes), then the information about the change is broadcast from the point where the change occurs to all other nodes, which update their databases accordingly. Thus it is possible that at some time, the various machines will not have identical databases, if information about recent changes has not propagated through the whole network. During this period, datagrams may not be routed in the best way, and may even travel in loops. However, over an extended period without changes, the databases will converge to each be a true representation of the topology, and therefore datagrams will after that time be sent along the cheapest route from origin to destination.

Hierarchical Names

In a network, each machine has to be distinguishable from every other. This is done by giving each a unique name. In order to maintain uniqueness, one could imagine having a single authority that issues the names for the whole network. However this approach clearly does not scale well. Thus instead names are organized hierarchically. For example the name of a machine might be cluster.cs.su.oz.au. Here each component of the name refers to a grouping of machines, usually by organization. In this case cs refers to the Department of Computer Science, su refers to the University of Sydney, oz refers to the collective of universities, and au refers to Australia. Each group has an authority which ensures that no two contained organizations (at the level immediately below) are referred to by the same symbol. Thus within the University of Sydney the Celtic Studies Department must pick a reference other than cs. However, it is perfectly acceptable for the Physics Department also to choose the reference cluster for a machine (named cluster.phys.su.oz.au) or indeed for cs to refer to Celtic Studies in the University of Tasmania (where a machine might well be named cluster.cs.utas.oz.au). Thus, unique references for the components of each single organization produces global uniqueness of the complete names. We refer to the collection of machines which share a common suffix for their name as a region and say that the name of the region is that suffix (for example cs.su.oz.au is the name of the collection of machines in the University of Sydney's Computer Science Department, and mu.oz.au is the name of the collection of machines in Melbourne University). For uniformity, we say that a single machine forms by itself a region of the lowest level. We say that one region is a child of another if the name of the first consists of a single symbol followed by the name of the second. (We also say that the second is the parent of the first). More generally if the name of one region is a suffix of the name of another, we say that the first is an ancestor of the second, and the second a descendent of the first. We can organize the names of all the regions in the network into a tree based on this relationship (the root of the tree is the region consisting of the entire network, which has the empty name).

Hierarchical Routing

The usual pattern of an organization like an academic department is to be based in one geographic location. It is natural to connect machines in a department by cheap links (since all the machines are physically close) and then to link two different departments by building more expensive links between a single pair consisting of one machine from each group. This pattern can be repeated at higher levels, with different universities linked by choosing one major hub in each university, and connecting those hubs by a single high-bandwidth link. In this situation, the shortest route from cluster.cs.su.oz.au to any machine in Melbourne University would be the same at first: within cs.su.oz.au to the Departmental hub, then to the University of Sydney hub, and from there to the Melbourne University hub. Only once the datagram has reached some machine in Melbourne University would the specific destination within that organization become a controlling factor. This leads to the idea of hierarchical routing, that is to allow a machine to make routing decisions based not on the whole destination name, but only on some suffix of that name (usually the suffix that is the name of a region that is a child of the smallest common ancestor[2]  of the routing machine and the destination). This approach was introduced by Kleinrock and Kamouns [Kleinrock 77] who showed that under certain (unrealistic) assumptions, this allowed considerable reductions in the length of the routing table stored at a node, while not causing significant increases in the cost of the paths taken. The algorithm of this paper is based on the idea of hierarchical routing, but is affected by a number of pragmatic considerations, that make it somewhat more complicated than the original scheme of Kleinrock and Kamouns.

It is important to distinguish general hierarchical routing from the routing in the Internet. Machines in the Internet are referred to by users with hierarchical names (such as welch.lcs.mit.edu), but these are resolved by the origin (using a hierarchy of domain name servers) into addresses that have only a two-level structure (network-number.host-number). Internet Routing is done entirely using the network-number until the correct network is reached, when the host-number is used by the network's local routing mechanism.[3]  The space of network-numbers is completely flat, with no requirement for example that different networks within an organization have related identifiers, and the routing decisions can not in any way use the fact that a route is known to one machine in mit.edu to deduce the route for another machine in mit.edu if the second is on a different network. In contrast, the algorithm described in this paper uses the hierarchically structured names in making routing decisions at every machine along the route.

Goals

The routing mechanism described here has as its primary goal to cause datagrams to be routed successfully from origin to destination. It is intended that the route used be close to the lowest in cost, although it is reasonable to sacrifice perfect shortest-path routes (and thereby waste bandwidth on the network) in order to allow great reductions in the memory requirements in most machines (for storing routing tables and database) and to reduce the bandwidth consumed by propagating database updates. It is important that the algorithm allow for the possibility that links may be uni-directional, or even bi-directional with different costs in the two directions. We further need to allow distributed management, and to minimise the amount of effort required to join the network. We must allow small capacity nodes to participate in a very large network.

Pragmatic considerations provide some other constraints on the design. It is essential that the system allow for regions that are not internally connected, that is, where the only routes between two machines in the region involve machines that are not in the region. This is not common (as we noted, regions as organizations are usually based in one geographical area, and then are likely to use only local links to exchange datagrams), but it does occur with some organizations having branches in several cities[4]  and it must not cause routing to fail.

The mechanism should also route successfully even when knowledge of recent topology changes has not propagated throughout the network. This is important since the frequency of topology changes will be similar to that in other networks, but propagation times for datagrams should be allowed to be variable and sometimes very large, due to geographical isolation or low traffic levels which can make intermittent dial-up connections[5]  economically appropriate between some sites.

It is also important that a machine can join the network and soon begin to originate datagrams, without waiting for infrequent broadcasts of topology changes to build its database. Similarly knowledge of a new region (and a route to it, even if not the best route) should be spread rapidly, so that it may begin receiving datagrams.

Some organizations wish to join only a part of the network. For example, a small start-up computing company (forming a region start.com.au) which is situated near Sydney might wish to exchange mail with the region cs.su.oz.au, but not with the rest of the network. It is desirable to have an inbuilt mechanism to allow this, rather than relying on ad-hoc tactics such as editing the routing tables! A related concern is that each machine should make explicit the level of involvement that is appropriate for it, so that for example a university can choose one or two machines to act as hubs (and to participate in routing for datagrams to outside the university), and the rest of the machines can be nominated as purely local to the university, or even as local to a department. Local machines should not need to store in their database information outside their area of concern.

MAIN IDEAS

In this section we discuss some of the key concepts underlying the routing algorithm we describe. These concepts are not necessarily implemented exactly as described here, for they interact with one another and also some extra features are treated, some optimisations are performed, and some ugly hacks are added. Nonetheless, we believe that the concepts are in themselves valuable contributions to the study of hierarchical routing, and also that they will help the reader understand the detailed algorithm when it is presented.

Restricting Topology Update Propagation

As discussed above, each node is required to specify the level of the hierarchy at which it expects to participate in routing decisions. This is done by naming a region that is ancestral to the node itself as its region of routing interest, or visibility.

For example, the Computer Science hub basser.cs.su.oz.au participates in routing within the University of Sydney, and therefore is declared visible in the region su.oz.au. The machine cluster.cs.su.oz.au on the other hand is interested only in activity internal to the Department and so is declared visible in cs.su.oz.au. Information about the links between a node and its neighbors ought to be broadcast only within its region of visibility, rather than to the whole network as in the simple Link State Routing scheme. Thus a change in the connections at basser.cs.su.oz.au will not be reported to munnari.cs.mu.oz.au, since the latter is not in su.oz.au, and a change in the state of cluster.cs.su.oz.au will not be reported to summer.maths.su.oz.au. Indeed topology updates are even more restricted, since within the region of visibility they are not propagated to machines whose own visibility is so small as not to include the changed node. Thus if summer.maths.su.oz.au is visible in only in maths.su.oz.au, it does not receive updates from basser.cs.su.oz.au, despite being within su.oz.au.

These restrictions imply that arbitrary choice of visibility can cause routing to fail even in a network where a path exists between any two nodes. The fundamental condition for successful routing is that each internally connected part of a region (say d.e.f) should contain at least one node (say a.b.c.d.e.f) that is visible beyond that region (ie the visibility of a.b.c.d.e.f is either e.f or f or the whole world), and that this node should form a connected subgraph together with the corresponding nodes (with visibility at least d.e.f) from each connected part of each child region (such as c1.d.e.f, or c2.d.e.f).

This restriction on choices is regarded as a feature, since creative violation allows us to provide restricted service to organizations that want it: if an organization x.y.z.f connects directly only to a node a1.b1.c1.d.e.f whose visibility is only d.e.f, then it will be able to communicate with nodes in d.e.f but not with the rest of e.f.

Pruning the Routing Tree

In hierarchical routing algorithms, the routing table contains entries giving the next hop towards regions as well as towards machines. In the simplest scheme, each destination lies inside exactly one region (possibly the machine itself) for which a next hop is listed, and this region is the child of the smallest region that is an ancestor of both the router and the destination. Thus datagrams moving towards all the nodes in that region follow the same route initially. This is usually what is wanted, but it is inappropriate for regions that are not internally connected. For example, the next hop for a datagram from the Sydney University hub to a destination in csiro.oz.au might depend on whether the destination is in syd.csiro.oz.au or in melb.csiro.oz.au. This leads to the idea of developing the format of the routing table dynamically (driven by common next hops) rather than statically (based purely on the level of shared suffix in a name). Thus the routing algorithm ought to calculate routes to destinations it is aware of within each region, and if they are all the same, then they can be summarized in the routing table by a single entry. Thus one imagines that what is calculated is a table containing a next hop for every member of the tree of names, and then the entries for a subtree (provided all have the same next hop) can be pruned to a single entry for the root of that subtree. This can sometimes lead to more efficient routing tables than the static method. For example, a node with only a single neighbor could collapse its routing table to a single entry.

Pseudolinks

At some stages during the execution of the routing algorithm, a machine, say X, may discover that another machine Y knows a route (or at least a next hop) for a destination Z[6]  even though no route to Z exists through links known to X. In this case, the database entry for Y is altered by inserting a uni-directional `pseudolink' of maximal cost from Y to Z. Z is given a default visible region equal to itself to prevent it being used to resolve routes to higher regions. This may allow datagrams destined for Z to be successfully routed via Y, until enough details arrive to calculate a complete route to Z.

ALGORITHM

We now describe the algorithm in detail, showing how routing decisions are made based on a routing table, how the routing table is calculated from the topology database, and how the topology database is maintained. The actual code incorporates a few inessential variations from the algorithm described, and these are discussed at the end of this section.

The algorithm is based on the provision at each node by link-level communications software of a data structure that records, for each direct connection outward from the node, the following information: the name of the node to which the link goes and the cost of using that link. We will refer to this structure, together with the node's own name and its region of visibility, as the local data for the node.

Routing

Each node X calculates (as described below) a routing table. This is a list of pairs, each consisting of a name (of a machine or of a larger region) together with the name of a neighbor of X. When a datagram must be sent from X (either because it originated there, or because it was received by X but has destination different from X) then the routing table is scanned sequentially until the first entry is found where the region named includes the destination of the datagram.[7]  The datagram is then transmitted over the link from X to the neighbor mentioned in that entry. If this neighbor is not itself the destination, then the routing process will be repeated at that neighbor.

Because the table is scanned in order, it is important that for any region its entry (if any) should precede the entry for its parent region. This allows default behavior, where a particular next hop is used for all destinations within a region except those which lie in subregions that are explicitly mentioned in the table. For example, if the routing table contains an entry for phys.su.oz.au with next hop cluster.phys.su.oz.au and a later entry for su.oz.au with next hop extro.su.oz.au, then datagrams for machines in the Physics Department would be sent to a (directly connected) machine in that department, and datagrams for unknown departments within Sydney University would be sent to the main routing hub for the campus.

Broadcasts   The algorithm allows for datagrams destined for a region rather than a single machine. This is used in propagating topology updates, as described below, but is also available to users, who can provide a destination like *.su.oz.au. A broadcast to a region is to be delivered to all the machines within that region whose visibility includes the region. A broadcast routing table is calculated, based on the usual routing table. For each destination region it contains a list of next hop neighbors, namely all those that appear in the usual routing table for any node within the destination region whose region of visibility includes the destination region. Each broadcast datagram is given a unique identifier at the origin (consisting of the origin and a sequence number) and an expiry time. It also contains a field for a link history, which is the list of all machines through which the datagram has travelled on its way to its present location. Each node maintains a broadcast database recording the identifiers of broadcast datagrams processed and their expiry times.

When a broadcast datagram is received by a node, its identifier is checked in the broadcast database. If the identifier is already recorded, or if its expiry time has passed, it is discarded. Otherwise the datagram's link history is altered by appending the current node, and then its identifier and expiry time are recorded in the broadcast database, and the datagram is forwarded on all links to those neighbors recorded for the destination region in the broadcast routing table, except for those neighbors whose name already appears in the link history. Periodically the broadcast database is purged of expired entries. Thus this algorithm is a variant of the one called `hot potato forwarding algorithm' discussed by Dalal and Metcalfe [Dalal 78].[8] 

Multicasts   The routing algorithm even allows datagrams to be sent to a set of destinations rather than a single destination. This could of course be done simply by replicating the datagram and sending it separately to each destination. Instead, our mechanism reduces costs as follows: when a datagram has multiple destinations, a node forwards it to any neighbor that is given by the routing tables as a next hop[9]  for any of the listed destinations, but before doing so, the destination list in the datagram is modified to remove destinations for which the given neighbor is not an appropriate next hop.

Multicast Broadcasts   The routing algorithm also provides a mechanism to reach every machine within a region, irrespective of visibility. This is done by allowing broadcast addresses within multicast sets, and treating each broadcast address separately, as above. For instance this allows addresses of the form: *.cs.su.oz.au,*.su.oz.au.

Calculating the Routing Tables

The routing tables are calculated from the topology database, which is a set consisting of local data for a number of machines. That is, each entry in the topology database consists of a machine name, its visibility and a set of neighbors each with the cost of the link from the named machine to that neighbor.

The calculation proceeds (conceptually) by forming a weighted directed graph consisting of (as nodes) all names mentioned in the database together with all their suffices (that is, every name whose existence is deducible from the database). A directed edge is placed from any node to each of its neighbors (as recorded in the topology database) with the edge's weight being the cost recorded for that link. In addition a directed edge (with weight 0) is placed from any machine to each ancestral region included in the region of visibility of the machine involved.

The usual priority-first search strategy (described for example in Chapter 31 of [Sedgewick 88]) enables the calculation of the shortest path from the node representing the machine at which the calculation is done to all other nodes of the graph. It is straightforward to maintain during this algorithm for each node not only the cost of the shortest path found so far, but also the first hop neighbor along that path. Thus at termination, the shortest path algorithm has computed for each region (including each known machine) the next hop on the shortest path known that leads to the region and does not use intermediate nodes for forwarding beyond their own declared region of visibility.

From the information just described, the routing table is calculated.[10]  First we conceptually form the name hierarchy tree containing all known region names, each labeled by the next hop on the shortest known path to that region. Then we prune the tree as follows: suppose X is the name of the machine at which the calculation is being performed, and we are considering whether to delete a node Y, given the next hop Z to its parent. We first recursively try to delete each of its children, given the next hop to Y. If all Y's children are deleted successfully, and Y itself is neither a direct neighbor of X nor a region ancestral to X that is within X's region of visibility, and the next hop to Y is equal to Z, then Y is deleted successfully. We apply this procedure to each top-level region known, which results in a pruned tree of region names, each labelled with the appropriate next hop. The routing table is simply a post-order traversal of this tree.

Maintaining the Topology Database

A node X maintains its topology database by receiving updates from other nodes as described below. The update may consist of a new version of the local data for the other node, or a new version of its computed routing table.

When a version of the local state of Y is received, X first checks that it is actually more recent than existing information (this is determined using the identifiers on broadcasts) and that Y is already mentioned in the database.[11]  and then enters the information in the database, either by overwriting existing local state for Y, or if none is present, as a new entry.

When a version of Y's routing table arrives, X checks that it is more recent than existing information and that Y is already mentioned in the database, and then considers each region named in the table, and if it was not previously known to be reachable from X, the database entry of Y is modified by appending a pseudolink from Y to that region, marked as such (and treated by the routing table calculation as having maximal cost).

Whenever a node's database changes, and a new routing table is calculated, afterwards the database is cleaned, by removing pseudolinks that are no longer needed (because the endpoint is now reachable via a route of genuine links), and other integrity checks are performed.

Each node V sends out new versions of its topology in the following way: Whenever the local data for V changes, the new information is sent in a multicast broadcast to each region that contains V and is within its region of visibility, and also to every direct neighbor of V. Thus if basser.cs.su.oz.au is visible in su.oz.au, its update is sent to *.cs.su.oz.au and to *.su.oz.au, as well as to each neighbor. In addition if there is a change in V's routing table caused by the appearance of a region ancestral to the region of visibility of V, then the new routing table is sent in a broadcast to the region that contains V and is a child of V's region of visibility.[12]  Thus in the example just mentioned, any change in the routing table concerning a region ancestral to su.oz.au (such as the appearance of the region oz.au) will cause the new table to be broadcast to *.cs.su.oz.au.

Also when a new machine joins the network, it explicitly requests the routing table from each of its neighbors, to allow it rapidly to establish enough pseudolinks in its database for routing to occur to all destinations.

Additional Features

The implemented system has some additional features that were ignored above as inessential. For example, each link has two separate cost measures associated with it (one being delay and the other financial cost). Each datagram is marked at the origin as desiring either fastest or cheapest transmission. Every node calculates two routing tables, and uses the appropriate one for each datagram.

For nodes that have low visibility but in fact have enough memory capacity to store a larger database than that for their visible region, the algorithm allows a node to request copies of the entire topology database from major hubs, and thus improve the quality of the information available to it when calculating shortest paths.

Further minor hacks are that local data is actually broadcast when there is a significant change in delay along incoming links, and that the topology database is edited to include provisionally the reverse of any bidirectional link when one direction is first discovered, in the expectation that the correct cost for the reverse direction will soon be discovered, but that in the meantime, the reverse cost is a good approximation.

We also wish nodes to be able to restrict the destinations of datagrams travelling over a link, to allow what can be considered as `private' links. Each link inward to a node may have a region restriction placed on it, which prevents that link being used to carry datagrams whose ultimate destination is not within the region mentioned. This restriction is recorded in the local data, and propagated through the topology databases of other nodes to avoid routing loops. The calculation of shortest paths at each node must respect this restriction.

We allow a limited form of source-based routing called `explicit routing', in which an address can list a series of destinations, each of which will be visited in order. However, the choice of route between destinations listed is under the control of the routing algorithm at each node.

The routing calculation is performed by building a priority queue of links to be traversed. Because the time for the routing calculation is dominated by the insertion time for the priority queue, we use a probabilistic skip list [Pugh 90] so insertion time is proportional to log e (e is the number of edges). Since an edge is inserted into the queue at most e times, the time is proportional to e log e. Typical networks are often sparse, so that the number of edges e is proportional to the number of nodes n, and the running time is more typically n log n. The worst case of a fully connected network would be n2 log n.[13] 

When resolving addresses, we don't search the routing tables sequentially. Instead they are searched using a binary chop for an exact match on the address. If no match is found, the first domain is dropped, and the next higher region used. This process proceeds until a region is found for which a route is known yielding a lookup time usually proportional to log n, rather than n.

EXPERIENCE

The scheme described above is in use in a message-oriented wide-area network containing over one thousand nodes separated by several thousand kilometres. The areas are based on X.400 domain names, which are: country, private management domain (PRMD), organisation, and department. Major routing nodes at the PRMD level of visibility receive about 80 broadcast topology updates per week, composed mostly of local organisation-level changes, with a few announcements of new organisation areas joining the net, or changed connections between organisations. Broadcast topology messages have a timeout of one week.[14] 

A new node joins the network by configuring a local topology description file with its address, visible region, and details about its links. The links have parameters for cost and expected delay, and address restrictions if desired. The administrator's e-mail, telephone, and postal addresses are also added. When the network starts, the local details are automatically propagated within the visible region, and the nearest neighbors supply sufficient topology information for the new node to route messages to the rest of the network.

The topology data-base at a typical major routing node with 23 links is about 100 kilobytes (Kb), generating routing tables of 40 Kb containing 400 regions.[15]  The local state information message is 3 Kb. A shortest path calculation takes about 0.1 seconds on a 10-mips CPU, and is run once for the fastest routes, and once for the cheapest. (The routing update time is dominated by the I/O overheads which consume about 1.3 seconds.)

The node routes about 3000 messages per day of average size about 10Kb. These messages consist of mail (~50%), news (~30%), files (~10%), and topology updates (~0.3%) with the rest split evenly between other message types.

By contrast a typical leaf node (ie: a node visible only in its parent region) has a topology data-base of about 15 Kb generating routing tables containing 19 regions of about 8 Kb in size, with negligible CPU overheads. The local state information datagram is 0.5 Kb.[16] 

About one looping message is detected per week, and has to be re-routed by operator intervention. These all appear to be due to inconsistencies in routing decisions between the new algorithm and the old, still in use in parts of the network.

DISCUSSION

In this section we make some unrelated observations about our algorithm.

By comparison with schemes in which a node's routing table is collapsed at statically (syntactically) determined levels, our method of tree pruning via common shortest paths allows for shorter paths to be found. Thus an organisation with widely dispersed locations, where the shortest paths to different nodes from an external origin may take quite different directions, need not receive all traffic through a single route. Indeed, the reason why our algorithm does not use perfect shortest paths is the incomplete topology databases (because of visibility restrictions on update propagation) which is required to prevent small nodes being swamped by topology information.

The use of routing tables derived from pruned topology trees requires delayed resolution of addresses, since any one node does not contain enough information in its routing tables to determine the correctness of an arbitrary address. This results in illegally addressed datagrams being transmitted unnecessarily, but this is only a problem if the proportion of illegal addresses to legal addresses is significant.

The broadcast scheme we use is more resilient to changes in network topology (particularly those occurring during the propagation of the broadcast) than other algorithms such as Dalal and Metcalfe's `reverse path forwarding'. The overheads of our scheme involve the space in each datagram required for the complete link history, and the database of broadcast identifiers. Our experience shows that these are well worthwhile, in view of the added resiliency.

The low overheads of the routing calculation, together with the measured infrequency of routing updates, lead us to believe that this algorithm has applicability at the network layer, as well as the application layer.

RELATED WORK

Here we mention some other work involving hierarchical routing techniques. We do not attempt to evaluate these algorithms, but merely indicate the principal differences between our method and each.

Kleinrock and Kamouns [Kleinrock 77] present an algorithm that uses a static determination (based on the level of common ancestry in the naming tree) of what regions are mentioned in the routing table, while our algorithm makes this decision based on current topology information. Their method requires each region to be internally connected, whereas ours does not. Finally their algorithm is based on Distance Vector routing rather than Link State routing.

Perlman [Perlman 85] considers a system where each level of the hierarchy has its own routing algorithm, and where nodes at a given level need not form a connected subgraph, in contrast to our system where the same algorithm is used everywhere, and each level must be connected. Thus in Perlman's system, a datagram being transmitted between regions is repeatedly repackaged with a separate level 1 header for each stage of travel between level 2 routers, while our system does all routing from the same header.

The Landmark algorithm [Tsuchiya 88] uses hierarchical addresses that change dynamically as the network's topology changes, unlike our system where a node's name is fixed. The system is based on Distance Vector routing. In our algorithm, datagrams usually pass through intermediate nodes that are hubs of increasing importance, but in Landmark, datagrams usually do not actually reach the widely visible landmarks on their way.

The Pathalias system [Bellovin 86] is used for far more complicated address structures (combining disparate network address syntax). It is a source routing mechanism, in which a complete route is chosen at the source and subsequent nodes merely forward the datagram on the link previously chosen. In contrast, our algorithm makes independant routing decisions at each node.

FURTHER WORK

The algorithm is working well, but there are some variations that are being investigated. We allow `default' entries in the routing tables. This provides pattern matching for addresses that haven't matched any `normal' entry so that they may be forwarded to some other node. This is clearly unnecessary for dealing with normal addresses that are on the network, but we are using this `feature' for dealing with addresses known to exist, but which aren't members of the network at present, for example nodes in the USA with addresses ending in .gov or .edu.

It appears to be possible for `pseudolinks' to be retained even after the real routing information from which they arose has disappeared from the net. In view of the observed low topology broadcast rate, we intend to investigate deleting pseudolinks by timeout, at the expense of enforcing a certain minimum topology broadcast rate.

We also intend to perform simulation studies to determine by how much the routes found exceed the cheapest possible if complete topologies were maintained at every node. This `overhead', which is caused by the visibility restriction in broadcasting database updates, seems reasonable in the existing network, but it would be interesting to see how different sorts of topology affect this aspect of performance.

One could also study the effects of varying some of the rules governing propagation of topology changes, on the performance of the algorithm. For example, one might broadcast local data from a node to all others within its region of visibility, rather than only to those that are mutually visible. This increases the size of topology databases, but might allow more efficient route calculation. It would be good to understand more precisely the trade-off involved.

We also intend to try automatic loop resolution. The loop detection program could be modified to re-run the routing algorithm with temporary increases in the weights for all the links in the loop (each message has a complete link traversal history in the header.) If the result were a different path, an `explicit' address could be used to force a `source-based' alternate route.

Finally, a rigorous proof of correctness for the algorithm would be an interesting exercise in formal techniques.

CONCLUSION

We have presented an algorithm that provides routing for messages using hierachical addresses. The algorithm relies on some key ideas: restricting topology updates to nodes of sufficient importance in the region where the change occurs; pruning the routing table in a dynamic fashion, so that a region where all nodes have the same next destination is represented only once; and broadcasting topology updates by forwarding new informations to all neighbours that are next destination for any ultimate destination. The algorithm has been implemented and works well. We believe that the algorithm is valid for network-level as well as application level routing.

REFERENCES

S. M. Bellovin and P. Honeyman [1986] `PATHALIAS or The Care and Feeding of Relative Addresses' Summer USENIX Conf Proc (1986).

Y. K. Dalal and R. M. Metcalfe [1978], `Reverse path forwarding of broadcast datagrams', Comm of the ACM, 21(12):1040-1048 (1978).

L. Kleinrock and F. Kamouns [1977], `Hierarchical routing for large networks', Computer Networks 1:155-174 (1977).

J. Mcquillan, I. Richier, and E. Rosen [1980], `The new routing algorithm for the ARPANET', IEEE Trans. on Comm., 28(5):711-719, (1980).

R. Perlman [1985], `Hierarchical networks and the subnetwork partition problem', Computer Networks and ISDN Systems 9:297-303 (1985).

W. Pugh [1990], `Skip lists: a probabilistic alternative to balanced trees', Comm of the ACM, 33(6):668-676 (1990).

Sedgewick [1988], `Algorithms, (2nd ed)', Addison-Wesley (1988).

P. Tsuchiya [1988], `The Landmark hierarchy', Proc SIGCOMM 1988, 35-42.


Footnotes

Authors' addresses and acknowledgement

Authors' address:
Department of Computer Science
Sydney University
NSW 2006, Australia

E-mail addresses:
Feketefekete@cs.su.oz.au
Kummerfeldbob@cs.su.oz.au
Lauderpiers@cs.su.oz.au

We gratefully acknowledge support for this work from the Commonwealth Scientific and Industrial Research Organisation, Telecom Research Laboratories, and from the Research Foundation for Information Technology in the University of Sydney.


Footnote 2
The smallest common ancestor's name is the longest common suffix of the router and destination names.

Footnote 3
If subnetting is used, the host-number may itself be of the form subnet-number.machine.

Footnote 4
The CSIRO (Commonwealth Scientific and Industrial Research Organization) is the principle example, with offices in Sydney and Melbourne relying on intercity connections between the main universities, in preference to building a dedicated CSIRO link.

Footnote 5
A single hop connection that is active only once every few hours should be allowable, so that end-to-end datagram delivery may take many hours.

Footnote 6
Z may be a machine or a region.

Footnote 7
The region contains the destination exactly if the region's name is a suffix of the destination's name.

Footnote 8
The Dalal and Metcalfe algorithm is not the same as `forward on single shortest queue' which is the usual `hot potato' routing.

Footnote 9
We say `a next hop' rather than `the next hop' because broadcast addresses may have several neighbors listed.

Footnote 10
This is redone every time the topology database changes.

Footnote 11
If Y is not already mentioned either as a node for which an entry occurs, or as a neighbor of a node with an entry, then the update is discarded, and information about Y will arrive later from nodes that can reach it. After that has occured, information about Y's outgoing links will be allowed into the database.

Footnote 12
The new information might alter the routing table for these nodes, and thus cause further broadcasts.

Footnote 13
In practice, even highly connected networks have faster running times than n2 log n as links are unevenly weighted.

Footnote 14
This implies that the broadcast data-base need not be very large.

Footnote 15
About 23Kb of this is overhead due to the syntax of X.400 names.

Footnote 16
The size is dominated by the site administrator's details.