This paper describes a protocol designed specifically to transport variable size messages over circuits with different error and bandwidth characteristics, while using standard operating system services in its implementation. High throughput is achieved by allowing infinite `window' sizes and multiple message streams, and by careful collaboration with the operating system.
INTRODUCTIONPrevious experience in implementing a message transport system using a standard `windowing' protocol suggested that higher throughput could be achieved by a protocol tailored specifically to message transport over generally reliable circuits. It seemed that protocol windows had to be several seconds in size to obtain high throughput, because window acknowledgment handshakes were the main cause of interruption. The interaction between operating system behaviour and protocol behaviour often impeded throughput where window management was concerned, and the reverse flow of window acknowledgements stole valuable bandwidth from bi-directional message transport.
Most circuits encountered had very low error rates, and were either slower than the file-systems at either end, or else sub-networks already provided flow-control. Error detection was always necessary, but error correction was needed at a rate that was the exception rather than the rule, so that a penalty for higher overheads on error correction would be acceptable if it led to higher throughput. It seemed that a protocol where the window size was larger than the message size would effectively reduce the need for protocol acknowledgements in the reverse direction, except when the occasional errors had to be corrected. It would also allow an implementation better suited to operating system characteristics, in that timing constraints imposed by protocol handshakes could be relaxed.
This paper attempts to explain the design criteria and subsequent implementation, together with the results from observing message transport behaviour over some typical circuits. It will concentrate on the implementation issues, rather than the protocol itself, as the greatest improvements to throughput came from understanding how to integrate a user-level implementation of a message transport protocol with the services available from a multi-user operating system.
MESSAGE TRANSPORTThe service to be provided is the transport of whole messages over a wide variety of unreliable circuits between two computers each running a multi-user operating system. The messages may range in size from a few bytes to many megabytes, and each message will have a transmission priority attached, so that higher priority messages may be transmitted ahead of lower. Outward-bound messages are supplied by some external routing process, which places them in a queue for transmission along a particular circuit. Inward-bound messages must be placed in a queue for the routing process after reception. Since there are queues at both ends of a circuit, messages may flow in both directions at once.
The transport service in turn makes use of standard operating system services, such as file-systems for message storage, inter-process communication, and access to devices and sub-protocols for data transfer.
An overall requirement for message transport is high-speed: the messages must be delivered as quickly as possible. However, the main determinants for the design of a suitable protocol for message transport are the circuit characteristics and the operating system constraints.
CircuitsSo many circuit types are in use that it might be considered difficult to design a protocol to take best advantage of them all. However circuit characteristics from the point of view of message transport fall into the following categories:
Full vs. half-duplex.Some circuits can only operate at full speed in one direction at a time, either because they are inherently uni-directional, such as baseband or ring networks (only one packet of data at a time), or because they utilise the full capacity of a circuit to maximise throughput in a single direction, such as some modems.
This implies that the protocol must work well over a half-duplex circuit, so the basic operation of the protocol should be uni-directional, with a minimum amount of information flowing in the opposite direction. On the other hand, the protocol must also be able to transport messages in two directions at once, to maximise throughput over full-duplex circuits.
Delays.Some circuits impose delays on information transfer that are large with respect to the throughput, such as a circuit operating via a satellite transmission. Delays are introduced by `turn-around' in half-duplex carriers, or by circuits that are supported by operating system processes, caused by the latency attending context switching. Delays are also introduced by circuits that are themselves implemented on top of packet switching networks.
Delays affect protocol performance where handshakes are necessary, such as those required to regulate throughput using `flow control'. Flow control is performed by allowing only one `window' of data to pass through a circuit before a protocol handshake must occur to permit another window to follow it. The larger the window size, the greater the tolerance to delay.
Errors.Errors are of three types: caused either by corrupted bits, added/lost bits, or the circuit itself failing. Bit errors are further characterised by their occurrence as single isolated cases, or in bursts. The error bursts can align with byte or packet boundaries depending on circuit characteristics. Some circuits may fail in a way that causes `circuit-echo', whereby bits transmitted to the output are immediately reflected on the input side.
It is also possible that a circuit duplicates or reorders bits, and although these can be treated as corrupted data, the protocol should handle the special case where whole packets are reordered or duplicated.
The protocol must detect and correct errors: altered data errors can best be detected by partitioning the data into small packets each with a cyclic redundancy check. The packet size should be flexible to cope with different error rates, since the smaller the packet, the greater is its chance of successful transmission, but the overhead (ratio of header information to data) will also be greater. Under changing error conditions the packet size should be adjustable to minimise overheads.
Circuit failure implies that messages must be restartable after the circuit is re-established. It is surprising how few message transport protocols actually implement message restart.
Large speed range.The protocol will need to be able to extract the maximum throughput from circuits ranging in speed from a few bytes/second, to giga-bytes/second.
Apart from the different error syndromes of slow vs. fast circuits which affects the optimum packet size, very high-speed circuits require the use of very large packets for optimum throughput (since it is often the packet throughput that limits data throughput).
Circuit establishment.One could of course use one circuit invocation per message, but very often the circuit itself is a scarce resource whose utilisation must be maximised (such as a circuit provided by a telecommunications modem). Even if circuits are plentiful, very often circuit establishment is expensive, in either time, or specific establishment-related costs.
Operating System ConstraintsThe message transport system must run on various computer platforms, and this consideration implies that the transport protocol will need to be implementable by user-level processes to ensure maximum portability. Protocols implemented via processes that run at the user-level of an operating system encounter special problems.
Context switch delays.These are caused, at the lowest level, by the overheads of switching between different processes, at the highest level, by overloaded memory resources causing the transfer of processes out of main memory. The delays introduced by these overheads can range by many orders of magnitude from micro-seconds to seconds. Such delays imply that the protocol cannot depend on precise timing, and should be tolerant of wide variations in response time.
Synchronous I/O.The common form of I/O provided by operating systems is synchronous, where one I/O operation must complete before the same process may begin another. Asynchronous I/O is often possible, but the aids to processes undertaking it are often the cause of additional overheads which act to slow throughput.
The protocol must handle a minimum of three I/O streams: one message data source/sink; one data-stream from the circuit; and one data-stream to the circuit. The most important I/O stream whose bandwidth is to be utilised fully is the circuit itself, so the most important streams to de-couple are those to and from the circuit, so that output is not blocked while waiting for input, and vice-versa.
File-system throughput.The file-system is often a bottle-neck for message transport performance, and great care must be taken to design a fast interface. Some sort of buffering scheme must be implemented to allow the file-system interface to be de-coupled from the circuit.
Flow control.In order to maximise throughput of the circuit, there must be some way of ensuring that the circuit doesn't provide data to the receiver faster than it can process it. The benefit of `sliding-window' protocols is precisely this, that a window need not be acknowledged until the receiver is ready to receive a new one.
However most circuits used for message transport fall into one of two categories with respect to flow-control: they either are slow enough that a receiver can always be expected to cope (such a telephone modem), or they are implemented on top of some sub-protocol which already provides flow-control (such as TCP/IP or X.25). If message transport is restricted to these two categories of circuit, flow-control becomes unnecessary.
GOALSBearing in mind the constraints listed above, these are the goals for the message transport protocol:
DESIGNThe most important elements in the preceding discussion are the lack of precise timing available to an implementation, the knowledge that it should be possible to improve throughput by dispensing with `flow-control' windows, the belief that receivers will usually be able to accept data from a circuit, and that errors will be the exception rather than the rule. The solution is a `window' at least as large as a message. This will allow a transmitter to send all the message data into a circuit without waiting for the receiver to respond. It will also minimise the amount of control data that must flow in the opposite direction. So it will be possible to send messages in both directions at once without impeding message data with control data.
PacketsIn order to allow for error detection and correction, the transmitted message is broken into packets, of a size that must be adjustable to suit error rates and throughput.
The protocol uses packets to transmit message data and control information. Each packet has a header containing an address, a type, a channel, a direction, a size, and a CRC, and an optional data part, with an optional data CRC. The header CRC is present to help identify a packet header in the presence of errors. The data CRC is optional so that in order to minimise CPU overheads it can be dispensed with where the circuit is already sufficiently reliable.
The main packet type is `Message-Data' (DATA), the other types are for control, as described below.
The address in a DATA packet header is the address within the message of the first byte in the data. This effectively gives a protocol `window' size equal to the size of the message, and therefore permits infinite `write-ahead'. An infinite window size allows a circuit to re-order packets without loss, as well as detecting out-of-order packet duplication. The address in non-DATA packets has other meanings. The size is always the size of the data part of a packet.
ErrorsTransmission errors cause packets to be lost. A Cyclic-redundancy-check (CRC) is considered sufficient error detection, and since errors are not expected to dominate transmission, error correction may be achieved by re-transmission. As any CRC error causes the attached packet to be discarded, the main mechanism for detecting errors is for the packet receiver to notice and signal gaps in the (expected) ordered sequence. For DATA packets the sequence is the range: address - address+size. Whenever the receiver notices a gap, it is immediately signaled back to the transmitter via a `Negative-Acknowledgement' packet (NAK), whose address is the start of the gap, and whose data contains the size of the gap. Re-ordered packets will generate unnecessary NAKs, but the subsequent duplications can be discarded.
All non-DATA packet types have an expected response, and errors are handled by timeouts.
Circuit echo is detected by having an additional bit in the header to detect direction. This bit has a value determined by a lexical comparison of circuit endpoint addresses, which are known at circuit establishment time. Any packet arriving with a bit with the same value as transmitted packets is immediately discarded.
Circuit failure is detected either by a circuit failure signal if provided, or by the absence of any correctly received packets within a certain timeout period (quiescent circuits have keep-alive packets transmitted periodically).
Message RestartThe need for this capability requires that message transport status information be maintained. This information need only be maintained at the receiving end, and it need only consist of a vector of `gaps' (initially set to the whole message). On re-establishment after a circuit failure, the transmitter can enquire of the receiver the list of gaps, and proceed to transmit the data required to fill them. Messages are marked at each end of a link by a unique identifier which is used to negotiate the restart parameters.
Message MultiplexingThis capability allows more than one message to be `in flight' in the same direction at the same time. There are two reasons for allowing message multiplexing over the same circuit. The first is that the messages have individual priorities, in order that higher-priority messages may overtake lower ones. For this to be effective, there must be some easy way to stall a low priority message in transmission, and allow a higher priority message to monopolise the available bandwidth instead. The second reason is throughput. As each message transmission terminates, it must be passed over to the routing process. Where one message is stalled waiting for the termination to complete, a second message may use the idle bandwidth.
With these aims four messages are multiplexed at once in four separate message `channels'. This entails the addition of 2 bits to the packet header to define with which channel a packet is associated.
Two ProcessesThe protocol is implemented by two processes, one to handle input from the circuit, the other to handle output. These enable the reading and writing of a circuit to be de-coupled. There is minimal negotiation between the two processes: during message startup, termination, and during error correction. However this interface needs a very low bandwidth compared with message transmission (no message data flows over it).
In figure 2 below, there are two nodes, A and B, that are communicating over a full-duplex circuit. The circuit is shown divided into two groups of multiplexed message flows, one for each direction. Messages flowing for example from `transmitter' B to `receiver' A, have protocol acknowledgements flowing in the opposite direction via `transmitter' A and `receiver' B (thicker arcs). The thinner arcs show the protocol acknowledgment flow from `receiver' B to `transmitter' A.
Transmitter.Assuming that circuits are mostly reliable, the absolute minimum reverse information that must flow to a transmitter process is a packet to say that a new message may be transmitted, and another to indicate the message has successfully completed transmission.
The transmitter sends one packet signifying a new message, and must then wait for the receiver to indicate it is prepared to accept the new message. This packet is called the `Start-Of-Message' packet (SOM), which is acknowledged by a `Start-Of-Message-Accept' (SOMA). As the receiver will usually accept a new message, the transmitter may also start transmitting a certain limited number of data packets ahead of reception of the SOMA packet, the number depending on measured latency, perhaps several seconds worth. (This is also known as `write-ahead'.)
Data for packets is obtained by reading the contents of a message from the local file-system, and since the most efficient file-system block size will often be different from the DATA packet size, a buffer must be used between file-system and packet assembly.
The only major obstacle to throughput is the termination handshake, since the last packet sent must be the `End-Of-Message' packet (EOM), and the transmitter must then wait for its acknowledgement, the `End-Of-Message-Accept' packet (EOMA). This hiatus is another reason for allowing message multiplexing, so that the transmitter may continue sending data for a second message while awaiting reception of the EOMA for the first.
The transmitter assumes that any new message may be being `restarted' and hence the SOM packet contains a unique identifier (ID) for each message, and its `address' represents the size of the message. In the particular case where the circuit has just been initialised, local status information from any previous instantiation of the transmitter is examined to see if any current messages were in the process of being transmitted, and if so, `write-ahead' is disabled for those messages in the expectation of a successful `restart' negotiation.
Receiver.The receiver process attempts to be always ready to accept another message, and writes data received for a message as fast as possible to the file-system where it is held. This implies that attention must be made to designing the fastest-possible file-system interface, and to optimising the message hand over procedure to signal message reception to the message router. The receiver must also keep status information for a message during reception. This status information defines what data has been received, and what data is missing. Any errors in reception are signaled back to the remote transmitter with NAK packets, and this missing data status is time-stamped to allow for a NAK being lost. The receiver periodically scans the status for missing data information that has aged sufficiently to warrant re-transmission. On receiving an EOM packet, the receiver will refuse to generate an EOMA in response until all gaps have been filled. The EOM will trigger a rescan of the missing data status to re-transmit any timed-out NAKs. To allow for lost EOMA packets, the transmitter must periodically retransmit EOM packets until the EOMA has been received.
Packets to be transmitted to the remote transmitter must first be passed to the local transmitter for transmission over the circuit. This need is the only cause for possible I/O blocking, and this interface must be designed to avoid delays if possible. Packets received from the remote receiver process destined for the local transmitter must also be passed over this interface. For this reason there are in fact two interfaces between the two processes, one for packets destined to be passed from the receiver to the circuit for transmission, and the other for `local' packets destined for the transmitter, either generated locally by the receiver to control the transmitter, or generated by the remote receiver in response to message reception.
The receiver assumes that any new message may be being restarted, and examines the ID in a SOM packet to match it against the status information for any message currently being received, either from this instantiation, or from an earlier one. The SOMA returned in response will either have an address of `0', indicating a new message, or contain a non-zero `restart address'. Data before the restart address need not be re-transmitted. This simple scheme does not allow for `gaps' in the received data, but instead can only avoid re-transmitting data up to the first `gap'. A more complex handshake to exchange the `gap' vector has not yet been implemented.
The size of the message indicated in a SOM packet address is also used by the receiver to prevent space exhaustion. If the file-system doesn't contain enough space to hold the entire message, then the SOMA is withheld until the space is freed up.
ControlApart from the six packet types mentioned above, there are two other channel-independent packet types for handling initial circuit synchronisation, circuit termination, and half-duplex transmission turn-around. The circuit termination packet type has only one function, but the last packet type, CTRL, handles all other control functions by using the channel number as a `sub-type'.
Synchronisation.This version of the CTRL packet is known as SYNC. These packets have the name of the source in the data part, and are transmitted periodically by the transmitter immediately after circuit establishment until the transmitter sees a local SYNC packet from the local receiver, after which it marks the circuit active and proceeds to transmit messages. On reception, a receiver determines if the SYNC packet source is the same as the remote end, and if so echoes it back to the remote receiver, after marking the other end as active. When a receiver sees one of its own SYNC packets echoed back, it marks itself as active, and discards it. When both ends are marked active, the receiver sends a SYNC packet to the local transmitter, and message transmission can begin.
The SYNC packets are also used to validate the link, in that the source address must either match the local address, or match the expected remote address.
Turn-around.This version of the CTRL packet is known as `Message-Queue-Empty', or MQE. When the protocol is operating over a half-duplex circuit, it is organised to transmit messages first in one direction, and then in the other, alternately. For this purpose only the transmitter holding the `transmit token' may transmit message data. The MQE is generated by either transmitter prepared to relinquish the `transmit token' to pass the token to the remote transmitter, but no more frequently than a certain period. Any transmitter receiving an MQE may opt to start transmitting messages, until a suitable moment to relinquish the token occurs. To cater for lost MQE packets, the transmitter not currently holding the token periodically retransmits an MQE.
In full-duplex mode, an MQE packet merely signals that there are no messages to send.
Keep-alive.Otherwise known as IDLE packets, these control packets are transmitted by the receiver to the remote transmitter periodically in order to allow the receiver to recognise that the circuit is still alive. They also contain the recent received throughput measurement to allow the remote transmitter to adjust packet sizes appropriately.
Termination.This packet type (EOT) shuts down the circuit when received, and is transmitted when either end decides to terminate. This is most useful when the message transport system is being operated in `batch' mode to transport a queue of messages and then terminate after the queues at both ends have been emptied. Termination occurs when either end has both received an MQE from the remote transmitter, and its own queue is empty.
IMPLEMENTATIONThe implementation so far has been made only on UNIXTM platforms. TM The message transport system is encapsulated in one `daemon' program, that forks on startup to provide the receiver and transmitter processes. The program is invoked with one file descriptor set up for a pre-established circuit to the remote site. The receiver only ever reads from this file, whereas the transmitter only ever writes to it. (Virtual circuit establishment is carried out by a separate program, which then invokes the transport daemon.)
Inter-process CommunicationTwo `pipes' are used to carry the two types of communication between the processes, one for packets from the receiver to be processed by the transmitter, and one for packets from the receiver to be transmitted on the circuit. The transmitter makes use of the operating system fstat system call to determine when there is data available to be read from the pipes. The transmitter polls the pipes in between transmitting message data, but in order to prevent busy loops while waiting for data to appear when there are no messages to transmit, the receiver also signals the transmitter whenever packets are written to the pipes (which allows the transmitter to sleep when idle). The signals are sent via the kill system call.
Buffer ManagementA buffering scheme is used between the message data in the disk file, and the packets on the circuit. Files are accessed using the buffer size natural to the file-system, from which packets are assembled/disassembled as appropriate. (Error correction causes buffer re-alignment as necessary.) For high-speed circuits, it is important to align the data in packets so that data-copying does not occur on odd boundaries, which tends to be much slower. If packet sizes are larger than the natural file-system size, then buffering is disabled.
ParameterisationVarious parameters are passed to the transport daemon at circuit initialisation time. These inform it of the expected initial throughput, whether the circuit is half-duplex (and which direction owns the `transmit token'), and whether the circuit is sufficiently reliable to dispense with the DATA CRC. The DATA CRC used depends on the size of the data: packet sizes greater than 1024 use a 32-bit CRC, smaller ones use 16-bit.
OTHER MESSAGE TRANSPORT SYSTEMSOther systems also provide message transport.
FTP`Ftp' is a TCP/IP specific file transfer protocol that also allows limited user interaction [Postel 85]. It depends on using two circuits provided by an underlying TCP/IP network, one to provide control, and a second on which to transfer data. Data can be transferred one message at a time, in one direction at a time, and is essentially just copied over the circuit with no error detection other than that provided by TCP. Apart from implementation constraints, it ought to be optimally fast at uni-directional data transfer. If a circuit breaks during message transfer, although the protocol defines a restart procedure, no implementations known to the author implement it, and the transfer must be restarted from the beginning.
UUCPUUCP is the original message transport system that encouraged the wide-spread adoption of inter-site networking in the UNIXTM world. There are several transport protocols defined for use with UUCP, which have been adapted over time to cope with changing circuit providers. The standard one uses a small windowed packet scheme [Chesson 79], with error detection and correction by retransmission (after a 25 second timeout). (At least one modem manufacturer has adapted their modems to intercept and emulate the UUCP flow-control packets so that throughput is not blocked by windowing.) Other packet protocols exist to take advantage of error-corrected circuits such as X.25 and TCP/IP. However, all circuits are used in a half-duplex fashion (for uni-directional message transfer), with no message restart after circuit failure.
Unreliable Circuit ThroughputTests were run transferring messages in both directions at once over various modem circuits. The messages were ~100 Kbytes of compressed data each, so that message control packets contributed a negligible amount to the overheads (85 bytes/message). Packet sizes are chosen by the transport daemon so that a packet occupies the circuit for ~1 second under good conditions (Packet sizes are reduced if throughput drops due to errors.) Packet headers are 11 bytes. Each test lasted more than 10 minutes.
TCP/IP ThroughputTests were run transferring a 9492560 byte file three times between two MIPS 3030 CPUs connected via two IP bridges, using both the transport daemon (with data CRC disabled), and the `ftp' program (in `binary' mode). The `ftp' times are those it reports at the end of a transfer. The transport daemon times are those reported by the router process on reception, which include the time between the source router process queuing the message for transmission, and the destination router process queuing the message for reception.
These results are surprising, since the transport daemon encapsulates data in 8192 byte packets (with 9 byte headers), whereas `ftp' just does data copies over a data `socket'. The transport daemon also reads the message data and header from separate files. So it would be a reasonable assumption that the transport daemon should be at the best marginally slower than `ftp'.
Message RestartA more complex protocol to exchange the `gap' vector could be implemented for complete recovery of partially-transmitted messages after circuit failure.
Flow ControlIt would be useful if this message transport system could make use of circuits that do require flow-control (such as UDP/IP, and GIGABIT links with no transport protocol). An investigation of whether a flow-control scheme can be layered on top of the transport protocol is in progress.
Throughput CeilingsAs this transport system maximises throughput, it can dominate other traffic using circuits established over the same carrier. While the issue of `circuit priority' ought to be addressed by the circuit provider, it very often isn't. (For instance, provision exists in IP networks to have different packet priorities, but this isn't implemented on any IP network available to the author.) It would therefore be useful to be able to put ceilings on throughput, and this should be achievable as a side-effect of the work on flow-control.
SUMMARYThe protocol described will transport messages between two nodes while extracting maximum throughput from available circuit bandwidth. It also detects and corrects errors, allows message restart after circuit failure and re-establishment, allows higher priority messages to overtake lower, supports uni- and bi-directional message transfer, and adapts to changing circuit conditions.
REFERENCESJ. Postel and J. Reynolds  `File Transfer Protocol' RFC 959, DARPA, October 1985.
G.L. Chesson  `Packet Driver Protocol' AT&T Bell Laboratories.