Previous [1] [2] [3] [4] [5] [6] [7] [8] [9]

@

Journal of Information Science and Engineering, Vol.19 No.3, pp.503-516 (May 2003)


Robust TCP Connections for Fault Tolerant Computing*

Richard Ekwall, Peter Urban and Andre Schiper
Ecole Polytechnique Federale de Lausanne (EPFL)
School of Computer and Communication Sciences
Distributed Systems Laboratory
CH-1015 Lausanne, Switzerland
E-mail: {nilsrichard.ekwall, peter.urban, andre.schiper}@epfl.ch

When processes on two different machines communicate, they most often do so using the TCP protocol. While TCP is appropriate for a wide range of applications, it has shortcomings in other application areas. One of these areas is fault tolerant distributed computing. For some of those applications, TCP does not address link failures adequately: TCP breaks the connection if connectivity is lost for some duration (typically minutes). This is sometimes undesirable. The paper proposes robust TCP connections, a solution to the problem of broken TCP connections. The paper presents a session layer protocol on top of TCP that ensures reconnection, and provides exactly-once delivery for all transmitted data. A prototype has been implemented as a Java library. The prototype has less than 10% overhead on TCP sockets with respect to the most important performance figures.

Keywords: session layer protocol, TCP, performance, fault-tolerant distributed computing, quasi-reliable channels, java

Full Text () Retrieve PDF document (200305_07.pdf)

Received May 15, 2002; accepted July 25, 2002.
Communicated by Biing-Feng Wang, Stephan Olariu and Gen-Huey Chen.
*Research supported by a grant from the CSEM Swiss Center for Electronics and MIcrotechnology, Inc., Neuchatel and by OFES under contract number 01.0537-1 as part of the IST REMUNE project (number 65002). A preliminary version of the paper was presented at the 2002 International Conference on Parallel and Distributed Systems, Chungli, Taiwan.