Networking Forums

Networking Forums > Computer Networking > Linux Networking > TCP Keepalives Problem on Linux

Reply
Thread Tools Display Modes

TCP Keepalives Problem on Linux

 
 
olssons@gmail.com
Guest
Posts: n/a

 
      05-22-2006, 10:19 PM
Hello,
I am having a problem with TCP Keepalives on Linux. The same tests
below works as expected on Tru64 (4.0F).

I have 2 hosts, Host-A and Host-B that are connected via TCP
connection. Host-A is the client and Host-B is the server.

First, here are my /etc/sysctl.conf parameters for TCP Keepalives:
net.ipv4.tcp_keepalive_intvl = 3
net.ipv4.tcp_keepalive_probes = 2
net.ipv4.tcp_keepalive_time = 10

My expectation is this:
When no data is received over a connection for 10 seconds, the TCP
stack will send a probe to the other end of the connection asking for
an ACK of the last segment sent. In 3 seconds, if nothing is received
then a second probe is sent. If nothing is received in 3 more seconds,
the connection is closed by the TCP stack. Linux select() will notify
of "read" activity for the connection's file descriptor and attempting
to read it returns ETIMEDOUT.

What I have found is this:
If during the period when 2 connection hosts are physically
disconnected (ie cable pull), the client (Host-A) attempts to send
*any* data on the connection (even 1 byte), TCP keepalives stop
functioning and will never cause the connection to drop... even if no
more data is ever sent or received on the connection.

Both processors use the same version of linux (Redhat ES3).
Here is uname -a for one of them:
Linux o1cwp05 2.4.21-32.ELsmp #1 SMP Fri Apr 15 21:17:59 EDT 2005 i686
i686 i386 GNU/Linux

This "example 1" does NOT work as expected:
HOST-A and HOST-B have an established TCP connection
HOST-B is disconnected from the network (unplug ethernet cable for
example)
HOST-A attempts to send some data on the connection
TCP Keepalives never kill the connection even if no data is sent or
received ever again

This "example 2" does work as expected:
HOST-A and HOST-B have an established TCP connection
HOST-B is disconnected from the network (unplug ethernet cable for
example)
HOST-A DOES NOT attempt to send some data on the connection
After at most 16 seconds, the connection is closed and ETIMEDOUT error
occurs as expected

To recreate this problem in my lab, I used W. Richard Steven's "sock"
program provided on his website
http://www.kohala.com/start/unpv12e.html (see link for Source code
half way down)

on Host-B, I ran a TCP server on port 4999 (sock -K -s 10.10.60.25
4999)
on Host-A, I connected to the server on Host-B (sock -K 10.10.60.25
4999)

In the data below, o1fdp01 is Host-A and o1cwp05 is Host-B

Here is the tcpdump output of the run where TCP keepalives never
worked:
tcpdump -vvv port 4999
tcpdump: listening on eth0
20:40:04.325315 o1fdp01.59689 > o1cwp05.4999: S [tcp sum ok]
3927728606:3927728606(0) win 5840 <mss 1460,sackOK,timestamp 10860835
0,nop,wscale 0> (DF) (ttl 64, id 55549, len 60)
20:40:04.325402 o1cwp05.4999 > o1fdp01.59689: S [tcp sum ok]
2921424471:2921424471(0) ack 3927728607 win 5792 <mss
1460,sackOK,timestamp 1354048 10860835,nop,wscale 0> (DF) (ttl 64, id
0, len 60)
20:40:04.325415 o1fdp01.59689 > o1cwp05.4999: . [tcp sum ok] 1:1(0) ack
1 win 5840 <nop,nop,timestamp 10860835 1354048> (DF) (ttl 64, id 55550,
len 52)
20:40:14.322837 o1fdp01.59689 > o1cwp05.4999: . [tcp sum ok] 0:0(0) ack
1 win 5840 <nop,nop,timestamp 10861835 1354048> (DF) (ttl 64, id 55551,
len 52)
20:40:14.322918 o1cwp05.4999 > o1fdp01.59689: . [tcp sum ok] 1:1(0) ack
1 win 5792 <nop,nop,timestamp 1355047 10860835> (DF) (ttl 64, id 53016,
len 52)
20:40:14.323282 o1cwp05.4999 > o1fdp01.59689: . [tcp sum ok] 0:0(0) ack
1 win 5792 <nop,nop,timestamp 1355048 10860835> (DF) (ttl 64, id 53017,
len 52)
20:40:14.323287 o1fdp01.59689 > o1cwp05.4999: . [tcp sum ok] 1:1(0) ack
1 win 5840 <nop,nop,timestamp 10861835 1355047> (DF) (ttl 64, id 55552,
len 52)
20:40:24.323263 o1fdp01.59689 > o1cwp05.4999: . [tcp sum ok] 0:0(0) ack
1 win 5840 <nop,nop,timestamp 10862835 1355047> (DF) (ttl 64, id 55553,
len 52)
20:40:25.785665 o1fdp01.59689 > o1cwp05.4999: P [bad tcp cksum 4770!]
1:2(1) ack 1 win 5840 <nop,nop,timestamp 10862981 1355047> (DF) (ttl
64, id 55554, len 53)
20:40:25.993337 o1fdp01.59689 > o1cwp05.4999: P [bad tcp cksum 3270!]
1:2(1) ack 1 win 5840 <nop,nop,timestamp 10863002 1355047> (DF) (ttl
64, id 55555, len 53)
20:40:26.413378 o1fdp01.59689 > o1cwp05.4999: P [bad tcp cksum 870!]
1:2(1) ack 1 win 5840 <nop,nop,timestamp 10863044 1355047> (DF) (ttl
64, id 55556, len 53)
20:40:27.253387 o1fdp01.59689 > o1cwp05.4999: P [bad tcp cksum b46f!]
1:2(1) ack 1 win 5840 <nop,nop,timestamp 10863128 1355047> (DF) (ttl
64, id 55557, len 53)
20:40:28.933463 o1fdp01.59689 > o1cwp05.4999: P [bad tcp cksum c6f!]
1:2(1) ack 1 win 5840 <nop,nop,timestamp 10863296 1355047> (DF) (ttl
64, id 55558, len 53)
20:40:32.293611 o1fdp01.59689 > o1cwp05.4999: P [bad tcp cksum bc6d!]
1:2(1) ack 1 win 5840 <nop,nop,timestamp 10863632 1355047> (DF) (ttl
64, id 55559, len 53)
20:40:39.013893 o1fdp01.59689 > o1cwp05.4999: P [bad tcp cksum 1c6b!]
1:2(1) ack 1 win 5840 <nop,nop,timestamp 10864304 1355047> (DF) (ttl
64, id 55560, len 53)

...... after this, it stops sending any data at all, but the connection
is not closed as expected

The case below works and exhibits textbook behavior (note last entry R
closing connection)
tcpdump -vvv port 4999
tcpdump: listening on eth0
20:43:01.729986 o1fdp01.59837 > o1cwp05.4999: . [tcp sum ok]
4122167661:4122167661(0) ack 3110189858 win 5840 <nop,nop,timestamp
10878575 1370789> (DF) (ttl 64, id 46060, len 52)
20:43:01.730062 o1cwp05.4999 > o1fdp01.59837: . [tcp sum ok] 1:1(0) ack
1 win 5792 <nop,nop,timestamp 1371788 10877575> (DF) (ttl 64, id 20871,
len 52)
20:43:01.738698 o1cwp05.4999 > o1fdp01.59837: . [tcp sum ok] 0:0(0) ack
1 win 5792 <nop,nop,timestamp 1371789 10877575> (DF) (ttl 64, id 20872,
len 52)
20:43:01.738703 o1fdp01.59837 > o1cwp05.4999: . [tcp sum ok] 1:1(0) ack
1 win 5840 <nop,nop,timestamp 10878575 1371788> (DF) (ttl 64, id 46061,
len 52)
20:43:11.730419 o1fdp01.59837 > o1cwp05.4999: . [tcp sum ok] 0:0(0) ack
1 win 5840 <nop,nop,timestamp 10879575 1371788> (DF) (ttl 64, id 46062,
len 52)
20:43:14.730542 o1fdp01.59837 > o1cwp05.4999: . [tcp sum ok] 0:0(0) ack
1 win 5840 <nop,nop,timestamp 10879875 1371788> (DF) (ttl 64, id 46063,
len 52)
20:43:17.730668 o1fdp01.59837 > o1cwp05.4999: R [tcp sum ok] 1:1(0) ack
1 win 5840 <nop,nop,timestamp 10880175 1371788> (DF) (ttl 64, id 46064,
len 52)

I appreciate any insight anyone can give. I'm hoping there is just a
TCP option I'm missing or something.

Thanks,
Sten

 
Reply With Quote
 
 
 
 
Phil Frisbie, Jr.
Guest
Posts: n/a

 
      05-22-2006, 10:45 PM
(E-Mail Removed) wrote:

> Hello,
> I am having a problem with TCP Keepalives on Linux. The same tests
> below works as expected on Tru64 (4.0F).

<snip>
> I appreciate any insight anyone can give. I'm hoping there is just a
> TCP option I'm missing or something.


First let me ask why are you (mis)using TCP Keepalives this way?

There is a reason the default time-out is set to 7200 seconds: TCP Keepalives
were deprecated even before the TCP protocol was adopted! Even though it was
realized they would not be useful, they were not removed but rather effectively
disabled by the long time-out.

If you need to quickly detect a lost connection then the place to do it is to
add a keepalive message to your own application protocol. The changes you are
making to your system to misuse TCP Keepalives effects ALL TCP connections.

--
Phil Frisbie, Jr.
Hawk Software
http://www.hawksoft.com
 
Reply With Quote
 
Rick Jones
Guest
Posts: n/a

 
      05-22-2006, 10:47 PM
(E-Mail Removed) wrote:
> What I have found is this:


> If during the period when 2 connection hosts are physically
> disconnected (ie cable pull), the client (Host-A) attempts to send
> *any* data on the connection (even 1 byte), TCP keepalives stop
> functioning and will never cause the connection to drop... even if
> no more data is ever sent or received on the connection.


By definition, from the perspective of HostA when that single byte is
sent and remains unACKed, the connection remains "active" rather than
idle so the keepalives should not start. The connection will be
timed-out based on the "normal" retransmission mechanisms.

Only when/if there is no outstanding data on a connection should a
keepalive timer fire.

rick jones
--
The glass is neither half-empty nor half-full. The glass has a leak.
The real question is "Can it be patched?"
these opinions are mine, all mine; HP might not want them anyway...
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
 
Reply With Quote
 
olssons@gmail.com
Guest
Posts: n/a

 
      05-23-2006, 02:09 AM
Rick Jones wrote:
> By definition, from the perspective of HostA when that single byte is
> sent and remains unACKed, the connection remains "active" rather than
> idle so the keepalives should not start. The connection will be
> timed-out based on the "normal" retransmission mechanisms.
>
> Only when/if there is no outstanding data on a connection should a
> keepalive timer fire.


Rick,
Thanks for this information. It is useful for understanding what Linux
is doing under the covers.

It seems I will need to write some additional code. Fortunately the
existing application protocol I am using has a message that requires a
response. I can timeout that response by writing some extra code and
fix this specific problem. Unfortunately there are lots other
protocols that are also broken given this information.

The code I'm working with was originally developed on Tru64 and Tru64
doesn't implement their keepalive processing in the same way as Linux.
If data is not received on Tru64, the connection is considered idle and
keepalives begin. If data is not received after a given number of
keepalive probes, the connection is closed. This seems to make sense
and I wonder what it hurts for Linux to do something similar... unless
there are legitimate cases where this logic fails.

It seems like keepalive processing for Linux should be for data
received from the perspective of a host. If data is never received,
the connection should be closed after the timeout periods (assuming
keepalive is set and probes have been sent). Otherwise a host can
inadvertently prohibit keepalive processing from taking place by
attempting to send data to the other host (the one that isn't
responding).

Thanks again,
Sten

 
Reply With Quote
 
olssons@gmail.com
Guest
Posts: n/a

 
      05-23-2006, 02:21 AM
Phil Frisbie, Jr. wrote:
> (E-Mail Removed) wrote:
>
> > Hello,
> > I am having a problem with TCP Keepalives on Linux. The same tests
> > below works as expected on Tru64 (4.0F).

> <snip>
> > I appreciate any insight anyone can give. I'm hoping there is just a
> > TCP option I'm missing or something.

>
> First let me ask why are you (mis)using TCP Keepalives this way?

Well, I can think of a few offhand... plus the standard one, "it was
that way before I got here"

Here are some other reasons I can think of.

It is a system-wide way of doing keepalives for TCP connections.
Otherwise each protocol needs a separate message to handle this
(assuming they care if one end goes down). Instead of having one place
to write the common code, you need to have multiple places. This means
more chance for bugs and problems overall.... plus more development
effort.

Another thing to think about is that some protocols can not be easily
modified (customer provided). Since you don't have the source and/or
can't change the other end of the connection, it is difficult to
implement an additional keepalive message.

Apparently though as it has been pointed out that it doesn't seem Linux
supports what I'm trying to do via the keepalive mechanism. It is kind
of a moot point.

Anyway, thanks a lot for your time. I appreciate it.

Thanks,
Sten

 
Reply With Quote
 
Andrei Korostelev
Guest
Posts: n/a

 
      06-01-2006, 09:43 PM
> I appreciate any insight anyone can give. I'm hoping there is just a
> TCP option I'm missing or something.


Additionally, please have a look at Keepalives specification in sec.
4.2.3.6 of RFC 1122 at http://rfc.net/rfc1122.html#p101.
Note that TCP does not require providers to support Keep-alives.

 
Reply With Quote
 
Rick Jones
Guest
Posts: n/a

 
      06-01-2006, 11:37 PM
(E-Mail Removed) wrote:
> The code I'm working with was originally developed on Tru64 and Tru64
> doesn't implement their keepalive processing in the same way as Linux.


Not that Linux is always right, but arguably, Tru64 may be incorrect
in how it implements keepalive - certianly if it starts to send
keepalive probes while there is unACKed data in that direction on the
connection.

> If data is not received on Tru64, the connection is considered idle and
> keepalives begin.


Now that sounds different from what was described previously, at least
on the receiving side - if the receiving side does not have data
outstanding, then it is indeed "idle" and the keepalive timer should
expire at some suitable point.

> If data is not received after a given number of keepalive probes,
> the connection is closed. This seems to make sense and I wonder
> what it hurts for Linux to do something similar... unless there are
> legitimate cases where this logic fails.


> It seems like keepalive processing for Linux should be for data
> received from the perspective of a host. If data is never received,
> the connection should be closed after the timeout periods (assuming
> keepalive is set and probes have been sent). Otherwise a host can
> inadvertently prohibit keepalive processing from taking place by
> attempting to send data to the other host (the one that isn't
> responding).


Indeed, if no data is received, and keepalives were enabled, I would
expect that the connection would "die" from keepalive timeout. Again,
that is different from whether or not keepalives should start on a
system that is in the midst of actively retransmitting data.

rick jones
--
Wisdom Teeth are impacted, people are affected by the effects of events.
these opinions are mine, all mine; HP might not want them anyway...
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Linux to Linux Fileshare Problem Geoff Lane Linux Networking 5 07-16-2008 08:58 PM
Strange problem: no problem with Linux, when I boot windows 2K network is down... Santa Linux Networking 11 11-29-2004 06:46 AM
Red Hot Linux v9.0 [2 DVDs]. Red Hot Linux v9.0 [3 CDs]. Redhat Enterprise Linux ES v3.0 REPACK [4 CDs]. Mandrake Linux 9.2 [3 CDs] -new ! TEL Linux Networking 0 12-01-2003 12:06 PM
Red Hot Linux v9.0 [2 DVDs]. Red Hot Linux v9.0 [3 CDs]. Redhat Enterprise Linux ES v3.0 REPACK [4 CDs]. Mandrake Linux 9.2 [3 CDs] - new ! TEL Linux Networking 0 11-29-2003 11:27 AM
Linux Router/Firewall - Linux Client problem Fry Linux Networking 6 09-06-2003 02:25 AM



1 2 3 4 5 6 7 8 9 10 11