Networking Forums

Networking Forums > Computer Networking > Linux Networking > tcp close problems under heavy load on 2.6.18

Reply
Thread Tools Display Modes

tcp close problems under heavy load on 2.6.18

 
 
Jean-Francois Smigielski
Guest
Posts: n/a

 
      03-06-2007, 09:18 PM
Hi,

I met a problem with with tcp connections on a linux 2.6.18 (both clients and
servers).

I have an echo service that can be represented by 1, 2, 3 or 4
processes that listen on the same ip/port. This service accepts tens of
thousands of simultaneous connections. Each client process starts thousands
of connections to the service, writes some data, read the sanswer,wait, close,
and then open, write, ...

Both client and server sockets are non-Blocking and use the options SO_LINGER
to avoid letting a lot of sockets in a TIME_WAIT state. I started with a linger
time-out of 0.

If I kill the client processes of a host, killing so thousands of
connections at a time, I should observe many tcp RST-flagged packets, at
least one for every socket. But only a part of those packets are sent,
for one half of the original number of sockets. This happens with more
than 4 thousands of client sockets.

If I increase the linger delay, I observe that a part of the connection
can terminate with a FIN-flagged tcp packet (those that closes within the
time-out), some other (that fired the timeout) with a RST-flagged packet,
and a smaller part of the sockets still does not send any close packet.

With a big enough time-out, the penomenon disappears, all the sockets
can be cleanly closed.

The observed effect on the server is obvious : all the badly closed sockets
remain in ESTABLISHED state, since the server only answers to received
data... A tcpdump shows that the close packets are not sent out the client
host. On the client host, no kernel erro rmessage is written.

Is there a known problem? Is there a proper workaround or a correction ?
Did I miss something?


Thanks a lot,

JF Smigielski,

 
Reply With Quote
 
 
 
 
Rick Jones
Guest
Posts: n/a

 
      03-06-2007, 10:20 PM
Jean-Francois Smigielski <(E-Mail Removed)> wrote:
> I met a problem with with tcp connections on a linux 2.6.18 (both
> clients and servers).


> I have an echo service that can be represented by 1, 2, 3 or 4
> processes that listen on the same ip/port. This service accepts tens
> of thousands of simultaneous connections. Each client process starts
> thousands of connections to the service, writes some data, read the
> sanswer,wait, close, and then open, write, ...


> Both client and server sockets are non-Blocking and use the options
> SO_LINGER to avoid letting a lot of sockets in a TIME_WAIT state. I
> started with a linger time-out of 0.


I thought it was generally agreed that deliberately causing abortive
closes that way was a "bad thing" - for example, RST's are not
retransmitted, so you could leave the remote in ESTABLISHED etc for a
very long time... And TIME_WAIT is there for a reason - to protect
against the accidental acceptance of old segments from a TCP
connection of the same name.

> If I kill the client processes of a host, killing so thousands of
> connections at a time, I should observe many tcp RST-flagged
> packets, at least one for every socket. But only a part of those
> packets are sent, for one half of the original number of
> sockets. This happens with more than 4 thousands of client sockets.


Are you certain that your packet sniffer actually saw all the packets?
Sometimes even pcap reporting zero drops doesn't necessarily mean it
did see all the traffic.

If you were tracing on the server, back on the client, a sudden spike
of 4000 RST's going out at once might have filled the driver/NIC's
transmit queues and so some of them may have been dropped, never to be
seen again... It is possible that if you were tracing on the client
that those drops happened before the promiscuous tap (I'm not certain
of that, just speculating).

> The observed effect on the server is obvious : all the badly closed
> sockets remain in ESTABLISHED state, since the server only answers
> to received data...


Ah, so you do see then firsthand one of the reasons an abortive close
of a TCP connection is considered a Bad Thing

rick jones
--
oxymoron n, commuter in a gas-guzzling luxury SUV with an American flag
these opinions are mine, all mine; HP might not want them anyway...
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
 
Reply With Quote
 
Jean-Francois Smigielski
Guest
Posts: n/a

 
      03-07-2007, 07:23 AM
On 2007-03-06, Rick Jones <XXXXXXXXXXXXXXXXXX> wrote:
>
> I thought it was generally agreed that deliberately causing abortive
> closes that way was a "bad thing" - for example, RST's are not
> retransmitted, so you could leave the remote in ESTABLISHED etc for a
> very long time... And TIME_WAIT is there for a reason - to protect
> against the accidental acceptance of old segments from a TCP
> connection of the same name.


Catching the signals to make a clean close of the sockets before
exiting has the same effect.

A call on close(int s) on the sockets seems to close the connections the
same way (with FIN packets in the time-out, with RST packets out of the
time-out, and with nothing for the rest).

It agrees with the documentation about the SO_LINGER option.

>> If I kill the client processes of a host, killing so thousands of
>> connections at a time, I should observe many tcp RST-flagged
>> packets, at least one for every socket. But only a part of those
>> packets are sent, for one half of the original number of
>> sockets. This happens with more than 4 thousands of client sockets.

>
> Are you certain that your packet sniffer actually saw all the packets?
> Sometimes even pcap reporting zero drops doesn't necessarily mean it
> did see all the traffic.
>
> If you were tracing on the server, back on the client, a sudden spike
> of 4000 RST's going out at once might have filled the driver/NIC's
> transmit queues and so some of them may have been dropped, never to be
> seen again... It is possible that if you were tracing on the client
> that those drops happened before the promiscuous tap (I'm not certain
> of that, just speculating).
>


Hmmm, no, I am not absolutely sure that my sniffer did not lose any
packet. I usually rely on the tcpdump. But I made the capture on both
the client and the server hosts, and the result was the same.

I also speculated about the network driver. Maybe should I mail one of
the official LinuxKernel mailing lists.


I thank you for your attention,

JF Smigielski.

 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Server 2008 R2 file share access fails under heavy load PaulChavez Windows Networking 3 10-31-2009 11:33 PM
Help req: Problems with up load sites using Zen ISP. CheggersPop Broadband 2 05-17-2006 06:51 PM
Help: Debian USB wireless problems so close..but so far. - linux-wlan-ng - linksys WUSB12 Ou Phrontis Linux Networking 0 10-29-2005 08:24 PM
Can a wireles connection sustain a heavy data transfer load? Mr_K Wireless Internet 4 06-16-2004 10:17 PM
HELP! Orinoco AP-1000 force load problems.................................................TIA Boll Weevil Wireless Internet 3 07-31-2003 05:45 PM



1 2 3 4 5 6 7 8 9 10 11