Networking Forums

Networking Forums > Computer Networking > Linux Networking > I need a Linux TCP stack guru

Reply
Thread Tools Display Modes

I need a Linux TCP stack guru

 
 
Patrick Klos
Guest
Posts: n/a

 
      03-14-2006, 01:09 AM
I am looking for someone who knows the internals of the TCP implementation
on Linux (2.6.10 or thereabouts). Here's a brief overview of the issue I'm
trying to resolve:

Background:
I'm trying to optimize transfers over a local GigE connection. The Linux
machine (MIPS) is supposed to send 500K+ of data using a single send()
function from the test application. The socket buffer size is set to more
than 1MB. Nagle is disabled (not that it should matter in this case). I've
essentially disabled congestion control by initializing tcp_cwnd to something
like 128. I've done everything I can think of to make sure the kernel and/or
TCP stack have no reason to do anything but send this chunk of TCP data as
fast as possible.

Problem:
Whenever the Linux TCP stack receives a packet from the peer indicating a
larger window size, it seems to cause a delay of about 350 microseconds
before additional TCP processing occurs on this connection. This occurs
BEFORE the peer's window ever gets too small for the Linux machine to
stop filling it, so it's not that the window closed and Linux had to stop
sending data to the peer.

Analysis:
Doing the math, this chunk should be able to be transferred in under 5 milli-
seconds (really, closer to 4 msec). Instead, it's taking around 20 msec.
There are 41 of these window opening delay events in my test transfer, adding
at least 15 msec to the transfer time.

I don't know if I've explained this as clearly as I'd like. I could really
use a quick chat with someone who knows the workings of the Linux stack
inside and out (especially with regards to congestion control and ACK/
window processing).

Patrick
========= For LAN/WAN Protocol Analysis, check out PacketView Pro! =========
Patrick Klos Email: (E-Mail Removed)
Klos Technologies, Inc. Web: http://www.klos.com/
==================== http://www.loving-long-island.com/ ====================
 
Reply With Quote
 
 
 
 
Grant
Guest
Posts: n/a

 
      03-14-2006, 01:41 AM
On Tue, 14 Mar 2006 02:09:20 +0000 (UTC), (E-Mail Removed) (Patrick Klos) wrote:

>I am looking for someone who knows the internals of the TCP implementation
>on Linux (2.6.10 or thereabouts). Here's a brief overview of the issue I'm


Tried the mailing lists for linux-kernel and friends?

Perhaps start with http://vger.kernel.org/vger-lists.html then search
for other sub-system lists hosted elsewhere. Outside my kernel interest
area Also look to linux testing project for benchmarking.

Why stay back at 2.6.10? 2.6.15.6 is latest stable, lots of bugfix,
performance fixes since then.

Grant.
--
Testing can show the presense of bugs, but not their absence.
-- Dijkstra
 
Reply With Quote
 
Patrick Klos
Guest
Posts: n/a

 
      03-14-2006, 05:20 PM
In article <(E-Mail Removed)>,
Grant <(E-Mail Removed)> wrote:
>On Tue, 14 Mar 2006 02:09:20 +0000 (UTC), (E-Mail Removed) (Patrick Klos) wrote:
>
>>I am looking for someone who knows the internals of the TCP implementation
>>on Linux (2.6.10 or thereabouts). Here's a brief overview of the issue I'm

>
>Tried the mailing lists for linux-kernel and friends?


Yes, I've poked around a bit. Seems this is an area seldom touched upon.

>Perhaps start with http://vger.kernel.org/vger-lists.html then search
>for other sub-system lists hosted elsewhere. Outside my kernel interest
>area Also look to linux testing project for benchmarking.


I'll check them out - thanks!

>Why stay back at 2.6.10? 2.6.15.6 is latest stable, lots of bugfix,
>performance fixes since then.


I don't get to chose the version - this company has everything working on
2.6.10 and doesn't think such a change is warranted at this time.

Thanks for the reply!

Patrick
========= For LAN/WAN Protocol Analysis, check out PacketView Pro! =========
Patrick Klos Email: (E-Mail Removed)
Klos Technologies, Inc. Web: http://www.klos.com/
==================== http://www.loving-long-island.com/ ====================
 
Reply With Quote
 
Rick Jones
Guest
Posts: n/a

 
      03-15-2006, 12:13 AM
>>>I am looking for someone who knows the internals of the TCP
>>>implementation on Linux (2.6.10 or thereabouts).


>>Tried the mailing lists for linux-kernel and friends?


> Yes, I've poked around a bit. Seems this is an area seldom touched upon.


netdev is probably the one you want, although the folks in netdev may
or may not take kindly to someone effectively disabling slow-start -
although if you have a patch that adds a sysctl might as well share it
with them

>>Why stay back at 2.6.10? 2.6.15.6 is latest stable, lots of bugfix,
>>performance fixes since then.


> I don't get to chose the version - this company has everything
> working on 2.6.10 and doesn't think such a change is warranted at
> this time.


I can almost guarantee that the first request will be to try on a
newer version. If you can reproduce the problem on some "other"
hardware that you can then bump to a newer version and still show the
problem it will help your cause greatly. You would though likely have
to back-port any changes into the 2.6.10 bits in use at that company.

rick jones
--
firebug n, the idiot who tosses a lit cigarette out his car window
these opinions are mine, all mine; HP might not want them anyway...
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
 
Reply With Quote
 
Patrick Klos
Guest
Posts: n/a

 
      03-15-2006, 12:35 AM
In article <zSJRf.4710$(E-Mail Removed)>,
Rick Jones <(E-Mail Removed)> wrote:
>>>>I am looking for someone who knows the internals of the TCP
>>>>implementation on Linux (2.6.10 or thereabouts).

>
>>>Tried the mailing lists for linux-kernel and friends?

>
>> Yes, I've poked around a bit. Seems this is an area seldom touched upon.

>
>netdev is probably the one you want, although the folks in netdev may
>or may not take kindly to someone effectively disabling slow-start -


Hi Rick,

Thanks for the followup. Turns out I've made some good progress today.
At the suggestion of someone else, I checked into the possibility that
the ethernet driver was doing interrupt mitigation (a.k.a. interrupt
coalescing), and it turns out it was. Turning that off took my 500KB
transfer from 20 milliseconds down to just under 9 milliseconds. Not
completely what I was hoping for, but it's a start! ) Also, I won't
be able to keep interrupt coalescing totally turned off, but at least
I know that's one of the knobs I'll have to tune.

Even so, I will take a look at some of these other mailing lists and
resources, thanks!

>although if you have a patch that adds a sysctl might as well share it
>with them


I did add a sysctl (tcp_initial_cwnd) which I'd be glad to share, but
it's really not rocket science. Until I understand better how snd_cwnd
is supposed to be used and grow and shrink, I'd be reluctant to just
throw another knob into the mix. I was hoping someone who knew that
part of the TCP implementation could enlighten me?

>>>Why stay back at 2.6.10? 2.6.15.6 is latest stable, lots of bugfix,
>>>performance fixes since then.

>
>> I don't get to chose the version - this company has everything
>> working on 2.6.10 and doesn't think such a change is warranted at
>> this time.

>
>I can almost guarantee that the first request will be to try on a
>newer version.


Yes, it has been.

>If you can reproduce the problem on some "other"
>hardware that you can then bump to a newer version and still show the
>problem it will help your cause greatly.


Despite the previously mentioned observation about interrupt coalescing,
I tried a similar transfer on a different Linux box with a different
processor and got somewhat similar behavior after the peer's receive
window size increased. Still, I haven't been able to optimize the tests
on the x86 server to the point that I could make any real (fair) comparisons
(yet). I'll be doing analysis of that trace tomorrow to see what I can
learn from that?

>You would though likely have
>to back-port any changes into the 2.6.10 bits in use at that company.


That wouldn't be nearly as terrible as trying to upgrade the entire
kernel! ;^)

Forgive my ignorance, but where is the best place to find the newer
versions of Linux kernel code? Does each distribution maintain its
own set of sources? I don't know how that works?

Patrick
========= For LAN/WAN Protocol Analysis, check out PacketView Pro! =========
Patrick Klos Email: (E-Mail Removed)
Klos Technologies, Inc. Web: http://www.klos.com/
==================== http://www.loving-long-island.com/ ====================
 
Reply With Quote
 
Grant
Guest
Posts: n/a

 
      03-15-2006, 12:54 AM
On Wed, 15 Mar 2006 01:35:09 +0000 (UTC), (E-Mail Removed) (Patrick Klos) wrote:

>Forgive my ignorance, but where is the best place to find the newer
>versions of Linux kernel code? Does each distribution maintain its
>own set of sources? I don't know how that works?


kernel.org Be aware that some large distro' kernels are too 'bent'
for the standard 'vanilla' kernel (from kernel.org) to successfully
compile / install -- if you have such a distro, best to stay with their
modified kernel source. I use Slackware, it doesn't have these issues.

Also check the linux/Documentation/networking/ip-sysctl.txt in the
source (if you've not been there), some I use are:

# turn on the router, also sets defaults
echo 1 > /proc/sys/net/ipv4/ip_forward
# ISP drops ICMPs, cannot perform mtu discovery
echo 0 > /proc/sys/net/ipv4/ip_no_pmtu_disc
# connection speed of 256/64kbps does not require window scaling
echo 0 > /proc/sys/net/ipv4/tcp_window_scaling
# use timestamps for 256/64 kbps? FIXME: no measure
echo 1 > /proc/sys/net/ipv4/tcp_timestamps
# enable select acknowledgments, also required for 'tcp_fack' - yes
echo 1 > /proc/sys/net/ipv4/tcp_sack
# enable congestion avoidance and fast retransmission - yes
echo 1 > /proc/sys/net/ipv4/tcp_fack
# allows TCP to send "duplicate" SACKs - yes? FIXME no measure
echo 1 > /proc/sys/net/ipv4/tcp_dsack
# enable Explicit Congestion Notification - yes? FIXME no measure
echo 1 > /proc/sys/net/ipv4/tcp_ecn
# try for lower latency (default: 0)
# note: this trades throughput for low-latency
echo 0 > /proc/sys/net/ipv4/tcp_low_latency

this copied from my firewall setup, mostly guesswork.

Grant.
--
Memory fault -- brain fried
 
Reply With Quote
 
Grant
Guest
Posts: n/a

 
      03-15-2006, 01:05 AM
On Wed, 15 Mar 2006 02:49:39 GMT, Rick Jones <(E-Mail Removed)> wrote:

>I think that txqueuelen is an ifconfig option.


I set that to three to improve network control response. Network
control packets don't get backed up behind a queued data stream.
Let the senders queue outgoing data streams instead of the NIC.

Grant.
--
Memory fault -- brain fried
 
Reply With Quote
 
Rick Jones
Guest
Posts: n/a

 
      03-15-2006, 01:49 AM
Patrick Klos <(E-Mail Removed)> wrote:

> Thanks for the followup. Turns out I've made some good progress today.
> At the suggestion of someone else, I checked into the possibility that
> the ethernet driver was doing interrupt mitigation (a.k.a. interrupt
> coalescing), and it turns out it was. Turning that off took my 500KB
> transfer from 20 milliseconds down to just under 9 milliseconds. Not
> completely what I was hoping for, but it's a start! ) Also, I won't
> be able to keep interrupt coalescing totally turned off, but at least
> I know that's one of the knobs I'll have to tune.


Ah, interrupt coalescing - yep that could be an interesting one.
Still with a "large enough" window I wouldn't have expected the sender
to be stuck waiting for a window update to then be awaiting an
interrupt from the NIC to state that the ACK with the piggy-backed
window update arrived. Just what is the advertised window from the
receiver (modulo wscale)? Any idea how many packets are "in flight"
(from TCP's perspective) at that time? I wonder if perhaps the
interface transmit queue got filled and so flow-controlled the sending
TCP "internally?"

I think that txqueuelen is an ifconfig option.

>>although if you have a patch that adds a sysctl might as well share it
>>with them


> I did add a sysctl (tcp_initial_cwnd) which I'd be glad to share, but
> it's really not rocket science. Until I understand better how snd_cwnd
> is supposed to be used and grow and shrink, I'd be reluctant to just
> throw another knob into the mix. I was hoping someone who knew that
> part of the TCP implementation could enlighten me?


That would be netdev.

> Forgive my ignorance, but where is the best place to find the newer
> versions of Linux kernel code? Does each distribution maintain its
> own set of sources? I don't know how that works?


I'm still learning and so may have parts wrong, but basically
kernel.org is at the root of the tree as far as the linux kernel goes.
Distros can back-port stuff from kernel.org into their own trees and
may push fixes/changes from their own distros up to kernel.org.

rick jones
--
No need to believe in either side, or any side. There is no cause.
There's only yourself. The belief is in your own precision. - Jobert
these opinions are mine, all mine; HP might not want them anyway...
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
 
Reply With Quote
 
Patrick Klos
Guest
Posts: n/a

 
      03-15-2006, 01:33 PM
In article <7hLRf.4715$(E-Mail Removed)>,
Rick Jones <(E-Mail Removed)> wrote:
>Ah, interrupt coalescing - yep that could be an interesting one.
>Still with a "large enough" window I wouldn't have expected the sender
>to be stuck waiting for a window update to then be awaiting an
>interrupt from the NIC to state that the ACK with the piggy-backed
>window update arrived.


True, and I'm still trying to figure that out. The sender never sent
enough data to the receiver to cause the window to close. At all times,
the receiver reported a window that was large enough for the sender to
continue sending if it wanted to.

Another odd thing is that these delay events occurred ONLY AFTER the
stack received an larger updated window from the peer!?! More packets
to look at...

>> I did add a sysctl (tcp_initial_cwnd) which I'd be glad to share, but
>> it's really not rocket science. Until I understand better how snd_cwnd
>> is supposed to be used and grow and shrink, I'd be reluctant to just
>> throw another knob into the mix. I was hoping someone who knew that
>> part of the TCP implementation could enlighten me?

>
>That would be netdev.


I'll check them out today.

>> Forgive my ignorance, but where is the best place to find the newer
>> versions of Linux kernel code? Does each distribution maintain its
>> own set of sources? I don't know how that works?

>
>I'm still learning and so may have parts wrong, but basically
>kernel.org is at the root of the tree as far as the linux kernel goes.
>Distros can back-port stuff from kernel.org into their own trees and
>may push fixes/changes from their own distros up to kernel.org.


Thanks again! I'll take a look at how much has changed between our version
and the latest stable version?

Patrick
========= For LAN/WAN Protocol Analysis, check out PacketView Pro! =========
Patrick Klos Email: (E-Mail Removed)
Klos Technologies, Inc. Web: http://www.klos.com/
==================== http://www.loving-long-island.com/ ====================
 
Reply With Quote
 
Rick Jones
Guest
Posts: n/a

 
      03-15-2006, 05:22 PM
Patrick Klos <(E-Mail Removed)> wrote:
> True, and I'm still trying to figure that out. The sender never sent
> enough data to the receiver to cause the window to close. At all times,
> the receiver reported a window that was large enough for the sender to
> continue sending if it wanted to.


Ah... You do not have to see a window field of zero for the sender to
believe he has no more window into which he can send. You have to
compare the ACK number in the window update with the size of the value
in the window field (possibly scaled) with the highest sequence number
sent by the sender.

rick jones
--
portable adj, code that compiles under more than one compiler
these opinions are mine, all mine; HP might not want them anyway...
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Any Documentaion about How Linux Networking IP stack looklike between Linux PC (single NIC card) and Router with multiple Network interfaces (LAN and WAN)? santa19992000@yahoo.com Linux Networking 0 02-11-2007 09:19 PM
How to tell an application to use a custom tcp/ip stack instead of tcp/ip stack from linux? CDP Linux Networking 18 07-06-2005 01:45 PM
how to remove atcp/ip stack and add a third party stack in linux RajaSekhar.Kavuri Linux Networking 1 03-22-2005 06:30 PM
WAP stack for linux Hammercode Linux Networking 0 04-01-2004 12:35 PM
[Q}Linux TCP/IP stack routing... Paul Linux Networking 3 06-24-2003 05:07 PM



1 2 3 4 5 6 7 8 9 10 11