Networking Forums

Networking Forums > Computer Networking > Linux Networking > Bizarre jumbo-frame TCP performance problems

Reply
Thread Tools Display Modes

Bizarre jumbo-frame TCP performance problems

 
 
Erik Walthinsen
Guest
Posts: n/a

 
      05-09-2004, 01:48 AM
I'm working on tuning the performance of a pair of dual Xeon 2.8GHz
machines (Supermicro 6013P-T) with crossover gigabit between the two
(onboard e1000's). The goal of the current excercise is to optimize
NFS performance for another set of machines, but I'm still on the
first step (further steps involve characterising and simulating the
target workload, which is quite unique).

Specifically, I'm trying to figure out why jumbo frames dramatically
*reduce* performance. I'm testing with both ttcp and netperf and
getting the same results.

Out-of-the-box, the machine boots with the following (2.4.26, not
entirely stock. results very similar with one running *stock*, will
reboot and retest with both running stock when the machine is next
idle):

net/core/rmem_default, wmem_default, rmem_max, rmem_max = 65536
net/ipv4/tcp_rmem is 4096 87380 174760
net/ipv4/tcp_wmem is 4096 16384 131072

netperf is giving me:

Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 566.83
87380 16384 16384 9.99 601.52
87380 16384 16384 9.99 547.18

ttcp is giving me about 68MB/sec with my script that does 5 runs,
trims best and worst, and averages the remaining 3. This puts my ttcp
tests on par with netperf.

Now, I perform exactly two operations: ifconfig eth0 mtu 9000 on both
machines.

87380 16384 16384 10.00 393.70
87380 16384 16384 10.00 389.22
87380 16384 16384 10.00 395.94

ttcp comes in at around 48-50MB/sec, again consistent.

Interestingly, while running ttcp of 320MB total, tcpdump shows about
255,000 packets with MTU of 1500, and a total of about 33,000
interrupts for eth0. With the MTU at 9000, the packet count drops to
around 53,000, while the interrupt count stays the same at 33,000.

Another data point, checking the tcpdump output shows that with an MTU
of 1500, the TCP window size is 5840 for ~227,000 packets, and 34752
for ~30,000 packets. Switch the MTU to 9000 and we get 17920 for
~38,500 packets and 35792 for ~16,500 packets. FWIW ping reports RTTs
in the .150 to .250ms range, which from my understanding of TCP
requires a noticably larger window even with such short RTTs.

Also, running ttcp in UDP mode results in around 104.5MB/sec (836Mbps)
no matter what the MTU is. This number is reported on both sending
and receiving sides of the connection, so packets are not getting
dropped half-way there and still counted.


Now, to fix the problem I started by doubling all the parameters in
tcp_rmem, to 8192 174760 349520. With an MTU of 1500, netperf shows:

174760 16384 16384 10.00 821.97
174760 16384 16384 9.99 799.36
174760 16384 16384 9.99 821.46

While with an MTU of 9000, we get:

174760 16384 16384 10.00 543.19
174760 16384 16384 10.00 537.96
174760 16384 16384 10.00 549.33

These numbers are only barely up to what MTU 1500 does *without*
tuning.


I'm at a loss to explain what's going on, since every single reference
I've found online implies that setting the MTU to 9000 will magically
improve performance. I also know more than enough about the internals
of TCP and the networking stack to know that this *should* be the
case. However, *something* is not working as it's supposed to.

Any hints would be appreciated.
 
Reply With Quote
 
 
 
 
P Gentry
Guest
Posts: n/a

 
      05-09-2004, 02:23 PM
(E-Mail Removed) (Erik Walthinsen) wrote in message news:<(E-Mail Removed). com>...
> I'm working on tuning the performance of a pair of dual Xeon 2.8GHz
> machines (Supermicro 6013P-T) with crossover gigabit between the two
> (onboard e1000's). The goal of the current excercise is to optimize
> NFS performance for another set of machines, but I'm still on the
> first step (further steps involve characterising and simulating the
> target workload, which is quite unique).
>
> Specifically, I'm trying to figure out why jumbo frames dramatically
> *reduce* performance. I'm testing with both ttcp and netperf and
> getting the same results.


[snip]

> These numbers are only barely up to what MTU 1500 does *without*
> tuning.
>
> I'm at a loss to explain what's going on, since every single reference
> I've found online implies that setting the MTU to 9000 will magically
> improve performance. I also know more than enough about the internals
> of TCP and the networking stack to know that this *should* be the
> case. However, *something* is not working as it's supposed to.
>
> Any hints would be appreciated.


Not much time this morning to think about the various possibilities
....

Setting MTU to 9000 is not enough. Your machines are capable of
handling a heavy interrupt load, so even at MTU=1500 you see
"acceptable" performance. I think you need a few more changes to get
a bit more performance.

Check here for a more complete list of perf tuning tips:
http://www.psc.edu/networking/perf_tune.html
http://www.psc.edu/networking/perf_tune.html#Linux

Example from above:
[quote]
the BDP can be calculated as follows:
1,000,000,000 bits 1 byte 70 seconds
------------------- * ------ * ---------- = 8,750,000 bytes = 8.75 MB
1 second 8 bits 1,000
....
The following values would be reasonable for path with a large BDP:
echo 8388608 > /proc/sys/net/core/wmem_max
echo 8388608 > /proc/sys/net/core/rmem_max
echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem
echo "4096 65536 4194304" > /proc/sys/net/ipv4/tcp_wmem
[end quote]

Amazing how much memory you really need with GigE ;-)

Also beware that if you plan to test specific network software on an
smp box that there may be issues re: cpu affinity.

hth,
prg
email above disabled
 
Reply With Quote
 
Erik Walthinsen
Guest
Posts: n/a

 
      05-10-2004, 04:53 AM
(E-Mail Removed) (P Gentry) wrote in message news:<(E-Mail Removed). com>...
> Check here for a more complete list of perf tuning tips:
> http://www.psc.edu/networking/perf_tune.html
> http://www.psc.edu/networking/perf_tune.html#Linux

Been there (links are still purple).

> The following values would be reasonable for path with a large BDP:
> echo 8388608 > /proc/sys/net/core/wmem_max
> echo 8388608 > /proc/sys/net/core/rmem_max
> echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem
> echo "4096 65536 4194304" > /proc/sys/net/ipv4/tcp_wmem

Done that. ;-) No change: 70MB/sec vs. 50MB/sec with no other
changes.

What really makes a difference, as in my original mail, is to force
both the minimum and default [rw]mem sizes up. That's what gets me
the last 25MB/sec in the 1500-byte case. But jumbo frames lag behind
no matter what ;-(

> Amazing how much memory you really need with GigE ;-)

Yes, but in the example given notice that the RTT is 70ms. Because
the machines are connected together by a 1-ft cable, RTT is near
wire-speed, from .150 to .250ms, about a 1/350th of the example. That
puts the packets-in-flight at no more than 32KB (about 25KB actually)
if I'm doing my math right.

> Also beware that if you plan to test specific network software on an
> smp box that there may be issues re: cpu affinity.

Yeah, every interrupt is going to CPU zero (of "4", dual HT Xeon).
However, I would expect that a) the interrupt count (rate? bursts?)
being the same in both cases would eliminate that, and/or b) the
presence of the ttcp/netperf processes on "random" processors would
have reared its head by now by causing wildly fluctuating numbers.

FWIW I rebuilt the e1000 driver on one of the machines with NAPI
(polled Rx), to no effect either way, either sending or receiving.

- Omega
aka Erik Walthinsen
 
Reply With Quote
 
P Gentry
Guest
Posts: n/a

 
      05-11-2004, 02:03 AM
(E-Mail Removed) (Erik Walthinsen) wrote in message news:<(E-Mail Removed). com>...
> (E-Mail Removed) (P Gentry) wrote in message news:<(E-Mail Removed). com>...

[snip]
>
> What really makes a difference, as in my original mail, is to force
> both the minimum and default [rw]mem sizes up. That's what gets me
> the last 25MB/sec in the 1500-byte case. But jumbo frames lag behind
> no matter what ;-(
>
> > Amazing how much memory you really need with GigE ;-)

> Yes, but in the example given notice that the RTT is 70ms. Because
> the machines are connected together by a 1-ft cable, RTT is near
> wire-speed, from .150 to .250ms, about a 1/350th of the example. That
> puts the packets-in-flight at no more than 32KB (about 25KB actually)
> if I'm doing my math right.
>
> > Also beware that if you plan to test specific network software on an
> > smp box that there may be issues re: cpu affinity.

> Yeah, every interrupt is going to CPU zero (of "4", dual HT Xeon).
> However, I would expect that a) the interrupt count (rate? bursts?)
> being the same in both cases would eliminate that, and/or b) the
> presence of the ttcp/netperf processes on "random" processors would
> have reared its head by now by causing wildly fluctuating numbers.
>
> FWIW I rebuilt the e1000 driver on one of the machines with NAPI
> (polled Rx), to no effect either way, either sending or receiving.
>
> - Omega
> aka Erik Walthinsen


Sorry for the more-or-less pre-canned response, but just didn't have
time for more -- Mother's Day ;-) A bit more time today to look at
this ...

Be on notice that it's been several years since I've had "real"
hands-on with GigE and getting it up/tuned.

I try to keep up reasonably well since it's inevitable that GigE will
make it to at least some desktops/servers in our environment -- we're
currenty using it on some backbones with Ciscos.

Posts like yours crop up now and then, but yours is the first with
some useful numbers -- thanks for posting the output as I don't have
access myself to a box that I can get numbers from, etc.

Besides the memory adjustments most of the other TCP variables are
already set to "good" values, IIRC. You might double check, but
nothing leaps out at me as a self-evident cause for your observations.
You could also compare 1500 MTU stats to 9000 MTU stats (especially
the TCPExts) via:
$ netstat -spc tcp

I wonder if this:
http://kerneltrap.org/node/view/2969 kernel 2.4.27-pre1
would help at all or if these are just "general" e1000 fixes?

After reading Intel's Readme, Release Notes, and Application Note as
well as the Napi paper I couldn't begin to guess what configuration
changes via the driver/tools would help. The interrupt timers look
interesting, but where do you begin?

That was for me the most baffling number you posted -- ie., equal
interrupt counts with both MTUs. Hmmm .... the humidity must be
affecting my brain cells.

The Readme does rather vaguely and unhelpfully acknowledge
"Performance Degradation with Jumbo Frames" as a Known Issue. OK ...
and? I wonder if it's more than just a memory tweak that's needed in
your case.

Looking at how "versatile" the controller/interrupts are, I wonder if
APIC code is not playing well with the driver -- I know RH used to
have problems backporting APIC patches. But those problems usually
resulted in barely or wholly non-functional hardware/features.

The only other thing I ran across, in Intel's Open Software
Developer's Manual, was Frame Based Flow Control that generates the
ethernet pause frames when the nic's receive buffer is nearing full.
It is likely part of the auto-negotiation between two e1000's -- it's
available on a "dedicated link". See p.109 here:
http://sourceforge.net/project/showf...ckage_id=68544
I seem -- very hazily -- to recall having "problems" with this in the
past that resulted in unexplained behavior (sense we weren't
looking/aware of it).

Sorry, I just don't seem able to come up with any useful ideas :-(
prg
email above disabled
 
Reply With Quote
 
Ralf Herrmann
Guest
Posts: n/a

 
      05-11-2004, 10:42 AM
Hi,

i'm just wondering......

Some weeks ago i did a mistake and posted something wrong in this NG.
From my (long-ago) lectures i had in mind, that max. packet size of
IP packets was 1536 Bytes.
Well this is false, someone corrected me and said that the max size
is 64k. But he also said, the max. ethernet frame size is 1536 Bytes.

So i guess i mixed up some things because i think it's correct
about ethernet frame size.
I really don't know, but Since e1000 is still ethernet, i think
this might be still the case.

So this would provide some convinience to the number of IRQ breaks,
which did not drop with MTU set to 9000.
The large IP packeckts seem to be split into several ethernet frames
and then transported seperately.

Well this should not lead to such a performance drop as mentioned
by your numbers (i don't know if there is some add. header when splitting
packets, but with 1500x6=9000 the possibly wasted bandwidth
hould not be that much).
Anyways, it might contribute a part to this.

I think the performance drop might really come from re-assembling the
IP packets. When the receiving host gets the ether-frames
it might wait to report the IP frame, until all parts have been arriving
there. Maybe the Ether-Frame buffer is not large enough to hold all
the parts so that the driver module is presented with incomplete
IP-Packets or whatever.
However something inbetween there might not be
implemented very good, so some extra delay is introduced in the end.

Would be interesting to investigate a bit more about e1000.
But unfortunately i can't, because of a lack of time
and a lack of e1000:-)

HTH

Ralf
 
Reply With Quote
 
Erik Walthinsen
Guest
Posts: n/a

 
      05-12-2004, 05:40 AM
(E-Mail Removed) (P Gentry) wrote in message news:<(E-Mail Removed). com>...
> Besides the memory adjustments most of the other TCP variables are
> already set to "good" values, IIRC. You might double check, but
> nothing leaps out at me as a self-evident cause for your observations.
> You could also compare 1500 MTU stats to 9000 MTU stats (especially
> the TCPExts) via:
> $ netstat -spc tcp

I don't see any TCPExts number listed, but a diff before and after for
both sizes shows nothing that shouldn't be expected.

> I wonder if this:
> http://kerneltrap.org/node/view/2969 kernel 2.4.27-pre1
> would help at all or if these are just "general" e1000 fixes?

I applied the e1000 diffs from the pre2 kernel and have seen no
difference in behavior.

> After reading Intel's Readme, Release Notes, and Application Note as
> well as the Napi paper I couldn't begin to guess what configuration
> changes via the driver/tools would help. The interrupt timers look
> interesting, but where do you begin?

Exactly. I contemplated doing a brute-force search script, but when I
started to get a feel for how man piddly little options that *might*
affect performance there were, the test duration rapidly climbed in
the months.

> The Readme does rather vaguely and unhelpfully acknowledge
> "Performance Degradation with Jumbo Frames" as a Known Issue. OK ...
> and? I wonder if it's more than just a memory tweak that's needed in
> your case.

I would have *assumed* that this would work out of the box, that gig
was used enough by people that these problems would have been
resolved, or at least someone would document what to change...

> Looking at how "versatile" the controller/interrupts are, I wonder if
> APIC code is not playing well with the driver -- I know RH used to
> have problems backporting APIC patches. But those problems usually
> resulted in barely or wholly non-functional hardware/features.

The machines are running Debian Woody, and I build my own reasonably
stock kernels. These machines do have some patches, and for that
reason I'm going to attempt to reboot to a truly stock 2.4.26 right
now. See below.

> The only other thing I ran across, in Intel's Open Software
> Developer's Manual, was Frame Based Flow Control that generates the
> ethernet pause frames when the nic's receive buffer is nearing full.
> It is likely part of the auto-negotiation between two e1000's -- it's
> available on a "dedicated link". See p.109 here:
> http://sourceforge.net/project/showf...ckage_id=68544
> I seem -- very hazily -- to recall having "problems" with this in the
> past that resulted in unexplained behavior (sense we weren't
> looking/aware of it).

I turned off auto/rx/tx pausing and performance dropped to almost
nothing.

UPDATE:

After putting a stock 2.4.26 kernel on both machines, the performance
difference for pure TCP goes away. I can get 112MB/sec with 1500-byte
MTU and 115MB/sec with 9000, *after* raising the rmem/wmem max's for
both core and ipv4 to 4MB, and quadrupling the min/default for ipv4
rmem/wmem.

With that part done, it's on to NFS, where the problem comes right
back!

Previous experiments show that the following mount options are
reasonably optimal:

soft,intr,bg,tcp,async,rsize=8192,rsize=8192

With those mount options and an MTU of 1500, with the above rmem/wmem
tweaks, bonnie++ gets 47MB/sec sequential writes and 68MB/sec
sequential reads. Bonnie on the local machine gets 65MB/sec and
90MB/sec respectively, as it's on a 4-disk RAID-10 array. Switching
to 9000 drops that to 33MB/sec and 45MB/sec. The really bizarre part
is that doing so also *triples* the processor load while reading (but
not writing!).

I've done some packet traces and found some rather odd patterns. If
you look at http://pdxcolo.net/~omega/misc/jumbo/ you'll see the
following files:

daniel-1500.pcap.bz2 - pre-stock-2.4.26 ttcps
daniel-9000.pcap.bz2
rtt-1500.png - screengrab of RTT from daniel-*.pcap
rtt-9000.png
throughput-1500.png - screengrab of "throughput" from daniel-*.pcap
throughput-9000.png
nfs-write1500.pcap.bz2 - trace of a 256MB zerofile written on *stock*
2.4.26
nfs-write9000.pcap.bz2
nfs-write1500.png - screengrab of "throughput" from
nfs-write*.pcap
nfs-write9000.png

You'll see that there's a *very* different pattern going on depending
on the MTU, both before and after switching to pure stock kernels. I
haven't done a TTCP trace since switching, but I plan on doing so to
see if it retains the odd pattern, or shows an entirely new (third)
pattern to reflect the final achievement of normal TCP-only
throughputs.

My business partner is at Interop this week, I'll forward this to him
and remind him that he was going to try to corner someone from NetApp
and dig into a few NFS-related things ;-)

- Omega
aka Erik Walthinsen
http://pdxcolo.net/
 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Really Bizarre WiFi Network Problems Jay Wireless Internet 7 11-01-2006 03:26 AM
Gigabit Performance Problems Roy Stefanussen Linux Networking 0 03-02-2006 01:12 PM
UDP performance problems Stefano Pettini Linux Networking 6 03-04-2005 05:44 PM
W2003 Server Performance Problems David Ray Windows Networking 0 02-04-2005 12:24 AM
root over nfs - performance problems Boris Glawe Linux Networking 6 10-26-2004 09:10 PM



1 2 3 4 5 6 7 8 9 10 11