Steve Wampler <(E-Mail Removed)> wrote:
> Rick Jones wrote:
> > Not that I have any data myself
but are you looking for a
> > single stream across the bond, or multiple streams?
> Single stream, at 960MB/s for 4 hours/day (typical) with possible 8
> hour duration (rarer). The source is an as-yet-unbuilt camera
> system. (There's actually more than one, but we should be able to
> isolate the data flows.) The 960MB is too close to 10Gb for me the
> believe we can get by with one port - hence the interest in bonding.
Well, depending on the "oomph" you have on the sender and the
reciever, it is possible to achieve "link rate" with TCP over 10G
Ethernet with either 1500 byte MTU employing TSO - TCP Segmentation
Offload - on the receiver and LRO - Large Receive Offload - on the
sender. It gets even easier if you can use JumboFrames of 9000 bytes
or more.
> > Are you going back-to-back with 10G between systems or will there
> > be a switch in between?
> We'd prefer to have a switch, if possible at those rates. The
> cameras will be on a rotating platform with the target systems well
> off the platform, so having to switch fibers to switch back-ends
> between cameras isn't very attractive.
I'm not sure if any commercially available switches offer a mode-rr
(round robin) setting. Some use MAC addresses for picking a link in
the bond/trunk/team/aggregate, some can use IP address, some can use
TCP port numbers. But I'm not sure if any do round-robin.
> > If you want a single stream to try to take advantage of multiple
> > links in a bond you are pretty much limited to mode-rr, and so at
> > the very real risk (certainty IMO) of reordered traffic at the
> > receiver. I suspect that will affect the receiver's ability to
> > effectively employ Large Receive Offload. The out of order
> > traffic will result in an increased ACK load, if the out of order
> > is "enough" out of order it can trigger spurrious fast
> > retransmissions. Further, while the bonding software on the Linux
> > host will control how traffic is spread on outbound, it is the
> > _switch_ which controls how traffic is spread on inbound, and if
> > the switch does not have a mode-rr equivalent, you might get 2
> > links on transmit but only one link on recieve.
> Thanks - that's extremely useful! (or it will be as soon as I get a
> translation back into english
)
TCP will "work" when its segments arrive out of order, but for every
out-of-order segment a TCP receiver will generate an immediate ACK.
That ACK will have the sequence number of the first "missing" TCP
segment. That means that both the receiving and sending TCPs will
spend more CPU cycles in ACK processing.
A sending TCP has a heuristic called "fast retransmit" which works
based on the ass-u-me-ption that traffic is rarely reordered, so if
traffic arrives out of order at a receiver it implies some traffic was
lost. By default, if a sending TCP receives three duplicate ACKs
(ACKs saying the same sequence number is the next expected) the
sending TCP will assume that segment was lost and retransmit it.
Sending TCPs also maintain an idea of how much traffic they can send
at one time without triggering packet loss in the network. That is
called the congestion window. When a sending TCP has to retransmit it
will adjust its congestion window downwards - sometimes considerably.
So, lots of traffic reordering can result in spurrious fast
retransmissions, which can result in smaller congestion windows which
can result in lower performance.
The linux tcp stack on the sending side can have its sensitivity to
duplicate ACKs "tuned" to the point of effectively eliminating fast
retransmissions. Of course then if there *is* a lost segment one
might end-up waiting for a retransmission timeout and that is really
bad news. Enabling Selective ACKnowledgement may help.
Similarly, I think that many of the LRO schemes in NICs make use of
the "traffic is rarely reordered" assumption. So, when traffic
arrives out of order the NIC is not able to aggregate as many smaller
segments into one larger one to give to the host. So the receiving
host has more per-packet work to do because it is receiving more
packets.
The above may not be modern english, but perhaps it isn't any worse
than middle english now
Whan that Aprill, with his shoures soote
The droghte of March hath perced to the roote
http://www.librarius.com/cantales.htm
rick jones
I've heard occasional talk about 40 and 100Gbit Ethernet - not sure if
any of it is far enough along for an "observatory special" though.
There may be something that fast or faster in the telco space. In
either case we are probably talking some serious dollars though.
--
portable adj, code that compiles under more than one compiler
these opinions are mine, all mine; HP might not want them anyway...

feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...