We are seeing unexplained delays/batching of TCP sends in network traces of
our client/server app, despite the fact that we have Nagle disabled. Has
anyone else experienced this and have an explanation/workaround/fix?
VERBOSE BACKGROUND:
We have a client/server app which requires all sends be pushed out onto the
wire as quickly as possible. From the beginning, we've had Nagle disabled
and have sniffed the wire to observe the appropriate and expected behavior.
That is, with Nagle disabled, we nearly always see a 1:1 mapping between
application socket writes and TCP frames hitting the wire.
Our client runs on Windows Server 2003 and Windows Server 2008 hosts. All
client machines are up-to-date with the latest NIC drivers and Windows
Updates (as of yesterday). Hardware is off-the-shelf Dell/Intel and NICs are
Intel. All 32-bit, not that it should matter. The network connecting
clients and servers is essentially a LAN (which doesn't matter too much
anyway, because the send delays we're now seeing are present at the client
NIC (i.e., before entering the network switches for the first time).
Recently, some changes were made to the hardware and software on the server
side (maintained by a 3rd party), but we're not privy to exactly what changes
were made nor what the current config is of those server boxes.
The behavior we're now seeing is that when our client app makes multiple,
near-simultaneous calls to Socket.Write(), instead of seeing the pattern
we're used to seeing (a matching number of frames, corresponding almost
always to exactly the number of times Socket.Write() was called in the client
app) we now see a single frame sent initially (corresponding to a single call
to Socket.Write()), followed by a pause on the order of 10 - 25 MILLIseconds
(ms) or more. At which time the server's response (payload) arrives at the
client and the client INSTANTLY (same MICROsecond (us)), sends the remaining
payload as a single frame out onto the wire.
On the surface, this is reminiscent of Nagle being enabled. However, we've
confirmed through additional network traces and in our client application
itself that Nagle is in fact properly disabled and therefore it should not be
the cause of our delay in sending the remaining data.
So, we're left w/o an explanation for why the client machine is sending the
first payload only and then pausing and not resuming again until it receives
the next communication from its peer (which happens to take a large-ish
number of ms in this case), by which time the remaining payloads have all
been coalesced into a single frame and hit the wire at once.
It has the feel that perhaps there's some sort of congestion avoidance
protocol in play here, but we can't explain it from the network traces nor
from the MS doc's on Windows networking internals.
We've tried a variety of network configuration changes on both the Windows
Server 2003 and Server 2008 hosts, but are unable to modify the behavior in
the least. Among the things we've tried are: Disable Autotuning (Win2k8),
Disable RSS (Win2k8), Unbind/Remove QOS Packet Scheduling Service, Disable
CTCP (Win2k8), set TcpDelAckTicks=0, set TcpAckFrequency=1 or 2 (although
this should have no effect on TCP sends), Disable/Enable Flow Control between
the NIC and switch, Disable Interrupt moderation on the NIC, Disable
Inter-Frame spacing on the NIC (default), etc., etc.
OUTSTANDING ISSUE:
Our suspicion is that there's some sort of dynamic tuning going on that
began with all the recent updates on both the client and server side, but so
far no one can pinpoint exactly what that might be.
We've run out of ideas of what to try on the client side that could make an
impact on these apparent delayed sends. Can anyone provide some insight into
what might be the underlying cause of these delays and subsequent batching
and how we might get back to sending "immediately" upon calling
Socket.Write() (no matter what the server is doing on its side)?
|