We have a client running W2K SP4, connecting to a server running W2K3 SP1.
Occasionally (very often during busy periods) TCP connections are reset
during the initial TCP handshake. It appears to be due to a bad packed
coming from the W2K3 box (ACK instead of SYN,ACK).
Outline:
Host A: Windows 2000 SP4
Host B: Windows 2003 SP1
Host A is the initiator. It’s “source” ports are variable, but it always
connects to Port 4000 on Host B. The application running on Host A is not
very efficient, and each time data is requested from B, it opens a new
connection (after prior request is completed).
The traces taken on both sides are identical – there are no apparent
network-induced delays, and no apparent packet loss occurring. I have only
included the traces from one side, but packet-for-packet they were identical
once you discount the erroneous TCP checksum errors (introduced by offloading
to the NIC).
The three second delay created by scenario 2 below introduces the potential
for huge performance problems on a busy system since this application seems
to be queuing requests – one connection is established at a time and waits
for the prior one to complete.
From what I’ve read, during the 3-step TCP handshake, the MSS should be set
and the TCP Window should be 16k, and the second packet in these “bad”
conversations defy both those parameters, in addition to not sending a SYN
back to the initiator.
Scenario 1 - Normal:
Host A application opens communication path to Host B. Handshake completes.
Data Transfers. Connection terminated normally.
Packet 1. A > B SYN
Packet 2. B > A SYN,ACK
Packet 3. A > B ACK
Packet 4. A > B PSH,ACK
Packet 5. B > A ACK
Packet 6. A > B PSH,ACK
Packet 7. B > A ACK
Packet 8. B > A PSH,ACK
Packet 9. A > B ACK
Packet 10. A > B FIN,ACK
Packet 11. B > ACK
Scenario 2 - Problem (happens occasionally – seemingly at random intervals,
more often on busy system):
Host A application attempts opening of communication path to B. Handshake
fails, starts over, then works normally.
Packet 1. A > B SYN
Packet 2. B > A ACK (Packet 2 does not issue a
SYN)
Packet 3. A > B RST (Followed by 3+ second delay)
Packet 4. A > B SYN (A connects from same port
as Packet 1)
Packet 5. B > A SYN,ACK
Packet 6. A > B ACK
Packet 7. A > B PSH,ACK
Packet 8. B > A ACK
Packet 9. A > B PSH,ACK
Packet 10. B > A ACK
Packet 11. B > A PSH,ACK
Packet 12. A > B ACK
Packet 13. A > B FIN,ACK
Packet 14. B > ACK
More detailed analysis:
Packet 1, in both scenarios, is identical except for source port number.
Scenario 1) TCP 2407 > 4000 [SYN] Seq=0 Len=0 MSS=1460
Scenario 2) TCP 2413 > 4000 [SYN] Seq=0 Len=0 MSS=1460
Packet 2, is very different. Note that an ACK is sent instead of SYN,ACK –
but also there is no MSS, the TCPWINDOWSIZE is much larger, and Ack number is
a much larger number.
1) TCP 4000 > 2407 [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460
2) TCP 4000 > 2413 [ACK] Seq=0 Ack=3068662333 Win=64619 Len=0
Packet 3, of course, is going to be different, as a result of 2...
1) TCP 2407 > 4000 [ACK] Seq=1 Ack=1 Win=17520 Len=0
2) TCP 2413 > 4000 [RST] Seq=3068662333 Len=0
Packet 4, in Scenario 2, is a replay of packet 1, utilizing same port number
as before (After 3+ seconds have passed):
TCP 2413 > 4000 [SYN] Seq=0 Len=0 MSS=1460
Packet 5, Scenario 2, is a ”normal” SYN,ACK:
TCP 4000 > 2413 [SYN, ACK] Seq=2628930846 Ack=1 Win=16384 Len=0 MSS=1460
And Packet 6, Scenario 2, finishes the handshake, as normal :
TCP 2413 > 4000 [ACK] Seq=1 Ack=2628930847 Win=17520 Len=0
--- So, I know numerous TCP fixes are included in W2K SP2 but this didn't
sound like an exact fit for any of them (already poked around a bit in the
kb). Any other thoughts as to where to look for potential problems - other
than to just tell them to upgrade to SP2?