antoine <(E-Mail Removed)> wrote:
>> So someone else is providing you with Solaris and Linux binaries I
>> presume?
> Yes, this is a vendor solution that we've purchased. a stock trading
> engine that is receiving order requests from my own in-house trading
> application: socket connection, simple text-based API. I have done
> all network optimization at the software level - through java - on
> MY app.
Java...

So, your app is the client yes?
> the order engine is then connected to another "system" in the order
> sending process (there are a few like this one after the other
> before it goes to the market).
> the goal is of course to minimize the time it takes to reach the
> market...
> on solaris, optimizing my app at software level, and changing the
> parameters on the server improved performances quite well, but now
> that we're migrating to linux (the server), I'm back with not as
> good performances, and an intuition the network optimization has at
> least a little to do with it...
>> My intuition may be oversaturated with 18 years of experience, but
>> it is beginning to sound like you are having to kludge around a
>> poorly written application. It isn't perhaps trying to write
>> requests or responses to the socket in multiple write calls is it?
>> A truss (solaris) or strace (linux) of the application, perhaps
>> combined with a tcpdump trace could be very helpful there...
> I think it's exactly the way the app is working, but I don't have
> any way to modify it, and I BELIEVE it's more a feature than a bug:
> - the client app is sending an order request
> - the server is FIRST replying with a message that says "I've heard
> you, I'm going to handle the request"
> - the server then sends another reply saying something like "I've
> handled your request here, it's sent somewhere else"
Well, I've often said that "99 times out of 10" setting TCP_NODELAY is
a kludge, but if the above is accurate, it would be one of those 100th
out of 10 situations.
I was more concerned with say at step two - the server saying "I hear
you" that the server wasn't sending that message with more than one
send call.
> in the same way, when there are executions on an order, each
> execution message might not be sent "as soon as available"...
> ALSO, as I'm very often sending several requests at the same time,
> the server is replying to each request in different messages, that
> are aggregated by Nagle algorithm, so that the FIRST reply is
> delayed (it's waiting for more data to send), unless I disable
> Nagle...
I think your understanding of Nagle is a little off. It is supposed
to work this way:
1) is this user's send(), plus any queued, unsent data >= the MSS for
the connection? if yes, send immediately (modulo things like
congestion window). if no, go to step 2
2) Is this connection otherwise idle - do we have no unACKed data
outstanding to the other side? if yes, send immediately (again modulo
stuff like congestion window) otherwise go to step 3
3) wait for either
a) more sends from the user to get >= MSS
b) ACK's from the remote to make the connection "idle"
c) the retransmission timer to expire
This suggests that the first reply from the server will not be
delayed, it is the second reply from the server which will be delayed.
Soooo, if I've understood correctly, a single transaction at the app
level would look like:
Client Server
Request ->
<- Server "I hear you"
<- Server "response"
Now, in a mostly perfect world, that would be the same picture at the
TCP level:
Client Server
Request + TCP ACK of previous Server stuff ->
<- Server "I hear you" + TCP ACK of Request piggyback
<- Server "response" + TCP ACK of Request piggyback
But with Nagle enabled that server response is indeed going to be
delayed awaiting the delayed ACK from the Clien't TCP Stack:
Client Server
Request + TCP ACK of previous Server stuff ->
<- Server "I hear you" + TCP ACK of Request piggyback
TCP ACK of IHY ->
<- Server "response" + TCP ACK of Request piggyback
However, if you have set the deferred ack max to zero on both ends and
it enables immediate ACK like I think it does, what you are really
going to see on the wire is:
Client Server
ACK of prev server data ->
Request ->
<- TCP ACK of Req
<- Server IHY
ACK of IHY ->
<- Server Response
ACK of Server Rsp
So, what would have been simply three segments on the wire is actually
6 segments on the wire, and in broad handwaving terms since all of
those are small, the TCP level CPU utilization is 2X what it would
have otherwise been.
If you just set TCP_NODELAY (disable Nagle) on both ends it should
become a three segment exchange. I am surprised that it was necessary
to set both naglem_def and deferred_ack.
Given the behaviour you have described, the server software vendor
does indeed have a bug if they offer no way to set TCP_NODELAY on the
connection. Hold their feet to the fire to get one.
Meanwhile, are the message sizes pretty much predictable? Notice that
the Nagle algorithm takes the connection MSS (Maximum Segment Size)
into consideration. If the server response is the larger of the
messages, you could set the PathMTU on the server to be that plus 20
or so bytes and then the send of server response would indeed be >=
MSS and would go out immediately even without setting TCP_NODELAY.
Now, if the server app vendor drags their feet (normally the issue is
getting those sorts of folks to _remove_ a bogus TCP_NODELAY setting

one could in theory write a small shim library which intercepted
say the accept() or connect() call (depending on which way the server
worked) and added a setsockopt() call to set TCP_NODELAY.
Now, since you mention multiple requests outstanding at a time, keep
in mind that setting TCP_NODELAY is an explicit tradeof of perceived
latency against aggregate throughput. The sematics of stock trading
may require that I suppose.
rick jones
--
Wisdom Teeth are impacted, people are affected by the effects of events.
these opinions are mine, all mine; HP might not want them anyway...

feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...