Linux TCP - unexpected retransmissions

Discussion in 'Linux Networking' started by Francois, May 28, 2007.

  1. Francois

    Francois Guest

    This may not be the proper newsgroup but any help would be greatly

    Our are working on an embedded system that has a number of PowerQUICC
    processors running Linux. During normal operation, processors exchange
    small messages (< 100 bytes) using TCP. We have a response time
    requirement of about 100 milliseconds and we observed that sometimes
    we have a long latency in transporting (e.g., > 200 mlliseconds across
    Ethernet link) messages between nodes of the system resulting in
    response time exceeding our requirement. This latency occurs randomly
    at different places and on different interface types. We set the
    socket NO_DELAY option, tried different setting (proc file ipv4
    options) and test programs to isolate the root cause of the latency
    with no success.

    We can reproduce the latency using a small application where two
    PowerQuicc cards randomly send each other burst of messages across an
    Ethernet link. For this test, we are using the 2.6.16 kernel. We use a
    sniffer to capture data across the Ethernet link to realize that
    sometimes when both TCPs send each other messages at about the same
    time (segment 5 and 6 below), for unknown reasons, the second TCP does
    not ack the message from the first TCP and a transmission occurs
    (segment 8). We also observed that retransmissions sometimes occur
    when one TCP is busy transmitting many messages (segment 38 contains
    many application messages) while a message is being sent to it, again,
    for unknown reasons, that TCP does not ack the message thus forcing a
    retransmission (segment 40).

    Netstats reports TCP segments being retransmitted but no error at the
    interface level. We have no reason to believe that segments are
    dropped at the physical layer. We suspect that segments are dropped at
    the TCP layer but we don't know why/where. Any ideas?


    Here is the trace with relative sequence numbers where we capture
    three instances of a retransmission.
    1 0.000000 TCP 4124 >
    9000 [PSH, ACK] Seq=0 Ack=0 Win=9902 Len=84 TSV=15025917 TSER=16502810
    2 0.039817 TCP 9000 >
    4124 [ACK] Seq=0 Ack=84 Win=2896 Len=0 TSV=16502926 TSER=15025917
    3 0.080062 TCP 9000 >
    4124 [PSH, ACK] Seq=0 Ack=84 Win=2896 Len=8 TSV=16502936 TSER=15025917
    4 0.080103 TCP 4124 >
    9000 [ACK] Seq=84 Ack=8 Win=9902 Len=0 TSV=15025937 TSER=16502936
    5 0.583935 TCP 4124 >
    9000 [PSH, ACK] Seq=84 Ack=8 Win=9902 Len=8 TSV=15026063 TSER=16502936
    6 0.583940 TCP 9000 >
    4124 [PSH, ACK] Seq=8 Ack=84 Win=2896 Len=8 TSV=16503062 TSER=15025937
    7 0.583985 TCP 4124 >
    9000 [ACK] Seq=92 Ack=16 Win=9902 Len=0 TSV=15026063 TSER=16503062
    8 0.795861 TCP [TCP
    Retransmission] 4124 > 9000 [PSH, ACK] Seq=84 Ack=16 Win=9902 Len=8
    TSV=15026116 TSER=16503062
    9 0.796059 TCP 9000 >
    4124 [ACK] Seq=16 Ack=92 Win=2896 Len=0 TSV=16503115 TSER=15026116
    10 0.797151 TCP 9000 >
    4124 [PSH, ACK] Seq=16 Ack=92 Win=2896 Len=8 TSV=16503115
    11 0.797194 TCP 4124 >
    9000 [ACK] Seq=92 Ack=24 Win=9902 Len=0 TSV=15026116 TSER=16503115
    12 1.088260 TCP 4124 >
    9000 [PSH, ACK] Seq=92 Ack=24 Win=9902 Len=8 TSV=15026189

    16 6.127280 TCP 4124 >
    9000 [PSH, ACK] Seq=324 Ack=2656 Win=9902 Len=8 TSV=15027449
    17 6.127289 TCP 9000 >
    4124 [PSH, ACK] Seq=2656 Ack=324 Win=2896 Len=8 TSV=16504448
    18 6.127334 TCP 4124 >
    9000 [ACK] Seq=332 Ack=2664 Win=9902 Len=0 TSV=15027449 TSER=16504448
    19 6.127865 TCP 9000 >
    4124 [PSH, ACK] Seq=2664 Ack=332 Win=2896 Len=8 TSV=16504448
    20 6.127907 TCP 4124 >
    9000 [ACK] Seq=332 Ack=2672 Win=9902 Len=0 TSV=15027449 TSER=16504448
    21 6.631221 TCP 4124 >
    9000 [PSH, ACK] Seq=332 Ack=2672 Win=9902 Len=8 TSV=15027575
    22 6.631226 TCP 9000 >
    4124 [PSH, ACK] Seq=2672 Ack=332 Win=2896 Len=8 TSV=16504574
    23 6.631260 TCP 4124 >
    9000 [ACK] Seq=340 Ack=2680 Win=9902 Len=0 TSV=15027575 TSER=16504574
    24 6.839618 TCP [TCP
    Retransmission] 4124 > 9000 [PSH, ACK] Seq=332 Ack=2680 Win=9902 Len=8
    TSV=15027627 TSER=16504574
    25 6.840379 TCP 9000 >
    4124 [PSH, ACK] Seq=2680 Ack=340 Win=2896 Len=8 TSV=16504626
    26 6.840433 TCP 4124 >
    9000 [ACK] Seq=340 Ack=2688 Win=9902 Len=0 TSV=15027627 TSER=16504626
    27 7.136158 TCP 4124 >
    9000 [PSH, ACK] Seq=340 Ack=2688 Win=9902 Len=8 TSV=15027701
    28 7.136163 TCP 9000 >
    4124 [PSH, ACK] Seq=2688 Ack=348 Win=2896 Len=8 TSV=16504700
    29 7.136164 TCP 4124 >
    9000 [ACK] Seq=348 Ack=2696 Win=9902 Len=0 TSV=15027701 TSER=16504700

    31 1106.230079 TCP 9000 >
    4124 [PSH, ACK] Seq=470416 Ack=58388 Win=2896 Len=84 TSV=16779507
    32 1106.230121 TCP 4124 >
    9000 [ACK] Seq=58388 Ack=470500 Win=14942 Len=0 TSV=15302506
    33 1106.230402 TCP 9000 >
    4124 [PSH, ACK] Seq=470500 Ack=58388 Win=2896 Len=84 TSV=16779507
    34 1106.230445 TCP 4124 >
    9000 [ACK] Seq=58388 Ack=470584 Win=14942 Len=0 TSV=15302506
    35 1106.230716 TCP 9000 >
    4124 [PSH, ACK] Seq=470584 Ack=58388 Win=2896 Len=84 TSV=16779507
    36 1106.230759 TCP 4124 >
    9000 [ACK] Seq=58388 Ack=470668 Win=14942 Len=0 TSV=15302506
    37 1106.232746 TCP 4124 >
    9000 [PSH, ACK] Seq=58388 Ack=470668 Win=14942 Len=8 TSV=15302507
    38 1106.232809 TCP 9000 >
    4124 [PSH, ACK] Seq=470668 Ack=58388 Win=2896 Len=588 TSV=16779507
    39 1106.272712 TCP 4124 >
    9000 [ACK] Seq=58396 Ack=471256 Win=14942 Len=0 TSV=15302517
    40 1106.440704 TCP [TCP
    Retransmission] 4124 > 9000 [PSH, ACK] Seq=58388 Ack=471256 Win=14942
    Len=8 TSV=15302559 TSER=16779507
    41 1106.443387 TCP 9000 >
    4124 [PSH, ACK] Seq=471256 Ack=58396 Win=2896 Len=8 TSV=16779560
    42 1106.443391 TCP 4124 >
    9000 [ACK] Seq=58396 Ack=471264 Win=14942 Len=0 TSV=15302559
    43 1106.736707 TCP 4124 >
    9000 [PSH, ACK] Seq=58396 Ack=471264 Win=14942 Len=8 TSV=15302633
    44 1106.737143 TCP 9000 >
    4124 [PSH, ACK] Seq=471264 Ack=58404 Win=2896 Len=8 TSV=16779633
    45 1106.737196 TCP 4124 >
    9000 [ACK] Seq=58404 Ack=471272 Win=14942 Len=0 TSV=15302633
    Francois, May 28, 2007
    1. Advertisements

  2. Did you try replacing whatever was in the middle (hub/switch/crossover
    cable/...)? I know you said you don't suspect the link layer, but a
    little paranoia never hurts.

    Did you try using well-tested network cards? The machine I'm using to
    write this has a built-in NIC that started mysteriously dropping packets
    when I installed FC5. Switching to a well-debugged card/driver made the
    problem go away.
    Allen McIntosh, May 29, 2007
    1. Advertisements

  3. Francois

    Francois Guest

    Our system is composed of a number of embedded PowerQUICC processors
    (VME) located within a number of shelves. Processors communicate using
    point-to-point Ethernet links, or through the VME backplane. There is
    no hub or switch between them (except when we use a sniffer for
    testing purposes). We tried different cables, cards, shelves, etc, to
    isolate the root cause of this latency with no success.

    After browsing the Linux code for a while (I wish I understand it
    better), we realized that the TCP stack optimizes performance by
    separating the processing of events between user and kernel space. We
    suspect that under certain conditions (heavy burst of messages, or
    messages arriving at the same time), the stack drops or postpones
    processing of events (holding locks, buffering) causing timers to
    trigger retransmissions.

    Francois, May 29, 2007
  4. Francois

    Rick Jones Guest

    ISTR there is a sysctl which controls some of that decision making -
    net.ipv4.tcp_low_latency . Maybe that will help, maybe not.

    Quite frankly, TCP isn't exactly the right protocol for firm/hard
    realtime requirements, as you have learned from experience with lost
    traffic and retransmissions. There isn't really a "perfect" protocol
    for such things though (IMO).

    rick jones
    Rick Jones, May 29, 2007
  5. Francois

    Tim S Guest

    There's Infiniband (which I know little of apart from it exists). I dare say
    it would be an expensive option and totally OTT for the OP's application.

    However, I do wonder if the OP has considered dumping IP and just throwing
    raw ethernet frames around? Hard to say whether it would be better or not -
    depends on the hardware setup, but it's worth a though.


    Tim S, May 29, 2007
  6. Francois

    Rick Jones Guest

    One of those damned if you do, damned if you don't things I suspect.
    One could go with direct Ethernet, but then one has to segment
    oneself, as well as deal with lost traffic. One does have the
    advantage of being able to use one's own retransmission timeouts.
    Having doe that though, some months later someone will want to be able
    to run the application between two sites, without any bridging
    available and then the lack of routing (since we've ditched IP) will
    come back to haunt.

    Also, with direct Ethernet, there are only so many Ethertypes/SAPs one
    can use which may make multiple "connections" a bit difficult. The
    author might have to write her own connection multiplex/demultiplex.

    rick jones
    Rick Jones, May 29, 2007
  7. Francois

    Tim S Guest

    Yes - I should clarify. I've seriously considered using plain ethernet for a
    point-to-point link where one half of the link is hosted by a fairly dumb
    embedded system (too dumb to run a "proper" OS, but highly specialised for
    its task) and where the link's purpose is to feed data to a more
    intelligent but less specialised embedded board.


    Tim S, May 29, 2007
  8. Francois

    Dan N Guest

    That sounds like a reasonable explanation to me. Or the link layer drops
    data because of timing constraints and/or limited resource, so the tcp
    stack never sees it.

    Others have suggested using link layer protocol only, but what about using

    Dan N, May 30, 2007
  9. Francois

    Francois Guest

    We have considered using UDP. Although feasible, it would be a
    significant of work, not so much to implement but to prove for
    correctness. Rightly or wrongly, we made a number of assumptions early
    on in the design that were driven by the fact that we used TCP thus
    there would be a need to implement additional services on top of UDP
    and prove correctness.

    We first wanted to isolate the root cause of this latency. As
    described above, we suspect the problem related the TCP stack but we
    have not proven this yet. We were hoping someone on the net would
    confirm that either the current design of the Linux TCP stack could
    result in such behaviour, or that this a bug and even better point us
    towards a fix.

    Francois, May 30, 2007
  10. Francois

    Rick Jones Guest

    If you have already checked all the stats available in Linux (netstat
    -s and ethtool) and they are indeed clean, and then have checked the
    stats on the switches (for those situations were switches were used),
    and a tcpdump trace, or perhaps better still some external packet
    sniffing with a sufficinelty powerfull third system (and perhaps a
    hub) shows actual symptoms of packet loss, then it would seem that you
    have encountered a situation where there are points in the stack which
    can drop packets, but not increment a stat.

    That would be a bug.

    You may need to start perusing the source of the entire path looking
    for places where this might be the case. You would then need to
    kludge-in some counters of your own (perhaps just simple printk's even
    as a start) to see what might be going-on. If you get your Linux bits
    from a commerical source, you could fire-up your support contract and
    start getting them to do some of that - the source code perusal and
    perhaps quick and dirty counters at least.

    rick jones
    Rick Jones, May 30, 2007
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.