syslog server, RH ES 4, large amounts of UDP loss. please help

Discussion in 'Linux Networking' started by guser, Aug 2, 2006.

  1. guser

    guser Guest

    I am having trouble pinpointing why the system is dropping so many UDP
    packets, I could really use some help here as I was not expecting to
    see such a high number of packet errors (so far we have 3 systems
    sending remote syslog to this system which is equating to about
    1MB/minute. I have hundreds of systems to eventually configure and can
    build more syslog servers but this is going to be hard to justify if 1
    system is having trouble with the load from 3
    servers).

    If there is any other information I can provide to help answer this I
    will do my best.

    thanks,

    netstat -su
    Udp:
    65321 packets received
    21 packets to unknown port received.
    28401 packet receive errors
    21829 packets sent

    System: Dell Precision 650
    3190.755 MHz processor
    Memory: 1032492k/1048020k available
    e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
    e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex

    Install was minimal. Then I turned everything off via chkconfig
    excluding syslog (reconfigured for remote syslog accepting) and sshd.
    iptables (used fwbuilder) is configured to block all in/outbound
    requests by default. The holes are poked in for ssh and syslog,
    outgoing for dns queries, and ntp.

    my sysctl.conf has the following mods which was put together after
    looking at these and a few other docs
    http://www.29west.com/docs/THPM/udp-buffer-sizing.html
    http://dsd.lbl.gov/TCP-tuning/linux.html

    net.core.rmem_max = 33554432
    net.core.wmem_max = 33554432
    net.core.rmem_default = 65536
    net.core.wmem_default = 65536
    net.ipv4.tcp_rmem = 4096 87380 33554432
    net.ipv4.tcp_wmem = 4096 65536 33554432
    net.ipv4.tcp_mem = 33554432 33554432 33554432


    netstat -i
    Kernel Interface table
    Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR
    TX-DRP TX-OVR Flg
    eth0 1500 0 74924 0 0 0 18204 0
    0 0 BMRU
    lo 16436 0 0 0 0 0 0 0
    0 0 LRU


    vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system--
    ----cpu----
    r b swpd free buff cache si so bi bo in cs us
    sy id wa
    0 1 0 885752 39428 74192 0 0 37 841 1215 366 0
    1 43 56
    0 1 0 885696 39428 74192 0 0 0 1156 1249 442 0
    0 25 75
    0 1 0 885640 39436 74184 0 0 0 1464 1295 545 1
    1 8 90
    0 0 0 885576 39436 74184 0 0 0 252 1046 104 0
    0 84 16
    0 1 0 885548 39436 74444 0 0 0 920 1331 422 0
    1 43 56
    0 0 0 885492 39436 74444 0 0 0 1232 1292 543 0
    0 22 78
    0 1 0 885464 39436 74444 0 0 0 900 1219 351 0
    0 41 59
    0 0 0 885436 39436 74444 0 0 0 1140 1205 434 0
    1 25 74
    0 0 0 885436 39436 74444 0 0 0 0 1022 25 0
    0 100 0
    1 1 0 885408 39444 74436 0 0 0 1012 1316 436 0
    0 38 62
    0 1 0 885352 39444 74436 0 0 0 1092 1273 495 1
    1 32 66
    0 1 0 885324 39444 74436 0 0 0 940 1211 359 0
    0 40 60
    0 0 0 885296 39444 74436 0 0 0 1088 1216 434 0
    1 27 72
    0 0 0 885296 39444 74436 0 0 0 0 1005 7 0
    0 100 0
    0 1 0 885268 39452 74688 0 0 0 972 1364 430 0
    0 36 64
    0 0 0 885212 39452 74688 0 0 0 1448 1376 624 0
    1 14 85
    0 1 0 885184 39452 74688 0 0 0 1028 1238 392 0
    1 37 62
    0 1 0 885128 39452 74688 0 0 0 1424 1301 509 0
    1 6 93
    0 0 0 885100 39452 74688 0 0 0 636 1116 260 0
    0 51 49
    0 1 0 885072 39452 74688 0 0 0 796 1208 344 0
    0 44 56

    sar -n EDEV 2 10
    Linux 2.6.9-34.0.2.ELsmp (systemname) 08/02/2006

    03:13:39 PM IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s
    txcarr/s rxfram/s rxfifo/s txfifo/s
    03:13:41 PM lo 0.00 0.00 0.00 0.00 0.00
    0.00 0.00 0.00 0.00
    03:13:41 PM eth0 0.00 0.00 0.00 0.00 0.00
    0.00 0.00 0.00 0.00
    03:13:41 PM sit0 0.00 0.00 0.00 0.00 0.00
    0.00 0.00 0.00 0.00

    [note all the other 9 entries were the same as above for the sar output
    command]

    Some netstat -na --inet | grep ':514' output (removed any with null
    result return)
    udp 30576 0 0.0.0.0:514
    udp 39984 0 0.0.0.0:514
    udp 7056 0 0.0.0.0:514 0.0.0.0:*
    udp 63504 0 0.0.0.0:514 0.0.0.0:*
     
    guser, Aug 2, 2006
    #1
    1. Advertisements

  2. guser

    Moe Trin Guest

    That's one sh!tload of data - what the heck are you logging that generates
    that much traffic? Not withstanding that, there shouldn't be that
    much of a problem - as even with each log entry being a line of text (say
    60 characters), that's 16,667 packets a minute - and well within the
    network capability, and _should_ be well within the capability of the
    daemon. (Our log daemon sees barely twice that traffic in a day!)
    If you have 300 servers, and they're all spewing at that data rate, you
    are going to have an interesting disk storage problem - that's 144 Gigs
    a day.
    What kind of receive errors? (ifconfig output) Your 'netstat -i' output
    doesn't show a problem, so I might suggest using tcpdump to have a look at
    the packets. The '21 packets to unknown port' is what I see with windoze
    crap sending to port 0.
    That _should_ be fine
    IF YOU CAN, what does it look like with the firewall stopped?

    Old guy
     
    Moe Trin, Aug 4, 2006
    #2
    1. Advertisements

  3. guser

    guser Guest

    Thanks for the reply,

    With the firewall disabled I still saw the errors. As for storage, I
    have logrotate setup to rotate and gzip the log files once they reaches
    500MB which gets the files to about 33MB on average. Plus this is just
    the test system. Once we get it tuned we have another set of systems
    with alot more storage.

    I found one problem that reduced the udp errors which gave me a lead to
    try something else. The problem was when I setup the local facility to
    dump all the syslog data, I forgot to also add local5.none to the
    /var/log/messages entry. Thus the syslog daemon was having to make two
    writes per event which caused extra overhead. That change made a
    significant difference in udp errors when running netstat -su.

    So I called the Dell linux support and got someone who really new their
    stuff and had me download the latest e1000 linux driver from Intel
    which was released around 6/2006. That helped some more.

    Reading the readme files on the Intel driver led me to find a way to
    increase the RX ring parameter on the NIC. So I increased that from 256
    to 512. The rate of error has dropped alot but then again this is all
    overnight so the logging rate slows down

    netstat -su
    Udp:
    1839508 packets received
    26 packets to unknown port received.
    1510 packet receive errors
    789415 packets sent

    What I am now wondering is what netstat -su is really reporting as
    errors. When I run ethtool -S eth0 I get a different set of stats
    altogether. Below I see no recieve errors, but netstat reports recieve
    errors. So is netstat looking at the errors from some kernel counter
    while ethtool looks at the errors from the nic?

    NIC statistics:
    rx_packets: 3416841
    tx_packets: 4004531
    rx_bytes: 846988732
    tx_bytes: 96819577
    rx_errors: 0
    tx_errors: 0
    tx_dropped: 0
    multicast: 907
    collisions: 0
    rx_length_errors: 0
    rx_over_errors: 0
    rx_crc_errors: 0
    rx_frame_errors: 0
    rx_no_buffer_count: 0
    rx_missed_errors: 0
    tx_aborted_errors: 0
    tx_carrier_errors: 0
    tx_fifo_errors: 0
    tx_heartbeat_errors: 0
    tx_window_errors: 0
    tx_abort_late_coll: 0
    tx_deferred_ok: 0
    tx_single_coll_ok: 0
    tx_multi_coll_ok: 0
    tx_timeout_count: 0
    rx_long_length_errors: 0
    rx_short_length_errors: 0
    rx_align_errors: 0
    tx_tcp_seg_good: 1391641
    tx_tcp_seg_failed: 0
    rx_flow_control_xon: 0
    rx_flow_control_xoff: 0
    tx_flow_control_xon: 0
    tx_flow_control_xoff: 0
    rx_long_byte_count: 846988732
    rx_csum_offload_good: 3398423
    rx_csum_offload_errors: 0
    rx_header_split: 0
    alloc_rx_buff_failed: 0

    Thanks again.
     
    guser, Aug 4, 2006
    #3
  4. guser

    Moe Trin Guest

    OK - that was a wild chance anyway
    I hope you have an efficient log parsing program - that's an awful lot
    of data. Our log server barely sees 2 Megabytes a day, which equates to
    roughly 560 pages of text.
    [snip several things that cascade to overwhelm the stack and daemon]
    Is this net publicly reachable? That could be windoze crap.
    Better - but still shouldn't be happening
    Over what period of time?
    I don't know - the only sure way to find out is going to be to look at the
    kernel and ethtool source code. As a _guess_ I'd suggest that the NIC
    stats are only looking at the Ethernet level errors in the stack. Thus,
    the packets are being lifted out of the hardware fast enough to avoid
    overruns, they aren't short, partials, or have _overall_ CRC errors. I'll
    come back to this below. The netstat errors relate to IP and TCP/UDP/ICMP
    errors - higher in the stack. Now, if the Ethernet errors are nil, and
    the UDP errors exist, then either the sending station is creating crap
    data (that is cleanly being carried over the Ethernet without errors), or
    the higher levels on the receiving system stack are tripping over themselves.
    This _could_ be a scheduling problem - how busy is the receiving system?
    What is the load averages? One presumes you aren't running X on this box -
    does the error ratio improve if you either 'nice' the daemon to a higher
    priority, OR reduce the number of packets coming in?
    846988732 / 3416841 = 246 bytes average
    96819577 / 4004531 = 24 bytes average

    How long? (3416841 + 4004531 ) / 8 hours = 260 packets/second.
    (846988732 + 96819577 ) / 8 hours = 32771 bytes/second = 262 kilobits/second
    which should be well within a 10BaseT network, never mind a Gigabit card.

    I don't see an obvious network problem... I dunno, I'd look at the ps
    output to see if your logging daemon is getting enough time. Like I say,
    ours doesn't see anywhere near that much traffic.

    Old guy
     
    Moe Trin, Aug 5, 2006
    #4
  5. guser

    guser Guest

    Right now log parsing is not yet done until the requirements are all
    defined. I will probably use SEC.
    It is not publicly reachable but the internal network it resides on has
    a whole lot of windows systems. I thought that with iptables blocking
    all inbound udp except syslog that tools like netstat would not record
    that data. It could be I was mistaken in that and the udp error stats
    are being recorded pre iptables filtering.
    The bit rate according to mrtg is:

    Avg In bits/sec 59396.94
    Avg Out bits/sec 58998.77
    Max In bits/sec 541861.30
    Max Out bits/sec 5102595
    IFace Speed 100mps

    The max rate is due to when the remote backups kick off.
    The boxes response times avg 250ms
    CPU utilization is close to 0 till the backups hit then it nears 1.
    Memory utilization is interesting. Looks like syslog may have a memory
    leak:
    starts around 161MB at 4am, increases at a steady rate to about 390mb
    and goes back down again to about 161mb at 4am when logrotate kicks
    off.
    Virtual memory is a constant 163.84 kb. Allocated but apparently never
    needed for use.

    This is a minimal install. Nothing but the base requirements for
    install RHES4.

    I reniced the syslog process to -10. See if this helps. If it does ill
    need to change the syslog init script to run syslogd at a higher
    priority.

    thanks again for the help
     
    guser, Aug 7, 2006
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.