Networking Forums

Networking Forums > Computer Networking > Linux Networking > syslog server, RH ES 4, large amounts of UDP loss. please help

Reply
Thread Tools Display Modes

syslog server, RH ES 4, large amounts of UDP loss. please help

 
 
guser@packetstorm.org
Guest
Posts: n/a

 
      08-02-2006, 08:29 PM
I am having trouble pinpointing why the system is dropping so many UDP
packets, I could really use some help here as I was not expecting to
see such a high number of packet errors (so far we have 3 systems
sending remote syslog to this system which is equating to about
1MB/minute. I have hundreds of systems to eventually configure and can
build more syslog servers but this is going to be hard to justify if 1
system is having trouble with the load from 3
servers).

If there is any other information I can provide to help answer this I
will do my best.

thanks,

netstat -su
Udp:
65321 packets received
21 packets to unknown port received.
28401 packet receive errors
21829 packets sent

System: Dell Precision 650
3190.755 MHz processor
Memory: 1032492k/1048020k available
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex

Install was minimal. Then I turned everything off via chkconfig
excluding syslog (reconfigured for remote syslog accepting) and sshd.
iptables (used fwbuilder) is configured to block all in/outbound
requests by default. The holes are poked in for ssh and syslog,
outgoing for dns queries, and ntp.

my sysctl.conf has the following mods which was put together after
looking at these and a few other docs
http://www.29west.com/docs/THPM/udp-buffer-sizing.html
http://dsd.lbl.gov/TCP-tuning/linux.html

net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.core.rmem_default = 65536
net.core.wmem_default = 65536
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.ipv4.tcp_mem = 33554432 33554432 33554432


netstat -i
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR
TX-DRP TX-OVR Flg
eth0 1500 0 74924 0 0 0 18204 0
0 0 BMRU
lo 16436 0 0 0 0 0 0 0
0 0 LRU


vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
0 1 0 885752 39428 74192 0 0 37 841 1215 366 0
1 43 56
0 1 0 885696 39428 74192 0 0 0 1156 1249 442 0
0 25 75
0 1 0 885640 39436 74184 0 0 0 1464 1295 545 1
1 8 90
0 0 0 885576 39436 74184 0 0 0 252 1046 104 0
0 84 16
0 1 0 885548 39436 74444 0 0 0 920 1331 422 0
1 43 56
0 0 0 885492 39436 74444 0 0 0 1232 1292 543 0
0 22 78
0 1 0 885464 39436 74444 0 0 0 900 1219 351 0
0 41 59
0 0 0 885436 39436 74444 0 0 0 1140 1205 434 0
1 25 74
0 0 0 885436 39436 74444 0 0 0 0 1022 25 0
0 100 0
1 1 0 885408 39444 74436 0 0 0 1012 1316 436 0
0 38 62
0 1 0 885352 39444 74436 0 0 0 1092 1273 495 1
1 32 66
0 1 0 885324 39444 74436 0 0 0 940 1211 359 0
0 40 60
0 0 0 885296 39444 74436 0 0 0 1088 1216 434 0
1 27 72
0 0 0 885296 39444 74436 0 0 0 0 1005 7 0
0 100 0
0 1 0 885268 39452 74688 0 0 0 972 1364 430 0
0 36 64
0 0 0 885212 39452 74688 0 0 0 1448 1376 624 0
1 14 85
0 1 0 885184 39452 74688 0 0 0 1028 1238 392 0
1 37 62
0 1 0 885128 39452 74688 0 0 0 1424 1301 509 0
1 6 93
0 0 0 885100 39452 74688 0 0 0 636 1116 260 0
0 51 49
0 1 0 885072 39452 74688 0 0 0 796 1208 344 0
0 44 56

sar -n EDEV 2 10
Linux 2.6.9-34.0.2.ELsmp (systemname) 08/02/2006

03:13:39 PM IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s
txcarr/s rxfram/s rxfifo/s txfifo/s
03:13:41 PM lo 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
03:13:41 PM eth0 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
03:13:41 PM sit0 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00

[note all the other 9 entries were the same as above for the sar output
command]

Some netstat -na --inet | grep ':514' output (removed any with null
result return)
udp 30576 0 0.0.0.0:514
udp 39984 0 0.0.0.0:514
udp 7056 0 0.0.0.0:514 0.0.0.0:*
udp 63504 0 0.0.0.0:514 0.0.0.0:*

 
Reply With Quote
 
 
 
 
Moe Trin
Guest
Posts: n/a

 
      08-04-2006, 12:30 AM
On 2 Aug 2006, in the Usenet newsgroup comp.os.linux.networking, in article
<(E-Mail Removed) .com>, (E-Mail Removed)
wrote:

>I am having trouble pinpointing why the system is dropping so many UDP
>packets, I could really use some help here as I was not expecting to
>see such a high number of packet errors (so far we have 3 systems
>sending remote syslog to this system which is equating to about
>1MB/minute.


That's one sh!tload of data - what the heck are you logging that generates
that much traffic? Not withstanding that, there shouldn't be that
much of a problem - as even with each log entry being a line of text (say
60 characters), that's 16,667 packets a minute - and well within the
network capability, and _should_ be well within the capability of the
daemon. (Our log daemon sees barely twice that traffic in a day!)

>I have hundreds of systems to eventually configure and can build more
>syslog servers but this is going to be hard to justify if 1 system is
>having trouble with the load from 3 servers).


If you have 300 servers, and they're all spewing at that data rate, you
are going to have an interesting disk storage problem - that's 144 Gigs
a day.

>netstat -su
>Udp:
> 65321 packets received
> 21 packets to unknown port received.
> 28401 packet receive errors
> 21829 packets sent


What kind of receive errors? (ifconfig output) Your 'netstat -i' output
doesn't show a problem, so I might suggest using tcpdump to have a look at
the packets. The '21 packets to unknown port' is what I see with windoze
crap sending to port 0.

>e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection


That _should_ be fine

>Install was minimal. Then I turned everything off via chkconfig
>excluding syslog (reconfigured for remote syslog accepting) and sshd.
>iptables (used fwbuilder) is configured to block all in/outbound
>requests by default. The holes are poked in for ssh and syslog,
>outgoing for dns queries, and ntp.


IF YOU CAN, what does it look like with the firewall stopped?

Old guy
 
Reply With Quote
 
guser@packetstorm.org
Guest
Posts: n/a

 
      08-04-2006, 12:32 PM
Thanks for the reply,

With the firewall disabled I still saw the errors. As for storage, I
have logrotate setup to rotate and gzip the log files once they reaches
500MB which gets the files to about 33MB on average. Plus this is just
the test system. Once we get it tuned we have another set of systems
with alot more storage.

I found one problem that reduced the udp errors which gave me a lead to
try something else. The problem was when I setup the local facility to
dump all the syslog data, I forgot to also add local5.none to the
/var/log/messages entry. Thus the syslog daemon was having to make two
writes per event which caused extra overhead. That change made a
significant difference in udp errors when running netstat -su.

So I called the Dell linux support and got someone who really new their
stuff and had me download the latest e1000 linux driver from Intel
which was released around 6/2006. That helped some more.

Reading the readme files on the Intel driver led me to find a way to
increase the RX ring parameter on the NIC. So I increased that from 256
to 512. The rate of error has dropped alot but then again this is all
overnight so the logging rate slows down

netstat -su
Udp:
1839508 packets received
26 packets to unknown port received.
1510 packet receive errors
789415 packets sent

What I am now wondering is what netstat -su is really reporting as
errors. When I run ethtool -S eth0 I get a different set of stats
altogether. Below I see no recieve errors, but netstat reports recieve
errors. So is netstat looking at the errors from some kernel counter
while ethtool looks at the errors from the nic?

NIC statistics:
rx_packets: 3416841
tx_packets: 4004531
rx_bytes: 846988732
tx_bytes: 96819577
rx_errors: 0
tx_errors: 0
tx_dropped: 0
multicast: 907
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 1391641
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 846988732
rx_csum_offload_good: 3398423
rx_csum_offload_errors: 0
rx_header_split: 0
alloc_rx_buff_failed: 0

Thanks again.




Moe Trin wrote:
> On 2 Aug 2006, in the Usenet newsgroup comp.os.linux.networking, in article
> <(E-Mail Removed) .com>, (E-Mail Removed)
> wrote:
>
> >I am having trouble pinpointing why the system is dropping so many UDP
> >packets, I could really use some help here as I was not expecting to
> >see such a high number of packet errors (so far we have 3 systems
> >sending remote syslog to this system which is equating to about
> >1MB/minute.

>
> That's one sh!tload of data - what the heck are you logging that generates
> that much traffic? Not withstanding that, there shouldn't be that
> much of a problem - as even with each log entry being a line of text (say
> 60 characters), that's 16,667 packets a minute - and well within the
> network capability, and _should_ be well within the capability of the
> daemon. (Our log daemon sees barely twice that traffic in a day!)
>
> >I have hundreds of systems to eventually configure and can build more
> >syslog servers but this is going to be hard to justify if 1 system is
> >having trouble with the load from 3 servers).

>
> If you have 300 servers, and they're all spewing at that data rate, you
> are going to have an interesting disk storage problem - that's 144 Gigs
> a day.
>
> >netstat -su
> >Udp:
> > 65321 packets received
> > 21 packets to unknown port received.
> > 28401 packet receive errors
> > 21829 packets sent

>
> What kind of receive errors? (ifconfig output) Your 'netstat -i' output
> doesn't show a problem, so I might suggest using tcpdump to have a look at
> the packets. The '21 packets to unknown port' is what I see with windoze
> crap sending to port 0.
>
> >e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection

>
> That _should_ be fine
>
> >Install was minimal. Then I turned everything off via chkconfig
> >excluding syslog (reconfigured for remote syslog accepting) and sshd.
> >iptables (used fwbuilder) is configured to block all in/outbound
> >requests by default. The holes are poked in for ssh and syslog,
> >outgoing for dns queries, and ntp.

>
> IF YOU CAN, what does it look like with the firewall stopped?
>
> Old guy


 
Reply With Quote
 
Moe Trin
Guest
Posts: n/a

 
      08-05-2006, 01:50 AM
On 4 Aug 2006, in the Usenet newsgroup comp.os.linux.networking, in article
<(E-Mail Removed) om>, (E-Mail Removed)
wrote:

>With the firewall disabled I still saw the errors.


OK - that was a wild chance anyway

>As for storage, I have logrotate setup to rotate and gzip the log files
>once they reaches 500MB which gets the files to about 33MB on average.
>Plus this is just the test system. Once we get it tuned we have another
>set of systems with alot more storage.


I hope you have an efficient log parsing program - that's an awful lot
of data. Our log server barely sees 2 Megabytes a day, which equates to
roughly 560 pages of text.

>I found one problem that reduced the udp errors which gave me a lead to
>try something else.


[snip several things that cascade to overwhelm the stack and daemon]

>The rate of error has dropped alot but then again this is all
>overnight so the logging rate slows down
>
>netstat -su
>Udp:
> 1839508 packets received
> 26 packets to unknown port received.


Is this net publicly reachable? That could be windoze crap.

> 1510 packet receive errors


Better - but still shouldn't be happening

> 789415 packets sent


Over what period of time?

>What I am now wondering is what netstat -su is really reporting as
>errors. When I run ethtool -S eth0 I get a different set of stats
>altogether. Below I see no recieve errors, but netstat reports recieve
>errors. So is netstat looking at the errors from some kernel counter
>while ethtool looks at the errors from the nic?


I don't know - the only sure way to find out is going to be to look at the
kernel and ethtool source code. As a _guess_ I'd suggest that the NIC
stats are only looking at the Ethernet level errors in the stack. Thus,
the packets are being lifted out of the hardware fast enough to avoid
overruns, they aren't short, partials, or have _overall_ CRC errors. I'll
come back to this below. The netstat errors relate to IP and TCP/UDP/ICMP
errors - higher in the stack. Now, if the Ethernet errors are nil, and
the UDP errors exist, then either the sending station is creating crap
data (that is cleanly being carried over the Ethernet without errors), or
the higher levels on the receiving system stack are tripping over themselves.
This _could_ be a scheduling problem - how busy is the receiving system?
What is the load averages? One presumes you aren't running X on this box -
does the error ratio improve if you either 'nice' the daemon to a higher
priority, OR reduce the number of packets coming in?

>NIC statistics:
> rx_packets: 3416841
> tx_packets: 4004531
> rx_bytes: 846988732
> tx_bytes: 96819577


846988732 / 3416841 = 246 bytes average
96819577 / 4004531 = 24 bytes average

How long? (3416841 + 4004531 ) / 8 hours = 260 packets/second.
(846988732 + 96819577 ) / 8 hours = 32771 bytes/second = 262 kilobits/second
which should be well within a 10BaseT network, never mind a Gigabit card.

I don't see an obvious network problem... I dunno, I'd look at the ps
output to see if your logging daemon is getting enough time. Like I say,
ours doesn't see anywhere near that much traffic.

Old guy
 
Reply With Quote
 
guser@packetstorm.org
Guest
Posts: n/a

 
      08-07-2006, 01:01 PM

Moe Trin wrote:
> On 4 Aug 2006, in the Usenet newsgroup comp.os.linux.networking, in article
> <(E-Mail Removed) om>, (E-Mail Removed)
> wrote:
>
> >With the firewall disabled I still saw the errors.

>
> OK - that was a wild chance anyway
>
> >As for storage, I have logrotate setup to rotate and gzip the log files
> >once they reaches 500MB which gets the files to about 33MB on average.
> >Plus this is just the test system. Once we get it tuned we have another
> >set of systems with alot more storage.

>
> I hope you have an efficient log parsing program - that's an awful lot
> of data. Our log server barely sees 2 Megabytes a day, which equates to
> roughly 560 pages of text.


Right now log parsing is not yet done until the requirements are all
defined. I will probably use SEC.

>
> >I found one problem that reduced the udp errors which gave me a lead to
> >try something else.

>
> [snip several things that cascade to overwhelm the stack and daemon]
>
> >The rate of error has dropped alot but then again this is all
> >overnight so the logging rate slows down
> >
> >netstat -su
> >Udp:
> > 1839508 packets received
> > 26 packets to unknown port received.

>
> Is this net publicly reachable? That could be windoze crap.
>


It is not publicly reachable but the internal network it resides on has
a whole lot of windows systems. I thought that with iptables blocking
all inbound udp except syslog that tools like netstat would not record
that data. It could be I was mistaken in that and the udp error stats
are being recorded pre iptables filtering.

> > 1510 packet receive errors

>
> Better - but still shouldn't be happening
>
> > 789415 packets sent

>
> Over what period of time?


The bit rate according to mrtg is:

Avg In bits/sec 59396.94
Avg Out bits/sec 58998.77
Max In bits/sec 541861.30
Max Out bits/sec 5102595
IFace Speed 100mps

The max rate is due to when the remote backups kick off.

>
> >What I am now wondering is what netstat -su is really reporting as
> >errors. When I run ethtool -S eth0 I get a different set of stats
> >altogether. Below I see no recieve errors, but netstat reports recieve
> >errors. So is netstat looking at the errors from some kernel counter
> >while ethtool looks at the errors from the nic?

>
> I don't know - the only sure way to find out is going to be to look at the
> kernel and ethtool source code. As a _guess_ I'd suggest that the NIC
> stats are only looking at the Ethernet level errors in the stack. Thus,
> the packets are being lifted out of the hardware fast enough to avoid
> overruns, they aren't short, partials, or have _overall_ CRC errors. I'll
> come back to this below. The netstat errors relate to IP and TCP/UDP/ICMP
> errors - higher in the stack. Now, if the Ethernet errors are nil, and
> the UDP errors exist, then either the sending station is creating crap
> data (that is cleanly being carried over the Ethernet without errors), or
> the higher levels on the receiving system stack are tripping over themselves.
> This _could_ be a scheduling problem - how busy is the receiving system?
> What is the load averages? One presumes you aren't running X on this box -
> does the error ratio improve if you either 'nice' the daemon to a higher
> priority, OR reduce the number of packets coming in?


The boxes response times avg 250ms
CPU utilization is close to 0 till the backups hit then it nears 1.
Memory utilization is interesting. Looks like syslog may have a memory
leak:
starts around 161MB at 4am, increases at a steady rate to about 390mb
and goes back down again to about 161mb at 4am when logrotate kicks
off.
Virtual memory is a constant 163.84 kb. Allocated but apparently never
needed for use.

This is a minimal install. Nothing but the base requirements for
install RHES4.

I reniced the syslog process to -10. See if this helps. If it does ill
need to change the syslog init script to run syslogd at a higher
priority.

thanks again for the help

>
> >NIC statistics:
> > rx_packets: 3416841
> > tx_packets: 4004531
> > rx_bytes: 846988732
> > tx_bytes: 96819577

>
> 846988732 / 3416841 = 246 bytes average
> 96819577 / 4004531 = 24 bytes average
>
> How long? (3416841 + 4004531 ) / 8 hours = 260 packets/second.
> (846988732 + 96819577 ) / 8 hours = 32771 bytes/second = 262 kilobits/second
> which should be well within a 10BaseT network, never mind a Gigabit card.
>
> I don't see an obvious network problem... I dunno, I'd look at the ps
> output to see if your logging daemon is getting enough time. Like I say,
> ours doesn't see anywhere near that much traffic.
>
> Old guy


 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
64-bit W2K3 Server - network connectivity loss Russell Cooper Windows Networking 1 04-11-2008 08:08 AM
Seeing large amounts of data on udp recv-q, only sending 1 byte of data ssussman@starentnetworks.com Linux Networking 3 05-25-2007 03:44 PM
free syslog server joseph Windows Networking 1 12-04-2006 01:00 AM
Syslog parser wanted to replace Kiwi Syslog (win32) Jurgen.Turrekens@gmail.com Linux Networking 0 01-19-2006 01:08 PM
Windoze > Linux Syslog server KM Linux Networking 2 06-30-2005 08:25 AM



1 2 3 4 5 6 7 8 9 10 11