I posted the message below to linux.kernel, but I realize that it may be
more appropriate for this forum...
-----
We have an Intel SE7501WV2A system running Linux kernel 2.4.32 that is
crashing between every 1-4 days. The system has two on-board Intel
PRO/1000 MT Server Network Connections (Intel 82546EB Controller) that
are both in use. We did extensive hardware diagnostics on the machine,
and came up with no hardware errors. We got a similarly configured
Dell Precision WorkStation 450, which has an Intel 7505 chipset, and
used that as a temporary replacement for the 7501 box while we were
doing hardware testing. To our surprise, the Dell Precision box
crashed as well. We allowed the problem to happen a few more times on
the Dell Precision box just so that we could be sure.
We have a totally different Dell PowerEdge 1750 box (with Dell Intel
ServerWorks GC LE chipset) that is configured almost identically to the
above two machines, but does not have the Intel e1000 gigabit on-board.
Instead, that system has dual on-board Broadcom gigabit using the tg3
driver. This box which has the same role at the original box (user
time sharing server) is not crashing on us at all, and more often than
not has a much higher load than the 7501 box.
When I say crash, I mean that logins into the box hang, the console
displays a black screen. However, interestingly enough, the machine
remains pingable, and an nmap on the machine reveals the ports that it
provides services on. The machine will answer on those ports, but the
services are not available. We also have the contents of a "ps" that
is going to a file once every minute, in order to try to help us solve
this problem, and that activity stops. Activity prior to the crash is
generally minimal. A serial console displays nothing until the machine
is rebooted.
The one similar thing between the crashing machines is the fact that
they both have the Intel on-board gigabit controller. I saw a few
posts on the web from people talking about having the same symptoms
(crashing Linux) when using, in particular, the e1000 module with
on-board Intel nics. However, in the few cases that I found, the users
claimed that by upgrading the e1000 module, their mysterious crashes
went away. The version of the e1000 driver that comes with Linux
2.4.32 is an older version - 5.7.6-k1-NAPI. I've compiled 6.3.9-NAPI
from sourceforge, and put that in place on our server after the last
crash. Our server lasted 4 days, and then crashed again. After that,
it only lasted an additional 2 days and crashed again.
I cannot guarantee that the problem is related to the e1000 module.
I'm just very suspicious of that fact. I have no way of getting into
the system when it "crashes". The ordeal is rather frustrating!
There also doesn't seem to be much in Linux in terms of generation of
kernel dumps. I enabled the sysrq sequence on my kernel, and was
hoping to be able to use the "c" command to crash the kernel, and get a
dump that I could use to check out what is going on, but pretty much
every kernel dump facility that I've read about seems to work with 2.6
only or older versions of 2.4! Further,
the "c" option that I keep reading about doesn't seem to exist in
2.4.32, or 2.6... (I think it could be a redhat mod?)
Anyhow -- does anyone have any ideas on how we might go about
diagnosing this problem? I have contacted Intel via mail for
suggestions and haven't heard back anything yet.
Thanks,
Jason.
|