Networking Forums

Networking Forums > Computer Networking > Linux Networking > will gigabit ethernet help autonegotiation problems go away?

Reply
Thread Tools Display Modes

will gigabit ethernet help autonegotiation problems go away?

 
 
Michael Thomas
Guest
Posts: n/a

 
      02-28-2006, 10:25 PM
We have some new SuperMicro X6DVA-EG motherboards that have Intel
Corporation 82541GI/PI Gigabit Ethernet Controllers built in.

We noticed that these servers were having various connection problems
after some uptime. Large screen outputs would cause the connection to
hang in the middle, etc.

Our servers are running RHELv4, so we worked with Redhat, and they
suggested using ethtool to force the speed to 100 and duplex to full
(and autonegotiation off).

This worked, and all our connection problems went away. However, since
our switch is an unmanaged Netgear switch (FS516), it defaults down to
half-duplex when one side is not set to autonegotiation.

Redhat thinks that by us upgrading to a switch capable of gigabit
ethernet, the problem will go away with these particular NIC's (which
are supposedly 10/100/1000 capable).

Personally, I would think this is either a hardware or a driver issue
that needs to be resolved. However, I'm wondering if there is any
basis for that argument that going to gigabit ethernet will resolve
this.

If we switched to an unmanaged Netgear switch capable of 10/100/1000
(like JGS524), why would the autonegotiation on that switch be any
different?

Wouldn't we still encounter the same problems?

Thank you in advance for any advice anyone can provide!

-Michael Thomas

 
Reply With Quote
 
 
 
 
David Schwartz
Guest
Posts: n/a

 
      02-28-2006, 10:37 PM

"Michael Thomas" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...

> If we switched to an unmanaged Netgear switch capable of 10/100/1000
> (like JGS524), why would the autonegotiation on that switch be any
> different?
>
> Wouldn't we still encounter the same problems?


I would be very surprised if you encountered negotiation problems
connecting two gigabit devices together. Remember, when ethernet was first
designed, negotiation didn't exist -- there was nothing to negotiate.
However, with gigabit, not only is negotiation part of the specification
from the very beginning, but there's really nothing to negotiate. There is
no half-duplex gigabit.

DS


 
Reply With Quote
 
Rick Jones
Guest
Posts: n/a

 
      03-01-2006, 01:33 AM
Michael Thomas <(E-Mail Removed)> wrote:
> Our servers are running RHELv4, so we worked with Redhat, and they
> suggested using ethtool to force the speed to 100 and duplex to full
> (and autonegotiation off).


> This worked, and all our connection problems went away. However,
> since our switch is an unmanaged Netgear switch (FS516), it defaults
> down to half-duplex when one side is not set to autonegotiation.


Um, then no, your connection problems didn't really go away. You how
have a duplex mis-match between your switch and your NIC. You should
"never" hardcode one side and not the other.

How Autoneg is supposed to work:

When both sides of the link are set to autoneg, they will "negotiate"
the duplex setting and select full duplex if both sides can do
full-duplex.

If one side is hardcoded and not using autoneg, the autoneg process
will "fail" and the side trying to autoneg is required by spec to use
half-duplex mode.

If one side is using half-duplex, and the other is using full-duplex,
sorrow and woe is the usual result.

So, the following table shows what will happen given various settings
on each side:

Auto Half Full

Auto Happiness Lucky Sorrow

Half Lucky Happiness Sorrow

Full Sorrow Sorrow Happiness

Happiness means that there is a good shot of everything going well.
Lucky means that things will likely go well, but not because you did
anything correctly Sorrow means that there _will_ be a duplex
mis-match.

When there is a duplex mismatch, on the side running half-duplex you
will see various errors and probably a number of late collisions. On
the side running full-duplex you will see things like FCS errors.
Note that those errors are not necessarily conclusive, they are simply
indicators.

Further, it is important to keep in mind that a "clean" ping (or the
like - eg "linkloop") test result is inconclusive here - a duplex
mismatch causes lost traffic _only_ when both sides of the link try to
speak at the same time. A typical ping test, being synchronous, one at
a time request/response, never tries to have both sides talking at the
same time.

> Redhat thinks that by us upgrading to a switch capable of gigabit
> ethernet, the problem will go away with these particular NIC's (which
> are supposedly 10/100/1000 capable).


If you can try a different switch, and it happens to work, then go
with it.

> Personally, I would think this is either a hardware or a driver issue
> that needs to be resolved. However, I'm wondering if there is any
> basis for that argument that going to gigabit ethernet will resolve
> this.


> If we switched to an unmanaged Netgear switch capable of 10/100/1000
> (like JGS524), why would the autonegotiation on that switch be any
> different?


> Wouldn't we still encounter the same problems?


Not necessarily. Auto-neg is a required part of the gigabit ethernet
standard.

rick jones
--
oxymoron n, Hummer H2 with California Save Our Coasts and Oceans plates
these opinions are mine, all mine; HP might not want them anyway...
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
 
Reply With Quote
 
Michael Thomas
Guest
Posts: n/a

 
      03-01-2006, 01:33 AM
Rick Jones wrote:
> Michael Thomas <(E-Mail Removed)> wrote:
> When there is a duplex mismatch, on the side running half-duplex you
> will see various errors and probably a number of late collisions. On
> the side running full-duplex you will see things like FCS errors.
> Note that those errors are not necessarily conclusive, they are simply
> indicators.


Hello,

Understood. We definitely don't want to stay in the current state. We
realize right now we're in the lucky state that things are working, and
we're not exactly comfortable being at half duplex on the switch.

> > Redhat thinks that by us upgrading to a switch capable of gigabit
> > ethernet, the problem will go away with these particular NIC's (which
> > are supposedly 10/100/1000 capable).

>
> If you can try a different switch, and it happens to work, then go
> with it.


This would be a purchase, rather than something available on hand. So,
we'd like to have some sort of reasonable guess that it would solve our
problems before we go out and buy it. Also, these machines are spread
across 4 switches, so we'd probably be replacing 4 of them.

We've noticed that we have machines with the same NIC's on the same
switches, and only the ones with RHELv4 are having problems... none of
the ones with RHELv3 are.

We've brought this to Redhat's attention, and they don't seem to have
an answer.

We've gone out and upgraded the e1000 driver to the newest release,
6.3.9, but it made no difference. The connection hangs still happen.

Does it at this point seem reasonable, then, to go out and replace the
switches with gigabit capable ones, so that they are likely to work
well with our RHELv4 machines?

To me, it seems like there's something in RHELv4 that is causing a
problem, and I'm a bit afraid to go and make these purchases assuming
that the problem is limited to just negotiating with a 10/100T switch,
and not something that is continue to be a problem.

Of course, I know that this is my problem. =P But, I'm just hoping
that either someone has had similar problems and found a solution
(maybe we should just downgrade all the machines to RHELv3, sigh) or
thinks that replacing the switches should be the way to go, to
compensate for RHELv4's problems. In which case, we'll bite the
bullet. =P

Thanks again!
-Michael

 
Reply With Quote
 
prg
Guest
Posts: n/a

 
      03-01-2006, 02:12 AM

Michael Thomas wrote:

[snip -- sorry]

>
> We've noticed that we have machines with the same NIC's on the same
> switches, and only the ones with RHELv4 are having problems... none of
> the ones with RHELv3 are.


Can't say for sure but ... you may want to look at this:
OS Compatibility Chart for E7320 Motherboards
http://www.supermicro.com/support/re...patibility.cfm

> We've brought this to Redhat's attention, and they don't seem to have
> an answer.
>
> We've gone out and upgraded the e1000 driver to the newest release,
> 6.3.9, but it made no difference. The connection hangs still happen.
>
> Does it at this point seem reasonable, then, to go out and replace the
> switches with gigabit capable ones, so that they are likely to work
> well with our RHELv4 machines?


I would investigate the above chart and contact SuperMicro for any
advice before plunking down $. Is RHELv3 ES a possibility? Centos'
version?

> To me, it seems like there's something in RHELv4 that is causing a
> problem, ...


As near as I can guess from the chart it seems these mobos don't play
well with 2.6 kernels. It's likely a combination of hardware/bios
issues that are not correctable or they would have, you guessed,
corrected them :-)

> ... and I'm a bit afraid to go and make these purchases assuming
> that the problem is limited to just negotiating with a 10/100T switch,
> and not something that is continue to be a problem.
>
> Of course, I know that this is my problem. =P But, I'm just hoping
> that either someone has had similar problems and found a solution
> (maybe we should just downgrade all the machines to RHELv3, sigh) or
> thinks that replacing the switches should be the way to go, to
> compensate for RHELv4's problems. In which case, we'll bite the
> bullet. =P


I have a pretty good notion that downgrading is your best option and
may be your only one. They are not rated for FC2, but are rated for
Suse9. My guess is that RHEL4 won't fly.

Even with RHEL3 ES (or any Linux) the SATA contoller is not marked
supported for this mobo version. But in RHEL 2.1 AS it is. This must
be some older hardware re: Linux kernels.

Perhaps with a bios update and anything else offered by SuperMicro you
might try connecting two of the RHEL4 boxes with a cat6 x-over cable
and see if they will work with each other. If not, then I would say
it's likely they _won't_ work with other hardware :-(

not much help I'm afraid,
prg

 
Reply With Quote
 
Allen McIntosh
Guest
Posts: n/a

 
      03-01-2006, 02:15 AM
> This would be a purchase, rather than something available on hand. So,
> we'd like to have some sort of reasonable guess that it would solve our
> problems before we go out and buy it. Also, these machines are spread
> across 4 switches, so we'd probably be replacing 4 of them.


Can you get your hands on a non-Netgear "hub"? (Quotes because most of
these things seem more like switches these days...) I'm thinking
something with 4 ports from 3COM or Linksys or ..., the advantage being
that such things cost minimal $$ at your corner store :-) If so, try
putting it between the Netgear hardware and one of the machines having
trouble.

Also, what sort of shape are the cables in? I doubt that's the problem,
especially since it has bitten more than one machine, but you never know...

> We've noticed that we have machines with the same NIC's on the same
> switches, and only the ones with RHELv4 are having problems... none of
> the ones with RHELv3 are.
> We've brought this to Redhat's attention, and they don't seem to have
> an answer.
> We've gone out and upgraded the e1000 driver to the newest release,
> 6.3.9, but it made no difference. The connection hangs still happen.


Sure sounds like a kernel misfeature to me. You could go lurk on the
kernel mailing list for a while and see if anything comes up.

If you can take a server out of service for a while, try booting a
Linux-on-a-CD system (Knoppix or Ubuntu or ...) and see if the
autonegotiation mismatch persists.
 
Reply With Quote
 
Rick Jones
Guest
Posts: n/a

 
      03-01-2006, 05:27 PM
Michael Thomas <(E-Mail Removed)> wrote:
> We've noticed that we have machines with the same NIC's on the same
> switches, and only the ones with RHELv4 are having problems... none
> of the ones with RHELv3 are.


What happens if you set the "arp ignore" option on all the interfaces?
While it may not be related to your problem, in the past I have had
"issues" with multiple NICs connected to the same switch, configured
into different subnets where ARP would be more than happy to reply
with the MAC of the "other" NIC in the system.

I forget the precise name, but "sysctl -a | grep arp" will doubtless
find it.

rick jones
--
oxymoron n, commuter in a gas-guzzling luxury SUV with an American flag
these opinions are mine, all mine; HP might not want them anyway...
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
 
Reply With Quote
 
Michael Thomas
Guest
Posts: n/a

 
      03-01-2006, 09:20 PM

Allen McIntosh wrote:
> Also, what sort of shape are the cables in? I doubt that's the problem,
> especially since it has bitten more than one machine, but you never know...
>
> > We've noticed that we have machines with the same NIC's on the same
> > switches, and only the ones with RHELv4 are having problems... none of
> > the ones with RHELv3 are.
> > We've brought this to Redhat's attention, and they don't seem to have
> > an answer.
> > We've gone out and upgraded the e1000 driver to the newest release,
> > 6.3.9, but it made no difference. The connection hangs still happen.

>
> Sure sounds like a kernel misfeature to me. You could go lurk on the
> kernel mailing list for a while and see if anything comes up.


Hello,

An update. Cables look to be good. We tried connecting the machines
directly by a crossover cable, and everything looked good, connected at
gigabit speed. So, we ended up replacing the 4 switches with 4 Netgear
JGS524 switches (probably should have used something besides Netgear...
just to rule out a Netgear incompatibility, but it was the best we
could do at the time).

Everything seemed great... for a while. It took much longer for
connection problems to surface, but they did. So, they may have
surfaced when the machines were connected directly, but we didn't give
them enough time. I won't be physically next to the machines for a
while to test that again, but will as soon as I can.

This is pretty frustrating, because now that the machines are connected
by gigabit ethernet and the problems surface much less frequently, it
is much harder to test.

Is there some tool out there that will maintain a connection between
two machines, and log any errors it has? Maybe we can configure two
instances, one to send a continual stream of data back and forth, and
one that maintains a connection, but sends large bursts of data every
30 minutes or so?

Our connection problems definitely seem to happen only after the
connection has been maintained for a while. We start seeing hangs.
But we don't see them at all right when the connections first form.
(There is no firewall inbetween the machines. And, we didn't see any
hangs at all for extended periods of time, when we forced the NICS out
of autonegotiation... though it left us at half duplex.)

Thanks again for everyone's help!

-Michael

 
Reply With Quote
 
Robert
Guest
Posts: n/a

 
      03-02-2006, 12:36 AM
On Tue, 28 Feb 2006 15:25:13 -0800, Michael Thomas wrote:

> We have some new SuperMicro X6DVA-EG motherboards that have Intel
> Corporation 82541GI/PI Gigabit Ethernet Controllers built in.
>
> We noticed that these servers were having various connection problems
> after some uptime. Large screen outputs would cause the connection to
> hang in the middle, etc.


Are you sure the network is hanging? How have you tested this? I hear
all day long how it's a network problem only to find out that it was/is
an app or user problem.

> Our servers are running RHELv4, so we worked with Redhat, and they
> suggested using ethtool to force the speed to 100 and duplex to full
> (and autonegotiation off).


Which isn't a bad idea. I cannot understand for the life of me why
everyone thinks autoneg. is the best way to go. Anything that doesn't
move around should be locked down to speed and duplex wanted. This will
solve most of your network issue.

Autoneg is only good on devices that move from connection to connection
like a Laptop.

> This worked, and all our connection problems went away. However, since
> our switch is an unmanaged Netgear switch (FS516), it defaults down to
> half-duplex when one side is not set to autonegotiation.


This is normal because it cannot autoneg. a duplex setting so it defaults
to half.

> Personally, I would think this is either a hardware or a driver issue
> that needs to be resolved. However, I'm wondering if there is any
> basis for that argument that going to gigabit ethernet will resolve
> this.


OK, what steps have been taken to track down this issue?

What does ifconfig <interface> show you?
Do you see any errors or collisions?

What does mii-tool <interface> show you?

What do the log files show you, both system and switch?


> If we switched to an unmanaged Netgear switch capable of 10/100/1000
> (like JGS524), why would the autonegotiation on that switch be any
> different?


Autoneg is Autoneg.

> Wouldn't we still encounter the same problems?


Depends on what is really causing you the problem. By guessing its a
network issue isn't going to help you. You need to track it down and only
then will you be able to fix it.


--

Regards
Robert

Smile... it increases your face value!


----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
 
Reply With Quote
 
Michael Thomas
Guest
Posts: n/a

 
      03-02-2006, 01:10 AM

Robert wrote:
> On Tue, 28 Feb 2006 15:25:13 -0800, Michael Thomas wrote:
> > We noticed that these servers were having various connection problems
> > after some uptime. Large screen outputs would cause the connection to
> > hang in the middle, etc.

>
> Are you sure the network is hanging? How have you tested this? I hear
> all day long how it's a network problem only to find out that it was/is
> an app or user problem.


Hi Robert,

We believe it's a network issue, because when it was in half duplex
mode, all problems disappeared. There were no more hangs, and no more
errors.

We experienced problems in two scenarios. The first, was that we were
receiving mod_jk errors from a server running Apache connecting to a
server running Tomcat. It worked 99% of the time, but occassionally,
it would give errors that it couldn't connect to Tomcat, would retry,
and would connect successfully on retry. We spent 5 days reconfiguring
and reconfiguring Apache, Tomcat, and the mod_jk connector that
connects them. We were positive that it had to be an issue with either
those programs, or our app running on Tomcat.

Then we noticed other problems. When we ssh'd from the web server to
the db server, and ran a query that produced a large output, the output
would sometimes freeze half way through, and then about 5-10 minutes
later, the connection from the web server to the db server would
disconnect. This would only happen after the connection had been
somewhat idle for a while, before running the query again. When the
connection was first made, the output from the query would always come
through completely.

We then tried opening connections from several groups of differently
configured servers to the same db server, running the same query, and
getting the same output. Without fail, it was always the same batch of
machines that had the connection hangs in the middle of the query
output... they all had similar gigabit capable NIC's, and were all
running RHELv4.

None of the machines ever hung in the query output with RHELv3, even
after being idle for a full 24 hours.

Then, when we turned auto-negotiation off, and ended up in half-duplex
mode on the switch, all problems disappeared. None of the machines
ever hung. There were no more errors in the mod_jk logs.

This is why we came to the conclusion that there is some sort of
networking issue. All problems disappeared when the network changes
were made.

> > Our servers are running RHELv4, so we worked with Redhat, and they
> > suggested using ethtool to force the speed to 100 and duplex to full
> > (and autonegotiation off).

>
> Which isn't a bad idea. I cannot understand for the life of me why
> everyone thinks autoneg. is the best way to go. Anything that doesn't
> move around should be locked down to speed and duplex wanted. This will
> solve most of your network issue.
>
> Autoneg is only good on devices that move from connection to connection
> like a Laptop.


We want autonegotiation so we can communicate with our unmanaged
switches at full duplex.


> This is normal because it cannot autoneg. a duplex setting so it defaults
> to half.


Right. This seems like a good reason to leave autonegotiation on for
the NIC's, when you're limited by the switch in this way.


> OK, what steps have been taken to track down this issue?


I have listed most of them in the other posts in this thread. We are
also in the process of putting a non-Netgear switch inbetween the
machines, to see if that resolves any problems.


> What does ifconfig <interface> show you?
> Do you see any errors or collisions?


No errors or collisions. No messages in /var/log/messages either.
We've also done dmesg -n 8 in an attempt to log more error messages.


> What does mii-tool <interface> show you?


Here's the output now that we've switched to the gigabit switches.
Web01 connects via eth1 to app01's eth0.

[root@web01 access]# mii-tool -v
eth0: negotiated 100baseTx-FD flow-control, link ok
product info: vendor 00:aa:00, model 56 rev 0
basic mode: autonegotiation enabled
basic status: autonegotiation complete, link ok
capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
advertising: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
flow-control
link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
flow-control
eth1: negotiated 100baseTx-FD flow-control, link ok
product info: vendor 00:aa:00, model 56 rev 0
basic mode: autonegotiation enabled
basic status: autonegotiation complete, link ok
capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
advertising: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
flow-control
link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
flow-control

[root@app01 ~]# mii-tool -v
eth0: negotiated 100baseTx-FD flow-control, link ok
product info: vendor 00:aa:00, model 56 rev 0
basic mode: autonegotiation enabled
basic status: autonegotiation complete, link ok
capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
advertising: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
flow-control
link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
flow-control
eth1: no link
product info: vendor 00:aa:00, model 56 rev 0
basic mode: autonegotiation enabled
basic status: no link
capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
advertising: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
flow-control


> What do the log files show you, both system and switch?


It's an unmanaged switch. There are no log files for it,
unfortunately. There are no system log errors at all.


> Depends on what is really causing you the problem. By guessing its a
> network issue isn't going to help you. You need to track it down and only
> then will you be able to fix it.


Unfortunately, without knowing what the problem absolutely is, we have
to start with a guess. This is our best guess based on the evidence.

The primary piece of evidence is that everything worked fine,
completely error free, when the switch went into half-duplex mode and
autonegotiation was off on the cards.

We're working hard to pursue all routes. SuperMicro has offered to
replicate our setup in their lab to test, and we have support issues
open with Redhat.

But at the moment, it does seem like a networking issue is the primary
culprit (and that may be an incompatibility between the motherboard and
redhat, redhat and the NIC, etc., that is causing the problem).

Thanks,
-Michael

 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
10 gigabit/s Ethernet already available? Luca Villa Linux Networking 2 01-02-2008 02:55 PM
Gigabit switch/ethernet, will it help me? see.my.sig.4.addr@nowhere.com.invalid Windows Networking 2 03-27-2005 11:49 PM
Gigabit ethernet please! Chris Windows Networking 5 03-09-2005 09:38 PM
Gigabit ethernet - wiring etc. usenet@isbd.co.uk Home Networking 12 10-12-2004 04:32 PM
ethernet autonegotiation result Norbert van Bolhuis Linux Networking 1 05-10-2004 03:06 PM



1 2 3 4 5 6 7 8 9 10 11