Hi everybody,
I am experiencing a serious issue with our new Win2003 Enterprise
Cluster. At random times, seemingly at higher workloads (we never
really see 20% utilization even at full tilt) our cluster stops
responding, we lose all out shares, and business grinds to a halt.
Now, we are running a Microsoft cluster, two servers, Top and Bot. The
cluster name is HAL. We need to run File Services for Macintosh
(sfmsrv.sys) Top is normally in control of the cluster. When we have
the issue, Hal just goes away, no win or mac shares are available.
Also, Top becomes unavailable on the network, you cant even map to the
administrative shares on the box. You can get on Top's console, but
you cannot open the Cluster manger, and you cannoy open the Services
console to try to restart the cluster service or anyother one for that
matter.
Now, you would think that the Bot server would take over, but htat is
not that case. When you login to Bot, you still cannot open the
Cluster Manager to fail the cluster to Bot and get on with business.
The only way we can get HAL back online is to power down Top. Once Top
goes away, you can get Bot to take over and bring HAL back up.
The first error in Top's error log before all the errors that say
everyone is disconnected from the shares is
|