PROBLEM
-------
I was able to ssh from the master node (server) to a slave node
(oscarnode1) previously. That means ssh was running properly
previously. But now, I can't login to oscarnode1. I tried ping and it
works perfectly. Meanwhile, I can still ssh from server to all other
nodes successfully.
How this happens is that I am running MPI programs (using MPICH) from
the master node on the slave nodes using "mpirun -np 8 allgatherv2".
Everything works fine until my program appear to consume too much on
oscarnode1 (as seen from the /var/log/messages), forcing oscarnode1 to
'die'. Now, I can't
1) "ssh oscarnode1"
2) "mpirun -np 8 allgatherv2" (possibly caused by the 'dead'
oscarnode1)
output from /var/log/messages:
-----------
Sep 7 20:19:09 oscarnode1 sshd(pam_unix)[2000]: session opened for
user csyeo by (uid=0)
Sep 7 20:20:56 oscarnode1 sshd(pam_unix)[2018]: session opened for
user csyeo by (uid=0)
Sep 7 20:21:17 oscarnode1 kernel: Out of Memory: Killed process
2001(allgatherv2).
Sep 7 20:21:28 oscarnode1 sshd(pam_unix)[2000]: session closed for
user csyeo
Sep 7 20:21:42 oscarnode1 sshd(pam_unix)[2082]: session opened for
user csyeo by (uid=0)
-----------
Now, I try running ssh with -v option to debug, it is not responding
after showing the below:
-----------
[csyeo@server csyeo]$ ssh -v oscarnode1
OpenSSH_3.1p1, SSH protocols 1.5/2.0, OpenSSL 0x0090602f
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Rhosts Authentication disabled, originating port will not be
trusted.
debug1: restore_uid
debug1: ssh_connect: getuid 508 geteuid 0 anon 1
debug1: Connecting to oscarnode1 [192.168.1.1] port 22.
debug1: temporarily_use_uid: 508/100 (e=0)
debug1: restore_uid
debug1: temporarily_use_uid: 508/100 (e=0)
debug1: restore_uid
debug1: Connection established.
debug1: read PEM private key done: type DSA
debug1: read PEM private key done: type RSA
debug1: identity file /home/csyeo/.ssh/identity type 0
debug1: identity file /home/csyeo/.ssh/id_rsa type 1
debug1: identity file /home/csyeo/.ssh/id_dsa type 2
[not responding --- hang]
-----------
A successful ssh to another slave node oscarnode2:
-----------
[csyeo@server csyeo]$ ssh -v oscarnode2
OpenSSH_3.1p1, SSH protocols 1.5/2.0, OpenSSL 0x0090602f
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Rhosts Authentication disabled, originating port will not be
trusted.
debug1: restore_uid
debug1: ssh_connect: getuid 508 geteuid 0 anon 1
debug1: Connecting to oscarnode2 [192.168.1.2] port 22.
debug1: temporarily_use_uid: 508/100 (e=0)
debug1: restore_uid
debug1: temporarily_use_uid: 508/100 (e=0)
debug1: restore_uid
debug1: Connection established.
debug1: read PEM private key done: type DSA
debug1: read PEM private key done: type RSA
debug1: identity file /home/csyeo/.ssh/identity type 0
debug1: identity file /home/csyeo/.ssh/id_rsa type 1
debug1: identity file /home/csyeo/.ssh/id_dsa type 2
debug1: Remote protocol version 1.99, remote software version
OpenSSH_3.1p1
debug1: match: OpenSSH_3.1p1 pat OpenSSH*
Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_3.1p1
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
OS ver for all the nodes:
Red Hat Linux release 7.3 (Valhalla)
ssh ver for all the nodes:
OpenSSH_3.1p1, SSH protocols 1.5/2.0, OpenSSL 0x0090602f
Many Thanks,
Yeo
|