Today at work, we migrated the first box to Solaris 11 and we experienced the first bug as soon as we needed to log in onto the server.
As theses delays are quite common when the SSHd is configured by default, I quickly added theses lines to remove GSSAPI and DNS common issues:
/etc/ssh/sshd_config
LookupClientHostnames no VerifyReverseMapping no GSSAPIAuthentication no
Although, theses settings didn’t fixed the problem.
I added some verbosity to both ssh client and server and tracked down the delay to happen at this stage of the connection:
On the client:
$ ssh -v -p 2222 s11box -l adminifm OpenSSH_5.8p1-hpn13v10lpk, OpenSSL 1.0.0c 2 Dec 2010 debug1: Reading configuration data /home/wildcat/.ssh/config debug1: Reading configuration data /etc/ssh/ssh_config debug1: Connecting to admblockum04 10.2.12.155 port 2222. debug1: Connection established. debug1: identity file /home/wildcat/.ssh/id_rsa type -1 debug1: identity file /home/wildcat/.ssh/id_rsa-cert type -1 debug1: identity file /home/wildcat/.ssh/id_dsa type 2 debug1: identity file /home/wildcat/.ssh/id_dsa-cert type -1 debug1: identity file /home/wildcat/.ssh/id_ecdsa type -1 debug1: identity file /home/wildcat/.ssh/id_ecdsa-cert type -1 debug1: Remote protocol version 2.0, remote software version Sun_SSH_2.0 debug1: no match: Sun_SSH_2.0 debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_5.8p1-hpn13v10lpk debug1: SSH2_MSG_KEXINIT sent *** HANG ***
And on the server:
debug1: Reloading X.509 host keys to avoid PKCS#11 fork issues.monitor debug1: reading the context from the child debug1: use_engine is 'no' debug1: list_hostkey_types: ssh-rsa,ssh-dss debug1: My KEX proposal before adding the GSS KEX algorithm: debug2: kex_parse_kexinit: diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman- group1-sha1 debug2: kex_parse_kexinit: ssh-rsa,ssh-dss debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour128,arcfour256,arcfour debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour128,arcfour256,arcfour debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: none,zlib debug2: kex_parse_kexinit: none,zlib debug2: kex_parse_kexinit: de-DE,en-US,es-ES,fr-FR,it-IT,ja-JP,ko-KR,pt-BR,zh-CN,zh-TW,i-default debug2: kex_parse_kexinit: de-DE,en-US,es-ES,fr-FR,it-IT,ja-JP,ko-KR,pt-BR,zh-CN,zh-TW,i-default debug2: kex_parse_kexinit: first_kex_follows 0 debug2: kex_parse_kexinit: reserved 0 *** HANG ***
Adding truss of the server process is helping us a lot:
17477: so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, 0, SOV_DEFAULT) = 3 17469: pollsys(0x080451C0, 1, 0x00000000, 0x00000000) (sleeping...) 17477: connect(3, 0x08047030, 16, SOV_DEFAULT) (sleeping...)
The lock is happening just after the connect() syscall. We can now check the pfiles of this process together with a netstat to identify which connection is causing trouble to be established:
root@admblockum04:/local/home/ucc# pfiles 16585 16585: /usr/lib/ssh/sshd -f /etc/ssh/sshd_config -p 2222 -ddd -D Current rlimit: 256 file descriptor 0: S_IFCHR mode:0620 dev:532,0 ino:3502445063 uid:60004 gid:7 rdev:221,8 O_RDWR|O_NOCTTY|O_LARGEFILE /dev/pts/8 offset:43267 1: S_IFCHR mode:0620 dev:532,0 ino:3502445063 uid:60004 gid:7 rdev:221,8 O_RDWR|O_NOCTTY|O_LARGEFILE /dev/pts/8 offset:43267 2: S_IFCHR mode:0620 dev:532,0 ino:3502445063 uid:60004 gid:7 rdev:221,8 O_RDWR|O_NOCTTY|O_LARGEFILE /dev/pts/8 offset:43267 3: S_IFSOCK mode:0666 dev:540,0 ino:34566 uid:0 gid:0 size:0 O_RDWR SOCK_STREAM SO_SNDBUF(49152),SO_RCVBUF(131072) sockname: AF_INET 127.0.0.1 port: 55867 congestion control: newreno 4: S_IFSOCK mode:0666 dev:540,0 ino:5714 uid:0 gid:0 size:0 O_RDWR|O_NONBLOCK SOCK_STREAM SO_REUSEADDR,SO_KEEPALIVE,SO_SNDBUF(49152),SO_RCVBUF(128872) sockname: AF_INET6 ::ffff:10.2.12.155 port: 2222 peername: AF_INET6 ::ffff:10.2.60.1 port: 43575 congestion control: newreno 5: S_IFIFO mode:0000 dev:529,0 ino:70783 uid:0 gid:0 size:0 O_RDWR 6: S_IFIFO mode:0000 dev:529,0 ino:70783 uid:0 gid:0 size:0 O_RDWR 8: S_IFIFO mode:0000 dev:529,0 ino:70784 uid:0 gid:0 size:0 O_RDWR FD_CLOEXEC
# netstat -an|grep 55867 127.0.0.1.55867 127.0.0.1.30003 0 0 131072 0 SYN_SENT
The port 30003 is the default port of tcsd daemon, which is managing physical cryptography (through /dev/tpm). If there is no hardware crypto devices, this daemon is disabled. It seems though that cryptoadm is linking tpm crypto mechanism by default, enabling ssh to trying to access this daemon.
Workaround found (just to confirm slowliness is caused by tcsd):
Run this command on the server:
# nc -e 'cat /dev/null' localhost 30003
and try to ssh the box, it should be fast.
Permanent workaround:
Simply remove the pcks11_tpm provider from the crypto framework:
cryptoadm disable provider=/usr/lib/security/\$ISA/pkcs11_tpm.so mechanism=all cryptoadm uninstall provider=/usr/lib/security/\$ISA/pkcs11_tpm.so