Interesting NFS4 failures on some CentOS 7 clients

We’ve had a couple of clients which use a combination of:

  • NFS4 (no encryption yet, just simple NFS4 with idmapd set as domain
  • automounted file-servers on /net/<hostname>, with bind mounts to get these to appear in /home/<username>
  • CentOS 7
  • kernel 3.10.0-229.20.1.el7.x86_64 (newest at the moment)

..where for some reason they decided to lose the NFS server connection.

Normally you’d just restart a bunch of NFS daemons, along with a ‘umount -l’ of the stuck mounts and continue on your way. In this case, however, we got a kernel thread showing up in uninterruptible sleep
with a name of [<IP-of-NFS-server>-ma]. (Square brackets because kernel threads show up in ps output in that format)

Looking at /proc/<PID>/stack for this PID showed that it was trying to recover a failed NFS mount, and then was waiting for an RPC response
(stuck in the rpc_wait_bit_killable function, despite not being possible to kill).

Has anyone else seen this behaviour?

I’ve just had two servers with the exact same install and kernel version do this, one at about 1am this morning and then the other at about 1pm
this afternoon.
A third machine with an identical install but a slightly older kernel version hasn’t hit this problem, so I’m erring toward it being a kernel NFS4 bug.

(Have rebooted both compute servers as that was the only possible recovery method it seemed, and picked a previous kernel version for one of the hosts to see if the behaviour changes…)

Posted in NFS