Stale Mounts
A stale mount occurs when a CephFS volume remains mounted on a Kubernetes node, but access to it is lost due to a disrupted connection between the node and the Ceph cluster. While the mount point still exists (mount | grep ceph shows it), any operation on it (e.g., ls, cd, touch) fails, hangs, or returns a “Stale file handle” error. This happens because the kernel or CSI driver believes the mount is valid, but the underlying Ceph storage is unreachable or inconsistent.
The most common causes of stale mounts include Ceph cluster disruptions (e.g., MDS or MON failures), network instability (packet loss between nodes and Ceph), Kubernetes CSI plugin issues (failed volume detach/reattach), resource exhaustion (CPU, Memory, or I/O) and unclean pod termination (where a deleted pod doesn’t properly release its PVC mount). Additionally, kernel-level CephFS client bugs or long-standing mounts that outlive network reconnections can also lead to stale mounts, requiring manual intervention or node restarts.
Now that we know that the permission issue that crops up when mounting PVCs is actually to do with Stale Mounts, let us understand why the logs do not capture these events and how we can remedy it.
1. Why Logs Don’t Show Anything About Stale Mounts or Permission Issues?
This is common with CephFS and Kubernetes CSI setups because:
Stale Mounts Don’t Always Generate Log Errors
The stale mount condition happens at the filesystem level. The
cephfskernel module may not report an explicit error unless the node tries to access a broken mount.In such cases, running:
mount | grep cephor
df -h | grep cephon the affected node might show a stuck mount.
Permission Denied Issues Are Often a Side Effect
- The real issue isn’t permissions but a stuck mount point.
- Kubernetes only reports pod PVC mount failures as permission issues (e.g.,
lsnot working due to stale NFS handles).
CSI Driver Logs Might Not Capture Kernel-Level Issues
- The CSI driver logs focus on attaching and detaching volumes, not kernel mount failures.
2. Understanding the mountOptions and Their Impact
Suggested Configuration
mountOptions:
- noatime
- nodiratime
- mds_namespace=myfs
- client_mount_timeout=30
- client_reconnect_stale=5How This Will Affect Your Cluster
- New PVCs: This will apply only to new PVCs created after the change.
- Existing PVCs: Won’t be affected immediately, but any pod using an old PVC will still use the old mount settings.
- Cluster Impact: No downtime for running applications, but you may need to manually remount PVCs for changes to take full effect.
Line-by-Line Explanation of mountOptions
1. noatime (Disable Access Time Updates)
- What it does:
- Prevents the system from updating the access time (
atime) of files each time they are read.
- Prevents the system from updating the access time (
- Why it helps:
- Reduces unnecessary metadata writes to CephFS, improving performance.
- Prevents
atimeupdates from interfering with mount operations.
2. nodiratime (Disable Directory Access Time Updates)
- What it does:
- Similar to
noatime, but applies only to directories.
- Similar to
- Why it helps:
- Reduces overhead when listing files in a mounted CephFS directory.
- Improves performance for workloads with frequent directory accesses.
3. mds_namespace=myfs (Specify CephFS Namespace)
- What it does:
- Ensures that the CephFS mount uses the correct namespace (
myfs) in the Ceph metadata service (MDS).
- Ensures that the CephFS mount uses the correct namespace (
- Why it helps:
- Avoids conflicts if multiple CephFS namespaces exist.
- Helps ensure the correct MDS handles the mount.
4. client_mount_timeout=30 (Reduce Mount Timeout)
- What it does:
- Specifies a 30-second timeout for CephFS client mount operations.
- Why it helps:
- If a mount request takes too long, it fails quickly instead of hanging indefinitely.
- Helps avoid “stuck” CephFS mounts that never recover.
5. client_reconnect_stale=5 (Auto-Reconnect for Stale Mounts)
- What it does:
- Forces the CephFS client to automatically reconnect to a stale mount within 5 seconds.
- Why it helps:
- Prevents nodes from holding on to broken mounts for too long.
- Ensures that PVCs recover quickly if Ceph temporarily loses connection.
3. How to Apply These Changes
Since this is a StorageClass setting, you need to update the existing rook-cephfs StorageClass.
Step 1: Check Your Existing StorageClass
kubectl get sc rook-cephfs -o yamlStep 2: Patch the StorageClass
kubectl patch sc rook-cephfs -p '{"mountOptions":["noatime","nodiratime","mds_namespace=myfs","client_mount_timeout=30","client_reconnect_stale=5"]}'**Step 3: How to Verify the Change
After applying the patch, check that the mountOptions are correctly applied:
kubectl get sc rook-cephfs -o yaml | grep -A 5 "mountOptions"Expected output:
mountOptions:
- noatime
- nodiratime
- mds_namespace=myfs
- client_mount_timeout=30
- client_reconnect_stale=5Do I Need to Restart Anything?
Ceph CSI Plugin: Restart the Ceph CSI pods to ensure new mount options take effect
kubectl delete pod -n rook-ceph -l app=csi-cephfspluginKubernetes will automatically restart the pods.
Check Mounts on Nodes: After restarting pods, verify that new PVCs are mounted with the correct options:
mount | grep cephYou should see the
mds_namespace=myfsin the mount options.
Step 4: Restart Pods Using PVCs
Since existing mounts won’t change, restart affected pods to force them to use the new mount options.
kubectl delete pod --selector=app=<your-app> -n <namespace>Step 5: Verify New Mounts
Run this on a worker node with newly scheduled pods:
mount | grep cephYou should see the new mount options applied.
4. Alternative: Apply Changes to Existing PVCs
Since mountOptions don’t apply to already mounted PVCs, you can manually unmount and remount them without restarting nodes.
Option 1: Force Unmount Stale Mounts
On an affected node:
sudo umount -lf /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pvc-mount-path>Then restart the Ceph CSI plugin:
kubectl delete pod -n rook-ceph -l app=csi-cephfspluginOption 2: Remount All PVCs
For an automated approach, drain and re-add the affected node:
kubectl drain <worker-node> --ignore-daemonsets --delete-local-data
kubectl uncordon <worker-node>5. Will This Completely Prevent Future Stale Mounts?
These changes significantly reduce the chances of stale mounts but don’t eliminate them completely. Additional steps include:
- Monitoring mount status with
mount | grep ceph - Using
kuredfor automatic rebooting of stuck nodes - Setting up an alerts to detect stuck Ceph mounts (need to figure out how to do this)
Final Thoughts
✅ What This Change Helps With
- Faster recovery from CephFS disconnections.
- Fewer permission errors caused by stale mounts.
- Reduced unnecessary metadata writes (improves performance).
- Faster failure detection when mounts break.
❌ What This Won’t Fix
- Hardware issues or node failures.
- Ceph cluster-wide failures (OSD crashes, high latency).
- Network disconnects lasting longer than the timeout values.
Common PVC Permission Errors Related to Stale Mounts in CephFS
When a stale CephFS mount occurs, Kubernetes PVC binding failures can manifest in different ways. Below are the typical error messages you might encounter, along with what they mean.
1️⃣ “Stale File Handle” Error
Error Message:
ls: cannot access '/var/lib/kubelet/pods/.../volumes/kubernetes.io~csi/...': Stale file handleWhy It Happens:
- This occurs when the CephFS client loses connection to the Ceph MDS (Metadata Server) but the mount point still exists.
- The Kubernetes pod tries to access the PVC, but CephFS does not recognize the session anymore.
- Common Causes:
- Ceph MDS crash or overload.
- Network failure between worker node and Ceph cluster.
- Kubernetes pod rescheduled without properly unmounting the volume.
🔍 How to Fix:
umount -lf /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pvc-mount-path>
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin2️⃣ “Permission Denied” When Accessing PVC
Error Message:
chmod: changing permissions of '/mnt/my-pvc': Permission deniedor
touch: cannot touch '/mnt/my-pvc/testfile': Permission deniedWhy It Happens:
- The CephFS mount exists, but the permissions are incorrect.
- The Kubernetes pod cannot write to the PVC because the stale mount prevents proper UID/GID mapping.
- Common Causes:
- Stale CephFS mount.
- Kubernetes didn’t unmount the PVC properly before reattaching it to another pod.
- Ceph MDS lost track of the volume’s permissions.
🔍 How to Fix:
umount -lf /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pvc-mount-path>
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin💡 Check the CephFS MDS logs for permission issues:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph mds stat3️⃣ “Transport endpoint is not connected”
Error Message:
ls: cannot access '/mnt/my-pvc': Transport endpoint is not connectedWhy It Happens:
- The CephFS client lost connection to the Ceph cluster, and the mount point is now broken.
- The Kubernetes node still thinks the mount exists, but CephFS no longer recognizes it.
- Common Causes:
- Network failure between the node and Ceph Monitors (MONs).
- CephFS client bug or kernel panic.
- Improper unmounting of the PVC.
🔍 How to Fix:
umount -lf /mnt/my-pvc
mount | grep ceph # Verify if the mount is gone
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin4️⃣ “Failed to Attach Volume” or “Unable to Mount Volume”
Error Message:
Warning FailedAttachVolume 10s attachdetach-controller AttachVolume.Attach failed for volume "pvc-xyz" : rpc error: code = DeadlineExceeded desc = context deadline exceededor
Warning FailedMount 8s kubelet MountVolume.MountDevice failed for volume "pvc-xyz" : timeout waiting for CephFS mountWhy It Happens:
- Kubernetes tries to mount a stale CephFS PVC but times out because the mount is in an inconsistent state.
- Common Causes:
- Ceph CSI driver failure (
csi-cephfsplugincrash). - Persistent stale mount from a previous pod.
- Ceph MDS session expired for the PVC.
- Ceph CSI driver failure (
🔍 How to Fix:
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin
kubectl delete pod <affected-pod>
kubectl logs -n rook-ceph -l app=csi-cephfsplugin --tail=505️⃣ “I/O Error” on PVC
Error Message:
cp: error writing '/mnt/my-pvc/file.txt': Input/output erroror
dmesg | grep ceph
[ 1234.567890] ceph: I/O error on mountpoint /mnt/my-pvcWhy It Happens:
- The CephFS client failed to process an I/O request, possibly due to a MDS timeout or stale mount.
- Common Causes:
- CephFS client kernel module crashed.
- Node lost access to Ceph storage.
- Kubernetes pod is writing to a PVC whose backend storage is unreachable.
🔍 How to Fix:
umount -lf /mnt/my-pvc
mount | grep ceph # Verify if the mount is gone
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin6️⃣ “Pod Stuck in ContainerCreating”
Error Message:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 20s kubelet Unable to attach or mount volumesWhy It Happens:
- The pod is waiting indefinitely for a PVC that cannot be mounted due to a stale reference.
- Common Causes:
- Kubernetes is trying to mount a PVC that is already stale.
- The node has an orphaned CephFS mount that is preventing a new attach operation.
🔍 How to Fix:
kubectl get pods -o wide | grep <pvc-name>
kubectl delete pod <affected-pod>
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin🛠 Final Fix: Automated Stale Mount Cleanup
Instead of manually checking for stale mounts, you can automate detection and cleanup using a cron job or Kubernetes DaemonSet.
📌 Automated Cleanup Script:
#!/bin/bash
# Check for stale CephFS mounts and unmount them
for MOUNT in $(mount | grep ceph | awk '{print $3}'); do
if ! ls $MOUNT &>/dev/null; then
echo "Unmounting stale CephFS mount: $MOUNT"
umount -lf $MOUNT
fi
doneRun every 5 minutes as a cron job:
crontab -e */5 * * * * /path/to/stale-mount-cleanup.sh
🚀 Summary: Common Stale Mount Errors and Fixes
| Error Message | Cause | Fix |
|---|---|---|
Stale file handle | CephFS lost connection to MDS | umount -lf <path> & restart CSI plugin |
Permission denied | Stale mount prevents proper UID/GID mapping | Unmount & restart CSI plugin |
Transport endpoint is not connected | CephFS client lost connection | Unmount & restart CSI plugin |
Failed to attach volume | Stale PVC reference | Delete pod & restart CSI plugin |
I/O error | Kernel-level CephFS failure | Restart affected node |
Pod stuck in ContainerCreating | PVC mount failure due to stale mount | Delete pod & restart CSI plugin |