Stale Mounts

A stale mount occurs when a CephFS volume remains mounted on a Kubernetes node, but access to it is lost due to a disrupted connection between the node and the Ceph cluster. While the mount point still exists (mount | grep ceph shows it), any operation on it (e.g., ls, cd, touch) fails, hangs, or returns a “Stale file handle” error. This happens because the kernel or CSI driver believes the mount is valid, but the underlying Ceph storage is unreachable or inconsistent.

The most common causes of stale mounts include Ceph cluster disruptions (e.g., MDS or MON failures), network instability (packet loss between nodes and Ceph), Kubernetes CSI plugin issues (failed volume detach/reattach), resource exhaustion (CPU, Memory, or I/O) and unclean pod termination (where a deleted pod doesn’t properly release its PVC mount). Additionally, kernel-level CephFS client bugs or long-standing mounts that outlive network reconnections can also lead to stale mounts, requiring manual intervention or node restarts.

Now that we know that the permission issue that crops up when mounting PVCs is actually to do with Stale Mounts, let us understand why the logs do not capture these events and how we can remedy it.

1. Why Logs Don’t Show Anything About Stale Mounts or Permission Issues?

This is common with CephFS and Kubernetes CSI setups because:

Stale Mounts Don’t Always Generate Log Errors
- The stale mount condition happens at the filesystem level. The cephfs kernel module may not report an explicit error unless the node tries to access a broken mount.
- In such cases, running:
  mount | grep ceph
  or
  df -h | grep ceph
  on the affected node might show a stuck mount.
Permission Denied Issues Are Often a Side Effect
- The real issue isn’t permissions but a stuck mount point.
- Kubernetes only reports pod PVC mount failures as permission issues (e.g., ls not working due to stale NFS handles).
CSI Driver Logs Might Not Capture Kernel-Level Issues
- The CSI driver logs focus on attaching and detaching volumes, not kernel mount failures.

2. Understanding the `mountOptions` and Their Impact

Suggested Configuration

mountOptions:
  - noatime
  - nodiratime
  - mds_namespace=myfs
  - client_mount_timeout=30
  - client_reconnect_stale=5

How This Will Affect Your Cluster

New PVCs: This will apply only to new PVCs created after the change.
Existing PVCs: Won’t be affected immediately, but any pod using an old PVC will still use the old mount settings.
Cluster Impact: No downtime for running applications, but you may need to manually remount PVCs for changes to take full effect.

Line-by-Line Explanation of `mountOptions`

1. `noatime` (Disable Access Time Updates)

What it does:
- Prevents the system from updating the access time (atime) of files each time they are read.
Why it helps:
- Reduces unnecessary metadata writes to CephFS, improving performance.
- Prevents atime updates from interfering with mount operations.

2. `nodiratime` (Disable Directory Access Time Updates)

What it does:
- Similar to noatime, but applies only to directories.
Why it helps:
- Reduces overhead when listing files in a mounted CephFS directory.
- Improves performance for workloads with frequent directory accesses.

3. `mds_namespace=myfs` (Specify CephFS Namespace)

What it does:
- Ensures that the CephFS mount uses the correct namespace (myfs) in the Ceph metadata service (MDS).
Why it helps:
- Avoids conflicts if multiple CephFS namespaces exist.
- Helps ensure the correct MDS handles the mount.

4. `client_mount_timeout=30` (Reduce Mount Timeout)

What it does:
- Specifies a 30-second timeout for CephFS client mount operations.
Why it helps:
- If a mount request takes too long, it fails quickly instead of hanging indefinitely.
- Helps avoid “stuck” CephFS mounts that never recover.

5. `client_reconnect_stale=5` (Auto-Reconnect for Stale Mounts)

What it does:
- Forces the CephFS client to automatically reconnect to a stale mount within 5 seconds.
Why it helps:
- Prevents nodes from holding on to broken mounts for too long.
- Ensures that PVCs recover quickly if Ceph temporarily loses connection.

3. How to Apply These Changes

Since this is a StorageClass setting, you need to update the existing rook-cephfs StorageClass.

Step 1: Check Your Existing StorageClass

kubectl get sc rook-cephfs -o yaml

Step 2: Patch the StorageClass

kubectl patch sc rook-cephfs -p '{"mountOptions":["noatime","nodiratime","mds_namespace=myfs","client_mount_timeout=30","client_reconnect_stale=5"]}'

**Step 3: How to Verify the Change

After applying the patch, check that the mountOptions are correctly applied:

kubectl get sc rook-cephfs -o yaml | grep -A 5 "mountOptions"

Expected output:

mountOptions:
  - noatime
  - nodiratime
  - mds_namespace=myfs
  - client_mount_timeout=30
  - client_reconnect_stale=5

Do I Need to Restart Anything?

Ceph CSI Plugin: Restart the Ceph CSI pods to ensure new mount options take effect
```
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin
```
Kubernetes will automatically restart the pods.
Check Mounts on Nodes: After restarting pods, verify that new PVCs are mounted with the correct options:
```
mount | grep ceph
```
You should see the mds_namespace=myfs in the mount options.

Step 4: Restart Pods Using PVCs

Since existing mounts won’t change, restart affected pods to force them to use the new mount options.

kubectl delete pod --selector=app=<your-app> -n <namespace>

Step 5: Verify New Mounts

Run this on a worker node with newly scheduled pods:

mount | grep ceph

You should see the new mount options applied.

4. Alternative: Apply Changes to Existing PVCs

Since mountOptions don’t apply to already mounted PVCs, you can manually unmount and remount them without restarting nodes.

Option 1: Force Unmount Stale Mounts

On an affected node:

sudo umount -lf /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pvc-mount-path>

Then restart the Ceph CSI plugin:

kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin

Option 2: Remount All PVCs

For an automated approach, drain and re-add the affected node:

kubectl drain <worker-node> --ignore-daemonsets --delete-local-data
kubectl uncordon <worker-node>

5. Will This Completely Prevent Future Stale Mounts?

These changes significantly reduce the chances of stale mounts but don’t eliminate them completely. Additional steps include:

Monitoring mount status with mount | grep ceph
Using kured for automatic rebooting of stuck nodes
Setting up an alerts to detect stuck Ceph mounts (need to figure out how to do this)

Final Thoughts

✅ What This Change Helps With

Faster recovery from CephFS disconnections.
Fewer permission errors caused by stale mounts.
Reduced unnecessary metadata writes (improves performance).
Faster failure detection when mounts break.

❌ What This Won’t Fix

Hardware issues or node failures.
Ceph cluster-wide failures (OSD crashes, high latency).
Network disconnects lasting longer than the timeout values.

Common PVC Permission Errors Related to Stale Mounts in CephFS

When a stale CephFS mount occurs, Kubernetes PVC binding failures can manifest in different ways. Below are the typical error messages you might encounter, along with what they mean.

1️⃣ “Stale File Handle” Error

Error Message:

ls: cannot access '/var/lib/kubelet/pods/.../volumes/kubernetes.io~csi/...': Stale file handle

Why It Happens:

This occurs when the CephFS client loses connection to the Ceph MDS (Metadata Server) but the mount point still exists.
The Kubernetes pod tries to access the PVC, but CephFS does not recognize the session anymore.
Common Causes:
- Ceph MDS crash or overload.
- Network failure between worker node and Ceph cluster.
- Kubernetes pod rescheduled without properly unmounting the volume.

🔍 How to Fix:

umount -lf /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pvc-mount-path>
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin

2️⃣ “Permission Denied” When Accessing PVC

Error Message:

chmod: changing permissions of '/mnt/my-pvc': Permission denied

touch: cannot touch '/mnt/my-pvc/testfile': Permission denied

Why It Happens:

The CephFS mount exists, but the permissions are incorrect.
The Kubernetes pod cannot write to the PVC because the stale mount prevents proper UID/GID mapping.
Common Causes:
- Stale CephFS mount.
- Kubernetes didn’t unmount the PVC properly before reattaching it to another pod.
- Ceph MDS lost track of the volume’s permissions.

🔍 How to Fix:

umount -lf /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pvc-mount-path>
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin

💡 Check the CephFS MDS logs for permission issues:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph mds stat

3️⃣ “Transport endpoint is not connected”

Error Message:

ls: cannot access '/mnt/my-pvc': Transport endpoint is not connected

Why It Happens:

The CephFS client lost connection to the Ceph cluster, and the mount point is now broken.
The Kubernetes node still thinks the mount exists, but CephFS no longer recognizes it.
Common Causes:
- Network failure between the node and Ceph Monitors (MONs).
- CephFS client bug or kernel panic.
- Improper unmounting of the PVC.

🔍 How to Fix:

umount -lf /mnt/my-pvc
mount | grep ceph  # Verify if the mount is gone
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin

4️⃣ “Failed to Attach Volume” or “Unable to Mount Volume”

Error Message:

Warning  FailedAttachVolume  10s  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-xyz" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

Warning  FailedMount  8s  kubelet  MountVolume.MountDevice failed for volume "pvc-xyz" : timeout waiting for CephFS mount

Why It Happens:

Kubernetes tries to mount a stale CephFS PVC but times out because the mount is in an inconsistent state.
Common Causes:
- Ceph CSI driver failure (csi-cephfsplugin crash).
- Persistent stale mount from a previous pod.
- Ceph MDS session expired for the PVC.

🔍 How to Fix:

kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin
kubectl delete pod <affected-pod>
kubectl logs -n rook-ceph -l app=csi-cephfsplugin --tail=50

5️⃣ “I/O Error” on PVC

Error Message:

cp: error writing '/mnt/my-pvc/file.txt': Input/output error

dmesg | grep ceph
[ 1234.567890] ceph: I/O error on mountpoint /mnt/my-pvc

Why It Happens:

The CephFS client failed to process an I/O request, possibly due to a MDS timeout or stale mount.
Common Causes:
- CephFS client kernel module crashed.
- Node lost access to Ceph storage.
- Kubernetes pod is writing to a PVC whose backend storage is unreachable.

🔍 How to Fix:

umount -lf /mnt/my-pvc
mount | grep ceph  # Verify if the mount is gone
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin

6️⃣ “Pod Stuck in ContainerCreating”

Error Message:

Events:
  Type     Reason       Age   From               Message
  ----     ------       ----  ----               -------
  Warning  FailedMount  20s   kubelet            Unable to attach or mount volumes

Why It Happens:

The pod is waiting indefinitely for a PVC that cannot be mounted due to a stale reference.
Common Causes:
- Kubernetes is trying to mount a PVC that is already stale.
- The node has an orphaned CephFS mount that is preventing a new attach operation.

🔍 How to Fix:

kubectl get pods -o wide | grep <pvc-name>
kubectl delete pod <affected-pod>
kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin

🛠 Final Fix: Automated Stale Mount Cleanup

Instead of manually checking for stale mounts, you can automate detection and cleanup using a cron job or Kubernetes DaemonSet.

📌 Automated Cleanup Script:

#!/bin/bash
# Check for stale CephFS mounts and unmount them
for MOUNT in $(mount | grep ceph | awk '{print $3}'); do
  if ! ls $MOUNT &>/dev/null; then
    echo "Unmounting stale CephFS mount: $MOUNT"
    umount -lf $MOUNT
  fi
done

Run every 5 minutes as a cron job:

crontab -e
*/5 * * * * /path/to/stale-mount-cleanup.sh

🚀 Summary: Common Stale Mount Errors and Fixes

Error Message	Cause	Fix
`Stale file handle`	CephFS lost connection to MDS	`umount -lf <path>` & restart CSI plugin
`Permission denied`	Stale mount prevents proper UID/GID mapping	Unmount & restart CSI plugin
`Transport endpoint is not connected`	CephFS client lost connection	Unmount & restart CSI plugin
`Failed to attach volume`	Stale PVC reference	Delete pod & restart CSI plugin
`I/O error`	Kernel-level CephFS failure	Restart affected node
`Pod stuck in ContainerCreating`	PVC mount failure due to stale mount	Delete pod & restart CSI plugin

Minikube L&D Lab with In-Cluster Caddy and Ceph

Stale Mounts

1. Why Logs Don’t Show Anything About Stale Mounts or Permission Issues?

2. Understanding the mountOptions and Their Impact

Suggested Configuration

How This Will Affect Your Cluster

Line-by-Line Explanation of mountOptions

1. noatime (Disable Access Time Updates)

2. nodiratime (Disable Directory Access Time Updates)

3. mds_namespace=myfs (Specify CephFS Namespace)

4. client_mount_timeout=30 (Reduce Mount Timeout)

5. client_reconnect_stale=5 (Auto-Reconnect for Stale Mounts)

3. How to Apply These Changes

Step 1: Check Your Existing StorageClass

Step 2: Patch the StorageClass

**Step 3: How to Verify the Change

Do I Need to Restart Anything?

Step 4: Restart Pods Using PVCs

Step 5: Verify New Mounts

4. Alternative: Apply Changes to Existing PVCs

Option 1: Force Unmount Stale Mounts

Option 2: Remount All PVCs

5. Will This Completely Prevent Future Stale Mounts?

Final Thoughts

✅ What This Change Helps With

❌ What This Won’t Fix

Common PVC Permission Errors Related to Stale Mounts in CephFS

1️⃣ “Stale File Handle” Error

Error Message:

Why It Happens:

2️⃣ “Permission Denied” When Accessing PVC

Error Message:

Why It Happens:

3️⃣ “Transport endpoint is not connected”

Error Message:

Why It Happens:

4️⃣ “Failed to Attach Volume” or “Unable to Mount Volume”

Error Message:

Why It Happens:

5️⃣ “I/O Error” on PVC

Error Message:

Why It Happens:

6️⃣ “Pod Stuck in ContainerCreating”

Error Message:

Why It Happens:

🛠 Final Fix: Automated Stale Mount Cleanup

🚀 Summary: Common Stale Mount Errors and Fixes

2. Understanding the `mountOptions` and Their Impact

Line-by-Line Explanation of `mountOptions`

1. `noatime` (Disable Access Time Updates)

2. `nodiratime` (Disable Directory Access Time Updates)

3. `mds_namespace=myfs` (Specify CephFS Namespace)

4. `client_mount_timeout=30` (Reduce Mount Timeout)

5. `client_reconnect_stale=5` (Auto-Reconnect for Stale Mounts)