Stale Mounts - Sudo-Samurai

A **stale mount** occurs when a CephFS volume remains mounted on a Kubernetes node, but access to it is lost due to a **disrupted connection between the node and the Ceph cluster**. While the mount point still exists (`mount | grep ceph` shows it), any operation on it (e.g., `ls`, `cd`, `touch`) **fails, hangs, or returns a "Stale file handle" error**. This happens because the kernel or CSI driver believes the mount is valid, but the underlying Ceph storage is **unreachable or inconsistent**. The most common causes of stale mounts include **Ceph cluster disruptions** (e.g., MDS or MON failures), **network instability** (packet loss between nodes and Ceph), **Kubernetes CSI plugin issues** (failed volume detach/reattach), resource exhaustion (CPU, Memory, or I/O) and **unclean pod termination** (where a deleted pod doesn’t properly release its PVC mount). Additionally, **kernel-level CephFS client bugs** or **long-standing mounts that outlive network reconnections** can also lead to stale mounts, requiring manual intervention or node restarts. Now that we know that the permission issue that crops up when mounting PVCs is actually to do with Stale Mounts, let us understand why the logs do not capture these events and how we can remedy it. --- ## **1. Why Logs Don't Show Anything About Stale Mounts or Permission Issues?** This is common with CephFS and Kubernetes CSI setups because: 1. **Stale Mounts Don’t Always Generate Log Errors** - The **stale mount condition** happens at the filesystem level. The `cephfs` kernel module may not report an explicit error unless the node **tries to access a broken mount**. - In such cases, running: ```bash mount | grep ceph ``` or ```bash df -h | grep ceph ``` on the affected node might show a stuck mount. 2. **Permission Denied Issues Are Often a Side Effect** - The real issue isn't permissions but a **stuck mount point**. - Kubernetes only reports **pod PVC mount failures** as permission issues (e.g., `ls` not working due to stale NFS handles). 3. **CSI Driver Logs Might Not Capture Kernel-Level Issues** - The CSI driver logs focus on **attaching and detaching volumes**, not kernel mount failures. --- ## **2. Understanding the `mountOptions` and Their Impact** ### **Suggested Configuration** ```yaml mountOptions: - noatime - nodiratime - mds_namespace=myfs - client_mount_timeout=30 - client_reconnect_stale=5 ``` ### **How This Will Affect Your Cluster** - **New PVCs:** This will apply **only to new PVCs** created after the change. - **Existing PVCs:** Won’t be affected immediately, but any pod using an old PVC will still use the old mount settings. - **Cluster Impact:** No downtime for running applications, but you may need to manually remount PVCs for changes to take full effect. --- ### **Line-by-Line Explanation of `mountOptions`** #### **1. `noatime` (Disable Access Time Updates)** - **What it does:** - Prevents the system from updating the **access time** (`atime`) of files each time they are read. - **Why it helps:** - Reduces **unnecessary metadata writes** to CephFS, improving performance. - Prevents `atime` updates from interfering with mount operations. #### **2. `nodiratime` (Disable Directory Access Time Updates)** - **What it does:** - Similar to `noatime`, but applies **only to directories**. - **Why it helps:** - Reduces overhead when listing files in a mounted CephFS directory. - Improves performance for workloads with frequent directory accesses. #### **3. `mds_namespace=myfs` (Specify CephFS Namespace)** - **What it does:** - Ensures that the CephFS mount uses the correct namespace (`myfs`) in the Ceph metadata service (MDS). - **Why it helps:** - Avoids conflicts if multiple CephFS namespaces exist. - Helps ensure the correct MDS handles the mount. #### **4. `client_mount_timeout=30` (Reduce Mount Timeout)** - **What it does:** - Specifies a 30-second timeout for CephFS client mount operations. - **Why it helps:** - If a mount request takes **too long**, it fails quickly instead of hanging indefinitely. - Helps avoid "stuck" CephFS mounts that never recover. #### **5. `client_reconnect_stale=5` (Auto-Reconnect for Stale Mounts)** - **What it does:** - Forces the CephFS client to **automatically reconnect** to a stale mount within 5 seconds. - **Why it helps:** - Prevents nodes from **holding on to broken mounts** for too long. - Ensures that PVCs recover quickly if Ceph temporarily loses connection. --- ## **3. How to Apply These Changes** Since this is a **StorageClass** setting, you need to update the existing `rook-cephfs` StorageClass. ### **Step 1: Check Your Existing StorageClass** ```bash kubectl get sc rook-cephfs -o yaml ``` ### **Step 2: Patch the StorageClass** ```bash kubectl patch sc rook-cephfs -p '{"mountOptions":["noatime","nodiratime","mds_namespace=myfs","client_mount_timeout=30","client_reconnect_stale=5"]}' ``` ### **Step 3: How to Verify the Change After applying the patch, check that the `mountOptions` are correctly applied: ```bash kubectl get sc rook-cephfs -o yaml | grep -A 5 "mountOptions" ``` Expected output: ```yaml mountOptions: - noatime - nodiratime - mds_namespace=myfs - client_mount_timeout=30 - client_reconnect_stale=5 ``` #### **Do I Need to Restart Anything?** 1. **Ceph CSI Plugin**: Restart the Ceph CSI pods to ensure new mount options take effect ```bash kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin ``` Kubernetes will automatically restart the pods. 2. **Check Mounts on Nodes**: After restarting pods, verify that new PVCs are mounted with the correct options: ```bash mount | grep ceph ``` You should see the `mds_namespace=myfs` in the mount options. ### **Step 4: Restart Pods Using PVCs** Since existing mounts won’t change, restart affected pods to force them to use the new mount options. ```bash kubectl delete pod --selector=app=<your-app> -n <namespace> ``` ### **Step 5: Verify New Mounts** Run this on a worker node with newly scheduled pods: ```bash mount | grep ceph ``` You should see the new mount options applied. --- ## **4. Alternative: Apply Changes to Existing PVCs** Since `mountOptions` don’t apply to **already mounted** PVCs, you can manually unmount and remount them without restarting nodes. ### **Option 1: Force Unmount Stale Mounts** On an affected node: ```bash sudo umount -lf /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pvc-mount-path> ``` Then restart the Ceph CSI plugin: ```bash kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin ``` ### **Option 2: Remount All PVCs** For an automated approach, drain and re-add the affected node: ```bash kubectl drain <worker-node> --ignore-daemonsets --delete-local-data kubectl uncordon <worker-node> ``` --- ## **5. Will This Completely Prevent Future Stale Mounts?** These changes **significantly reduce** the chances of stale mounts but don’t eliminate them completely. **Additional steps** include: - Monitoring mount status with `mount | grep ceph` - Using `kured` for automatic rebooting of stuck nodes - Setting up an alerts to detect stuck Ceph mounts (need to figure out how to do this) --- ## **Final Thoughts** ### ✅ **What This Change Helps With** - Faster recovery from CephFS disconnections. - Fewer permission errors caused by stale mounts. - Reduced unnecessary metadata writes (improves performance). - Faster failure detection when mounts break. ### ❌ **What This Won’t Fix** - Hardware issues or node failures. - Ceph cluster-wide failures (OSD crashes, high latency). - Network disconnects lasting longer than the timeout values. --- ## **Common PVC Permission Errors Related to Stale Mounts in CephFS** When a **stale CephFS mount** occurs, Kubernetes **PVC binding failures** can manifest in different ways. Below are the **typical error messages** you might encounter, along with what they mean. --- ## **1️⃣ "Stale File Handle" Error** ### **Error Message:** ```bash ls: cannot access '/var/lib/kubelet/pods/.../volumes/kubernetes.io~csi/...': Stale file handle ``` ### **Why It Happens:** - This occurs when the **CephFS client loses connection** to the Ceph MDS (Metadata Server) but the mount point still exists. - The Kubernetes pod tries to **access the PVC, but CephFS does not recognize the session** anymore. - **Common Causes:** - Ceph MDS crash or overload. - Network failure between worker node and Ceph cluster. - Kubernetes pod rescheduled without properly unmounting the volume. 🔍 **How to Fix:** ```bash umount -lf /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pvc-mount-path> kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin ``` --- ## **2️⃣ "Permission Denied" When Accessing PVC** ### **Error Message:** ```bash chmod: changing permissions of '/mnt/my-pvc': Permission denied ``` or ```bash touch: cannot touch '/mnt/my-pvc/testfile': Permission denied ``` ### **Why It Happens:** - The **CephFS mount exists, but the permissions are incorrect**. - The Kubernetes pod cannot write to the PVC **because the stale mount prevents proper UID/GID mapping**. - **Common Causes:** - Stale CephFS mount. - Kubernetes didn't unmount the PVC properly before reattaching it to another pod. - Ceph MDS lost track of the volume's permissions. 🔍 **How to Fix:** ```bash umount -lf /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pvc-mount-path> kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin ``` 💡 **Check the CephFS MDS logs for permission issues:** ```bash kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph mds stat ``` --- ## **3️⃣ "Transport endpoint is not connected"** ### **Error Message:** ```bash ls: cannot access '/mnt/my-pvc': Transport endpoint is not connected ``` ### **Why It Happens:** - The CephFS client **lost connection to the Ceph cluster**, and the mount point is now broken. - The Kubernetes node **still thinks the mount exists**, but **CephFS no longer recognizes it**. - **Common Causes:** - Network failure between the node and Ceph Monitors (MONs). - CephFS client bug or kernel panic. - Improper unmounting of the PVC. 🔍 **How to Fix:** ```bash umount -lf /mnt/my-pvc mount | grep ceph # Verify if the mount is gone kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin ``` --- ## **4️⃣ "Failed to Attach Volume" or "Unable to Mount Volume"** ### **Error Message:** ```bash Warning FailedAttachVolume 10s attachdetach-controller AttachVolume.Attach failed for volume "pvc-xyz" : rpc error: code = DeadlineExceeded desc = context deadline exceeded ``` or ```bash Warning FailedMount 8s kubelet MountVolume.MountDevice failed for volume "pvc-xyz" : timeout waiting for CephFS mount ``` ### **Why It Happens:** - Kubernetes tries to mount a **stale CephFS PVC** but **times out because the mount is in an inconsistent state**. - **Common Causes:** - Ceph CSI driver failure (`csi-cephfsplugin` crash). - Persistent stale mount from a previous pod. - Ceph MDS session expired for the PVC. 🔍 **How to Fix:** ```bash kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin kubectl delete pod <affected-pod> kubectl logs -n rook-ceph -l app=csi-cephfsplugin --tail=50 ``` --- ## **5️⃣ "I/O Error" on PVC** ### **Error Message:** ```bash cp: error writing '/mnt/my-pvc/file.txt': Input/output error ``` or ```bash dmesg | grep ceph [ 1234.567890] ceph: I/O error on mountpoint /mnt/my-pvc ``` ### **Why It Happens:** - The CephFS client **failed to process an I/O request**, possibly due to a **MDS timeout or stale mount**. - **Common Causes:** - CephFS client kernel module crashed. - Node lost access to Ceph storage. - Kubernetes pod is writing to a PVC **whose backend storage is unreachable**. 🔍 **How to Fix:** ```bash umount -lf /mnt/my-pvc mount | grep ceph # Verify if the mount is gone kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin ``` --- ## **6️⃣ "Pod Stuck in ContainerCreating"** ### **Error Message:** ```bash Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 20s kubelet Unable to attach or mount volumes ``` ### **Why It Happens:** - The pod is **waiting indefinitely** for a PVC that **cannot be mounted due to a stale reference**. - **Common Causes:** - Kubernetes is trying to **mount a PVC that is already stale**. - The node has an **orphaned CephFS mount that is preventing a new attach operation**. 🔍 **How to Fix:** ```bash kubectl get pods -o wide | grep <pvc-name> kubectl delete pod <affected-pod> kubectl delete pod -n rook-ceph -l app=csi-cephfsplugin ``` --- ### **🛠 Final Fix: Automated Stale Mount Cleanup** Instead of manually checking for stale mounts, you can **automate detection and cleanup** using a **cron job or Kubernetes DaemonSet**. 📌 **Automated Cleanup Script:** ```bash #!/bin/bash # Check for stale CephFS mounts and unmount them for MOUNT in $(mount | grep ceph | awk '{print $3}'); do if ! ls $MOUNT &>/dev/null; then echo "Unmounting stale CephFS mount: $MOUNT" umount -lf $MOUNT fi done ``` - **Run every 5 minutes as a cron job**: ```bash crontab -e */5 * * * * /path/to/stale-mount-cleanup.sh ``` --- ### **🚀 Summary: Common Stale Mount Errors and Fixes** |**Error Message**|**Cause**|**Fix**| |---|---|---| |`Stale file handle`|CephFS lost connection to MDS|`umount -lf <path>` & restart CSI plugin| |`Permission denied`|Stale mount prevents proper UID/GID mapping|Unmount & restart CSI plugin| |`Transport endpoint is not connected`|CephFS client lost connection|Unmount & restart CSI plugin| |`Failed to attach volume`|Stale PVC reference|Delete pod & restart CSI plugin| |`I/O error`|Kernel-level CephFS failure|Restart affected node| |`Pod stuck in ContainerCreating`|PVC mount failure due to stale mount|Delete pod & restart CSI plugin|