How Kubernetes Handles Node Failure

The following document details how Kubernetes handles node failure, a detailed troubleshooting section, and our recommendations to achieve high availability for critical applications.

What Happens to Pods When a Worker Node Fails in Kubernetes?

When a worker node in a Kubernetes cluster fails, Kubernetes detects the issue and handles it based on its built-in resilience mechanisms. Here’s what happens step-by-step:

1. Detection of Node Failure

Kubernetes uses heartbeats (periodic signals) sent from the worker node to the control plane via the Kubelet.
If the control plane does not receive a heartbeat within a certain time (default node-monitor-grace-period is 40 seconds), the node is marked as NotReady.

2. Impact on Pods

Once the node is marked NotReady, the pods running on it become unreachable.
The cluster waits for a grace period (default: 5 minutes, controlled by pod-eviction-timeout) to allow the node to recover.
If the node does not recover:
- Pods managed by ReplicaSets, Deployments, StatefulSets, or DaemonSets are automatically rescheduled on healthy nodes if resources are available.
- Standalone Pods (those not managed by a controller) are not rescheduled and require manual intervention.

3. Storage Considerations

Stateless Pods: These can be rescheduled easily since they don’t depend on persistent storage.
Stateful Pods: Rescheduling depends on persistent volumes (PVs). Since we use Rook-Ceph with the filesystem storage class, which supports ReadWriteMany (RWX) access mode, persistent volumes can be mounted on multiple nodes. This ensures that PVs remain accessible even when a node fails, facilitating seamless failover. Further, in our case, since we use Ceph with a replication factor of 3 (1 original + 2 copies), PVs will be available on other nodes and this should facilitate seamless failover.

4. Impact on Services

If a failed pod is part of a Service, Kubernetes automatically removes the pod’s endpoint from the Service. This ensures traffic is not routed to unreachable pods.

Troubleshooting Steps for Node or Pod Failures

If a node fails and its pods become unavailable, follow these troubleshooting steps:

Step 1: Check Node Health

Run the following command to check the status of the node:
```
kubectl get nodes
```
- Look for the node marked as NotReady.
To get more details about the node:
```
kubectl describe node <node-name>
```
- Look for issues like taints, resource pressure, or kubelet connectivity.

Step 2: Check Pod Status

Find the pods running on the node:

kubectl get pods --all-namespaces -o wide | grep <node-name>

Check the status of affected pods:

kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>

Look for events or errors like CrashLoopBackOff, ImagePullBackOff, or Evicted.

Step 3: Access the Pod Directly

If the pod is part of a Service, try accessing it using its NodePort:
```
curl http://<node-ip>:<nodeport> -u <username>:<password>
```

Example:

curl http://192.168.49.20:30080 -u admin:password123

Verify whether the issue is with pod networking or the application itself.

Step 4: Distinguish Between Node and Application Issues

If the node is healthy but the pod is failing:
- Check the application logs using:
  kubectl logs <pod-name> -n <namespace>
- Review application-specific credentials, endpoints, or configurations.

Step 5: Provide Application-Specific Details

Ensure you have:
- URL: The full endpoint of the application.
- Username and Password: Ensure these are valid and not expired.

Example:

URL: http://192.168.49.20:30080/login
Username: admin
Password: password123

Provide Stribog support team the complete URL, Worker Node Name, NodePort, Application Username, and Password (for further testing). Mention the current behavior and expected behavior.

Step 6: Check Resource Availability

Ensure the cluster has enough resources (CPU, memory) for pods to reschedule:
```
kubectl top nodes
kubectl top pods
```
If resources are insufficient, consider scaling up your cluster by adding nodes.

Additional Troubleshooting Tips

Check Networking: Use tools like ping or curl to verify connectivity between pods, nodes, and services.

Verify DNS: Ensure that DNS resolution is working for the cluster.

kubectl exec -it <pod-name> -n <namespace> -- nslookup kubernetes.default

Examine Events: View Kubernetes events for insights:

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Steps to Prevent Downtime

1. Use Pod Anti-Affinity

Why? Pod anti-affinity ensures that multiple replicas of a workload are distributed across different nodes, avoiding a single point of failure. This is particularly important for high-availability applications.
How? Use the replicas field in your deployment to specify the desired number of pod replicas. Combine this with podAntiAffinity rules to spread the replicas across nodes.

Example configuration for a Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: default
spec:
  replicas: 3  # Number of pod replicas
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - my-app
              topologyKey: kubernetes.io/hostname
      containers:
        - name: my-app-container
          image: my-app-image:latest
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "500m"

Explanation of Key Fields:

replicas: Specifies the desired number of identical pods to run. In this example, 3 replicas are created.
affinity: Includes a podAntiAffinity rule to ensure pods with the app: my-app label are scheduled on separate nodes.
topologyKey: The kubernetes.io/hostname ensures that pods are spread across different nodes based on their hostnames.

Why This is Important:

If all replicas are placed on a single node and that node fails, your application will experience downtime.
With pod anti-affinity, Kubernetes ensures that pods are distributed, improving the availability of your application even during node failures.

2. Set Resource Requests and Limits

Define resource requirements to prevent overloading nodes:

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "1Gi"
    cpu: "500m"

3. Use a DaemonSet for Critical Applications

DaemonSets ensure that critical applications run on every node (e.g., a critical application like EC). Example:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: critical-app
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: critical-app
  template:
    metadata:
      labels:
        app: critical-app
    spec:
      containers:
        - name: critical-app
          image: my-critical-app-image

4. Regularly Test Failover Scenarios

Simulate node failures to validate your cluster’s resilience along with Stribog support team. Always test failover scenarios on Dev cluster and NOT on production cluster.

By following these steps, you can minimize downtime and ensure smooth operation even during node failures. If you encounter issues, the troubleshooting process will help isolate and address the root cause effectively.

Managing Multiple Kubernetes Contexts