The following document details how Kubernetes handles node failure, a detailed troubleshooting section, and our recommendations to achieve high availability for critical applications.
---
### What Happens to Pods When a Worker Node Fails in Kubernetes?
When a worker node in a Kubernetes cluster fails, Kubernetes detects the issue and handles it based on its built-in resilience mechanisms. Here’s what happens step-by-step:
---
### 1. **Detection of Node Failure**
- Kubernetes uses **heartbeats** (periodic signals) sent from the worker node to the control plane via the **Kubelet**.
- If the control plane does not receive a heartbeat within a certain time (default `node-monitor-grace-period` is 40 seconds), the node is marked as **NotReady**.
---
### 2. **Impact on Pods**
- Once the node is marked **NotReady**, the pods running on it become unreachable.
- The cluster waits for a **grace period** (default: 5 minutes, controlled by `pod-eviction-timeout`) to allow the node to recover.
- If the node does not recover:
- Pods managed by **ReplicaSets**, **Deployments**, **StatefulSets**, or **DaemonSets** are automatically rescheduled on healthy nodes if resources are available.
- **Standalone Pods** (those not managed by a controller) are not rescheduled and require manual intervention.
---
### 3. **Storage Considerations**
- **Stateless Pods**: These can be rescheduled easily since they don’t depend on persistent storage.
- **Stateful Pods**: Rescheduling depends on persistent volumes (PVs). Since we use **Rook-Ceph** with the `filesystem` storage class, which supports **ReadWriteMany (RWX)** access mode, persistent volumes can be mounted on multiple nodes. This ensures that PVs remain accessible even when a node fails, facilitating seamless failover. Further, in our case, since we use Ceph with a replication factor of 3 (1 original + 2 copies), PVs will be available on other nodes and this should facilitate seamless failover.
---
### 4. **Impact on Services**
- If a failed pod is part of a **Service**, Kubernetes automatically removes the pod’s endpoint from the Service. This ensures traffic is not routed to unreachable pods.
---
### Troubleshooting Steps for Node or Pod Failures
If a node fails and its pods become unavailable, follow these troubleshooting steps:
#### Step 1: **Check Node Health**
- Run the following command to check the status of the node:
```bash
kubectl get nodes
```
- Look for the node marked as **NotReady**.
- To get more details about the node:
```bash
kubectl describe node <node-name>
```
- Look for issues like taints, resource pressure, or kubelet connectivity.
---
#### Step 2: **Check Pod Status**
- Find the pods running on the node:
```bash
kubectl get pods --all-namespaces -o wide | grep <node-name>
```
- Check the status of affected pods:
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>
```
- Look for events or errors like **CrashLoopBackOff**, **ImagePullBackOff**, or **Evicted**.
---
#### Step 3: **Access the Pod Directly**
- If the pod is part of a **Service**, try accessing it using its **NodePort**:
```bash
curl http://<node-ip>:<nodeport> -u <username>:<password>
```
- Example:
```
curl http://192.168.49.20:30080 -u admin:password123
```
- Verify whether the issue is with pod networking or the application itself.
---
#### Step 4: **Distinguish Between Node and Application Issues**
- If the node is healthy but the pod is failing:
- Check the application logs using:
```bash
kubectl logs <pod-name> -n <namespace>
```
- Review application-specific credentials, endpoints, or configurations.
---
#### Step 5: **Provide Application-Specific Details**
- Ensure you have:
- **URL:** The full endpoint of the application.
- **Username and Password:** Ensure these are valid and not expired.
- Example:
```
URL: http://192.168.49.20:30080/login
Username: admin
Password: password123
```
- Provide Stribog support team the complete URL, Worker Node Name, NodePort, Application Username, and Password (for further testing). Mention the current behavior and expected behavior.
---
#### Step 6: **Check Resource Availability**
- Ensure the cluster has enough resources (CPU, memory) for pods to reschedule:
```bash
kubectl top nodes
kubectl top pods
```
- If resources are insufficient, consider scaling up your cluster by adding nodes.
---
### Additional Troubleshooting Tips
- **Check Networking:** Use tools like `ping` or `curl` to verify connectivity between pods, nodes, and services.
- **Verify DNS:** Ensure that DNS resolution is working for the cluster.
```bash
kubectl exec -it <pod-name> -n <namespace> -- nslookup kubernetes.default
```
- **Examine Events:** View Kubernetes events for insights:
```bash
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
```
---
### Steps to Prevent Downtime
#### 1. **Use Pod Anti-Affinity**
- **Why?** Pod anti-affinity ensures that multiple replicas of a workload are distributed across different nodes, avoiding a single point of failure. This is particularly important for high-availability applications.
- **How?** Use the `replicas` field in your deployment to specify the desired number of pod replicas. Combine this with `podAntiAffinity` rules to spread the replicas across nodes.
Example configuration for a Deployment:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: default
spec:
replicas: 3 # Number of pod replicas
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
containers:
- name: my-app-container
image: my-app-image:latest
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
```
---
### Explanation of Key Fields:
1. **`replicas`:** Specifies the desired number of identical pods to run. In this example, `3` replicas are created.
2. **`affinity`:** Includes a `podAntiAffinity` rule to ensure pods with the `app: my-app` label are scheduled on separate nodes.
3. **`topologyKey`:** The `kubernetes.io/hostname` ensures that pods are spread across different nodes based on their hostnames.
---
### Why This is Important:
- If all replicas are placed on a single node and that node fails, your application will experience downtime.
- With pod anti-affinity, Kubernetes ensures that pods are distributed, improving the availability of your application even during node failures.
---
#### 2. **Set Resource Requests and Limits**
- Define resource requirements to prevent overloading nodes:
```yaml
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
```
#### 3. **Use a DaemonSet for Critical Applications**
- DaemonSets ensure that critical applications run on every node (e.g., a critical application like EC). Example:
```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: critical-app
namespace: kube-system
spec:
selector:
matchLabels:
app: critical-app
template:
metadata:
labels:
app: critical-app
spec:
containers:
- name: critical-app
image: my-critical-app-image
```
#### 4. **Regularly Test Failover Scenarios**
- Simulate node failures to validate your cluster’s resilience along with Stribog support team. Always test failover scenarios on Dev cluster and NOT on production cluster.
---
By following these steps, you can minimize downtime and ensure smooth operation even during node failures. If you encounter issues, the troubleshooting process will help isolate and address the root cause effectively.