Cluster Maintenance

This lab will cover worker node maintenance as well as cluster backup and restore. Node maintenance will enable you to take worker nodes offline gracefully and complete maintenance tasks (Kernel upgrade, Container runtime, hardware changes, etc). Backup and restore will reduce the time to recovery should you lose the cluster state, it can also help with migrating data from one cluster to another.

Kubectl drain can be used to safely evict all of your pods from a node before you perform maintenance on the node (e.g. kernel upgrade, hardware maintenance, etc.). By safely evicting pods with drain the pod’s containers will gracefully terminate and respect any termination policies you have specified.

Note: By default kubectl drain will ignore certain system pods on the node that cannot be killed (Daemonsets, etc);

When kubectl drain returns successfully, that indicates that all of the pods (except the ones excluded as described in the previous paragraph) have been safely evicted. It is then safe to bring down the node by powering down its physical machine or, if running on a cloud platform, deleting its virtual machine.

Verify current node status.

kubectl get nodes -o wide
NAME       STATUS   ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
master01   Ready    master   20h     v1.19.6   10.10.95.11   <none>        Ubuntu 20.04.1 LTS   5.4.0-48-generic   docker://19.3.14
master02   Ready    master   14m     v1.19.6   10.10.95.12   <none>        Ubuntu 20.04.1 LTS   5.4.0-48-generic   docker://19.3.14
master03   Ready    master   4m47s   v1.19.6   10.10.95.13   <none>        Ubuntu 20.04.1 LTS   5.4.0-48-generic   docker://19.3.14
worker01   Ready    <none>   19h     v1.19.6   10.10.95.21   <none>        Ubuntu 20.04.1 LTS   5.4.0-48-generic   docker://19.3.14
worker02   Ready    <none>   18h     v1.19.6   10.10.95.22   <none>        Ubuntu 20.04.1 LTS   5.4.0-48-generic   docker://19.3.14
worker03   Ready    <none>   18h     v1.19.6   10.10.95.23   <none>        Ubuntu 20.04.1 LTS   5.4.0-48-generic   docker://19.3.14

Let’s create a sample workload to see how it behaves when the node is drained.

kubectl create ns lab-drain
namespace/lab-drain created

Change default namespace to lab-drain:

kubectl config set-context --current --namespace lab-drain
kubectl create deployment lab-drain --image=nginx
deployment.apps/lab-drain created
kubectl scale deployment lab-drain --replicas=6
deployment.apps/lab-drain scaled

Verify that minimum 1 pod has been created on every workers

kubectl get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP               NODE           NOMINATED NODE   READINESS GATES
lab-drain-65f68f47fb-5h562   1/1     Running   0          24s   192.168.39.205   k8s-worker03   <none>           <none>
lab-drain-65f68f47fb-5ldlh   1/1     Running   0          24s   192.168.79.102   k8s-worker01   <none>           <none>
lab-drain-65f68f47fb-5zrlp   1/1     Running   0          24s   192.168.69.253   k8s-worker02   <none>           <none>
lab-drain-65f68f47fb-flfz9   1/1     Running   0          37s   192.168.39.253   k8s-worker03   <none>           <none>
lab-drain-65f68f47fb-h66xc   1/1     Running   0          24s   192.168.79.77    k8s-worker01   <none>           <none>
lab-drain-65f68f47fb-hvhpv   1/1     Running   0          24s   192.168.69.220   k8s-worker02   <none>           <none>

Cordon off k8s-worker01

kubectl cordon worker01
node/k8s-worker01 cordoned

Verify worker1 has been cordoned.

kubectl get nodes
NAME           STATUS                     ROLES    AGE    VERSION
k8s-master01   Ready                      master   239d   v1.19.2
k8s-master02   Ready                      master   239d   v1.19.2
k8s-master03   Ready                      master   239d   v1.19.2
k8s-worker01   Ready,SchedulingDisabled   <none>   95d    v1.19.2
k8s-worker02   Ready                      <none>   95d    v1.19.2
k8s-worker03   Ready                      <none>   95d    v1.19.2

The node should show SchedulingDisabled.

Drain the k8s-worker01 node:

kubectl drain worker01 --ignore-daemonsets --delete-local-data
node/k8s-worker01 already cordoned
evicting pod lab-drain/lab-drain-65f68f47fb-h66xc
evicting pod lab-drain/lab-drain-65f68f47fb-5ldlh
....

Verify that there are no more pods on worker1, and the pod that was there moved to other workers.

kubectl get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE     IP               NODE           NOMINATED NODE   READINESS GATES
lab-drain-65f68f47fb-5h562   1/1     Running   0          4m34s   192.168.39.205   k8s-worker03   <none>           <none>
lab-drain-65f68f47fb-5zrlp   1/1     Running   0          4m34s   192.168.69.253   k8s-worker02   <none>           <none>
lab-drain-65f68f47fb-cbptw   1/1     Running   0          102s    192.168.69.212   k8s-worker02   <none>           <none>
lab-drain-65f68f47fb-flfz9   1/1     Running   0          4m47s   192.168.39.253   k8s-worker03   <none>           <none>
lab-drain-65f68f47fb-hvhpv   1/1     Running   0          4m34s   192.168.69.220   k8s-worker02   <none>           <none>
lab-drain-65f68f47fb-l229s   1/1     Running   0          102s    192.168.39.192   k8s-worker03   <none>           <none>

Once that’s done, the node would be ready for maintenance. Let’s upgrade OS on k8s-worker01 to a newer bugfix version. Open a new terminal tab and connect to worker1.

ssh worker01
sudo apt update && sudo apt-get upgrade -y

Switch to the client terminal and verify that k8s-worker01 is back to ready status.

kubectl uncordon worker01
node/worker01 uncordoned
kubectl get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE     IP             NODE       NOMINATED NODE   READINESS GATES
lab-drain-65f68f47fb-5rdg2   1/1     Running   0          4m4s    172.16.1.154   worker03   <none>           <none>
lab-drain-65f68f47fb-75ds2   1/1     Running   0          5m33s   172.16.0.214   worker02   <none>           <none>
lab-drain-65f68f47fb-k82lx   1/1     Running   0          5m33s   172.16.1.152   worker03   <none>           <none>
lab-drain-65f68f47fb-mvn2l   1/1     Running   0          5m33s   172.16.1.151   worker03   <none>           <none>
lab-drain-65f68f47fb-pfmwn   1/1     Running   0          5m33s   172.16.0.215   worker02   <none>           <none>
lab-drain-65f68f47fb-w48gz   1/1     Running   0          4m4s    172.16.0.216   worker02   <none>           <none>

Existing pods have been moved to other nodes will stay put, if we scale to 1, and after scaling to 6 they will be distributed to k8s-worker01

kubectl scale deployment lab-drain --replicas=1
deployment.apps/lab-drain scaled
kubectl scale deployment lab-drain --replicas=6
deployment.apps/lab-drain scaled
kubectl get pods -o wide
lab-drain-65f68f47fb-9tkp4   1/1     Running   0          14s     172.16.1.155   worker03   <none>           <none>
lab-drain-65f68f47fb-gxx27   1/1     Running   0          14s     172.16.0.218   worker02   <none>           <none>
lab-drain-65f68f47fb-lz4jp   1/1     Running   0          14s     172.16.1.156   worker03   <none>           <none>
lab-drain-65f68f47fb-pfmwn   1/1     Running   0          6m12s   172.16.0.215   worker02   <none>           <none>
lab-drain-65f68f47fb-pjf5f   1/1     Running   0          14s     172.16.1.19    worker01   <none>           <none>
lab-drain-65f68f47fb-v8lhf   1/1     Running   0          14s     172.16.1.20    worker01   <none>           <none>

Clean up our lab-drain namespace.

kubectl delete ns lab-drain
namespace "lab-drain" deleted

Return to the default namespace

kubectl config set-context --current --namespace default