Making Nodes Drainable
In this lesson, we will re-run the chaos experiment after making our nodes drainable by scaling our cluster and the Istio components.
Taking a look at Istio Deployments#
Let’s take a look at Istio Deployments.
The output is as follows.
NAME READY UP-TO-DATE AVAILABLE AGE
istio-ingressgateway 1/1 1 1 12m
istiod 1/1 1 1 13m
prometheus 1/1 1 1 12m
We can see that there are two components, not counting prometheus
. If we focus on the READY
column, we can see that they’re all having one replica.
The two Istio components have a HorizontalPodAutoscaler (HPA) associated. They control how many replicas we’ll have, based on metrics like CPU and memory usage. What we need to do is set the minimum number of instances to 2
.
Since the experiment revealed that istio-ingressgateway
should have at least two replicas, that’s the one we’ll focus on. Later on, the experiment might reveal other issues. If it does, we’ll deal with them then.
Scaling the cluster#
Before we dive into scaling Istio, we are going to explore scaling the cluster itself. It would be pointless to increase the number of replicas of Istio components, as a way to solve the problem of not being able to drain a node, if that is the only node in a cluster. We need the Gateway not only scaled but also distributed across different nodes of the cluster. Only then can we hope to drain a node successfully while the Gateway is running in it. We’ll assume that the experiment might shut down one replica, while others are still running somewhere else. Fortunately for us, Kubernetes always does its best to distribute instances of our apps across different nodes. As long as it can, it will not run multiple replicas on a single node.
So, our first action is to scale our cluster. However, scaling a cluster is not the same everywhere. Therefore, the commands to scale the cluster will differ depending on where you’re running it.
Defining an environment variable#
To begin, we’ll define an environment variable that will contain the name of our cluster.
Please replace
[...]
with the name of the cluster in the command that follows. If you’re using one of my Gists, the name should bechaos
.
From now on, the instructions will differ depending on the Kubernetes distribution you’re running.
Please follow the instructions corresponding to your Kubernetes flavor. If you’re not running your cluster inside one of the providers I’m using, you’ll have to figure out the equivalent commands yourself.
For GKE clusters#
Please execute the command that follows only if you are using Google Kubernetes Engine (GKE). Bear in mind that, if you have a newly created account, the command might fail due to insufficient quotas. If that happens, follow the instructions in the output to request a quota increase.
For EKS clusters#
Please execute the commands that follow only if you are using Amazon’s Elastic Kubernetes Service (EKS).
For AKS clusters#
Please execute the commands that follow only if you are using Azure Kubernetes Service (AKS).
Depending on where you’re running your Kubernetes cluster, the process can take anything from seconds to minutes. We’ll be able to continue once the cluster is scaled.
Checking the nodes to confirm#
To be on the safe side, we’ll confirm that everything is OK by outputting the nodes.
The output, in my case, is as follows.
NAME STATUS ROLES AGE VERSION
gke-chaos-... Ready <none> 60s v1.15.9-gke.22
gke-chaos-... Ready <none> 60s v1.15.9-gke.22
gke-chaos-... Ready <none> 3h28m v1.15.9-gke.22
Repeat the previous command if there are less than three nodes with the status Ready
.
In my case, you can see that there are three worker nodes. That does not include the control plane, which is out of reach when we’re using GKE (like I do) and most other managed Kubernetes clusters.
Scaling the Istio components#
Now that we have a cluster with multiple nodes, we can figure out how to scale Istio components. Or, to be more precise, we’ll change the minimum number of replicas defined in the HPA associated with the Gateway.
Let’s start by taking a quick look at what we have.
The output is as follows.
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
istio-ingressgateway Deployment/istio-ingressgateway 6%/80% 1 5 1 70s
istiod Deployment/istiod 0%/80% 1 5 1 62m
We can see that the minimum number of Pods of those HPAs is 1
. If we can change the value to 2
, we should always have two replicas of each while still allowing the components to scale to up to 5
replicas if needed.
Typically, we should create a new Istio manifest with all the changes we need. However, in the interest of moving fast, we’ll apply a patch to the HPA. As I mentioned already, this course is not about Istio, and I’ll assume that you’ll check the documentation if you don’t already know how to update it properly.
Now that we modified Istio in our cluster, we’re going to take another look at the HPAs.
The output is as follows.
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
istio-ingressgateway Deployment/istio-ingressgateway 7%/80% 2 5 2 13m
istiod Deployment/istiod 0%/80% 1 5 1 75m
We can see that istio-ingressgateway
now has the minimum and the actual number of Pods set to 2
. If, in your case, the number of replicas is still 1
, the second Pod is still not up and running. If that’s the case, wait for a few moments and repeat the command that retrieves all the HPAs.
Now let’s take a quick look at the Pods in istio-system
. We want to confirm that they’re running on different nodes.
The output is as follows.
NAME READY STATUS RESTARTS AGE IP NODE ...
istio-ingressgateway-... 1/1 Running 0 6m59s 10.52.1.3 gke-chaos-...-kqgk ...
istio-ingressgateway-... 1/1 Running 0 16m 10.52.1.2 gke-chaos-...-kjkz ...
istiod-... 1/1 Running 0 77m 10.52.0.14 gke-chaos-...-kjkz ...
prometheus-... 2/2 Running 0 77m 10.52.0.16 gke-chaos-...-kjkz ...
If we take a look at istio-ingressgateway
Pods, we can see that they are running on different nodes. A quick glance tells us that they’re all distributed across the cluster with each of its replicas running on different servers.
Re-running chaos experiment and inspecting the output#
Let’s re-run our experiment and see what we are getting.
We are going to execute exactly the same experiment as before. It failed before due to a very unexpected reason. It was not unsuccessful because of the problems with go-demo-8
, but it failed because of Istio itself. So let’s see what we’ll get now.
The output, without timestamps, is as follows.
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we drain a node
[... INFO] Steady state hypothesis: Nodes are indestructible
[... INFO] Probe: all-apps-are-healthy
[... INFO] Steady state hypothesis is met!
[... INFO] Action: drain-node
[... INFO] Pausing after activity for 1s...
[... INFO] Steady state hypothesis: Nodes are indestructible
[... INFO] Probe: all-apps-are-healthy
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] Rollback: uncordon-node
[... INFO] Action: uncordon-node
[... INFO] Experiment ended with status: completed
Once the node was drained, the experiment waited for one second. Then, it confirmed that the steady-state hypothesis is valid; it passed. So, this time, the nodes seem to be drainable.
In the end, the experiment rolled back so that there is no permanent negative effect. We can confirm that by listing all the nodes.
In my case, the output is as follows.
NAME STATUS ROLES AGE VERSION
gke-chaos-... Ready <none> 5m25s v1.15.9-gke.22
gke-chaos-... Ready <none> 5m25s v1.15.9-gke.22
gke-chaos-... Ready <none> 3h32m v1.15.9-gke.22
The status of all the nodes is Ready
. That’s awesome. Not only have we managed to make the nodes of our cluster drainable, but we also managed to roll back and make them all “normal” again by the end of the process.
In the next lesson, we will carry out an experiment to check how our cluster behaves if nodes get destroyed or get damaged.