Educative: Interactive Courses for Software Developers

Taking a look at Istio Deployments#

Let’s take a look at Istio Deployments.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

NAME                 READY UP-TO-DATE AVAILABLE AGE
istio-ingressgateway 1/1   1          1         12m
istiod               1/1   1          1         13m
prometheus           1/1   1          1         12m

We can see that there are two components, not counting prometheus. If we focus on the READY column, we can see that they’re all having one replica.

The two Istio components have a HorizontalPodAutoscaler (HPA) associated. They control how many replicas we’ll have, based on metrics like CPU and memory usage. What we need to do is set the minimum number of instances to 2.

Since the experiment revealed that istio-ingressgateway should have at least two replicas, that’s the one we’ll focus on. Later on, the experiment might reveal other issues. If it does, we’ll deal with them then.

Scaling the cluster#

Before we dive into scaling Istio, we are going to explore scaling the cluster itself. It would be pointless to increase the number of replicas of Istio components, as a way to solve the problem of not being able to drain a node, if that is the only node in a cluster. We need the Gateway not only scaled but also distributed across different nodes of the cluster. Only then can we hope to drain a node successfully while the Gateway is running in it. We’ll assume that the experiment might shut down one replica, while others are still running somewhere else. Fortunately for us, Kubernetes always does its best to distribute instances of our apps across different nodes. As long as it can, it will not run multiple replicas on a single node.

So, our first action is to scale our cluster. However, scaling a cluster is not the same everywhere. Therefore, the commands to scale the cluster will differ depending on where you’re running it.

Defining an environment variable#

To begin, we’ll define an environment variable that will contain the name of our cluster.

Please replace [...] with the name of the cluster in the command that follows. If you’re using one of my Gists, the name should be chaos.

Enter to Rename, Shift+Enter to Preview

From now on, the instructions will differ depending on the Kubernetes distribution you’re running.

Please follow the instructions corresponding to your Kubernetes flavor. If you’re not running your cluster inside one of the providers I’m using, you’ll have to figure out the equivalent commands yourself.

For GKE clusters#

Please execute the command that follows only if you are using Google Kubernetes Engine (GKE). Bear in mind that, if you have a newly created account, the command might fail due to insufficient quotas. If that happens, follow the instructions in the output to request a quota increase.

Enter to Rename, Shift+Enter to Preview

For EKS clusters#

Please execute the commands that follow only if you are using Amazon’s Elastic Kubernetes Service (EKS).

Enter to Rename, Shift+Enter to Preview

For AKS clusters#

Please execute the commands that follow only if you are using Azure Kubernetes Service (AKS).

Enter to Rename, Shift+Enter to Preview

Depending on where you’re running your Kubernetes cluster, the process can take anything from seconds to minutes. We’ll be able to continue once the cluster is scaled.

Checking the nodes to confirm#

To be on the safe side, we’ll confirm that everything is OK by outputting the nodes.

Enter to Rename, Shift+Enter to Preview

The output, in my case, is as follows.

NAME          STATUS ROLES  AGE   VERSION
gke-chaos-... Ready  <none> 60s   v1.15.9-gke.22
gke-chaos-... Ready  <none> 60s   v1.15.9-gke.22
gke-chaos-... Ready  <none> 3h28m v1.15.9-gke.22

Repeat the previous command if there are less than three nodes with the status Ready.

In my case, you can see that there are three worker nodes. That does not include the control plane, which is out of reach when we’re using GKE (like I do) and most other managed Kubernetes clusters.

Scaling the Istio components#

Now that we have a cluster with multiple nodes, we can figure out how to scale Istio components. Or, to be more precise, we’ll change the minimum number of replicas defined in the HPA associated with the Gateway.

Let’s start by taking a quick look at what we have.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

NAME                 REFERENCE                       TARGETS MINPODS MAXPODS REPLICAS AGE
istio-ingressgateway Deployment/istio-ingressgateway 6%/80%  1       5       1        70s
istiod               Deployment/istiod               0%/80%  1       5       1        62m

We can see that the minimum number of Pods of those HPAs is 1. If we can change the value to 2, we should always have two replicas of each while still allowing the components to scale to up to 5 replicas if needed.

Typically, we should create a new Istio manifest with all the changes we need. However, in the interest of moving fast, we’ll apply a patch to the HPA. As I mentioned already, this course is not about Istio, and I’ll assume that you’ll check the documentation if you don’t already know how to update it properly.

Enter to Rename, Shift+Enter to Preview

Now that we modified Istio in our cluster, we’re going to take another look at the HPAs.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

NAME                 REFERENCE                       TARGETS MINPODS MAXPODS REPLICAS AGE
istio-ingressgateway Deployment/istio-ingressgateway 7%/80%  2       5       2        13m
istiod               Deployment/istiod               0%/80%  1       5       1        75m

We can see that istio-ingressgateway now has the minimum and the actual number of Pods set to 2. If, in your case, the number of replicas is still 1, the second Pod is still not up and running. If that’s the case, wait for a few moments and repeat the command that retrieves all the HPAs.

Now let’s take a quick look at the Pods in istio-system. We want to confirm that they’re running on different nodes.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

NAME                     READY STATUS  RESTARTS AGE   IP         NODE               ...
istio-ingressgateway-... 1/1   Running 0        6m59s 10.52.1.3  gke-chaos-...-kqgk ...
istio-ingressgateway-... 1/1   Running 0        16m   10.52.1.2  gke-chaos-...-kjkz ...
istiod-...               1/1   Running 0        77m   10.52.0.14 gke-chaos-...-kjkz ...
prometheus-...           2/2   Running 0        77m   10.52.0.16 gke-chaos-...-kjkz ...

If we take a look at istio-ingressgateway Pods, we can see that they are running on different nodes. A quick glance tells us that they’re all distributed across the cluster with each of its replicas running on different servers.

Re-running chaos experiment and inspecting the output#

Let’s re-run our experiment and see what we are getting.

We are going to execute exactly the same experiment as before. It failed before due to a very unexpected reason. It was not unsuccessful because of the problems with go-demo-8, but it failed because of Istio itself. So let’s see what we’ll get now.

Enter to Rename, Shift+Enter to Preview

The output, without timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we drain a node
[... INFO] Steady state hypothesis: Nodes are indestructible
[... INFO] Probe: all-apps-are-healthy
[... INFO] Steady state hypothesis is met!
[... INFO] Action: drain-node
[... INFO] Pausing after activity for 1s...
[... INFO] Steady state hypothesis: Nodes are indestructible
[... INFO] Probe: all-apps-are-healthy
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] Rollback: uncordon-node
[... INFO] Action: uncordon-node
[... INFO] Experiment ended with status: completed

Once the node was drained, the experiment waited for one second. Then, it confirmed that the steady-state hypothesis is valid; it passed. So, this time, the nodes seem to be drainable.

In the end, the experiment rolled back so that there is no permanent negative effect. We can confirm that by listing all the nodes.

Enter to Rename, Shift+Enter to Preview

In my case, the output is as follows.

NAME          STATUS ROLES  AGE   VERSION
gke-chaos-... Ready  <none> 5m25s v1.15.9-gke.22
gke-chaos-... Ready  <none> 5m25s v1.15.9-gke.22
gke-chaos-... Ready  <none> 3h32m v1.15.9-gke.22

The status of all the nodes is Ready. That’s awesome. Not only have we managed to make the nodes of our cluster drainable, but we also managed to roll back and make them all “normal” again by the end of the process.

In the next lesson, we will carry out an experiment to check how our cluster behaves if nodes get destroyed or get damaged.

Before We Begin

Introduction To Kubernetes Chaos Engineering

Defining Requirements

Destroying Application Instances

Experimenting with Application Availability

Obstructing and Destroying Network

Draining and Deleting Nodes

Creating Chaos Experiment Reports

Running Chaos Experiments Inside a Kubernetes Cluster

Executing Random Chaos

Conclusion

Making Nodes Drainable

Taking a look at Istio Deployments#

Scaling the cluster#

Defining an environment variable#

For GKE clusters#

For EKS clusters#

For AKS clusters#

Checking the nodes to confirm#

Scaling the Istio components#

Re-running chaos experiment and inspecting the output#