The reasoning behind this experiment#

We’re going to try to drain everything from a random worker node.

Why do you think we might want to do something like this? One possible reason for doing that is in upgrades. The draining process is the same as the one we are likely using to upgrade our Kubernetes cluster.

Upgrading a Kubernetes cluster usually involves a few steps. Typically, we’d drain a node, we’d shut it down, and we’d replace it with an upgraded version of the node. Alternatively, we might upgrade a node without shutting it down, but that would be more appropriate for bare-metal servers that cannot be destroyed and created at will. Further on, we’d repeat the steps. We’d drain a node, shut it down, and create a new one based on an upgraded version. This would continue over and over again, one node after another, until the whole cluster is upgraded. The process is often called rolling updates (or rolling upgrades), and it is employed by most Kubernetes distributions.

We want to make sure nothing wrong happens while or after upgrading a cluster. To do that, we’re going to design an experiment that would perform the most critical step of the process. It will drain a random node, and we will validate whether our applications are just as healthy as before.

If you’re not familiar with the expression, draining means removing everything from a node.

Inspecting the definition of node-drain.yaml#

Let’s take a look at yet another definition of an experiment.

The output is as follows.

version1.0.0
titleWhat happens if we drain a node
descriptionAll the instances are distributed among healthy nodes and the applications are healthy
tags:
k8s
deployment
node
configuration:
  node_label:
      typeenv
      keyNODE_LABEL
steady-state-hypothesis:
  titleNodes are indestructible
  probes:
  - nameall-apps-are-healthy
    typeprobe
    tolerancetrue
    provider:
      typepython
      funcall_microservices_healthy
      modulechaosk8s.probes
      arguments:
        nsgo-demo-8
method:
typeaction
  namedrain-node
  provider:
    typepython
    funcdrain_nodes
    modulechaosk8s.node.actions
    arguments:
      label_selector${node_label}
      count1
      pod_namespacego-demo-8
      delete_pods_with_local_storagetrue
  pauses
    after1

The experiment has the typical version, title, description, and tags. We’re not going to go through them.

Further on, in the configuration section, there is the node_label variable that will use the value of environment variable NODE_LABEL. I will explain why we need that variable in a moment. For now, just remember, there is a variable node_label.

Then we have the steady-state hypothesis. It has a single probe all-apps-are-healthy, which, as the name suggests, expects all applications to be healthy. We already used the same hypothesis, so there should be no need to explore it in more detail. It validates whether everything running in the go-demo-8 Namespace is healthy.

The only action is in the method section, and it uses the function drain_nodes. The name should explain what it does. The function itself is available in chaosk8s.node.actions module, which is available through the Kubernetes plugin that we installed at the very beginning.

The action has a couple of arguments. There is the label_selector set to the value of the variable ${node_label}. So, we are going to tell the system what are the nodes that are eligible for draining. Even though all your nodes are the same most of the time, that might not always be the case. Through that argument, we can select which nodes are eligible and which are not.

Further on, we have the count argument set to 1, meaning that only one node will be drained.

There is also pod_namespace. This one might sound weird since we are not draining any Pods. We are draining nodes. Even though it might not be self-evident, this argument is instrumental. It tells the action to select a random node among the ones that have at least one Pod running in that Namespace. So, it will choose a random node among those where Pods inside the go-demo-8 Namespace are running. That way, we can check what happens to the applications in that Namespace when one of the servers they are running is gone.

Finally, we will pause for one second. That should be enough for us to validate whether our applications are healthy soon after one of the nodes is drained.

Before we run that experiment, let me go back to the node_label variable.

Describing the labels of nodes of the cluster#

We need to figure out what the labels of the nodes of our cluster are, and we can do that by describing them.

The output, limited to the relevant parts, is as follows.

...
Labels: beta.kubernetes.io/arch=amd64
        beta.kubernetes.io/fluentd-ds-ready=true
        beta.kubernetes.io/instance-type=n1-standard-4
        beta.kubernetes.io/os=linux
        cloud.google.com/gke-nodepool=default-pool
        cloud.google.com/gke-os-distribution=cos
        failure-domain.beta.kubernetes.io/region=us-east1
        failure-domain.beta.kubernetes.io/zone=us-east1-b
        kubernetes.io/arch=amd64
        kubernetes.io/hostname=gke-chaos-default-pool-2686de6b-nvzp
        kubernetes.io/os=linux
...

If you used one of my Gists to create a cluster, there should be only one node. However, if you are using your own cluster, there might be others. As a result, you might see descriptions of more than one node.

In my case, I can use the beta.kubernetes.io/os=linux label as a node selector. If you have the same one, you can use it as well. Otherwise, you need to find a label that identifies your nodes. This is why I did not hard-code the selector. I did not want to risk the possibility that, in your cluster, that label might be different or might not even exist.

Exporting the NODE_LABEL variable#

All in all, please make sure that you specify the correct label in the command that follows.

We exported the NODE_LABEL variable with the label that identifies the nodes eligible for draining through experiments.

We are almost ready to run the experiment. Before we do, stop for a moment and think about what the result will be? Remember, we are running a cluster with a single node, at least if you used one of my Gists. What will the outcome be when we drain a single node? Will it be drained? The answer might seem obvious, but it probably isn’t. To make it easier to answer this question, you might want to take another look at the Gist you used to create the cluster unless you rolled it out by yourself. Now, after you examined the Gist, think again. You might even need to look at the things running inside the cluster and examine them in more detail.

Running chaos experiment and inspecting the output#

Let’s say that you predicted the output of the experiment. Let’s see whether you’re right.

The output, without timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we drain a node
[... INFO] Steady state hypothesis: Nodes are indestructible
[... INFO] Probe: all-apps-are-healthy
[... INFO] Steady state hypothesis is met!
[... INFO] Action: drain-node
[... ERROR]   => failed: chaoslib.exceptions.ActivityFailed: Failed to evict pod istio-ingressgateway-8577f4c6f8-xwcpb: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Cannot evict pod as it would violate the pod's disruption budget.","reason":"TooManyRequests","details":{"causes":[{"reason":"DisruptionBudget","message":"The disruption budget ingressgateway needs 1 healthy pods and has 1 currently"}]},"code":429}
[... INFO] Pausing after activity for 1s...
[... INFO] Steady state hypothesis: Nodes are indestructible
[... INFO] Probe: all-apps-are-healthy
[... ERROR]   => failed: chaoslib.exceptions.ActivityFailed: the system is unhealthy
[... WARNING] Probe terminated unexpectedly, so its tolerance could not be validated
[... CRITICAL] Steady state probe 'all-apps-are-healthy' is not in the given t... this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered

According to the output, the initial probe passed. All applications in the go-demo-8 Namespace are healthy. That’s a good thing.

Why couldn’t we drain the node?#

Further on, the action was executed. It tried to drain a node, and it failed miserably.

It might have failed for a reason you did not expect.

We drained a node hoping to see what effect that produces on the applications in the go-demo-8 Namespace. However, we instead got an error stating that the node cannot be drained at all. It could not match the disruption budget of the istio-ingressgateway.

The Gateway is configured to have a disruption budget of 1. That means that there must be at least one Pod running at any given moment. We, on the other hand, made a colossal mistake of deploying Istio without going into details. As a result, we have a single replica of the Gateway.

All in all, the Gateway, in its current form, has one replica. However, it has the disruption budget that prevents the system from removing a replica without guaranteeing that at least one Pod is always running. This is a good thing. Istio’s design decision is correct because the Gateway should be running at any given moment. The bad thing is what we did. We installed Istio, or at least I told you how to install Istio, in a terrible way. We should have known better, and we should have scaled that Istio component to at least 2 or more replicas. We will do this soon. But, before that, we have a bigger problem on our plate.

We are running a single-node cluster. Or, to be more precise, if you used my instructions from one of the Gists, you’re running a single-node cluster. It will do us no good to scale Istio components to multiple replicas if they’re all going to run on that single node. That would result in precisely the same failure. The system could not drain the node because that would mean that all the replicas of the Istio components would need to be shut down, and they are configured with the disruption budget of 1.


The next lesson, we will fix an issue that we just created by rolling it back.

Deploying the Application
Uncordoning Worker Nodes
Mark as Completed
Report an Issue