Deleting Worker Nodes

In this lesson, we will carry out an experiment that will delete a node in the cluster. This experiment can help us understand how our cluster behaves if nodes are destroyed or damaged.

After resolving a few problems, we are now able to drain nodes. We discovered those issues through experiments. As a result, we should be able to upgrade our cluster without doing something terribly wrong and, hopefully, without negatively affecting our applications.

Draining nodes is, most of the time, a voluntary action. We tend to drain our nodes when we choose to upgrade our cluster. The previous experiment was beneficial because we now have the confidence to upgrade the cluster without (much) fear. However, there is still something worse that can happen to our nodes.

More often than not, nodes will fail without our consent. They will not drain. They will get destroyed or damaged, they will go down, and they will be powered off. Bad things will happen to nodes, whether we like it or not.

Let’s see whether we can create an experiment that will validate how our cluster behaves when such things happen.

Inspecting the definition of node-delete.yaml and comparing it with node-uncordon.yaml#

As always, we’re going to take a look at yet another experiment.

The output is as follows.

version1.0.0
titleWhat happens if we delete a node
descriptionAll the instances are distributed among healthy nodes and the applications are healthy
tags:
k8s
deployment
node
configuration:
  node_label:
      typeenv
      keyNODE_LABEL
steady-state-hypothesis:
  titleNodes are indestructible
  probes:
  - nameall-apps-are-healthy
    typeprobe
    tolerancetrue
    provider:
      typepython
      funcall_microservices_healthy
      modulechaosk8s.probes
      arguments:
        nsgo-demo-8
method:
typeaction
  namedelete-node
  provider:
    typepython
    funcdelete_nodes
    modulechaosk8s.node.actions
    arguments:
      label_selector${node_label}
      count1
      pod_namespacego-demo-8
  pauses
    after10

We can see that we replaced the draining-node method with delete-node. There are a few other changes, so let’s look at the diff that should allow us to see better what really changed when compared with the previous definition.

2c2
< title: What happens if we drain a node
---
> title: What happens if we delete a node
26c26
<   name: drain-node
---
>   name: delete-node
29c29
<     func: drain_nodes
---
>     func: delete_nodes
35d34
<       delete_pods_with_local_storage: true
37,46c36
<     after: 1
< rollbacks:
< - type: action
<   name: uncordon-node
<   provider:
<     type: python
<     func: uncordon_node
<     module: chaosk8s.node.actions
<     arguments:
<       label_selector: {node_label}
---
>     after: 10

We can see that the title and the name changed, but that’s not important. The function is different. It is delete_nodes, instead of drain_nodes. As you can see, all the other parts of that action are the same because nothing else changed, except the delete_pods_with_local_storage which should be self-explanatory. In other words, draining and deleting a node takes exactly the same arguments. It’s just a different function.

Another thing that changed is that we removed the rollbacks section. The previous rollback that un-cordons nodes is not there anymore. The reason is simple. We are not draining, and we are not cordoning nodes. We’re deleting a node, and there is no rollback from that. Or, to be more precise, there could be a rollback, but we’re not going to implement it. It is a similar situation like destroying a Pod. Except that, this time, we’re not terminating a Pod. We’re killing a node, and our cluster should be able to recuperate from that. We’re not changing the desired state, and the cluster should recover from a loss of a node without us rolling back anything. At least, that should be the goal. All in all, there is no rollback.

Running chaos experiment and inspecting the output#

Let’s run this experiment and see what we’re getting.

The output is as follows.

[2020-03-17 22:59:26 INFO] Validating the experiment's syntax
[2020-03-17 22:59:26 INFO] Experiment looks valid
[2020-03-17 22:59:26 INFO] Running experiment: What happens if we delete a node
[2020-03-17 22:59:26 INFO] Steady state hypothesis: Nodes are indestructible
[2020-03-17 22:59:26 INFO] Probe: all-apps-are-healthy
[2020-03-17 22:59:27 INFO] Steady state hypothesis is met!
[2020-03-17 22:59:27 INFO] Action: delete-node
[2020-03-17 22:59:29 INFO] Pausing after activity for 10s...
[2020-03-17 22:59:39 INFO] Steady state hypothesis: Nodes are indestructible
[2020-03-17 22:59:39 INFO] Probe: all-apps-are-healthy
[2020-03-17 22:59:39 INFO] Steady state hypothesis is met!
[2020-03-17 22:59:39 INFO] Let's rollback...
[2020-03-17 22:59:39 INFO] No declared rollbacks, let's move on.
[2020-03-17 22:59:39 INFO] Experiment ended with status: completed

In your case, the output might show that the experiment failed. If that’s the case, you were unlucky, and the node that hosted the go-demo-8 DB was destroyed. We already discussed the problems with the database, and we’re not going to revisit them here. You’ll have to imagine that it worked and that your output matches mine.

We can see, at least from my output, that the initial steady-state hypothesis was met, and the action to delete a node was executed. Then we were waiting for 10 seconds, and after that, the node was destroyed. The post-action steady-state hypothesis was successful, so we can conclude that our applications running in go-demo-8 Namespace are still functioning. They are healthy. Actually, that’s not true. There are other problems, but we are not discovering those problems with this experiment. We’re going to get to them later in one of the next sections. For now, at least according to the experiment, everything was successful.

All in all, we confirmed that all apps were healthy, and then we destroyed a node and confirmed that they’re still healthy.

Checking the nodes to confirm#

Let’s take a look at the nodes.

The output, in my case, is as follows.

NAME          STATUS ROLES  AGE   VERSION
gke-chaos-... Ready  <none> 7m25s v1.15.9-gke.22
gke-chaos-... Ready  <none> 3h34m v1.15.9-gke.22

Now we have two nodes, while before we had three. The cluster is not recuperating from this because we did not set autoscaling and quite a few other things. But the good news is that our app is still running.

There are a couple of things that you should note. First of all, we did not really destroy a node. We could do that, but we didn’t. Instead, we removed the node from the Kubernetes cluster. So, from the Kubernetes perspective, the node is gone. However, the virtual machine is most likely still running, and you might want to delete that machine yourself. If you do wish to annihilate it, you can go to console or use the CLI from your favorite cloud provider and delete it. I will not provide instructions for that.

Assignment#

What matters is that this is one way to see what happens when a node is deleted or terminated. To be more precise, we saw what happens when a node is, for one reason or another, not registered by Kubernetes anymore. From the Kubernetes perspective, that node was gone. It’s a separate question whether it is physically running or not. We do not know if what was running on that node is still there, but hidden from Kubernetes. That’s the assignment for you.

Figure out what happens with the stuff running on that node. Think of it as homework.

Before we go on, let’s check the pods and see whether they’re all running.

The output, in my case, is as follows.

NAME             READY STATUS  RESTARTS AGE
go-demo-8-...    2/2   Running 2        3m35s
go-demo-8-...    2/2   Running 1        34s
go-demo-8-...    2/2   Running 0        33s
go-demo-8-db-... 2/2   Running 0        3m34s

We can see that the Pods in the go-demo-8 Namespace are indeed running. We can see that, in my case, two of them are relatively young (33s and 34s), meaning that they were probably running in the node that was removed from Kubernetes. They were distributed across the two healthy nodes. So our application is fine if we ignore the possibility that additional replicas might be running on the node that went rogue.

I have another assignment for you. Create an experiment that will destroy a node instead of removing it from Kubernetes. Create an action that will shut it down.

I will not provide instructions on how to do that for two reasons:

  1. First of all, you should have assignments instead of just copying and pasting commands.

  2. The second reason lies in the complexity of doing something like that without knowing what your hosting provider is. I cannot provide the instructions as an experiment that works for all since it would have to differ from one provider to another.

Your hosting provider might already be supported through one of the Chaos Toolkit modules. You’d need to install a plugin for your specific provider. Or, you might be using one of the providers that are not supported through any of the plugins. If this is the case, you can accomplish the same thing by telling ChaosToolkit to execute a command that will destroy a node. Later on, I’m going to show you how to run random commands. We’re done for now. We saw that nothing wrong happens with our cluster if we remove a node (ignoring the DB). However, if you remove one more node, your cluster might not have enough capacity to host everything we need. We would be left with only one server. And, if you remove the third node, that would be a real problem.

The primary issue is that our cluster is not scalable. At least, not if you created it through one of the Gists.


We will briefly discuss this in the next lesson.

Making Nodes Drainable
Destroying Cluster Zones
Mark as Completed
Report an Issue