Uncordoning Worker Nodes

In this lesson, we will carry out another chaos experiment, which will also include a rollback block so that we can uncordon the node after the experiment.

We'll cover the following

The issue we just created
Inspecting the definition of node-uncordon.yaml and comparing it with node-drain.yaml
Running chaos experiment and inspecting the output
Checking the nodes to confirm the status

The issue we just created#

There are a couple of issues that we need to fix to get out of the bad situation we’re in right now. However, before we start solving those, an even bigger problem was created a few moments ago. I will demonstrate the issue by retrieving the nodes.

Enter to Rename, Shift+Enter to Preview

The output, in my case, is as follows (yours will be different).

NAME          STATUS                   ROLES  AGE VERSION
gke-chaos-... Ready,SchedulingDisabled <none> 13m v1.15.9-gke.22

You can see that the status of our single node is Ready,SchedulingDisabled. We run the experiment that failed to drain a node; this is a two-step process. First, the system disables scheduling on that node so that no new Pods are deployed. Then, it drains that node by removing everything. The experiment managed to do the first step (it disabled scheduling), but it failed on the second. As a result, we have a cluster where we cannot schedule anything new. Based on our previous experience, we should have done it better from the start.

We should have created the rollback section that would un-cordon our nodes after the experiment, whether it is successful or failed. Therefore, our first mission will be to create the rollbacks section that will make sure that our cluster is in the correct state after the experiment. In other words, we’ll roll back whatever damage we created to the cluster through that experiment.

Inspecting the definition of `node-uncordon.yaml` and comparing it with `node-drain.yaml`#

Let’s take a look at yet another definition.

Enter to Rename, Shift+Enter to Preview

You should see that there is a new section rollbacks. But, as in most other cases, we’re going to diff this definition with the previous one and make sure that we see all the changes that we are making.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

37a38,46
> rollbacks:
> - type: action
>   name: uncordon-node
>   provider:
>     type: python
>     func: uncordon_node
>     module: chaosk8s.node.actions
>     arguments:
>       label_selector: ＄{node_label}

We can see that the addition is only the rollbacks section with an action that relies on the uncordon_node function. Uncordon will undo the damage that we are potentially going to create by draining a node. So, if you think of draining as being cordoning or disabling scheduling plus draining, this rollback action will un-cordon or re-enable scheduling on that node. The rest will be taken by Kubernetes, which will be scheduling new Pods on that node.

The rollbacks section has a single argument label_selector that matches the label we’re using in the cordoning action. It will un-cordon all the nodes with that label.

Running chaos experiment and inspecting the output#

Now we can re-run the experiment.

Enter to Rename, Shift+Enter to Preview

The output, without timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we drain a node
[... INFO] Steady state hypothesis: Nodes are indestructible
[... INFO] Probe: all-apps-are-healthy
[... ERROR]   => failed: chaoslib.exceptions.ActivityFailed: the system is unhealthy
[... WARNING] Probe terminated unexpectedly, so its tolerance could not be validated
[... CRITICAL] Steady state probe 'all-apps-are-healthy' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] Rollback: uncordon-node
[... INFO] Action: uncordon-node
[... INFO] Experiment ended with status: failed

Just as before, it still fails because we did not solve the problem of not having enough nodes and not having enough replicas of our Istio components. The failure persists, and that’s as expected. What we’re getting now is the rollback action that will undo the damage created by the experiment.

Checking the nodes to confirm the status#

We can confirm that easily by outputting the nodes of the cluster.

Enter to Rename, Shift+Enter to Preview

The output, in my case, is as follows.

NAME          STATUS ROLES  AGE VERSION
gke-chaos-... Ready  <none> 15m v1.15.9-gke.22

We can see that we still have the same single node but, this time, the status is Ready, while before it was Ready,SchedulingDisabled. We undid the damage created by the actions of the experiment.

In this lesson, you saw how to un-cordon a node after cordoning it and after trying to drain it. Now, we can go into a more exciting part and try to figure out how to fix the problem. We cannot drain nodes. Therefore, we cannot upgrade them since the upgrade process, in most cases, means draining first and upgrading second.

In the next lesson, you will learn how to make the nodes drainable.

Draining Worker Nodes

Making Nodes Drainable

Mark as Completed

Report an Issue

Before We Begin

Introduction To Kubernetes Chaos Engineering

Defining Requirements

Destroying Application Instances

Experimenting with Application Availability

Obstructing and Destroying Network

Draining and Deleting Nodes

Creating Chaos Experiment Reports

Running Chaos Experiments Inside a Kubernetes Cluster

Executing Random Chaos

Conclusion

Uncordoning Worker Nodes

The issue we just created#

Inspecting the definition of `node-uncordon.yaml` and comparing it with `node-drain.yaml`#

Running chaos experiment and inspecting the output#

Checking the nodes to confirm the status#

Uncordoning Worker Nodes

The issue we just created#

Inspecting the definition of node-uncordon.yaml and comparing it with node-drain.yaml#

Running chaos experiment and inspecting the output#

Checking the nodes to confirm the status#

Inspecting the definition of `node-uncordon.yaml` and comparing it with `node-drain.yaml`#