Uncordoning Worker Nodes
In this lesson, we will carry out another chaos experiment, which will also include a rollback block so that we can uncordon the node after the experiment.
The issue we just created#
There are a couple of issues that we need to fix to get out of the bad situation we’re in right now. However, before we start solving those, an even bigger problem was created a few moments ago. I will demonstrate the issue by retrieving the nodes.
The output, in my case, is as follows (yours will be different).
NAME STATUS ROLES AGE VERSION
gke-chaos-... Ready,SchedulingDisabled <none> 13m v1.15.9-gke.22
You can see that the status of our single node is Ready,SchedulingDisabled
. We run the experiment that failed to drain a node; this is a two-step process. First, the system disables scheduling on that node so that no new Pods are deployed. Then, it drains that node by removing everything. The experiment managed to do the first step (it disabled scheduling), but it failed on the second. As a result, we have a cluster where we cannot schedule anything new. Based on our previous experience, we should have done it better from the start.
We should have created the rollback section that would un-cordon our nodes after the experiment, whether it is successful or failed. Therefore, our first mission will be to create the rollbacks
section that will make sure that our cluster is in the correct state after the experiment. In other words, we’ll roll back whatever damage we created to the cluster through that experiment.
Inspecting the definition of node-uncordon.yaml
and comparing it with node-drain.yaml
#
Let’s take a look at yet another definition.
You should see that there is a new section rollbacks
. But, as in most other cases, we’re going to diff
this definition with the previous one and make sure that we see all the changes that we are making.
The output is as follows.
37a38,46
> rollbacks:
> - type: action
> name: uncordon-node
> provider:
> type: python
> func: uncordon_node
> module: chaosk8s.node.actions
> arguments:
> label_selector: ${node_label}
We can see that the addition is only the rollbacks
section with an action that relies on the uncordon_node
function. Uncordon will undo the damage that we are potentially going to create by draining a node. So, if you think of draining as being cordoning or disabling scheduling plus draining, this rollback action will un-cordon or re-enable scheduling on that node. The rest will be taken by Kubernetes, which will be scheduling new Pods on that node.
The rollbacks
section has a single argument label_selector
that matches the label we’re using in the cordoning action. It will un-cordon all the nodes with that label.
Running chaos experiment and inspecting the output#
Now we can re-run the experiment.
The output, without timestamps, is as follows.
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we drain a node
[... INFO] Steady state hypothesis: Nodes are indestructible
[... INFO] Probe: all-apps-are-healthy
[... ERROR] => failed: chaoslib.exceptions.ActivityFailed: the system is unhealthy
[... WARNING] Probe terminated unexpectedly, so its tolerance could not be validated
[... CRITICAL] Steady state probe 'all-apps-are-healthy' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] Rollback: uncordon-node
[... INFO] Action: uncordon-node
[... INFO] Experiment ended with status: failed
Just as before, it still fails because we did not solve the problem of not having enough nodes and not having enough replicas of our Istio components. The failure persists, and that’s as expected. What we’re getting now is the rollback action that will undo the damage created by the experiment.
Checking the nodes to confirm the status#
We can confirm that easily by outputting the nodes of the cluster.
The output, in my case, is as follows.
NAME STATUS ROLES AGE VERSION
gke-chaos-... Ready <none> 15m v1.15.9-gke.22
We can see that we still have the same single node but, this time, the status is Ready
, while before it was Ready,SchedulingDisabled
. We undid the damage created by the actions of the experiment.
In this lesson, you saw how to un-cordon a node after cordoning it and after trying to drain it. Now, we can go into a more exciting part and try to figure out how to fix the problem. We cannot drain nodes. Therefore, we cannot upgrade them since the upgrade process, in most cases, means draining first and upgrading second.
In the next lesson, you will learn how to make the nodes drainable.