Preparing for Termination of Nodes

In this lesson, we will set up a new ConfigMap, create a namespace, compare it with the previous one, and explore a CronJob that we will later use for the experiment.

What can we do?#

Now, we know how to affect not only individual applications, but also random ones running in a Namespace, or even in the whole cluster. Next, we’ll explore how to randomize our experiments on the node level as well.

In the past, we were terminating or disrupting nodes where a specific application was running. Next, we will try to figure out how to destroy a completely random node. It will be without any particular criteria. We’ll just do random stuff and see how it affects our cluster. If we’re lucky, such actions will not result in any adverse result. Or, maybe they will. We’ll soon find out.

We couldn’t do this before because the steady-state hypothesis of our experiments was not enough, but we can do it now. If we destroy something (almost) completely random, any part of the system can be affected. We cannot use the Chaos Toolkit hypothesis to predict what the initial state should be, nor what the state after some destructive cluster-wide actions should be. We could do that, but it would be too complicated and we would be trying to solve the problem with the wrong tool.

Now, we know that we can use Prometheus to store metrics and that we can monitor our system through dashboards like Grafana and Kiali. We could, and should, go further. For example, we should create alerts that will notify us when any part of the system is misbehaving.

Now, we are ready to go full throttle and run our experiments on the cluster level.

Inspecting the ConfigMap defined in experiments-node.yaml#

Let’s take a look at yet another YAML definition.

That definition should not contain anything truly new. Nevertheless, there are a few details worth explaining. To make it simpler, we’ll take a look at the diff between that and the previous definition. That will help us spot the differences between the two.

The output is as follows.

40c40,68
< 
---
>   node.yaml: |
>     version: 1.0.0
>     title: What happens if we drain a node
>     description: All the instances are distributed among healthy nodes and the applications are healthy
>     tags:
>     - k8s
>     - node
>     method:
>     - type: action
>       name: drain-node
>       provider:
>         type: python
>         func: drain_nodes
>         module: chaosk8s.node.actions
>         arguments:
>           label_selector: beta.kubernetes.io/os=linux
>           count: 1
>           delete_pods_with_local_storage: true
>       pauses: 
>         after: 180
>     rollbacks:
>     - type: action
>       name: uncordon-node
>       provider:
>         type: python
>         func: uncordon_node
>         module: chaosk8s.node.actions
>         arguments:
>           label_selector: beta.kubernetes.io/os=linux

We can see that we have a completely new experiment called node.yaml, with the title, the description, and all the other things we normally have in experiments. It doesn’t have any steady-state hypothesis, because we really don’t know what the state of the whole cluster should be. So we are skipping the steady-state, but we are using the method. It will remove one of the nodes that have a certain label selector.

The only reason why we’re setting the label_selector to beta.kubernetes.io/os=linux is to avoid draining windows nodes if there are any. We don’t have them in the cluster (if you used the Gist I provided). Nevertheless, since you are not forced to use those Gists to create a cluster, I couldn’t be sure that you do not have it mixes with Windows servers. Our focus is only on Linux.

To be on the safe side, describe one of your nodes, and confirm that the label indeed exists. If that’s not the case, please replace its value with whichever label is used in your case.

Further down, we can see that we also set delete_pods_with_local_storage to true. With it, we’ll ensure that Pods with the local storage will be deleted before the node is drained. Otherwise, the experiment would not be able to perform the action since Kubernetes does not allow draining of nodes with local storage, given that they are tied to that specific node.

All in all, we’ll drain a random node (as long as it’s based on Linux). And then, we’re going to pause for 180 seconds. Finally, we’ll roll back by un-cordoning the nodes, and, that way, we’ll restore them to their original state.

There’s one potentially important thing you should know before we proceed. In hindsight, I should have said it at the beginning of this section. Nevertheless, better late than never.

The experiment we are about to run will not work with Docker Desktop or Minikube. If that’s the Kubernetes distribution you’re running, you can observe the output in here, since you will not be able to run the experiment. Docker Desktop and Minikube have only one node, so draining one would mean draining the whole cluster, including the control plane.

Applying the new ConfigMap#

Let’s apply this definition and update our existing ConfigMap.

Inspecting the CronJob defined in periodic-node.yaml#

Next, we’re going to take a look at yet another CronJob.

The output, limited to the relevant parts, is as follows.

---

apiVersionbatch/v1beta1
kindCronJob
metadata:
  namenodes-chaos
spec:
  concurrencyPolicyForbid
  schedule"*/5 * * * *"
  jobTemplate:
    ...
    spec:
      activeDeadlineSeconds600
      backoffLimit0
      template:
        metadata:
          labels:
            apphealth-instances-chaos
          annotations:
            sidecar.istio.io/inject"false"
        spec:
          serviceAccountNamechaostoolkit
          restartPolicyNever
          containers:
          - namechaostoolkit
            imagevfarcic/chaostoolkit:1.4.1
            args:
            - --verbose
            - run
            - --journal-path
            - /results/node.json
            - /experiment/node.yaml
            env:
            - nameCHAOSTOOLKIT_IN_POD
              value"true"
            volumeMounts:
            - nameexperiments
              mountPath/experiment
              readOnlytrue
            - nameresults
              mountPath/results
              readOnlyfalse
            resources:
              limits:
                cpu20m
                memory64Mi
              requests:
                cpu20m
                memory64Mi
          volumes:
          - nameexperiments
            configMap:
              namechaostoolkit-experiments
          - nameresults
            persistentVolumeClaim:
              claimNamechaos

---

kindPersistentVolumeClaim
apiVersionv1
metadata:
  namechaos
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage1Gi

We can see that it is, more or less, the same CronJob as the one we used before. There are only a few minor differences.

The schedule is now increased to five minutes because it takes a while until we drain and, later on, un-cordon a node. I raised it from two minutes we had before to account for the fact that the experiment will take longer to run. The other difference is that, this time, we are running the experiment defined in node.yaml residing in the newly updated ConfigMap.

Apart from having a different schedule and running an experiment defined in a different file, that CronJob is exactly the same as the one we used before.


In the next lesson, we will terminate random nodes using the CronJob and observe the outcome.

Disrupting Network Traffic
Terminating Random Nodes
Mark as Completed
Report an Issue