Preparing for Termination of Nodes

In this lesson, we will set up a new ConfigMap, create a namespace, compare it with the previous one, and explore a CronJob that we will later use for the experiment.

We'll cover the following

What can we do?
Inspecting the ConfigMap defined in experiments-node.yaml
Applying the new ConfigMap
Inspecting the CronJob defined in periodic-node.yaml

What can we do?#

Now, we know how to affect not only individual applications, but also random ones running in a Namespace, or even in the whole cluster. Next, we’ll explore how to randomize our experiments on the node level as well.

In the past, we were terminating or disrupting nodes where a specific application was running. Next, we will try to figure out how to destroy a completely random node. It will be without any particular criteria. We’ll just do random stuff and see how it affects our cluster. If we’re lucky, such actions will not result in any adverse result. Or, maybe they will. We’ll soon find out.

We couldn’t do this before because the steady-state hypothesis of our experiments was not enough, but we can do it now. If we destroy something (almost) completely random, any part of the system can be affected. We cannot use the Chaos Toolkit hypothesis to predict what the initial state should be, nor what the state after some destructive cluster-wide actions should be. We could do that, but it would be too complicated and we would be trying to solve the problem with the wrong tool.

Now, we know that we can use Prometheus to store metrics and that we can monitor our system through dashboards like Grafana and Kiali. We could, and should, go further. For example, we should create alerts that will notify us when any part of the system is misbehaving.

Now, we are ready to go full throttle and run our experiments on the cluster level.

Inspecting the ConfigMap defined in `experiments-node.yaml`#

Let’s take a look at yet another YAML definition.

Enter to Rename, Shift+Enter to Preview

That definition should not contain anything truly new. Nevertheless, there are a few details worth explaining. To make it simpler, we’ll take a look at the diff between that and the previous definition. That will help us spot the differences between the two.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

40c40,68
< 
---
>   node.yaml: |
>     version: 1.0.0
>     title: What happens if we drain a node
>     description: All the instances are distributed among healthy nodes and the applications are healthy
>     tags:
>     - k8s
>     - node
>     method:
>     - type: action
>       name: drain-node
>       provider:
>         type: python
>         func: drain_nodes
>         module: chaosk8s.node.actions
>         arguments:
>           label_selector: beta.kubernetes.io/os=linux
>           count: 1
>           delete_pods_with_local_storage: true
>       pauses: 
>         after: 180
>     rollbacks:
>     - type: action
>       name: uncordon-node
>       provider:
>         type: python
>         func: uncordon_node
>         module: chaosk8s.node.actions
>         arguments:
>           label_selector: beta.kubernetes.io/os=linux

We can see that we have a completely new experiment called node.yaml, with the title, the description, and all the other things we normally have in experiments. It doesn’t have any steady-state hypothesis, because we really don’t know what the state of the whole cluster should be. So we are skipping the steady-state, but we are using the method. It will remove one of the nodes that have a certain label selector.

The only reason why we’re setting the label_selector to beta.kubernetes.io/os=linux is to avoid draining windows nodes if there are any. We don’t have them in the cluster (if you used the Gist I provided). Nevertheless, since you are not forced to use those Gists to create a cluster, I couldn’t be sure that you do not have it mixes with Windows servers. Our focus is only on Linux.

To be on the safe side, describe one of your nodes, and confirm that the label indeed exists. If that’s not the case, please replace its value with whichever label is used in your case.

Further down, we can see that we also set delete_pods_with_local_storage to true. With it, we’ll ensure that Pods with the local storage will be deleted before the node is drained. Otherwise, the experiment would not be able to perform the action since Kubernetes does not allow draining of nodes with local storage, given that they are tied to that specific node.

All in all, we’ll drain a random node (as long as it’s based on Linux). And then, we’re going to pause for 180 seconds. Finally, we’ll roll back by un-cordoning the nodes, and, that way, we’ll restore them to their original state.

There’s one potentially important thing you should know before we proceed. In hindsight, I should have said it at the beginning of this section. Nevertheless, better late than never.

The experiment we are about to run will not work with Docker Desktop or Minikube. If that’s the Kubernetes distribution you’re running, you can observe the output in here, since you will not be able to run the experiment. Docker Desktop and Minikube have only one node, so draining one would mean draining the whole cluster, including the control plane.

Applying the new ConfigMap#

Let’s apply this definition and update our existing ConfigMap.

Enter to Rename, Shift+Enter to Preview

Inspecting the CronJob defined in `periodic-node.yaml`#

Next, we’re going to take a look at yet another CronJob.

Enter to Rename, Shift+Enter to Preview

The output, limited to the relevant parts, is as follows.

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: nodes-chaos
spec:
  concurrencyPolicy: Forbid
  schedule: "*/5 * * * *"
  jobTemplate:
    ...
    spec:
      activeDeadlineSeconds: 600
      backoffLimit: 0
      template:
        metadata:
          labels:
            app: health-instances-chaos
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          serviceAccountName: chaostoolkit
          restartPolicy: Never
          containers:
          - name: chaostoolkit
            image: vfarcic/chaostoolkit:1.4.1
            args:
            - --verbose
            - run
            - --journal-path
            - /results/node.json
            - /experiment/node.yaml
            env:
            - name: CHAOSTOOLKIT_IN_POD
              value: "true"
            volumeMounts:
            - name: experiments
              mountPath: /experiment
              readOnly: true
            - name: results
              mountPath: /results
              readOnly: false
            resources:
              limits:
                cpu: 20m
                memory: 64Mi
              requests:
                cpu: 20m
                memory: 64Mi
          volumes:
          - name: experiments
            configMap:
              name: chaostoolkit-experiments
          - name: results
            persistentVolumeClaim:
              claimName: chaos

---

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: chaos
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

We can see that it is, more or less, the same CronJob as the one we used before. There are only a few minor differences.

The schedule is now increased to five minutes because it takes a while until we drain and, later on, un-cordon a node. I raised it from two minutes we had before to account for the fact that the experiment will take longer to run. The other difference is that, this time, we are running the experiment defined in node.yaml residing in the newly updated ConfigMap.

Apart from having a different schedule and running an experiment defined in a different file, that CronJob is exactly the same as the one we used before.

In the next lesson, we will terminate random nodes using the CronJob and observe the outcome.

Disrupting Network Traffic

Terminating Random Nodes

Mark as Completed

Report an Issue

Before We Begin

Introduction To Kubernetes Chaos Engineering

Defining Requirements

Destroying Application Instances

Experimenting with Application Availability

Obstructing and Destroying Network

Draining and Deleting Nodes

Creating Chaos Experiment Reports

Running Chaos Experiments Inside a Kubernetes Cluster

Executing Random Chaos

Conclusion

Preparing for Termination of Nodes

What can we do?#

Inspecting the ConfigMap defined in `experiments-node.yaml`#

Applying the new ConfigMap#

Inspecting the CronJob defined in `periodic-node.yaml`#

Preparing for Termination of Nodes

What can we do?#

Inspecting the ConfigMap defined in experiments-node.yaml#

Applying the new ConfigMap#

Inspecting the CronJob defined in periodic-node.yaml#

Inspecting the ConfigMap defined in `experiments-node.yaml`#

Inspecting the CronJob defined in `periodic-node.yaml`#