Running Scheduled Experiments

In this lesson, we will see how to schedule our experiments periodically using CronJobs.

We'll cover the following

Testing for objectiveness
Inspecting the periodic.yaml file
Applying the definition
Retrieving the CronJobs
Checking the pods

In some cases, one-shot experiments are useful. You might want to trigger an experiment based on specific events (e.g., deployment of a new release), or as a scheduled exercise with “all hands on deck.” However, there are situations when you might want to run chaos experiments periodically. You might, for example, decide to execute them once a day at a specific hour, or you might even choose to randomize that and run them periodically at a random time of a day.

Testing for objectiveness#

We want to test the system and be objective. This might sound strange, but being objective with chaos engineering often means being, more or less, random. If we know when something potentially disrupting might happen, we might react differently than when in unexpected situations. We might be biased and schedule the execution of experiments at the time when we know that there will be no adverse effect on the system. Instead, it might be a good idea to run experiments at some random intervals during the day or a week so that we cannot easily predict what will happen. We often don’t control when “bad” things will happen in production. Most of the time, we don’t know when a node will die, and we often cannot guess when a network will stop being responsive.

Similarly, we should try to include some additional level of randomness to the experiments. If we run them only when we deploy a new release, we might not discover the adverse effects that might be produced hours later. We can partly mitigate that by running experiments periodically.

Inspecting the `periodic.yaml` file#

Let’s see how we can create periodic execution of chaos experiments. To do that, we are going to take a look at yet another YAML file.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: go-demo-8-chaos
spec:
  concurrencyPolicy: Forbid
  schedule: "*/5 * * * *"
  jobTemplate:
    metadata:
      labels:
        app: go-demo-8-chaos
    spec:
      activeDeadlineSeconds: 600
      backoffLimit: 0
      template:
        metadata:
          labels:
            app: go-demo-8-chaos
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          serviceAccountName: chaostoolkit
          restartPolicy: Never
          containers:
          - name: chaostoolkit
            image: vfarcic/chaostoolkit:1.4.1-2
            args:
            - --verbose
            - run
            - --journal-path
            - /results/journal-health-http.json
            - /experiment/health-http.yaml
            env:
            - name: CHAOSTOOLKIT_IN_POD
              value: "true"
            volumeMounts:
            - name: experiments
              mountPath: /experiment
              readOnly: true
            - name: results
              mountPath: /results
              readOnly: false
            resources:
              limits:
                cpu: 20m
                memory: 64Mi
              requests:
                cpu: 20m
                memory: 64Mi
          volumes:
          - name: experiments
            configMap:
              name: chaostoolkit-experiments
          - name: results
            persistentVolumeClaim:
              claimName: go-demo-8-chaos

---

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: go-demo-8-chaos
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

That YAML is slightly bigger than the previous one. The major difference is that this time we are not defining a Job. Instead, we have a CronJob, which will create the Jobs in scheduled intervals.

If you take a closer look, you’ll notice that the CronJob is almost the same as the Job we used before. There are a few significant differences, though.

First of all, we probably don’t want to run the same Jobs concurrently. Running one copy of an experiment at a time should be enough. So, we set concurrencyPolicy to Forbid.

The schedule, in this case, is set to */5 * * * *. That means that the Job will run every five minutes unless the previous one did not yet finish since that would contradict the concurrencyPolicy.

If you’re not familiar with the syntax like */5 * * * *, it is the standard syntax from crontab available in (almost) every Linux distribution. You can find more info in the CRON expression section of the Cron entry in Wikipedia.

In “real world” situations, running an experiment every five minutes might be too frequent. Something like once an hour, once a day, or even once a week is more appropriate. But, for the purpose of this demonstration, I want to make sure that you’re not waiting for too long to see the results of an experiment. So, we will run our experiments every five minutes.

The jobTemplate, as the name suggests, defines the template that will be used to create Jobs. Whatever is inside, it is almost the same as what we had in the Job we created earlier.

The significant difference between now and then is that this time we’re using CronJob to create Jobs at scheduled intervals. The rest is, more or less, the same. The one difference is that when we run one-shot experiments, we’re in control. We can tail the logs because we are observing what’s happening, or we’re letting pipelines decide what to do next. However, in case of scheduled periodic experiments, we probably want to store the results somewhere. We most likely want to write journal files that can be converted into reports later. For that, we are mounting an additional volume besides the ConfigMap. It is called results, and it references PersistentVolumeClaim called go-demo-8-chaos.

If we go back to the args, we can see that we are setting --journal-path argument to the value with the path to /results/journal-health-http.json. Since /results is the directory that will be mounted to an external drive, our journals will be persisted. Or, to be more precise, the last journal will be stored there. Given that the name of the journal is always the same, new ones will always overwrite the older one. We could randomize that, but, for our purposes, the latest journal should be enough. Remember, we’ll run the experiments every five minutes, but, in the real-world situation, the frequency would be much lower, and you should have enough time to explore the journal if an experiment fails before the execution of a new one starts. Anyways, what matters is that we will not only have a CronJob that creates Jobs periodically but that we will be storing journal files of our experiments in an external drive. From there, you should be able to fetch those journals and convert them into PDFs, just as you did before.

Applying the definition#

Let’s apply that definition and see what we’ll get.

Enter to Rename, Shift+Enter to Preview

Retrieving the CronJobs#

Next, we’ll retrieve the CronJobs and wait until the one we just created runs a Job.

Enter to Rename, Shift+Enter to Preview

The output, in my case, is as follows.

NAME            SCHEDULE    SUSPEND ACTIVE LAST SCHEDULE AGE
go-demo-8-chaos */5 * * * * False   0      <none>        15s

If, in your case, the LAST SCHEDULE is also set to <none>, you’ll need a bit of patience. Jobs are created every five minutes. Keep re-running the previous command until the status of the LAST SCHEDULE column changes from <none> to something else. The output should be similar to the one that follows.

NAME            SCHEDULE    SUSPEND ACTIVE LAST SCHEDULE AGE
go-demo-8-chaos */5 * * * * False   1      6s            84s

In my case, the CronJob created a Job six seconds ago.

Let’s retrieve the Jobs and see what we’ll get.

Enter to Rename, Shift+Enter to Preview

The output, in my case, is as follows.

NAME                COMPLETIONS DURATION AGE
go-demo-8-chaos-... 0/1         19s      19s

We can see that, in my case, there is only one Job because there was not sufficient time for the second to run.

What matters is that this Job was created by the CronJob, and that it did not finish executing. It is still running. If that’s the situation on your screen as well, you’ll need to repeat the previous command until the COMPLETIONS column is set to 1/1. Remember, it takes around a minute or two for the experiment to execute. After a while, the output of get jobs should be similar to the one that follows.

NAME                COMPLETIONS DURATION AGE
go-demo-8-chaos-... 1/1         59s      72s

The COMPLETIONS column now says 1/1. In my case, it took around a minute to execute it.

Checking the pods#

Given that Jobs create Pods, we’ll retrieve all those in the go-demo-8 Namespace and see what we’ll get.

Enter to Rename, Shift+Enter to Preview

NAME                READY STATUS    RESTARTS AGE
go-demo-8-...       2/2   Running   2        14m
go-demo-8-...       2/2   Running   0        30s
go-demo-8-chaos-... 0/1   Completed 0        82s
go-demo-8-db-...    2/2   Running   0        14m
repeater-...        2/2   Running   0        14m
repeater-...        2/2   Running   0        14m

We can see that we have the Pods of our application, which are all Running. Alongside them, there is the go-demo-8-chaos Pod with the STATUS set to Completed. Since Jobs start processes in containers one by one and do not restart them when finished, the READY column is set to 0/1. The processes in that Pod finished, and now none of the containers is running.

Let’s summarize.

We created a scheduled CronJob. Every five minutes, a new Job is created, which, in turn, creates a Pod with a single container that runs our experiments. For now, there is only one container, but we could just as well add more.

In the next lesson, we’ll explore what happens if an experiment fails.

Running One-Shot Experiments

Running Failed Scheduled Experiments

Mark as Completed

Report an Issue

Before We Begin

Introduction To Kubernetes Chaos Engineering

Defining Requirements

Destroying Application Instances

Experimenting with Application Availability

Obstructing and Destroying Network

Draining and Deleting Nodes

Creating Chaos Experiment Reports

Running Chaos Experiments Inside a Kubernetes Cluster

Executing Random Chaos

Conclusion

Running Scheduled Experiments

Testing for objectiveness#

Inspecting the `periodic.yaml` file#

Applying the definition#

Retrieving the CronJobs#

Checking the pods#

Running Scheduled Experiments

Testing for objectiveness#

Inspecting the periodic.yaml file#

Applying the definition#

Retrieving the CronJobs#

Checking the pods#

Inspecting the `periodic.yaml` file#