Running Scheduled Experiments
In this lesson, we will see how to schedule our experiments periodically using CronJobs.
In some cases, one-shot experiments are useful. You might want to trigger an experiment based on specific events (e.g., deployment of a new release), or as a scheduled exercise with “all hands on deck.” However, there are situations when you might want to run chaos experiments periodically. You might, for example, decide to execute them once a day at a specific hour, or you might even choose to randomize that and run them periodically at a random time of a day.
Testing for objectiveness#
We want to test the system and be objective. This might sound strange, but being objective with chaos engineering often means being, more or less, random. If we know when something potentially disrupting might happen, we might react differently than when in unexpected situations. We might be biased and schedule the execution of experiments at the time when we know that there will be no adverse effect on the system. Instead, it might be a good idea to run experiments at some random intervals during the day or a week so that we cannot easily predict what will happen. We often don’t control when “bad” things will happen in production. Most of the time, we don’t know when a node will die, and we often cannot guess when a network will stop being responsive.
Similarly, we should try to include some additional level of randomness to the experiments. If we run them only when we deploy a new release, we might not discover the adverse effects that might be produced hours later. We can partly mitigate that by running experiments periodically.
Inspecting the periodic.yaml
file#
Let’s see how we can create periodic execution of chaos experiments. To do that, we are going to take a look at yet another YAML file.
The output is as follows.
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: go-demo-8-chaos
spec:
concurrencyPolicy: Forbid
schedule: "*/5 * * * *"
jobTemplate:
metadata:
labels:
app: go-demo-8-chaos
spec:
activeDeadlineSeconds: 600
backoffLimit: 0
template:
metadata:
labels:
app: go-demo-8-chaos
annotations:
sidecar.istio.io/inject: "false"
spec:
serviceAccountName: chaostoolkit
restartPolicy: Never
containers:
- name: chaostoolkit
image: vfarcic/chaostoolkit:1.4.1-2
args:
- --verbose
- run
- --journal-path
- /results/journal-health-http.json
- /experiment/health-http.yaml
env:
- name: CHAOSTOOLKIT_IN_POD
value: "true"
volumeMounts:
- name: experiments
mountPath: /experiment
readOnly: true
- name: results
mountPath: /results
readOnly: false
resources:
limits:
cpu: 20m
memory: 64Mi
requests:
cpu: 20m
memory: 64Mi
volumes:
- name: experiments
configMap:
name: chaostoolkit-experiments
- name: results
persistentVolumeClaim:
claimName: go-demo-8-chaos
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: go-demo-8-chaos
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
That YAML is slightly bigger than the previous one. The major difference is that this time we are not defining a Job. Instead, we have a CronJob
, which will create the Jobs in scheduled intervals.
If you take a closer look, you’ll notice that the CronJob is almost the same as the Job we used before. There are a few significant differences, though.
First of all, we probably don’t want to run the same Jobs concurrently. Running one copy of an experiment at a time should be enough. So, we set concurrencyPolicy
to Forbid
.
The schedule
, in this case, is set to */5 * * * *
. That means that the Job will run every five minutes unless the previous one did not yet finish since that would contradict the concurrencyPolicy
.
If you’re not familiar with the syntax like */5 * * * *
, it is the standard syntax from crontab
available in (almost) every Linux distribution. You can find more info in the CRON expression section of the Cron entry in Wikipedia.
In “real world” situations, running an experiment every five minutes might be too frequent. Something like once an hour, once a day, or even once a week is more appropriate. But, for the purpose of this demonstration, I want to make sure that you’re not waiting for too long to see the results of an experiment. So, we will run our experiments every five minutes.
The jobTemplate
, as the name suggests, defines the template that will be used to create Jobs. Whatever is inside, it is almost the same as what we had in the Job we created earlier.
The significant difference between now and then is that this time we’re using CronJob to create Jobs at scheduled intervals. The rest is, more or less, the same. The one difference is that when we run one-shot experiments, we’re in control. We can tail the logs because we are observing what’s happening, or we’re letting pipelines decide what to do next. However, in case of scheduled periodic experiments, we probably want to store the results somewhere. We most likely want to write journal files that can be converted into reports later. For that, we are mounting an additional volume besides the ConfigMap. It is called results
, and it references PersistentVolumeClaim
called go-demo-8-chaos
.
If we go back to the args
, we can see that we are setting --journal-path
argument to the value with the path to /results/journal-health-http.json
. Since /results
is the directory that will be mounted to an external drive, our journals will be persisted. Or, to be more precise, the last journal will be stored there. Given that the name of the journal is always the same, new ones will always overwrite the older one. We could randomize that, but, for our purposes, the latest journal should be enough. Remember, we’ll run the experiments every five minutes, but, in the real-world situation, the frequency would be much lower, and you should have enough time to explore the journal if an experiment fails before the execution of a new one starts. Anyways, what matters is that we will not only have a CronJob that creates Jobs periodically but that we will be storing journal files of our experiments in an external drive. From there, you should be able to fetch those journals and convert them into PDFs, just as you did before.
Applying the definition#
Let’s apply that definition and see what we’ll get.
Retrieving the CronJobs#
Next, we’ll retrieve the CronJobs and wait until the one we just created runs a Job.
The output, in my case, is as follows.
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
go-demo-8-chaos */5 * * * * False 0 <none> 15s
If, in your case, the LAST SCHEDULE
is also set to <none>
, you’ll need a bit of patience. Jobs are created every five minutes. Keep re-running the previous command until the status of the LAST SCHEDULE
column changes from <none>
to something else. The output should be similar to the one that follows.
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
go-demo-8-chaos */5 * * * * False 1 6s 84s
In my case, the CronJob created a Job six seconds ago.
Let’s retrieve the Jobs and see what we’ll get.
The output, in my case, is as follows.
NAME COMPLETIONS DURATION AGE
go-demo-8-chaos-... 0/1 19s 19s
We can see that, in my case, there is only one Job because there was not sufficient time for the second to run.
What matters is that this Job was created by the CronJob, and that it did not finish executing. It is still running. If that’s the situation on your screen as well, you’ll need to repeat the previous command until the COMPLETIONS
column is set to 1/1
. Remember, it takes around a minute or two for the experiment to execute. After a while, the output of get jobs
should be similar to the one that follows.
NAME COMPLETIONS DURATION AGE
go-demo-8-chaos-... 1/1 59s 72s
The COMPLETIONS
column now says 1/1
. In my case, it took around a minute to execute it.
Checking the pods#
Given that Jobs create Pods, we’ll retrieve all those in the go-demo-8
Namespace and see what we’ll get.
NAME READY STATUS RESTARTS AGE
go-demo-8-... 2/2 Running 2 14m
go-demo-8-... 2/2 Running 0 30s
go-demo-8-chaos-... 0/1 Completed 0 82s
go-demo-8-db-... 2/2 Running 0 14m
repeater-... 2/2 Running 0 14m
repeater-... 2/2 Running 0 14m
We can see that we have the Pods of our application, which are all Running
. Alongside them, there is the go-demo-8-chaos
Pod with the STATUS
set to Completed
. Since Jobs start processes in containers one by one and do not restart them when finished, the READY
column is set to 0/1
. The processes in that Pod finished, and now none of the containers is running.
Let’s summarize.
We created a scheduled CronJob. Every five minutes, a new Job is created, which, in turn, creates a Pod with a single container that runs our experiments. For now, there is only one container, but we could just as well add more.
In the next lesson, we’ll explore what happens if an experiment fails.