Running Scheduled Experiments

In this lesson, we will see how to schedule our experiments periodically using CronJobs.

In some cases, one-shot experiments are useful. You might want to trigger an experiment based on specific events (e.g., deployment of a new release), or as a scheduled exercise with “all hands on deck.” However, there are situations when you might want to run chaos experiments periodically. You might, for example, decide to execute them once a day at a specific hour, or you might even choose to randomize that and run them periodically at a random time of a day.

Testing for objectiveness#

We want to test the system and be objective. This might sound strange, but being objective with chaos engineering often means being, more or less, random. If we know when something potentially disrupting might happen, we might react differently than when in unexpected situations. We might be biased and schedule the execution of experiments at the time when we know that there will be no adverse effect on the system. Instead, it might be a good idea to run experiments at some random intervals during the day or a week so that we cannot easily predict what will happen. We often don’t control when “bad” things will happen in production. Most of the time, we don’t know when a node will die, and we often cannot guess when a network will stop being responsive.

Similarly, we should try to include some additional level of randomness to the experiments. If we run them only when we deploy a new release, we might not discover the adverse effects that might be produced hours later. We can partly mitigate that by running experiments periodically.

Inspecting the periodic.yaml file#

Let’s see how we can create periodic execution of chaos experiments. To do that, we are going to take a look at yet another YAML file.

The output is as follows.

---

apiVersionbatch/v1beta1
kindCronJob
metadata:
  namego-demo-8-chaos
spec:
  concurrencyPolicyForbid
  schedule"*/5 * * * *"
  jobTemplate:
    metadata:
      labels:
        appgo-demo-8-chaos
    spec:
      activeDeadlineSeconds600
      backoffLimit0
      template:
        metadata:
          labels:
            appgo-demo-8-chaos
          annotations:
            sidecar.istio.io/inject"false"
        spec:
          serviceAccountNamechaostoolkit
          restartPolicyNever
          containers:
          - namechaostoolkit
            imagevfarcic/chaostoolkit:1.4.1-2
            args:
            - --verbose
            - run
            - --journal-path
            - /results/journal-health-http.json
            - /experiment/health-http.yaml
            env:
            - nameCHAOSTOOLKIT_IN_POD
              value"true"
            volumeMounts:
            - nameexperiments
              mountPath/experiment
              readOnlytrue
            - nameresults
              mountPath/results
              readOnlyfalse
            resources:
              limits:
                cpu20m
                memory64Mi
              requests:
                cpu20m
                memory64Mi
          volumes:
          - nameexperiments
            configMap:
              namechaostoolkit-experiments
          - nameresults
            persistentVolumeClaim:
              claimNamego-demo-8-chaos

---

kindPersistentVolumeClaim
apiVersionv1
metadata:
  namego-demo-8-chaos
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage1Gi

That YAML is slightly bigger than the previous one. The major difference is that this time we are not defining a Job. Instead, we have a CronJob, which will create the Jobs in scheduled intervals.

If you take a closer look, you’ll notice that the CronJob is almost the same as the Job we used before. There are a few significant differences, though.

First of all, we probably don’t want to run the same Jobs concurrently. Running one copy of an experiment at a time should be enough. So, we set concurrencyPolicy to Forbid.

The schedule, in this case, is set to */5 * * * *. That means that the Job will run every five minutes unless the previous one did not yet finish since that would contradict the concurrencyPolicy.

If you’re not familiar with the syntax like */5 * * * *, it is the standard syntax from crontab available in (almost) every Linux distribution. You can find more info in the CRON expression section of the Cron entry in Wikipedia.

In “real world” situations, running an experiment every five minutes might be too frequent. Something like once an hour, once a day, or even once a week is more appropriate. But, for the purpose of this demonstration, I want to make sure that you’re not waiting for too long to see the results of an experiment. So, we will run our experiments every five minutes.

The jobTemplate, as the name suggests, defines the template that will be used to create Jobs. Whatever is inside, it is almost the same as what we had in the Job we created earlier.

The significant difference between now and then is that this time we’re using CronJob to create Jobs at scheduled intervals. The rest is, more or less, the same. The one difference is that when we run one-shot experiments, we’re in control. We can tail the logs because we are observing what’s happening, or we’re letting pipelines decide what to do next. However, in case of scheduled periodic experiments, we probably want to store the results somewhere. We most likely want to write journal files that can be converted into reports later. For that, we are mounting an additional volume besides the ConfigMap. It is called results, and it references PersistentVolumeClaim called go-demo-8-chaos.

If we go back to the args, we can see that we are setting --journal-path argument to the value with the path to /results/journal-health-http.json. Since /results is the directory that will be mounted to an external drive, our journals will be persisted. Or, to be more precise, the last journal will be stored there. Given that the name of the journal is always the same, new ones will always overwrite the older one. We could randomize that, but, for our purposes, the latest journal should be enough. Remember, we’ll run the experiments every five minutes, but, in the real-world situation, the frequency would be much lower, and you should have enough time to explore the journal if an experiment fails before the execution of a new one starts. Anyways, what matters is that we will not only have a CronJob that creates Jobs periodically but that we will be storing journal files of our experiments in an external drive. From there, you should be able to fetch those journals and convert them into PDFs, just as you did before.

Applying the definition#

Let’s apply that definition and see what we’ll get.

Retrieving the CronJobs#

Next, we’ll retrieve the CronJobs and wait until the one we just created runs a Job.

The output, in my case, is as follows.

NAME            SCHEDULE    SUSPEND ACTIVE LAST SCHEDULE AGE
go-demo-8-chaos */5 * * * * False   0      <none>        15s

If, in your case, the LAST SCHEDULE is also set to <none>, you’ll need a bit of patience. Jobs are created every five minutes. Keep re-running the previous command until the status of the LAST SCHEDULE column changes from <none> to something else. The output should be similar to the one that follows.

NAME            SCHEDULE    SUSPEND ACTIVE LAST SCHEDULE AGE
go-demo-8-chaos */5 * * * * False   1      6s            84s

In my case, the CronJob created a Job six seconds ago.

Let’s retrieve the Jobs and see what we’ll get.

The output, in my case, is as follows.

NAME                COMPLETIONS DURATION AGE
go-demo-8-chaos-... 0/1         19s      19s

We can see that, in my case, there is only one Job because there was not sufficient time for the second to run.

What matters is that this Job was created by the CronJob, and that it did not finish executing. It is still running. If that’s the situation on your screen as well, you’ll need to repeat the previous command until the COMPLETIONS column is set to 1/1. Remember, it takes around a minute or two for the experiment to execute. After a while, the output of get jobs should be similar to the one that follows.

NAME                COMPLETIONS DURATION AGE
go-demo-8-chaos-... 1/1         59s      72s

The COMPLETIONS column now says 1/1. In my case, it took around a minute to execute it.

Checking the pods#

Given that Jobs create Pods, we’ll retrieve all those in the go-demo-8 Namespace and see what we’ll get.

NAME                READY STATUS    RESTARTS AGE
go-demo-8-...       2/2   Running   2        14m
go-demo-8-...       2/2   Running   0        30s
go-demo-8-chaos-... 0/1   Completed 0        82s
go-demo-8-db-...    2/2   Running   0        14m
repeater-...        2/2   Running   0        14m
repeater-...        2/2   Running   0        14m

We can see that we have the Pods of our application, which are all Running. Alongside them, there is the go-demo-8-chaos Pod with the STATUS set to Completed. Since Jobs start processes in containers one by one and do not restart them when finished, the READY column is set to 0/1. The processes in that Pod finished, and now none of the containers is running.

Let’s summarize.

We created a scheduled CronJob. Every five minutes, a new Job is created, which, in turn, creates a Pod with a single container that runs our experiments. For now, there is only one container, but we could just as well add more.


In the next lesson, we’ll explore what happens if an experiment fails.

Running One-Shot Experiments
Running Failed Scheduled Experiments
Mark as Completed
Report an Issue