Running One-Shot Experiments
In this lesson, we will learn about Kubernetes jobs and apply a job to our cluster for one-shot experiments.
For now, we’re going to focus on one-shot experiments that we’ll run manually, or through pipelines, or any other means that we might see fit.
Inspecting the once.yaml
file#
Let’s take a look at yet another Kubernetes YAML file.
The output is as follows.
---
apiVersion: batch/v1
kind: Job
metadata:
name: go-demo-8-chaos
spec:
activeDeadlineSeconds: 600
backoffLimit: 0
template:
metadata:
labels:
app: go-demo-8-chaos
annotations:
sidecar.istio.io/inject: "false"
spec:
serviceAccountName: chaostoolkit
restartPolicy: Never
containers:
- name: chaostoolkit
image: vfarcic/chaostoolkit:1.4.1-2
args:
- --verbose
- run
- /experiment/health-http.yaml
env:
- name: CHAOSTOOLKIT_IN_POD
value: "true"
volumeMounts:
- name: config
mountPath: /experiment
readOnly: true
resources:
limits:
cpu: 20m
memory: 64Mi
requests:
cpu: 20m
memory: 64Mi
volumes:
- name: config
configMap:
name: chaostoolkit-experiments
Kubernetes jobs#
What matters is that we are defining a Kubernetes Job
.
A Job creates one or more Pods and ensures that a specified number of them successfully terminate. As Pods successfully complete, the Job tracks the successful completions. When a specified number of successful completions is reached, the task (Job) is complete. Deleting a Job will clean up the Pods it created.
We can see from that description (taken from Kubernetes docs) that a Job is an excellent candidate to run something once while being able to track the status and, especially, exit codes of the processes inside it. If we limit ourselves to core Kubernetes components, Jobs are probably the most appropriate type of Kubernetes resources we can use when we want to run something like an experiment. The processes inside containers in a Job will start, run, and finish. Unlike most other Kubernetes resources, Jobs are not restarted or recreated when they are successful or when Pods are deleted. That’s one of the main differences when compared to, let’s say, Deployments and StatefulSets.
On top of that, we can configure Jobs not to restart on failure. That’s what we’re doing in that definition by setting spec.template.spec.restartPolicy
to Never
. Experiments can be successful or failed, and no matter the outcome of an experiment, the Pod created by that Job will run only once.
Most of that specification is not very exciting. What matters is that we set the serviceAccountName
to chaostoolkit
. That should give the Job sufficient permissions to do whatever we need it to do.
We defined only one container. We could have more if we’d like to run multiple experiments. But, for our purposes, one should be more than enough to demonstrate how experiments defined as Jobs work.
I encourage you to create your own container image that will contain Chaos Toolkit, the required plugins, and anything else that you might need. But, to simplify things, we’ll use the one I created (vfarcic/chaostoolkit
). We’ll discuss that image soon.
We can see from the args
that we want the output to be verbose, and that it should run
the experiment defined in /experiment/health-http.yaml
. If you’re wondering where does that file come from, the short answer is “from the ConfigMap”. We’ll discuss it in a moment.
Then we have the env
variable CHAOSTOOLKIT_IN_POD
set to true
. Chaos Toolkit might need to behave slightly differently when running inside a Pod, so we’re using that variable to ensure it knows where it is.
Further on, we have volumeMounts
. We are mounting something called config
as the directory /experiment
. That “something” is defined in the volumes
section, and it is referencing the ConfigMap chaostoolkit-experiments
we created earlier. That way, all the entries from that ConfigMap will be available as files inside the /experiment
directory in the container.
Finally, we also defined the resources
so that Kubernetes knows how much memory and CPU we’re requesting, and what should be the limits
.
Inspecting definition of the image vfarcic/chaostoolkit
#
Before we move on, let’s take a quick look at the definition of the image vfarcic/chaostoolkit
.
If you are a Windows user, the
open
command might not work. If that’s the case, please copy the address from the command that follows, and paste it into your favorite browser.
This repository only has Dockerfile
. Open it, and you’ll see the definition of the container image. It is as follows.
FROM python:3.8.1-alpine
LABEL maintainer="Viktor Farcic <viktor@farcic.com>"
RUN apk add --no-cache --virtual build-deps gcc g++ git libffi-dev linux-headers python3-dev musl-dev && \
pip install --no-cache-dir -q -U pip && \
pip install --no-cache-dir chaostoolkit && \
pip install --no-cache-dir chaostoolkit-kubernetes && \
pip install --no-cache-dir chaostoolkit-istio && \
pip install --no-cache-dir chaostoolkit-slack && \
pip install --no-cache-dir slackclient==1.3.2 && \
apk del build-deps
ENTRYPOINT ["/usr/local/bin/chaos"]
CMD ["--help"]
I will assume that you are already familiar with Dockerfile format and that you know that it is used to define instructions that tools can use to build container images. There are quite a few builders that use such definitions, including Docker and Kaniko.
As you can see, the definition of the image (Dockerfile
) is relatively simple and straightforward. It is based on python
since that’s a requirement for Chaos Toolkit, and it installs the CLI and the plugins we’ll need. The entry point is the path to the chaos
binary. By default, it’ll output help if we do not overwrite the command (CMD
).
Applying the definition#
Now that we have explored the Job that will run the experiment, we are ready to apply it and see what happens.
We can see that the Job go-demo-8-chaos
was created
.
Checking the pods#
Next, we’ll take a look at the Pods in that Namespace. To be more specific, we’re going to retrieve all the Pods and filter them by using selector app=go-demo-8-chaos
since that’s the name of the label that we put to the template of the Pod, which will be created by the Job.
The output is as follows.
NAME READY STATUS RESTARTS AGE
go-demo-8-chaos-... 1/1 Running 0 14s
We can see that, in my case, the Pod created by the Job is Running
. If that’s what you see on your screen, please repeat that command until the STATUS
is Completed
. That should take around a minute. When it’s finished, the output should be similar to the one that follows.
NAME READY STATUS RESTARTS AGE
go-demo-8-chaos-... 0/1 Completed 0 75s
The execution of the Pod created through the Job finished. It is Completed
, meaning that the experiment was successful.
Inspecting the logs#
Let’s output the logs and confirm that the experiment was indeed successful.
Since we wanted to output the logs only of the Pod that executed the experiment, we used the selector
as a filter.
I won’t present the output in here since it is too big to fit into the lesson. You should see on your screen the events that we already explored countless times before. The only substantial difference between now and then is that we run the experiment from a container, instead of a laptop. As a result, Kubernetes added timestamps and log levels to each output entry.
We managed to run an experiment from a Kubernetes cluster in a very similar way as if we’d run it from a laptop.
Deleting the Job#
Before we proceed, we’ll delete the Job. We won’t need it anymore since we are about to switch to the scheduled execution of experiments.
That’s it. The Job is no more.
As you saw, running one-shot experiments inside a Kubernetes cluster is straightforward. I encourage you to hook them into your continuous delivery pipelines. Really, all you have to do is create a Job.
In the next lesson, we’re going to take a look at how we can schedule our experiments to be executed periodically instead of running one-shot executions.