What is a steady-state?#

What we have done so far cannot be qualified as an experiment. We simply executed an action that resulted in the destruction of a Pod. The best we can get from that is a satisfying feeling like “oh, look at me, I destroyed stuff.” However, the goal of chaos engineering is not to destroy for the sake of feeling better or for the purpose of destruction itself. The objective is to find weak points in our clusters, applications, data center, and in other parts of our systems. Therefore, we typically start by defining a steady-state that is validated before and after actions.

We define something like this is what that should look like. If the state of that something is as we defined it, we start destroying stuff by introducing some chaos into our cluster. After that, we take another look at the state and check whether it is still the same.

So, if the state is the same both before and after actions, we can conclude that our cluster is fault-tolerant and resilient and that everything is just peachy. In the case of Chaos Toolkit, we accomplish this by defining steady state hypothesis.

Inspecting the definition of terminate-pod-ssh.yaml#

We’re going to take a look at yet another definition that specifies the state that will be validated before and after some actions. Let’s take a look.

Since it is difficult to see the differences from the file itself, we’re going to output the diff and see what was added compared to what we had before.

The output is as follows.

> steady-state-hypothesis:
>   title: Pod exists
>   probes:
>   - name: pod-exists
>     type: probe
>     tolerance: 1
>     provider:
>       type: python
>       func: count_pods
>       module: chaosk8s.pod.probes
>       arguments:
>         label_selector: app=go-demo-8
>         ns: go-demo-8

The new section, as you can see, is steady-state-hypothesis.

title#

It has a title, which is just informative, and it has some probes. In this case, there’s only one, but there can be more.

name#

The name of the probe is pod-exists. Just like the title, it is informative and serves no additional purpose.

type#

The type is probe, meaning that it will probe the system. It will validate whether the result is acceptable or not, and, as you will see soon, it will do that before and after actions.

tolerance#

The tolerance is set 1. We’ll come back to it later.

provider#

Further on, we can see that we have the provider set to python. Get used to it. Almost all providers are based on Python.

func#

The function is count_pods, so, as the name implies, we’re going to count how many Pods we have that match that criteria.

module#

The module is chaosk8s.pods.probes. It comes from the same plugin we installed before.

arguments#

Finally, we have two arguments. The first one will select only Pods that have the matching label app=go-demo-8. The second is ns, short for Namespace. When combined, it means that the probe will count only the Pods with the matching label app=go-demo-8 and in the Namespace go-demo-8.

Now, if we go back to the tolerance argument, we can see that it is set to 1. So, our experiment will expect to find precisely one Pod with the matching label and in the specified Namespace. Not more, not less.

Running chaos experiment and inspecting the output#

Let’s run this chaos experiment and see what we’re getting.

The output is as follows (timestamps are removed for brevity).

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... CRITICAL] Steady state probe 'pod-exists' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: failed

We can see that there is a critical issue. Steady state probe 'pod-exists' is not in the given tolerance. That probe failed before we even started executing actions like those that we had before. That’s normal because we destroyed the Pod in the previous section. Now, there are no Pods in that Namespace, at least not those with that matching label.

So, our experiment failed at the very beginning. It confirmed that the initial state of what we are measuring is not matching what we want it to be. The experiment failed, and we can see that by outputting the exit code of the previous command.

This time the output is 1, meaning that the experiment was indeed unsuccessful. It failed at the very beginning before it even tried to execute actions. It attempted to validate the initial state, which expects to have a single Pod with the matching label. The experiment failed in that first validation. The Pod is not there. The previous experiment terminated it, and we did not recreate it.

Recreating pods#

Now, let’s apply the terminate-pods/pod.yaml definition so that we recreate the Pod we destroyed with the first experiment. After that, we’ll be able to see what happens if we re-run the experiment with the steady-state-hypothesis.

Re-running the chaos experiment and inspecting the output#

We re-created our pod, and now we’re going to re-run the same experiment.

The output is as follows (timestamps are removed for brevity).

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-pod
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: completed

This time everything is green. We can see that the probe pod-exists confirmed that the state is correct, and we can see that the action terminate-pod was executed. Further on, we can observe that the steady-state was re-evaluated. It confirmed that the Pod still exists. The Pod existed before the action, and Pod existed after the action. Therefore, everything is green.

Now, that is kind of strange, isn’t it? We confirmed that the Pod exists. Then we destroyed it. Nevertheless, the experiment shows that the Pod still exists or, to be more precise, that it existed after that action that removed it. How can the Pod exist if we destroyed it?

Before we discuss this, let’s confirm that the experiment was really successful by outputting $?.

We can see that the exit code is indeed 0. This is indeed strange. The probe should have failed after the action that destroyed the Pod. The reason why it didn’t fail will be explained soon. For now, let’s recreate that Pod again.


In the next lesson, we’re going to try to figure out why our experiment didn’t fail. It wasn’t supposed to be successful.

Terminating Application Instances
Pausing After Actions
Mark as Completed
Report an Issue