Validating Application Health

In this lesson, we will be running experiments to validate whether all the instances of all the apps in that namespace are healthy.

Inspecting the definition of health.yaml#

Let’s take a look at yet another chaos experiment definition.

The output is as follows.

version1.0.0
titleWhat happens if we terminate an instance of the application?
descriptionIf an instance of the application is terminated, a new instance should be created
tags:
k8s
pod
deployment
steady-state-hypothesis:
  titleThe app is healthy
  probes:
  - nameall-apps-are-healthy
    typeprobe
    tolerancetrue
    provider:
      typepython
      funcall_microservices_healthy
      modulechaosk8s.probes
      arguments:
        nsgo-demo-8
method:
typeaction
  nameterminate-app-pod
  provider:
    typepython
    modulechaosk8s.pod.actions
    functerminate_pods
    arguments:
      label_selectorapp=go-demo-8
      randtrue
      nsgo-demo-8

What do we have there?

The title asks what happens if we terminate an instance of an application?. The description says that if an instance of the application is terminated, a new instance should be created. Both should be self-explanatory and do not serve any practical purpose. They are informing us about the objectives of the experiment.

“So far, that definition looks almost the same as what we were doing in the previous section.” If that’s what you’re thinking, you’re right. Or, to be more precise, they are very similar.

The reason why we’re doing this again is that there is an easier and faster way to do what we did before.

Instead of verifying conditions and states of specific Pods of our application, we are going to validate whether all the instances of all the apps in that namespace are healthy. Instead of going from one application to another and checking states and conditions, we’re just going to tell the experiment, “look, for you to be successful, everything in that namespace needs to be healthy.” We’re going to do that by using the function all_microservices_healthy. The module is chaosk8s.probes, and the argument is simply a namespace.

So, our conditions before and after actions are that everything in that namespace needs to be healthy.

Further on, we have an action in the method that will terminate a Pod based on that label_selector and in that specific Namespace (ns). It will be a random Pod with that label selector. Such randomness doesn’t make much sense right now since we have only one pod. Nevertheless, it is going to be random, even if there’s only one Pod.

Predict the outcome yourself#

Before we run that experiment, please take another look at the definition. I want you to try to figure out whether the experiment will be successful or failed. And, if it will fail, I want you to guess what is missing? Why will it fail? After all, our applications should now be fault-tolerant. It should be healthy, right? Think about the experiment and try to predict the outcome.

Running chaos experiment and inspecting the output#

Now that you did some thinking and that you guessed the outcome, let’s run the experiment.

The output is as follows (timestamps are removed for brevity).

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate an instance of the application?
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: all-apps-are-healthy
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-app-pod
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: all-apps-are-healthy
[... ERROR]   => failed: chaoslib.exceptions.ActivityFailed: the system is unhealthy
[... WARNING] Probe terminated unexpectedly, so its tolerance could not be validated
[... CRITICAL] Steady state probe 'all-apps-are-healthy' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered

We can see that the initial probe passed and that the action was executed. And then, the probe was executed again, and it failed. Why did it fail? What are we missing? We know that Kubernetes will recreate terminated Pods that are controlled by a Deployment. We validated that behavior in the previous section. So, why is it failing now? It should be relatively straightforward to figure out what’s missing. But before we go there, let’s take a look at the Pods. Are they really running?

The output is as follows.

NAME             READY STATUS  RESTARTS AGE
go-demo-8-...    1/1   Running 0        9s
go-demo-8-db-... 1/1   Running 0        11m

Why did it fail?#

We can see that the database (go-demo-8-db) and the API (go-demo-8) are both running. In my case, the API, the one without the -db suffix, is running for 9 seconds. So, it is really fault-tolerant (failed Pods are recreated), and yet the experiment failed. Why did it fail?

Inspecting the definition of health-pause.yaml#

Let’s take a look at yet another experiment definition.

The output is as follows.

version1.0.0
titleWhat happens if we terminate an instance of the application?
descriptionIf an instance of the application is terminated, a new instance should be created
tags:
k8s
pod
deployment
steady-state-hypothesis:
  titleThe app is healthy
  probes:
  - nameall-apps-are-healthy
    typeprobe
    tolerancetrue
    provider:
      typepython
      funcall_microservices_healthy
      modulechaosk8s.probes
      arguments:
        nsgo-demo-8
method:
typeaction
  nameterminate-app-pod
  provider:
    typepython
    modulechaosk8s.pod.actions
    functerminate_pods
    arguments:
      label_selectorapp=go-demo-8
      randtrue
      nsgo-demo-8
  pauses
    after10

We were missing a pause. That was done intentionally since I want you to understand the importance of not only executing stuff one after another but giving the system appropriate time to recuperate when needed. If you expect your system to have healthy Pods immediately after destruction, then don’t put the pause. But, in our case, we’re setting the expectation that within 10 seconds of the destruction of an instance, a new one should be fully operational. That’s the expectation we’re having right now. Yours could be different. We could expect the system to recuperate immediately, after one second, or three days. With pauses, we set the expectation of how long the system needs to recuperate from a potentially destructive action. In this case, we are setting it to 10 seconds.

So, let’s confirm that is really the only change by outputting the differences between the old and the new definition.

The output is as follows.

>   pauses: 
>     after: 10

We can see that, indeed, the only change is the additional pause of 10 seconds after the action.

Running chaos experiment and inspecting the output#

Let’s run this experiment and see what we’re getting.

The output, without timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate an instance of the application?
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: all-apps-are-healthy
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-app-pod
[... INFO] Pausing after activity for 10s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: all-apps-are-healthy
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: completed

The probe confirmed that, initially, everything is healthy. All the instances were operational. Then we performed an action to terminate a Pod, and we waited for ten seconds. We can observe that pause by checking the timestamps of the Pausing after activity for 10s... and the Steady state hypothesis: The app is healthy events. After that, the probe confirmed that all the instances of the applications are healthy.

Let’s take another look at the Pods and confirm that’s really true.

The output is as follows.

NAME             READY STATUS  RESTARTS AGE
go-demo-8-...    1/1   Running 0        2m19s
go-demo-8-db-... 1/1   Running 0        22m

As you can see, the new Pod was created, and, in my case, it already exists for over two minutes.


In the next lesson, we will carry out another set of chaos experiments to check whether our application remains accessible to customers at all times.

Validating the Application
Validating Application Availability
Mark as Completed
Report an Issue