Validating Application Health

In this lesson, we will be running experiments to validate whether all the instances of all the apps in that namespace are healthy.

We'll cover the following

Inspecting the definition of health.yaml
Predict the outcome yourself
Running chaos experiment and inspecting the output
Why did it fail?
Inspecting the definition of health-pause.yaml
Running chaos experiment and inspecting the output

Inspecting the definition of `health.yaml`#

Let’s take a look at yet another chaos experiment definition.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

version: 1.0.0
title: What happens if we terminate an instance of the application?
description: If an instance of the application is terminated, a new instance should be created
tags:
- k8s
- pod
- deployment
steady-state-hypothesis:
  title: The app is healthy
  probes:
  - name: all-apps-are-healthy
    type: probe
    tolerance: true
    provider:
      type: python
      func: all_microservices_healthy
      module: chaosk8s.probes
      arguments:
        ns: go-demo-8
method:
- type: action
  name: terminate-app-pod
  provider:
    type: python
    module: chaosk8s.pod.actions
    func: terminate_pods
    arguments:
      label_selector: app=go-demo-8
      rand: true
      ns: go-demo-8

What do we have there?

The title asks what happens if we terminate an instance of an application?. The description says that if an instance of the application is terminated, a new instance should be created. Both should be self-explanatory and do not serve any practical purpose. They are informing us about the objectives of the experiment.

“So far, that definition looks almost the same as what we were doing in the previous section.” If that’s what you’re thinking, you’re right. Or, to be more precise, they are very similar.

The reason why we’re doing this again is that there is an easier and faster way to do what we did before.

Instead of verifying conditions and states of specific Pods of our application, we are going to validate whether all the instances of all the apps in that namespace are healthy. Instead of going from one application to another and checking states and conditions, we’re just going to tell the experiment, “look, for you to be successful, everything in that namespace needs to be healthy.” We’re going to do that by using the function all_microservices_healthy. The module is chaosk8s.probes, and the argument is simply a namespace.

So, our conditions before and after actions are that everything in that namespace needs to be healthy.

Further on, we have an action in the method that will terminate a Pod based on that label_selector and in that specific Namespace (ns). It will be a random Pod with that label selector. Such randomness doesn’t make much sense right now since we have only one pod. Nevertheless, it is going to be random, even if there’s only one Pod.

Predict the outcome yourself#

Before we run that experiment, please take another look at the definition. I want you to try to figure out whether the experiment will be successful or failed. And, if it will fail, I want you to guess what is missing? Why will it fail? After all, our applications should now be fault-tolerant. It should be healthy, right? Think about the experiment and try to predict the outcome.

Running chaos experiment and inspecting the output#

Now that you did some thinking and that you guessed the outcome, let’s run the experiment.

Enter to Rename, Shift+Enter to Preview

The output is as follows (timestamps are removed for brevity).

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate an instance of the application?
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: all-apps-are-healthy
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-app-pod
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: all-apps-are-healthy
[... ERROR]   => failed: chaoslib.exceptions.ActivityFailed: the system is unhealthy
[... WARNING] Probe terminated unexpectedly, so its tolerance could not be validated
[... CRITICAL] Steady state probe 'all-apps-are-healthy' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered

We can see that the initial probe passed and that the action was executed. And then, the probe was executed again, and it failed. Why did it fail? What are we missing? We know that Kubernetes will recreate terminated Pods that are controlled by a Deployment. We validated that behavior in the previous section. So, why is it failing now? It should be relatively straightforward to figure out what’s missing. But before we go there, let’s take a look at the Pods. Are they really running?

Enter to Rename, Shift+Enter to Preview

The output is as follows.

NAME             READY STATUS  RESTARTS AGE
go-demo-8-...    1/1   Running 0        9s
go-demo-8-db-... 1/1   Running 0        11m

Why did it fail?#

We can see that the database (go-demo-8-db) and the API (go-demo-8) are both running. In my case, the API, the one without the -db suffix, is running for 9 seconds. So, it is really fault-tolerant (failed Pods are recreated), and yet the experiment failed. Why did it fail?

Inspecting the definition of `health-pause.yaml`#

Let’s take a look at yet another experiment definition.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

version: 1.0.0
title: What happens if we terminate an instance of the application?
description: If an instance of the application is terminated, a new instance should be created
tags:
- k8s
- pod
- deployment
steady-state-hypothesis:
  title: The app is healthy
  probes:
  - name: all-apps-are-healthy
    type: probe
    tolerance: true
    provider:
      type: python
      func: all_microservices_healthy
      module: chaosk8s.probes
      arguments:
        ns: go-demo-8
method:
- type: action
  name: terminate-app-pod
  provider:
    type: python
    module: chaosk8s.pod.actions
    func: terminate_pods
    arguments:
      label_selector: app=go-demo-8
      rand: true
      ns: go-demo-8
  pauses: 
    after: 10

We were missing a pause. That was done intentionally since I want you to understand the importance of not only executing stuff one after another but giving the system appropriate time to recuperate when needed. If you expect your system to have healthy Pods immediately after destruction, then don’t put the pause. But, in our case, we’re setting the expectation that within 10 seconds of the destruction of an instance, a new one should be fully operational. That’s the expectation we’re having right now. Yours could be different. We could expect the system to recuperate immediately, after one second, or three days. With pauses, we set the expectation of how long the system needs to recuperate from a potentially destructive action. In this case, we are setting it to 10 seconds.

So, let’s confirm that is really the only change by outputting the differences between the old and the new definition.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

>   pauses: 
>     after: 10

We can see that, indeed, the only change is the additional pause of 10 seconds after the action.

Running chaos experiment and inspecting the output#

Let’s run this experiment and see what we’re getting.

Enter to Rename, Shift+Enter to Preview

The output, without timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate an instance of the application?
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: all-apps-are-healthy
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-app-pod
[... INFO] Pausing after activity for 10s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: all-apps-are-healthy
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: completed

The probe confirmed that, initially, everything is healthy. All the instances were operational. Then we performed an action to terminate a Pod, and we waited for ten seconds. We can observe that pause by checking the timestamps of the Pausing after activity for 10s... and the Steady state hypothesis: The app is healthy events. After that, the probe confirmed that all the instances of the applications are healthy.

Let’s take another look at the Pods and confirm that’s really true.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

NAME             READY STATUS  RESTARTS AGE
go-demo-8-...    1/1   Running 0        2m19s
go-demo-8-db-... 1/1   Running 0        22m

As you can see, the new Pod was created, and, in my case, it already exists for over two minutes.

In the next lesson, we will carry out another set of chaos experiments to check whether our application remains accessible to customers at all times.

Validating the Application

Validating Application Availability

Mark as Completed

Report an Issue

Before We Begin

Introduction To Kubernetes Chaos Engineering

Defining Requirements

Destroying Application Instances

Experimenting with Application Availability

Obstructing and Destroying Network

Draining and Deleting Nodes

Creating Chaos Experiment Reports

Running Chaos Experiments Inside a Kubernetes Cluster

Executing Random Chaos

Conclusion

Validating Application Health

Inspecting the definition of `health.yaml`#

Predict the outcome yourself#

Running chaos experiment and inspecting the output#

Why did it fail?#

Inspecting the definition of `health-pause.yaml`#

Running chaos experiment and inspecting the output#

Validating Application Health

Inspecting the definition of health.yaml#

Predict the outcome yourself#

Running chaos experiment and inspecting the output#

Why did it fail?#

Inspecting the definition of health-pause.yaml#

Running chaos experiment and inspecting the output#

Inspecting the definition of `health.yaml`#

Inspecting the definition of `health-pause.yaml`#