Validating Application Availability

In this lesson, we will be running some chaos experiments to check if our application is highly available and remains accessible from outside.

We'll cover the following

Inspecting the definition of health-http.yaml
Checking the difference between of health-pause.yaml and health-http.yaml
Running chaos experiment and inspecting the output
Why is our application not highly available?
Inspecting the definition of health/hpa.yaml
Deploying the HorizontalPodAutoscaler
Running chaos experiment and inspecting the output

Validating whether all the Pods are healthy and running is useful. But that does not necessarily mean that our application is accessible. Maybe the Pods are running, and everything is fantastic and peachy, but our customers cannot access our application.

Let’s see how we can validate whether we can send HTTP requests to our application and whether we can continue doing that after an instance of our app is destroyed. How would that definition look like?

Before we dive into application availability, we have a tiny problem that needs to be addressed. I couldn’t define the address of our application in YAML because your IP is almost certainly different than mine. And neither of us are using real domains because that would be too complicated to set up. That problem allows me to introduce you to yet another feature of Chaos Toolkit.

We are going to define a variable that can be injected into our definition.

Inspecting the definition of `health-http.yaml`#

Let’s take a quick look at yet another YAML.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

version: 1.0.0
title: What happens if we terminate an instance of the application?
description: If an instance of the application is terminated, the applications as a whole should still be operational.
tags:
- k8s
- pod
- http
configuration:
  ingress_host:
      type: env
      key: INGRESS_HOST
steady-state-hypothesis:
  title: The app is healthy
  probes:
  - name: app-responds-to-requests
    type: probe
    tolerance: 200
    provider:
      type: http
      timeout: 3
      verify_tls: false
      url: http://＄{ingress_host}/demo/person
      headers:
        Host: go-demo-8.acme.com
method:
- type: action
  name: terminate-app-pod
  provider:
    type: python
    module: chaosk8s.pod.actions
    func: terminate_pods
    arguments:
      label_selector: app=go-demo-8
      rand: true
      ns: go-demo-8
  pauses: 
    after: 2

So the new sections of this definition, when compared to the previous one, is that we added a configuration section and we changed our steady-state-hypothesis. There are a few other changes. Some of those are cosmetic, while others are indeed important.

Checking the difference between of `health-pause.yaml` and `health-http.yaml`#

We’ll skip commenting on the contents of this file because it is hard to see what’s different when compared to what we had before. We’ll comment on the differences by executing a diff between the new and the old version of the definition.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

3c3
< description: If an instance of the application is terminated, a new instance should be created
---
> description: If an instance of the application is terminated, the applications as a whole should still be operational.
7c7,11
< - deployment
---
> - http
> configuration:
>   ingress_host:
>       type: env
>       key: INGRESS_HOST
11c15
<   - name: all-apps-are-healthy
---
>   - name: app-responds-to-requests
13c17
<     tolerance: true
---
>     tolerance: 200
15,19c19,24
<       type: python
<       func: all_microservices_healthy
<       module: chaosk8s.probes
<       arguments:
<         ns: go-demo-8
---
>       type: http
>       timeout: 3
>       verify_tls: false
>       url: http://＄{ingress_host}/demo/person
>       headers:
>         Host: go-demo-8.acme.com
32c37
<     after: 10
---
>     after: 2

We can see that, this time, quite a lot of things changed. The description and the tags are different. Those are only informative, so there’s no reason to go through them.

What matters is that we added a configuration section and that it has a variable called ingress_host.

Variables can be of different types, and, in this case, the one we defined is an environment variable (env). The key is INGRESS_HOST. That means that if we set the environment variable INGRESS_HOST, the value of that variable will be assigned to Chaos Toolkit variable ingress_host.

The name of the steady-state-hypothesis changed, and the tolerance is now 200. Before, we were validating whether our application is healthy by checking whether the output is true or false. This time, however, we will verify whether our app responds with 200 response code, or with something else.

Further on, we can see that we changed the probe type from all_microservices_healthy to http. The timeout is set to 3 seconds.

All in all, we expect our application to respond within three seconds. That’s unrealistically high, I would say. It should be much lower, like a hundred milliseconds. However, I couldn’t be a hundred percent sure that your networking is fast, so I defined a relatively high value to be on the safe side.

You should also note that we are not verifying TLS (verify_tls) because it would be hard to define certificates without having a proper domain. In “real life” situations, you would always validate TLS.

Now comes the vital part.

The url to which we’re going to send the request is using the ingress_host variable. We’ll be sending requests to our Ingress domain/IP and to the path /demo/person. Since our Ingress is configured to accept only requests from go-demo-8.acme.com, we are adding that Host as one of the headers to that request.

Finally, we’re changing the pause. Instead of 10 seconds we had before, we’re going to give it 2 seconds. After destroying a Pod, we’re going to wait for two seconds, and then we’re going to validate whether we can send a request to it and whether it responds with 200. I wanted to ensure that the Pod is indeed destroyed. Otherwise, we wouldn’t need such a pause.

Running chaos experiment and inspecting the output#

Let’s run this experiment and see what we’ll get.

Enter to Rename, Shift+Enter to Preview

The output, without timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate an instance of the application?
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-app-pod
[... INFO] Pausing after activity for 2s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered

The initial probe passed. It validated that our application responds to requests before it started executing probes. The action that terminated a Pod was executed and, after that, we waited for two seconds. Further on, we can see that the post-action probe failed. Two seconds after destroying an instance of the application, we were unable to send a request to it. That’s very unfortunate, isn’t it? Why does that happen?

Why is our application not highly available?#

I want you to stop here and think. Why is our application not highly available? It should be responsive and available at all times. If you destroy an instance of our application, it should continue serving our users. So, why is it not highly available? What is missing?

Did you figure it out? Kudos to you if you did. If you didn’t, I’ll provide an answer.

Our application is not highly available. It does not continue serving requests after a Pod, or an instance is destroyed because there is only one instance. Every application, when architecture allows, should run multiple instances as a way to prevent this type of situation. If, for example, we would have three instances of our application and we’d destroy one of them, the other two should be able to continue serving requests while Kubernetes is recreating the failed Pod. In other words, we need to increase the number of replicas of our application.

We could scale up in quite a few ways. We could just go to the definition of the Deployment and say that there should be two or three or four replicas of that application. But that’s a bad idea. That’s static. That would mean that if we say three replicas, then our app would always have three replicas. What we want is for our application to go up and down. It should increase and decrease the number of instances depending on their memory or CPU utilization. We could even define more complicated criteria based on Prometheus.

Inspecting the definition of `health/hpa.yaml`#

We’re not going to go into details of how to scale applications. That’s not the subject of this book. Instead, I’ll just say that we’re going to define HorizontalPodAutoscaler (HPA). So, let’s take a look at yet another YAML.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

---

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: go-demo-8
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: go-demo-8
  minReplicas: 2
  maxReplicas: 6
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 80
  - type: Resource
    resource:
      name: memory
      targetAverageUtilization: 80

That definition specifies a HorizontalPodAutoscaler called go-demo-8. It is targeting the Deployment with the same name go-demo-8. The minimum number of replicas will be 2, and the maximum number will be 6. Our application will have anything between two and six instances. The exact number depends on the metrics.

In this case, we have two basic metrics. The average utilization of CPU should be around 80%, and the average usage of memory should be 80% as well. In most “real-world” cases, those two metrics would be insufficient. Nevertheless, they should be suitable for our example.

All in all, that HPA will make our application run at least two replicas. That should hopefully make it highly available. If you’re unsure about the validity of that statement, try to guess what happens if we destroy one or multiple replicas of an application. The others should continue serving requests.

Deploying the HorizontalPodAutoscaler#

Let’s deploy the HorizontalPodAutoscaler.

Enter to Rename, Shift+Enter to Preview

To be on the safe side, we’ll retrieve HPAs and confirm that everything looks OK.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

NAME      REFERENCE            TARGETS         MINPODS MAXPODS REPLICAS AGE
go-demo-8 Deployment/go-demo-8 16%/80%, 0%/80% 2       6       2        51s

It might take a few moments until the HPA figures out what it needs to do. Keep repeating the get hpa command until the number of replicas increases to 2 (or more).

Now we’re ready to proceed. The HorizontalPodAutoscaler increased the number of replicas of our application. However, we are yet to see whether that was enough. Is our app now highly available?

Running chaos experiment and inspecting the output#

Just like everything we do in this book, we are always going to validate our theories and ideas by running chaos experiments. So let’s re-run the same experiment and see what happens. Remember, the experiment we are about to run (the same one as before) validates whether our application can serve requests after an instance of that application is destroyed. In other words, it checks whether the app highly available.

Enter to Rename, Shift+Enter to Preview

The output, without the timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate an instance of the application?
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-app-pod
[... INFO] Pausing after activity for 2s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: completed

We can see that the initial probe passed and that the action was executed to terminate a Pod. After that, it waited for two seconds just to make hundred percent sure that the Pod is destroyed. Then we re-run the probe. It passed! Everything was successful. Our experiment was a success. Our application is indeed highly available.

In the next lesson, we will carry out a chaos experiment to check what happens if we destroy an instance of a dependency of our application.

Validating Application Health

Terminating Application Dependencies

Mark as Completed

Report an Issue

Before We Begin

Introduction To Kubernetes Chaos Engineering

Defining Requirements

Destroying Application Instances

Experimenting with Application Availability

Obstructing and Destroying Network

Draining and Deleting Nodes

Creating Chaos Experiment Reports

Running Chaos Experiments Inside a Kubernetes Cluster

Executing Random Chaos

Conclusion

Validating Application Availability

Inspecting the definition of `health-http.yaml`#

Checking the difference between of `health-pause.yaml` and `health-http.yaml`#

Running chaos experiment and inspecting the output#

Why is our application not highly available?#

Inspecting the definition of `health/hpa.yaml`#

Deploying the HorizontalPodAutoscaler#

Running chaos experiment and inspecting the output#

Validating Application Availability

Inspecting the definition of health-http.yaml#

Checking the difference between of health-pause.yaml and health-http.yaml#

Running chaos experiment and inspecting the output#

Why is our application not highly available?#

Inspecting the definition of health/hpa.yaml#

Deploying the HorizontalPodAutoscaler#

Running chaos experiment and inspecting the output#

Inspecting the definition of `health-http.yaml`#

Checking the difference between of `health-pause.yaml` and `health-http.yaml`#

Inspecting the definition of `health/hpa.yaml`#