Validating whether all the Pods are healthy and running is useful. But that does not necessarily mean that our application is accessible. Maybe the Pods are running, and everything is fantastic and peachy, but our customers cannot access our application.

Let’s see how we can validate whether we can send HTTP requests to our application and whether we can continue doing that after an instance of our app is destroyed. How would that definition look like?

Before we dive into application availability, we have a tiny problem that needs to be addressed. I couldn’t define the address of our application in YAML because your IP is almost certainly different than mine. And neither of us are using real domains because that would be too complicated to set up. That problem allows me to introduce you to yet another feature of Chaos Toolkit.

We are going to define a variable that can be injected into our definition.

Inspecting the definition of health-http.yaml#

Let’s take a quick look at yet another YAML.

The output is as follows.

version1.0.0
titleWhat happens if we terminate an instance of the application?
descriptionIf an instance of the application is terminated, the applications as a whole should still be operational.
tags:
k8s
pod
http
configuration:
  ingress_host:
      typeenv
      keyINGRESS_HOST
steady-state-hypothesis:
  titleThe app is healthy
  probes:
  - nameapp-responds-to-requests
    typeprobe
    tolerance200
    provider:
      typehttp
      timeout3
      verify_tlsfalse
      urlhttp://${ingress_host}/demo/person
      headers:
        Hostgo-demo-8.acme.com
method:
typeaction
  nameterminate-app-pod
  provider:
    typepython
    modulechaosk8s.pod.actions
    functerminate_pods
    arguments:
      label_selectorapp=go-demo-8
      randtrue
      nsgo-demo-8
  pauses
    after2

So the new sections of this definition, when compared to the previous one, is that we added a configuration section and we changed our steady-state-hypothesis. There are a few other changes. Some of those are cosmetic, while others are indeed important.

Checking the difference between of health-pause.yaml and health-http.yaml#

We’ll skip commenting on the contents of this file because it is hard to see what’s different when compared to what we had before. We’ll comment on the differences by executing a diff between the new and the old version of the definition.

The output is as follows.

3c3
< description: If an instance of the application is terminated, a new instance should be created
---
> description: If an instance of the application is terminated, the applications as a whole should still be operational.
7c7,11
< - deployment
---
> - http
> configuration:
>   ingress_host:
>       type: env
>       key: INGRESS_HOST
11c15
<   - name: all-apps-are-healthy
---
>   - name: app-responds-to-requests
13c17
<     tolerance: true
---
>     tolerance: 200
15,19c19,24
<       type: python
<       func: all_microservices_healthy
<       module: chaosk8s.probes
<       arguments:
<         ns: go-demo-8
---
>       type: http
>       timeout: 3
>       verify_tls: false
>       url: http://{ingress_host}/demo/person
>       headers:
>         Host: go-demo-8.acme.com
32c37
<     after: 10
---
>     after: 2

We can see that, this time, quite a lot of things changed. The description and the tags are different. Those are only informative, so there’s no reason to go through them.

What matters is that we added a configuration section and that it has a variable called ingress_host.

Variables can be of different types, and, in this case, the one we defined is an environment variable (env). The key is INGRESS_HOST. That means that if we set the environment variable INGRESS_HOST, the value of that variable will be assigned to Chaos Toolkit variable ingress_host.

The name of the steady-state-hypothesis changed, and the tolerance is now 200. Before, we were validating whether our application is healthy by checking whether the output is true or false. This time, however, we will verify whether our app responds with 200 response code, or with something else.

Further on, we can see that we changed the probe type from all_microservices_healthy to http. The timeout is set to 3 seconds.

All in all, we expect our application to respond within three seconds. That’s unrealistically high, I would say. It should be much lower, like a hundred milliseconds. However, I couldn’t be a hundred percent sure that your networking is fast, so I defined a relatively high value to be on the safe side.

You should also note that we are not verifying TLS (verify_tls) because it would be hard to define certificates without having a proper domain. In “real life” situations, you would always validate TLS.

Now comes the vital part.

The url to which we’re going to send the request is using the ingress_host variable. We’ll be sending requests to our Ingress domain/IP and to the path /demo/person. Since our Ingress is configured to accept only requests from go-demo-8.acme.com, we are adding that Host as one of the headers to that request.

Finally, we’re changing the pause. Instead of 10 seconds we had before, we’re going to give it 2 seconds. After destroying a Pod, we’re going to wait for two seconds, and then we’re going to validate whether we can send a request to it and whether it responds with 200. I wanted to ensure that the Pod is indeed destroyed. Otherwise, we wouldn’t need such a pause.

Running chaos experiment and inspecting the output#

Let’s run this experiment and see what we’ll get.

The output, without timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate an instance of the application?
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-app-pod
[... INFO] Pausing after activity for 2s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered

The initial probe passed. It validated that our application responds to requests before it started executing probes. The action that terminated a Pod was executed and, after that, we waited for two seconds. Further on, we can see that the post-action probe failed. Two seconds after destroying an instance of the application, we were unable to send a request to it. That’s very unfortunate, isn’t it? Why does that happen?

Why is our application not highly available?#

I want you to stop here and think. Why is our application not highly available? It should be responsive and available at all times. If you destroy an instance of our application, it should continue serving our users. So, why is it not highly available? What is missing?

Did you figure it out? Kudos to you if you did. If you didn’t, I’ll provide an answer.

Our application is not highly available. It does not continue serving requests after a Pod, or an instance is destroyed because there is only one instance. Every application, when architecture allows, should run multiple instances as a way to prevent this type of situation. If, for example, we would have three instances of our application and we’d destroy one of them, the other two should be able to continue serving requests while Kubernetes is recreating the failed Pod. In other words, we need to increase the number of replicas of our application.

We could scale up in quite a few ways. We could just go to the definition of the Deployment and say that there should be two or three or four replicas of that application. But that’s a bad idea. That’s static. That would mean that if we say three replicas, then our app would always have three replicas. What we want is for our application to go up and down. It should increase and decrease the number of instances depending on their memory or CPU utilization. We could even define more complicated criteria based on Prometheus.

Inspecting the definition of health/hpa.yaml#

We’re not going to go into details of how to scale applications. That’s not the subject of this book. Instead, I’ll just say that we’re going to define HorizontalPodAutoscaler (HPA). So, let’s take a look at yet another YAML.

The output is as follows.

---

apiVersionautoscaling/v2beta1
kindHorizontalPodAutoscaler
metadata:
  namego-demo-8
spec:
  scaleTargetRef:
    apiVersionapps/v1
    kindDeployment
    namego-demo-8
  minReplicas2
  maxReplicas6
  metrics:
  - typeResource
    resource:
      namecpu
      targetAverageUtilization80
  - typeResource
    resource:
      namememory
      targetAverageUtilization80

That definition specifies a HorizontalPodAutoscaler called go-demo-8. It is targeting the Deployment with the same name go-demo-8. The minimum number of replicas will be 2, and the maximum number will be 6. Our application will have anything between two and six instances. The exact number depends on the metrics.

In this case, we have two basic metrics. The average utilization of CPU should be around 80%, and the average usage of memory should be 80% as well. In most “real-world” cases, those two metrics would be insufficient. Nevertheless, they should be suitable for our example.

All in all, that HPA will make our application run at least two replicas. That should hopefully make it highly available. If you’re unsure about the validity of that statement, try to guess what happens if we destroy one or multiple replicas of an application. The others should continue serving requests.

Deploying the HorizontalPodAutoscaler#

Let’s deploy the HorizontalPodAutoscaler.

To be on the safe side, we’ll retrieve HPAs and confirm that everything looks OK.

The output is as follows.

NAME      REFERENCE            TARGETS         MINPODS MAXPODS REPLICAS AGE
go-demo-8 Deployment/go-demo-8 16%/80%, 0%/80% 2       6       2        51s

It might take a few moments until the HPA figures out what it needs to do. Keep repeating the get hpa command until the number of replicas increases to 2 (or more).

Now we’re ready to proceed. The HorizontalPodAutoscaler increased the number of replicas of our application. However, we are yet to see whether that was enough. Is our app now highly available?

Running chaos experiment and inspecting the output#

Just like everything we do in this book, we are always going to validate our theories and ideas by running chaos experiments. So let’s re-run the same experiment and see what happens. Remember, the experiment we are about to run (the same one as before) validates whether our application can serve requests after an instance of that application is destroyed. In other words, it checks whether the app highly available.

The output, without the timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate an instance of the application?
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-app-pod
[... INFO] Pausing after activity for 2s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: completed

We can see that the initial probe passed and that the action was executed to terminate a Pod. After that, it waited for two seconds just to make hundred percent sure that the Pod is destroyed. Then we re-run the probe. It passed! Everything was successful. Our experiment was a success. Our application is indeed highly available.


In the next lesson, we will carry out a chaos experiment to check what happens if we destroy an instance of a dependency of our application.

Validating Application Health
Terminating Application Dependencies
Mark as Completed
Report an Issue