Validating Application Availability
In this lesson, we will be running some chaos experiments to check if our application is highly available and remains accessible from outside.
We'll cover the following
- Inspecting the definition of health-http.yaml
- Checking the difference between of health-pause.yaml and health-http.yaml
- Running chaos experiment and inspecting the output
- Why is our application not highly available?
- Inspecting the definition of health/hpa.yaml
- Deploying the HorizontalPodAutoscaler
- Running chaos experiment and inspecting the output
Validating whether all the Pods are healthy and running is useful. But that does not necessarily mean that our application is accessible. Maybe the Pods are running, and everything is fantastic and peachy, but our customers cannot access our application.
Let’s see how we can validate whether we can send HTTP requests to our application and whether we can continue doing that after an instance of our app is destroyed. How would that definition look like?
Before we dive into application availability, we have a tiny problem that needs to be addressed. I couldn’t define the address of our application in YAML because your IP is almost certainly different than mine. And neither of us are using real domains because that would be too complicated to set up. That problem allows me to introduce you to yet another feature of Chaos Toolkit.
We are going to define a variable that can be injected into our definition.
Inspecting the definition of health-http.yaml
#
Let’s take a quick look at yet another YAML.
The output is as follows.
version: 1.0.0
title: What happens if we terminate an instance of the application?
description: If an instance of the application is terminated, the applications as a whole should still be operational.
tags:
- k8s
- pod
- http
configuration:
ingress_host:
type: env
key: INGRESS_HOST
steady-state-hypothesis:
title: The app is healthy
probes:
- name: app-responds-to-requests
type: probe
tolerance: 200
provider:
type: http
timeout: 3
verify_tls: false
url: http://${ingress_host}/demo/person
headers:
Host: go-demo-8.acme.com
method:
- type: action
name: terminate-app-pod
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: app=go-demo-8
rand: true
ns: go-demo-8
pauses:
after: 2
So the new sections of this definition, when compared to the previous one, is that we added a configuration section and we changed our steady-state-hypothesis
. There are a few other changes. Some of those are cosmetic, while others are indeed important.
Checking the difference between of health-pause.yaml
and health-http.yaml
#
We’ll skip commenting on the contents of this file because it is hard to see what’s different when compared to what we had before. We’ll comment on the differences by executing a diff
between the new and the old version of the definition.
The output is as follows.
3c3
< description: If an instance of the application is terminated, a new instance should be created
---
> description: If an instance of the application is terminated, the applications as a whole should still be operational.
7c7,11
< - deployment
---
> - http
> configuration:
> ingress_host:
> type: env
> key: INGRESS_HOST
11c15
< - name: all-apps-are-healthy
---
> - name: app-responds-to-requests
13c17
< tolerance: true
---
> tolerance: 200
15,19c19,24
< type: python
< func: all_microservices_healthy
< module: chaosk8s.probes
< arguments:
< ns: go-demo-8
---
> type: http
> timeout: 3
> verify_tls: false
> url: http://${ingress_host}/demo/person
> headers:
> Host: go-demo-8.acme.com
32c37
< after: 10
---
> after: 2
We can see that, this time, quite a lot of things changed. The description
and the tags
are different. Those are only informative, so there’s no reason to go through them.
What matters is that we added a configuration
section and that it has a variable called ingress_host
.
Variables can be of different types, and, in this case, the one we defined is an environment variable (env
). The key
is INGRESS_HOST
. That means that if we set the environment variable INGRESS_HOST
, the value of that variable will be assigned to Chaos Toolkit variable ingress_host
.
The name of the steady-state-hypothesis
changed, and the tolerance
is now 200
. Before, we were validating whether our application is healthy by checking whether the output is true
or false
. This time, however, we will verify whether our app responds with 200
response code, or with something else.
Further on, we can see that we changed the probe type
from all_microservices_healthy
to http
. The timeout
is set to 3
seconds.
All in all, we expect our application to respond within three seconds. That’s unrealistically high, I would say. It should be much lower, like a hundred milliseconds. However, I couldn’t be a hundred percent sure that your networking is fast, so I defined a relatively high value to be on the safe side.
You should also note that we are not verifying TLS (verify_tls
) because it would be hard to define certificates without having a proper domain. In “real life” situations, you would always validate TLS.
Now comes the vital part.
The url
to which we’re going to send the request is using the ingress_host
variable. We’ll be sending requests to our Ingress domain/IP and to the path /demo/person
. Since our Ingress is configured to accept only requests from go-demo-8.acme.com
, we are adding that Host
as one of the headers
to that request.
Finally, we’re changing the pause
. Instead of 10
seconds we had before, we’re going to give it 2
seconds. After destroying a Pod, we’re going to wait for two seconds, and then we’re going to validate whether we can send a request to it and whether it responds with 200
. I wanted to ensure that the Pod is indeed destroyed. Otherwise, we wouldn’t need such a pause.
Running chaos experiment and inspecting the output#
Let’s run this experiment and see what we’ll get.
The output, without timestamps, is as follows.
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate an instance of the application?
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-app-pod
[... INFO] Pausing after activity for 2s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered
The initial probe passed. It validated that our application responds to requests before it started executing probes. The action that terminated a Pod was executed and, after that, we waited for two seconds. Further on, we can see that the post-action probe failed. Two seconds after destroying an instance of the application, we were unable to send a request to it. That’s very unfortunate, isn’t it? Why does that happen?
Why is our application not highly available?#
I want you to stop here and think. Why is our application not highly available? It should be responsive and available at all times. If you destroy an instance of our application, it should continue serving our users. So, why is it not highly available? What is missing?
Did you figure it out? Kudos to you if you did. If you didn’t, I’ll provide an answer.
Our application is not highly available. It does not continue serving requests after a Pod, or an instance is destroyed because there is only one instance. Every application, when architecture allows, should run multiple instances as a way to prevent this type of situation. If, for example, we would have three instances of our application and we’d destroy one of them, the other two should be able to continue serving requests while Kubernetes is recreating the failed Pod. In other words, we need to increase the number of replicas of our application.
We could scale up in quite a few ways. We could just go to the definition of the Deployment and say that there should be two or three or four replicas of that application. But that’s a bad idea. That’s static. That would mean that if we say three replicas, then our app would always have three replicas. What we want is for our application to go up and down. It should increase and decrease the number of instances depending on their memory or CPU utilization. We could even define more complicated criteria based on Prometheus.
Inspecting the definition of health/hpa.yaml
#
We’re not going to go into details of how to scale applications. That’s not the subject of this book. Instead, I’ll just say that we’re going to define HorizontalPodAutoscaler (HPA). So, let’s take a look at yet another YAML.
The output is as follows.
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: go-demo-8
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: go-demo-8
minReplicas: 2
maxReplicas: 6
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 80
- type: Resource
resource:
name: memory
targetAverageUtilization: 80
That definition specifies a HorizontalPodAutoscaler
called go-demo-8
. It is targeting the Deployment
with the same name go-demo-8
. The minimum number of replicas will be 2
, and the maximum number will be 6
. Our application will have anything between two and six instances. The exact number depends on the metrics.
In this case, we have two basic metrics. The average utilization of CPU should be around 80%, and the average usage of memory should be 80% as well. In most “real-world” cases, those two metrics
would be insufficient. Nevertheless, they should be suitable for our example.
All in all, that HPA will make our application run at least two replicas. That should hopefully make it highly available. If you’re unsure about the validity of that statement, try to guess what happens if we destroy one or multiple replicas of an application. The others should continue serving requests.
Deploying the HorizontalPodAutoscaler#
Let’s deploy the HorizontalPodAutoscaler.
To be on the safe side, we’ll retrieve HPAs and confirm that everything looks OK.
The output is as follows.
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
go-demo-8 Deployment/go-demo-8 16%/80%, 0%/80% 2 6 2 51s
It might take a few moments until the HPA figures out what it needs to do. Keep repeating the get hpa
command until the number of replicas increases to 2
(or more).
Now we’re ready to proceed. The HorizontalPodAutoscaler increased the number of replicas of our application. However, we are yet to see whether that was enough. Is our app now highly available?
Running chaos experiment and inspecting the output#
Just like everything we do in this book, we are always going to validate our theories and ideas by running chaos experiments. So let’s re-run the same experiment and see what happens. Remember, the experiment we are about to run (the same one as before) validates whether our application can serve requests after an instance of that application is destroyed. In other words, it checks whether the app highly available.
The output, without the timestamps, is as follows.
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate an instance of the application?
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-app-pod
[... INFO] Pausing after activity for 2s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: completed
We can see that the initial probe passed and that the action was executed to terminate a Pod. After that, it waited for two seconds just to make hundred percent sure that the Pod is destroyed. Then we re-run the probe. It passed! Everything was successful. Our experiment was a success. Our application is indeed highly available.
In the next lesson, we will carry out a chaos experiment to check what happens if we destroy an instance of a dependency of our application.