Probing Phases and Conditions
In this lesson, we will check out the pod conditions and phases and learn how to probe them.
We'll cover the following
We are most likely not sure whether Pods exist. That does not really serve much of a purpose. Instead, we should validate whether the Pods are healthy. Knowing that a Pod exists does not really do us much good if it is unhealthy or if the application inside it is hanging. What we really want to validate is whether the Pods exist and whether they are in a certain phase.
Describing the pod and inspecting the conditions#
Before we go through that, let’s describe the Pod that we created and see the types, phases, and conditions it’s in.
The output, limited to the relevant parts, is as follows.
...
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
...
If we take a closer look at the conditions
section, we can see that there are two columns: the type
and the status
. We can see that this Pod is initialized
, but it is not yet Ready
. We can see that there are different conditions of that Pod that we might want to take into account when constructing our experiments. What is even more important is that we have not really checked the status of our Pod, neither through experiments nor with kubectl
. Maybe that Pod was never really running. Perhaps it was always failing, no matter the experiments that we were running, which is entirely possible because our experiments were not validating the readiness and the state of our Pod. They were just checking whether it exists. Maybe it did exist, but it was never functional.
Let’s introduce some additional arguments to our experiment to validate this.
Inspecting the definition of terminate-pod-phase.yaml
#
We’ll take a look at a new YAML definition.
What matters more is the diff
.
The output is as follows.
> - name: pod-in-phase
> type: probe
> tolerance: true
> provider:
> type: python
> func: pods_in_phase
> module: chaosk8s.pod.probes
> arguments:
> label_selector: app=go-demo-8
> ns: go-demo-8
> phase: Running
> - name: pod-in-conditions
> type: probe
> tolerance: true
> provider:
> type: python
> func: pods_in_conditions
> module: chaosk8s.pod.probes
> arguments:
> label_selector: app=go-demo-8
> ns: go-demo-8
> conditions:
> - type: Ready
> status: "True"
We added two new probes. The tolerance for the first one (pod-in-phase
) is set to true
. So, we expect the output of whatever we’re going to probe to return that value. This time, the function is pods_in_phase
. It’s yet another function available in the module chaosk8s.pod.probes
.
Next, we have a couple of arguments. The label_selector
is set to app=go-demo-8
, and ns
is set to go-demo-8
. These are the same arguments we used before, but this time we have a new one called phase
, which is set to Running
. We are validating whether there is at least one Pod with the matching label, inside a specific Namespace, and in the phase Running
.
The second probe (pod-in-conditions
) is similar to the first one, except that it uses the function pods_in_conditions
. That’s another function available out of the box in the Kubernetes plugin. The arguments are the same: label_selector
and ns
. The new thing here is the conditions
section. In it, we specify that the type
of condition should be Ready
and the status should be "True"
.
All in all, we are adding two new probes. The first one is going to confirm that our Pod is indeed running, and the second will check whether it is ready to serve requests.
Running the chaos experiment and inspecting the output#
Let’s see what we get when we execute this experiment.
The output is as follows (timestamps are removed for brevity).
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... INFO] Probe: pod-in-phase
[... INFO] Probe: pod-in-conditions
[... ERROR] => failed: chaoslib.exceptions.ActivityFailed: pod go-demo-8 does not match the following given condition: {'type': 'Ready', 'status': 'True'}
[... WARNING] Probe terminated unexpectedly, so its tolerance could not be validated
[... CRITICAL] Steady state probe 'pod-in-conditions' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: failed
The experiment failed. It failed to execute one of the probes before it even started executing actions. Something is wrong: the state of our Pod is not correct. It probably never was.
We have the ActivityFailed
exception saying that it does not match the following given condition: {'type': 'Ready', 'status': 'True'}
. Our Pod is running. We know that because the previous probe (pod-in-phase
) is checking whether it is running. So, Pod is indeed running, but the condition of the Pod is not ready
.
Now, we know that there was something wrong with our Pod from the very beginning. We had no idea until now because we did not really examine the Pod itself. We just trusted that its existence is enough. “Oh, I created a Pod. Therefore, the Pod must be running.”
We can see from the experiment that the Pod is not running. Or, to be more precise, it is running, but it is not ready. Everything we have done so far was useless because the initial state was never really well defined.
Just as before, we will validate that the status of the experiment is indeed indicating an error. We’ll do that by retrieving the exit code of the last command.
The output is 1
, as we expected it to be. So, that’s fine.
Checking the logs of the pod#
Now, let’s take a look at the Pod’s logs. Why is it failing, and why didn’t we notice that there is something wrong with it from the start?
The output is as follows.
2020/01/27 20:50:00 Starting the application
2020/01/27 20:50:00 Configuring DB go-demo-8-db
panic: no reachable servers
goroutine 1 [running]:
main.setupDb()
/Users/vfarcic/code/go-demo-8/main.go:74 +0x32e
main.main()
/Users/vfarcic/code/go-demo-8/main.go:52 +0x78
If you have read one of my other courses or books, you probably recognize that output. If you haven’t, I won’t bore you with the architecture of the application. Instead, I’ll just say that the app uses MongoDB, which we did not deploy. It tries to connect to it, and it fails. As a result, the container fails, and it is recreated by Kubernetes continuously. Or, at least until we fix it.
From the very beginning, the Pod was failing because it was trying to connect to the database, which does not exist. We just discovered that by running an experiment probe that failed in its initial set of validations.
Deploying MongoDB and fixing the issue#
Now that we know that the Pod was never really running because a database is missing, we’re going to fix it by deploying MongoDB. I will not go through the definition of the DB because it’s not relevant. The only thing that matters is that we will deploy it to fulfill the requirement of the application. Hopefully, that will fix the issue, and the Pod will be running this time.
Next, we’ll wait until the database is rolled out.
Retrieving the pods#
Now that the database is rolled out, we’re going to take a look at the Pods.
The output is as follows.
NAME READY STATUS RESTARTS AGE
go-demo-8 1/1 Running 21 66m
go-demo-8-db-... 1/1 Running 0 7m17s
In your case, if the
go-demo-8
Pod is not yetRunning
, please wait for a while. The next time it restarts, it should be working because it will be able to connect to the database.
The go-demo-8
Pod is now ready and running after restarting for quite a few times (in my case, 21
times).
Please note that this is the first time we are retrieving Pods. We can see that go-demo-8
is now running. We would have saved ourselves from a lot of pain if we retrieved the Pods earlier. However, while that would help us in this “demo” environment, we cannot expect to validate things manually in a potentially large production environment. That’s why we have tests, but that’s not the subject of this book. Even if we don’t have tests, we have chaos experiments to validate the desired state both before and after destructive actions.
Our application is now fully operational.
Re-running chaos experiment and inspecting the output#
Do you think that our chaos experiment will pass now? What do you think the result will be when we re-run the same experiment?
Let’s see.
The output is as follows (timestamps are removed for brevity).
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... INFO] Probe: pod-in-phase
[... INFO] Probe: pod-in-conditions
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-pod
[... INFO] Pausing after activity for 10s...
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... INFO] Probe: pod-in-phase
[... INFO] Probe: pod-in-conditions
[... ERROR] => failed: chaoslib.exceptions.ActivityFailed: pod go-demo-8 does not match the following given condition: {'type': 'Ready', 'status': 'True'}
[... WARNING] Probe terminated unexpectedly, so its tolerance could not be validated
[... CRITICAL] Steady state probe 'pod-in-conditions' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered
To begin, all three probes executed before the actions are passing, meaning the Pod existed when the experiment started. It was running and ready. Then, the action to terminate the Pod was executed, and after the pause of 10
seconds, the probe failed.
So, we managed to confirm that the Pod is really in the correct state this time. After we terminated the Pod, the first probe failed. If that one didn’t fail, the second and third ones would fail. Basically, all three probes are guaranteed to fail after we destroy the only Pod we have with the matching labels and in that Namespace. However, Chaos Toolkit did not execute all the probes. It failed fast. The execution of the experiment stopped after the first failure.
Like before, we’ll retrieve the last exit code, and confirm that Chaos Toolkit indeed exits correctly.
We can see the output is 1
, meaning that the experiment failed.
We are moving forward and improving our experiments. However, while our experiment is now relatively good, our application is not. It is not fault-tolerant, and we’re going to correct that in the next lesson.