Terminating Random Application Instances

Applying the CronJob#

Let’s apply the CronJob that we just explored and see what’ll happen.

Retrieving all CronJobs and Pods from the chaos Namespace#

Next, we’ll output all the CronJobs from the Namespace chaos.

If the LAST SCHEDULE is set to <none>, you might need to wait for a while longer (up to two minutes), and re-run the previous command. Once the first Job is created, the output should be similar to the one that follows.

NAME                   SCHEDULE    SUSPEND ACTIVE LAST SCHEDULE AGE
health-instances-chaos */2 * * * * False   1      8s            107s

Next, we’ll take a look at the Jobs created by that CronJob.

Just as we had to wait until the CronJob creates the Job, not we need to wait until the Job creates the Pod and the experiment inside it finishes executing. Keep re-running the previous command until the COMPLETIONS column is set to 1/1. The output should be similar to the one that follows.

NAME                       COMPLETIONS DURATION AGE
health-instances-chaos-... 1/1         93s      98s

From this moment on, the results I will present might differ from what you’ll observe on your screen. Ultimately, we’ll end up with the same result, even though the time you might need to wait for that might differ.

Finally, we’ll retrieve the Pods in the chaos Namespace and check whether there was a failure.

The output is as follows.

NAME                       READY STATUS    RESTARTS AGE
health-instances-chaos-... 0/1   Completed 0        107s

Retrieving the Pods from the application#

In my case, it seems that the first run of the experiment was successful. To be on the safe side, we’ll take a look at the Pods of the demo applications.

The output is as follows.

NAME             READY STATUS  RESTARTS AGE
go-demo-8-...    2/2   Running 2        27m
go-demo-8-...    2/2   Running 3        27m
go-demo-8-db-... 2/2   Running 0        27m
repeater-...     2/2   Running 0        27m
repeater-...     2/2   Running 0        53s

All the Pods are running, and, at least in my case, it seems that the experiment did not detect any anomaly. We can confirm that it indeed terminated one of the Pods by observing the AGE column. In my case, one of the repeater Pods is fifty-three seconds old. That must be the one that was created after the experiment removed one of the replicas of the repeater. The experiment indeed chose a random Pod, and, in my case, the system seems to be resilient to such actions.

It might conclude that the experiment did not uncover any weakness in the system. That’s excellent news. Isn’t it?

Inspecting the output on Grafana dashboard#

I’d say that we deserve a break. We’ll open the dashboard in Grafana and let it stay on the screen while we celebrate the success. We finally have an experiment that does not uncover any fault in the system.

Please open the Istio Mesh Dashboard. You should know how to find it since we already used it in the previous sections.

In my case, everything is “green”, and that is yet another confirmation that my system was not affected negatively by the experiment. Not only that the probe passed, but I can see that all the metrics are within the thresholds. There are no responses with 4xx nor 5xx codes. The Success Rate is 100% for both Services. Life is good, and I do deserve a break as a reward. I’ll leave the dashboard on the screen, while the experiment is being executed every two minutes. Since it destroys a random Pod, next time, it could be any other. It’s not always going to be the repeater.

Right now, you must use your imagination and picture me working on something else using my primary monitor, while Grafana keeps running in my secondary display.

Minutes are passing. The CronJob keeps creating Jobs every two minutes, and the experiments are running one after another. I’m focused on my work, confident that everything is OK. I know that there’s nothing to worry about because everything is “green” in Grafana. My system is rock-solid. Everything works. Life is good.

A while later, my whole world turns upside down. All of a sudden, I can see an increase in 5xx responses. The “Success Rate” turned red for both repeater and go-demo-8 Services. Something is terribly wrong. I have an issue that was not detected by any of the previous experiments. Even the one that I’m running right now uncovered an issue only after it was executed quite a few times. To make things more complicated, the experiments are successful, and the problem can be observed only through the Grafana dashboard.

Grafana dashboard after an unsuccessful experiment
Grafana dashboard after an unsuccessful experiment

You might have reached the same point as I did, or everything might still be green. In the latter case, be patient. Your system will fail, as well. It’s just a matter of time. Be patient.

Unlike the previous experiments, this time, we did not destroy or disrupt a specific target. We did not predict what will be the adverse effect. But we did observe through the dashboard that there is a problem affecting both the repeater and go-demo-8. Nevertheless, we did uncover a potential issue. But why did it appear after so much time? Why wasn’t it discovered right away from the first execution of the experiment? After all, we were destroying Pods before, and we did fix all the issues we found. Yet, we’re in trouble again.

Inspecting the output on Kiali dashboard#

So far, we know that there is a problem detected through the Grafana dashboard. Let’s switch to Kiali and see whether we’ll see something useful there.

Please go back to the terminal, cancel the tunnel towards Grafana by pressing ctrl+c, and execute the command that follows to open Kiali.

What do we have there? We can see that there are three applications. Two of them are red.

Select Graph from the left-hand menu, and choose the go-demo-8 Namespace from the top. You should see something similar to figure 9-4.

Kiali dashboard after an unsuccessful experiment
Kiali dashboard after an unsuccessful experiment

We can see that the traffic is blocked. It’s red all around. External traffic goes to the repeater. From there, it goes to go-demo-8, and then it fails to reach the database. The problem might be that our database was destroyed.

Our database is not replicated. It makes sense that one of the experiments eventually terminated the database. That could result in the crash of all the applications that, directly or indirectly, depend on it. Destroying a Pod of the repeater is okay because there are two replicas of it. The same can be said for go-demo-8 as well. But the database, with the current setup, is not highly available. We could predict that destroying the only Pod of the DB results in downtime.

All in all, eventually, one of the experiments randomly terminated the database, and that resulted in failure. The issue should be temporary, and the system should continue operating correctly as soon as Kubernetes recreates the failed Pod. It shouldn’t take more than a few seconds, a minute at the most, to get back to the desired state. Yet, more time already passed, and things are still red in Kiali. Or, at least, that’s what’s happening on my screen.

Retrieving the Pods from the application#

Let’s check the Pods and see whether we can deduce what’s going on.

Please go back to the terminal, cancel the tunnel towards Kiali by pressing ctrl+c, and execute the command that follows to retrieve the Pods.

The output is as follows.

NAME             READY STATUS  RESTARTS AGE
go-demo-8-...    2/2   Running 0        5m2s
go-demo-8-...    2/2   Running 3        37m
go-demo-8-db-... 2/2   Running 0        3m1s
repeater-...     2/2   Running 0        37m
repeater-...     2/2   Running 0        57s

Everything looks fine, even though we know that there’s something wrong. All the Pods are running, and, in my case, the database was recreated three minutes ago.

What went wrong and how can we fix it?#

Let’s try to figure out what’s the cause of the issue.

The database was destroyed, and Kubernetes recreated it a few seconds later. The rest of the Pods are running, and, in my case, the last one terminated was the repeater. How did we know that the system is messed up even after everything went back to normal? Even though the database is up and running, the system is not operational. What could be the most likely cause of this issue?

When go-demo-8 boots, it connects to the database. My best guess is that the likely culprit is the logic of the code of the demo application. It probably does not know how to re-establish a connection to the database after it gets back online. The code of the app was perhaps written in a way that, if the connection is broken, it does not have sufficient logic to re-establish it later. That’s a bad design. We have a problem with the application itself. We probably found an issue that we did not capture during past experiments.

In the previous sections, we have already demonstrated that when the database is terminated, it produces a short downtime because it is not replicated. The new finding is that the code of the application itself (go-demo-8) does not know how to re-establish the connection. So, even if we would make the database fault-tolerant and even if Kubernetes would recreate failed replicas of the database, the damage is permanent due to bad code. We would need to re-design the application to make sure that when the connection to the database is lost, the app itself tries to re-establish it.

Even better, we could make the application fail when the connection to the database is lost. That would result in the termination of the container where it’s running, and Kubernetes would recreate it. That process might repeat itself a few times. However, once the database is up-and-running again, the container’s restart would result in a new boot of the application, which would be able to re-establish the connection. Otherwise, we can opt for a third solution.

We could also change the health check. Right now, the address is not targeting an endpoint that uses the database. As a result, the health check seems to be working fine, and Kubernetes sees no reason to terminate containers of go-demo-8.

No matter the solution we should employ, the vital thing to note is that we found yet another weak spot in our system. Knowing what might go wrong is the most important thing. Solving an issue is often not a problem, as long as we know what the issue is.

Deleting the CronJob and rolling out#

There’s no need for us to continue terminating random Pods since now we know that there is a problem that needs to be resolved. So, we’re going to delete the CronJob.

I’ll leave it to you to think which proposed solution you’d employ, if any. Think of it as yet another homework. For now, we’re going to restart go-demo-8. That should re-establish the connection to the database and fix the issue. To be more precise, that will not solve the problem but act as a workaround that will allow us to move to the next subject. The “real” fix is the task I already assigned to you.


In the next lesson, we will discuss how to disrupt network traffic and you will be assigned a new homework assignment.

Preparing for Termination of Instances
Disrupting Network Traffic
Mark as Completed
Report an Issue