Terminating Random Nodes

In this lesson, we will run an experiment using a CronJob to terminate random nodes, and then we will use the dashboards to figure out if it was successful or not.

We'll cover the following

Applying the CronJob
Retrieving all CronJobs from the chaos Namespace
Retrieving all Jobs
Retrieving all Pods from the chaos Namespace
Retrieving all nodes and inspecting
Inspecting the output on Grafana dashboard
Inspecting the output on Kiali dashboard
Conclusion

Applying the CronJob#

Let’s apply the new definition of the CronJob.

Enter to Rename, Shift+Enter to Preview

Assuming that you left the loop that sends requests running in the second terminal, we should be able to observe that the demo application keeps responding with 200. At the moment, the demo application seems to be working correctly.

As you already know, we’ll need to wait for a while until the first Job is created, and the experiment is executed.

Retrieving all CronJobs from the `chaos` Namespace#

We’ll retrieve CronJobs to make it more interesting than staring at the blank screen.

Enter to Rename, Shift+Enter to Preview

After a while, and a few repetitions of the previous command, the output should be similar to the one that follows.

NAME        SCHEDULE    SUSPEND ACTIVE LAST SCHEDULE AGE
nodes-chaos */5 * * * * False   0      1m01s         5m2s

Once there is something other than <none> in the LAST SCHEDULE column, we can proceed knowing that a Job was created. That means that our experiment is running right now.

Retrieving all Jobs#

Let’s retrieve Jobs and confirm that everything looks correct so far.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

NAME            COMPLETIONS DURATION AGE
nodes-chaos-... 1/1         2m5s     4m46s

In my case, it runs for over two minutes (2m5s), so I (and probably you) will need a bit more patience. We need to wait until the Job is finished executing. If you’re in the same position, keep repeating the previous command.

Retrieving all Pods from the `chaos` Namespace#

Next, to be entirely on the safe side, we’ll retrieve Pods as well.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

NAME            READY STATUS            RESTARTS AGE
nodes-chaos-... 0/1   Completed         0        4m55s
nodes-chaos-... 0/1   ContainerCreating 0        4s

Sufficient time passed, and, in my case, the first experiment finished, and the next one is already running for four seconds.

Retrieving all nodes and inspecting#

Finally, the last thing we need to do is retrieve the nodes and see what’s going on over there.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

NAME    STATUS ROLES  AGE   VERSION
gke-... Ready  <none> 3m37s v1.15.9-gke.22
gke-... Ready  <none> 55m   v1.15.9-gke.22
gke-... Ready  <none> 55m   v1.15.9-gke.22
gke-... Ready  <none> 55m   v1.15.9-gke.22

In my case, we can see that I have four nodes. Everything looks healthy, at least from nodes perspective. The experiment drained a node, and then it un-cordoned it, thus enabling it back for scheduling.

Now comes the most critical step when running experiments that are not focused on a single application.

Inspecting the output on Grafana dashboard#

Let’s take a look at Grafana and see what we have over there.

Enter to Rename, Shift+Enter to Preview

What can we see from Grafana? Is there anything useful there?

Please open the Istio Mesh Dashboard.

Everything seems to be working correctly. Nothing terrible happened when we drained a node. Or, at least, everything seems to be working correctly from the networking perspective.

Inspecting the output on Kiali dashboard#

Next, we’ll try to see whether we can spot anything from Kiali.

Please stop the tunnel to Grafana by pressing ctrl+c, and open Kiali through the command that follows.

Enter to Rename, Shift+Enter to Preview

Go through different graphs, and iterate through different Namespaces. If your situation is the same as mine, you should see that everything seems to be okay. Everything works well. There are no issues.

Please stop the tunnel to Kiali by pressing ctrl+c.

Draining a completely random node could affect anything in a cluster. Yet, we could not prove that such actions are disastrous. In parallel, the experiment keeps being executed every five minutes, so the nodes keep being drained. We can confirm that by retrieving the nodes through the kubectl get nodes command. If we keep doing that, we’ll see that every once in a while, a node is drained. Then, a while later, it’s being restored to normal (un-cordoned).

Conclusion#

We can see that the measures and the improvements to the system we have done so far were successful. The cluster and the applications in it are more robust then they were when we started with the first experiment. However, this is not entirely true. I was kind of lucky, and you might not have been. The database is still running as a single replica, and this is the final weak point. Therefore, you might have seen a failure that I did not experience myself.

In the next lesson, we will briefly discuss monitoring and alerting and using it in chaos engineering.

Preparing for Termination of Nodes

Monitoring and Alerting with Prometheus

Mark as Completed

Report an Issue

Before We Begin

Introduction To Kubernetes Chaos Engineering

Defining Requirements

Destroying Application Instances

Experimenting with Application Availability

Obstructing and Destroying Network

Draining and Deleting Nodes

Creating Chaos Experiment Reports

Running Chaos Experiments Inside a Kubernetes Cluster

Executing Random Chaos

Conclusion

Terminating Random Nodes

Applying the CronJob#

Retrieving all CronJobs from the `chaos` Namespace#

Retrieving all Jobs#

Retrieving all Pods from the `chaos` Namespace#

Retrieving all nodes and inspecting#

Inspecting the output on Grafana dashboard#

Inspecting the output on Kiali dashboard#

Conclusion#

Terminating Random Nodes

Applying the CronJob#

Retrieving all CronJobs from the chaos Namespace#

Retrieving all Jobs#

Retrieving all Pods from the chaos Namespace#

Retrieving all nodes and inspecting#

Inspecting the output on Grafana dashboard#

Inspecting the output on Kiali dashboard#

Conclusion#

Retrieving all CronJobs from the `chaos` Namespace#

Retrieving all Pods from the `chaos` Namespace#