Terminating Random Nodes
In this lesson, we will run an experiment using a CronJob to terminate random nodes, and then we will use the dashboards to figure out if it was successful or not.
Applying the CronJob#
Let’s apply the new definition of the CronJob.
Assuming that you left the loop that sends requests running in the second terminal, we should be able to observe that the demo application keeps responding with 200
. At the moment, the demo application seems to be working correctly.
As you already know, we’ll need to wait for a while until the first Job is created, and the experiment is executed.
Retrieving all CronJobs from the chaos
Namespace#
We’ll retrieve CronJobs to make it more interesting than staring at the blank screen.
After a while, and a few repetitions of the previous command, the output should be similar to the one that follows.
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
nodes-chaos */5 * * * * False 0 1m01s 5m2s
Once there is something other than <none>
in the LAST SCHEDULE
column, we can proceed knowing that a Job was created. That means that our experiment is running right now.
Retrieving all Jobs#
Let’s retrieve Jobs and confirm that everything looks correct so far.
The output is as follows.
NAME COMPLETIONS DURATION AGE
nodes-chaos-... 1/1 2m5s 4m46s
In my case, it runs for over two minutes (2m5s
), so I (and probably you) will need a bit more patience. We need to wait until the Job is finished executing. If you’re in the same position, keep repeating the previous command.
Retrieving all Pods from the chaos
Namespace#
Next, to be entirely on the safe side, we’ll retrieve Pods as well.
The output is as follows.
NAME READY STATUS RESTARTS AGE
nodes-chaos-... 0/1 Completed 0 4m55s
nodes-chaos-... 0/1 ContainerCreating 0 4s
Sufficient time passed, and, in my case, the first experiment finished, and the next one is already running for four seconds.
Retrieving all nodes and inspecting#
Finally, the last thing we need to do is retrieve the nodes and see what’s going on over there.
The output is as follows.
NAME STATUS ROLES AGE VERSION
gke-... Ready <none> 3m37s v1.15.9-gke.22
gke-... Ready <none> 55m v1.15.9-gke.22
gke-... Ready <none> 55m v1.15.9-gke.22
gke-... Ready <none> 55m v1.15.9-gke.22
In my case, we can see that I have four nodes. Everything looks healthy, at least from nodes perspective. The experiment drained a node, and then it un-cordoned it, thus enabling it back for scheduling.
Now comes the most critical step when running experiments that are not focused on a single application.
Inspecting the output on Grafana dashboard#
Let’s take a look at Grafana and see what we have over there.
What can we see from Grafana? Is there anything useful there?
Please open the Istio Mesh Dashboard.
Everything seems to be working correctly. Nothing terrible happened when we drained a node. Or, at least, everything seems to be working correctly from the networking perspective.
Inspecting the output on Kiali dashboard#
Next, we’ll try to see whether we can spot anything from Kiali.
Please stop the tunnel to Grafana by pressing ctrl+c, and open Kiali through the command that follows.
Go through different graphs, and iterate through different Namespaces. If your situation is the same as mine, you should see that everything seems to be okay. Everything works well. There are no issues.
Please stop the tunnel to Kiali by pressing ctrl+c.
Draining a completely random node could affect anything in a cluster. Yet, we could not prove that such actions are disastrous. In parallel, the experiment keeps being executed every five minutes, so the nodes keep being drained. We can confirm that by retrieving the nodes through the kubectl get nodes
command. If we keep doing that, we’ll see that every once in a while, a node is drained. Then, a while later, it’s being restored to normal (un-cordoned).
Conclusion#
We can see that the measures and the improvements to the system we have done so far were successful. The cluster and the applications in it are more robust then they were when we started with the first experiment. However, this is not entirely true. I was kind of lucky, and you might not have been. The database is still running as a single replica, and this is the final weak point. Therefore, you might have seen a failure that I did not experience myself.
In the next lesson, we will briefly discuss monitoring and alerting and using it in chaos engineering.