Aborting All Requests
In this lesson, we will run a chaos experiment which will simulate what would happen if our application faces a complete network failure.
Sometimes, we’re unlucky, and we have partial failures with our network. At other times, we’re incredibly unfortunate, and the network fails entirely, or at least parts of the network related to one or a few of the applications are down. What happens in this case? Can we recuperate from that? Can we make our applications resilient even in those situations?
Inspecting the definition of network-abort-100.yaml
and comparing with network-rollback.yaml
#
We can run an experiment that will validate what happens in such a situation. It’s going to be an easy one since we already have a very similar experiment.
Given that experiment is almost the same as the one before, and that you’d have a tough time spotting the difference, we’ll jump straight into the diff
of the two.
The output is as follows.
51c51
< percentage: 50
---
> percentage: 100
The difference is in the percent of abort failures. Instead of aborting 50% of the requests, we’re going to abort all those destined for go-demo-8
.
Running the chaos experiment and inspecting the output#
I’m sure that you know the outcome. Nevertheless, let’s run the experiment and see whats happens.
The output, without timestamps, is as follows.
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we abort responses
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: abort-failure
[... INFO] Pausing after activity for 1s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] Rollback: remove-abort-failure
[... INFO] Action: remove-abort-failure
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered
We modified the action to abort 100% of network requests destined to go-demo-8
. As expected, the first probe failed. That’s normal. Our application tried and tried and tried and tried. But after 10 times, it gave up, and we did not receive the response. It’s a complete failure of the network associated with the go-demo-8
API.
How can we fix this?#
I will not show you how to resolve this because the solution should most likely not be applied inside Kubernetes but on the application level. In this scenario, assuming that we have other processes in place that deal with infrastructure when the network completely fails, it will be recuperated at one moment. We cannot expect Kubernetes and Istio and software around our applications to fix all of the problems. This is the case where the design of our applications should be able to handle it.
Let’s say that your frontend application is accessible, but that the backend is not. If, for example, your frontend application cannot, under any circumstance, communicate with the backend application, it should probably show a message like “shopping cart is currently not available, but feel free to browse our products” because they go to different backend applications. That’s why we like microservices. The smaller the applications are, the smaller the scope of an issue. Maybe your frontend application is not accessible, and then you would serve your users some static version of your frontend. There can be many different scenarios, and we won’t go through them.
Similarly, there are many things that we can do to limit the blast radius. We won’t go into those either because most of the solutions for a complete collapse are related to how we code our applications, not what we do outside those applications. Think about how you would design your application to handle a total failure, similar to the one we just demonstrated through yet another chaos experiment.
In the next lesson, we will be simulating the scenario of a denial of service (DoS) attack.