Making the Application Resilient to Partial Network Failures
In this lesson, we will apply a modified virtual service definition to ensure that our application is resilient to partial network failures.
How can we make our applications resilient to (some) network issues? How can we deal with the fact that the network is not 100% reliable?
The last experiment will not create a complete outage but only partial network failures. We can fix this in quite a few ways.
I already said that I will not show you how to solve issues by changing the code of your application. That would require examples in too many different languages. So, we’ll look for a solution outside the application itself, probably inside Kubernetes. In this case, Istio is the logical place.
Inspecting modified version of Virtual Service#
We’re going to take a look at a modified version of our Virtual Service.
The output is as follows.
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: repeater
spec:
hosts:
- repeater.acme.com
- repeater
gateways:
- repeater
http:
- route:
- destination:
host: repeater
subset: primary
port:
number: 80
retries:
attempts: 10
perTryTimeout: 3s
retryOn: 5xx
Most of the VirtualService
definition is the same as before. It’s called repeater
, it has some hosts, and it is associated with the repeater
Gateway. The destination
is also the same. What is new is the spec.http.retries
section.
We’ll tell Istio Virtual Service to retry
requests up to 10
times. If a request fails, it will be retried. If it fails again, it will be retried again, and again, and again. The timeout is set to 3
seconds, and retryOn
is set to 5xx
. That’s telling Istio that it should retry failed requests up to 10
times with a timeout of 3
seconds. It should retry them only if the response code is in the 500 range. If any of the 500 range response codes are received, Istio will repeat a request.
That should hopefully solve the problem of having our network failing sometimes.
Exploring the Envoy proxy documentation of route filters#
Soon we’ll check whether that new definition really works. But, before we even apply that definition, we’ll take a quick look at the Envoy proxy documentation related to route filters. Istio uses Envoy, so the information about those retryOn
codes can be found in its documentation.
If you are a Windows user, the
open
command might not work. If that’s the case, please open the address manually in your favorite browser.
You should see that the 5xx
code is one of many supported by Envoy. As you probably already know, Envoy is the proxy Istio injects into Pods as side-car containers. If, in the future, you want to fine-tune your Virtual Service definitions with other retryOn
codes, that page gives you all the information you need. To be more precise, all router filters of the Envoy proxy are there.
Applying the improved definition of Virtual Service#
Now that we know where to find information about Envoy filters, we can go back to our improved definition of the Virtual Service and apply it. That should allow us to see whether our application is now resilient and highly available, even in case of partial failures of the network.
Running the chaos experiment and inspecting the output#
Now, we’re ready to re-run the same experiment and see whether our new setup works.
The output, without timestamps, is as follows.
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we abort responses
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: abort-failure
[... INFO] Pausing after activity for 1s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] Rollback: remove-abort-failure
[... INFO] Action: remove-abort-failure
[... INFO] Experiment ended with status: completed
We can see that the five initial probes were executed successfully and that the action injected abort failure set to 50%. After that, the same probes were re-run, and we can see that this time, all were successful. Our application is indeed retrying failed requests up to ten times. Since approximately 50% of them fail, up to ten repetitions are more than sufficient.
Everything is working. Our experiment was successful, and we can conclude that the repeater
can handle partial network outages.
In the next lesson, we will find out how to increase the network latency of our application.