Making the Application Resilient to Partial Network Failures

In this lesson, we will apply a modified virtual service definition to ensure that our application is resilient to partial network failures.

We'll cover the following

Inspecting modified version of Virtual Service
Exploring the Envoy proxy documentation of route filters
Applying the improved definition of Virtual Service
Running the chaos experiment and inspecting the output

How can we make our applications resilient to (some) network issues? How can we deal with the fact that the network is not 100% reliable?

The last experiment will not create a complete outage but only partial network failures. We can fix this in quite a few ways.

I already said that I will not show you how to solve issues by changing the code of your application. That would require examples in too many different languages. So, we’ll look for a solution outside the application itself, probably inside Kubernetes. In this case, Istio is the logical place.

Inspecting modified version of Virtual Service#

We’re going to take a look at a modified version of our Virtual Service.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

---

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: repeater
spec:
  hosts:
  - repeater.acme.com
  - repeater
  gateways:
  - repeater
  http:
  - route:
    - destination:
        host: repeater
        subset: primary
        port:
          number: 80
    retries:
      attempts: 10
      perTryTimeout: 3s
      retryOn: 5xx

Most of the VirtualService definition is the same as before. It’s called repeater, it has some hosts, and it is associated with the repeater Gateway. The destination is also the same. What is new is the spec.http.retries section.

We’ll tell Istio Virtual Service to retry requests up to 10 times. If a request fails, it will be retried. If it fails again, it will be retried again, and again, and again. The timeout is set to 3 seconds, and retryOn is set to 5xx. That’s telling Istio that it should retry failed requests up to 10 times with a timeout of 3 seconds. It should retry them only if the response code is in the 500 range. If any of the 500 range response codes are received, Istio will repeat a request.

That should hopefully solve the problem of having our network failing sometimes.

Exploring the Envoy proxy documentation of route filters#

Soon we’ll check whether that new definition really works. But, before we even apply that definition, we’ll take a quick look at the Envoy proxy documentation related to route filters. Istio uses Envoy, so the information about those retryOn codes can be found in its documentation.

If you are a Windows user, the open command might not work. If that’s the case, please open the address manually in your favorite browser.

Enter to Rename, Shift+Enter to Preview

You should see that the 5xx code is one of many supported by Envoy. As you probably already know, Envoy is the proxy Istio injects into Pods as side-car containers. If, in the future, you want to fine-tune your Virtual Service definitions with other retryOn codes, that page gives you all the information you need. To be more precise, all router filters of the Envoy proxy are there.

Applying the improved definition of Virtual Service#

Now that we know where to find information about Envoy filters, we can go back to our improved definition of the Virtual Service and apply it. That should allow us to see whether our application is now resilient and highly available, even in case of partial failures of the network.

Enter to Rename, Shift+Enter to Preview

Running the chaos experiment and inspecting the output#

Now, we’re ready to re-run the same experiment and see whether our new setup works.

Enter to Rename, Shift+Enter to Preview

The output, without timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we abort responses
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: abort-failure
[... INFO] Pausing after activity for 1s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] Rollback: remove-abort-failure
[... INFO] Action: remove-abort-failure
[... INFO] Experiment ended with status: completed

We can see that the five initial probes were executed successfully and that the action injected abort failure set to 50%. After that, the same probes were re-run, and we can see that this time, all were successful. Our application is indeed retrying failed requests up to ten times. Since approximately 50% of them fail, up to ten repetitions are more than sufficient.

Everything is working. Our experiment was successful, and we can conclude that the repeater can handle partial network outages.

In the next lesson, we will find out how to increase the network latency of our application.

Rolling Back Abort Failures

Increasing Network Latency

Mark as Completed

Report an Issue

Before We Begin

Introduction To Kubernetes Chaos Engineering

Defining Requirements

Destroying Application Instances

Experimenting with Application Availability

Obstructing and Destroying Network

Draining and Deleting Nodes

Creating Chaos Experiment Reports

Running Chaos Experiments Inside a Kubernetes Cluster

Executing Random Chaos

Conclusion

Making the Application Resilient to Partial Network Failures

Inspecting modified version of Virtual Service#

Exploring the Envoy proxy documentation of route filters#

Applying the improved definition of Virtual Service#

Running the chaos experiment and inspecting the output#