Aborting Network Requests

In this lesson, we will run a chaos experiment to see what happens if we abort some network requests.

Networking issues are very common. They happen more often than many people think. We are about to explore what happens when we simulate or create those same issues ourselves.

So, what can we do?

We can do many different things, but, in our case, we’ll start with something relatively simple. We’ll see what happens if we intentionally abort some of the network requests. We’re going to terminate requests and see how our application behaves when that happens. We’re not going to abort all the requests, but only some. Terminating 50% of requests should do.

What happens if 50% of the requests coming to our applications are terminated? Is our application resilient enough to survive without negatively affecting users? As you can probably guess, we can check that through an experiment.

Inspecting the definition of network.yaml#

Let’s take a look at yet another Chaos Toolkit definition.

The output is as follows.

version1.0.0
titleWhat happens if we abort responses
descriptionIf responses are aborted, the dependant application should retry and/or timeout requests
tags:
k8s
istio
http
configuration:
  ingress_host:
      typeenv
      keyINGRESS_HOST
steady-state-hypothesis:
  titleThe app is healthy
  probes:
  - typeprobe
    nameapp-responds-to-requests
    tolerance200
    provider:
      typehttp
      timeout5
      verify_tlsfalse
      urlhttp://${ingress_host}?addr=http://go-demo-8
      headers:
        Hostrepeater.acme.com
  - typeprobe
    tolerance200
    refapp-responds-to-requests
  - typeprobe
    tolerance200
    refapp-responds-to-requests
  - typeprobe
    tolerance200
    refapp-responds-to-requests
  - typeprobe
    tolerance200
    refapp-responds-to-requests
method:
typeaction
  nameabort-failure
  provider:
    typepython
    modulechaosistio.fault.actions
    funcadd_abort_fault
    arguments:
      virtual_service_namego-demo-8
      http_status500
      routes:
        - destination:
            hostgo-demo-8
            subsetprimary
      percentage50
      versionnetworking.istio.io/v1alpha3
      nsgo-demo-8
  pauses
    after1

At the top, we have general information like the title asking what happens if we abort responses and the description stating that if responses are aborted, the dependant application should retry and/or timeout requests. Those are reasonable questions and assumptions. If something bad happens with requests, we should probably retry or timeout them. We also have some tags telling us that the experiment is about k8s, istio, and http. Just as before, we have configuration that will allow us to convert the environment variable INGRESS_HOST into Chaos Toolkit variable ingress_host. And we have a steady-state-hypothesis that validates that the application is healthy. We’re measuring that health by sending a request to our application and expecting that the return code is 200. We are, more or less, doing the same thing as before. However, this time, we are not sending a request to go-demo-8 but to the repeater.

Repeating the requests#

Since we are going to abort 50% of the requests, having only one probe with a request might not produce the result that we want. Getting the desired would rely on luck since we couldn’t predict whether that request would fall into the 50% that are aborted. To reduce the possibility of randomness influencing our steady-state hypothesis, we are going to repeat that request four more times. However, instead of defining the whole probe, we have a shortcut definition. The second probe also has the tolerance 200, but it is referencing the probe app-responds-to-requests. So, instead of repeating everything, we are just referencing the existing probe, and we are doing that four times.

All in all, we are sending requests, and we’re expecting the 200 response code five times.

Then we have a method with the action abort-failure. It’s using the module chaosistio.fault.actions and the function add_abort_fault. It should be self-descriptive, and you should be able to guess that it will add abort faults into an Istio Virtual Service. We can also see that the action is targeting the Virtual Service go-demo-8.

All in all, the add_abort_fault function will inject HTTP status 500 to the Virtual Service go-demo-8 that is identified through the destination with the host set to go-demo-8 and the subset set to primary. Further on, we can see that we have the percentage set to 50. So, fifty percent of the requests to go-demo-8 will be aborted. We also have the version of Istio that we’re using and the Namespace (ns) where that Virtual Service is residing.

So, we will be sending requests to the repeater, but we will be aborting those requests on the go-demo-8 API. That’s why we added an additional application. Since the repeater forwards requests to go-demo-8, we will be able to see what happens when we interact with one application that interacts with another while there is a cut in that communication between the two.

After we inject the abort, just to be sure that we are not too hasty, we’re going to give the system one second pause so that the abortion can be adequately propagated to the Virtual Service.

Now, let’s see what happens when we run this experiment. Can you guess? It should be obvious what happens if we abort 50% of the responses, and we are validating whether our application is responsive. Will all five requests that will be sent to our application return status code 200?

Running chaos experiment and inspecting the output#

Let’s run the experiment and see.

The output, without timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we abort responses
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: abort-failure
[... INFO] Pausing after activity for 1s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered

Please note that the output in your case could be different.

The probe was executed successfully five times. Then, the action added abort failures to the Istio Virtual Service. We were waiting for one second, and then we started re-running the probes.

We can see that, in my case, the first probe failed. I was unlucky. Given that approximately 50% should be unsuccessful, it could have been the second, third, or any other probe that failed, but my luck ran out right away. The first probe failed, and that was the end of the experiment. It is the first of five post-action probes. That was to be expected. One of those probes should have failed; it didn’t have to be the first one, though.


In the next lesson, we will find out how to roll back abort failures.

Discovering Chaos Toolkit Istio Plugin
Rolling Back Abort Failures
Mark as Completed
Report an Issue