Aborting Network Requests
In this lesson, we will run a chaos experiment to see what happens if we abort some network requests.
Networking issues are very common. They happen more often than many people think. We are about to explore what happens when we simulate or create those same issues ourselves.
So, what can we do?
We can do many different things, but, in our case, we’ll start with something relatively simple. We’ll see what happens if we intentionally abort some of the network requests. We’re going to terminate requests and see how our application behaves when that happens. We’re not going to abort all the requests, but only some. Terminating 50% of requests should do.
What happens if 50% of the requests coming to our applications are terminated? Is our application resilient enough to survive without negatively affecting users? As you can probably guess, we can check that through an experiment.
Inspecting the definition of network.yaml
#
Let’s take a look at yet another Chaos Toolkit definition.
The output is as follows.
version: 1.0.0
title: What happens if we abort responses
description: If responses are aborted, the dependant application should retry and/or timeout requests
tags:
- k8s
- istio
- http
configuration:
ingress_host:
type: env
key: INGRESS_HOST
steady-state-hypothesis:
title: The app is healthy
probes:
- type: probe
name: app-responds-to-requests
tolerance: 200
provider:
type: http
timeout: 5
verify_tls: false
url: http://${ingress_host}?addr=http://go-demo-8
headers:
Host: repeater.acme.com
- type: probe
tolerance: 200
ref: app-responds-to-requests
- type: probe
tolerance: 200
ref: app-responds-to-requests
- type: probe
tolerance: 200
ref: app-responds-to-requests
- type: probe
tolerance: 200
ref: app-responds-to-requests
method:
- type: action
name: abort-failure
provider:
type: python
module: chaosistio.fault.actions
func: add_abort_fault
arguments:
virtual_service_name: go-demo-8
http_status: 500
routes:
- destination:
host: go-demo-8
subset: primary
percentage: 50
version: networking.istio.io/v1alpha3
ns: go-demo-8
pauses:
after: 1
At the top, we have general information like the title
asking what happens if we abort responses
and the description
stating that if responses are aborted, the dependant application should retry and/or timeout requests
. Those are reasonable questions and assumptions. If something bad happens with requests, we should probably retry or timeout them. We also have some tags
telling us that the experiment is about k8s
, istio
, and http
. Just as before, we have configuration
that will allow us to convert the environment variable INGRESS_HOST
into Chaos Toolkit variable ingress_host
. And we have a steady-state-hypothesis
that validates that the application is healthy. We’re measuring that health by sending a request to our application and expecting that the return code is 200
. We are, more or less, doing the same thing as before. However, this time, we are not sending a request to go-demo-8
but to the repeater
.
Repeating the requests#
Since we are going to abort 50% of the requests, having only one probe with a request might not produce the result that we want. Getting the desired would rely on luck since we couldn’t predict whether that request would fall into the 50% that are aborted. To reduce the possibility of randomness influencing our steady-state hypothesis, we are going to repeat that request four more times. However, instead of defining the whole probe, we have a shortcut definition. The second probe
also has the tolerance 200
, but it is referencing the probe app-responds-to-requests
. So, instead of repeating everything, we are just referencing the existing probe, and we are doing that four times.
All in all, we are sending requests, and we’re expecting the 200
response code five times.
Then we have a method
with the action abort-failure
. It’s using the module chaosistio.fault.actions
and the function add_abort_fault
. It should be self-descriptive, and you should be able to guess that it will add abort faults into an Istio Virtual Service. We can also see that the action is targeting the Virtual Service go-demo-8
.
All in all, the add_abort_fault
function will inject HTTP status 500
to the Virtual Service go-demo-8
that is identified through the destination
with the host
set to go-demo-8
and the subset
set to primary
. Further on, we can see that we have the percentage
set to 50
. So, fifty percent of the requests to go-demo-8
will be aborted. We also have the version
of Istio that we’re using and the Namespace (ns
) where that Virtual Service is residing.
So, we will be sending requests to the repeater
, but we will be aborting those requests on the go-demo-8
API. That’s why we added an additional application. Since the repeater
forwards requests to go-demo-8
, we will be able to see what happens when we interact with one application that interacts with another while there is a cut in that communication between the two.
After we inject the abort, just to be sure that we are not too hasty, we’re going to give the system one second pause
so that the abortion can be adequately propagated to the Virtual Service.
Now, let’s see what happens when we run this experiment. Can you guess? It should be obvious what happens if we abort 50% of the responses, and we are validating whether our application is responsive. Will all five requests that will be sent to our application return status code 200
?
Running chaos experiment and inspecting the output#
Let’s run the experiment and see.
The output, without timestamps, is as follows.
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we abort responses
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: abort-failure
[... INFO] Pausing after activity for 1s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered
Please note that the output in your case could be different.
The probe was executed successfully five times. Then, the action added abort failures to the Istio Virtual Service. We were waiting for one second, and then we started re-running the probes.
We can see that, in my case, the first probe failed. I was unlucky. Given that approximately 50% should be unsuccessful, it could have been the second, third, or any other probe that failed, but my luck ran out right away. The first probe failed, and that was the end of the experiment. It is the first of five post-action probes. That was to be expected. One of those probes should have failed; it didn’t have to be the first one, though.
In the next lesson, we will find out how to roll back abort failures.