Rolling Back Abort Failures
This lesson shows how we can roll back the abort failures in network requests.
We'll cover the following
- Sending fake requests ourselves
- Checking Virtual Service
- Restoring Virtual Service to the original definition
- Inspecting the definition of network-rollback.yaml
- Checking the difference between network.yaml and network-rollback.yaml
- Running chaos experiment and inspecting the output
- Sending the requests again
- Checking Virtual Service to confirm
Sending fake requests ourselves#
We are going to forget about experiments for a moment and see what happens if we send requests to the application ourselves. We’re going to dispatch ten requests to repeater.acme.com
, which is the same address as the address in the experiment. To be more precise, we’ll fake that we’re sending requests to repeater.acme.com
, and the “real” address will be the Istio Gateway Ingress host.
To make the output more readable, we added an empty line after requests.
The output, in my case, is as follows.
fault filter abort
Version: 0.0.1; Release: unknown
fault filter abort
fault filter abort
Version: 0.0.1; Release: unknown
fault filter abort
fault filter abort
Version: 0.0.1; Release: unknown
Version: 0.0.1; Release: unknown
fault filter abort
We can see that some of the requests returned fault filter abort
. Those requests are the 50% that were aborted. Now, don’t take 50% seriously because other requests are happening inside the cluster, and the number of those that failed in that output might not be exactly half. Think of it as approximately 50%.
What matters is that some requests were aborted, and others were successful. That is very problematic for at least two reasons.
-
First, the experiment showed that our application cannot deal with network abortions. If a request is terminated (and that is inevitable), our app does not know how to deal with it.
-
The second issue is that we did not roll back our change. Therefore, the injected faults are present even after the chaos experiment. We can see that through the
Checking Virtual Service#
We’ll need to figure out how to roll back that change at the end of the experiment. We need to remove the damage we’re doing to our network. Before we do that, let’s confirm that our Virtual Service is indeed permanently messed up.
The output, limited to the relevant parts, is as follows.
...
Spec:
Hosts:
go-demo-8
Http:
Fault:
Abort:
Http Status: 500
Percentage:
Value: 50
...
We can see that, within the Spec.Http
section, there is the Fault.Abort
subsection, with Http Status
set to 500
and Percentage
to 50
.
Istio allows us to do such things. It will enable us, among many other possibilities, to specify how many HTTP responses should be aborted. Through Chaos Toolkit, we run an experiment that modified the definition of the Virtual Service by adding Fault.Abort
. What we did not do is revert that change.
Everything we do with chaos experiments should always roll back after the experiment to what we had before unless it is a temporary change. For example, if we destroy a Pod, we expect Kubernetes to create a new one, and there is no need to revert such a change. However, injecting abort failures into our Virtual Service is permanent, and the system is not supposed to recover from that in any formal way. We need to roll it back.
We can explain this concept differently. We should revert the changes we’re enacting to definitions of the components inside our cluster. If we only destroy something, that does not change any definition. We’re not changing the desired state. Adding Istio abort failures is a change of a definition while terminating a Pod is not. Therefore, the former should be reverted, while the latter shouldn’t.
Restoring Virtual Service to the original definition#
Before we improve our experiment, let’s apply the istio.yaml
file again. This will restore our Virtual Service to its original definition.
We rolled back, not through a chaos experiment, but by re-applying the same definition istio.yaml
that we used initially.
Now, let’s describe the Virtual Service and confirm that re-applying the definition worked.
The output, limited to the relevant parts, is as follows.
...
Spec:
Hosts:
go-demo-8
Http:
Route:
Destination:
Host: go-demo-8
Port:
Number: 80
Subset: primary
...
We can see that there is no Fault.Abort
section. From now on, our Virtual Service should be working correctly with all requests.
Inspecting the definition of network-rollback.yaml
#
Now that we are back to where we started, let’s take a look at yet another chaos experiment.
Checking the difference between network.yaml
and network-rollback.yaml
#
As you can see from the output, there is a new section, called rollbacks
. To avoid converting this into a “where’s Waldo” type of exercise, we’ll output diff
so that you can easily see the changes when compared to the previous experiment.
The output is as follows.
55a56,70
> rollbacks:
> - type: action
> name: remove-abort-failure
> provider:
> type: python
> func: remove_abort_fault
> module: chaosistio.fault.actions
> arguments:
> virtual_service_name: go-demo-8
> routes:
> - destination:
> host: go-demo-8
> subset: primary
> version: networking.istio.io/v1alpha3
> ns: go-demo-8
We can see that, the only addition is the rollbacks
section. It is also based on the module chaosistio.fault.actions
, but the function
is remove_abort_fault
. The arguments are basically the same as those we used to add the abort fault.
All in all, we are adding an abort fault through the action, and we are removing that same abort fault during the rollback phase.
Running chaos experiment and inspecting the output#
Let’s run this experiment and see what happens.
The output, without timestamps, is as follows.
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we abort responses
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: abort-failure
[... INFO] Pausing after activity for 1s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] Rollback: remove-abort-failure
[... INFO] Action: remove-abort-failure
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered
The result is almost the same as before. The initial probes were successful, the action added the abort fault, and one of the after-action probes failed. All that is as expected, and as it was when we ran the previous experiment.
What’s new this time is the rollback. The experiment is failing just as it was failing before, but this time, there is the rollback action to remove whatever we did through the actions. We added the abort failure, and then we removed it.
Sending the requests again#
In order to confirm that we successfully rolled back the change, we’re going to send yet another ten requests to the same address as before.
This time the output is a stream of Version: 0.0.1; Release: unknown
messages confirming that all the requests were successful. We have not improved our application just yet, but we did manage to undo the damage created during the experiment.
Checking Virtual Service to confirm#
Let’s describe the Virtual Service and double-check that everything indeed looks OK.
The output, limited to the relevant parts, is as follows.
...
Spec:
Hosts:
go-demo-8
Http:
Route:
Destination:
Host: go-demo-8
Port:
Number: 80
Subset: primary
...
We can see that the Fault.Abort
section is not there. Everything is normal. Whatever damage was done by the experiment action was rolled back at the end of the experiment.
However, there’s still at least one thing left for us to do. We still need to fix the application so that it can survive partial network failure. We’ll do that next.
In the next lesson, we will learn how to make our application resilient to some of the network issues.