Sending fake requests ourselves#

We are going to forget about experiments for a moment and see what happens if we send requests to the application ourselves. We’re going to dispatch ten requests to repeater.acme.com, which is the same address as the address in the experiment. To be more precise, we’ll fake that we’re sending requests to repeater.acme.com, and the “real” address will be the Istio Gateway Ingress host.

To make the output more readable, we added an empty line after requests.

The output, in my case, is as follows.

fault filter abort
Version: 0.0.1; Release: unknown

fault filter abort
fault filter abort
Version: 0.0.1; Release: unknown

fault filter abort
fault filter abort
Version: 0.0.1; Release: unknown

Version: 0.0.1; Release: unknown

fault filter abort

We can see that some of the requests returned fault filter abort. Those requests are the 50% that were aborted. Now, don’t take 50% seriously because other requests are happening inside the cluster, and the number of those that failed in that output might not be exactly half. Think of it as approximately 50%.

What matters is that some requests were aborted, and others were successful. That is very problematic for at least two reasons.

  1. First, the experiment showed that our application cannot deal with network abortions. If a request is terminated (and that is inevitable), our app does not know how to deal with it.

  2. The second issue is that we did not roll back our change. Therefore, the injected faults are present even after the chaos experiment. We can see that through the

Checking Virtual Service#

We’ll need to figure out how to roll back that change at the end of the experiment. We need to remove the damage we’re doing to our network. Before we do that, let’s confirm that our Virtual Service is indeed permanently messed up.

The output, limited to the relevant parts, is as follows.

...
Spec:
  Hosts:
    go-demo-8
  Http:
    Fault:
      Abort:
        Http Status:  500
        Percentage:
          Value:  50
...

We can see that, within the Spec.Http section, there is the Fault.Abort subsection, with Http Status set to 500 and Percentage to 50.

Istio allows us to do such things. It will enable us, among many other possibilities, to specify how many HTTP responses should be aborted. Through Chaos Toolkit, we run an experiment that modified the definition of the Virtual Service by adding Fault.Abort. What we did not do is revert that change.

Everything we do with chaos experiments should always roll back after the experiment to what we had before unless it is a temporary change. For example, if we destroy a Pod, we expect Kubernetes to create a new one, and there is no need to revert such a change. However, injecting abort failures into our Virtual Service is permanent, and the system is not supposed to recover from that in any formal way. We need to roll it back.

We can explain this concept differently. We should revert the changes we’re enacting to definitions of the components inside our cluster. If we only destroy something, that does not change any definition. We’re not changing the desired state. Adding Istio abort failures is a change of a definition while terminating a Pod is not. Therefore, the former should be reverted, while the latter shouldn’t.

Restoring Virtual Service to the original definition#

Before we improve our experiment, let’s apply the istio.yaml file again. This will restore our Virtual Service to its original definition.

We rolled back, not through a chaos experiment, but by re-applying the same definition istio.yaml that we used initially.

Now, let’s describe the Virtual Service and confirm that re-applying the definition worked.

The output, limited to the relevant parts, is as follows.

...
Spec:
  Hosts:
    go-demo-8
  Http:
    Route:
      Destination:
        Host:  go-demo-8
        Port:
          Number:  80
        Subset:    primary
...

We can see that there is no Fault.Abort section. From now on, our Virtual Service should be working correctly with all requests.

Inspecting the definition of network-rollback.yaml#

Now that we are back to where we started, let’s take a look at yet another chaos experiment.

Checking the difference between network.yaml and network-rollback.yaml#

As you can see from the output, there is a new section, called rollbacks. To avoid converting this into a “where’s Waldo” type of exercise, we’ll output diff so that you can easily see the changes when compared to the previous experiment.

The output is as follows.

55a56,70
> rollbacks:
> - type: action
>   name: remove-abort-failure
>   provider:
>     type: python
>     func: remove_abort_fault
>     module: chaosistio.fault.actions
>     arguments:
>       virtual_service_name: go-demo-8
>       routes:
>         - destination:
>             host: go-demo-8
>             subset: primary
>       version: networking.istio.io/v1alpha3
>       ns: go-demo-8

We can see that, the only addition is the rollbacks section. It is also based on the module chaosistio.fault.actions, but the function is remove_abort_fault. The arguments are basically the same as those we used to add the abort fault.

All in all, we are adding an abort fault through the action, and we are removing that same abort fault during the rollback phase.

Running chaos experiment and inspecting the output#

Let’s run this experiment and see what happens.

The output, without timestamps, is as follows.

[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we abort responses
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... INFO] Steady state hypothesis is met!
[... INFO] Action: abort-failure
[... INFO] Pausing after activity for 1s...
[... INFO] Steady state hypothesis: The app is healthy
[... INFO] Probe: app-responds-to-requests
[... INFO] Probe: app-responds-to-requests
[... CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] Rollback: remove-abort-failure
[... INFO] Action: remove-abort-failure
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered

The result is almost the same as before. The initial probes were successful, the action added the abort fault, and one of the after-action probes failed. All that is as expected, and as it was when we ran the previous experiment.

What’s new this time is the rollback. The experiment is failing just as it was failing before, but this time, there is the rollback action to remove whatever we did through the actions. We added the abort failure, and then we removed it.

Sending the requests again#

In order to confirm that we successfully rolled back the change, we’re going to send yet another ten requests to the same address as before.

This time the output is a stream of Version: 0.0.1; Release: unknown messages confirming that all the requests were successful. We have not improved our application just yet, but we did manage to undo the damage created during the experiment.

Checking Virtual Service to confirm#

Let’s describe the Virtual Service and double-check that everything indeed looks OK.

The output, limited to the relevant parts, is as follows.

...
Spec:
  Hosts:
    go-demo-8
  Http:
    Route:
      Destination:
        Host:  go-demo-8
        Port:
          Number:  80
        Subset:    primary
...

We can see that the Fault.Abort section is not there. Everything is normal. Whatever damage was done by the experiment action was rolled back at the end of the experiment.

However, there’s still at least one thing left for us to do. We still need to fix the application so that it can survive partial network failure. We’ll do that next.


In the next lesson, we will learn how to make our application resilient to some of the network issues.

Aborting Network Requests
Making the Application Resilient to Partial Network Failures
Mark as Completed
Report an Issue