Increasing Network Latency

In this lesson, we will be looking at another chaos experiment definition and improved Virtual service definition to increase the latency of our network.

We'll cover the following

Inspecting the definition of network-delay.yaml
Checking the difference between network-rollback.yaml and network-delay.yaml
Running chaos experiment and inspecting the output
Improving the Virtual Service definition using istio-delay.yaml
Applying the definition and running the chaos experiment

We saw how we can deal with network failures. To be more precise, we saw one possible way to simulate network failures and one way to solve the adverse outcomes it produces. However, it’s not always going to be that easy. Sometimes the network does not fail, and requests do not immediately return 500 response codes. Sometimes there is a delay. Our applications might wait for responses for milliseconds, seconds, or even longer. How can we deal with that?

Let’s see what happens if we introduce a delay to requests’ responses. Is our application, in its current state, capable of handling this well and without affecting end users?

Inspecting the definition of `network-delay.yaml`#

Let’s take a look at yet another chaos experiment definition.

Enter to Rename, Shift+Enter to Preview

This time, there are more than a few lines of changes.

We’re keeping the part that aborts requests, and we added delays. So, instead of creating a separate experiment to deal with delays, we will do both at the same time. We’re keeping the abort failures, and we’re adding delays. We’re spicing it up so it’s not so far from the “real world” situation. When network failures happen, some other requests might be delayed.

Checking the difference between `network-rollback.yaml` and `network-delay.yaml`#

Let’s take a look at the differences between this and the previous definition.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

2,3c2,3
< title: What happens if we abort responses
< description: If responses are aborted, the dependant application should retry and/or timeout requests
---
> title: What happens if we abort and delay responses
> description: If responses are aborted and delayed, the dependant application should retry and/or timeout requests
20c20
<       timeout: 5
---
>       timeout: 15
53a54,69
> - type: action
>   name: delay
>   provider:
>     type: python
>     module: chaosistio.fault.actions
>     func: add_delay_fault
>     arguments:
>       virtual_service_name: go-demo-8
>       fixed_delay: 15s
>       routes:
>         - destination:
>             host: go-demo-8
>             subset: primary
>       percentage: 50
>       version: networking.istio.io/v1alpha3
>       ns: go-demo-8
70a87,100
> - type: action
>   name: remove-delay
>   provider:
>     type: python
>     func: remove_delay_fault
>     module: chaosistio.fault.actions
>     arguments:
>       virtual_service_name: go-demo-8
>       routes:
>         - destination:
>             host: go-demo-8
>             subset: primary
>       version: networking.istio.io/v1alpha3
>       ns: go-demo-8

We’ll ignore the changes in the title and the description since they do not affect the output of an experiment.

We’re increasing the timeout to 15 seconds, not because I expect to have a long timeout, but because it will be easier to demonstrate what’s coming next.

We have two new actions.

The new action uses the function add_delay_fault, and the arguments are very similar to what we had before. It introduces a fixed delay of 15 seconds. So, when a request comes to this Virtual Service, it will be delayed for 15 seconds. If we go back to the top, we can see that the timeout is also 15 seconds. Because the delay of 15 seconds plus whatever number of milliseconds the request itself takes is more than the timeout, our probe should fail. The vital thing to note is that the delay is applied only to 50 percent of the requests.

Then, we have a rollback action to remove that same delay.

All in all, we are adding a delay of 15 seconds on top of the abort faults. We are also introducing a rollback to remove that delay.

Running chaos experiment and inspecting the output#

What do you think? Will the experiment fail, or will our application survive such conditions without affecting the users (too much)? Let’s take a look.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

[2020-03-13 23:45:33 INFO] Validating the experiment's syntax
[2020-03-13 23:45:33 INFO] Experiment looks valid
[2020-03-13 23:45:33 INFO] Running experiment: What happens if we abort and delay responses
[2020-03-13 23:45:34 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:45:34 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:34 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:34 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:34 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:34 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:34 INFO] Steady state hypothesis is met!
[2020-03-13 23:45:34 INFO] Action: abort-failure
[2020-03-13 23:45:34 INFO] Action: delay
[2020-03-13 23:45:34 INFO] Pausing after activity for 1s...
[2020-03-13 23:45:35 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:45:35 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:35 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:35 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:35 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:50 ERROR]   => failed: activity took too long to complete
[2020-03-13 23:45:50 WARNING] Probe terminated unexpectedly, so its tolerance could not be validated
[2020-03-13 23:45:50 CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[2020-03-13 23:45:50 INFO] Let's rollback...
[2020-03-13 23:45:50 INFO] Rollback: remove-abort-failure
[2020-03-13 23:45:50 INFO] Action: remove-abort-failure
[2020-03-13 23:45:50 INFO] Rollback: remove-delay
[2020-03-13 23:45:50 INFO] Action: remove-delay
[2020-03-13 23:45:50 INFO] Experiment ended with status: deviated
[2020-03-13 23:45:50 INFO] The steady-state has deviated, a weakness may have been discovered

The first five probes executed before the actions and confirmed that the initial state is as desired. The action introduced abort failure and the delay, just as before.

In my case, and yours is likely different, the fourth probe failed. It could have been the first, the second, or any other. But, in my case, the fifth one was unsuccessful. The message activity took too long to complete should be self-explanatory.

If we focus on the timestamp of the failed probe, we can see that there is precisely a 15 seconds difference from the previous experiment. In my case, the last successful probe started at 35 seconds, and then it failed at 50 seconds. The request was sent, and given that it has a timeout of 15 seconds, that’s how much it waited for the response.

We can conclude that our application does not know how to cope with delays. What could be the fix for that?

Improving the Virtual Service definition using `istio-delay.yaml`#

Let’s try to improve the definition of our Virtual Service. We’ll output istio-delay.yaml and see what could be the change that might solve the problem with delayed responses.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

---

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: repeater
spec:
  hosts:
  - repeater.acme.com
  - repeater
  gateways:
  - repeater
  http:
  - route:
    - destination:
        host: repeater
        subset: primary
        port:
          number: 80
    retries:
      attempts: 10
      perTryTimeout: 2s
      retryOn: 5xx,connect-failure
    timeout: 10s

We still have the retries section with attempts set to 10 and with perTryTimeout set to 2 seconds. In addition, we now have connect-failure added to the retryOn values.

We are going to retry this ten times with a 2 second time out. We’ll do that not only if we have response codes in the five hundred range (5xx), but also when we have connection failures (connect-failure). That 2 seconds timeout is crucial in this case. If we send the request and it happens to be delayed, Istio Virtual Service will wait for 2 seconds, even though the delay is 15 seconds. It will abort that request after 2 seconds, and it will try again, and again, and again until it is successful, or until it does it for 10 times. The total timeout is 10 seconds, so it might fail faster than that. It all depends on whether the timeout is reached first or the number of attempts.

Let’s take a closer look at the diff between this definition and the previous one. That should give us a clearer picture of what’s going on, and what changed.

Enter to Rename, Shift+Enter to Preview

The output is as follows.

22,23c22,24
<       perTryTimeout: 3s
<       retryOn: 5xx
---
>       perTryTimeout: 2s
>       retryOn: 5xx,connect-failure
>     timeout: 10s

We can see that the perTryTimeout was reduced to 3 seconds, that we changed the retryOn codes to be not only 5xx but also connect-failure, and that we introduced a timeout of 10 seconds. The retry process will be repeated up to 10 times and only for 10 seconds total.

Before we proceed, I must say that the timeout of 10 seconds is unrealistically high. Nobody should have 10 seconds as a goal. But, in this case, for the sake of simplifying everything, our expectation is that the application will be responsive, no matter whether we have partial delays or partial aborts within 10 seconds.

Applying the definition and running the chaos experiment#

We’re about to apply the new definition of this Virtual Service and re-run our experiment.

Will it work? Will our application be highly available and manage to serve our users, no matter whether requests and responses are aborted or delayed?

Enter to Rename, Shift+Enter to Preview

The output of the latter command is as follows.

[2020-03-13 23:46:30 INFO] Validating the experiment's syntax
[2020-03-13 23:46:30 INFO] Experiment looks valid
[2020-03-13 23:46:30 INFO] Running experiment: What happens if we abort and delay responses
[2020-03-13 23:46:30 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:46:30 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:30 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:30 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:30 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:30 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:30 INFO] Steady state hypothesis is met!
[2020-03-13 23:46:30 INFO] Action: abort-failure
[2020-03-13 23:46:31 INFO] Action: delay
[2020-03-13 23:46:31 INFO] Pausing after activity for 1s...
[2020-03-13 23:46:32 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:46:32 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:38 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:44 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:46 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:48 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:54 INFO] Steady state hypothesis is met!
[2020-03-13 23:46:54 INFO] Let's rollback...
[2020-03-13 23:46:54 INFO] Rollback: remove-abort-failure
[2020-03-13 23:46:54 INFO] Action: remove-abort-failure
[2020-03-13 23:46:54 INFO] Rollback: remove-delay
[2020-03-13 23:46:54 INFO] Action: remove-delay
[2020-03-13 23:46:54 INFO] Experiment ended with status: completed

We can see that the execution of the five initial probes worked. Afterwards, we ran the two actions to add abort and delay failures, and the same probes were executed successfully.

Pay attention to the timestamps of the after-action probes. In my case, if we focus on the timestamps from the first and the second post-action probe, we can see that the first one took around six seconds. There were probably some delays. Maybe there was one delay and one network abort. Or, there could be some other combination. What matters is that it managed to respond within 6 seconds. In my case, the second request also took around 6 seconds. Therefore, we can guess that there were problems and that they were resolved. The rest of the probes were also successful even though they required varied durion to finish.

The responses to some, if not all the probes, took longer than usual. Nevertheless, they were all successful. Our application, in my case, managed to survive delays and abort failures.

After all the actions and the probes, the experiment rolled back the changes, and our system is back to its initial state. It’s as if we never ran the experiment.

In the next lesson, we will explore how to abort all requests if we face a complete network failure.

Making the Application Resilient to Partial Network Failures

Aborting All Requests

Mark as Completed

Report an Issue

Before We Begin

Introduction To Kubernetes Chaos Engineering

Defining Requirements

Destroying Application Instances

Experimenting with Application Availability

Obstructing and Destroying Network

Draining and Deleting Nodes

Creating Chaos Experiment Reports

Running Chaos Experiments Inside a Kubernetes Cluster

Executing Random Chaos

Conclusion

Increasing Network Latency

Inspecting the definition of `network-delay.yaml`#

Checking the difference between `network-rollback.yaml` and `network-delay.yaml`#

Running chaos experiment and inspecting the output#

Improving the Virtual Service definition using `istio-delay.yaml`#

Applying the definition and running the chaos experiment#

Increasing Network Latency

Inspecting the definition of network-delay.yaml#

Checking the difference between network-rollback.yaml and network-delay.yaml#

Running chaos experiment and inspecting the output#

Improving the Virtual Service definition using istio-delay.yaml#

Applying the definition and running the chaos experiment#

Inspecting the definition of `network-delay.yaml`#

Checking the difference between `network-rollback.yaml` and `network-delay.yaml`#

Improving the Virtual Service definition using `istio-delay.yaml`#