Increasing Network Latency
In this lesson, we will be looking at another chaos experiment definition and improved Virtual service definition to increase the latency of our network.
We'll cover the following
We saw how we can deal with network failures. To be more precise, we saw one possible way to simulate network failures and one way to solve the adverse outcomes it produces. However, it’s not always going to be that easy. Sometimes the network does not fail, and requests do not immediately return 500 response codes. Sometimes there is a delay. Our applications might wait for responses for milliseconds, seconds, or even longer. How can we deal with that?
Let’s see what happens if we introduce a delay to requests’ responses. Is our application, in its current state, capable of handling this well and without affecting end users?
Inspecting the definition of network-delay.yaml
#
Let’s take a look at yet another chaos experiment definition.
This time, there are more than a few lines of changes.
We’re keeping the part that aborts requests, and we added delays. So, instead of creating a separate experiment to deal with delays, we will do both at the same time. We’re keeping the abort failures, and we’re adding delays. We’re spicing it up so it’s not so far from the “real world” situation. When network failures happen, some other requests might be delayed.
Checking the difference between network-rollback.yaml
and network-delay.yaml
#
Let’s take a look at the differences between this and the previous definition.
The output is as follows.
2,3c2,3
< title: What happens if we abort responses
< description: If responses are aborted, the dependant application should retry and/or timeout requests
---
> title: What happens if we abort and delay responses
> description: If responses are aborted and delayed, the dependant application should retry and/or timeout requests
20c20
< timeout: 5
---
> timeout: 15
53a54,69
> - type: action
> name: delay
> provider:
> type: python
> module: chaosistio.fault.actions
> func: add_delay_fault
> arguments:
> virtual_service_name: go-demo-8
> fixed_delay: 15s
> routes:
> - destination:
> host: go-demo-8
> subset: primary
> percentage: 50
> version: networking.istio.io/v1alpha3
> ns: go-demo-8
70a87,100
> - type: action
> name: remove-delay
> provider:
> type: python
> func: remove_delay_fault
> module: chaosistio.fault.actions
> arguments:
> virtual_service_name: go-demo-8
> routes:
> - destination:
> host: go-demo-8
> subset: primary
> version: networking.istio.io/v1alpha3
> ns: go-demo-8
We’ll ignore the changes in the title
and the description
since they do not affect the output of an experiment.
We’re increasing the timeout
to 15
seconds, not because I expect to have a long timeout, but because it will be easier to demonstrate what’s coming next.
We have two new actions.
The new action uses the function add_delay_fault
, and the arguments are very similar to what we had before. It introduces a fixed delay of 15 seconds. So, when a request comes to this Virtual Service, it will be delayed for 15 seconds. If we go back to the top, we can see that the timeout is also 15 seconds. Because the delay of 15 seconds plus whatever number of milliseconds the request itself takes is more than the timeout, our probe should fail. The vital thing to note is that the delay
is applied only to 50
percent of the requests.
Then, we have a rollback
action to remove that same delay.
All in all, we are adding a delay of 15 seconds on top of the abort faults. We are also introducing a rollback to remove that delay.
Running chaos experiment and inspecting the output#
What do you think? Will the experiment fail, or will our application survive such conditions without affecting the users (too much)? Let’s take a look.
The output is as follows.
[2020-03-13 23:45:33 INFO] Validating the experiment's syntax
[2020-03-13 23:45:33 INFO] Experiment looks valid
[2020-03-13 23:45:33 INFO] Running experiment: What happens if we abort and delay responses
[2020-03-13 23:45:34 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:45:34 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:34 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:34 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:34 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:34 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:34 INFO] Steady state hypothesis is met!
[2020-03-13 23:45:34 INFO] Action: abort-failure
[2020-03-13 23:45:34 INFO] Action: delay
[2020-03-13 23:45:34 INFO] Pausing after activity for 1s...
[2020-03-13 23:45:35 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:45:35 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:35 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:35 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:35 INFO] Probe: app-responds-to-requests
[2020-03-13 23:45:50 ERROR] => failed: activity took too long to complete
[2020-03-13 23:45:50 WARNING] Probe terminated unexpectedly, so its tolerance could not be validated
[2020-03-13 23:45:50 CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[2020-03-13 23:45:50 INFO] Let's rollback...
[2020-03-13 23:45:50 INFO] Rollback: remove-abort-failure
[2020-03-13 23:45:50 INFO] Action: remove-abort-failure
[2020-03-13 23:45:50 INFO] Rollback: remove-delay
[2020-03-13 23:45:50 INFO] Action: remove-delay
[2020-03-13 23:45:50 INFO] Experiment ended with status: deviated
[2020-03-13 23:45:50 INFO] The steady-state has deviated, a weakness may have been discovered
The first five probes executed before the actions and confirmed that the initial state is as desired. The action introduced abort failure and the delay, just as before.
In my case, and yours is likely different, the fourth probe failed. It could have been the first, the second, or any other. But, in my case, the fifth one was unsuccessful. The message activity took too long to complete
should be self-explanatory.
If we focus on the timestamp of the failed probe, we can see that there is precisely a 15 seconds difference from the previous experiment. In my case, the last successful probe started at 35 seconds, and then it failed at 50 seconds. The request was sent, and given that it has a timeout of 15 seconds, that’s how much it waited for the response.
We can conclude that our application does not know how to cope with delays. What could be the fix for that?
Improving the Virtual Service definition using istio-delay.yaml
#
Let’s try to improve the definition of our Virtual Service. We’ll output istio-delay.yaml
and see what could be the change that might solve the problem with delayed responses.
The output is as follows.
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: repeater
spec:
hosts:
- repeater.acme.com
- repeater
gateways:
- repeater
http:
- route:
- destination:
host: repeater
subset: primary
port:
number: 80
retries:
attempts: 10
perTryTimeout: 2s
retryOn: 5xx,connect-failure
timeout: 10s
We still have the retries
section with attempts
set to 10
and with perTryTimeout
set to 2
seconds. In addition, we now have connect-failure
added to the retryOn
values.
We are going to retry this ten times with a 2 second time out. We’ll do that not only if we have response codes in the five hundred range (5xx
), but also when we have connection failures (connect-failure
). That 2 seconds timeout is crucial in this case. If we send the request and it happens to be delayed, Istio Virtual Service will wait for 2 seconds, even though the delay is 15 seconds. It will abort that request after 2 seconds, and it will try again, and again, and again until it is successful, or until it does it for 10 times. The total timeout is 10 seconds, so it might fail faster than that. It all depends on whether the timeout is reached first or the number of attempts.
Let’s take a closer look at the diff
between this definition and the previous one. That should give us a clearer picture of what’s going on, and what changed.
The output is as follows.
22,23c22,24
< perTryTimeout: 3s
< retryOn: 5xx
---
> perTryTimeout: 2s
> retryOn: 5xx,connect-failure
> timeout: 10s
We can see that the perTryTimeout
was reduced to 3
seconds, that we changed the retryOn
codes to be not only 5xx
but also connect-failure
, and that we introduced a timeout of 10
seconds. The retry process will be repeated up to 10 times and only for 10 seconds total.
Before we proceed, I must say that the timeout of 10 seconds is unrealistically high. Nobody should have 10 seconds as a goal. But, in this case, for the sake of simplifying everything, our expectation is that the application will be responsive, no matter whether we have partial delays or partial aborts within 10 seconds.
Applying the definition and running the chaos experiment#
We’re about to apply the new definition of this Virtual Service and re-run our experiment.
Will it work? Will our application be highly available and manage to serve our users, no matter whether requests and responses are aborted or delayed?
The output of the latter command is as follows.
[2020-03-13 23:46:30 INFO] Validating the experiment's syntax
[2020-03-13 23:46:30 INFO] Experiment looks valid
[2020-03-13 23:46:30 INFO] Running experiment: What happens if we abort and delay responses
[2020-03-13 23:46:30 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:46:30 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:30 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:30 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:30 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:30 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:30 INFO] Steady state hypothesis is met!
[2020-03-13 23:46:30 INFO] Action: abort-failure
[2020-03-13 23:46:31 INFO] Action: delay
[2020-03-13 23:46:31 INFO] Pausing after activity for 1s...
[2020-03-13 23:46:32 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:46:32 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:38 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:44 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:46 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:48 INFO] Probe: app-responds-to-requests
[2020-03-13 23:46:54 INFO] Steady state hypothesis is met!
[2020-03-13 23:46:54 INFO] Let's rollback...
[2020-03-13 23:46:54 INFO] Rollback: remove-abort-failure
[2020-03-13 23:46:54 INFO] Action: remove-abort-failure
[2020-03-13 23:46:54 INFO] Rollback: remove-delay
[2020-03-13 23:46:54 INFO] Action: remove-delay
[2020-03-13 23:46:54 INFO] Experiment ended with status: completed
We can see that the execution of the five initial probes worked. Afterwards, we ran the two actions to add abort and delay failures, and the same probes were executed successfully.
Pay attention to the timestamps of the after-action probes. In my case, if we focus on the timestamps from the first and the second post-action probe, we can see that the first one took around six seconds. There were probably some delays. Maybe there was one delay and one network abort. Or, there could be some other combination. What matters is that it managed to respond within 6 seconds. In my case, the second request also took around 6 seconds. Therefore, we can guess that there were problems and that they were resolved. The rest of the probes were also successful even though they required varied durion to finish.
The responses to some, if not all the probes, took longer than usual. Nevertheless, they were all successful. Our application, in my case, managed to survive delays and abort failures.
After all the actions and the probes, the experiment rolled back the changes, and our system is back to its initial state. It’s as if we never ran the experiment.
In the next lesson, we will explore how to abort all requests if we face a complete network failure.