Running Denial of Service Attacks
In this lesson, we will run a chaos experiment for DoS and observe how our application responds. We will also explore the logs.
Now that you are familiar with Siege, and that you have seen a “trick” baked in the go-demo-8
app that allows us to limit the number of requests the application can handle, we can construct a chaos experiment that will check how the application behaves when under Denial of Service attack.
Inspecting the definition of network-dos.yaml
#
Let’s take a look at yet another chaos experiment definition.
The output is as follows.
version: 1.0.0
title: What happens if we abort responses
description: If responses are aborted, the dependant application should retry and/or timeout requests
tags:
- k8s
- pod
- deployment
- istio
configuration:
ingress_host:
type: env
key: INGRESS_HOST
steady-state-hypothesis:
title: The app is healthy
probes:
- type: probe
name: app-responds-to-requests
tolerance: 200
provider:
type: http
timeout: 5
verify_tls: false
url: http://${ingress_host}?addr=http://go-demo-8/limiter
headers:
Host: repeater.acme.com
method:
- type: action
name: abort-failure
provider:
type: process
path: kubectl
arguments:
- run
- siege
- --namespace
- go-demo-8
- --image
- yokogawa/siege
- --generator
- run-pod/v1
- -it
- --rm
- --
- --concurrent
- 50
- --time
- 20S
- "http://go-demo-8/limiter"
pauses:
after: 5
We have a steady-state hypothesis, which validates that our application does respond with 200
on the endpoint /limiter
. Then, we have an action with the type
of the provider
set to process
. This another reason why I’m showing you that definition. Besides simulating Denial of Service and how to send an increased number of concurrent requests to our applications, I’m using this opportunity to explore yet another provider
type.
The process
provider allows us to execute any command. This is very useful in cases when none of the Chaos Toolkit plugins will enable us to do what we need.
We can always accomplish goals that are not available through plugins by using the process
provider, which can execute any command. It could be a script, a shell command, or anything else, as long as it is executable. In this case, the path
is kubectl
(a command) followed by a list of arguments
. Those are the same we just executed manually. We’ll be sending fifty concurrent requests for 20 seconds to the /limiter
endpoint.
Running the chaos experiment and inspecting the output#
Let’s run this experiment and see what happens.
[2020-03-13 23:51:28 INFO] Validating the experiment's syntax
[2020-03-13 23:51:28 INFO] Experiment looks valid
[2020-03-13 23:51:28 INFO] Running experiment: What happens if we abort responses
[2020-03-13 23:51:28 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:51:28 INFO] Probe: app-responds-to-requests
[2020-03-13 23:51:28 INFO] Steady state hypothesis is met!
[2020-03-13 23:51:28 INFO] Action: abort-failure
[2020-03-13 23:51:52 INFO] Pausing after activity for 5s...
[2020-03-13 23:51:57 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:51:57 INFO] Probe: app-responds-to-requests
[2020-03-13 23:51:57 CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[2020-03-13 23:51:57 INFO] Let's rollback...
[2020-03-13 23:51:57 INFO] No declared rollbacks, let's move on.
[2020-03-13 23:51:57 INFO] Experiment ended with status: deviated
[2020-03-13 23:51:57 INFO] The steady-state has deviated, a weakness may have been discovered
We can see that, after the initial probe was successful, we executed an action that ran the siege
Pod. After that, the probe ran again, and it failed. Our application failed to respond because it was under a heavy load, and it collapsed. It couldn’t handle that amount of traffic. This time, the amount of traffic was ridiculously low, and that’s why we’re simulating DoS attacks. However, in a “real world” situation, you would send high volumes, maybe thousands or hundreds of thousands of concurrent requests, and see whether your application is responsive after that. In this case, we were cheating by configuring the application to handle a very low number of requests.
We can see that, in this case, the application cannot handle the load. The experiment failed.
Exploring the logs#
The output in front of us is not very descriptive. We probably wouldn’t be able to deduce the cause of the issue just by looking at it. Fortunately, that’s only the list of events and their statuses, and more information is available. Every time we run an experiment, we get a chaostoolkit.log
file that stores detailed logs of what happened in case we need additional information. Let’s take a look at the log for this scenario.
The output is too big to be presented in a lesson, so I’ll let you explore it on your screen. You should see the (poorly formatted) output from siege
, which it gives us the same info as when we run it manually.
All in all, if you need more information, you can always find it in chaostoolkit.log
. Think of it as debug info.
What would be the fix for this situation?
If you’re waiting for me to give you the answer, you’re out of luck. Just like the end of the previous section, I have a task for you. You’re getting yet another homework assignment.
Try to figure out how to handle the situation we explored through the last experiment.
I will give you just a small tip that might help you know what to look for. In Istio, we can use circuit breakers to limit the number of requests coming to an endpoint. In case of a Denial of Service attack, or a sudden increase in the number of requests, we can use circuit breakers in Istio, or almost any other service mesh, to control the maximum number of concurrent requests that an application should receive.
Assignment#
Now it’s your turn. Do the homework. Explore circuit breakers, and try to figure out how you would implement them for your applications. Use a chaos experiment to confirm that it fails before the changes are implemented and that it passes after. The goal is to figure out how to prevent a situation like this from becoming a disaster.
I know that your application has a limit. Every application does. How will you handle a sudden outburst of requests that is way above what your app can handle at any given moment?
In the next lesson, we will remove the resources that we have created.