Running Denial of Service Attacks

In this lesson, we will run a chaos experiment for DoS and observe how our application responds. We will also explore the logs.

Now that you are familiar with Siege, and that you have seen a “trick” baked in the go-demo-8 app that allows us to limit the number of requests the application can handle, we can construct a chaos experiment that will check how the application behaves when under Denial of Service attack.

Inspecting the definition of network-dos.yaml#

Let’s take a look at yet another chaos experiment definition.

The output is as follows.

version1.0.0
titleWhat happens if we abort responses
descriptionIf responses are aborted, the dependant application should retry and/or timeout requests
tags:
k8s
pod
deployment
istio
configuration:
  ingress_host:
      typeenv
      keyINGRESS_HOST
steady-state-hypothesis:
  titleThe app is healthy
  probes:
  - typeprobe
    nameapp-responds-to-requests
    tolerance200
    provider:
      typehttp
      timeout5
      verify_tlsfalse
      urlhttp://${ingress_host}?addr=http://go-demo-8/limiter
      headers:
        Hostrepeater.acme.com
method:
typeaction
  nameabort-failure
  provider:
    typeprocess
    pathkubectl
    arguments:
    - run
    - siege
    - --namespace
    - go-demo-8
    - --image
    - yokogawa/siege
    - --generator
    - run-pod/v1
    - -it
    - --rm
    - -- 
    - --concurrent
    - 50
    - --time
    - 20S
    - "http://go-demo-8/limiter"
  pauses
    after5

We have a steady-state hypothesis, which validates that our application does respond with 200 on the endpoint /limiter. Then, we have an action with the type of the provider set to process. This another reason why I’m showing you that definition. Besides simulating Denial of Service and how to send an increased number of concurrent requests to our applications, I’m using this opportunity to explore yet another provider type.

The process provider allows us to execute any command. This is very useful in cases when none of the Chaos Toolkit plugins will enable us to do what we need.

We can always accomplish goals that are not available through plugins by using the process provider, which can execute any command. It could be a script, a shell command, or anything else, as long as it is executable. In this case, the path is kubectl (a command) followed by a list of arguments. Those are the same we just executed manually. We’ll be sending fifty concurrent requests for 20 seconds to the /limiter endpoint.

Running the chaos experiment and inspecting the output#

Let’s run this experiment and see what happens.

[2020-03-13 23:51:28 INFO] Validating the experiment's syntax
[2020-03-13 23:51:28 INFO] Experiment looks valid
[2020-03-13 23:51:28 INFO] Running experiment: What happens if we abort responses
[2020-03-13 23:51:28 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:51:28 INFO] Probe: app-responds-to-requests
[2020-03-13 23:51:28 INFO] Steady state hypothesis is met!
[2020-03-13 23:51:28 INFO] Action: abort-failure
[2020-03-13 23:51:52 INFO] Pausing after activity for 5s...
[2020-03-13 23:51:57 INFO] Steady state hypothesis: The app is healthy
[2020-03-13 23:51:57 INFO] Probe: app-responds-to-requests
[2020-03-13 23:51:57 CRITICAL] Steady state probe 'app-responds-to-requests' is not in the given tolerance so failing this experiment
[2020-03-13 23:51:57 INFO] Let's rollback...
[2020-03-13 23:51:57 INFO] No declared rollbacks, let's move on.
[2020-03-13 23:51:57 INFO] Experiment ended with status: deviated
[2020-03-13 23:51:57 INFO] The steady-state has deviated, a weakness may have been discovered

We can see that, after the initial probe was successful, we executed an action that ran the siege Pod. After that, the probe ran again, and it failed. Our application failed to respond because it was under a heavy load, and it collapsed. It couldn’t handle that amount of traffic. This time, the amount of traffic was ridiculously low, and that’s why we’re simulating DoS attacks. However, in a “real world” situation, you would send high volumes, maybe thousands or hundreds of thousands of concurrent requests, and see whether your application is responsive after that. In this case, we were cheating by configuring the application to handle a very low number of requests.

We can see that, in this case, the application cannot handle the load. The experiment failed.

Exploring the logs#

The output in front of us is not very descriptive. We probably wouldn’t be able to deduce the cause of the issue just by looking at it. Fortunately, that’s only the list of events and their statuses, and more information is available. Every time we run an experiment, we get a chaostoolkit.log file that stores detailed logs of what happened in case we need additional information. Let’s take a look at the log for this scenario.

The output is too big to be presented in a lesson, so I’ll let you explore it on your screen. You should see the (poorly formatted) output from siege, which it gives us the same info as when we run it manually.

All in all, if you need more information, you can always find it in chaostoolkit.log. Think of it as debug info.

What would be the fix for this situation?

If you’re waiting for me to give you the answer, you’re out of luck. Just like the end of the previous section, I have a task for you. You’re getting yet another homework assignment.

Try to figure out how to handle the situation we explored through the last experiment.

I will give you just a small tip that might help you know what to look for. In Istio, we can use circuit breakers to limit the number of requests coming to an endpoint. In case of a Denial of Service attack, or a sudden increase in the number of requests, we can use circuit breakers in Istio, or almost any other service mesh, to control the maximum number of concurrent requests that an application should receive.

Assignment#

Now it’s your turn. Do the homework. Explore circuit breakers, and try to figure out how you would implement them for your applications. Use a chaos experiment to confirm that it fails before the changes are implemented and that it passes after. The goal is to figure out how to prevent a situation like this from becoming a disaster.

I know that your application has a limit. Every application does. How will you handle a sudden outburst of requests that is way above what your app can handle at any given moment?


In the next lesson, we will remove the resources that we have created.

Simulating Denial of Service Attacks
Destroying What We Created
Mark as Completed
Report an Issue