WEBVTT

00:00:00.340 --> 00:00:01.130
Finally,

00:00:01.140 --> 00:00:05.500
let's look at creating, implementing, and testing the BCDR from a

00:00:05.500 --> 00:00:11.710
procedural vantage point. When we are looking at the BCDR plan,

00:00:11.970 --> 00:00:17.100
it's important that we make the implementers the same as those

00:00:17.100 --> 00:00:19.350
who would be assisting in building the plan.

00:00:20.040 --> 00:00:25.070
One of the most daring feats of his time was Charles Blondin's crossing

00:00:25.070 --> 00:00:29.040
the Niagara Falls in 1859 by means of a tightrope.

00:00:29.050 --> 00:00:31.630
Imagine when he told his manager that for the next feat,

00:00:31.630 --> 00:00:34.400
he would love to carry a person on his back across the

00:00:34.400 --> 00:00:36.900
tightrope to increase the entertainment value.

00:00:36.980 --> 00:00:41.000
Imagine you were his manager and hearing about it and thinking, well,

00:00:41.000 --> 00:00:43.840
that would be exciting to see a person being carried on the back of

00:00:43.840 --> 00:00:48.160
Charles Blondin and would probably be a huge entertainment value. Then

00:00:48.160 --> 00:00:49.970
imagine Charles telling you,

00:00:50.160 --> 00:00:53.470
I'm glad you think it would be exciting because I would like you to be the

00:00:53.470 --> 00:00:57.180
person on my back. Now you might have a different reaction as you adjust your

00:00:57.180 --> 00:01:00.460
thinking from being a spectator to a participant.

00:01:00.470 --> 00:01:04.730
There actually needs to be close alignment of those responsible for creating

00:01:04.730 --> 00:01:08.920
the plan and those responsible for implementing the plan.

00:01:09.040 --> 00:01:13.710
It is wise to consult or even adapt existing IT project planning and risk

00:01:13.710 --> 00:01:18.130
management methodologies. In this section, some activities and concerns are

00:01:18.130 --> 00:01:23.400
highlighted that are relevant for a cloud BCDR. The BCDR plan and its

00:01:23.400 --> 00:01:27.490
implementation are embedded in an information security strategy that

00:01:27.490 --> 00:01:30.350
encompasses clearly defined roles, risk assessment,

00:01:30.350 --> 00:01:31.390
classification,

00:01:31.390 --> 00:01:35.610
policy awareness, and training. The creation and implementation of

00:01:35.610 --> 00:01:39.950
a fully tested BCDR plan that is ready for the failover event

00:01:39.960 --> 00:01:44.020
structurally resembles any other IT implementation plan, as well as

00:01:44.030 --> 00:01:46.460
other disaster response plans.

00:01:47.940 --> 00:01:51.550
The requirements that are input for BCDR planning include

00:01:51.550 --> 00:01:54.990
identification of critical business processes and their

00:01:54.990 --> 00:01:57.960
dependencies on specific data services.

00:01:57.970 --> 00:02:00.410
So here we're digging into characteristics, and

00:02:00.410 --> 00:02:02.690
descriptions, and service agreements.

00:02:02.700 --> 00:02:07.890
The things that are okay today may not actually be okay tomorrow.

00:02:07.900 --> 00:02:11.800
One of the reasons why is because the threat landscape could

00:02:11.800 --> 00:02:15.940
change. Vulnerability analysis could actually illuminate that

00:02:15.940 --> 00:02:18.020
something else has been modified.

00:02:18.030 --> 00:02:18.810
Also,

00:02:18.820 --> 00:02:22.210
the business may be involved in activities tomorrow

00:02:22.220 --> 00:02:24.570
that it is not involved in today.

00:02:24.580 --> 00:02:28.150
So, there is this need to make sure that there's the

00:02:28.160 --> 00:02:31.650
ongoing process of risk assessment.

00:02:32.440 --> 00:02:37.320
The cloud provider's ability to elastically expand, so, for

00:02:37.320 --> 00:02:42.110
instance, remember the availability zone scenario that we did with

00:02:42.110 --> 00:02:47.560
primary BCDR being in the cloud service provider's premises? How

00:02:47.560 --> 00:02:50.420
about if you have a new provider,

00:02:50.430 --> 00:02:53.520
do they address the existing concerns that you had

00:02:53.530 --> 00:02:56.220
already solved with the previous one?

00:02:56.230 --> 00:02:59.840
We cannot make assumptions. If we're doing cloud to cloud,

00:02:59.850 --> 00:03:04.820
we need to investigate what bandwidth there is to handle replication.

00:03:04.830 --> 00:03:08.300
There are new types of systems and services that are out there

00:03:08.300 --> 00:03:12.760
like Megaport that allow you to dynamically connect to cloud

00:03:12.760 --> 00:03:16.040
service providers based upon emergent needs.

00:03:16.050 --> 00:03:20.560
We also need to think about what legal issues may prevent us

00:03:20.560 --> 00:03:23.380
from doing a recovery. How might that be true?

00:03:23.390 --> 00:03:28.480
Well, think about where you are from a geofencing perspective and data

00:03:28.480 --> 00:03:33.930
privacy regulations that may mandate that you cannot extract data without

00:03:33.930 --> 00:03:37.760
consent based upon where you are located in the world.

00:03:38.640 --> 00:03:44.700
The objective of the design phase is to establish and evaluate

00:03:44.700 --> 00:03:48.520
candidate architecture solutions. The approaches and their

00:03:48.520 --> 00:03:52.240
components have been illustrated in earlier sections, so the design

00:03:52.240 --> 00:03:56.000
phase should not just result in technical alternatives, but also

00:03:56.000 --> 00:03:58.970
flesh out procedures and workflow.

00:03:58.980 --> 00:04:03.760
Once the design of the BCDR solution is ready, implementing the solution will

00:04:03.760 --> 00:04:08.740
begin. This will require work on both the primary solution platform and on

00:04:08.740 --> 00:04:13.250
the DR platform. On the primary platform, these activities are likely to

00:04:13.250 --> 00:04:17.570
include implementation of functionality for enabling data replication on a

00:04:17.570 --> 00:04:20.149
regular or a continuous basis.

00:04:20.160 --> 00:04:24.830
Also, thinking about the business modifying and the business

00:04:24.840 --> 00:04:28.050
changing, care must be taken that not only the required

00:04:28.050 --> 00:04:30.170
infrastructure and services are made available,

00:04:30.180 --> 00:04:34.900
but also that the DR platform tracks any relevant changes and functional

00:04:34.910 --> 00:04:37.660
updates that are being made on the primary platform.

00:04:37.740 --> 00:04:44.130
No plan is an effective plan unless it's tested. With the DR tests,

00:04:44.140 --> 00:04:50.690
there are three basic areas that include about five different capabilities.

00:04:50.750 --> 00:04:56.480
The first one would be what we would call a desktop review. In a desktop

00:04:56.480 --> 00:05:01.370
review, you are going through the plan as a walkthrough of the invocation

00:05:01.370 --> 00:05:06.740
and recovery process at a maximum. At a minimum, one of the things is

00:05:06.740 --> 00:05:09.680
that you're just checking the plan to make sure that it's written in a

00:05:09.680 --> 00:05:13.070
way that it's understandable and that it's updated according to the

00:05:13.070 --> 00:05:14.310
business objectives.

00:05:14.320 --> 00:05:17.650
The second way is that you have a conversation that circles

00:05:17.650 --> 00:05:20.270
around the invocation and recovery process.

00:05:20.280 --> 00:05:25.270
So, one person could be introduced as the chaos monkey and say,

00:05:25.270 --> 00:05:28.040
hey, this is a scenario that I'm introducing,

00:05:28.040 --> 00:05:30.900
I'm a tornado, how does everybody respond?

00:05:30.980 --> 00:05:34.630
Then you have the executive emergency management team,

00:05:34.630 --> 00:05:37.800
the recovery team, and the restoration team take on their

00:05:37.800 --> 00:05:41.320
responsibilities and their roles around the table.

00:05:41.650 --> 00:05:44.650
The next one could be a recovery simulation.

00:05:44.660 --> 00:05:46.420
Now here, you have two choices.

00:05:46.430 --> 00:05:50.270
One could be a component failure recovery, so there is some

00:05:50.280 --> 00:05:53.750
specific technological component that you fail out.

00:05:53.760 --> 00:05:58.010
Maybe it is a virtual machine or it's an availability zone, or

00:05:58.020 --> 00:06:00.470
it could be based upon service recovery.

00:06:00.480 --> 00:06:04.130
What is the benefit that comes from that availability zone

00:06:04.140 --> 00:06:08.600
or that virtual machine? Fail that out as a simulation in a

00:06:08.600 --> 00:06:10.150
nonproduction environment.

00:06:10.340 --> 00:06:14.170
The last one is an operational failure. Here,

00:06:14.170 --> 00:06:18.140
you could be switching between redundant systems. So, in some

00:06:18.140 --> 00:06:20.770
cases this is what's known as a full test.

00:06:20.770 --> 00:06:24.760
The test is simulating, in the most realistic way, any risk that

00:06:24.760 --> 00:06:28.560
could be manifest. You may have an alternate site,

00:06:28.570 --> 00:06:32.960
like a mirrored site that's working on your behalf and then you fail out

00:06:32.970 --> 00:06:36.880
what would be considered one element of that mirrored site.

00:06:36.890 --> 00:06:39.640
The other element should continue processing.

00:06:39.650 --> 00:06:40.920
Now, if this is done,

00:06:40.930 --> 00:06:45.160
obviously, you would need to make sure that the impact to

00:06:45.170 --> 00:06:49.200
operational resilience is not affected and you don't

00:06:49.200 --> 00:06:52.160
induce actual business failure.

00:06:52.240 --> 00:06:56.280
The best kind of test could be chaos engineering.

00:06:56.290 --> 00:06:59.970
Chaos engineering requires an optimal testing

00:06:59.970 --> 00:07:02.280
environment to be live and continuous.

00:07:02.290 --> 00:07:07.450
Netflix embraces chaos engineering, which led them to create the Simian Army.

00:07:07.460 --> 00:07:08.500
The Chaos Monkey,

00:07:08.500 --> 00:07:13.160
which is a script that imposes failure on random production VMs, would be

00:07:13.160 --> 00:07:17.710
like unplugging a live production server from its power, designed to

00:07:17.710 --> 00:07:20.800
expose assumptions about resilience. By 2013,

00:07:20.800 --> 00:07:24.590
Chaos Monkey had killed over 60,000 instances.

00:07:24.600 --> 00:07:25.660
What did they learn?

00:07:26.100 --> 00:07:31.180
State is bad, clusters good, instance survival is not

00:07:31.180 --> 00:07:33.400
sufficient. But that wasn't enough for them.

00:07:33.410 --> 00:07:36.420
Then they created something called Chaos Gorilla, which

00:07:36.430 --> 00:07:39.980
imposed on production availability zones,

00:07:39.990 --> 00:07:43.000
which would be like shutting down a production datacenter.

00:07:43.010 --> 00:07:47.080
This exposed assumptions of deployment topology. Seamless

00:07:47.080 --> 00:07:51.160
recovery is a challenge, rapidly redirected traffic has

00:07:51.170 --> 00:07:52.930
errors. What were the lessons learned?

00:07:53.240 --> 00:07:54.510
Physical, logical,

00:07:54.510 --> 00:07:59.260
virtual disaggregation and compilation where it was necessary. That wasn't

00:07:59.260 --> 00:08:03.380
enough for them, they then imposed something called Chaos Kong, which

00:08:03.390 --> 00:08:08.650
simulated regional failure. When Amazon's DynamoDB service experienced an

00:08:08.650 --> 00:08:13.890
availability issue in the US‑East‑1 region, thanks to its stress testing,

00:08:13.900 --> 00:08:18.790
Netflix's system actually handled traffic failover while other large

00:08:18.790 --> 00:08:23.410
companies were out for 7 to 8 hours. If you really want to get to a high

00:08:23.410 --> 00:08:27.360
level of mature testing, that might be something that you investigate. You

00:08:27.360 --> 00:08:31.700
can actually go to GitHub and download the Chaos Monkey scripts.
