1
00:00:00,080 --> 00:00:00,960
In this lesson,

2
00:00:00,960 --> 00:00:03,570
we're going to discuss disaster recovery metrics.

3
00:00:03,570 --> 00:00:06,450
Disaster recovery metrics are quantifiable standards

4
00:00:06,450 --> 00:00:09,150
that are used to plan and evaluate an organization's ability

5
00:00:09,150 --> 00:00:12,480
to recover IT operations following a disruptive event.

6
00:00:12,480 --> 00:00:14,310
These metrics are focused on the measurements

7
00:00:14,310 --> 00:00:15,270
inside of your business

8
00:00:15,270 --> 00:00:16,590
and your enterprise network

9
00:00:16,590 --> 00:00:18,780
that are going to be used to identify organizational risks

10
00:00:18,780 --> 00:00:19,710
and determine their effect

11
00:00:19,710 --> 00:00:22,230
on your ongoing mission critical operations.

12
00:00:22,230 --> 00:00:23,970
These recovery methods can also be used

13
00:00:23,970 --> 00:00:25,620
to measure our levels of availability

14
00:00:25,620 --> 00:00:27,690
and how quickly we can restore our services

15
00:00:27,690 --> 00:00:29,160
if they're being negatively affected

16
00:00:29,160 --> 00:00:31,260
by some kind of an incident or event.

17
00:00:31,260 --> 00:00:33,840
Now availability is measured in what we call uptime

18
00:00:33,840 --> 00:00:36,180
or by how many minutes or hours you are up.

19
00:00:36,180 --> 00:00:38,340
And often this will be shown as a percentage

20
00:00:38,340 --> 00:00:39,690
of how many minutes you are up

21
00:00:39,690 --> 00:00:41,040
divided by the total number of minutes

22
00:00:41,040 --> 00:00:42,870
available during that period.

23
00:00:42,870 --> 00:00:44,340
We try to maintain what's called

24
00:00:44,340 --> 00:00:45,990
the five nines of availability

25
00:00:45,990 --> 00:00:47,970
in most of our commercial base networks.

26
00:00:47,970 --> 00:00:49,080
And this is really hard,

27
00:00:49,080 --> 00:00:51,210
because when we talk about five nines of availability,

28
00:00:51,210 --> 00:00:54,390
we are talking about 99.999% uptime

29
00:00:54,390 --> 00:00:57,420
or a maximum of five minutes of downtime per year,

30
00:00:57,420 --> 00:00:59,520
which is not a whole lot of downtime.

31
00:00:59,520 --> 00:01:01,020
In some cloud-based networks,

32
00:01:01,020 --> 00:01:05,910
they aim for six nines of availability, which is 99.9999%,

33
00:01:05,910 --> 00:01:09,330
and this equates to just 31 seconds of downtime per year.

34
00:01:09,330 --> 00:01:10,650
So as you can imagine,

35
00:01:10,650 --> 00:01:12,750
I'm probably going to need more than 31 seconds

36
00:01:12,750 --> 00:01:14,143
or even five minutes of downtime per year

37
00:01:14,143 --> 00:01:16,770
to be able to do things like patching my servers

38
00:01:16,770 --> 00:01:18,210
or installing a new storage device

39
00:01:18,210 --> 00:01:20,250
or to put in a new router or switch.

40
00:01:20,250 --> 00:01:23,250
So how can I maintain that high level of availability

41
00:01:23,250 --> 00:01:25,140
at five nines or six nines

42
00:01:25,140 --> 00:01:26,670
if I need to do that type of maintenance

43
00:01:26,670 --> 00:01:28,200
and recovery procedures?

44
00:01:28,200 --> 00:01:30,330
Well, I'm going to do that by designing my networks

45
00:01:30,330 --> 00:01:33,060
to be highly available and highly reliable.

46
00:01:33,060 --> 00:01:34,770
Now whenever I talk about availability,

47
00:01:34,770 --> 00:01:36,510
I'm really concerned with my network's ability

48
00:01:36,510 --> 00:01:38,130
to be up and operational.

49
00:01:38,130 --> 00:01:39,750
But when I talk about reliability,

50
00:01:39,750 --> 00:01:41,790
I'm more concerned about not dropping packets

51
00:01:41,790 --> 00:01:44,070
inside of the network because I want to ensure that I'm up

52
00:01:44,070 --> 00:01:47,370
and also effectively passing data across that network.

53
00:01:47,370 --> 00:01:49,650
So if your network is highly available

54
00:01:49,650 --> 00:01:50,820
but it's not reliable,

55
00:01:50,820 --> 00:01:52,410
that's not a very good network.

56
00:01:52,410 --> 00:01:54,840
But conversely if you have a highly reliable network

57
00:01:54,840 --> 00:01:56,520
but it's not a highly available one,

58
00:01:56,520 --> 00:01:57,600
that's not good either.

59
00:01:57,600 --> 00:01:59,430
Because even if it's the most reliable network

60
00:01:59,430 --> 00:02:00,600
in the entire world,

61
00:02:00,600 --> 00:02:02,610
if it's only going to be up 20 minutes per year,

62
00:02:02,610 --> 00:02:03,750
that it's not going to be considered

63
00:02:03,750 --> 00:02:04,950
a very good network

64
00:02:04,950 --> 00:02:07,500
because it's not usable for business operations.

65
00:02:07,500 --> 00:02:10,380
So what we need to do is balance these two extremes

66
00:02:10,380 --> 00:02:12,210
and aim for a good enough in both areas

67
00:02:12,210 --> 00:02:13,950
to meet our business needs.

68
00:02:13,950 --> 00:02:15,750
So as we go through this lesson,

69
00:02:15,750 --> 00:02:17,460
we're going to discuss some key metrics

70
00:02:17,460 --> 00:02:18,840
that you should be aware of,

71
00:02:18,840 --> 00:02:20,790
and we're going to talk about how you can measure things

72
00:02:20,790 --> 00:02:22,230
within your own organization

73
00:02:22,230 --> 00:02:24,030
like the meantime between failures,

74
00:02:24,030 --> 00:02:27,150
the meantime to repair, the maximum tolerable downtime,

75
00:02:27,150 --> 00:02:28,320
the recovery point objective,

76
00:02:28,320 --> 00:02:30,540
and the recovery time objective.

77
00:02:30,540 --> 00:02:33,090
First, we have the meantime to repair.

78
00:02:33,090 --> 00:02:35,400
Now the meantime to repair, or MTTR,

79
00:02:35,400 --> 00:02:37,500
is a metric that's used to measure the average time

80
00:02:37,500 --> 00:02:40,170
it takes to repair a network device when it breaks.

81
00:02:40,170 --> 00:02:42,630
After all, everything in our networks and organizations

82
00:02:42,630 --> 00:02:44,190
will break eventually.

83
00:02:44,190 --> 00:02:45,285
So when a device breaks,

84
00:02:45,285 --> 00:02:48,690
how long does it take for you and your team to fix it?

85
00:02:48,690 --> 00:02:50,100
Now, based on that,

86
00:02:50,100 --> 00:02:52,050
how much downtime did you actually experience

87
00:02:52,050 --> 00:02:53,310
during the last year?

88
00:02:53,310 --> 00:02:54,750
This is really what we're trying to measure here

89
00:02:54,750 --> 00:02:56,580
with the meantime to repair.

90
00:02:56,580 --> 00:02:59,400
Second, we have the meantime between failures.

91
00:02:59,400 --> 00:03:02,280
Now the meantime between failures, or the MTBF,

92
00:03:02,280 --> 00:03:03,990
is a metric that measures the average time

93
00:03:03,990 --> 00:03:06,720
between when failures occur on a given device.

94
00:03:06,720 --> 00:03:07,710
Now for most people,

95
00:03:07,710 --> 00:03:10,350
these two terms can be a little confusing at first.

96
00:03:10,350 --> 00:03:12,240
So let's consider an example of both of these

97
00:03:12,240 --> 00:03:14,280
by considering some kind of system failure

98
00:03:14,280 --> 00:03:16,710
and how it's going to be resolved over time.

99
00:03:16,710 --> 00:03:18,120
Now if you have a system failure

100
00:03:18,120 --> 00:03:20,040
and then you resume normal operations,

101
00:03:20,040 --> 00:03:21,630
the amount of time between the failure

102
00:03:21,630 --> 00:03:23,400
and the resumption of normal operations

103
00:03:23,400 --> 00:03:25,320
will be considered to be the time to repair

104
00:03:25,320 --> 00:03:27,570
this particular incident or event.

105
00:03:27,570 --> 00:03:28,530
On the timeline,

106
00:03:28,530 --> 00:03:30,300
this is shown as the first stop sign

107
00:03:30,300 --> 00:03:32,400
on the left side of the timeline.

108
00:03:32,400 --> 00:03:34,650
Now if I click all those times to repair metrics

109
00:03:34,650 --> 00:03:35,970
and I add them all together

110
00:03:35,970 --> 00:03:37,680
and then create an average from them,

111
00:03:37,680 --> 00:03:39,810
this creates what we call the meantime to repair,

112
00:03:39,810 --> 00:03:42,090
or the MTTR, from my organization's network

113
00:03:42,090 --> 00:03:43,770
or individual devices,

114
00:03:43,770 --> 00:03:45,510
depending on which metrics we were collecting

115
00:03:45,510 --> 00:03:47,760
and using as part of that average.

116
00:03:47,760 --> 00:03:49,230
Now on the failure side of things,

117
00:03:49,230 --> 00:03:50,550
we're going to need to measure the time

118
00:03:50,550 --> 00:03:52,170
between one failure occurring,

119
00:03:52,170 --> 00:03:53,460
us fixing that failure,

120
00:03:53,460 --> 00:03:55,470
and then the next failure that occurs.

121
00:03:55,470 --> 00:03:57,300
This becomes the time between failures,

122
00:03:57,300 --> 00:03:58,830
and when I average them all together,

123
00:03:58,830 --> 00:04:02,670
it becomes the meantime between failures or the MTBF.

124
00:04:02,670 --> 00:04:04,680
Now hopefully you can see the difference here.

125
00:04:04,680 --> 00:04:06,150
With the meantime to repair,

126
00:04:06,150 --> 00:04:07,530
what we want is a low number

127
00:04:07,530 --> 00:04:09,510
because we want to be able to fix things quickly

128
00:04:09,510 --> 00:04:11,310
and get ourselves back online.

129
00:04:11,310 --> 00:04:13,200
So the lower the meantime to repair,

130
00:04:13,200 --> 00:04:14,310
the better things you're going to be

131
00:04:14,310 --> 00:04:16,440
in terms of our network's availability.

132
00:04:16,440 --> 00:04:17,273
On the other hand,

133
00:04:17,273 --> 00:04:19,290
if you're talking about the meantime between failures,

134
00:04:19,290 --> 00:04:21,750
then you want the number to be as large as possible

135
00:04:21,750 --> 00:04:23,040
because the longer number

136
00:04:23,040 --> 00:04:25,920
means that the meantime between failures becomes longer,

137
00:04:25,920 --> 00:04:27,540
and this means that your network's availability

138
00:04:27,540 --> 00:04:29,730
and reliability will increase.

139
00:04:29,730 --> 00:04:32,640
After all, we want to buy and operate reliable equipment

140
00:04:32,640 --> 00:04:34,200
with a lower number of failures.

141
00:04:34,200 --> 00:04:36,420
So if there's more time in between our failures,

142
00:04:36,420 --> 00:04:38,730
we can call that piece of equipment more reliable

143
00:04:38,730 --> 00:04:40,020
than an equivalent piece of equipment

144
00:04:40,020 --> 00:04:41,970
that's going to fail more often.

145
00:04:41,970 --> 00:04:45,030
Now third, we have maximum tolerable downtime.

146
00:04:45,030 --> 00:04:47,670
The maximum tolerable downtime, or MTD,

147
00:04:47,670 --> 00:04:48,960
is the longest period of time

148
00:04:48,960 --> 00:04:50,340
of business can be inoperable

149
00:04:50,340 --> 00:04:52,980
without causing irrevocable business failures.

150
00:04:52,980 --> 00:04:55,260
Essentially, the maximum tolerable downtime

151
00:04:55,260 --> 00:04:57,270
will answer a simple question for us.

152
00:04:57,270 --> 00:04:59,490
How long can our organization's network be down

153
00:04:59,490 --> 00:05:01,140
without going out of business?

154
00:05:01,140 --> 00:05:03,720
Now the maximum tolerable downtime is going to be different

155
00:05:03,720 --> 00:05:05,460
for each organization that you work at,

156
00:05:05,460 --> 00:05:07,297
and it can even be different within different departments

157
00:05:07,297 --> 00:05:09,720
inside of the same organization.

158
00:05:09,720 --> 00:05:11,910
Additionally, in larger organizations,

159
00:05:11,910 --> 00:05:13,230
each of your business processes

160
00:05:13,230 --> 00:05:16,050
can have its own maximum tolerable downline as well.

161
00:05:16,050 --> 00:05:18,480
For example, some maximum tolerable downtimes

162
00:05:18,480 --> 00:05:19,680
may just be a couple of minutes

163
00:05:19,680 --> 00:05:21,180
for your most critical functions,

164
00:05:21,180 --> 00:05:23,250
while others might be as long as a couple of hours

165
00:05:23,250 --> 00:05:25,830
or even days for more administrative functions.

166
00:05:25,830 --> 00:05:27,840
This really does depend on your organization,

167
00:05:27,840 --> 00:05:29,490
and you're going to have to figure this out for yourself

168
00:05:29,490 --> 00:05:31,140
of what that specific target or goal

169
00:05:31,140 --> 00:05:33,330
for your maximum tolerable downtime will be

170
00:05:33,330 --> 00:05:36,150
based on working with your organization's key stakeholders.

171
00:05:36,150 --> 00:05:37,920
But for now, I want you to simply remember

172
00:05:37,920 --> 00:05:39,780
that the maximum tolerable downtime

173
00:05:39,780 --> 00:05:41,730
is really the upper limit on the recovery time

174
00:05:41,730 --> 00:05:44,400
that the system and the asset owners must resume

175
00:05:44,400 --> 00:05:46,260
your normal operations within.

176
00:05:46,260 --> 00:05:47,820
So keeping that in mind,

177
00:05:47,820 --> 00:05:49,860
let's take a look at my own business.

178
00:05:49,860 --> 00:05:51,660
Now one of the maximum tolerable downtime

179
00:05:51,660 --> 00:05:53,220
we've established at Dion Training

180
00:05:53,220 --> 00:05:54,690
is focused on our response time

181
00:05:54,690 --> 00:05:56,490
to our students when they ask a question

182
00:05:56,490 --> 00:05:58,830
by emailing support at diontraining.com

183
00:05:58,830 --> 00:06:00,120
or by posting a question

184
00:06:00,120 --> 00:06:02,670
inside the Q&A section of the course.

185
00:06:02,670 --> 00:06:04,200
Our maximum tolerable downtime

186
00:06:04,200 --> 00:06:07,620
for this area in our business has been set at 12 hours.

187
00:06:07,620 --> 00:06:09,231
Now why is our MTD 12 hours

188
00:06:09,231 --> 00:06:11,040
instead of something like five minutes

189
00:06:11,040 --> 00:06:12,600
or something short like that?

190
00:06:12,600 --> 00:06:13,830
This is a great question.

191
00:06:13,830 --> 00:06:14,940
And when we started to analyze

192
00:06:14,940 --> 00:06:16,170
the function within our business,

193
00:06:16,170 --> 00:06:18,750
we started to think about it from our student's perspective.

194
00:06:18,750 --> 00:06:20,730
If I was a student and I asked a question,

195
00:06:20,730 --> 00:06:22,410
what is the longest I would want to go

196
00:06:22,410 --> 00:06:24,270
before I get a response to my question?

197
00:06:24,270 --> 00:06:26,730
And the answer for us was somewhere around 24 hours.

198
00:06:26,730 --> 00:06:30,600
So we cut that in half and made it 12 hours for our MTD.

199
00:06:30,600 --> 00:06:31,890
Now this was a balanced decision

200
00:06:31,890 --> 00:06:34,470
to balance the cost of support for our student questions

201
00:06:34,470 --> 00:06:36,720
versus how quickly we could get them answered.

202
00:06:36,720 --> 00:06:39,060
Sure, we can hire an entire team of 100 people

203
00:06:39,060 --> 00:06:41,370
to do nothing but answer student questions all day,

204
00:06:41,370 --> 00:06:43,710
but that would cost us around $5 million per year

205
00:06:43,710 --> 00:06:44,700
in labor costs.

206
00:06:44,700 --> 00:06:46,380
And to support that kind of labor budget,

207
00:06:46,380 --> 00:06:48,030
we would have to increase our course prices

208
00:06:48,030 --> 00:06:51,090
by at least five or 10 times of their current costs.

209
00:06:51,090 --> 00:06:52,740
So when we surveyed our students

210
00:06:52,740 --> 00:06:54,750
in regards to the level of service they expected

211
00:06:54,750 --> 00:06:56,640
and the prices they were willing to pay for that,

212
00:06:56,640 --> 00:06:59,340
we found that around 12 hours to 24 hours

213
00:06:59,340 --> 00:07:00,690
was a reasonable compromise

214
00:07:00,690 --> 00:07:02,190
in terms of speed of the response

215
00:07:02,190 --> 00:07:04,740
and the cost to deliver those responses to you.

216
00:07:04,740 --> 00:07:07,650
We could afford to provide answers within 12 to 24 hours

217
00:07:07,650 --> 00:07:08,730
and we could do that with a team

218
00:07:08,730 --> 00:07:10,920
of about four to seven people most of the time,

219
00:07:10,920 --> 00:07:13,260
which makes it a more cost effective option for us,

220
00:07:13,260 --> 00:07:14,970
and in turn, for you, our students,

221
00:07:14,970 --> 00:07:17,100
because we pass the savings on to you.

222
00:07:17,100 --> 00:07:18,570
Now to accommodate that,

223
00:07:18,570 --> 00:07:20,670
we actually have split our student support team members

224
00:07:20,670 --> 00:07:22,350
into two different teams.

225
00:07:22,350 --> 00:07:23,340
One half of our team

226
00:07:23,340 --> 00:07:24,973
lives and works over in the Philippines,

227
00:07:24,973 --> 00:07:26,340
and the other half of the team

228
00:07:26,340 --> 00:07:28,020
works here in the United States.

229
00:07:28,020 --> 00:07:29,760
This means that both teams are offset

230
00:07:29,760 --> 00:07:31,260
by about 11 to 12 hours

231
00:07:31,260 --> 00:07:33,360
depending on the time of year that we're dealing with.

232
00:07:33,360 --> 00:07:35,280
So if it's daytime in the Philippines,

233
00:07:35,280 --> 00:07:37,230
it's usually nighttime here in the United States;

234
00:07:37,230 --> 00:07:38,670
and when it's daytime here in the United States,

235
00:07:38,670 --> 00:07:40,530
it's usually nighttime in the Philippines.

236
00:07:40,530 --> 00:07:42,960
And so we can cover almost 24 hours a day

237
00:07:42,960 --> 00:07:44,550
by using these two locations

238
00:07:44,550 --> 00:07:46,500
because each side will work eight hours

239
00:07:46,500 --> 00:07:47,520
and they'll do it during the day

240
00:07:47,520 --> 00:07:49,170
in their particular country.

241
00:07:49,170 --> 00:07:50,970
Now you may have noticed there's a little bit of time

242
00:07:50,970 --> 00:07:52,890
that isn't covered by both of these teams

243
00:07:52,890 --> 00:07:55,050
because each of them is only working eight hours,

244
00:07:55,050 --> 00:07:57,330
but they have to cover a 12-hour period.

245
00:07:57,330 --> 00:07:59,490
And so what we did was we have another person

246
00:07:59,490 --> 00:08:00,960
who works over in Egypt,

247
00:08:00,960 --> 00:08:03,270
and they work right in the middle of that time

248
00:08:03,270 --> 00:08:04,830
so they can cover that eight-hour block,

249
00:08:04,830 --> 00:08:06,090
the extra four hours from the US,

250
00:08:06,090 --> 00:08:07,770
and the other four hours from the Philippines

251
00:08:07,770 --> 00:08:09,330
during their working hours in Egypt

252
00:08:09,330 --> 00:08:10,410
to make sure our student questions

253
00:08:10,410 --> 00:08:12,270
are being answered effectively.

254
00:08:12,270 --> 00:08:14,130
Now another benefit of doing this kind of a setup

255
00:08:14,130 --> 00:08:16,530
where you're splitting teams across multiple locations

256
00:08:16,530 --> 00:08:18,750
is that they're now geographically distinct.

257
00:08:18,750 --> 00:08:19,890
And so if there's a big storm

258
00:08:19,890 --> 00:08:21,930
that's affecting power in the United States,

259
00:08:21,930 --> 00:08:23,220
that isn't going to be a problem for us

260
00:08:23,220 --> 00:08:24,330
because the people in the Philippines

261
00:08:24,330 --> 00:08:26,310
can still work and take care of it.

262
00:08:26,310 --> 00:08:28,560
Alternatively, if there is a big typhoon or flood

263
00:08:28,560 --> 00:08:30,840
that causes issues for our teams in the Philippines,

264
00:08:30,840 --> 00:08:32,309
it shouldn't affect our teams located here

265
00:08:32,309 --> 00:08:33,179
in the United States

266
00:08:33,179 --> 00:08:34,917
and they can cover their workload

267
00:08:34,917 --> 00:08:36,480
to be able to get answers to our students.

268
00:08:36,480 --> 00:08:38,549
By having this type of geographic diversity,

269
00:08:38,549 --> 00:08:40,919
it's going to allow us to maintain that 12-hour response time

270
00:08:40,919 --> 00:08:42,851
because even if one team is offline their entire shift

271
00:08:42,851 --> 00:08:44,580
because of a disaster,

272
00:08:44,580 --> 00:08:46,530
they're only going to be gone for eight hours

273
00:08:46,530 --> 00:08:48,120
and we should be able to get their questions answered

274
00:08:48,120 --> 00:08:49,560
before that 12-hour mark

275
00:08:49,560 --> 00:08:50,790
because the person in Egypt

276
00:08:50,790 --> 00:08:53,280
or the person in either Philippines or the United States

277
00:08:53,280 --> 00:08:54,990
can take care of those things.

278
00:08:54,990 --> 00:08:56,040
And so at the end of the day,

279
00:08:56,040 --> 00:08:58,140
this really became a risk management decision for us

280
00:08:58,140 --> 00:09:00,240
as well as a cost benefit analysis

281
00:09:00,240 --> 00:09:01,650
to be able to provide good service

282
00:09:01,650 --> 00:09:03,840
at a good price to all of our students.

283
00:09:03,840 --> 00:09:05,400
Now again, you're going to have to figure out

284
00:09:05,400 --> 00:09:07,590
what your maximum tolerable downtime is

285
00:09:07,590 --> 00:09:09,390
inside your own company or organization

286
00:09:09,390 --> 00:09:10,500
out in the real world,

287
00:09:10,500 --> 00:09:11,760
and this will all be a process

288
00:09:11,760 --> 00:09:14,280
of designing your risk management plan to be able to support

289
00:09:14,280 --> 00:09:16,830
that maximum tolerable downtime as well.

290
00:09:16,830 --> 00:09:19,590
Fourth, we need to cover the recovery time objective.

291
00:09:19,590 --> 00:09:22,500
The recovery time objective, also known as the RTO,

292
00:09:22,500 --> 00:09:23,730
is the length of time it takes

293
00:09:23,730 --> 00:09:24,840
after an event to resume

294
00:09:24,840 --> 00:09:27,270
your normal business operations and activities.

295
00:09:27,270 --> 00:09:29,550
When you start thinking about recovery time objective,

296
00:09:29,550 --> 00:09:30,810
I really want you to think about the fact

297
00:09:30,810 --> 00:09:33,360
of something went down like you lost power,

298
00:09:33,360 --> 00:09:34,297
and now you have to ask yourself,

299
00:09:34,297 --> 00:09:36,720
"How quickly do we need to get it back online?"

300
00:09:36,720 --> 00:09:39,597
In the case of power, we have a 60-second time for power.

301
00:09:39,597 --> 00:09:41,430
And if we want to make sure our power is back up

302
00:09:41,430 --> 00:09:43,500
and online within 60 seconds, can we do that?

303
00:09:43,500 --> 00:09:44,610
Is that achievable?

304
00:09:44,610 --> 00:09:45,660
Well, yes it is.

305
00:09:45,660 --> 00:09:47,370
If you have a backup diesel generator,

306
00:09:47,370 --> 00:09:50,010
it's going to turn on in about 30 to 45 seconds,

307
00:09:50,010 --> 00:09:51,780
and by 45 to 60 seconds,

308
00:09:51,780 --> 00:09:53,940
power will be transferred onto that diesel generator

309
00:09:53,940 --> 00:09:55,650
and we will be fully recovered.

310
00:09:55,650 --> 00:09:58,770
Now if I wasn't happy with that 45 or 60 seconds

311
00:09:58,770 --> 00:10:00,990
and I wanted to make my recovery time of zero,

312
00:10:00,990 --> 00:10:02,370
can I achieve that?

313
00:10:02,370 --> 00:10:03,810
Well, yes, we still can.

314
00:10:03,810 --> 00:10:05,370
And that's one of the reasons why I installed

315
00:10:05,370 --> 00:10:08,010
a battery backup system because if power goes away,

316
00:10:08,010 --> 00:10:10,380
those batteries provide instant power to our servers

317
00:10:10,380 --> 00:10:11,520
and our networking equipment.

318
00:10:11,520 --> 00:10:14,460
So using that, we're able to hit a recovery time objective

319
00:10:14,460 --> 00:10:16,470
of zero in terms of power

320
00:10:16,470 --> 00:10:18,930
if we're using a battery backup based system.

321
00:10:18,930 --> 00:10:20,035
Now the overall full restore

322
00:10:20,035 --> 00:10:22,230
isn't what we're really talking about here,

323
00:10:22,230 --> 00:10:24,870
but instead we're talking about the recovery time objective

324
00:10:24,870 --> 00:10:26,730
to ensure that the operations can continue

325
00:10:26,730 --> 00:10:28,920
and the services are back up and running.

326
00:10:28,920 --> 00:10:30,840
Here, we're not saying that we need to wait for the power

327
00:10:30,840 --> 00:10:32,820
to be fully restored by the power grid again

328
00:10:32,820 --> 00:10:34,530
from our local electric company.

329
00:10:34,530 --> 00:10:36,090
Instead, we just need to make sure

330
00:10:36,090 --> 00:10:37,257
we can recover our business

331
00:10:37,257 --> 00:10:39,510
and we can make sure we're continuing to operate.

332
00:10:39,510 --> 00:10:42,270
And in our case, that can be done on battery, on solar,

333
00:10:42,270 --> 00:10:45,120
or on a generator within zero to 60 seconds,

334
00:10:45,120 --> 00:10:48,060
and that means we can meet our recovery time objectives.

335
00:10:48,060 --> 00:10:50,730
Fifth, we have the recovery point objective.

336
00:10:50,730 --> 00:10:52,973
Now the recovery point objective or RPO

337
00:10:52,973 --> 00:10:55,440
is going to be defined as the longest period of time

338
00:10:55,440 --> 00:10:56,273
that an organization

339
00:10:56,273 --> 00:10:58,890
can tolerate lost data being unrecoverable.

340
00:10:58,890 --> 00:11:00,570
Now the way I like to think about the RPO

341
00:11:00,570 --> 00:11:02,070
is to think about ransomware.

342
00:11:02,070 --> 00:11:03,780
If you have ransomware in your system,

343
00:11:03,780 --> 00:11:05,550
it's going to attempt to encrypt all your data

344
00:11:05,550 --> 00:11:06,870
and all of your files.

345
00:11:06,870 --> 00:11:08,820
Now you've got a couple of choices here.

346
00:11:08,820 --> 00:11:11,130
You can pay the ransom, which we never recommend;

347
00:11:11,130 --> 00:11:12,900
you could try to crack the ransomware key,

348
00:11:12,900 --> 00:11:14,850
which could take you days, weeks, months,

349
00:11:14,850 --> 00:11:17,040
or even years depending on how strong it is;

350
00:11:17,040 --> 00:11:19,590
or you can actually format and wipe that system

351
00:11:19,590 --> 00:11:22,080
and then recover from a known good backup.

352
00:11:22,080 --> 00:11:24,540
Now most of the time, we're going to choose option three

353
00:11:24,540 --> 00:11:25,650
and we'll format the system

354
00:11:25,650 --> 00:11:27,570
and recover from a known good backup.

355
00:11:27,570 --> 00:11:30,570
So let's assume we went ahead and chose that option.

356
00:11:30,570 --> 00:11:33,090
Well, if we did that, what is the longest period of time

357
00:11:33,090 --> 00:11:34,843
that we can tolerate data loss?

358
00:11:34,843 --> 00:11:36,300
Now what I mean by this

359
00:11:36,300 --> 00:11:38,190
is that when you're recovering from a backup,

360
00:11:38,190 --> 00:11:40,650
that backup is a certain amount of time old.

361
00:11:40,650 --> 00:11:42,660
And that means we are always lagging behind

362
00:11:42,660 --> 00:11:44,190
what the current data on the system was

363
00:11:44,190 --> 00:11:45,720
when an event happens.

364
00:11:45,720 --> 00:11:47,820
That data could be several hours old

365
00:11:47,820 --> 00:11:50,040
because it may have been midnight when we last backed it up,

366
00:11:50,040 --> 00:11:51,600
or it could have been several days old

367
00:11:51,600 --> 00:11:54,210
if we only back up once a week or once a month.

368
00:11:54,210 --> 00:11:56,490
And if this ransomware hits you at 6:00 in the morning

369
00:11:56,490 --> 00:11:58,680
but you're using a every midnight backup time,

370
00:11:58,680 --> 00:12:01,170
that means you have about six hours worth of lost data

371
00:12:01,170 --> 00:12:02,760
because you don't have a backup going

372
00:12:02,760 --> 00:12:05,010
after midnight of that particular day,

373
00:12:05,010 --> 00:12:06,810
and that means you have about six hours of data

374
00:12:06,810 --> 00:12:07,830
that's now encrypted

375
00:12:07,830 --> 00:12:08,850
and you're not going to be able to get it back

376
00:12:08,850 --> 00:12:10,890
when this ransomware attack happens.

377
00:12:10,890 --> 00:12:12,030
Now this is what we're talking about

378
00:12:12,030 --> 00:12:14,370
when we talk about the recovery point objective.

379
00:12:14,370 --> 00:12:16,980
That six hours is going to be the lost period of time

380
00:12:16,980 --> 00:12:18,420
where you can't recover the data

381
00:12:18,420 --> 00:12:21,420
because you're not performing backups during that timeframe.

382
00:12:21,420 --> 00:12:23,070
So you have to keep this in mind

383
00:12:23,070 --> 00:12:24,750
when you're designing your systems.

384
00:12:24,750 --> 00:12:26,730
If you have an RPO of six hours,

385
00:12:26,730 --> 00:12:29,520
that means you need to run backups at least every six hours

386
00:12:29,520 --> 00:12:31,620
or less to ensure that all of your data,

387
00:12:31,620 --> 00:12:34,140
except the last six hours of the data can be recovered

388
00:12:34,140 --> 00:12:35,730
if you need to restore from a backup,

389
00:12:35,730 --> 00:12:38,400
and that's what we call the recovery point objective.

390
00:12:38,400 --> 00:12:41,040
So remember, when it comes to disaster recovery metrics,

391
00:12:41,040 --> 00:12:43,650
we have five key metrics that you have to consider.

392
00:12:43,650 --> 00:12:46,440
The meantime between failures, the meantime to repair,

393
00:12:46,440 --> 00:12:48,150
the maximum tolerable downtime,

394
00:12:48,150 --> 00:12:49,470
the recovery point objective,

395
00:12:49,470 --> 00:12:51,210
and the recovery time objective.

396
00:12:51,210 --> 00:12:53,730
The meantime between failures, or the MTBF,

397
00:12:53,730 --> 00:12:55,500
is a metric that measures the average time

398
00:12:55,500 --> 00:12:58,320
between when failures are going to occur on a device.

399
00:12:58,320 --> 00:13:00,630
The meantime to repair, or MTTR,

400
00:13:00,630 --> 00:13:03,000
is a metric that's used to measure the average time it takes

401
00:13:03,000 --> 00:13:05,190
to repair a network device when it breaks.

402
00:13:05,190 --> 00:13:07,770
The maximum tolerable downtime, or MTD,

403
00:13:07,770 --> 00:13:08,940
is the longest period of time

404
00:13:08,940 --> 00:13:10,410
that a business can be inoperable

405
00:13:10,410 --> 00:13:13,050
without causing irrevocable business failure.

406
00:13:13,050 --> 00:13:15,870
The recovery time objective, also known as the RTO,

407
00:13:15,870 --> 00:13:17,700
is the length of time it takes after an event

408
00:13:17,700 --> 00:13:20,520
to resume your normal business operations and activities.

409
00:13:20,520 --> 00:13:23,100
And the recovery point objective, or RPO,

410
00:13:23,100 --> 00:13:24,720
is defined as the longest period of time

411
00:13:24,720 --> 00:13:25,553
that an organization

412
00:13:25,553 --> 00:13:27,933
can tolerate lost data being unrecoverable.