1 00:00:00,080 --> 00:00:00,960 In this lesson, 2 00:00:00,960 --> 00:00:03,570 we're going to discuss disaster recovery metrics. 3 00:00:03,570 --> 00:00:06,450 Disaster recovery metrics are quantifiable standards 4 00:00:06,450 --> 00:00:09,150 that are used to plan and evaluate an organization's ability 5 00:00:09,150 --> 00:00:12,480 to recover IT operations following a disruptive event. 6 00:00:12,480 --> 00:00:14,310 These metrics are focused on the measurements 7 00:00:14,310 --> 00:00:15,270 inside of your business 8 00:00:15,270 --> 00:00:16,590 and your enterprise network 9 00:00:16,590 --> 00:00:18,780 that are going to be used to identify organizational risks 10 00:00:18,780 --> 00:00:19,710 and determine their effect 11 00:00:19,710 --> 00:00:22,230 on your ongoing mission critical operations. 12 00:00:22,230 --> 00:00:23,970 These recovery methods can also be used 13 00:00:23,970 --> 00:00:25,620 to measure our levels of availability 14 00:00:25,620 --> 00:00:27,690 and how quickly we can restore our services 15 00:00:27,690 --> 00:00:29,160 if they're being negatively affected 16 00:00:29,160 --> 00:00:31,260 by some kind of an incident or event. 17 00:00:31,260 --> 00:00:33,840 Now availability is measured in what we call uptime 18 00:00:33,840 --> 00:00:36,180 or by how many minutes or hours you are up. 19 00:00:36,180 --> 00:00:38,340 And often this will be shown as a percentage 20 00:00:38,340 --> 00:00:39,690 of how many minutes you are up 21 00:00:39,690 --> 00:00:41,040 divided by the total number of minutes 22 00:00:41,040 --> 00:00:42,870 available during that period. 23 00:00:42,870 --> 00:00:44,340 We try to maintain what's called 24 00:00:44,340 --> 00:00:45,990 the five nines of availability 25 00:00:45,990 --> 00:00:47,970 in most of our commercial base networks. 26 00:00:47,970 --> 00:00:49,080 And this is really hard, 27 00:00:49,080 --> 00:00:51,210 because when we talk about five nines of availability, 28 00:00:51,210 --> 00:00:54,390 we are talking about 99.999% uptime 29 00:00:54,390 --> 00:00:57,420 or a maximum of five minutes of downtime per year, 30 00:00:57,420 --> 00:00:59,520 which is not a whole lot of downtime. 31 00:00:59,520 --> 00:01:01,020 In some cloud-based networks, 32 00:01:01,020 --> 00:01:05,910 they aim for six nines of availability, which is 99.9999%, 33 00:01:05,910 --> 00:01:09,330 and this equates to just 31 seconds of downtime per year. 34 00:01:09,330 --> 00:01:10,650 So as you can imagine, 35 00:01:10,650 --> 00:01:12,750 I'm probably going to need more than 31 seconds 36 00:01:12,750 --> 00:01:14,143 or even five minutes of downtime per year 37 00:01:14,143 --> 00:01:16,770 to be able to do things like patching my servers 38 00:01:16,770 --> 00:01:18,210 or installing a new storage device 39 00:01:18,210 --> 00:01:20,250 or to put in a new router or switch. 40 00:01:20,250 --> 00:01:23,250 So how can I maintain that high level of availability 41 00:01:23,250 --> 00:01:25,140 at five nines or six nines 42 00:01:25,140 --> 00:01:26,670 if I need to do that type of maintenance 43 00:01:26,670 --> 00:01:28,200 and recovery procedures? 44 00:01:28,200 --> 00:01:30,330 Well, I'm going to do that by designing my networks 45 00:01:30,330 --> 00:01:33,060 to be highly available and highly reliable. 46 00:01:33,060 --> 00:01:34,770 Now whenever I talk about availability, 47 00:01:34,770 --> 00:01:36,510 I'm really concerned with my network's ability 48 00:01:36,510 --> 00:01:38,130 to be up and operational. 49 00:01:38,130 --> 00:01:39,750 But when I talk about reliability, 50 00:01:39,750 --> 00:01:41,790 I'm more concerned about not dropping packets 51 00:01:41,790 --> 00:01:44,070 inside of the network because I want to ensure that I'm up 52 00:01:44,070 --> 00:01:47,370 and also effectively passing data across that network. 53 00:01:47,370 --> 00:01:49,650 So if your network is highly available 54 00:01:49,650 --> 00:01:50,820 but it's not reliable, 55 00:01:50,820 --> 00:01:52,410 that's not a very good network. 56 00:01:52,410 --> 00:01:54,840 But conversely if you have a highly reliable network 57 00:01:54,840 --> 00:01:56,520 but it's not a highly available one, 58 00:01:56,520 --> 00:01:57,600 that's not good either. 59 00:01:57,600 --> 00:01:59,430 Because even if it's the most reliable network 60 00:01:59,430 --> 00:02:00,600 in the entire world, 61 00:02:00,600 --> 00:02:02,610 if it's only going to be up 20 minutes per year, 62 00:02:02,610 --> 00:02:03,750 that it's not going to be considered 63 00:02:03,750 --> 00:02:04,950 a very good network 64 00:02:04,950 --> 00:02:07,500 because it's not usable for business operations. 65 00:02:07,500 --> 00:02:10,380 So what we need to do is balance these two extremes 66 00:02:10,380 --> 00:02:12,210 and aim for a good enough in both areas 67 00:02:12,210 --> 00:02:13,950 to meet our business needs. 68 00:02:13,950 --> 00:02:15,750 So as we go through this lesson, 69 00:02:15,750 --> 00:02:17,460 we're going to discuss some key metrics 70 00:02:17,460 --> 00:02:18,840 that you should be aware of, 71 00:02:18,840 --> 00:02:20,790 and we're going to talk about how you can measure things 72 00:02:20,790 --> 00:02:22,230 within your own organization 73 00:02:22,230 --> 00:02:24,030 like the meantime between failures, 74 00:02:24,030 --> 00:02:27,150 the meantime to repair, the maximum tolerable downtime, 75 00:02:27,150 --> 00:02:28,320 the recovery point objective, 76 00:02:28,320 --> 00:02:30,540 and the recovery time objective. 77 00:02:30,540 --> 00:02:33,090 First, we have the meantime to repair. 78 00:02:33,090 --> 00:02:35,400 Now the meantime to repair, or MTTR, 79 00:02:35,400 --> 00:02:37,500 is a metric that's used to measure the average time 80 00:02:37,500 --> 00:02:40,170 it takes to repair a network device when it breaks. 81 00:02:40,170 --> 00:02:42,630 After all, everything in our networks and organizations 82 00:02:42,630 --> 00:02:44,190 will break eventually. 83 00:02:44,190 --> 00:02:45,285 So when a device breaks, 84 00:02:45,285 --> 00:02:48,690 how long does it take for you and your team to fix it? 85 00:02:48,690 --> 00:02:50,100 Now, based on that, 86 00:02:50,100 --> 00:02:52,050 how much downtime did you actually experience 87 00:02:52,050 --> 00:02:53,310 during the last year? 88 00:02:53,310 --> 00:02:54,750 This is really what we're trying to measure here 89 00:02:54,750 --> 00:02:56,580 with the meantime to repair. 90 00:02:56,580 --> 00:02:59,400 Second, we have the meantime between failures. 91 00:02:59,400 --> 00:03:02,280 Now the meantime between failures, or the MTBF, 92 00:03:02,280 --> 00:03:03,990 is a metric that measures the average time 93 00:03:03,990 --> 00:03:06,720 between when failures occur on a given device. 94 00:03:06,720 --> 00:03:07,710 Now for most people, 95 00:03:07,710 --> 00:03:10,350 these two terms can be a little confusing at first. 96 00:03:10,350 --> 00:03:12,240 So let's consider an example of both of these 97 00:03:12,240 --> 00:03:14,280 by considering some kind of system failure 98 00:03:14,280 --> 00:03:16,710 and how it's going to be resolved over time. 99 00:03:16,710 --> 00:03:18,120 Now if you have a system failure 100 00:03:18,120 --> 00:03:20,040 and then you resume normal operations, 101 00:03:20,040 --> 00:03:21,630 the amount of time between the failure 102 00:03:21,630 --> 00:03:23,400 and the resumption of normal operations 103 00:03:23,400 --> 00:03:25,320 will be considered to be the time to repair 104 00:03:25,320 --> 00:03:27,570 this particular incident or event. 105 00:03:27,570 --> 00:03:28,530 On the timeline, 106 00:03:28,530 --> 00:03:30,300 this is shown as the first stop sign 107 00:03:30,300 --> 00:03:32,400 on the left side of the timeline. 108 00:03:32,400 --> 00:03:34,650 Now if I click all those times to repair metrics 109 00:03:34,650 --> 00:03:35,970 and I add them all together 110 00:03:35,970 --> 00:03:37,680 and then create an average from them, 111 00:03:37,680 --> 00:03:39,810 this creates what we call the meantime to repair, 112 00:03:39,810 --> 00:03:42,090 or the MTTR, from my organization's network 113 00:03:42,090 --> 00:03:43,770 or individual devices, 114 00:03:43,770 --> 00:03:45,510 depending on which metrics we were collecting 115 00:03:45,510 --> 00:03:47,760 and using as part of that average. 116 00:03:47,760 --> 00:03:49,230 Now on the failure side of things, 117 00:03:49,230 --> 00:03:50,550 we're going to need to measure the time 118 00:03:50,550 --> 00:03:52,170 between one failure occurring, 119 00:03:52,170 --> 00:03:53,460 us fixing that failure, 120 00:03:53,460 --> 00:03:55,470 and then the next failure that occurs. 121 00:03:55,470 --> 00:03:57,300 This becomes the time between failures, 122 00:03:57,300 --> 00:03:58,830 and when I average them all together, 123 00:03:58,830 --> 00:04:02,670 it becomes the meantime between failures or the MTBF. 124 00:04:02,670 --> 00:04:04,680 Now hopefully you can see the difference here. 125 00:04:04,680 --> 00:04:06,150 With the meantime to repair, 126 00:04:06,150 --> 00:04:07,530 what we want is a low number 127 00:04:07,530 --> 00:04:09,510 because we want to be able to fix things quickly 128 00:04:09,510 --> 00:04:11,310 and get ourselves back online. 129 00:04:11,310 --> 00:04:13,200 So the lower the meantime to repair, 130 00:04:13,200 --> 00:04:14,310 the better things you're going to be 131 00:04:14,310 --> 00:04:16,440 in terms of our network's availability. 132 00:04:16,440 --> 00:04:17,273 On the other hand, 133 00:04:17,273 --> 00:04:19,290 if you're talking about the meantime between failures, 134 00:04:19,290 --> 00:04:21,750 then you want the number to be as large as possible 135 00:04:21,750 --> 00:04:23,040 because the longer number 136 00:04:23,040 --> 00:04:25,920 means that the meantime between failures becomes longer, 137 00:04:25,920 --> 00:04:27,540 and this means that your network's availability 138 00:04:27,540 --> 00:04:29,730 and reliability will increase. 139 00:04:29,730 --> 00:04:32,640 After all, we want to buy and operate reliable equipment 140 00:04:32,640 --> 00:04:34,200 with a lower number of failures. 141 00:04:34,200 --> 00:04:36,420 So if there's more time in between our failures, 142 00:04:36,420 --> 00:04:38,730 we can call that piece of equipment more reliable 143 00:04:38,730 --> 00:04:40,020 than an equivalent piece of equipment 144 00:04:40,020 --> 00:04:41,970 that's going to fail more often. 145 00:04:41,970 --> 00:04:45,030 Now third, we have maximum tolerable downtime. 146 00:04:45,030 --> 00:04:47,670 The maximum tolerable downtime, or MTD, 147 00:04:47,670 --> 00:04:48,960 is the longest period of time 148 00:04:48,960 --> 00:04:50,340 of business can be inoperable 149 00:04:50,340 --> 00:04:52,980 without causing irrevocable business failures. 150 00:04:52,980 --> 00:04:55,260 Essentially, the maximum tolerable downtime 151 00:04:55,260 --> 00:04:57,270 will answer a simple question for us. 152 00:04:57,270 --> 00:04:59,490 How long can our organization's network be down 153 00:04:59,490 --> 00:05:01,140 without going out of business? 154 00:05:01,140 --> 00:05:03,720 Now the maximum tolerable downtime is going to be different 155 00:05:03,720 --> 00:05:05,460 for each organization that you work at, 156 00:05:05,460 --> 00:05:07,297 and it can even be different within different departments 157 00:05:07,297 --> 00:05:09,720 inside of the same organization. 158 00:05:09,720 --> 00:05:11,910 Additionally, in larger organizations, 159 00:05:11,910 --> 00:05:13,230 each of your business processes 160 00:05:13,230 --> 00:05:16,050 can have its own maximum tolerable downline as well. 161 00:05:16,050 --> 00:05:18,480 For example, some maximum tolerable downtimes 162 00:05:18,480 --> 00:05:19,680 may just be a couple of minutes 163 00:05:19,680 --> 00:05:21,180 for your most critical functions, 164 00:05:21,180 --> 00:05:23,250 while others might be as long as a couple of hours 165 00:05:23,250 --> 00:05:25,830 or even days for more administrative functions. 166 00:05:25,830 --> 00:05:27,840 This really does depend on your organization, 167 00:05:27,840 --> 00:05:29,490 and you're going to have to figure this out for yourself 168 00:05:29,490 --> 00:05:31,140 of what that specific target or goal 169 00:05:31,140 --> 00:05:33,330 for your maximum tolerable downtime will be 170 00:05:33,330 --> 00:05:36,150 based on working with your organization's key stakeholders. 171 00:05:36,150 --> 00:05:37,920 But for now, I want you to simply remember 172 00:05:37,920 --> 00:05:39,780 that the maximum tolerable downtime 173 00:05:39,780 --> 00:05:41,730 is really the upper limit on the recovery time 174 00:05:41,730 --> 00:05:44,400 that the system and the asset owners must resume 175 00:05:44,400 --> 00:05:46,260 your normal operations within. 176 00:05:46,260 --> 00:05:47,820 So keeping that in mind, 177 00:05:47,820 --> 00:05:49,860 let's take a look at my own business. 178 00:05:49,860 --> 00:05:51,660 Now one of the maximum tolerable downtime 179 00:05:51,660 --> 00:05:53,220 we've established at Dion Training 180 00:05:53,220 --> 00:05:54,690 is focused on our response time 181 00:05:54,690 --> 00:05:56,490 to our students when they ask a question 182 00:05:56,490 --> 00:05:58,830 by emailing support at diontraining.com 183 00:05:58,830 --> 00:06:00,120 or by posting a question 184 00:06:00,120 --> 00:06:02,670 inside the Q&A section of the course. 185 00:06:02,670 --> 00:06:04,200 Our maximum tolerable downtime 186 00:06:04,200 --> 00:06:07,620 for this area in our business has been set at 12 hours. 187 00:06:07,620 --> 00:06:09,231 Now why is our MTD 12 hours 188 00:06:09,231 --> 00:06:11,040 instead of something like five minutes 189 00:06:11,040 --> 00:06:12,600 or something short like that? 190 00:06:12,600 --> 00:06:13,830 This is a great question. 191 00:06:13,830 --> 00:06:14,940 And when we started to analyze 192 00:06:14,940 --> 00:06:16,170 the function within our business, 193 00:06:16,170 --> 00:06:18,750 we started to think about it from our student's perspective. 194 00:06:18,750 --> 00:06:20,730 If I was a student and I asked a question, 195 00:06:20,730 --> 00:06:22,410 what is the longest I would want to go 196 00:06:22,410 --> 00:06:24,270 before I get a response to my question? 197 00:06:24,270 --> 00:06:26,730 And the answer for us was somewhere around 24 hours. 198 00:06:26,730 --> 00:06:30,600 So we cut that in half and made it 12 hours for our MTD. 199 00:06:30,600 --> 00:06:31,890 Now this was a balanced decision 200 00:06:31,890 --> 00:06:34,470 to balance the cost of support for our student questions 201 00:06:34,470 --> 00:06:36,720 versus how quickly we could get them answered. 202 00:06:36,720 --> 00:06:39,060 Sure, we can hire an entire team of 100 people 203 00:06:39,060 --> 00:06:41,370 to do nothing but answer student questions all day, 204 00:06:41,370 --> 00:06:43,710 but that would cost us around $5 million per year 205 00:06:43,710 --> 00:06:44,700 in labor costs. 206 00:06:44,700 --> 00:06:46,380 And to support that kind of labor budget, 207 00:06:46,380 --> 00:06:48,030 we would have to increase our course prices 208 00:06:48,030 --> 00:06:51,090 by at least five or 10 times of their current costs. 209 00:06:51,090 --> 00:06:52,740 So when we surveyed our students 210 00:06:52,740 --> 00:06:54,750 in regards to the level of service they expected 211 00:06:54,750 --> 00:06:56,640 and the prices they were willing to pay for that, 212 00:06:56,640 --> 00:06:59,340 we found that around 12 hours to 24 hours 213 00:06:59,340 --> 00:07:00,690 was a reasonable compromise 214 00:07:00,690 --> 00:07:02,190 in terms of speed of the response 215 00:07:02,190 --> 00:07:04,740 and the cost to deliver those responses to you. 216 00:07:04,740 --> 00:07:07,650 We could afford to provide answers within 12 to 24 hours 217 00:07:07,650 --> 00:07:08,730 and we could do that with a team 218 00:07:08,730 --> 00:07:10,920 of about four to seven people most of the time, 219 00:07:10,920 --> 00:07:13,260 which makes it a more cost effective option for us, 220 00:07:13,260 --> 00:07:14,970 and in turn, for you, our students, 221 00:07:14,970 --> 00:07:17,100 because we pass the savings on to you. 222 00:07:17,100 --> 00:07:18,570 Now to accommodate that, 223 00:07:18,570 --> 00:07:20,670 we actually have split our student support team members 224 00:07:20,670 --> 00:07:22,350 into two different teams. 225 00:07:22,350 --> 00:07:23,340 One half of our team 226 00:07:23,340 --> 00:07:24,973 lives and works over in the Philippines, 227 00:07:24,973 --> 00:07:26,340 and the other half of the team 228 00:07:26,340 --> 00:07:28,020 works here in the United States. 229 00:07:28,020 --> 00:07:29,760 This means that both teams are offset 230 00:07:29,760 --> 00:07:31,260 by about 11 to 12 hours 231 00:07:31,260 --> 00:07:33,360 depending on the time of year that we're dealing with. 232 00:07:33,360 --> 00:07:35,280 So if it's daytime in the Philippines, 233 00:07:35,280 --> 00:07:37,230 it's usually nighttime here in the United States; 234 00:07:37,230 --> 00:07:38,670 and when it's daytime here in the United States, 235 00:07:38,670 --> 00:07:40,530 it's usually nighttime in the Philippines. 236 00:07:40,530 --> 00:07:42,960 And so we can cover almost 24 hours a day 237 00:07:42,960 --> 00:07:44,550 by using these two locations 238 00:07:44,550 --> 00:07:46,500 because each side will work eight hours 239 00:07:46,500 --> 00:07:47,520 and they'll do it during the day 240 00:07:47,520 --> 00:07:49,170 in their particular country. 241 00:07:49,170 --> 00:07:50,970 Now you may have noticed there's a little bit of time 242 00:07:50,970 --> 00:07:52,890 that isn't covered by both of these teams 243 00:07:52,890 --> 00:07:55,050 because each of them is only working eight hours, 244 00:07:55,050 --> 00:07:57,330 but they have to cover a 12-hour period. 245 00:07:57,330 --> 00:07:59,490 And so what we did was we have another person 246 00:07:59,490 --> 00:08:00,960 who works over in Egypt, 247 00:08:00,960 --> 00:08:03,270 and they work right in the middle of that time 248 00:08:03,270 --> 00:08:04,830 so they can cover that eight-hour block, 249 00:08:04,830 --> 00:08:06,090 the extra four hours from the US, 250 00:08:06,090 --> 00:08:07,770 and the other four hours from the Philippines 251 00:08:07,770 --> 00:08:09,330 during their working hours in Egypt 252 00:08:09,330 --> 00:08:10,410 to make sure our student questions 253 00:08:10,410 --> 00:08:12,270 are being answered effectively. 254 00:08:12,270 --> 00:08:14,130 Now another benefit of doing this kind of a setup 255 00:08:14,130 --> 00:08:16,530 where you're splitting teams across multiple locations 256 00:08:16,530 --> 00:08:18,750 is that they're now geographically distinct. 257 00:08:18,750 --> 00:08:19,890 And so if there's a big storm 258 00:08:19,890 --> 00:08:21,930 that's affecting power in the United States, 259 00:08:21,930 --> 00:08:23,220 that isn't going to be a problem for us 260 00:08:23,220 --> 00:08:24,330 because the people in the Philippines 261 00:08:24,330 --> 00:08:26,310 can still work and take care of it. 262 00:08:26,310 --> 00:08:28,560 Alternatively, if there is a big typhoon or flood 263 00:08:28,560 --> 00:08:30,840 that causes issues for our teams in the Philippines, 264 00:08:30,840 --> 00:08:32,309 it shouldn't affect our teams located here 265 00:08:32,309 --> 00:08:33,179 in the United States 266 00:08:33,179 --> 00:08:34,917 and they can cover their workload 267 00:08:34,917 --> 00:08:36,480 to be able to get answers to our students. 268 00:08:36,480 --> 00:08:38,549 By having this type of geographic diversity, 269 00:08:38,549 --> 00:08:40,919 it's going to allow us to maintain that 12-hour response time 270 00:08:40,919 --> 00:08:42,851 because even if one team is offline their entire shift 271 00:08:42,851 --> 00:08:44,580 because of a disaster, 272 00:08:44,580 --> 00:08:46,530 they're only going to be gone for eight hours 273 00:08:46,530 --> 00:08:48,120 and we should be able to get their questions answered 274 00:08:48,120 --> 00:08:49,560 before that 12-hour mark 275 00:08:49,560 --> 00:08:50,790 because the person in Egypt 276 00:08:50,790 --> 00:08:53,280 or the person in either Philippines or the United States 277 00:08:53,280 --> 00:08:54,990 can take care of those things. 278 00:08:54,990 --> 00:08:56,040 And so at the end of the day, 279 00:08:56,040 --> 00:08:58,140 this really became a risk management decision for us 280 00:08:58,140 --> 00:09:00,240 as well as a cost benefit analysis 281 00:09:00,240 --> 00:09:01,650 to be able to provide good service 282 00:09:01,650 --> 00:09:03,840 at a good price to all of our students. 283 00:09:03,840 --> 00:09:05,400 Now again, you're going to have to figure out 284 00:09:05,400 --> 00:09:07,590 what your maximum tolerable downtime is 285 00:09:07,590 --> 00:09:09,390 inside your own company or organization 286 00:09:09,390 --> 00:09:10,500 out in the real world, 287 00:09:10,500 --> 00:09:11,760 and this will all be a process 288 00:09:11,760 --> 00:09:14,280 of designing your risk management plan to be able to support 289 00:09:14,280 --> 00:09:16,830 that maximum tolerable downtime as well. 290 00:09:16,830 --> 00:09:19,590 Fourth, we need to cover the recovery time objective. 291 00:09:19,590 --> 00:09:22,500 The recovery time objective, also known as the RTO, 292 00:09:22,500 --> 00:09:23,730 is the length of time it takes 293 00:09:23,730 --> 00:09:24,840 after an event to resume 294 00:09:24,840 --> 00:09:27,270 your normal business operations and activities. 295 00:09:27,270 --> 00:09:29,550 When you start thinking about recovery time objective, 296 00:09:29,550 --> 00:09:30,810 I really want you to think about the fact 297 00:09:30,810 --> 00:09:33,360 of something went down like you lost power, 298 00:09:33,360 --> 00:09:34,297 and now you have to ask yourself, 299 00:09:34,297 --> 00:09:36,720 "How quickly do we need to get it back online?" 300 00:09:36,720 --> 00:09:39,597 In the case of power, we have a 60-second time for power. 301 00:09:39,597 --> 00:09:41,430 And if we want to make sure our power is back up 302 00:09:41,430 --> 00:09:43,500 and online within 60 seconds, can we do that? 303 00:09:43,500 --> 00:09:44,610 Is that achievable? 304 00:09:44,610 --> 00:09:45,660 Well, yes it is. 305 00:09:45,660 --> 00:09:47,370 If you have a backup diesel generator, 306 00:09:47,370 --> 00:09:50,010 it's going to turn on in about 30 to 45 seconds, 307 00:09:50,010 --> 00:09:51,780 and by 45 to 60 seconds, 308 00:09:51,780 --> 00:09:53,940 power will be transferred onto that diesel generator 309 00:09:53,940 --> 00:09:55,650 and we will be fully recovered. 310 00:09:55,650 --> 00:09:58,770 Now if I wasn't happy with that 45 or 60 seconds 311 00:09:58,770 --> 00:10:00,990 and I wanted to make my recovery time of zero, 312 00:10:00,990 --> 00:10:02,370 can I achieve that? 313 00:10:02,370 --> 00:10:03,810 Well, yes, we still can. 314 00:10:03,810 --> 00:10:05,370 And that's one of the reasons why I installed 315 00:10:05,370 --> 00:10:08,010 a battery backup system because if power goes away, 316 00:10:08,010 --> 00:10:10,380 those batteries provide instant power to our servers 317 00:10:10,380 --> 00:10:11,520 and our networking equipment. 318 00:10:11,520 --> 00:10:14,460 So using that, we're able to hit a recovery time objective 319 00:10:14,460 --> 00:10:16,470 of zero in terms of power 320 00:10:16,470 --> 00:10:18,930 if we're using a battery backup based system. 321 00:10:18,930 --> 00:10:20,035 Now the overall full restore 322 00:10:20,035 --> 00:10:22,230 isn't what we're really talking about here, 323 00:10:22,230 --> 00:10:24,870 but instead we're talking about the recovery time objective 324 00:10:24,870 --> 00:10:26,730 to ensure that the operations can continue 325 00:10:26,730 --> 00:10:28,920 and the services are back up and running. 326 00:10:28,920 --> 00:10:30,840 Here, we're not saying that we need to wait for the power 327 00:10:30,840 --> 00:10:32,820 to be fully restored by the power grid again 328 00:10:32,820 --> 00:10:34,530 from our local electric company. 329 00:10:34,530 --> 00:10:36,090 Instead, we just need to make sure 330 00:10:36,090 --> 00:10:37,257 we can recover our business 331 00:10:37,257 --> 00:10:39,510 and we can make sure we're continuing to operate. 332 00:10:39,510 --> 00:10:42,270 And in our case, that can be done on battery, on solar, 333 00:10:42,270 --> 00:10:45,120 or on a generator within zero to 60 seconds, 334 00:10:45,120 --> 00:10:48,060 and that means we can meet our recovery time objectives. 335 00:10:48,060 --> 00:10:50,730 Fifth, we have the recovery point objective. 336 00:10:50,730 --> 00:10:52,973 Now the recovery point objective or RPO 337 00:10:52,973 --> 00:10:55,440 is going to be defined as the longest period of time 338 00:10:55,440 --> 00:10:56,273 that an organization 339 00:10:56,273 --> 00:10:58,890 can tolerate lost data being unrecoverable. 340 00:10:58,890 --> 00:11:00,570 Now the way I like to think about the RPO 341 00:11:00,570 --> 00:11:02,070 is to think about ransomware. 342 00:11:02,070 --> 00:11:03,780 If you have ransomware in your system, 343 00:11:03,780 --> 00:11:05,550 it's going to attempt to encrypt all your data 344 00:11:05,550 --> 00:11:06,870 and all of your files. 345 00:11:06,870 --> 00:11:08,820 Now you've got a couple of choices here. 346 00:11:08,820 --> 00:11:11,130 You can pay the ransom, which we never recommend; 347 00:11:11,130 --> 00:11:12,900 you could try to crack the ransomware key, 348 00:11:12,900 --> 00:11:14,850 which could take you days, weeks, months, 349 00:11:14,850 --> 00:11:17,040 or even years depending on how strong it is; 350 00:11:17,040 --> 00:11:19,590 or you can actually format and wipe that system 351 00:11:19,590 --> 00:11:22,080 and then recover from a known good backup. 352 00:11:22,080 --> 00:11:24,540 Now most of the time, we're going to choose option three 353 00:11:24,540 --> 00:11:25,650 and we'll format the system 354 00:11:25,650 --> 00:11:27,570 and recover from a known good backup. 355 00:11:27,570 --> 00:11:30,570 So let's assume we went ahead and chose that option. 356 00:11:30,570 --> 00:11:33,090 Well, if we did that, what is the longest period of time 357 00:11:33,090 --> 00:11:34,843 that we can tolerate data loss? 358 00:11:34,843 --> 00:11:36,300 Now what I mean by this 359 00:11:36,300 --> 00:11:38,190 is that when you're recovering from a backup, 360 00:11:38,190 --> 00:11:40,650 that backup is a certain amount of time old. 361 00:11:40,650 --> 00:11:42,660 And that means we are always lagging behind 362 00:11:42,660 --> 00:11:44,190 what the current data on the system was 363 00:11:44,190 --> 00:11:45,720 when an event happens. 364 00:11:45,720 --> 00:11:47,820 That data could be several hours old 365 00:11:47,820 --> 00:11:50,040 because it may have been midnight when we last backed it up, 366 00:11:50,040 --> 00:11:51,600 or it could have been several days old 367 00:11:51,600 --> 00:11:54,210 if we only back up once a week or once a month. 368 00:11:54,210 --> 00:11:56,490 And if this ransomware hits you at 6:00 in the morning 369 00:11:56,490 --> 00:11:58,680 but you're using a every midnight backup time, 370 00:11:58,680 --> 00:12:01,170 that means you have about six hours worth of lost data 371 00:12:01,170 --> 00:12:02,760 because you don't have a backup going 372 00:12:02,760 --> 00:12:05,010 after midnight of that particular day, 373 00:12:05,010 --> 00:12:06,810 and that means you have about six hours of data 374 00:12:06,810 --> 00:12:07,830 that's now encrypted 375 00:12:07,830 --> 00:12:08,850 and you're not going to be able to get it back 376 00:12:08,850 --> 00:12:10,890 when this ransomware attack happens. 377 00:12:10,890 --> 00:12:12,030 Now this is what we're talking about 378 00:12:12,030 --> 00:12:14,370 when we talk about the recovery point objective. 379 00:12:14,370 --> 00:12:16,980 That six hours is going to be the lost period of time 380 00:12:16,980 --> 00:12:18,420 where you can't recover the data 381 00:12:18,420 --> 00:12:21,420 because you're not performing backups during that timeframe. 382 00:12:21,420 --> 00:12:23,070 So you have to keep this in mind 383 00:12:23,070 --> 00:12:24,750 when you're designing your systems. 384 00:12:24,750 --> 00:12:26,730 If you have an RPO of six hours, 385 00:12:26,730 --> 00:12:29,520 that means you need to run backups at least every six hours 386 00:12:29,520 --> 00:12:31,620 or less to ensure that all of your data, 387 00:12:31,620 --> 00:12:34,140 except the last six hours of the data can be recovered 388 00:12:34,140 --> 00:12:35,730 if you need to restore from a backup, 389 00:12:35,730 --> 00:12:38,400 and that's what we call the recovery point objective. 390 00:12:38,400 --> 00:12:41,040 So remember, when it comes to disaster recovery metrics, 391 00:12:41,040 --> 00:12:43,650 we have five key metrics that you have to consider. 392 00:12:43,650 --> 00:12:46,440 The meantime between failures, the meantime to repair, 393 00:12:46,440 --> 00:12:48,150 the maximum tolerable downtime, 394 00:12:48,150 --> 00:12:49,470 the recovery point objective, 395 00:12:49,470 --> 00:12:51,210 and the recovery time objective. 396 00:12:51,210 --> 00:12:53,730 The meantime between failures, or the MTBF, 397 00:12:53,730 --> 00:12:55,500 is a metric that measures the average time 398 00:12:55,500 --> 00:12:58,320 between when failures are going to occur on a device. 399 00:12:58,320 --> 00:13:00,630 The meantime to repair, or MTTR, 400 00:13:00,630 --> 00:13:03,000 is a metric that's used to measure the average time it takes 401 00:13:03,000 --> 00:13:05,190 to repair a network device when it breaks. 402 00:13:05,190 --> 00:13:07,770 The maximum tolerable downtime, or MTD, 403 00:13:07,770 --> 00:13:08,940 is the longest period of time 404 00:13:08,940 --> 00:13:10,410 that a business can be inoperable 405 00:13:10,410 --> 00:13:13,050 without causing irrevocable business failure. 406 00:13:13,050 --> 00:13:15,870 The recovery time objective, also known as the RTO, 407 00:13:15,870 --> 00:13:17,700 is the length of time it takes after an event 408 00:13:17,700 --> 00:13:20,520 to resume your normal business operations and activities. 409 00:13:20,520 --> 00:13:23,100 And the recovery point objective, or RPO, 410 00:13:23,100 --> 00:13:24,720 is defined as the longest period of time 411 00:13:24,720 --> 00:13:25,553 that an organization 412 00:13:25,553 --> 00:13:27,933 can tolerate lost data being unrecoverable.