1
00:00:00,150 --> 00:00:01,740
So disaster recovery

2
00:00:01,740 --> 00:00:04,730
as a solutions architect is super important

3
00:00:04,730 --> 00:00:08,080
and the exam expects you to know about disaster recovery,

4
00:00:08,080 --> 00:00:10,080
and there's a white paper on it, you should read it

5
00:00:10,080 --> 00:00:12,650
but I tried to summarize everything clearly

6
00:00:12,650 --> 00:00:14,830
with graphs and diagrams in this lecture

7
00:00:14,830 --> 00:00:17,070
so you don't have to read it if you don't want to.

8
00:00:17,070 --> 00:00:19,010
But overall, you can expect some questions

9
00:00:19,010 --> 00:00:21,460
on disaster recovery and as a solutions architect

10
00:00:21,460 --> 00:00:23,920
you need to know about disaster recovery anyway.

11
00:00:23,920 --> 00:00:25,350
Don't worry, I tried to make this

12
00:00:25,350 --> 00:00:27,350
as simple as possible for you.

13
00:00:27,350 --> 00:00:28,850
So what is a disaster?

14
00:00:28,850 --> 00:00:31,200
Well it's any event that has a negative impact

15
00:00:31,200 --> 00:00:34,250
on a company's business continuity or finances,

16
00:00:34,250 --> 00:00:37,010
and so disaster recovery is about preparing

17
00:00:37,010 --> 00:00:39,070
and recovering from these disasters.

18
00:00:39,070 --> 00:00:40,840
So what kind of disaster recovery

19
00:00:40,840 --> 00:00:43,780
can we do on AWS or on general?

20
00:00:43,780 --> 00:00:46,260
Well we can do on-premise to on-premise.

21
00:00:46,260 --> 00:00:48,890
That means we have a first data center, maybe in California,

22
00:00:48,890 --> 00:00:50,980
another data center, maybe in Seattle,

23
00:00:50,980 --> 00:00:53,040
and so this is traditional disaster recovery

24
00:00:53,040 --> 00:00:55,180
and it's actually very, very expensive.

25
00:00:55,180 --> 00:00:56,720
Or we can start using the cloud

26
00:00:56,720 --> 00:00:59,200
and do on-premise as a main data center

27
00:00:59,200 --> 00:01:01,450
and then if we have any disaster, use the cloud.

28
00:01:01,450 --> 00:01:03,670
So this is called a hybrid recovery.

29
00:01:03,670 --> 00:01:05,870
Or if you're just all in the cloud

30
00:01:05,870 --> 00:01:08,820
then you can do AWS Cloud Region A to Cloud Region B,

31
00:01:08,820 --> 00:01:12,050
and that would be a full cloud type of disaster recovery.

32
00:01:12,050 --> 00:01:13,500
Now before we do the disaster recovery,

33
00:01:13,500 --> 00:01:14,860
we need to define two key terms,

34
00:01:14,860 --> 00:01:17,490
and you need to understand them from an exam perspective.

35
00:01:17,490 --> 00:01:20,940
The first one is called RPO, recovery point objective,

36
00:01:20,940 --> 00:01:25,190
and the second one is called RTO, recovery time objective.

37
00:01:25,190 --> 00:01:26,330
So remember these two terms

38
00:01:26,330 --> 00:01:28,150
and I'm going to explain them right now.

39
00:01:28,150 --> 00:01:30,410
So what is RPO and RTO?

40
00:01:30,410 --> 00:01:33,470
The first one is the RPO, recovery point objective,

41
00:01:33,470 --> 00:01:35,950
and so this is basically how often basically

42
00:01:35,950 --> 00:01:40,270
you run backups, how back in time can you to recover.

43
00:01:40,270 --> 00:01:44,030
And when a disaster strikes, basically, the time between

44
00:01:44,030 --> 00:01:47,540
the RPO and the disaster is going to be a data loss.

45
00:01:47,540 --> 00:01:51,040
For example, if you back up data every hour

46
00:01:51,040 --> 00:01:53,820
and a disaster strikes then you can go back in time

47
00:01:53,820 --> 00:01:57,280
for an hour and so you'll have lost one hour of data.

48
00:01:57,280 --> 00:01:59,604
So the RPO, sometimes it can be an hour,

49
00:01:59,604 --> 00:02:01,170
sometimes it can be maybe one minute.

50
00:02:01,170 --> 00:02:02,650
It really depends on our requirements,

51
00:02:02,650 --> 00:02:05,260
but RPO is how much of a data loss

52
00:02:05,260 --> 00:02:09,360
are you willing to accept in case of a disaster happens?

53
00:02:09,360 --> 00:02:10,780
RTO on the other end

54
00:02:10,780 --> 00:02:13,800
is when you recover from your disaster, okay.

55
00:02:13,800 --> 00:02:16,760
And so, between the disaster and the RTO

56
00:02:16,760 --> 00:02:19,550
is the amount of downtime your application has.

57
00:02:19,550 --> 00:02:21,973
So sometimes it's okay to have 24 hours of downtime,

58
00:02:21,973 --> 00:02:23,000
I don't think it is.

59
00:02:23,000 --> 00:02:25,160
Sometimes it's not okay and maybe you need

60
00:02:25,160 --> 00:02:27,630
just one minute of downtime, okay.

61
00:02:27,630 --> 00:02:30,630
So basically optimizing for the RPO and the RTO

62
00:02:30,630 --> 00:02:33,300
does drive some solution architecture decisions,

63
00:02:33,300 --> 00:02:35,900
and obviously the smaller you want these things to be,

64
00:02:35,900 --> 00:02:37,930
usually the higher the cost.

65
00:02:37,930 --> 00:02:40,660
So let's talk about disaster recovery strategies.

66
00:02:40,660 --> 00:02:42,640
The first one is backup and restore.

67
00:02:42,640 --> 00:02:44,420
Second one is pilot light,

68
00:02:44,420 --> 00:02:45,890
third one is warm standby,

69
00:02:45,890 --> 00:02:49,250
and fourth one is hot site or multi site approach.

70
00:02:49,250 --> 00:02:53,780
So if we basically rank them, all will have different RTO.

71
00:02:53,780 --> 00:02:56,330
Backup and restore will have the smaller RTO.

72
00:02:56,330 --> 00:02:58,700
Pilot light, then warm standby and multi site,

73
00:02:58,700 --> 00:03:02,080
all these things cost more money but they get a faster RTO.

74
00:03:02,080 --> 00:03:04,460
That means you have less downtime overall.

75
00:03:04,460 --> 00:03:07,430
So let's look at all of these one by one in details

76
00:03:07,430 --> 00:03:09,580
to really understand from an architectural standpoint

77
00:03:09,580 --> 00:03:10,870
what they mean.

78
00:03:10,870 --> 00:03:13,730
Backup and restore has a high RPO.

79
00:03:13,730 --> 00:03:15,300
That means that you have a corporate data center,

80
00:03:15,300 --> 00:03:18,180
for example, and here is your AWS Cloud

81
00:03:18,180 --> 00:03:19,460
and you have an S3 bucket.

82
00:03:19,460 --> 00:03:21,870
And so if you want to backup your data over time,

83
00:03:21,870 --> 00:03:24,510
maybe we can use AWS' Storage Gateway

84
00:03:24,510 --> 00:03:26,920
or and have some lifecycle policy

85
00:03:26,920 --> 00:03:30,110
put data into Glacier for cost optimization purposes,

86
00:03:30,110 --> 00:03:33,250
or maybe once a week you're sending a ton of data

87
00:03:33,250 --> 00:03:36,760
into Glacier using AWS' Snowball.

88
00:03:36,760 --> 00:03:38,420
So here you know if you use Snowball,

89
00:03:38,420 --> 00:03:40,530
your RPO is gonna be about one week

90
00:03:40,530 --> 00:03:42,397
because if your data center burns or whatever

91
00:03:42,397 --> 00:03:44,690
and you lose all your data then you've lost one week of data

92
00:03:44,690 --> 00:03:47,800
because you send that Snowball device once a week.

93
00:03:47,800 --> 00:03:49,920
If you're using the AWS' Cloud instead,

94
00:03:49,920 --> 00:03:52,310
maybe EBS volumes, Redshift and RDS.

95
00:03:52,310 --> 00:03:56,060
If you schedule regular snapshots and you back them up

96
00:03:56,060 --> 00:03:59,780
then your RPO is going to be maybe 24 hours or one hour

97
00:03:59,780 --> 00:04:04,270
based on how frequently you do create these snapshots.

98
00:04:04,270 --> 00:04:07,110
And then when you have a disaster strike you

99
00:04:07,110 --> 00:04:09,920
and you need to basically restore all your data

100
00:04:09,920 --> 00:04:13,350
then you can use AMIs to recreate EC2 instances

101
00:04:13,350 --> 00:04:15,520
and spin up your applications or you can restore

102
00:04:15,520 --> 00:04:16,579
straight from a snapshot

103
00:04:16,579 --> 00:04:18,700
and recreate your Amazon RDS database

104
00:04:18,700 --> 00:04:21,300
or your EBS volume or your Redshift, whatever you want.

105
00:04:21,300 --> 00:04:24,080
And so that can take a lot of time as well to restore

106
00:04:24,080 --> 00:04:27,780
this data and so you get a high RTO as well.

107
00:04:27,780 --> 00:04:29,440
But the reason we do this is actually

108
00:04:29,440 --> 00:04:31,610
it's quite cheap to do backup and restore.

109
00:04:31,610 --> 00:04:33,330
We don't manage infrastructure in the middle,

110
00:04:33,330 --> 00:04:35,840
we just recreate infrastructure when we need it,

111
00:04:35,840 --> 00:04:38,925
when we have a disaster and so the only cost we have

112
00:04:38,925 --> 00:04:41,880
is the cost of storing these backups.

113
00:04:41,880 --> 00:04:42,840
So it gives you an idea.

114
00:04:42,840 --> 00:04:45,389
Backup and restore, very easy, pretty expense--

115
00:04:45,389 --> 00:04:48,733
not too expensive and you get high RPO, high RTO.

116
00:04:49,680 --> 00:04:52,200
The second one is going to be pilot light.

117
00:04:52,200 --> 00:04:55,390
So here with pilot light, a small version of the app

118
00:04:55,390 --> 00:04:57,580
is always running in the cloud,

119
00:04:57,580 --> 00:05:00,080
and so usually that's going to be your critical core,

120
00:05:00,080 --> 00:05:01,940
and this is what is called pilot light.

121
00:05:01,940 --> 00:05:03,890
So it's very similar to backup and restore,

122
00:05:03,890 --> 00:05:06,640
but this time it's faster because your critical systems,

123
00:05:06,640 --> 00:05:07,930
they're already up and running

124
00:05:07,930 --> 00:05:10,050
and so when you do recover, you just need

125
00:05:10,050 --> 00:05:13,880
to add on all the other systems that are not as critical.

126
00:05:13,880 --> 00:05:15,030
So let's have an example.

127
00:05:15,030 --> 00:05:17,890
This is your data center, it has a server and a data base,

128
00:05:17,890 --> 00:05:19,320
and this is the AWS' Cloud.

129
00:05:19,320 --> 00:05:22,340
Maybe you're doing to do continuous data replication

130
00:05:22,340 --> 00:05:24,750
from your critical database into RDS

131
00:05:24,750 --> 00:05:26,290
which is going to be running at any time

132
00:05:26,290 --> 00:05:29,720
so you get an RDS database ready to go running.

133
00:05:29,720 --> 00:05:33,500
But your EC2 instances, they're not critical just yet.

134
00:05:33,500 --> 00:05:34,960
What's really important is your data,

135
00:05:34,960 --> 00:05:36,310
and so they're not running,

136
00:05:36,310 --> 00:05:38,650
but in case you have a disaster happening,

137
00:05:38,650 --> 00:05:41,430
Route 53 will allow you fail over from your server

138
00:05:41,430 --> 00:05:44,600
on your data center, recreate that EC2 instance in the cloud

139
00:05:44,600 --> 00:05:45,790
and make it up and running,

140
00:05:45,790 --> 00:05:48,780
but your RDS database is already ready.

141
00:05:48,780 --> 00:05:50,070
So here what do we get?

142
00:05:50,070 --> 00:05:53,690
Well we get a lower RPO, we get a lower RTO

143
00:05:53,690 --> 00:05:54,910
and we still manage costs.

144
00:05:54,910 --> 00:05:56,710
We still have to have an RDS running,

145
00:05:56,710 --> 00:05:59,800
but just the RDS database is running, the rest is not

146
00:05:59,800 --> 00:06:02,621
and your EC2 instance only are brought up,

147
00:06:02,621 --> 00:06:06,060
are created when you do a disaster recovery.

148
00:06:06,060 --> 00:06:08,360
So pilot light is a very popular choice.

149
00:06:08,360 --> 00:06:11,490
Remember, it's only for critical core assistance.

150
00:06:11,490 --> 00:06:15,210
Warm standby is when you have a full system up and running

151
00:06:15,210 --> 00:06:17,570
but at a minimum size so it's ready to go,

152
00:06:17,570 --> 00:06:21,220
but upon disaster, we can scale it to production load.

153
00:06:21,220 --> 00:06:22,300
So let's have a look.

154
00:06:22,300 --> 00:06:23,530
We have our corporate data center.

155
00:06:23,530 --> 00:06:24,980
Maybe it's a bit more complicated this time.

156
00:06:24,980 --> 00:06:27,130
We have a reverse proxy, an app server,

157
00:06:27,130 --> 00:06:28,670
and a master database,

158
00:06:28,670 --> 00:06:31,490
and currently our Route 53 is pointing the DNS

159
00:06:31,490 --> 00:06:32,940
to our corporate data center.

160
00:06:32,940 --> 00:06:35,880
And in the cloud, we'll still have our data replication

161
00:06:35,880 --> 00:06:38,140
to an RDS Slave database that is running.

162
00:06:38,140 --> 00:06:40,940
And maybe we'll have an EC2 auto scaling group,

163
00:06:40,940 --> 00:06:44,130
but running at minimum capacity that's currently talking

164
00:06:44,130 --> 00:06:46,720
to our corporate data center database.

165
00:06:46,720 --> 00:06:49,730
And maybe we'll have an ELB as well, ready to go.

166
00:06:49,730 --> 00:06:51,760
And so if a disaster strikes you,

167
00:06:51,760 --> 00:06:53,450
because we have a warm standby,

168
00:06:53,450 --> 00:06:55,970
we can use Route 53 to fail over

169
00:06:55,970 --> 00:07:00,320
to the ELB and we can use the failover to also change

170
00:07:00,320 --> 00:07:02,750
where our application is getting our data from.

171
00:07:02,750 --> 00:07:05,440
Maybe it's getting our data from the RDS Slave now,

172
00:07:05,440 --> 00:07:07,840
and so we've effectively basically stood by

173
00:07:07,840 --> 00:07:09,120
and then maybe using auto scaling,

174
00:07:09,120 --> 00:07:11,440
our application will scale pretty quickly.

175
00:07:11,440 --> 00:07:14,100
So this is a more costly thing to do now

176
00:07:14,100 --> 00:07:16,740
because we already have an ELB and EC2 Auto Scaling

177
00:07:16,740 --> 00:07:18,460
running at any time, but again,

178
00:07:18,460 --> 00:07:21,470
you can decrease your RPO and your RTO doing that.

179
00:07:21,470 --> 00:07:25,780
And then finally we get the multi site/hot site approach.

180
00:07:25,780 --> 00:07:28,490
It's very low RTO, we're talking minutes or seconds

181
00:07:28,490 --> 00:07:30,110
but it's also very expensive.

182
00:07:30,110 --> 00:07:32,350
But you get two full production scales

183
00:07:32,350 --> 00:07:33,710
running on AWS and On Premise.

184
00:07:33,710 --> 00:07:36,110
So that means we have your On Premise data center,

185
00:07:36,110 --> 00:07:39,070
full production scale, you have your AWS data center,

186
00:07:39,070 --> 00:07:42,050
full production scale with some data replication happening.

187
00:07:42,050 --> 00:07:44,560
And so here what happens is that because you have a hot site

188
00:07:44,560 --> 00:07:47,250
that's already running, your Route 53 can route request

189
00:07:47,250 --> 00:07:49,430
to both your corporate data center and the AWS Cloud

190
00:07:49,430 --> 00:07:52,130
and it's called an active, active type of setup.

191
00:07:52,130 --> 00:07:55,500
And so the idea here is that the failover can happen.

192
00:07:55,500 --> 00:07:59,230
Your EC2 can failover to your RDS Slave database if need be,

193
00:07:59,230 --> 00:08:01,070
but you get full production scale running

194
00:08:01,070 --> 00:08:05,030
on AWS and On Premise, and so this costs a lot of money,

195
00:08:05,030 --> 00:08:07,300
but at the same time, you're ready to fail over,

196
00:08:07,300 --> 00:08:09,750
you're ready and you're running into a multi DC

197
00:08:09,750 --> 00:08:12,130
type of infrastructure which is quite cool.

198
00:08:12,130 --> 00:08:14,640
Finally, if you wanted to go all cloud,

199
00:08:14,640 --> 00:08:16,620
you know it would be the same kind of architecture.

200
00:08:16,620 --> 00:08:19,420
It will be a multi region so maybe we could use Aurora here

201
00:08:19,420 --> 00:08:21,230
because we're really in the cloud,

202
00:08:21,230 --> 00:08:23,810
so we have a master database in a region

203
00:08:23,810 --> 00:08:25,840
and then we have your Aurora Global database

204
00:08:25,840 --> 00:08:28,440
that's been replicated to another region as a Slave

205
00:08:28,440 --> 00:08:31,120
and so these both regions are working for me

206
00:08:31,120 --> 00:08:33,116
and when I want to failover, you know,

207
00:08:33,116 --> 00:08:35,100
I will be ready to go full production scale again

208
00:08:35,100 --> 00:08:36,890
in another region if I need to.

209
00:08:36,890 --> 00:08:38,980
So this gives you an idea of all the strategies

210
00:08:38,980 --> 00:08:40,559
you can have for disaster recovery.

211
00:08:40,559 --> 00:08:43,600
It's really up to you to select the disaster recovery

212
00:08:43,600 --> 00:08:45,740
strategy you need, but the exam will ask you

213
00:08:45,740 --> 00:08:48,950
basically based on some scenarios, what do you recommend?

214
00:08:48,950 --> 00:08:50,320
Do you recommend backup and restore?

215
00:08:50,320 --> 00:08:51,420
Pilot light?

216
00:08:51,420 --> 00:08:54,140
Do you recommend multi site or do you recommend hot site?

217
00:08:54,140 --> 00:08:55,790
All that kind of stuff.

218
00:08:55,790 --> 00:08:57,300
Warm backups and all that stuff.

219
00:08:57,300 --> 00:09:00,030
Okay so finally, disaster recovery tips,

220
00:09:00,030 --> 00:09:02,020
and it's more like real life stuff.

221
00:09:02,020 --> 00:09:04,360
So for backups, you can use EBS Snapshots,

222
00:09:04,360 --> 00:09:06,880
RDS automated snapshots and backups, et cetera.

223
00:09:06,880 --> 00:09:08,450
And you can push all these snapshots

224
00:09:08,450 --> 00:09:10,714
regularly to S3, S3IA, Glacier.

225
00:09:10,714 --> 00:09:12,690
You can implement a Lifecycle Policy.

226
00:09:12,690 --> 00:09:14,695
You can use Cross Region Replication

227
00:09:14,695 --> 00:09:15,820
if you wanted to make sure these backups

228
00:09:15,820 --> 00:09:17,260
would be in different regions.

229
00:09:17,260 --> 00:09:19,390
And if you want to share your data from On-Premise

230
00:09:19,390 --> 00:09:21,330
to the cloud, Snowball or Storage Gateway

231
00:09:21,330 --> 00:09:22,830
would be great technologies.

232
00:09:22,830 --> 00:09:26,270
For high availability, using Route 53 to migrate DNS

233
00:09:26,270 --> 00:09:27,440
from a region to another region

234
00:09:27,440 --> 00:09:30,170
is really, really helpful and easy to implement.

235
00:09:30,170 --> 00:09:33,074
We can also use technology to have multi-AZ implemented,

236
00:09:33,074 --> 00:09:36,210
such as RDS Multi-AZ, ElastiCache Multi-AZ,

237
00:09:36,210 --> 00:09:38,700
EFS, S3, all these things are

238
00:09:38,700 --> 00:09:42,230
highly available by default if you enable that website.

239
00:09:42,230 --> 00:09:44,580
If you're talking about the high availability

240
00:09:44,580 --> 00:09:47,020
of your network, maybe you've implemented

241
00:09:47,020 --> 00:09:48,690
Direct Connect to connect from your

242
00:09:48,690 --> 00:09:50,460
corporate data center to AWS.

243
00:09:50,460 --> 00:09:52,960
But what if the connection goes down for whatever reason?

244
00:09:52,960 --> 00:09:54,970
Maybe you can use Site to Site VPN

245
00:09:54,970 --> 00:09:57,640
as a recovery option for your network.

246
00:09:57,640 --> 00:09:59,320
In terms of replication, you can use

247
00:09:59,320 --> 00:10:01,610
RDS Replication Cross Region, Aurora,

248
00:10:01,610 --> 00:10:03,090
and Global Databases.

249
00:10:03,090 --> 00:10:06,150
Maybe you can use a database replication software

250
00:10:06,150 --> 00:10:08,910
to do your on-premise database to RDS,

251
00:10:08,910 --> 00:10:11,120
or maybe you can use Storage Gateway as well.

252
00:10:11,120 --> 00:10:14,450
In terms of automation, so how do we recover from disasters?

253
00:10:14,450 --> 00:10:15,640
I think you would know already,

254
00:10:15,640 --> 00:10:18,180
Cloudformation/Elastic Beanstalk can help recreate

255
00:10:18,180 --> 00:10:20,880
whole new environments in the cloud very quickly.

256
00:10:20,880 --> 00:10:23,650
Or maybe if you use CloudWatch, we can recover

257
00:10:23,650 --> 00:10:25,420
or reboot our EC2 instances

258
00:10:25,420 --> 00:10:27,773
when the CloudWatch alarms fail.

259
00:10:27,773 --> 00:10:31,150
AWS Lambda can also be great to customize automation.

260
00:10:31,150 --> 00:10:33,330
So they're great to do rest API

261
00:10:33,330 --> 00:10:35,470
but they can also be used to automate your entire

262
00:10:35,470 --> 00:10:37,210
AWS infrastructure, and so overall,

263
00:10:37,210 --> 00:10:39,790
if you can manage to automate your whole disaster recovery

264
00:10:39,790 --> 00:10:42,950
then you are really, really well-set for success.

265
00:10:42,950 --> 00:10:44,800
And then finally, chaos testing,

266
00:10:44,800 --> 00:10:47,250
so how do we know how to recover from a disaster?

267
00:10:47,250 --> 00:10:49,227
Then you create disasters, and so

268
00:10:49,227 --> 00:10:50,650
and example that's, I think,

269
00:10:50,650 --> 00:10:52,730
widely quoted now in the AWS' world

270
00:10:52,730 --> 00:10:56,030
is that Netflix, they run everything on AWS,

271
00:10:56,030 --> 00:10:57,680
and they have created something called a

272
00:10:57,680 --> 00:11:00,290
simian-army, and they randomly terminate

273
00:11:00,290 --> 00:11:01,780
EC2 instances, for example.

274
00:11:01,780 --> 00:11:03,760
They do so much more, but basically

275
00:11:03,760 --> 00:11:05,560
they just take an application server

276
00:11:05,560 --> 00:11:06,660
and terminate it randomly.

277
00:11:06,660 --> 00:11:07,510
In production, okay?

278
00:11:07,510 --> 00:11:09,530
Not in divert test, in production.

279
00:11:09,530 --> 00:11:11,600
So they want to make sure that their infrastructure

280
00:11:11,600 --> 00:11:13,860
is capable to survive failures,

281
00:11:13,860 --> 00:11:15,160
and so that's why they're running

282
00:11:15,160 --> 00:11:18,880
a bunch of chaos monkeys that just terminate stuff randomly

283
00:11:18,880 --> 00:11:20,760
just to make sure that their infrastructure

284
00:11:20,760 --> 00:11:24,400
is rock-solid and can survive any types of failures.

285
00:11:24,400 --> 00:11:27,300
So that's it for this section on disaster recovery.

286
00:11:27,300 --> 00:11:28,133
I hope you enjoyed it,

287
00:11:28,133 --> 00:11:30,070
and I will see you in the next lecture.