1
00:00:00,230 --> 00:00:01,490
‫The third pilar

2
00:00:01,490 --> 00:00:04,890
‫of the Well-Architected Framework is reliability,

3
00:00:04,890 --> 00:00:07,820
‫and so reliability is the ability of a system

4
00:00:07,820 --> 00:00:11,040
‫to recover from infrastructure or service disruptions,

5
00:00:11,040 --> 00:00:13,540
‫dynamically acquire computing resources to meet demand,

6
00:00:13,540 --> 00:00:16,130
‫and mitigate disruptions such as misconfigurations

7
00:00:16,130 --> 00:00:18,450
‫or transient network issues.

8
00:00:18,450 --> 00:00:20,200
‫So it's about making sure your application

9
00:00:20,200 --> 00:00:21,830
‫runs no matter what.

10
00:00:21,830 --> 00:00:23,470
‫The design principles are simple.

11
00:00:23,470 --> 00:00:25,050
‫We need to test recovery procedures

12
00:00:25,050 --> 00:00:28,200
‫so you need to use automation to simulate different failures

13
00:00:28,200 --> 00:00:31,320
‫or to recreate scenarios that led to failures before.

14
00:00:31,320 --> 00:00:34,200
‫We need to automatically recover from failure.

15
00:00:34,200 --> 00:00:35,880
‫That means that you need to anticipate

16
00:00:35,880 --> 00:00:38,220
‫and remediate failures before they occur.

17
00:00:38,220 --> 00:00:40,440
‫Then scale horizontally in case you need

18
00:00:40,440 --> 00:00:44,810
‫to have increased system availability, or increased load.

19
00:00:44,810 --> 00:00:47,420
‫And then stop guessing capacity.

20
00:00:47,420 --> 00:00:49,390
‫So basically that means that if you think,

21
00:00:49,390 --> 00:00:52,480
‫oh, I need four streams in this for my application,

22
00:00:52,480 --> 00:00:54,760
‫that probably isn't going to work in the long term.

23
00:00:54,760 --> 00:00:56,670
‫Use auto scaling wherever you can

24
00:00:56,670 --> 00:00:59,250
‫to make sure you have the right capacity at any time.

25
00:00:59,250 --> 00:01:01,980
‫And then in terms of automation,

26
00:01:01,980 --> 00:01:04,750
‫you need to basically change everything through automation,

27
00:01:04,750 --> 00:01:08,090
‫and this is to ensure that your application will be reliable

28
00:01:08,090 --> 00:01:09,957
‫or you can roll back, or whatever.

29
00:01:09,957 --> 00:01:12,430
‫In terms of AWS Services, what do we have?

30
00:01:12,430 --> 00:01:15,080
‫Well the foundations of reliability is going to be IAM,

31
00:01:15,080 --> 00:01:17,890
‫again making sure that no one has too many rights

32
00:01:17,890 --> 00:01:20,610
‫to basically wreak havoc on your account.

33
00:01:20,610 --> 00:01:22,680
‫Amazon VPC, this is a really strong

34
00:01:22,680 --> 00:01:24,530
‫foundation for networking.

35
00:01:24,530 --> 00:01:26,750
‫And Service Limits, making sure that you

36
00:01:26,750 --> 00:01:28,770
‫do set appropriate service limits.

37
00:01:28,770 --> 00:01:30,690
‫Not too high, and not too low,

38
00:01:30,690 --> 00:01:31,920
‫just the right amount of service limits,

39
00:01:31,920 --> 00:01:33,670
‫and you monitor them over time.

40
00:01:33,670 --> 00:01:35,380
‫Such as if your application has been growing,

41
00:01:35,380 --> 00:01:36,380
‫and growing, and growing,

42
00:01:36,380 --> 00:01:38,140
‫and you're about to reach that service limit.

43
00:01:38,140 --> 00:01:40,210
‫You don't want to get any service disruptions,

44
00:01:40,210 --> 00:01:41,640
‫so you would contact AWS,

45
00:01:41,640 --> 00:01:44,010
‫and increase that service limit over time.

46
00:01:44,010 --> 00:01:45,510
‫Trusted Advisor is also great,

47
00:01:45,510 --> 00:01:47,030
‫we'll see this in this section

48
00:01:47,030 --> 00:01:49,820
‫about how we can basically look at these service limits,

49
00:01:49,820 --> 00:01:50,760
‫or look at other things,

50
00:01:50,760 --> 00:01:52,817
‫and get strong foundations over time.

51
00:01:52,817 --> 00:01:55,830
‫Change management, so how do we manage change overall?

52
00:01:55,830 --> 00:01:57,930
‫Well, Auto Scaling is a great way.

53
00:01:57,930 --> 00:02:00,730
‫Basically if my application gets more popular over time

54
00:02:00,730 --> 00:02:01,727
‫and I have set up auto scaling

55
00:02:01,727 --> 00:02:04,177
‫then I don't need to change anything, which is great.

56
00:02:04,177 --> 00:02:06,795
‫CloudWatch is a great way also of looking at your metrics.

57
00:02:06,795 --> 00:02:09,330
‫For your databases for your application,

58
00:02:09,330 --> 00:02:11,740
‫making sure everything looks reliable over time,

59
00:02:11,740 --> 00:02:14,458
‫and if the CP utilization starts to ramp up

60
00:02:14,458 --> 00:02:16,198
‫maybe do something about it.

61
00:02:16,198 --> 00:02:19,120
‫CloudTrail in terms of are we secure enough

62
00:02:19,120 --> 00:02:20,110
‫to track our API calls?

63
00:02:20,110 --> 00:02:21,810
‫And Config, again.

64
00:02:21,810 --> 00:02:24,230
‫Failure Management, so how do we manage failures?

65
00:02:24,230 --> 00:02:27,550
‫Well, we'll see this for disaster recovery explanation

66
00:02:27,550 --> 00:02:30,080
‫in this section, but you can use backups,

67
00:02:30,080 --> 00:02:31,830
‫all along the way to basically make sure

68
00:02:31,830 --> 00:02:33,880
‫that your application can be recovered

69
00:02:33,880 --> 00:02:35,730
‫if something really really bad happens.

70
00:02:35,730 --> 00:02:38,680
‫CloudFormation to recreate your whole infrastructure

71
00:02:38,680 --> 00:02:42,790
‫at once, S3, for example, to backup all your data

72
00:02:42,790 --> 00:02:44,850
‫or, you know, S3 Glacier if we're talking about

73
00:02:44,850 --> 00:02:47,890
‫archives that you don't need to touch once in a while.

74
00:02:47,890 --> 00:02:50,620
‫Finally maybe you want to use a reliable,

75
00:02:50,620 --> 00:02:53,010
‫highly available global DNS system,

76
00:02:53,010 --> 00:02:54,900
‫so Route 53 could be one of them.

77
00:02:54,900 --> 00:02:57,020
‫And in case of any failures, maybe you want

78
00:02:57,020 --> 00:03:00,620
‫to change Route 53 to just point to a new application stack

79
00:03:00,620 --> 00:03:03,050
‫somewhere else and really make your your application

80
00:03:03,050 --> 00:03:05,340
‫has some kind of disaster recovery mechanism.

81
00:03:05,340 --> 00:03:07,310
‫Don't worry, we'll see disaster recovery

82
00:03:07,310 --> 00:03:08,310
‫in this section as well,

83
00:03:08,310 --> 00:03:10,360
‫and I'll try to make it as simple as possible.

84
00:03:10,360 --> 00:03:12,370
‫So that's it, for this pillar, I hope you liked it,

85
00:03:12,370 --> 00:03:14,320
‫and I will see you in the next lecture.