1 00:00:00,230 --> 00:00:01,490 ‫The third pilar 2 00:00:01,490 --> 00:00:04,890 ‫of the Well-Architected Framework is reliability, 3 00:00:04,890 --> 00:00:07,820 ‫and so reliability is the ability of a system 4 00:00:07,820 --> 00:00:11,040 ‫to recover from infrastructure or service disruptions, 5 00:00:11,040 --> 00:00:13,540 ‫dynamically acquire computing resources to meet demand, 6 00:00:13,540 --> 00:00:16,130 ‫and mitigate disruptions such as misconfigurations 7 00:00:16,130 --> 00:00:18,450 ‫or transient network issues. 8 00:00:18,450 --> 00:00:20,200 ‫So it's about making sure your application 9 00:00:20,200 --> 00:00:21,830 ‫runs no matter what. 10 00:00:21,830 --> 00:00:23,470 ‫The design principles are simple. 11 00:00:23,470 --> 00:00:25,050 ‫We need to test recovery procedures 12 00:00:25,050 --> 00:00:28,200 ‫so you need to use automation to simulate different failures 13 00:00:28,200 --> 00:00:31,320 ‫or to recreate scenarios that led to failures before. 14 00:00:31,320 --> 00:00:34,200 ‫We need to automatically recover from failure. 15 00:00:34,200 --> 00:00:35,880 ‫That means that you need to anticipate 16 00:00:35,880 --> 00:00:38,220 ‫and remediate failures before they occur. 17 00:00:38,220 --> 00:00:40,440 ‫Then scale horizontally in case you need 18 00:00:40,440 --> 00:00:44,810 ‫to have increased system availability, or increased load. 19 00:00:44,810 --> 00:00:47,420 ‫And then stop guessing capacity. 20 00:00:47,420 --> 00:00:49,390 ‫So basically that means that if you think, 21 00:00:49,390 --> 00:00:52,480 ‫oh, I need four streams in this for my application, 22 00:00:52,480 --> 00:00:54,760 ‫that probably isn't going to work in the long term. 23 00:00:54,760 --> 00:00:56,670 ‫Use auto scaling wherever you can 24 00:00:56,670 --> 00:00:59,250 ‫to make sure you have the right capacity at any time. 25 00:00:59,250 --> 00:01:01,980 ‫And then in terms of automation, 26 00:01:01,980 --> 00:01:04,750 ‫you need to basically change everything through automation, 27 00:01:04,750 --> 00:01:08,090 ‫and this is to ensure that your application will be reliable 28 00:01:08,090 --> 00:01:09,957 ‫or you can roll back, or whatever. 29 00:01:09,957 --> 00:01:12,430 ‫In terms of AWS Services, what do we have? 30 00:01:12,430 --> 00:01:15,080 ‫Well the foundations of reliability is going to be IAM, 31 00:01:15,080 --> 00:01:17,890 ‫again making sure that no one has too many rights 32 00:01:17,890 --> 00:01:20,610 ‫to basically wreak havoc on your account. 33 00:01:20,610 --> 00:01:22,680 ‫Amazon VPC, this is a really strong 34 00:01:22,680 --> 00:01:24,530 ‫foundation for networking. 35 00:01:24,530 --> 00:01:26,750 ‫And Service Limits, making sure that you 36 00:01:26,750 --> 00:01:28,770 ‫do set appropriate service limits. 37 00:01:28,770 --> 00:01:30,690 ‫Not too high, and not too low, 38 00:01:30,690 --> 00:01:31,920 ‫just the right amount of service limits, 39 00:01:31,920 --> 00:01:33,670 ‫and you monitor them over time. 40 00:01:33,670 --> 00:01:35,380 ‫Such as if your application has been growing, 41 00:01:35,380 --> 00:01:36,380 ‫and growing, and growing, 42 00:01:36,380 --> 00:01:38,140 ‫and you're about to reach that service limit. 43 00:01:38,140 --> 00:01:40,210 ‫You don't want to get any service disruptions, 44 00:01:40,210 --> 00:01:41,640 ‫so you would contact AWS, 45 00:01:41,640 --> 00:01:44,010 ‫and increase that service limit over time. 46 00:01:44,010 --> 00:01:45,510 ‫Trusted Advisor is also great, 47 00:01:45,510 --> 00:01:47,030 ‫we'll see this in this section 48 00:01:47,030 --> 00:01:49,820 ‫about how we can basically look at these service limits, 49 00:01:49,820 --> 00:01:50,760 ‫or look at other things, 50 00:01:50,760 --> 00:01:52,817 ‫and get strong foundations over time. 51 00:01:52,817 --> 00:01:55,830 ‫Change management, so how do we manage change overall? 52 00:01:55,830 --> 00:01:57,930 ‫Well, Auto Scaling is a great way. 53 00:01:57,930 --> 00:02:00,730 ‫Basically if my application gets more popular over time 54 00:02:00,730 --> 00:02:01,727 ‫and I have set up auto scaling 55 00:02:01,727 --> 00:02:04,177 ‫then I don't need to change anything, which is great. 56 00:02:04,177 --> 00:02:06,795 ‫CloudWatch is a great way also of looking at your metrics. 57 00:02:06,795 --> 00:02:09,330 ‫For your databases for your application, 58 00:02:09,330 --> 00:02:11,740 ‫making sure everything looks reliable over time, 59 00:02:11,740 --> 00:02:14,458 ‫and if the CP utilization starts to ramp up 60 00:02:14,458 --> 00:02:16,198 ‫maybe do something about it. 61 00:02:16,198 --> 00:02:19,120 ‫CloudTrail in terms of are we secure enough 62 00:02:19,120 --> 00:02:20,110 ‫to track our API calls? 63 00:02:20,110 --> 00:02:21,810 ‫And Config, again. 64 00:02:21,810 --> 00:02:24,230 ‫Failure Management, so how do we manage failures? 65 00:02:24,230 --> 00:02:27,550 ‫Well, we'll see this for disaster recovery explanation 66 00:02:27,550 --> 00:02:30,080 ‫in this section, but you can use backups, 67 00:02:30,080 --> 00:02:31,830 ‫all along the way to basically make sure 68 00:02:31,830 --> 00:02:33,880 ‫that your application can be recovered 69 00:02:33,880 --> 00:02:35,730 ‫if something really really bad happens. 70 00:02:35,730 --> 00:02:38,680 ‫CloudFormation to recreate your whole infrastructure 71 00:02:38,680 --> 00:02:42,790 ‫at once, S3, for example, to backup all your data 72 00:02:42,790 --> 00:02:44,850 ‫or, you know, S3 Glacier if we're talking about 73 00:02:44,850 --> 00:02:47,890 ‫archives that you don't need to touch once in a while. 74 00:02:47,890 --> 00:02:50,620 ‫Finally maybe you want to use a reliable, 75 00:02:50,620 --> 00:02:53,010 ‫highly available global DNS system, 76 00:02:53,010 --> 00:02:54,900 ‫so Route 53 could be one of them. 77 00:02:54,900 --> 00:02:57,020 ‫And in case of any failures, maybe you want 78 00:02:57,020 --> 00:03:00,620 ‫to change Route 53 to just point to a new application stack 79 00:03:00,620 --> 00:03:03,050 ‫somewhere else and really make your your application 80 00:03:03,050 --> 00:03:05,340 ‫has some kind of disaster recovery mechanism. 81 00:03:05,340 --> 00:03:07,310 ‫Don't worry, we'll see disaster recovery 82 00:03:07,310 --> 00:03:08,310 ‫in this section as well, 83 00:03:08,310 --> 00:03:10,360 ‫and I'll try to make it as simple as possible. 84 00:03:10,360 --> 00:03:12,370 ‫So that's it, for this pillar, I hope you liked it, 85 00:03:12,370 --> 00:03:14,320 ‫and I will see you in the next lecture.