1 00:00:01,140 --> 00:00:03,240 Failover and Cluster Management. 2 00:00:03,240 --> 00:00:06,600 Here let's take a look at what happens in your failover cluster, 3 00:00:06,600 --> 00:00:10,180 depending upon whether the failure is repairable or reparable, 4 00:00:10,180 --> 00:00:11,740 or irreparable. 5 00:00:11,740 --> 00:00:11,950 Now, 6 00:00:11,950 --> 00:00:15,220 what would be an example of a reparable failure would be 7 00:00:15,220 --> 00:00:18,270 something transient maybe that takes down a node. 8 00:00:18,270 --> 00:00:21,670 We could consider taking a node offline for scheduled 9 00:00:21,670 --> 00:00:23,820 maintenance as a reparable failure. 10 00:00:23,820 --> 00:00:27,890 Something where the node is expected to come back online as is, 11 00:00:27,890 --> 00:00:29,440 is what we're talking about. 12 00:00:29,440 --> 00:00:32,350 And the main workflow here is three steps. 13 00:00:32,350 --> 00:00:32,850 One, 14 00:00:32,850 --> 00:00:37,780 if a failure occurs that's reparable on that node and it becomes unavailable, 15 00:00:37,780 --> 00:00:39,820 remember previously in this course, 16 00:00:39,820 --> 00:00:43,340 we discussed a little bit about the heartbeat messages, 17 00:00:43,340 --> 00:00:45,540 and this is how the nodes communicate with each 18 00:00:45,540 --> 00:00:47,460 other to make sure that they are, in fact, 19 00:00:47,460 --> 00:00:48,740 reachable. 20 00:00:48,740 --> 00:00:52,180 We would let the failover occur so that the failed node, 21 00:00:52,180 --> 00:00:56,340 all of its clustered roles would be shifted to another node. 22 00:00:56,340 --> 00:00:58,750 Of course, then we would need to, through our alerting, 23 00:00:58,750 --> 00:01:02,090 become aware of the problem, fix the problem on the failed node, 24 00:01:02,090 --> 00:01:03,650 bring it back online, 25 00:01:03,650 --> 00:01:06,890 and then depending upon how we configured those clustered roles, 26 00:01:06,890 --> 00:01:11,840 we could either automatically let failback happen or we can manually intervene 27 00:01:11,840 --> 00:01:16,540 to failback those clustered roles back to the original node. 28 00:01:16,540 --> 00:01:20,610 Now an irreparable failure would be where a node is really 29 00:01:20,610 --> 00:01:23,240 going to be permanently flat on its back. 30 00:01:23,240 --> 00:01:24,930 And this one has five steps. 31 00:01:24,930 --> 00:01:29,030 One, we allow failover clustering to do what it does best, 32 00:01:29,030 --> 00:01:31,310 you know, taking care of the failover process. 33 00:01:31,310 --> 00:01:32,770 Once that happens, 34 00:01:32,770 --> 00:01:36,840 we would then notify the failover cluster that the failed node 35 00:01:36,840 --> 00:01:38,950 is no longer going to be part of the cluster. 36 00:01:38,950 --> 00:01:41,340 We would evict the fail node. 37 00:01:41,340 --> 00:01:45,540 We then would replace the failed node with a new physical or virtual machine. 38 00:01:45,540 --> 00:01:48,870 We would then need to use either the Failover Cluster MMC 39 00:01:48,870 --> 00:01:52,850 Console or PowerShell to run cluster validation against the 40 00:01:52,850 --> 00:01:54,890 new node and add it into the cluster. 41 00:01:54,890 --> 00:01:56,960 And once it comes into the cluster, 42 00:01:56,960 --> 00:01:59,940 we then can transfer roles manually to the new node 43 00:01:59,940 --> 00:02:02,400 and configure preferred owners, etc., etc. 44 00:02:02,400 --> 00:02:02,920 Now, 45 00:02:02,920 --> 00:02:05,570 I guess something else that would be considered an 46 00:02:05,570 --> 00:02:08,930 "irreparable failure" would be where you have a cluster node 47 00:02:08,930 --> 00:02:10,960 that's just reached its natural end of life. 48 00:02:10,960 --> 00:02:14,170 It hasn't crashed yet, but you're going to decommission it, 49 00:02:14,170 --> 00:02:17,800 take it out of the cluster, and replace it with a new model. 50 00:02:17,800 --> 00:02:22,640 That would be another instance where you would use this particular workload. 51 00:02:22,640 --> 00:02:25,220 Now you'll see all this stuff in the demo upcoming, 52 00:02:25,220 --> 00:02:29,620 but I want to prepare you here so in order to manually failover a workload, 53 00:02:29,620 --> 00:02:32,590 as you can see here, we're right‑clicking a clustered role, 54 00:02:32,590 --> 00:02:36,470 and this is, of course, graphically through the Failover Cluster Manager. 55 00:02:36,470 --> 00:02:40,180 We can right‑click and, in this case, depending upon what the role is, 56 00:02:40,180 --> 00:02:42,840 this looks like a Scale‑out File Server Share, 57 00:02:42,840 --> 00:02:46,610 so we can move that workload either to let the cluster 58 00:02:46,610 --> 00:02:49,040 determine the best possible node. 59 00:02:49,040 --> 00:02:52,390 And how does the cluster know what the best possible node is? 60 00:02:52,390 --> 00:02:55,820 Well the best way to do that is to configure preferred owners yourself 61 00:02:55,820 --> 00:02:58,200 because the cluster is only as smart as you are, 62 00:02:58,200 --> 00:02:59,900 the person who's programmed it. 63 00:02:59,900 --> 00:03:04,740 Or you can manually select the node, as you can see from the flyout menu. 64 00:03:04,740 --> 00:03:07,570 So speaking of preferred owner, let's just take a quick look. 65 00:03:07,570 --> 00:03:10,340 If we look at the properties of a clustered node, 66 00:03:10,340 --> 00:03:13,440 there's generally the two tabs here, General and Failover, 67 00:03:13,440 --> 00:03:15,750 and on Failover, you can see down at the bottom, 68 00:03:15,750 --> 00:03:19,660 this is where you instruct the cluster to automatically shift the 69 00:03:19,660 --> 00:03:22,680 clustered role to the secondary host. That is, 70 00:03:22,680 --> 00:03:23,980 when the failover occurs, 71 00:03:23,980 --> 00:03:27,630 you go from the active to another node if you want to go back 72 00:03:27,630 --> 00:03:32,480 automatically to the original role holder or prevent failback. 73 00:03:32,480 --> 00:03:36,340 So you would just specify that according to your preferences. 74 00:03:36,340 --> 00:03:38,350 And on the General tab, as you can see here, 75 00:03:38,350 --> 00:03:42,010 this is where you can create a hierarchy of preferred owners. 76 00:03:42,010 --> 00:03:44,040 You might have workloads that, 77 00:03:44,040 --> 00:03:48,450 based on particular hardware features of a particular hardware node, 78 00:03:48,450 --> 00:03:51,170 you may feel a bit more strongly about having that 79 00:03:51,170 --> 00:03:54,440 particular node host that role most of the time. 80 00:03:54,440 --> 00:03:58,230 Now, that's in a sense, kind of an edge scenario because ideally, 81 00:03:58,230 --> 00:04:00,730 all of your cluster nodes are identical from a 82 00:04:00,730 --> 00:04:05,140 hardware and software perspective, at least a hardware perspective. 83 00:04:05,140 --> 00:04:08,940 Here we have a graphical screenshot of Failover Cluster Manager again, 84 00:04:08,940 --> 00:04:14,010 and we start, in this case, this looks like using virtual machine failover. 85 00:04:14,010 --> 00:04:17,740 It's, again, a different UI because it's a different clustered role. 86 00:04:17,740 --> 00:04:19,040 Right‑click the VM. 87 00:04:19,040 --> 00:04:23,560 We do a move, but with Hyper‑V virtual machines it's a little bit different. 88 00:04:23,560 --> 00:04:27,040 We can migrate just the virtual machine storage or we can 89 00:04:27,040 --> 00:04:29,030 migrate the storage and configuration. 90 00:04:29,030 --> 00:04:35,840 And I tend to do live migration in order to minimize any downtime. 91 00:04:35,840 --> 00:04:37,610 And when you go to Live Migration, 92 00:04:37,610 --> 00:04:40,610 you see there's Best Possible Node or Select Nodes. 93 00:04:40,610 --> 00:04:43,550 So it's a similar notion with preferred servers. 94 00:04:43,550 --> 00:04:44,070 Now, 95 00:04:44,070 --> 00:04:47,780 what I didn't mention in the previous screenshot that I'll mention 96 00:04:47,780 --> 00:04:51,830 here, that also applies to Scale Out File Server, is remember that 97 00:04:51,830 --> 00:04:54,010 you're using Cluster Shared Volumes. 98 00:04:54,010 --> 00:04:57,320 This means this particular VM we're talking about on this slide, 99 00:04:57,320 --> 00:05:00,910 nestvm, has as both it's VHDs as well as its 100 00:05:00,910 --> 00:05:04,340 configuration available to all cluster nodes. 101 00:05:04,340 --> 00:05:07,720 So especially when you choose the Live Migration option here, 102 00:05:07,720 --> 00:05:13,730 you're minimizing that switchover time and thus improving the high 103 00:05:13,730 --> 00:05:17,170 availability of your virtual machine and your other clustered 104 00:05:17,170 --> 00:05:19,280 roles that use Cluster Shared Volumes. 105 00:05:19,280 --> 00:05:24,250 Now I had mentioned that Hyper‑V highly available VMs have live and quick, 106 00:05:24,250 --> 00:05:26,740 and you should use live wherever possible. 107 00:05:26,740 --> 00:05:28,710 This is just me speculating a little bit, 108 00:05:28,710 --> 00:05:33,460 but in my experience, I think that quick is just there for legacy environments. 109 00:05:33,460 --> 00:05:37,070 Specifically, the idea is that live is truly a live 110 00:05:37,070 --> 00:05:40,270 migration where the VM experiences no downtime. 111 00:05:40,270 --> 00:05:44,250 When you do a quick migration, the failover cluster pauses the VM, 112 00:05:44,250 --> 00:05:46,940 so there is going to be some negligible downtime. 113 00:05:46,940 --> 00:05:48,590 And again, speculation alert, 114 00:05:48,590 --> 00:05:51,520 I think the difference there lies in the hardware 115 00:05:51,520 --> 00:05:53,940 capabilities of your cluster nodes. 116 00:05:53,940 --> 00:05:55,050 Years ago, of course, 117 00:05:55,050 --> 00:05:58,190 not everybody had current hardware that was fully aware with 118 00:05:58,190 --> 00:06:01,090 CPU virtualization support and all of that. 119 00:06:01,090 --> 00:06:03,090 Nowadays that's pretty much the standard, 120 00:06:03,090 --> 00:06:06,000 which is why I recommend you always do live migration. 121 00:06:06,000 --> 00:06:09,120 When cluster nodes may not have matching hardware, 122 00:06:09,120 --> 00:06:12,140 particularly ones with older hardware, 123 00:06:12,140 --> 00:06:18,000 you may want to go a more traditional copy route, and that would be quick migration.