1
00:00:01,140 --> 00:00:03,240
Failover and Cluster Management.

2
00:00:03,240 --> 00:00:06,600
Here let's take a look at what happens in your failover cluster,

3
00:00:06,600 --> 00:00:10,180
depending upon whether the failure is repairable or reparable,

4
00:00:10,180 --> 00:00:11,740
or irreparable.

5
00:00:11,740 --> 00:00:11,950
Now,

6
00:00:11,950 --> 00:00:15,220
what would be an example of a reparable failure would be

7
00:00:15,220 --> 00:00:18,270
something transient maybe that takes down a node.

8
00:00:18,270 --> 00:00:21,670
We could consider taking a node offline for scheduled

9
00:00:21,670 --> 00:00:23,820
maintenance as a reparable failure.

10
00:00:23,820 --> 00:00:27,890
Something where the node is expected to come back online as is,

11
00:00:27,890 --> 00:00:29,440
is what we're talking about.

12
00:00:29,440 --> 00:00:32,350
And the main workflow here is three steps.

13
00:00:32,350 --> 00:00:32,850
One,

14
00:00:32,850 --> 00:00:37,780
if a failure occurs that's reparable on that node and it becomes unavailable,

15
00:00:37,780 --> 00:00:39,820
remember previously in this course,

16
00:00:39,820 --> 00:00:43,340
we discussed a little bit about the heartbeat messages,

17
00:00:43,340 --> 00:00:45,540
and this is how the nodes communicate with each

18
00:00:45,540 --> 00:00:47,460
other to make sure that they are, in fact,

19
00:00:47,460 --> 00:00:48,740
reachable.

20
00:00:48,740 --> 00:00:52,180
We would let the failover occur so that the failed node,

21
00:00:52,180 --> 00:00:56,340
all of its clustered roles would be shifted to another node.

22
00:00:56,340 --> 00:00:58,750
Of course, then we would need to, through our alerting,

23
00:00:58,750 --> 00:01:02,090
become aware of the problem, fix the problem on the failed node,

24
00:01:02,090 --> 00:01:03,650
bring it back online,

25
00:01:03,650 --> 00:01:06,890
and then depending upon how we configured those clustered roles,

26
00:01:06,890 --> 00:01:11,840
we could either automatically let failback happen or we can manually intervene

27
00:01:11,840 --> 00:01:16,540
to failback those clustered roles back to the original node.

28
00:01:16,540 --> 00:01:20,610
Now an irreparable failure would be where a node is really

29
00:01:20,610 --> 00:01:23,240
going to be permanently flat on its back.

30
00:01:23,240 --> 00:01:24,930
And this one has five steps.

31
00:01:24,930 --> 00:01:29,030
One, we allow failover clustering to do what it does best,

32
00:01:29,030 --> 00:01:31,310
you know, taking care of the failover process.

33
00:01:31,310 --> 00:01:32,770
Once that happens,

34
00:01:32,770 --> 00:01:36,840
we would then notify the failover cluster that the failed node

35
00:01:36,840 --> 00:01:38,950
is no longer going to be part of the cluster.

36
00:01:38,950 --> 00:01:41,340
We would evict the fail node.

37
00:01:41,340 --> 00:01:45,540
We then would replace the failed node with a new physical or virtual machine.

38
00:01:45,540 --> 00:01:48,870
We would then need to use either the Failover Cluster MMC

39
00:01:48,870 --> 00:01:52,850
Console or PowerShell to run cluster validation against the

40
00:01:52,850 --> 00:01:54,890
new node and add it into the cluster.

41
00:01:54,890 --> 00:01:56,960
And once it comes into the cluster,

42
00:01:56,960 --> 00:01:59,940
we then can transfer roles manually to the new node

43
00:01:59,940 --> 00:02:02,400
and configure preferred owners, etc., etc.

44
00:02:02,400 --> 00:02:02,920
Now,

45
00:02:02,920 --> 00:02:05,570
I guess something else that would be considered an

46
00:02:05,570 --> 00:02:08,930
"irreparable failure" would be where you have a cluster node

47
00:02:08,930 --> 00:02:10,960
that's just reached its natural end of life.

48
00:02:10,960 --> 00:02:14,170
It hasn't crashed yet, but you're going to decommission it,

49
00:02:14,170 --> 00:02:17,800
take it out of the cluster, and replace it with a new model.

50
00:02:17,800 --> 00:02:22,640
That would be another instance where you would use this particular workload.

51
00:02:22,640 --> 00:02:25,220
Now you'll see all this stuff in the demo upcoming,

52
00:02:25,220 --> 00:02:29,620
but I want to prepare you here so in order to manually failover a workload,

53
00:02:29,620 --> 00:02:32,590
as you can see here, we're right‑clicking a clustered role,

54
00:02:32,590 --> 00:02:36,470
and this is, of course, graphically through the Failover Cluster Manager.

55
00:02:36,470 --> 00:02:40,180
We can right‑click and, in this case, depending upon what the role is,

56
00:02:40,180 --> 00:02:42,840
this looks like a Scale‑out File Server Share,

57
00:02:42,840 --> 00:02:46,610
so we can move that workload either to let the cluster

58
00:02:46,610 --> 00:02:49,040
determine the best possible node.

59
00:02:49,040 --> 00:02:52,390
And how does the cluster know what the best possible node is?

60
00:02:52,390 --> 00:02:55,820
Well the best way to do that is to configure preferred owners yourself

61
00:02:55,820 --> 00:02:58,200
because the cluster is only as smart as you are,

62
00:02:58,200 --> 00:02:59,900
the person who's programmed it.

63
00:02:59,900 --> 00:03:04,740
Or you can manually select the node, as you can see from the flyout menu.

64
00:03:04,740 --> 00:03:07,570
So speaking of preferred owner, let's just take a quick look.

65
00:03:07,570 --> 00:03:10,340
If we look at the properties of a clustered node,

66
00:03:10,340 --> 00:03:13,440
there's generally the two tabs here, General and Failover,

67
00:03:13,440 --> 00:03:15,750
and on Failover, you can see down at the bottom,

68
00:03:15,750 --> 00:03:19,660
this is where you instruct the cluster to automatically shift the

69
00:03:19,660 --> 00:03:22,680
clustered role to the secondary host. That is,

70
00:03:22,680 --> 00:03:23,980
when the failover occurs,

71
00:03:23,980 --> 00:03:27,630
you go from the active to another node if you want to go back

72
00:03:27,630 --> 00:03:32,480
automatically to the original role holder or prevent failback.

73
00:03:32,480 --> 00:03:36,340
So you would just specify that according to your preferences.

74
00:03:36,340 --> 00:03:38,350
And on the General tab, as you can see here,

75
00:03:38,350 --> 00:03:42,010
this is where you can create a hierarchy of preferred owners.

76
00:03:42,010 --> 00:03:44,040
You might have workloads that,

77
00:03:44,040 --> 00:03:48,450
based on particular hardware features of a particular hardware node,

78
00:03:48,450 --> 00:03:51,170
you may feel a bit more strongly about having that

79
00:03:51,170 --> 00:03:54,440
particular node host that role most of the time.

80
00:03:54,440 --> 00:03:58,230
Now, that's in a sense, kind of an edge scenario because ideally,

81
00:03:58,230 --> 00:04:00,730
all of your cluster nodes are identical from a

82
00:04:00,730 --> 00:04:05,140
hardware and software perspective, at least a hardware perspective.

83
00:04:05,140 --> 00:04:08,940
Here we have a graphical screenshot of Failover Cluster Manager again,

84
00:04:08,940 --> 00:04:14,010
and we start, in this case, this looks like using virtual machine failover.

85
00:04:14,010 --> 00:04:17,740
It's, again, a different UI because it's a different clustered role.

86
00:04:17,740 --> 00:04:19,040
Right‑click the VM.

87
00:04:19,040 --> 00:04:23,560
We do a move, but with Hyper‑V virtual machines it's a little bit different.

88
00:04:23,560 --> 00:04:27,040
We can migrate just the virtual machine storage or we can

89
00:04:27,040 --> 00:04:29,030
migrate the storage and configuration.

90
00:04:29,030 --> 00:04:35,840
And I tend to do live migration in order to minimize any downtime.

91
00:04:35,840 --> 00:04:37,610
And when you go to Live Migration,

92
00:04:37,610 --> 00:04:40,610
you see there's Best Possible Node or Select Nodes.

93
00:04:40,610 --> 00:04:43,550
So it's a similar notion with preferred servers.

94
00:04:43,550 --> 00:04:44,070
Now,

95
00:04:44,070 --> 00:04:47,780
what I didn't mention in the previous screenshot that I'll mention

96
00:04:47,780 --> 00:04:51,830
here, that also applies to Scale Out File Server, is remember that

97
00:04:51,830 --> 00:04:54,010
you're using Cluster Shared Volumes.

98
00:04:54,010 --> 00:04:57,320
This means this particular VM we're talking about on this slide,

99
00:04:57,320 --> 00:05:00,910
nestvm, has as both it's VHDs as well as its

100
00:05:00,910 --> 00:05:04,340
configuration available to all cluster nodes.

101
00:05:04,340 --> 00:05:07,720
So especially when you choose the Live Migration option here,

102
00:05:07,720 --> 00:05:13,730
you're minimizing that switchover time and thus improving the high

103
00:05:13,730 --> 00:05:17,170
availability of your virtual machine and your other clustered

104
00:05:17,170 --> 00:05:19,280
roles that use Cluster Shared Volumes.

105
00:05:19,280 --> 00:05:24,250
Now I had mentioned that Hyper‑V highly available VMs have live and quick,

106
00:05:24,250 --> 00:05:26,740
and you should use live wherever possible.

107
00:05:26,740 --> 00:05:28,710
This is just me speculating a little bit,

108
00:05:28,710 --> 00:05:33,460
but in my experience, I think that quick is just there for legacy environments.

109
00:05:33,460 --> 00:05:37,070
Specifically, the idea is that live is truly a live

110
00:05:37,070 --> 00:05:40,270
migration where the VM experiences no downtime.

111
00:05:40,270 --> 00:05:44,250
When you do a quick migration, the failover cluster pauses the VM,

112
00:05:44,250 --> 00:05:46,940
so there is going to be some negligible downtime.

113
00:05:46,940 --> 00:05:48,590
And again, speculation alert,

114
00:05:48,590 --> 00:05:51,520
I think the difference there lies in the hardware

115
00:05:51,520 --> 00:05:53,940
capabilities of your cluster nodes.

116
00:05:53,940 --> 00:05:55,050
Years ago, of course,

117
00:05:55,050 --> 00:05:58,190
not everybody had current hardware that was fully aware with

118
00:05:58,190 --> 00:06:01,090
CPU virtualization support and all of that.

119
00:06:01,090 --> 00:06:03,090
Nowadays that's pretty much the standard,

120
00:06:03,090 --> 00:06:06,000
which is why I recommend you always do live migration.

121
00:06:06,000 --> 00:06:09,120
When cluster nodes may not have matching hardware,

122
00:06:09,120 --> 00:06:12,140
particularly ones with older hardware,

123
00:06:12,140 --> 00:06:18,000
you may want to go a more traditional copy route, and that would be quick migration.