WEBVTT

00:00.000 --> 00:01.470
>> Let's switch gears a little

00:01.470 --> 00:02.970
bit and start talking about what

00:02.970 --> 00:06.659
happens when things go bad
for good network admins.

00:06.659 --> 00:09.900
Regardless of what incidents
happen on the network,

00:09.900 --> 00:11.730
all management and
fault tolerance

00:11.730 --> 00:13.320
is always going to be essential.

00:13.320 --> 00:15.345
When we talk about
fault management,

00:15.345 --> 00:16.485
we are really talking about

00:16.485 --> 00:18.105
being able to withstand failures

00:18.105 --> 00:19.740
of one or multiple devices and

00:19.740 --> 00:21.329
>> continue to move forward.

00:21.329 --> 00:23.535
>> Do we have the
redundancy in place?

00:23.535 --> 00:25.050
Do we have the
capabilities to keep

00:25.050 --> 00:27.270
moving forward as
an organization?

00:27.270 --> 00:29.760
At first, when we're
planning for redundancy,

00:29.760 --> 00:31.350
we have to have some idea

00:31.350 --> 00:33.089
>> of how much
redundancy we need.

00:33.089 --> 00:34.980
>> How much is enough?

00:34.980 --> 00:36.075
There are a couple of terms,

00:36.075 --> 00:38.160
a couple of metrics
that may be helpful.

00:38.160 --> 00:42.275
The first is a key
performance indicator, KPI.

00:42.275 --> 00:44.345
The key performance
indicator will give

00:44.345 --> 00:46.370
us an idea of whether
or not a device is

00:46.370 --> 00:48.500
performing to its expectations

00:48.500 --> 00:51.835
because rarely does a
device just stop working.

00:51.835 --> 00:53.660
Often, the service starts to

00:53.660 --> 00:56.195
degrade and we start to
see some issues come up.

00:56.195 --> 00:58.130
If a specific device isn't

00:58.130 --> 00:59.810
meeting its standard
performance,

00:59.810 --> 01:01.610
that might be an indicator
that we're looking

01:01.610 --> 01:03.685
at a device that's
getting ready to fail.

01:03.685 --> 01:07.610
There are also metrics that
the vendors give to us.

01:07.610 --> 01:10.550
One is mean time to fail and

01:10.550 --> 01:12.579
>> mean time between failures.

01:12.579 --> 01:14.180
>> These are pretty comparable,

01:14.180 --> 01:16.160
but the idea is
mean time to fail

01:16.160 --> 01:18.850
indicates this is a device
that we don't repair.

01:18.850 --> 01:21.225
This is the lifespan
of the device.

01:21.225 --> 01:22.500
We're going to buy it,

01:22.500 --> 01:24.510
three years later
it's going to fail.

01:24.510 --> 01:26.660
Mean time between failures gives

01:26.660 --> 01:29.185
the indication that the
device can be repaired.

01:29.185 --> 01:30.605
We buy the device,

01:30.605 --> 01:32.600
it goes three years, it fails,

01:32.600 --> 01:34.490
we repair it, three years later,

01:34.490 --> 01:36.815
it fails, we repair
it, and so on.

01:36.815 --> 01:38.810
Then we've also got
to consider how

01:38.810 --> 01:40.925
long it takes us to
repair the device,

01:40.925 --> 01:43.130
which is mean time to repair.

01:43.130 --> 01:45.110
If it's going to
take us a long time

01:45.110 --> 01:46.400
to repair the device,

01:46.400 --> 01:48.850
is it possible that we
can just replace it?

01:48.850 --> 01:51.110
A lot of these, we
don't really repair

01:51.110 --> 01:53.270
a whole lot of
devices today because

01:53.270 --> 01:55.280
most of them can be
replaced much cheaper

01:55.280 --> 01:57.980
than the time and effort it
would take to repair them.

01:57.980 --> 02:00.940
The last are service
level agreements.

02:00.940 --> 02:02.810
Vendors are going
to provide us with

02:02.810 --> 02:05.615
commitments as to performance
and availability.

02:05.615 --> 02:08.440
A lot of times that
revolves around up-time.

02:08.440 --> 02:10.310
For instance, if I'm storing

02:10.310 --> 02:12.305
my data with the Cloud
service provider,

02:12.305 --> 02:13.850
the redundancy is up to them

02:13.850 --> 02:16.469
so that way they can
meet their metrics.

02:16.490 --> 02:18.900
With redundancy, we want

02:18.900 --> 02:20.825
a redundancy to be comprehensive

02:20.825 --> 02:21.980
and not just focus on

02:21.980 --> 02:24.905
just redundant data or
just redundant drives.

02:24.905 --> 02:27.350
We really want to make sure
that all of our areas are

02:27.350 --> 02:28.700
redundant because a chain is

02:28.700 --> 02:30.785
only as strong as
its weakest link.

02:30.785 --> 02:33.870
We start off by talking
about redundancy in servers.

02:33.870 --> 02:37.100
That's just multiple servers
performing the same role.

02:37.100 --> 02:39.080
This is not the
same as a cluster

02:39.080 --> 02:41.000
because with redundant servers,

02:41.000 --> 02:44.095
each server is its
own unique device.

02:44.095 --> 02:46.470
I have domain controller A,

02:46.470 --> 02:48.000
and domain controller B,

02:48.000 --> 02:50.520
or DNS 1, DNS 2.

02:50.520 --> 02:52.400
If one fails, the second is

02:52.400 --> 02:55.200
up and running and is available.

02:55.750 --> 02:57.890
A cluster. I have

02:57.890 --> 02:59.630
multiple physical nodes acting

02:59.630 --> 03:01.490
as a single logical entity.

03:01.490 --> 03:03.560
They are very tightly coupled.

03:03.560 --> 03:06.845
For instance, if I have a
cluster or a server farm,

03:06.845 --> 03:08.090
as you sometimes hear them,

03:08.090 --> 03:09.980
you're not going to be
able to differentiate

03:09.980 --> 03:11.345
between the servers.

03:11.345 --> 03:13.260
When I go to amazon.com,

03:13.260 --> 03:14.570
there are many hundreds of

03:14.570 --> 03:16.925
machines responding
to web queries.

03:16.925 --> 03:19.070
I don't know which one is
which because they are

03:19.070 --> 03:21.220
functioning as part
of the same cluster.

03:21.220 --> 03:24.590
Clusters also usually do
provide load balancing,

03:24.590 --> 03:27.455
but you can also have clusters
that don't load balance.

03:27.455 --> 03:30.440
They're just simply working
for fault tolerance.

03:30.440 --> 03:31.915
In a smaller environment,

03:31.915 --> 03:33.935
I might have multiple servers.

03:33.935 --> 03:35.770
They can either
both be responding

03:35.770 --> 03:37.090
to requests or one could be

03:37.090 --> 03:38.350
passive while the other is

03:38.350 --> 03:40.179
>> the primary
responding server.

03:40.179 --> 03:42.415
>> There are all sorts
of configurations,

03:42.415 --> 03:43.630
but the real purpose and the

03:43.630 --> 03:44.980
>> real important piece is that

03:44.980 --> 03:47.815
>> we need our fault
tolerance for those servers.

03:47.815 --> 03:49.735
That's where your services run.

03:49.735 --> 03:51.595
That's where our resources are.

03:51.595 --> 03:53.665
Many times we use
spare equipment.

03:53.665 --> 03:55.360
I have a Switch in the closet.

03:55.360 --> 03:57.920
When we talk about cold
spares, that's exactly it.

03:57.920 --> 04:00.030
I've got a device
somewhere in the closet.

04:00.030 --> 04:01.469
>> I can find it.

04:01.469 --> 04:02.340
>> I can install it.

04:02.340 --> 04:06.030
>> We often have
warm spares, which

04:06.030 --> 04:07.744
>> you're ready to
go very quickly.

04:07.744 --> 04:10.075
>> Hot-swap or hot spares are

04:10.075 --> 04:11.680
already installed and
it's just a matter

04:11.680 --> 04:13.385
of switching over to them.

04:13.385 --> 04:16.290
Depending on the value of
what's being protected,

04:16.290 --> 04:18.410
how much downtime we
can tolerate is going

04:18.410 --> 04:21.480
to determine what types
of spares we have.

