1
00:00:00,150 --> 00:00:01,050
In this lesson,

2
00:00:01,050 --> 00:00:02,340
we're going to talk about sensors

3
00:00:02,340 --> 00:00:04,800
that help us monitor the performance of our network devices,

4
00:00:04,800 --> 00:00:07,890
those devices like routers, switches, and firewalls.

5
00:00:07,890 --> 00:00:09,600
Now, these sensors can be used to monitor

6
00:00:09,600 --> 00:00:12,900
the device's temperature, its CPU usage, and its memory,

7
00:00:12,900 --> 00:00:15,000
and these things can be key indicators

8
00:00:15,000 --> 00:00:16,890
of whether a device operating properly

9
00:00:16,890 --> 00:00:19,980
or is about to suffer a catastrophic failure.

10
00:00:19,980 --> 00:00:22,050
Our first sensor measurement we need to talk about

11
00:00:22,050 --> 00:00:23,850
is the temperature of the device.

12
00:00:23,850 --> 00:00:25,170
Now, most network devices,

13
00:00:25,170 --> 00:00:27,090
like your router, switches, and firewalls,

14
00:00:27,090 --> 00:00:27,923
have the ability

15
00:00:27,923 --> 00:00:30,750
to report on the temperature within their chassis.

16
00:00:30,750 --> 00:00:32,220
Now, depending on the model,

17
00:00:32,220 --> 00:00:34,350
there may be only one or two temperature readings,

18
00:00:34,350 --> 00:00:36,690
or on some larger enterprise devices,

19
00:00:36,690 --> 00:00:38,070
you may have a temperature reading

20
00:00:38,070 --> 00:00:41,220
on each and every controller, processor, interface card,

21
00:00:41,220 --> 00:00:43,440
and thing like that inside the system.

22
00:00:43,440 --> 00:00:45,210
Now, the temperature sensors can be used

23
00:00:45,210 --> 00:00:47,850
to measure the air temperature inside the intake outlet

24
00:00:47,850 --> 00:00:50,850
and the air temperature at the exhaust outlet at a minimum.

25
00:00:50,850 --> 00:00:52,140
Now, for each of these sensors,

26
00:00:52,140 --> 00:00:55,110
you can set up minor and major temperature thresholds.

27
00:00:55,110 --> 00:00:57,870
Minor temperature threshold is used to set off an alarm

28
00:00:57,870 --> 00:00:59,760
when a rising temperature is detected

29
00:00:59,760 --> 00:01:02,280
but it hasn't reached dangerous levels yet.

30
00:01:02,280 --> 00:01:04,830
When this occurs, a system message is displayed,

31
00:01:04,830 --> 00:01:06,840
an SNMP notification is sent,

32
00:01:06,840 --> 00:01:09,390
and an environmental alarm can be sounded.

33
00:01:09,390 --> 00:01:11,580
Now, when you have a major temperature threshold,

34
00:01:11,580 --> 00:01:13,680
this is going to be used to set off an alarm

35
00:01:13,680 --> 00:01:16,170
when a temperature reaches dangerous conditions.

36
00:01:16,170 --> 00:01:19,440
At this level, we want to still display those system messages,

37
00:01:19,440 --> 00:01:21,540
get that SNMP notification,

38
00:01:21,540 --> 00:01:23,580
and have the environmental alarm sounded,

39
00:01:23,580 --> 00:01:25,050
but in addition to that,

40
00:01:25,050 --> 00:01:27,060
the device can actually start to load shed

41
00:01:27,060 --> 00:01:28,590
by turning off different functions

42
00:01:28,590 --> 00:01:29,550
to reduce the temperature

43
00:01:29,550 --> 00:01:31,950
being generated by the device's processor.

44
00:01:31,950 --> 00:01:33,660
For example, let's say you have a router

45
00:01:33,660 --> 00:01:35,670
with multiple processing cards in it.

46
00:01:35,670 --> 00:01:38,400
That device may shut down one of those processing cards

47
00:01:38,400 --> 00:01:40,650
to prevent the entire system from overheating.

48
00:01:40,650 --> 00:01:42,540
That's what I mean by load shedding.

49
00:01:42,540 --> 00:01:43,680
Now, when a device runs

50
00:01:43,680 --> 00:01:45,540
at excessive temperatures for too long,

51
00:01:45,540 --> 00:01:47,460
the performance will decrease on that device

52
00:01:47,460 --> 00:01:50,580
and the lifespan will decline on that device as well.

53
00:01:50,580 --> 00:01:52,650
Over time, that device can even suffer

54
00:01:52,650 --> 00:01:55,260
a catastrophic failure from overheating.

55
00:01:55,260 --> 00:01:57,360
Our second sensor measurement we need to talk about

56
00:01:57,360 --> 00:02:00,780
is CPU usage or utilization on the device.

57
00:02:00,780 --> 00:02:03,210
At their core, router, switches, and firewalls

58
00:02:03,210 --> 00:02:05,070
are just specialized computers.

59
00:02:05,070 --> 00:02:07,410
When these devices are running under normal conditions,

60
00:02:07,410 --> 00:02:09,780
their CPU or central processing unit

61
00:02:09,780 --> 00:02:11,610
should have minimal utilization,

62
00:02:11,610 --> 00:02:14,340
somewhere in the range of 5 to 40%,

63
00:02:14,340 --> 00:02:16,860
but if the devices begin to become extremely busy

64
00:02:16,860 --> 00:02:19,380
or receive too many packets from its neighboring devices,

65
00:02:19,380 --> 00:02:22,170
the CPU utilization can become overutilized

66
00:02:22,170 --> 00:02:23,940
and the percentage will increase.

67
00:02:23,940 --> 00:02:26,490
Now, if the CPU utilization gets too high,

68
00:02:26,490 --> 00:02:29,580
the device could become unable to process any more requests,

69
00:02:29,580 --> 00:02:31,020
and it'll start to drop packets,

70
00:02:31,020 --> 00:02:33,510
or the entire connection could fail.

71
00:02:33,510 --> 00:02:36,540
Usually when you see a high processor utilization rate,

72
00:02:36,540 --> 00:02:38,910
this is an indication of a misconfigured network

73
00:02:38,910 --> 00:02:40,650
or a network under attack.

74
00:02:40,650 --> 00:02:42,720
If the network is misconfigured, for example,

75
00:02:42,720 --> 00:02:44,640
let's say you have a switch that's misconfigured,

76
00:02:44,640 --> 00:02:47,160
you can end up having a broadcast storm that occurs,

77
00:02:47,160 --> 00:02:47,993
and that's going to create

78
00:02:47,993 --> 00:02:49,890
an excessive amount of broadcast traffic

79
00:02:49,890 --> 00:02:52,650
that'll cause the switches CPU to become overutilized

80
00:02:52,650 --> 00:02:55,110
as it tries to process all those requests.

81
00:02:55,110 --> 00:02:56,970
Similarly, if you have a lot of complex

82
00:02:56,970 --> 00:02:58,830
and intricate ACLs on your router,

83
00:02:58,830 --> 00:03:01,230
and then people start sending a lot of inbound traffic,

84
00:03:01,230 --> 00:03:03,450
that router has to go through all of those ACLs

85
00:03:03,450 --> 00:03:05,220
each time for that traffic,

86
00:03:05,220 --> 00:03:07,110
and that can make it become unresponsive

87
00:03:07,110 --> 00:03:09,210
due to high CPU usage.

88
00:03:09,210 --> 00:03:12,180
As an administrator, you need to monitor the CPU utilization

89
00:03:12,180 --> 00:03:13,410
in your network devices

90
00:03:13,410 --> 00:03:15,300
to determine if they're operating properly,

91
00:03:15,300 --> 00:03:18,180
if they're misconfigured, or if they're under attack.

92
00:03:18,180 --> 00:03:19,800
The third sensor measurement we use

93
00:03:19,800 --> 00:03:22,200
is memory utilization for the device.

94
00:03:22,200 --> 00:03:24,210
Similar to high CPU utilization,

95
00:03:24,210 --> 00:03:26,280
high memory utilization can be indicative

96
00:03:26,280 --> 00:03:28,260
of a larger problem in your network.

97
00:03:28,260 --> 00:03:30,240
If your devices begin to use too much memory,

98
00:03:30,240 --> 00:03:32,850
this can lead to system hangs, processor crashes,

99
00:03:32,850 --> 00:03:34,890
and other undesirable behavior.

100
00:03:34,890 --> 00:03:36,300
To help protect against this,

101
00:03:36,300 --> 00:03:37,830
you should have minor, severe,

102
00:03:37,830 --> 00:03:39,690
and critical memory threshold warnings

103
00:03:39,690 --> 00:03:41,010
set up in your devices

104
00:03:41,010 --> 00:03:43,530
and reporting back to your centralized monitoring dashboard

105
00:03:43,530 --> 00:03:45,330
using SNMP.

106
00:03:45,330 --> 00:03:47,640
As a baseline, your network devices should operate

107
00:03:47,640 --> 00:03:49,650
at around 40% memory utilization

108
00:03:49,650 --> 00:03:51,600
under normal working conditions.

109
00:03:51,600 --> 00:03:55,440
During busier times, you may see this rise up to 60 to 70%,

110
00:03:55,440 --> 00:03:58,710
and during peak times, it may be up to 80%,

111
00:03:58,710 --> 00:04:01,080
but if you're constantly seeing memory utilization

112
00:04:01,080 --> 00:04:02,580
above 80%,

113
00:04:02,580 --> 00:04:03,570
you may need to install

114
00:04:03,570 --> 00:04:06,450
a larger or more powerful device for your network,

115
00:04:06,450 --> 00:04:08,340
or you could be under an attack

116
00:04:08,340 --> 00:04:09,690
for an excessive amount of time

117
00:04:09,690 --> 00:04:11,610
that's causing excessive loading.

118
00:04:11,610 --> 00:04:13,980
As you begin to operate your networks in the real world,

119
00:04:13,980 --> 00:04:15,990
you're going to begin to see what normal looks like

120
00:04:15,990 --> 00:04:17,670
for your particular network.

121
00:04:17,670 --> 00:04:19,350
As you see temperatures rising,

122
00:04:19,350 --> 00:04:21,899
or CPU and memory utilizations increase,

123
00:04:21,899 --> 00:04:24,330
this can trigger alarms that a network configuration

124
00:04:24,330 --> 00:04:27,510
or a network performance issue is happening right now.

125
00:04:27,510 --> 00:04:30,180
Then you need to investigate the root cause of that

126
00:04:30,180 --> 00:04:31,410
and solve those issues

127
00:04:31,410 --> 00:04:33,390
by bringing those metrics back to a normal level

128
00:04:33,390 --> 00:04:34,503
within your baseline.