1 00:00:00,150 --> 00:00:01,050 In this lesson, 2 00:00:01,050 --> 00:00:02,340 we're going to talk about sensors 3 00:00:02,340 --> 00:00:04,800 that help us monitor the performance of our network devices, 4 00:00:04,800 --> 00:00:07,890 those devices like routers, switches, and firewalls. 5 00:00:07,890 --> 00:00:09,600 Now, these sensors can be used to monitor 6 00:00:09,600 --> 00:00:12,900 the device's temperature, its CPU usage, and its memory, 7 00:00:12,900 --> 00:00:15,000 and these things can be key indicators 8 00:00:15,000 --> 00:00:16,890 of whether a device operating properly 9 00:00:16,890 --> 00:00:19,980 or is about to suffer a catastrophic failure. 10 00:00:19,980 --> 00:00:22,050 Our first sensor measurement we need to talk about 11 00:00:22,050 --> 00:00:23,850 is the temperature of the device. 12 00:00:23,850 --> 00:00:25,170 Now, most network devices, 13 00:00:25,170 --> 00:00:27,090 like your router, switches, and firewalls, 14 00:00:27,090 --> 00:00:27,923 have the ability 15 00:00:27,923 --> 00:00:30,750 to report on the temperature within their chassis. 16 00:00:30,750 --> 00:00:32,220 Now, depending on the model, 17 00:00:32,220 --> 00:00:34,350 there may be only one or two temperature readings, 18 00:00:34,350 --> 00:00:36,690 or on some larger enterprise devices, 19 00:00:36,690 --> 00:00:38,070 you may have a temperature reading 20 00:00:38,070 --> 00:00:41,220 on each and every controller, processor, interface card, 21 00:00:41,220 --> 00:00:43,440 and thing like that inside the system. 22 00:00:43,440 --> 00:00:45,210 Now, the temperature sensors can be used 23 00:00:45,210 --> 00:00:47,850 to measure the air temperature inside the intake outlet 24 00:00:47,850 --> 00:00:50,850 and the air temperature at the exhaust outlet at a minimum. 25 00:00:50,850 --> 00:00:52,140 Now, for each of these sensors, 26 00:00:52,140 --> 00:00:55,110 you can set up minor and major temperature thresholds. 27 00:00:55,110 --> 00:00:57,870 Minor temperature threshold is used to set off an alarm 28 00:00:57,870 --> 00:00:59,760 when a rising temperature is detected 29 00:00:59,760 --> 00:01:02,280 but it hasn't reached dangerous levels yet. 30 00:01:02,280 --> 00:01:04,830 When this occurs, a system message is displayed, 31 00:01:04,830 --> 00:01:06,840 an SNMP notification is sent, 32 00:01:06,840 --> 00:01:09,390 and an environmental alarm can be sounded. 33 00:01:09,390 --> 00:01:11,580 Now, when you have a major temperature threshold, 34 00:01:11,580 --> 00:01:13,680 this is going to be used to set off an alarm 35 00:01:13,680 --> 00:01:16,170 when a temperature reaches dangerous conditions. 36 00:01:16,170 --> 00:01:19,440 At this level, we want to still display those system messages, 37 00:01:19,440 --> 00:01:21,540 get that SNMP notification, 38 00:01:21,540 --> 00:01:23,580 and have the environmental alarm sounded, 39 00:01:23,580 --> 00:01:25,050 but in addition to that, 40 00:01:25,050 --> 00:01:27,060 the device can actually start to load shed 41 00:01:27,060 --> 00:01:28,590 by turning off different functions 42 00:01:28,590 --> 00:01:29,550 to reduce the temperature 43 00:01:29,550 --> 00:01:31,950 being generated by the device's processor. 44 00:01:31,950 --> 00:01:33,660 For example, let's say you have a router 45 00:01:33,660 --> 00:01:35,670 with multiple processing cards in it. 46 00:01:35,670 --> 00:01:38,400 That device may shut down one of those processing cards 47 00:01:38,400 --> 00:01:40,650 to prevent the entire system from overheating. 48 00:01:40,650 --> 00:01:42,540 That's what I mean by load shedding. 49 00:01:42,540 --> 00:01:43,680 Now, when a device runs 50 00:01:43,680 --> 00:01:45,540 at excessive temperatures for too long, 51 00:01:45,540 --> 00:01:47,460 the performance will decrease on that device 52 00:01:47,460 --> 00:01:50,580 and the lifespan will decline on that device as well. 53 00:01:50,580 --> 00:01:52,650 Over time, that device can even suffer 54 00:01:52,650 --> 00:01:55,260 a catastrophic failure from overheating. 55 00:01:55,260 --> 00:01:57,360 Our second sensor measurement we need to talk about 56 00:01:57,360 --> 00:02:00,780 is CPU usage or utilization on the device. 57 00:02:00,780 --> 00:02:03,210 At their core, router, switches, and firewalls 58 00:02:03,210 --> 00:02:05,070 are just specialized computers. 59 00:02:05,070 --> 00:02:07,410 When these devices are running under normal conditions, 60 00:02:07,410 --> 00:02:09,780 their CPU or central processing unit 61 00:02:09,780 --> 00:02:11,610 should have minimal utilization, 62 00:02:11,610 --> 00:02:14,340 somewhere in the range of 5 to 40%, 63 00:02:14,340 --> 00:02:16,860 but if the devices begin to become extremely busy 64 00:02:16,860 --> 00:02:19,380 or receive too many packets from its neighboring devices, 65 00:02:19,380 --> 00:02:22,170 the CPU utilization can become overutilized 66 00:02:22,170 --> 00:02:23,940 and the percentage will increase. 67 00:02:23,940 --> 00:02:26,490 Now, if the CPU utilization gets too high, 68 00:02:26,490 --> 00:02:29,580 the device could become unable to process any more requests, 69 00:02:29,580 --> 00:02:31,020 and it'll start to drop packets, 70 00:02:31,020 --> 00:02:33,510 or the entire connection could fail. 71 00:02:33,510 --> 00:02:36,540 Usually when you see a high processor utilization rate, 72 00:02:36,540 --> 00:02:38,910 this is an indication of a misconfigured network 73 00:02:38,910 --> 00:02:40,650 or a network under attack. 74 00:02:40,650 --> 00:02:42,720 If the network is misconfigured, for example, 75 00:02:42,720 --> 00:02:44,640 let's say you have a switch that's misconfigured, 76 00:02:44,640 --> 00:02:47,160 you can end up having a broadcast storm that occurs, 77 00:02:47,160 --> 00:02:47,993 and that's going to create 78 00:02:47,993 --> 00:02:49,890 an excessive amount of broadcast traffic 79 00:02:49,890 --> 00:02:52,650 that'll cause the switches CPU to become overutilized 80 00:02:52,650 --> 00:02:55,110 as it tries to process all those requests. 81 00:02:55,110 --> 00:02:56,970 Similarly, if you have a lot of complex 82 00:02:56,970 --> 00:02:58,830 and intricate ACLs on your router, 83 00:02:58,830 --> 00:03:01,230 and then people start sending a lot of inbound traffic, 84 00:03:01,230 --> 00:03:03,450 that router has to go through all of those ACLs 85 00:03:03,450 --> 00:03:05,220 each time for that traffic, 86 00:03:05,220 --> 00:03:07,110 and that can make it become unresponsive 87 00:03:07,110 --> 00:03:09,210 due to high CPU usage. 88 00:03:09,210 --> 00:03:12,180 As an administrator, you need to monitor the CPU utilization 89 00:03:12,180 --> 00:03:13,410 in your network devices 90 00:03:13,410 --> 00:03:15,300 to determine if they're operating properly, 91 00:03:15,300 --> 00:03:18,180 if they're misconfigured, or if they're under attack. 92 00:03:18,180 --> 00:03:19,800 The third sensor measurement we use 93 00:03:19,800 --> 00:03:22,200 is memory utilization for the device. 94 00:03:22,200 --> 00:03:24,210 Similar to high CPU utilization, 95 00:03:24,210 --> 00:03:26,280 high memory utilization can be indicative 96 00:03:26,280 --> 00:03:28,260 of a larger problem in your network. 97 00:03:28,260 --> 00:03:30,240 If your devices begin to use too much memory, 98 00:03:30,240 --> 00:03:32,850 this can lead to system hangs, processor crashes, 99 00:03:32,850 --> 00:03:34,890 and other undesirable behavior. 100 00:03:34,890 --> 00:03:36,300 To help protect against this, 101 00:03:36,300 --> 00:03:37,830 you should have minor, severe, 102 00:03:37,830 --> 00:03:39,690 and critical memory threshold warnings 103 00:03:39,690 --> 00:03:41,010 set up in your devices 104 00:03:41,010 --> 00:03:43,530 and reporting back to your centralized monitoring dashboard 105 00:03:43,530 --> 00:03:45,330 using SNMP. 106 00:03:45,330 --> 00:03:47,640 As a baseline, your network devices should operate 107 00:03:47,640 --> 00:03:49,650 at around 40% memory utilization 108 00:03:49,650 --> 00:03:51,600 under normal working conditions. 109 00:03:51,600 --> 00:03:55,440 During busier times, you may see this rise up to 60 to 70%, 110 00:03:55,440 --> 00:03:58,710 and during peak times, it may be up to 80%, 111 00:03:58,710 --> 00:04:01,080 but if you're constantly seeing memory utilization 112 00:04:01,080 --> 00:04:02,580 above 80%, 113 00:04:02,580 --> 00:04:03,570 you may need to install 114 00:04:03,570 --> 00:04:06,450 a larger or more powerful device for your network, 115 00:04:06,450 --> 00:04:08,340 or you could be under an attack 116 00:04:08,340 --> 00:04:09,690 for an excessive amount of time 117 00:04:09,690 --> 00:04:11,610 that's causing excessive loading. 118 00:04:11,610 --> 00:04:13,980 As you begin to operate your networks in the real world, 119 00:04:13,980 --> 00:04:15,990 you're going to begin to see what normal looks like 120 00:04:15,990 --> 00:04:17,670 for your particular network. 121 00:04:17,670 --> 00:04:19,350 As you see temperatures rising, 122 00:04:19,350 --> 00:04:21,899 or CPU and memory utilizations increase, 123 00:04:21,899 --> 00:04:24,330 this can trigger alarms that a network configuration 124 00:04:24,330 --> 00:04:27,510 or a network performance issue is happening right now. 125 00:04:27,510 --> 00:04:30,180 Then you need to investigate the root cause of that 126 00:04:30,180 --> 00:04:31,410 and solve those issues 127 00:04:31,410 --> 00:04:33,390 by bringing those metrics back to a normal level 128 00:04:33,390 --> 00:04:34,503 within your baseline.