1 00:00:00,900 --> 00:00:05,280 Now, we are going to start with convolutional neural networks. 2 00:00:06,120 --> 00:00:11,670 And in this section, we are going to understand some of the building blocks of CNN's. 3 00:00:13,610 --> 00:00:17,360 First of all, let us understand the motivation behind CNN. 4 00:00:18,290 --> 00:00:23,860 What is it that CNN do better than a normal artificial neural network? 5 00:00:25,290 --> 00:00:30,890 Seniors are mostly used for problems like image recognition or speech recognition. 6 00:00:32,210 --> 00:00:39,860 The reason for this is that CNN is outperform normal artificial neural networks in these types of problems. 7 00:00:41,510 --> 00:00:47,790 In fact, the accuracy of some CNN models at image recognition is even better than humans. 8 00:00:50,730 --> 00:00:55,450 Let us try to see the limitation in a normal artificial neural network. 9 00:00:57,350 --> 00:01:05,360 As we understand, an artificial neural network gets the data of each pixel as input into the first 10 00:01:05,360 --> 00:01:06,170 layer of neurons. 11 00:01:06,830 --> 00:01:16,580 So if you ever 16 by 16 pixel image and I want to find what is in that image, I will feed the information 12 00:01:16,580 --> 00:01:20,300 of all these 256 pixels into the first layer. 13 00:01:21,740 --> 00:01:23,000 But here's the problem. 14 00:01:23,960 --> 00:01:30,560 If you see the pixels randomly with no order, can you identify what this image is? 15 00:01:32,450 --> 00:01:38,960 For example, in this late, I have an image of our favorite video game character, Mario. 16 00:01:40,460 --> 00:01:42,530 It's a 16 by 16 pix element 17 00:01:45,630 --> 00:01:49,160 to demonstrate how our neural network sees this image. 18 00:01:50,300 --> 00:01:58,370 I have covered all of the pixels with blue squares and have revealed 25 random pixels in each of the 19 00:01:58,490 --> 00:01:59,360 four images. 20 00:02:01,160 --> 00:02:03,560 This is similar to what a neural network sees. 21 00:02:04,820 --> 00:02:12,380 So if a neural network sees this first image, do you think it'll be able to identify the character 22 00:02:12,410 --> 00:02:15,620 in this image or the second image? 23 00:02:16,910 --> 00:02:18,590 Same goes the third and fourth image. 24 00:02:19,670 --> 00:02:25,820 The point that I'm trying to make here is we are not considering the effect of neighboring pixels. 25 00:02:27,160 --> 00:02:32,800 If we randomly pick pixels, we do not understand what is the image behind it. 26 00:02:33,730 --> 00:02:41,290 If we consider the order of pixels only, then we are able to identify what is the object in that image. 27 00:02:42,610 --> 00:02:50,170 So identifying the object by looking at such images is a very difficult task because this is not how 28 00:02:50,170 --> 00:02:51,400 the human brain works. 29 00:02:53,210 --> 00:02:56,090 We do not look at individual points or pixels. 30 00:02:56,720 --> 00:03:00,260 We could recognize pattern in group of points or pixels. 31 00:03:01,490 --> 00:03:06,080 In fact, served in the visual cortex respond to different patterns. 32 00:03:07,580 --> 00:03:09,140 Some respond to horizontal lines. 33 00:03:09,260 --> 00:03:10,790 Some respond to vertical lines. 34 00:03:11,000 --> 00:03:13,100 Some respond to other complex patterns. 35 00:03:13,880 --> 00:03:20,300 The output of the lower level neurons is then processed by higher level neurons to identify objects 36 00:03:20,510 --> 00:03:21,470 in that visual field. 37 00:03:23,440 --> 00:03:27,100 Convolutional neural networks are inspired from this concept. 38 00:03:28,800 --> 00:03:35,160 In CNN's instead of looking at each individual pixel, we look at a group of pixel. 39 00:03:38,220 --> 00:03:44,460 If we look at a group of pixels, we are more likely to pick up different features of the objects in 40 00:03:44,460 --> 00:03:44,940 the image. 41 00:03:45,900 --> 00:03:51,420 And once we know the features, it is more likely that we can predict the object in the image. 42 00:03:53,130 --> 00:04:00,480 So in this light, you can see that by using a window on the image or by looking at a group of pixels, 43 00:04:01,740 --> 00:04:03,630 we can identify certain features. 44 00:04:05,880 --> 00:04:09,990 So you can see Mario's ear here in the first image. 45 00:04:11,580 --> 00:04:16,800 And this image, you can see Mario's red collared shirt and probably a shoulder. 46 00:04:18,240 --> 00:04:22,830 And the next image, you find out some very important features of the character. 47 00:04:23,070 --> 00:04:25,380 Eyes, nose and a mustache. 48 00:04:27,980 --> 00:04:36,680 The last window tells us that the character is wearing blue colored pants on the legs with these features 49 00:04:36,710 --> 00:04:37,460 identified. 50 00:04:38,060 --> 00:04:42,590 It is easier for our network to identify the object in our image. 51 00:04:43,910 --> 00:04:52,220 So if you compare it with the previous slaid in which we randomly showed you 25 pixels out of 256 pixels, 52 00:04:53,150 --> 00:04:57,170 you can easily see that in the second slide. 53 00:04:57,620 --> 00:05:05,640 You can identify features because the group of pixels is together as compared to the one in previous 54 00:05:05,640 --> 00:05:08,150 light where the pixels were randomly picked up. 55 00:05:10,310 --> 00:05:12,170 So this is the main idea here. 56 00:05:13,130 --> 00:05:18,440 Instead of looking at each individual pixel, we will look at a group of pixels. 57 00:05:20,270 --> 00:05:22,370 Now let us see how this is implemented. 58 00:05:24,170 --> 00:05:26,270 So here does that image at the bottom. 59 00:05:27,860 --> 00:05:30,590 This is the input image to our network. 60 00:05:32,450 --> 00:05:35,810 Now, on top of it, we will have a convolutional layer. 61 00:05:37,390 --> 00:05:39,760 This is the most important concept, Antionette. 62 00:05:40,780 --> 00:05:42,430 We have a convolutional leered. 63 00:05:43,540 --> 00:05:50,460 A convolutional layer comprises of neurons, which Daken information from a group of pixels in the previous 64 00:05:50,460 --> 00:05:50,720 lit. 65 00:05:52,670 --> 00:05:53,990 So in this first layer. 66 00:05:56,080 --> 00:06:02,920 This neuron gets information stored in the pixels within this rectangular box. 67 00:06:03,130 --> 00:06:10,990 Only this other neuron gets information from pixels of this rectangle only. 68 00:06:12,250 --> 00:06:18,910 Similarly, in the second convolutional layer, information of all the neurons in this small rectangle. 69 00:06:20,530 --> 00:06:24,700 That is on the first Fosler is taken as input by this Meuron. 70 00:06:27,760 --> 00:06:34,720 This architecture allows the network to concentrate on lower level features in the first layer and then 71 00:06:34,810 --> 00:06:41,000 assemble these features into larger, higher level features in the next hidden layer and so on. 72 00:06:43,050 --> 00:06:49,140 Now, let us focus more on this window at the receptive field of these neurons. 73 00:06:51,360 --> 00:06:56,310 So this window is also known as the receptively of that particular neuron. 74 00:06:58,410 --> 00:07:01,380 This window has two dimensions, height and weight. 75 00:07:03,340 --> 00:07:11,380 In this image, you can see that the height of the window we have taken is five pixels and it is also 76 00:07:11,380 --> 00:07:12,160 five pixels. 77 00:07:13,360 --> 00:07:16,450 So we say that this is a five Crossfade window. 78 00:07:18,310 --> 00:07:20,560 We can also have t close three window. 79 00:07:20,860 --> 00:07:24,090 Order to cross three window or any such damage. 80 00:07:25,630 --> 00:07:29,950 Most commonly used dimensions are three, three or five Crossfade. 81 00:07:31,570 --> 00:07:39,490 So this particular window, the information stored in all the pixels of this window will go into one 82 00:07:39,490 --> 00:07:42,220 particular neuron in the upper convolutional leered.