1 00:00:00,900 --> 00:00:05,740 Now we are going to start with conversational neural networks. 2 00:00:06,140 --> 00:00:15,020 And in this section we are going to understand some of the building blocks of CNN is first of all let 3 00:00:15,020 --> 00:00:18,130 us understand the motivation behind CNN. 4 00:00:18,290 --> 00:00:25,170 What is it that CNN do better than a normal artificial neural network. 5 00:00:25,280 --> 00:00:32,120 CNN's are mostly used for problems like image recognition or speech recognition. 6 00:00:32,180 --> 00:00:41,460 The reason for this is that CNN is outperform normal artificial neural networks in these types of problems. 7 00:00:41,510 --> 00:00:47,780 In fact the accuracy of some CNN models and image recognition is even better than humans 8 00:00:50,900 --> 00:00:55,460 let us try to see the limitation in a normal artificial neural network. 9 00:00:57,350 --> 00:01:05,600 As we understand an artificial neural network gets the data of each pixel as input into the first layer 10 00:01:05,600 --> 00:01:06,810 of neurons. 11 00:01:06,830 --> 00:01:15,890 So if you have a sixteen by sixteen pixel image and I want to find what is in that image I will feed 12 00:01:15,890 --> 00:01:25,340 the information of all these 256 pixels into the first layer but here's the problem if you see the pixels 13 00:01:25,340 --> 00:01:28,290 randomly with no order. 14 00:01:28,370 --> 00:01:32,300 Can you identify what this image is. 15 00:01:32,450 --> 00:01:40,440 For example in the slide I have a image of our favorite video game character module. 16 00:01:40,460 --> 00:01:42,600 It's a 16 by 16 picks element 17 00:01:45,620 --> 00:01:50,260 to demonstrate how our neural network sees this image. 18 00:01:50,300 --> 00:01:58,370 I have covered all of the pixels with blue squares and have revealed 25 random pixels in each of the 19 00:01:58,490 --> 00:02:01,160 four images. 20 00:02:01,160 --> 00:02:04,640 This is similar to what a neural network sees. 21 00:02:04,820 --> 00:02:12,380 So if a neural network sees this first image do you think it will be able to identify the character 22 00:02:12,410 --> 00:02:16,910 in this image or the second image. 23 00:02:16,910 --> 00:02:19,670 Same goes with third and fourth image. 24 00:02:19,670 --> 00:02:27,310 The point that I am trying to make here is we are not considering the effect of neighboring pixels if 25 00:02:27,310 --> 00:02:29,750 we randomly big pixels. 26 00:02:30,040 --> 00:02:33,580 We do not understand what is the image behind it. 27 00:02:33,760 --> 00:02:42,610 If we consider the order of pixels only then we are able to identify what is the object in that image. 28 00:02:42,610 --> 00:02:50,200 So identifying the object by looking at such images is a very difficult task because this is not how 29 00:02:50,200 --> 00:02:53,190 the human brain works. 30 00:02:53,210 --> 00:02:56,540 We do not look at individual points or pixels. 31 00:02:56,690 --> 00:03:05,060 We can recognize pattern in group of points or pixels in fact served in our visual cortex respond to 32 00:03:05,060 --> 00:03:06,840 different patterns. 33 00:03:07,580 --> 00:03:09,260 Some respond to horizontal lines. 34 00:03:09,290 --> 00:03:10,960 Some respond to vertical lines. 35 00:03:11,000 --> 00:03:13,590 Some respond to other complex patterns. 36 00:03:13,880 --> 00:03:20,660 The output of the lower level neurons is then processed by higher level neurons to identify object in 37 00:03:20,660 --> 00:03:27,290 a visual field convolution all neural networks are inspired from this concept. 38 00:03:28,800 --> 00:03:35,160 In CNS instead of looking at each individual pixel we look at a group of pixel 39 00:03:38,220 --> 00:03:40,620 if we look at a group of pixels. 40 00:03:40,680 --> 00:03:47,640 We are more likely to pick up different features of the objects in the image and once we know the features 41 00:03:48,090 --> 00:03:51,420 it is more likely that we can predict the object in the image. 42 00:03:53,130 --> 00:04:00,480 So in this light you can see that by using a window on the image or by looking at a group of pixels 43 00:04:01,740 --> 00:04:12,930 we can identify certain features so you can see modules airhead in the first image in this image you 44 00:04:12,930 --> 00:04:19,770 can see Mario's red collared shirt and probably a shoulder in the next image. 45 00:04:19,770 --> 00:04:23,060 You find out some very important features of the character. 46 00:04:23,100 --> 00:04:32,580 Eyes nose and a moustache the last window tells us that the character is wearing blue colored pants 47 00:04:32,820 --> 00:04:37,810 on the legs with these features identified. 48 00:04:38,040 --> 00:04:43,920 It is easier for our network to identify the object in our image. 49 00:04:43,920 --> 00:04:52,230 So if you compare it with the previous slide in which we randomly showed you 25 pixels out of 256 pixels 50 00:04:53,130 --> 00:05:01,500 you can easily see that in the second slide you can identify features because the group of pixels is 51 00:05:01,500 --> 00:05:08,190 together as compared to the one in previous light where the pixels were randomly picked up. 52 00:05:10,320 --> 00:05:12,370 So this is the main idea here. 53 00:05:13,140 --> 00:05:20,230 Instead of looking at each individual pixel we will look at a group of pixels. 54 00:05:20,250 --> 00:05:22,380 Now let's see how this is implemented. 55 00:05:24,180 --> 00:05:27,790 So here does that image at the bottom. 56 00:05:27,870 --> 00:05:32,180 This is the input image to our network. 57 00:05:32,430 --> 00:05:40,650 Now on top of it we will have a convolution earlier layer this is the most important concept into units. 58 00:05:40,780 --> 00:05:48,800 We have a convolution earlier a convolution layer comprises of neurons which take in information from 59 00:05:48,800 --> 00:05:50,720 a group of pixels in the previous layer. 60 00:05:52,670 --> 00:06:03,070 So in this first layer this neuron gets information stored in the pixels within this rectangular box. 61 00:06:03,130 --> 00:06:12,050 Only this other neuron gets information from pixels of this rectangle only. 62 00:06:12,250 --> 00:06:20,710 Similarly in the second convolution earlier information of all the neurons in this small rectangle that 63 00:06:20,710 --> 00:06:24,700 is on the first layer is taken as input by this neuron 64 00:06:27,760 --> 00:06:34,720 this architecture allows the network to concentrate on lower level features in the first layer and then 65 00:06:34,810 --> 00:06:40,990 assemble these features into larger higher level features in the next hated layer and so on. 66 00:06:43,020 --> 00:06:51,110 Now let us focus more on this window or the receptive feel of these neurons. 67 00:06:51,360 --> 00:06:58,340 So this window is also known as the receptive field of that particular neuron. 68 00:06:58,410 --> 00:07:07,180 This window has two dimensions height and weight in this image you can see that the height of the window. 69 00:07:07,210 --> 00:07:13,360 We have taken is five pixels and Vertex also five pixels. 70 00:07:13,360 --> 00:07:18,260 So we see that this is a five cross five window. 71 00:07:18,310 --> 00:07:25,160 We can also have t cross three window order to cross three window or any such dimension. 72 00:07:25,600 --> 00:07:34,300 Most commonly used dimensions are three cross three or five cross fire so this particular window the 73 00:07:34,420 --> 00:07:41,260 information stored in all the pixels of this window will go in to one particular neuron in the upper 74 00:07:41,260 --> 00:07:42,220 convolution earlier.