1
00:00:00,900 --> 00:00:05,280
Now, we are going to start with convolutional neural networks.

2
00:00:06,120 --> 00:00:11,670
And in this section, we are going to understand some of the building blocks of CNN's.

3
00:00:13,610 --> 00:00:17,360
First of all, let us understand the motivation behind CNN.

4
00:00:18,290 --> 00:00:23,860
What is it that CNN do better than a normal artificial neural network?

5
00:00:25,290 --> 00:00:30,890
Seniors are mostly used for problems like image recognition or speech recognition.

6
00:00:32,210 --> 00:00:39,860
The reason for this is that CNN is outperform normal artificial neural networks in these types of problems.

7
00:00:41,510 --> 00:00:47,790
In fact, the accuracy of some CNN models at image recognition is even better than humans.

8
00:00:50,730 --> 00:00:55,450
Let us try to see the limitation in a normal artificial neural network.

9
00:00:57,350 --> 00:01:05,360
As we understand, an artificial neural network gets the data of each pixel as input into the first

10
00:01:05,360 --> 00:01:06,170
layer of neurons.

11
00:01:06,830 --> 00:01:16,580
So if you ever 16 by 16 pixel image and I want to find what is in that image, I will feed the information

12
00:01:16,580 --> 00:01:20,300
of all these 256 pixels into the first layer.

13
00:01:21,740 --> 00:01:23,000
But here's the problem.

14
00:01:23,960 --> 00:01:30,560
If you see the pixels randomly with no order, can you identify what this image is?

15
00:01:32,450 --> 00:01:38,960
For example, in this late, I have an image of our favorite video game character, Mario.

16
00:01:40,460 --> 00:01:42,530
It's a 16 by 16 pix element

17
00:01:45,630 --> 00:01:49,160
to demonstrate how our neural network sees this image.

18
00:01:50,300 --> 00:01:58,370
I have covered all of the pixels with blue squares and have revealed 25 random pixels in each of the

19
00:01:58,490 --> 00:01:59,360
four images.

20
00:02:01,160 --> 00:02:03,560
This is similar to what a neural network sees.

21
00:02:04,820 --> 00:02:12,380
So if a neural network sees this first image, do you think it'll be able to identify the character

22
00:02:12,410 --> 00:02:15,620
in this image or the second image?

23
00:02:16,910 --> 00:02:18,590
Same goes the third and fourth image.

24
00:02:19,670 --> 00:02:25,820
The point that I'm trying to make here is we are not considering the effect of neighboring pixels.

25
00:02:27,160 --> 00:02:32,800
If we randomly pick pixels, we do not understand what is the image behind it.

26
00:02:33,730 --> 00:02:41,290
If we consider the order of pixels only, then we are able to identify what is the object in that image.

27
00:02:42,610 --> 00:02:50,170
So identifying the object by looking at such images is a very difficult task because this is not how

28
00:02:50,170 --> 00:02:51,400
the human brain works.

29
00:02:53,210 --> 00:02:56,090
We do not look at individual points or pixels.

30
00:02:56,720 --> 00:03:00,260
We could recognize pattern in group of points or pixels.

31
00:03:01,490 --> 00:03:06,080
In fact, served in the visual cortex respond to different patterns.

32
00:03:07,580 --> 00:03:09,140
Some respond to horizontal lines.

33
00:03:09,260 --> 00:03:10,790
Some respond to vertical lines.

34
00:03:11,000 --> 00:03:13,100
Some respond to other complex patterns.

35
00:03:13,880 --> 00:03:20,300
The output of the lower level neurons is then processed by higher level neurons to identify objects

36
00:03:20,510 --> 00:03:21,470
in that visual field.

37
00:03:23,440 --> 00:03:27,100
Convolutional neural networks are inspired from this concept.

38
00:03:28,800 --> 00:03:35,160
In CNN's instead of looking at each individual pixel, we look at a group of pixel.

39
00:03:38,220 --> 00:03:44,460
If we look at a group of pixels, we are more likely to pick up different features of the objects in

40
00:03:44,460 --> 00:03:44,940
the image.

41
00:03:45,900 --> 00:03:51,420
And once we know the features, it is more likely that we can predict the object in the image.

42
00:03:53,130 --> 00:04:00,480
So in this light, you can see that by using a window on the image or by looking at a group of pixels,

43
00:04:01,740 --> 00:04:03,630
we can identify certain features.

44
00:04:05,880 --> 00:04:09,990
So you can see Mario's ear here in the first image.

45
00:04:11,580 --> 00:04:16,800
And this image, you can see Mario's red collared shirt and probably a shoulder.

46
00:04:18,240 --> 00:04:22,830
And the next image, you find out some very important features of the character.

47
00:04:23,070 --> 00:04:25,380
Eyes, nose and a mustache.

48
00:04:27,980 --> 00:04:36,680
The last window tells us that the character is wearing blue colored pants on the legs with these features

49
00:04:36,710 --> 00:04:37,460
identified.

50
00:04:38,060 --> 00:04:42,590
It is easier for our network to identify the object in our image.

51
00:04:43,910 --> 00:04:52,220
So if you compare it with the previous slaid in which we randomly showed you 25 pixels out of 256 pixels,

52
00:04:53,150 --> 00:04:57,170
you can easily see that in the second slide.

53
00:04:57,620 --> 00:05:05,640
You can identify features because the group of pixels is together as compared to the one in previous

54
00:05:05,640 --> 00:05:08,150
light where the pixels were randomly picked up.

55
00:05:10,310 --> 00:05:12,170
So this is the main idea here.

56
00:05:13,130 --> 00:05:18,440
Instead of looking at each individual pixel, we will look at a group of pixels.

57
00:05:20,270 --> 00:05:22,370
Now let us see how this is implemented.

58
00:05:24,170 --> 00:05:26,270
So here does that image at the bottom.

59
00:05:27,860 --> 00:05:30,590
This is the input image to our network.

60
00:05:32,450 --> 00:05:35,810
Now, on top of it, we will have a convolutional layer.

61
00:05:37,390 --> 00:05:39,760
This is the most important concept, Antionette.

62
00:05:40,780 --> 00:05:42,430
We have a convolutional leered.

63
00:05:43,540 --> 00:05:50,460
A convolutional layer comprises of neurons, which Daken information from a group of pixels in the previous

64
00:05:50,460 --> 00:05:50,720
lit.

65
00:05:52,670 --> 00:05:53,990
So in this first layer.

66
00:05:56,080 --> 00:06:02,920
This neuron gets information stored in the pixels within this rectangular box.

67
00:06:03,130 --> 00:06:10,990
Only this other neuron gets information from pixels of this rectangle only.

68
00:06:12,250 --> 00:06:18,910
Similarly, in the second convolutional layer, information of all the neurons in this small rectangle.

69
00:06:20,530 --> 00:06:24,700
That is on the first Fosler is taken as input by this Meuron.

70
00:06:27,760 --> 00:06:34,720
This architecture allows the network to concentrate on lower level features in the first layer and then

71
00:06:34,810 --> 00:06:41,000
assemble these features into larger, higher level features in the next hidden layer and so on.

72
00:06:43,050 --> 00:06:49,140
Now, let us focus more on this window at the receptive field of these neurons.

73
00:06:51,360 --> 00:06:56,310
So this window is also known as the receptively of that particular neuron.

74
00:06:58,410 --> 00:07:01,380
This window has two dimensions, height and weight.

75
00:07:03,340 --> 00:07:11,380
In this image, you can see that the height of the window we have taken is five pixels and it is also

76
00:07:11,380 --> 00:07:12,160
five pixels.

77
00:07:13,360 --> 00:07:16,450
So we say that this is a five Crossfade window.

78
00:07:18,310 --> 00:07:20,560
We can also have t close three window.

79
00:07:20,860 --> 00:07:24,090
Order to cross three window or any such damage.

80
00:07:25,630 --> 00:07:29,950
Most commonly used dimensions are three, three or five Crossfade.

81
00:07:31,570 --> 00:07:39,490
So this particular window, the information stored in all the pixels of this window will go into one

82
00:07:39,490 --> 00:07:42,220
particular neuron in the upper convolutional leered.