1
00:00:00,900 --> 00:00:05,740
Now we are going to start with conversational neural networks.

2
00:00:06,140 --> 00:00:15,020
And in this section we are going to understand some of the building blocks of CNN is first of all let

3
00:00:15,020 --> 00:00:18,130
us understand the motivation behind CNN.

4
00:00:18,290 --> 00:00:25,170
What is it that CNN do better than a normal artificial neural network.

5
00:00:25,280 --> 00:00:32,120
CNN's are mostly used for problems like image recognition or speech recognition.

6
00:00:32,180 --> 00:00:41,460
The reason for this is that CNN is outperform normal artificial neural networks in these types of problems.

7
00:00:41,510 --> 00:00:47,780
In fact the accuracy of some CNN models and image recognition is even better than humans

8
00:00:50,900 --> 00:00:55,460
let us try to see the limitation in a normal artificial neural network.

9
00:00:57,350 --> 00:01:05,600
As we understand an artificial neural network gets the data of each pixel as input into the first layer

10
00:01:05,600 --> 00:01:06,810
of neurons.

11
00:01:06,830 --> 00:01:15,890
So if you have a sixteen by sixteen pixel image and I want to find what is in that image I will feed

12
00:01:15,890 --> 00:01:25,340
the information of all these 256 pixels into the first layer but here's the problem if you see the pixels

13
00:01:25,340 --> 00:01:28,290
randomly with no order.

14
00:01:28,370 --> 00:01:32,300
Can you identify what this image is.

15
00:01:32,450 --> 00:01:40,440
For example in the slide I have a image of our favorite video game character module.

16
00:01:40,460 --> 00:01:42,600
It's a 16 by 16 picks element

17
00:01:45,620 --> 00:01:50,260
to demonstrate how our neural network sees this image.

18
00:01:50,300 --> 00:01:58,370
I have covered all of the pixels with blue squares and have revealed 25 random pixels in each of the

19
00:01:58,490 --> 00:02:01,160
four images.

20
00:02:01,160 --> 00:02:04,640
This is similar to what a neural network sees.

21
00:02:04,820 --> 00:02:12,380
So if a neural network sees this first image do you think it will be able to identify the character

22
00:02:12,410 --> 00:02:16,910
in this image or the second image.

23
00:02:16,910 --> 00:02:19,670
Same goes with third and fourth image.

24
00:02:19,670 --> 00:02:27,310
The point that I am trying to make here is we are not considering the effect of neighboring pixels if

25
00:02:27,310 --> 00:02:29,750
we randomly big pixels.

26
00:02:30,040 --> 00:02:33,580
We do not understand what is the image behind it.

27
00:02:33,760 --> 00:02:42,610
If we consider the order of pixels only then we are able to identify what is the object in that image.

28
00:02:42,610 --> 00:02:50,200
So identifying the object by looking at such images is a very difficult task because this is not how

29
00:02:50,200 --> 00:02:53,190
the human brain works.

30
00:02:53,210 --> 00:02:56,540
We do not look at individual points or pixels.

31
00:02:56,690 --> 00:03:05,060
We can recognize pattern in group of points or pixels in fact served in our visual cortex respond to

32
00:03:05,060 --> 00:03:06,840
different patterns.

33
00:03:07,580 --> 00:03:09,260
Some respond to horizontal lines.

34
00:03:09,290 --> 00:03:10,960
Some respond to vertical lines.

35
00:03:11,000 --> 00:03:13,590
Some respond to other complex patterns.

36
00:03:13,880 --> 00:03:20,660
The output of the lower level neurons is then processed by higher level neurons to identify object in

37
00:03:20,660 --> 00:03:27,290
a visual field convolution all neural networks are inspired from this concept.

38
00:03:28,800 --> 00:03:35,160
In CNS instead of looking at each individual pixel we look at a group of pixel

39
00:03:38,220 --> 00:03:40,620
if we look at a group of pixels.

40
00:03:40,680 --> 00:03:47,640
We are more likely to pick up different features of the objects in the image and once we know the features

41
00:03:48,090 --> 00:03:51,420
it is more likely that we can predict the object in the image.

42
00:03:53,130 --> 00:04:00,480
So in this light you can see that by using a window on the image or by looking at a group of pixels

43
00:04:01,740 --> 00:04:12,930
we can identify certain features so you can see modules airhead in the first image in this image you

44
00:04:12,930 --> 00:04:19,770
can see Mario's red collared shirt and probably a shoulder in the next image.

45
00:04:19,770 --> 00:04:23,060
You find out some very important features of the character.

46
00:04:23,100 --> 00:04:32,580
Eyes nose and a moustache the last window tells us that the character is wearing blue colored pants

47
00:04:32,820 --> 00:04:37,810
on the legs with these features identified.

48
00:04:38,040 --> 00:04:43,920
It is easier for our network to identify the object in our image.

49
00:04:43,920 --> 00:04:52,230
So if you compare it with the previous slide in which we randomly showed you 25 pixels out of 256 pixels

50
00:04:53,130 --> 00:05:01,500
you can easily see that in the second slide you can identify features because the group of pixels is

51
00:05:01,500 --> 00:05:08,190
together as compared to the one in previous light where the pixels were randomly picked up.

52
00:05:10,320 --> 00:05:12,370
So this is the main idea here.

53
00:05:13,140 --> 00:05:20,230
Instead of looking at each individual pixel we will look at a group of pixels.

54
00:05:20,250 --> 00:05:22,380
Now let's see how this is implemented.

55
00:05:24,180 --> 00:05:27,790
So here does that image at the bottom.

56
00:05:27,870 --> 00:05:32,180
This is the input image to our network.

57
00:05:32,430 --> 00:05:40,650
Now on top of it we will have a convolution earlier layer this is the most important concept into units.

58
00:05:40,780 --> 00:05:48,800
We have a convolution earlier a convolution layer comprises of neurons which take in information from

59
00:05:48,800 --> 00:05:50,720
a group of pixels in the previous layer.

60
00:05:52,670 --> 00:06:03,070
So in this first layer this neuron gets information stored in the pixels within this rectangular box.

61
00:06:03,130 --> 00:06:12,050
Only this other neuron gets information from pixels of this rectangle only.

62
00:06:12,250 --> 00:06:20,710
Similarly in the second convolution earlier information of all the neurons in this small rectangle that

63
00:06:20,710 --> 00:06:24,700
is on the first layer is taken as input by this neuron

64
00:06:27,760 --> 00:06:34,720
this architecture allows the network to concentrate on lower level features in the first layer and then

65
00:06:34,810 --> 00:06:40,990
assemble these features into larger higher level features in the next hated layer and so on.

66
00:06:43,020 --> 00:06:51,110
Now let us focus more on this window or the receptive feel of these neurons.

67
00:06:51,360 --> 00:06:58,340
So this window is also known as the receptive field of that particular neuron.

68
00:06:58,410 --> 00:07:07,180
This window has two dimensions height and weight in this image you can see that the height of the window.

69
00:07:07,210 --> 00:07:13,360
We have taken is five pixels and Vertex also five pixels.

70
00:07:13,360 --> 00:07:18,260
So we see that this is a five cross five window.

71
00:07:18,310 --> 00:07:25,160
We can also have t cross three window order to cross three window or any such dimension.

72
00:07:25,600 --> 00:07:34,300
Most commonly used dimensions are three cross three or five cross fire so this particular window the

73
00:07:34,420 --> 00:07:41,260
information stored in all the pixels of this window will go in to one particular neuron in the upper

74
00:07:41,260 --> 00:07:42,220
convolution earlier.