1
00:00:01,130 --> 00:00:09,230
In this lecture, we will discuss about the hyper parameters of what neural network architecture and

2
00:00:09,230 --> 00:00:10,220
the lectors didn't know.

3
00:00:10,550 --> 00:00:18,350
You have seen that there are so many hybrid barometers in neural networks, and these hybrid barometers

4
00:00:18,530 --> 00:00:22,880
give us the flexibility of creating several types of architectures.

5
00:00:25,100 --> 00:00:30,530
But the flexibility of neural networks is also one of their main drawbacks.

6
00:00:31,970 --> 00:00:35,420
We have to decide on so many hybrid parameters in our model.

7
00:00:37,010 --> 00:00:44,960
Not only can we use any imaginable network architecture, but even in our simple, multi-level Perceptron,

8
00:00:45,500 --> 00:00:51,710
you can change the number of layers, the number of neurons, parlier the type of activation function

9
00:00:51,710 --> 00:00:56,540
to use in each layer, duvid, initialization logic and many more.

10
00:00:59,020 --> 00:01:05,800
Although still a lot of exciting research is going on in the field of hyper barometer tuning for neural

11
00:01:05,800 --> 00:01:06,340
networks.

12
00:01:07,030 --> 00:01:13,360
It will still help to have an idea of what values are reasonable for each hyper barometer.

13
00:01:13,930 --> 00:01:18,010
So you can build a quick prototype and restrict the search space.

14
00:01:20,540 --> 00:01:26,760
He had a few guidelines for choosing the number of layers and neurons in a multilevel Perceptron.

15
00:01:29,100 --> 00:01:31,660
Let's first discuss about the number of hidden layers.

16
00:01:32,580 --> 00:01:39,210
For most of the problems, you can just begin with a single earlier and you will get reasonable results.

17
00:01:40,890 --> 00:01:48,450
It has actually been shown that an MLP with just one hidden layer can model even the most complex functions,

18
00:01:49,020 --> 00:01:54,090
provided it has enough neurons for a long time.

19
00:01:54,660 --> 00:02:00,650
These facts convinced researchers that there is no need to investigate any deep neural network.

20
00:02:01,740 --> 00:02:09,330
But it was later found that deep networks have a much higher parameter efficiency than shallow ones.

21
00:02:11,010 --> 00:02:18,870
They can model complex functions using exponentially fewer neurons than shallowness, allowing them

22
00:02:18,870 --> 00:02:26,310
to read much better performance with the same amount of training data to understand why this happens.

23
00:02:26,700 --> 00:02:33,480
Suppose you are asked to draw a forest using some drawing software, but you are forbidden to use copy

24
00:02:33,480 --> 00:02:33,810
paste.

25
00:02:34,710 --> 00:02:42,540
You have brought each tree individually branch by branch leaf believe if you could instead draw one

26
00:02:42,540 --> 00:02:50,490
leaf copy pasted to a broader branch, then copy paste the branches to create the tree and finally copy

27
00:02:50,490 --> 00:02:52,560
paste this tree to make a forest.

28
00:02:53,750 --> 00:02:55,350
You would be finished in no time.

29
00:02:57,900 --> 00:03:01,230
Real raw data is often structured in such a hierarchical way.

30
00:03:01,710 --> 00:03:05,640
And deep neural networks automatically take advantage of this fact.

31
00:03:07,150 --> 00:03:12,840
Lower head and layers, more lower level structures, intermediate hidden layers.

32
00:03:13,000 --> 00:03:19,680
Combine these lower level structures to model intermediate level structures and higher state layers

33
00:03:19,980 --> 00:03:21,030
and the output layer.

34
00:03:21,210 --> 00:03:25,890
Combine these intermediate structures to model high level structures.

35
00:03:27,630 --> 00:03:35,430
Not only does this hierarchical structure help deep neural networks converge faster, it also improves

36
00:03:35,430 --> 00:03:38,150
the ability to generalize to new data.

37
00:03:38,240 --> 00:03:48,300
Say, for example, if you've already trained a model to recognize faces in a picture and you know,

38
00:03:48,300 --> 00:03:55,650
want to train a new neural network model to recognize hairstyles, then you can kick start training

39
00:03:55,860 --> 00:04:04,170
by reusing the lower levels of the first network instead of randomly initializing DVDs and biases of

40
00:04:04,170 --> 00:04:07,380
the first few layers of the new neural network.

41
00:04:08,220 --> 00:04:14,940
You can initialize them to the value of DVDs and biases of the lower layers of the first network.

42
00:04:16,560 --> 00:04:23,670
This way, the network will not have to learn from scratch and it will only have to learn the higher

43
00:04:23,670 --> 00:04:24,630
level structures.

44
00:04:26,130 --> 00:04:28,200
This is called transfer learning.

45
00:04:29,910 --> 00:04:37,140
So in summary, for most problems, you can start with just one or two hidden layers and it will work

46
00:04:37,170 --> 00:04:41,310
just fine for more complex problems.

47
00:04:41,610 --> 00:04:47,910
You can gradually ramp up the number of hidden layers until you start overfitting the training data.

48
00:04:51,840 --> 00:04:55,360
Next, we discuss the number of neurons, but he didn't listen.

49
00:04:56,630 --> 00:05:02,460
Obviously, the number of neurons in the input and output layers is determined by the type of input

50
00:05:02,670 --> 00:05:03,310
and output.

51
00:05:03,330 --> 00:05:07,590
Your task requires, for example, the M.

52
00:05:07,590 --> 00:05:13,050
NASD fashion dataset that we used that required 24 input, 24.

53
00:05:13,170 --> 00:05:18,600
That is 780 for input neurons and then output neurons.

54
00:05:20,100 --> 00:05:23,040
As for the hidden layers, unclear.

55
00:05:23,970 --> 00:05:27,310
It was a common practice to size them to formal.

56
00:05:27,510 --> 00:05:32,430
But I mean, that is the first layer had the most number of neurons.

57
00:05:33,390 --> 00:05:41,100
So, for example, in the M&amp;A zesty fashion dataset with three layers, you can have 300 neurons in

58
00:05:41,100 --> 00:05:49,290
the first data layer, 200 neurons and the second one and hundred, and deterred the rationale being

59
00:05:49,410 --> 00:05:55,770
that many low level features can coalesce into far fewer higher level features.

60
00:05:58,090 --> 00:06:05,140
However, this practice has been largely abandoned now, as it seems that simply using these same number

61
00:06:05,140 --> 00:06:12,530
of neurons in all hidden lives performs just as well in most cases, or maybe even better.

62
00:06:15,420 --> 00:06:22,120
Also, it has the advantage of having only one hyper barometer to tune instead of one but leered.

63
00:06:24,750 --> 00:06:30,900
So instead of having three hundred, two hundred and one hundred neurons in the tree and layers, you

64
00:06:30,900 --> 00:06:34,200
can have 150 neurons in all three of them.

65
00:06:36,730 --> 00:06:43,330
If you think that the problem at hand is really complex, then you can try increasing the number of

66
00:06:43,330 --> 00:06:52,270
neurons gradually until the network starts overfitting in general, increasing the depth of the network

67
00:06:52,750 --> 00:06:58,000
has better reser on the accuracy, then increasing the number of neurons.

68
00:06:58,020 --> 00:06:58,470
Bollier.

69
00:06:59,720 --> 00:07:06,040
Another approach could be to become model with a large number of layers and large number of neurons

70
00:07:06,130 --> 00:07:11,860
but hidden layer and then use at least hoping to prevent that model from overfitting.

71
00:07:14,690 --> 00:07:15,740
Next, Apple Barometer.

72
00:07:15,770 --> 00:07:18,470
We are going to discuss those learning great learning.

73
00:07:18,590 --> 00:07:21,530
It is arguably the most important hyper barometer.

74
00:07:23,000 --> 00:07:27,830
In general, the optimal learning rate is about half of the maximum learning rate.

75
00:07:28,220 --> 00:07:32,150
That is the learning rate about which the training algorithm diverges.

76
00:07:34,810 --> 00:07:39,670
So a simple approach for joining the learning rate is to start with large value.

77
00:07:40,420 --> 00:07:41,950
That makes the algorithm diverge.

78
00:07:43,690 --> 00:07:45,960
Then divide this value by three.

79
00:07:46,150 --> 00:07:52,010
And try again and repeat this until the training algorithm stops diverging.

80
00:07:53,650 --> 00:07:58,930
At that point, you generally won't be too far from the optimal learning rate.

81
00:07:59,740 --> 00:08:02,460
Then there is bad sites for bad taste.

82
00:08:02,510 --> 00:08:12,010
The general rule, general rule of thumb, trying to keep bad sales lower than 32 because a small bad

83
00:08:12,010 --> 00:08:19,330
size ensures that each training iteration is very fast and on the lower end.

84
00:08:19,990 --> 00:08:28,300
Try to keep a bad size more than 20 because this helps take advantage of the hardware and software optimizations,

85
00:08:28,990 --> 00:08:31,390
in particular for matrix multiplications.

86
00:08:31,810 --> 00:08:34,180
So this will also help in speeding up training.

87
00:08:35,770 --> 00:08:38,940
So a good range is between twenty to thirty two.

88
00:08:40,810 --> 00:08:44,170
Lastly, there is another hyper parameter called epoch.

89
00:08:44,530 --> 00:08:50,290
That is the number of training iterations that we will do instead of tuning it.

90
00:08:50,590 --> 00:08:59,530
We would suggest that you use a large number for epochs and usually at least topping technique to prevent

91
00:08:59,590 --> 00:09:00,060
overfitting.

92
00:09:01,980 --> 00:09:04,360
So that's all about selecting hyper parameters.