1
00:00:01,120 --> 00:00:09,310
In this lecture we will discuss about the hyper parameters of our neural network architecture and the

2
00:00:09,310 --> 00:00:17,230
lectures in law you have seen that there are so many hybrid parameters in neural networks and these

3
00:00:17,320 --> 00:00:26,020
hyper parameters give us the flexibility of creating similar types of architectures but the flexibility

4
00:00:26,020 --> 00:00:31,960
of neural networks is also one of their main drawbacks.

5
00:00:31,960 --> 00:00:36,900
We have to decide on so many hybrid parameters in our model.

6
00:00:37,000 --> 00:00:45,670
Not only can we use any imaginable network architecture but even in a simple multilevel perception you

7
00:00:45,670 --> 00:00:47,380
can change the number of layers.

8
00:00:47,530 --> 00:00:55,000
The number of neurons per layer the type of activation function to use in each layer the rate initialization

9
00:00:55,000 --> 00:01:04,780
logic and many more although still a lot of exciting research is going on in the field of hyper parameter

10
00:01:04,780 --> 00:01:06,830
tuning for neural networks.

11
00:01:07,030 --> 00:01:14,440
It will still help to have an idea of what values are reasonable for each hyper parameter so you can

12
00:01:14,440 --> 00:01:20,900
build a quick prototype and restrict the search space.

13
00:01:21,030 --> 00:01:29,050
Here are a few guidelines for choosing the number of the layers and neurons in a multilevel perception.

14
00:01:29,100 --> 00:01:34,050
Let's first discuss about the number of hidden layers for most of the problems.

15
00:01:34,140 --> 00:01:40,820
You can just begin with a single hidden layer and you will get reasonable results.

16
00:01:40,890 --> 00:01:48,450
It has actually been shown that an MLP with just one hidden layer can model even the most complex functions

17
00:01:49,020 --> 00:01:54,650
provided it has enough neurons for a long time.

18
00:01:54,660 --> 00:02:02,130
These fact convince these voters that there is no need to investigate any deep neural network but it

19
00:02:02,130 --> 00:02:10,490
was later found that deep networks have a much higher parameter efficiency than shallow ones.

20
00:02:11,000 --> 00:02:19,020
They can model complex functions using exponentially fewer neurons than shallowness allowing them to

21
00:02:19,020 --> 00:02:26,690
read much better performance with the same amount of trading data to understand why this happens.

22
00:02:26,700 --> 00:02:33,480
Suppose you are asked to draw a forest using some drawing software but you are forbidden to use copy

23
00:02:33,480 --> 00:02:34,700
paste.

24
00:02:34,710 --> 00:02:39,840
You have to draw each tree individually branch by branch leaf belief.

25
00:02:40,860 --> 00:02:48,300
If you could instead draw one leaf copy pasted to a broader branch then copy paste the branches to create

26
00:02:48,300 --> 00:02:53,650
the tree and finally copy paste this tree to make a forest.

27
00:02:53,760 --> 00:02:56,940
You would be finished in no time.

28
00:02:57,900 --> 00:03:03,750
Real world data is often structured in such a hierarchical way and deep neural networks automatically

29
00:03:03,750 --> 00:03:05,670
take advantage of this fact.

30
00:03:07,320 --> 00:03:12,950
Lower hidden layers model lower level structures intermediate hidden layers.

31
00:03:13,000 --> 00:03:19,680
Combine these lower level structures to model intermediate level structures and higher state layers

32
00:03:19,980 --> 00:03:21,140
and the output layer.

33
00:03:21,150 --> 00:03:27,590
Combine these intermediate structures to model high level structures.

34
00:03:27,620 --> 00:03:35,420
Not only does this hierarchical structure help deep neural networks converge faster it also improves

35
00:03:35,420 --> 00:03:39,470
the ability to generalize to new data.

36
00:03:40,280 --> 00:03:48,710
For example if you have already trained a model to recognize faces in a picture and you know want to

37
00:03:48,710 --> 00:03:56,720
train a new neural network model to recognize hair styles then you can kick start training by reusing

38
00:03:56,720 --> 00:04:04,310
the lower levels of the first network instead of randomly initializing the weights and biases of the

39
00:04:04,310 --> 00:04:08,180
first few layers of the new neural network.

40
00:04:08,210 --> 00:04:16,510
You can initialize them to the value of the weights and biases of the lower layers of the first network.

41
00:04:16,550 --> 00:04:23,990
This way the network will not have to learn from scratch and it will only have to learn the higher level

42
00:04:23,990 --> 00:04:26,150
structures.

43
00:04:26,150 --> 00:04:28,350
This is called transfer learning.

44
00:04:29,900 --> 00:04:37,490
So in summary for most problems you can start with just one or two hidden layers and it will work just

45
00:04:37,490 --> 00:04:41,540
fine for more complex problems.

46
00:04:41,600 --> 00:04:47,920
You can gradually ramp up the number of hidden layers until you start all fitting the training data.

47
00:04:51,840 --> 00:04:59,040
Next we discuss the number of neurons but you then layer obviously the number of neurons in the input

48
00:04:59,160 --> 00:05:03,330
and output layers is determined by the type of input and output.

49
00:05:03,330 --> 00:05:13,170
Your task requires for example the M NASD fashion dataset that we used that required 24 into 24.

50
00:05:13,170 --> 00:05:20,010
That is 780 for input neurons and then output neurons.

51
00:05:20,100 --> 00:05:28,860
As for the hidden layers earlier it was a common practice to size them to form a pyramid.

52
00:05:29,250 --> 00:05:33,400
That is the first layer had the most number of neurons.

53
00:05:33,540 --> 00:05:38,700
For example in the M NASD fashion dataset with three layers.

54
00:05:38,730 --> 00:05:45,960
You can have 300 neurons in the first data layer two hundred neurons and the second one and hundred

55
00:05:45,960 --> 00:05:55,200
and deterred the rationale being that many low level features can coalesce into far fewer higher level

56
00:05:55,200 --> 00:06:04,840
features however this practice has been largely abandoned now as it seems that simply using these same

57
00:06:04,840 --> 00:06:09,630
number of neurons in all hidden layers performs just as well.

58
00:06:09,670 --> 00:06:12,520
In most cases or maybe even better

59
00:06:15,420 --> 00:06:21,530
also it has the advantage of having only one hyper parameter to tune instead of one.

60
00:06:21,540 --> 00:06:30,900
But earlier so instead of having three hundred two hundred and 100 neurons in the tree and layers you

61
00:06:30,900 --> 00:06:40,960
can have 150 neurons in all three of them if you think that the problem at hand is really complex then

62
00:06:41,290 --> 00:06:49,740
you can try increasing the number of neurons gradually until the network starts over putting in general

63
00:06:50,250 --> 00:06:57,480
increasing the depth of the network has better results on the accuracy then increasing the number of

64
00:06:57,480 --> 00:06:58,200
neurons.

65
00:06:58,220 --> 00:07:06,030
Butler another approach could be to pick a model with large number of layers and large number of neurons

66
00:07:06,100 --> 00:07:11,850
per head and there and then use at least hoping to prevent that model from all fitting

67
00:07:14,690 --> 00:07:15,770
next type of barometer.

68
00:07:15,770 --> 00:07:22,290
We are going to discuss learning great learning that is arguably the most important hyper parameter.

69
00:07:22,970 --> 00:07:28,980
In general the optimal learning rate is about half of the maximum learning rate that is in learning

70
00:07:28,980 --> 00:07:32,150
grade about which the training algorithm diverges

71
00:07:34,810 --> 00:07:41,560
so a simple approach for tuning the learning rate is to start with large value that makes the algorithm

72
00:07:41,580 --> 00:07:50,770
they would then divide this value by three and try again and repeat this until the training algorithm

73
00:07:50,890 --> 00:07:55,210
stops diverging at that point.

74
00:07:55,210 --> 00:07:59,620
You generally won't be too far from the optimal learning grade.

75
00:07:59,710 --> 00:08:02,500
Then there is bad sites for bad essays.

76
00:08:02,510 --> 00:08:12,460
The general rule general rules of thumb try to keep bad sides lower than 32 because a small bad size

77
00:08:12,670 --> 00:08:19,690
ensures that each training ideation is very fast and on the lower end.

78
00:08:19,990 --> 00:08:28,300
Try to keep a bad size more than 20 because this helps take advantage of the hardware and software optimizations

79
00:08:28,990 --> 00:08:31,720
in particular for matrix multiplications.

80
00:08:31,810 --> 00:08:40,760
So this will also help in speeding up training so a good range is between 20 to 32.

81
00:08:40,780 --> 00:08:47,110
Lastly there is another hyper parameter called epoch that is the number of training ideations that we

82
00:08:47,110 --> 00:08:50,560
will do instead of tuning it.

83
00:08:50,560 --> 00:08:59,530
We would suggest that you use a large number for epochs and use the early stopping technique to prevent

84
00:08:59,590 --> 00:09:04,360
or predict that's all about selecting high but parameters.