1 00:00:01,120 --> 00:00:09,310 In this lecture we will discuss about the hyper parameters of our neural network architecture and the 2 00:00:09,310 --> 00:00:17,230 lectures in law you have seen that there are so many hybrid parameters in neural networks and these 3 00:00:17,320 --> 00:00:26,020 hyper parameters give us the flexibility of creating similar types of architectures but the flexibility 4 00:00:26,020 --> 00:00:31,960 of neural networks is also one of their main drawbacks. 5 00:00:31,960 --> 00:00:36,900 We have to decide on so many hybrid parameters in our model. 6 00:00:37,000 --> 00:00:45,670 Not only can we use any imaginable network architecture but even in a simple multilevel perception you 7 00:00:45,670 --> 00:00:47,380 can change the number of layers. 8 00:00:47,530 --> 00:00:55,000 The number of neurons per layer the type of activation function to use in each layer the rate initialization 9 00:00:55,000 --> 00:01:04,780 logic and many more although still a lot of exciting research is going on in the field of hyper parameter 10 00:01:04,780 --> 00:01:06,830 tuning for neural networks. 11 00:01:07,030 --> 00:01:14,440 It will still help to have an idea of what values are reasonable for each hyper parameter so you can 12 00:01:14,440 --> 00:01:20,900 build a quick prototype and restrict the search space. 13 00:01:21,030 --> 00:01:29,050 Here are a few guidelines for choosing the number of the layers and neurons in a multilevel perception. 14 00:01:29,100 --> 00:01:34,050 Let's first discuss about the number of hidden layers for most of the problems. 15 00:01:34,140 --> 00:01:40,820 You can just begin with a single hidden layer and you will get reasonable results. 16 00:01:40,890 --> 00:01:48,450 It has actually been shown that an MLP with just one hidden layer can model even the most complex functions 17 00:01:49,020 --> 00:01:54,650 provided it has enough neurons for a long time. 18 00:01:54,660 --> 00:02:02,130 These fact convince these voters that there is no need to investigate any deep neural network but it 19 00:02:02,130 --> 00:02:10,490 was later found that deep networks have a much higher parameter efficiency than shallow ones. 20 00:02:11,000 --> 00:02:19,020 They can model complex functions using exponentially fewer neurons than shallowness allowing them to 21 00:02:19,020 --> 00:02:26,690 read much better performance with the same amount of trading data to understand why this happens. 22 00:02:26,700 --> 00:02:33,480 Suppose you are asked to draw a forest using some drawing software but you are forbidden to use copy 23 00:02:33,480 --> 00:02:34,700 paste. 24 00:02:34,710 --> 00:02:39,840 You have to draw each tree individually branch by branch leaf belief. 25 00:02:40,860 --> 00:02:48,300 If you could instead draw one leaf copy pasted to a broader branch then copy paste the branches to create 26 00:02:48,300 --> 00:02:53,650 the tree and finally copy paste this tree to make a forest. 27 00:02:53,760 --> 00:02:56,940 You would be finished in no time. 28 00:02:57,900 --> 00:03:03,750 Real world data is often structured in such a hierarchical way and deep neural networks automatically 29 00:03:03,750 --> 00:03:05,670 take advantage of this fact. 30 00:03:07,320 --> 00:03:12,950 Lower hidden layers model lower level structures intermediate hidden layers. 31 00:03:13,000 --> 00:03:19,680 Combine these lower level structures to model intermediate level structures and higher state layers 32 00:03:19,980 --> 00:03:21,140 and the output layer. 33 00:03:21,150 --> 00:03:27,590 Combine these intermediate structures to model high level structures. 34 00:03:27,620 --> 00:03:35,420 Not only does this hierarchical structure help deep neural networks converge faster it also improves 35 00:03:35,420 --> 00:03:39,470 the ability to generalize to new data. 36 00:03:40,280 --> 00:03:48,710 For example if you have already trained a model to recognize faces in a picture and you know want to 37 00:03:48,710 --> 00:03:56,720 train a new neural network model to recognize hair styles then you can kick start training by reusing 38 00:03:56,720 --> 00:04:04,310 the lower levels of the first network instead of randomly initializing the weights and biases of the 39 00:04:04,310 --> 00:04:08,180 first few layers of the new neural network. 40 00:04:08,210 --> 00:04:16,510 You can initialize them to the value of the weights and biases of the lower layers of the first network. 41 00:04:16,550 --> 00:04:23,990 This way the network will not have to learn from scratch and it will only have to learn the higher level 42 00:04:23,990 --> 00:04:26,150 structures. 43 00:04:26,150 --> 00:04:28,350 This is called transfer learning. 44 00:04:29,900 --> 00:04:37,490 So in summary for most problems you can start with just one or two hidden layers and it will work just 45 00:04:37,490 --> 00:04:41,540 fine for more complex problems. 46 00:04:41,600 --> 00:04:47,920 You can gradually ramp up the number of hidden layers until you start all fitting the training data. 47 00:04:51,840 --> 00:04:59,040 Next we discuss the number of neurons but you then layer obviously the number of neurons in the input 48 00:04:59,160 --> 00:05:03,330 and output layers is determined by the type of input and output. 49 00:05:03,330 --> 00:05:13,170 Your task requires for example the M NASD fashion dataset that we used that required 24 into 24. 50 00:05:13,170 --> 00:05:20,010 That is 780 for input neurons and then output neurons. 51 00:05:20,100 --> 00:05:28,860 As for the hidden layers earlier it was a common practice to size them to form a pyramid. 52 00:05:29,250 --> 00:05:33,400 That is the first layer had the most number of neurons. 53 00:05:33,540 --> 00:05:38,700 For example in the M NASD fashion dataset with three layers. 54 00:05:38,730 --> 00:05:45,960 You can have 300 neurons in the first data layer two hundred neurons and the second one and hundred 55 00:05:45,960 --> 00:05:55,200 and deterred the rationale being that many low level features can coalesce into far fewer higher level 56 00:05:55,200 --> 00:06:04,840 features however this practice has been largely abandoned now as it seems that simply using these same 57 00:06:04,840 --> 00:06:09,630 number of neurons in all hidden layers performs just as well. 58 00:06:09,670 --> 00:06:12,520 In most cases or maybe even better 59 00:06:15,420 --> 00:06:21,530 also it has the advantage of having only one hyper parameter to tune instead of one. 60 00:06:21,540 --> 00:06:30,900 But earlier so instead of having three hundred two hundred and 100 neurons in the tree and layers you 61 00:06:30,900 --> 00:06:40,960 can have 150 neurons in all three of them if you think that the problem at hand is really complex then 62 00:06:41,290 --> 00:06:49,740 you can try increasing the number of neurons gradually until the network starts over putting in general 63 00:06:50,250 --> 00:06:57,480 increasing the depth of the network has better results on the accuracy then increasing the number of 64 00:06:57,480 --> 00:06:58,200 neurons. 65 00:06:58,220 --> 00:07:06,030 Butler another approach could be to pick a model with large number of layers and large number of neurons 66 00:07:06,100 --> 00:07:11,850 per head and there and then use at least hoping to prevent that model from all fitting 67 00:07:14,690 --> 00:07:15,770 next type of barometer. 68 00:07:15,770 --> 00:07:22,290 We are going to discuss learning great learning that is arguably the most important hyper parameter. 69 00:07:22,970 --> 00:07:28,980 In general the optimal learning rate is about half of the maximum learning rate that is in learning 70 00:07:28,980 --> 00:07:32,150 grade about which the training algorithm diverges 71 00:07:34,810 --> 00:07:41,560 so a simple approach for tuning the learning rate is to start with large value that makes the algorithm 72 00:07:41,580 --> 00:07:50,770 they would then divide this value by three and try again and repeat this until the training algorithm 73 00:07:50,890 --> 00:07:55,210 stops diverging at that point. 74 00:07:55,210 --> 00:07:59,620 You generally won't be too far from the optimal learning grade. 75 00:07:59,710 --> 00:08:02,500 Then there is bad sites for bad essays. 76 00:08:02,510 --> 00:08:12,460 The general rule general rules of thumb try to keep bad sides lower than 32 because a small bad size 77 00:08:12,670 --> 00:08:19,690 ensures that each training ideation is very fast and on the lower end. 78 00:08:19,990 --> 00:08:28,300 Try to keep a bad size more than 20 because this helps take advantage of the hardware and software optimizations 79 00:08:28,990 --> 00:08:31,720 in particular for matrix multiplications. 80 00:08:31,810 --> 00:08:40,760 So this will also help in speeding up training so a good range is between 20 to 32. 81 00:08:40,780 --> 00:08:47,110 Lastly there is another hyper parameter called epoch that is the number of training ideations that we 82 00:08:47,110 --> 00:08:50,560 will do instead of tuning it. 83 00:08:50,560 --> 00:08:59,530 We would suggest that you use a large number for epochs and use the early stopping technique to prevent 84 00:08:59,590 --> 00:09:04,360 or predict that's all about selecting high but parameters.