1 00:00:01,130 --> 00:00:09,230 In this lecture, we will discuss about the hyper parameters of what neural network architecture and 2 00:00:09,230 --> 00:00:10,220 the lectors didn't know. 3 00:00:10,550 --> 00:00:18,350 You have seen that there are so many hybrid barometers in neural networks, and these hybrid barometers 4 00:00:18,530 --> 00:00:22,880 give us the flexibility of creating several types of architectures. 5 00:00:25,100 --> 00:00:30,530 But the flexibility of neural networks is also one of their main drawbacks. 6 00:00:31,970 --> 00:00:35,420 We have to decide on so many hybrid parameters in our model. 7 00:00:37,010 --> 00:00:44,960 Not only can we use any imaginable network architecture, but even in our simple, multi-level Perceptron, 8 00:00:45,500 --> 00:00:51,710 you can change the number of layers, the number of neurons, parlier the type of activation function 9 00:00:51,710 --> 00:00:56,540 to use in each layer, duvid, initialization logic and many more. 10 00:00:59,020 --> 00:01:05,800 Although still a lot of exciting research is going on in the field of hyper barometer tuning for neural 11 00:01:05,800 --> 00:01:06,340 networks. 12 00:01:07,030 --> 00:01:13,360 It will still help to have an idea of what values are reasonable for each hyper barometer. 13 00:01:13,930 --> 00:01:18,010 So you can build a quick prototype and restrict the search space. 14 00:01:20,540 --> 00:01:26,760 He had a few guidelines for choosing the number of layers and neurons in a multilevel Perceptron. 15 00:01:29,100 --> 00:01:31,660 Let's first discuss about the number of hidden layers. 16 00:01:32,580 --> 00:01:39,210 For most of the problems, you can just begin with a single earlier and you will get reasonable results. 17 00:01:40,890 --> 00:01:48,450 It has actually been shown that an MLP with just one hidden layer can model even the most complex functions, 18 00:01:49,020 --> 00:01:54,090 provided it has enough neurons for a long time. 19 00:01:54,660 --> 00:02:00,650 These facts convinced researchers that there is no need to investigate any deep neural network. 20 00:02:01,740 --> 00:02:09,330 But it was later found that deep networks have a much higher parameter efficiency than shallow ones. 21 00:02:11,010 --> 00:02:18,870 They can model complex functions using exponentially fewer neurons than shallowness, allowing them 22 00:02:18,870 --> 00:02:26,310 to read much better performance with the same amount of training data to understand why this happens. 23 00:02:26,700 --> 00:02:33,480 Suppose you are asked to draw a forest using some drawing software, but you are forbidden to use copy 24 00:02:33,480 --> 00:02:33,810 paste. 25 00:02:34,710 --> 00:02:42,540 You have brought each tree individually branch by branch leaf believe if you could instead draw one 26 00:02:42,540 --> 00:02:50,490 leaf copy pasted to a broader branch, then copy paste the branches to create the tree and finally copy 27 00:02:50,490 --> 00:02:52,560 paste this tree to make a forest. 28 00:02:53,750 --> 00:02:55,350 You would be finished in no time. 29 00:02:57,900 --> 00:03:01,230 Real raw data is often structured in such a hierarchical way. 30 00:03:01,710 --> 00:03:05,640 And deep neural networks automatically take advantage of this fact. 31 00:03:07,150 --> 00:03:12,840 Lower head and layers, more lower level structures, intermediate hidden layers. 32 00:03:13,000 --> 00:03:19,680 Combine these lower level structures to model intermediate level structures and higher state layers 33 00:03:19,980 --> 00:03:21,030 and the output layer. 34 00:03:21,210 --> 00:03:25,890 Combine these intermediate structures to model high level structures. 35 00:03:27,630 --> 00:03:35,430 Not only does this hierarchical structure help deep neural networks converge faster, it also improves 36 00:03:35,430 --> 00:03:38,150 the ability to generalize to new data. 37 00:03:38,240 --> 00:03:48,300 Say, for example, if you've already trained a model to recognize faces in a picture and you know, 38 00:03:48,300 --> 00:03:55,650 want to train a new neural network model to recognize hairstyles, then you can kick start training 39 00:03:55,860 --> 00:04:04,170 by reusing the lower levels of the first network instead of randomly initializing DVDs and biases of 40 00:04:04,170 --> 00:04:07,380 the first few layers of the new neural network. 41 00:04:08,220 --> 00:04:14,940 You can initialize them to the value of DVDs and biases of the lower layers of the first network. 42 00:04:16,560 --> 00:04:23,670 This way, the network will not have to learn from scratch and it will only have to learn the higher 43 00:04:23,670 --> 00:04:24,630 level structures. 44 00:04:26,130 --> 00:04:28,200 This is called transfer learning. 45 00:04:29,910 --> 00:04:37,140 So in summary, for most problems, you can start with just one or two hidden layers and it will work 46 00:04:37,170 --> 00:04:41,310 just fine for more complex problems. 47 00:04:41,610 --> 00:04:47,910 You can gradually ramp up the number of hidden layers until you start overfitting the training data. 48 00:04:51,840 --> 00:04:55,360 Next, we discuss the number of neurons, but he didn't listen. 49 00:04:56,630 --> 00:05:02,460 Obviously, the number of neurons in the input and output layers is determined by the type of input 50 00:05:02,670 --> 00:05:03,310 and output. 51 00:05:03,330 --> 00:05:07,590 Your task requires, for example, the M. 52 00:05:07,590 --> 00:05:13,050 NASD fashion dataset that we used that required 24 input, 24. 53 00:05:13,170 --> 00:05:18,600 That is 780 for input neurons and then output neurons. 54 00:05:20,100 --> 00:05:23,040 As for the hidden layers, unclear. 55 00:05:23,970 --> 00:05:27,310 It was a common practice to size them to formal. 56 00:05:27,510 --> 00:05:32,430 But I mean, that is the first layer had the most number of neurons. 57 00:05:33,390 --> 00:05:41,100 So, for example, in the M&A zesty fashion dataset with three layers, you can have 300 neurons in 58 00:05:41,100 --> 00:05:49,290 the first data layer, 200 neurons and the second one and hundred, and deterred the rationale being 59 00:05:49,410 --> 00:05:55,770 that many low level features can coalesce into far fewer higher level features. 60 00:05:58,090 --> 00:06:05,140 However, this practice has been largely abandoned now, as it seems that simply using these same number 61 00:06:05,140 --> 00:06:12,530 of neurons in all hidden lives performs just as well in most cases, or maybe even better. 62 00:06:15,420 --> 00:06:22,120 Also, it has the advantage of having only one hyper barometer to tune instead of one but leered. 63 00:06:24,750 --> 00:06:30,900 So instead of having three hundred, two hundred and one hundred neurons in the tree and layers, you 64 00:06:30,900 --> 00:06:34,200 can have 150 neurons in all three of them. 65 00:06:36,730 --> 00:06:43,330 If you think that the problem at hand is really complex, then you can try increasing the number of 66 00:06:43,330 --> 00:06:52,270 neurons gradually until the network starts overfitting in general, increasing the depth of the network 67 00:06:52,750 --> 00:06:58,000 has better reser on the accuracy, then increasing the number of neurons. 68 00:06:58,020 --> 00:06:58,470 Bollier. 69 00:06:59,720 --> 00:07:06,040 Another approach could be to become model with a large number of layers and large number of neurons 70 00:07:06,130 --> 00:07:11,860 but hidden layer and then use at least hoping to prevent that model from overfitting. 71 00:07:14,690 --> 00:07:15,740 Next, Apple Barometer. 72 00:07:15,770 --> 00:07:18,470 We are going to discuss those learning great learning. 73 00:07:18,590 --> 00:07:21,530 It is arguably the most important hyper barometer. 74 00:07:23,000 --> 00:07:27,830 In general, the optimal learning rate is about half of the maximum learning rate. 75 00:07:28,220 --> 00:07:32,150 That is the learning rate about which the training algorithm diverges. 76 00:07:34,810 --> 00:07:39,670 So a simple approach for joining the learning rate is to start with large value. 77 00:07:40,420 --> 00:07:41,950 That makes the algorithm diverge. 78 00:07:43,690 --> 00:07:45,960 Then divide this value by three. 79 00:07:46,150 --> 00:07:52,010 And try again and repeat this until the training algorithm stops diverging. 80 00:07:53,650 --> 00:07:58,930 At that point, you generally won't be too far from the optimal learning rate. 81 00:07:59,740 --> 00:08:02,460 Then there is bad sites for bad taste. 82 00:08:02,510 --> 00:08:12,010 The general rule, general rule of thumb, trying to keep bad sales lower than 32 because a small bad 83 00:08:12,010 --> 00:08:19,330 size ensures that each training iteration is very fast and on the lower end. 84 00:08:19,990 --> 00:08:28,300 Try to keep a bad size more than 20 because this helps take advantage of the hardware and software optimizations, 85 00:08:28,990 --> 00:08:31,390 in particular for matrix multiplications. 86 00:08:31,810 --> 00:08:34,180 So this will also help in speeding up training. 87 00:08:35,770 --> 00:08:38,940 So a good range is between twenty to thirty two. 88 00:08:40,810 --> 00:08:44,170 Lastly, there is another hyper parameter called epoch. 89 00:08:44,530 --> 00:08:50,290 That is the number of training iterations that we will do instead of tuning it. 90 00:08:50,590 --> 00:08:59,530 We would suggest that you use a large number for epochs and usually at least topping technique to prevent 91 00:08:59,590 --> 00:09:00,060 overfitting. 92 00:09:01,980 --> 00:09:04,360 So that's all about selecting hyper parameters.