1 00:00:01,610 --> 00:00:03,270 So once we have trained our model. 2 00:00:03,860 --> 00:00:10,220 I told you that we can quantify the quality of fate of our model using the means squared error term, 3 00:00:10,640 --> 00:00:15,980 which is given by this formula, is basically residual more squares divided by N. 4 00:00:18,050 --> 00:00:23,350 If the predicted responses are very close to these observations, the means good at it will be small. 5 00:00:27,300 --> 00:00:34,410 But if we find out, I mean, square error on these same data which we use to bring our model, it is 6 00:00:34,410 --> 00:00:35,640 called training MASC. 7 00:00:38,490 --> 00:00:40,890 Reigning in it is not something we are interested in. 8 00:00:41,460 --> 00:00:48,060 We are interested in the accuracy of these predictions when we apply our method to previously unseen 9 00:00:48,250 --> 00:00:49,000 best data. 10 00:00:50,670 --> 00:00:53,280 For example, I'm predicting home price. 11 00:00:54,470 --> 00:01:00,350 I don't really care how well our method predicts house price of previously completed transactions. 12 00:01:00,950 --> 00:01:02,220 I get about Holwell. 13 00:01:02,300 --> 00:01:05,630 Will it predict the whole place of the future transactions? 14 00:01:07,130 --> 00:01:12,290 Similarly, if I want to predict the risk of a particular disease in different individuals. 15 00:01:13,400 --> 00:01:18,310 I want to do it for future patients and not for the ones I already know the outcome for. 16 00:01:19,690 --> 00:01:21,280 So what we're going to do is. 17 00:01:22,490 --> 00:01:25,610 We are going to split our data into two parts. 18 00:01:26,060 --> 00:01:27,820 One will be called training set. 19 00:01:29,360 --> 00:01:31,040 This will be used to train the model. 20 00:01:31,970 --> 00:01:34,910 And the other part will be called the test said. 21 00:01:36,430 --> 00:01:41,790 This will be the unseen data and it will be used to assess the accuracy of our model. 22 00:01:43,620 --> 00:01:51,250 So mathematically, I have these in pairs of observations, X one by one, X two way to do X and Y in 23 00:01:52,260 --> 00:01:57,090 these will be part of my training data and I'll use them to train my model. 24 00:01:57,660 --> 00:02:02,360 Once I have used them, I will have identified the functional form of play. 25 00:02:02,580 --> 00:02:04,020 That is the ethics. 26 00:02:06,000 --> 00:02:11,970 Now, I will use previously unseen set of observations exito ways to. 27 00:02:13,830 --> 00:02:16,470 These observations will come from our guest said. 28 00:02:17,730 --> 00:02:25,060 And I will try to find out the best error which is given by this formula, so best mean squared, it 29 00:02:25,620 --> 00:02:26,490 is average of. 30 00:02:27,540 --> 00:02:35,130 Squared of difference between the predicted value of life to the actual value of life on the given BASTIDA. 31 00:02:37,140 --> 00:02:38,960 Two, four different types of models. 32 00:02:39,560 --> 00:02:46,390 I will compare the value of this test error and then to LIGNE model with least tested. 33 00:02:48,770 --> 00:02:52,490 I hope you understand the idea behind having best entering data. 34 00:02:53,970 --> 00:02:59,580 Basically, we have training data and corresponding training at edit body models selected. 35 00:03:01,120 --> 00:03:07,030 But there is no guarantee that the matter with Lewis reading it, it will also have law tested at. 36 00:03:08,760 --> 00:03:15,870 Roughly speaking, many statistical methods specifically estimate redoes does so that we are able to 37 00:03:15,870 --> 00:03:17,430 minimize prunings, Ed. 38 00:03:18,760 --> 00:03:25,590 And for such methods, all the training so that there will be small, but the actual test set, it can 39 00:03:25,590 --> 00:03:26,450 be quite large. 40 00:03:30,210 --> 00:03:33,010 In this graph, you can see four different lamed. 41 00:03:35,150 --> 00:03:36,080 This black one. 42 00:03:37,330 --> 00:03:38,410 Is the true function. 43 00:03:40,180 --> 00:03:41,550 Mitt, we want to predict. 44 00:03:43,610 --> 00:03:47,110 This orange line is the output of a linear regression model. 45 00:03:48,520 --> 00:03:53,590 And these blue and green lines are the result of some other more flexible models. 46 00:03:55,650 --> 00:04:00,200 And these small circle that we are seeing are the data points which were used to train the model. 47 00:04:02,770 --> 00:04:09,640 You can see as I am increasing the flexibility of the model, that is, I'm allowing it to change its 48 00:04:09,640 --> 00:04:11,620 shape or its direction. 49 00:04:11,650 --> 00:04:15,220 Many times it is touching more points on this graph. 50 00:04:16,920 --> 00:04:23,940 So this green card, which has high flexibility, is putting the maximum number of point, whereas the 51 00:04:24,000 --> 00:04:28,730 orange girl, which has least flexibility, is touching very few points. 52 00:04:30,730 --> 00:04:34,050 You can see after a certain level of flexibility. 53 00:04:35,580 --> 00:04:38,900 This flexibility is making the code more bigly. 54 00:04:39,120 --> 00:04:44,610 That is it is following the individual data points and not the overall function. 55 00:04:45,860 --> 00:04:52,460 To the effect of flexibility on training at it and tested, it can be seen on the graph, on debate. 56 00:04:54,540 --> 00:05:01,310 You can see that this great plot is off Gradinger as you keep on increasing the flexibility, the training, 57 00:05:01,330 --> 00:05:02,860 it keeps coming down. 58 00:05:04,140 --> 00:05:10,650 That is the model will be fitting or will be getting a lot of these sample points. 59 00:05:11,770 --> 00:05:19,720 But after a certain point, the test it which is given by this red, got it start increasing with the 60 00:05:20,170 --> 00:05:21,460 increasing flexibility. 61 00:05:24,020 --> 00:05:32,480 You can see this Orange Point is the best entrain ever for this orange glow, which is District Lanco. 62 00:05:34,810 --> 00:05:41,410 This blue point is for the blue go, which is approximating the blue function very closely. 63 00:05:41,980 --> 00:05:46,030 And this green point is for the green go, which is very flexible. 64 00:05:47,020 --> 00:05:49,690 It has low training other than the bluegill. 65 00:05:51,260 --> 00:05:58,410 Is putting these points more closely, but it has hired Estero because it is not approximating the true 66 00:05:58,410 --> 00:05:58,950 function. 67 00:06:01,490 --> 00:06:03,720 So we want to identify this blue point. 68 00:06:04,340 --> 00:06:12,350 When we get the minimum tested it out, there are several techniques to split the data into training 69 00:06:12,350 --> 00:06:15,760 and test so that we can find this minimum point. 70 00:06:18,850 --> 00:06:22,300 So we are going to discuss the three most popular techniques. 71 00:06:23,200 --> 00:06:25,060 First is called Vegetation's and Approach. 72 00:06:25,570 --> 00:06:28,320 Second is Leverne out cross-validation. 73 00:06:29,350 --> 00:06:31,630 And the third one is gatefold cross-validation. 74 00:06:33,850 --> 00:06:38,080 The first technique, which is validation set approach, is the simplest approach. 75 00:06:39,200 --> 00:06:45,650 In this matter, we will randomly divide the data into two parts, a training set and a tested. 76 00:06:46,970 --> 00:06:51,350 The model will be fitted on the training set and once the model is trained. 77 00:06:52,670 --> 00:06:55,370 The Air Force did this, it will be calculated. 78 00:06:55,760 --> 00:06:56,720 Estimate detested. 79 00:06:58,320 --> 00:07:00,540 We usually do a split of 80, 20. 80 00:07:00,660 --> 00:07:04,530 That is, we use 80 percent of the data for training purposes. 81 00:07:04,830 --> 00:07:07,800 And 20 percent of the data for testing purposes. 82 00:07:09,170 --> 00:07:12,790 We'll be running this approach in our software package in a separate video. 83 00:07:15,370 --> 00:07:17,610 There are basically two limitations of this approach. 84 00:07:18,090 --> 00:07:22,350 One is that part of the available data will not be used for training. 85 00:07:23,690 --> 00:07:28,430 And as we know, the more data we use during training, better will be the performance of the model. 86 00:07:29,600 --> 00:07:34,400 So if we keep some data for testing, the green model will not be as good. 87 00:07:36,620 --> 00:07:41,180 And if you have a limited number of observations, your training will be severely impacted. 88 00:07:43,340 --> 00:07:50,090 Secondly, the best it can be, highly variable, depending on which observation is selected for training 89 00:07:50,510 --> 00:07:52,580 and which observation is selected for testing. 90 00:07:53,750 --> 00:07:58,010 So to handle these two issues, there are these two alternative approaches. 91 00:07:59,280 --> 00:08:01,320 In the Leverne out cross-validation. 92 00:08:02,350 --> 00:08:04,110 Suppose we have an observations. 93 00:08:04,900 --> 00:08:06,460 We'll keep the first observation. 94 00:08:07,800 --> 00:08:14,580 For testing purposes and random order on delimiting and minus one observations, then we will keep the 95 00:08:14,580 --> 00:08:17,160 second observation for testing purposes. 96 00:08:18,190 --> 00:08:22,150 And run the model on the remaining and minus one observations. 97 00:08:22,810 --> 00:08:30,070 And we will run this model and things that every time we'll keep one observation or testing and the 98 00:08:30,070 --> 00:08:31,600 other and minus one for training. 99 00:08:32,960 --> 00:08:38,330 And will take the average of the error on each of these testing of the missions. 100 00:08:41,070 --> 00:08:47,690 So since we'll need to run this model and times, this method can be computationally expensive. 101 00:08:49,020 --> 00:08:54,330 So an alternative to this Libano cross-validation is the K4 cross-validation. 102 00:08:55,900 --> 00:08:58,470 And this we were divided into two kasit. 103 00:09:00,590 --> 00:09:04,660 And then we're trained to do it on game minus one six. 104 00:09:05,030 --> 00:09:07,970 And Newsday gets it for testing purposes. 105 00:09:09,870 --> 00:09:10,650 You can see that. 106 00:09:11,250 --> 00:09:15,000 Leave a note cross-validation is a special case of Caple validation. 107 00:09:16,610 --> 00:09:17,240 If you have. 108 00:09:18,310 --> 00:09:23,130 Gays equal to N then give or validation and leave a note cross-validation are the same thing. 109 00:09:24,370 --> 00:09:27,440 So we will not be covering these two techniques in this software package. 110 00:09:27,820 --> 00:09:32,020 We will only be running the validations at abroad and our software package.