1 00:00:00,850 --> 00:00:02,380 So once we have trained our model. 2 00:00:03,570 --> 00:00:09,360 I told you that we can quantify the performance of our model using the confusion matrix. 3 00:00:10,200 --> 00:00:13,710 We can see how many times we have got it right and how many times it is wrong. 4 00:00:15,070 --> 00:00:22,710 But when we are growing the confusion matrix on the same data which we use to green our model, it is 5 00:00:22,710 --> 00:00:23,970 called training. 6 00:00:24,110 --> 00:00:24,480 It is. 7 00:00:27,180 --> 00:00:29,660 Training in it is not something we are interested in. 8 00:00:30,680 --> 00:00:36,860 We are interested in the accuracy of the predictions when we apply our method to a previously unseen 9 00:00:37,100 --> 00:00:37,730 test data. 10 00:00:40,160 --> 00:00:45,110 For example, when I'm predicting whether the House will be sold within three months or not, I don't 11 00:00:45,110 --> 00:00:51,320 really care how well the method predicts whether the house will be sold or not on the previously completed 12 00:00:51,320 --> 00:00:53,640 transactions, which my system I see. 13 00:00:55,430 --> 00:01:00,230 I want to know how well it will predict whether it will be sold on the future transactions. 14 00:01:01,640 --> 00:01:09,260 Similarly, if I want to predict the risk of a particular disease in different individuals, I want 15 00:01:09,260 --> 00:01:13,580 to do it for future patients and not for the ones I already know the outcome for. 16 00:01:16,100 --> 00:01:22,750 So to handle this issue, what you're going to do is you're going to split our data into two parts. 17 00:01:23,860 --> 00:01:26,480 One part will be called the training set. 18 00:01:27,340 --> 00:01:35,180 And the other part will be called dataset trainings that will be used to train the model and the test 19 00:01:35,410 --> 00:01:37,780 will be used to test its performance. 20 00:01:40,220 --> 00:01:46,490 So that test, it will be the unseen data and it will be used to assess the accuracy of our model. 21 00:01:48,500 --> 00:01:56,120 So mathematically, I have these pairs of X and ways x1 y by one, X two. 22 00:01:56,120 --> 00:01:56,540 I do. 23 00:01:56,750 --> 00:01:57,350 And so on. 24 00:01:57,740 --> 00:02:01,880 So these and build of X and ways will be my training set. 25 00:02:03,320 --> 00:02:05,030 I will use them to train my model. 26 00:02:06,320 --> 00:02:11,390 And once my model is trained, I have this functional form of relationship between X and Y. 27 00:02:13,640 --> 00:02:18,480 Now I will use this previously unseen set of data. 28 00:02:20,570 --> 00:02:27,050 And I'll feed these observations into this model to predict the value of life. 29 00:02:28,970 --> 00:02:35,440 This predicted value of Y and the actual value of Y available for this test set. 30 00:02:35,810 --> 00:02:36,920 That is these Y zeros. 31 00:02:38,090 --> 00:02:45,230 These will be compared to create the confusion matrix and this confusion matrix will be used to assess 32 00:02:45,230 --> 00:02:46,780 the accuracy of our model. 33 00:02:49,130 --> 00:02:54,860 So when we have three different types of classifiers, that is logistic regression, linear discriminant, 34 00:02:54,890 --> 00:03:03,260 analysis and gain nearest neighbors, we will draw the confusion matrix for the test set for all these 35 00:03:03,260 --> 00:03:10,730 three classifiers and then compare their performance on this previously unseen data instead of the training 36 00:03:10,730 --> 00:03:11,030 data. 37 00:03:14,740 --> 00:03:21,790 The main reason why we have to separate the data and to test it and the training set is because there 38 00:03:21,790 --> 00:03:28,260 is no guarantee that if a model is giving low training error, it will also have low test at it. 39 00:03:30,490 --> 00:03:38,410 Roughly speaking, many statistical methods specifically estimate why values so that we are able to 40 00:03:38,410 --> 00:03:39,790 minimize deep training error. 41 00:03:40,930 --> 00:03:43,750 So far, such methods training it and may be very low. 42 00:03:44,350 --> 00:03:47,350 But that test data will be quite large. 43 00:03:49,990 --> 00:03:57,220 And this graph, you can see that the true function of this dataset is discovered, Lane. 44 00:03:58,930 --> 00:04:06,910 But if I'm using a less flexible method, such as a straight line to estimate the values, it will have 45 00:04:07,420 --> 00:04:08,290 a lot of error. 46 00:04:10,030 --> 00:04:18,520 But on the other hand, if I increase the flexibility to too much, my line will exactly fall. 47 00:04:18,730 --> 00:04:22,270 Each and every point instead of following the general trend. 48 00:04:24,600 --> 00:04:28,570 So having too much flexibility makes the model overfit. 49 00:04:28,600 --> 00:04:32,920 The data, which also results in increasing error. 50 00:04:34,000 --> 00:04:39,840 So in this situation, if you notice the training, it it will be very low. 51 00:04:40,900 --> 00:04:43,810 As each point is captured by this code. 52 00:04:45,370 --> 00:04:50,060 So this particular model will give a very low training error. 53 00:04:50,740 --> 00:04:58,330 But in fact, if you use this model on any unseen data, this model will be giving a very high error 54 00:04:58,330 --> 00:05:01,360 rate, probably even more than the straight line. 55 00:05:02,860 --> 00:05:07,750 So we need to find the balance when selecting the flexibility of our model. 56 00:05:08,940 --> 00:05:12,220 And hence, we should compare our different models. 57 00:05:13,180 --> 00:05:18,820 This is their performance on unseen data instead of previously seen data or the training data. 58 00:05:20,920 --> 00:05:25,420 This graph tells us how the error rate changes along with flexibility. 59 00:05:27,650 --> 00:05:33,830 If I keep on increasing the flexibility of a of my model, the training error which is given by this 60 00:05:34,520 --> 00:05:38,330 light blue line, it keeps on decreasing continuously. 61 00:05:39,260 --> 00:05:44,310 So we may be tempted to use a very flexible method so as to get low training error. 62 00:05:46,580 --> 00:05:53,900 But in reality, if I present team data, this model will perform worse than this model. 63 00:05:55,550 --> 00:06:01,850 So we need to see the test error rate, which first decreases and then increases as we increase the 64 00:06:01,850 --> 00:06:02,540 flexibility. 65 00:06:03,350 --> 00:06:08,570 And we have to identify this point with the test error rate as minimum. 66 00:06:11,830 --> 00:06:18,250 Now, there are several techniques to split the data into grading and test it so that we can find this 67 00:06:18,250 --> 00:06:19,030 minimum point. 68 00:06:21,630 --> 00:06:24,770 We are going to discuss the three most popular techniques here. 69 00:06:25,790 --> 00:06:28,310 First one is called validation set approach. 70 00:06:29,510 --> 00:06:35,240 Second is leave a note cross validation and third one is gatefold cross-validation. 71 00:06:37,070 --> 00:06:38,920 The first technique, which is validation. 72 00:06:38,930 --> 00:06:40,760 That approach is the simplest approach. 73 00:06:41,930 --> 00:06:44,690 We will randomly divide the data into two parts. 74 00:06:45,400 --> 00:06:51,100 A training set and a test set model will be fitted on the training set. 75 00:06:51,350 --> 00:06:56,000 And once the model is train, the error for test will be calculated on the test. 76 00:06:57,170 --> 00:07:01,580 We usually split the relevant data in a ratio of eight is to 20. 77 00:07:01,760 --> 00:07:07,790 That is, 80 percent of the data will be used for training purpose and 20 percent will be used for testing 78 00:07:07,790 --> 00:07:08,280 bookless. 79 00:07:09,650 --> 00:07:12,170 There are basically two limitations of this approach. 80 00:07:12,470 --> 00:07:17,250 One is that part of the available data will not be used for training. 81 00:07:17,770 --> 00:07:23,860 And as we know, the more data we use during training, better will be the performance or model. 82 00:07:25,580 --> 00:07:32,270 So if we keep some data for testing, the train model will not be as good as we can get. 83 00:07:33,560 --> 00:07:39,500 And if you have a limited number of observations, that is not a lot of observations. 84 00:07:40,010 --> 00:07:42,220 Your training will be severely impacted. 85 00:07:44,070 --> 00:07:50,850 Secondly, the testator can be highly variable, depending on which observation is selected for printing 86 00:07:51,300 --> 00:07:53,310 and which observation is selected for testing. 87 00:07:55,320 --> 00:07:59,760 So to handle these two issues, there are these two alternative approaches. 88 00:08:02,640 --> 00:08:04,770 And the leverne out cross-validation. 89 00:08:05,580 --> 00:08:14,540 We will keep the first observation and Randy Madelon remaining and minus one observations in the next 90 00:08:14,540 --> 00:08:14,810 round. 91 00:08:14,940 --> 00:08:20,850 We will keep the second observation for testing both of us and run the model on the remaining and minus 92 00:08:20,850 --> 00:08:24,030 one observations again and each cycle. 93 00:08:24,660 --> 00:08:32,280 We will use that one observation for testing, and the error calculated in each cycle will be averaged 94 00:08:32,910 --> 00:08:36,030 to establish the test error in this method. 95 00:08:38,910 --> 00:08:47,660 So since we need to run the model several times, this method can be computationally expensive and automated 96 00:08:47,670 --> 00:08:48,930 to just leave one out. 97 00:08:48,930 --> 00:08:52,810 Cross validation is gave for cross validation. 98 00:08:53,410 --> 00:08:56,110 In this, we will divide the data into Kasit. 99 00:08:57,500 --> 00:09:02,400 We will keep one set for testing and use the other key minus one set for training. 100 00:09:03,240 --> 00:09:10,500 You can see that leave one out cross-validation is a special case of careful cross-validation. 101 00:09:11,310 --> 00:09:17,580 If you have gays equal to N, that is gay is equal to the total number of observations. 102 00:09:18,780 --> 00:09:21,830 This is exactly same as did leave a note cross-validation. 103 00:09:23,500 --> 00:09:27,230 We will not be covering these two techniques in the software package. 104 00:09:27,420 --> 00:09:29,790 We'll be only using the validation that approach.