1 00:00:00,033 --> 00:00:04,633 Hello my friends, and welcome to this new section where you will finally learn on 2 00:00:04,633 --> 00:00:10,333 how to evaluate your regression models and mostly on how to select the best one. 3 00:00:10,533 --> 00:00:14,700 All right, so this is indeed a section that has been long awaited. 4 00:00:14,900 --> 00:00:19,300 Because indeed, in this whole part two of regression, we built many machine 5 00:00:19,300 --> 00:00:20,200 learning models. 6 00:00:20,200 --> 00:00:23,400 And now most of you must have the question, okay, that's cool. 7 00:00:23,400 --> 00:00:27,733 I have all these regression models in the toolkit, but which one do I select? 8 00:00:27,733 --> 00:00:30,733 Which one should I apply for my data set? 9 00:00:31,000 --> 00:00:34,233 And well, I actually have some very good news for you. 10 00:00:34,300 --> 00:00:37,866 We will give the answer to this exact question in this tutorial. 11 00:00:38,100 --> 00:00:42,000 So I'm going to try to reveal everything in this same tutorial so that you know, 12 00:00:42,000 --> 00:00:47,166 this can be the ultimate tutorial of regression, where you finally learn on 13 00:00:47,166 --> 00:00:52,066 how to use your regression toolkit the right way on your future datasets. 14 00:00:52,400 --> 00:00:57,800 So what I will do in this tutorial is I will introduce you to this toolkit 15 00:00:57,800 --> 00:01:01,800 that I've just made and which contains all the regression models we learned 16 00:01:01,800 --> 00:01:04,900 together into some very generic code 17 00:01:04,900 --> 00:01:08,833 templates by very generic code templates, I mean that 18 00:01:08,933 --> 00:01:13,200 you will be able to use these code templates on your future data set. 19 00:01:13,333 --> 00:01:18,733 By having only 1 or 2 things to change, I made them as generic as possible 20 00:01:18,733 --> 00:01:21,600 so that they can be ready to deploy on your data sets. 21 00:01:21,600 --> 00:01:25,800 And besides, each of them contains, at the end of the implementation, 22 00:01:25,800 --> 00:01:30,033 the evaluation tool, you know, allowing to evaluate your model 23 00:01:30,200 --> 00:01:35,400 so that you can very easily and quickly compare the performance of each of them. 24 00:01:35,700 --> 00:01:39,933 In other words, you know, in short, thanks to this tool kit, you will be able 25 00:01:39,933 --> 00:01:44,633 to select the best model for your data set in a very short amount of time. 26 00:01:44,633 --> 00:01:46,400 You know, very, very efficiently. 27 00:01:46,400 --> 00:01:48,900 And that's exactly what I'll prove to you. 28 00:01:48,900 --> 00:01:51,566 You know what I'm going to show you in this tutorial? 29 00:01:51,566 --> 00:01:53,100 We're going to take a real world 30 00:01:53,100 --> 00:01:56,833 data set, you know, with several features and lots of observations. 31 00:01:57,133 --> 00:02:01,333 I will deploy each of the regression models of the toolkit on this data set, 32 00:02:01,566 --> 00:02:04,500 and you will see how quickly and efficiently 33 00:02:04,500 --> 00:02:06,966 I managed to figure out the best model. 34 00:02:06,966 --> 00:02:09,466 And that's actually the answer to the question 35 00:02:09,466 --> 00:02:11,333 how should I select the best model? 36 00:02:11,333 --> 00:02:12,833 And the simple answer is 37 00:02:12,833 --> 00:02:17,200 try all your models, try all your models, and just select the best one. 38 00:02:17,200 --> 00:02:19,400 Having the best performance result. 39 00:02:19,400 --> 00:02:21,466 And that performance result is measured 40 00:02:21,466 --> 00:02:25,100 by, of course, the coefficient r squared or adjusted r squared. 41 00:02:25,800 --> 00:02:26,400 All right. 42 00:02:26,400 --> 00:02:27,300 So there we go. 43 00:02:27,300 --> 00:02:29,533 Let me introduce you to this toolkit. 44 00:02:29,533 --> 00:02:32,533 And then let's proceed to the demo. 45 00:02:32,566 --> 00:02:35,833 But first let's make sure everyone here is on the same page. 46 00:02:36,033 --> 00:02:40,500 This is a new folder you know different than the whole machine learning. 47 00:02:40,500 --> 00:02:43,000 It is a folder containing ten parts. 48 00:02:43,000 --> 00:02:46,900 This is a new folder where you will get, you know, that regression toolkit 49 00:02:46,900 --> 00:02:48,666 containing all the regression models. 50 00:02:48,666 --> 00:02:52,533 And then when we tackle part three, the classification toolkit with all the 51 00:02:52,533 --> 00:02:57,133 classification models, and mostly you know this is the model selection folder. 52 00:02:57,133 --> 00:03:00,400 This is the folder you will want to use when you want to deploy 53 00:03:00,533 --> 00:03:03,533 either your regression models or your classification models 54 00:03:03,633 --> 00:03:07,666 on your data set, in order to quickly and efficiently select the best one. 55 00:03:08,000 --> 00:03:09,300 And now there we go. 56 00:03:09,300 --> 00:03:14,600 Let's enter this regression folder for model selection and as you see 57 00:03:14,700 --> 00:03:18,333 it contains five regression models 58 00:03:18,333 --> 00:03:22,866 that we studied in this part two you know multiple linear regression. 59 00:03:22,866 --> 00:03:25,866 And I didn't include simple linear regression of course, because 60 00:03:25,933 --> 00:03:29,933 now we will work with a real world data set with therefore several features. 61 00:03:30,266 --> 00:03:32,100 Then we have polynomial regression. 62 00:03:32,100 --> 00:03:35,066 Then support vector regression, then decision tree 63 00:03:35,066 --> 00:03:38,066 regression and of course random forest regression. 64 00:03:38,166 --> 00:03:42,166 And as I told you, I made each of these implementations 65 00:03:42,166 --> 00:03:46,533 very generic so that you can deploy them on your future 66 00:03:46,533 --> 00:03:50,200 data sets by having only 1 or 2 things to change. 67 00:03:50,200 --> 00:03:54,000 Assuming, of course, that your data set has a CSV format 68 00:03:54,300 --> 00:03:58,366 and contains all the features in the first columns, and the dependent 69 00:03:58,366 --> 00:04:02,366 variable in the last column, that's really the essential condition. 70 00:04:02,600 --> 00:04:03,566 And then of course, here 71 00:04:03,566 --> 00:04:07,533 I chose a data set without missing values or categorical data. 72 00:04:07,533 --> 00:04:09,000 That's because I trust 73 00:04:09,000 --> 00:04:12,866 you will know how to handle this thanks to your data preprocessing toolkit. 74 00:04:13,033 --> 00:04:17,900 So this data set is quite classic but yet real world because as you can see, 75 00:04:17,900 --> 00:04:21,833 it contains several features and many, many observations. 76 00:04:21,833 --> 00:04:25,566 Actually almost 10,000 observations if we scroll down. 77 00:04:25,566 --> 00:04:27,133 Yes, almost 10,000. 78 00:04:27,133 --> 00:04:27,733 All right. 79 00:04:27,733 --> 00:04:31,900 With, as you can see, only numerical values, no categorical data in strings. 80 00:04:32,066 --> 00:04:34,100 And once again, no missing data. 81 00:04:34,100 --> 00:04:38,133 And I chose such a data set so that, you know, we can make our code 82 00:04:38,133 --> 00:04:41,800 templates for each of our regression models 100% generic, 83 00:04:41,933 --> 00:04:45,233 so that you only have to change the name of the data set.