1 00:00:00,200 --> 00:00:02,100 And now what is this data set about? 2 00:00:02,100 --> 00:00:05,900 Well, that's a classic data set from actually the UCI Machine 3 00:00:05,900 --> 00:00:08,900 Learning Repository, which I encourage you to have a look, 4 00:00:09,066 --> 00:00:10,500 because indeed it is a website 5 00:00:10,500 --> 00:00:13,500 that contains a lot of data sets on which you can practice. 6 00:00:13,566 --> 00:00:17,300 And this one is actually called combined cycle Power Plant. 7 00:00:17,566 --> 00:00:20,700 And it consists of trying to predict this 8 00:00:20,700 --> 00:00:23,933 dependent variable which is actually an energy output. 9 00:00:23,966 --> 00:00:26,066 And don't worry, you don't have to understand 10 00:00:26,066 --> 00:00:29,466 how energy works or how the physics of this data set works. 11 00:00:29,700 --> 00:00:32,733 The only thing that you need to understand is that we want to predict 12 00:00:32,733 --> 00:00:36,333 this dependent variable, which turns out to be an energy output, 13 00:00:36,566 --> 00:00:41,233 and we are predicting this dependent variable with these four features here, 14 00:00:41,266 --> 00:00:46,566 which are first the engine temperature, second, the exhaust vacuum, 15 00:00:46,866 --> 00:00:51,233 third, the ambient pressure, and fourth the relative humidity. 16 00:00:51,433 --> 00:00:53,933 All right. So that's that's only what matters here. 17 00:00:53,933 --> 00:00:57,666 You have to see it as you know, a general data set where you have 18 00:00:57,666 --> 00:01:02,500 several features that you're going to use to predict that dependent variable. 19 00:01:02,733 --> 00:01:03,733 And as you can see, 20 00:01:03,733 --> 00:01:06,966 the condition, you know, in order to deploy our regression 21 00:01:06,966 --> 00:01:10,400 models on this data set in the future, data sets you'll be working on 22 00:01:10,666 --> 00:01:13,500 is to have in the first columns the features 23 00:01:13,500 --> 00:01:16,066 and in the last column the dependent variable. 24 00:01:16,066 --> 00:01:18,066 All right. That's all that matters. 25 00:01:18,066 --> 00:01:22,800 If you have a data set like that which has no missing data and no categorical data. 26 00:01:22,800 --> 00:01:27,133 Well, you can deploy each and every single one of these regression 27 00:01:27,133 --> 00:01:31,033 models by just having to change the name of your data set. 28 00:01:31,033 --> 00:01:34,266 And if your data set has missing data or categorical data, 29 00:01:34,266 --> 00:01:37,033 you just have to go to your data preprocessing toolkit 30 00:01:37,033 --> 00:01:38,100 to take care of this. 31 00:01:38,100 --> 00:01:40,866 And then you can deploy these models. 32 00:01:40,866 --> 00:01:41,700 All right. 33 00:01:41,700 --> 00:01:45,166 So now time for the demo I'm going to show you 34 00:01:45,166 --> 00:01:46,233 how are we going to quickly 35 00:01:46,233 --> 00:01:49,866 and efficiently plug and play each of these regression templates 36 00:01:50,100 --> 00:01:53,400 by only having to change the name of the data set. 37 00:01:53,700 --> 00:01:57,800 And then I'll show you how we will quickly identify and select 38 00:01:57,800 --> 00:02:01,533 the best regression model for this particular dataset. 39 00:02:01,733 --> 00:02:03,433 All right let's do this. 40 00:02:03,433 --> 00:02:08,500 So our first step here will be to create a copy of each of these files. 41 00:02:08,500 --> 00:02:10,833 Because these are all in read only mode 42 00:02:10,833 --> 00:02:13,066 because you know this folder was shared to you. 43 00:02:13,066 --> 00:02:17,366 So since all of you will access it, you can of course not modify it directly, 44 00:02:17,566 --> 00:02:21,200 but in order to modify it, you just need to create a copy in your drive. 45 00:02:21,333 --> 00:02:26,566 And to do this well, we can just do a right click here and then make a copy. 46 00:02:26,566 --> 00:02:30,300 So we're going to do this for each of the regression models here. 47 00:02:30,400 --> 00:02:31,333 Let's do this. 48 00:02:31,333 --> 00:02:34,266 Make a copy for multiple linear regression. 49 00:02:34,266 --> 00:02:36,166 Then make a copy. 50 00:02:36,166 --> 00:02:38,866 Then random forest regression make a copy. 51 00:02:38,866 --> 00:02:41,566 And finally support vector regression. 52 00:02:41,566 --> 00:02:43,533 And there we go. 53 00:02:43,533 --> 00:02:44,133 All right. Good. 54 00:02:44,133 --> 00:02:46,466 So we made a copy of each of these regression models. 55 00:02:46,466 --> 00:02:50,333 And the copies should be either on your main drive 56 00:02:50,333 --> 00:02:53,333 or in this Colab notebooks folder. 57 00:02:53,466 --> 00:02:56,466 And well as you can see they are on my main drive. 58 00:02:56,633 --> 00:02:58,733 So you will actually very easily find them. 59 00:02:58,733 --> 00:03:02,366 And now what we're going to do is open each of these files 60 00:03:02,633 --> 00:03:05,633 in order to proceed with the demo. 61 00:03:05,700 --> 00:03:06,033 All right. 62 00:03:06,033 --> 00:03:09,033 So I have first multiple linear regression. 63 00:03:09,300 --> 00:03:11,733 Then I'm going to open polynomial regression. 64 00:03:11,733 --> 00:03:15,066 You know, in the same order as the one we used 65 00:03:15,066 --> 00:03:18,900 to build our regression models then support vector regression. 66 00:03:19,933 --> 00:03:21,100 Once again you can either 67 00:03:21,100 --> 00:03:24,666 open them with Google Collaboratory or Jupyter Notebook, 68 00:03:24,666 --> 00:03:28,166 or even Spyder Anaconda, because I also gave you the folder 69 00:03:28,166 --> 00:03:30,266 containing all these codes and the data set 70 00:03:30,266 --> 00:03:32,833 right before this tutorial in the article. 71 00:03:32,833 --> 00:03:35,833 So then let's open decision trees 72 00:03:36,000 --> 00:03:40,466 and finally well, random forest regression. 73 00:03:40,800 --> 00:03:41,933 All right. 74 00:03:41,933 --> 00:03:44,833 So actually let me put it like that. 75 00:03:44,833 --> 00:03:50,066 You know the same order support vector decision tree and random forests. 76 00:03:50,066 --> 00:03:50,400 All right. 77 00:03:50,400 --> 00:03:53,900 So now we have all our regression models open. 78 00:03:54,300 --> 00:03:57,033 I'm first going to show you the code templates one by one. 79 00:03:57,033 --> 00:03:59,633 And then we will deploy them on the data set. 80 00:03:59,633 --> 00:04:03,233 And I'll show you how to quickly figure out which one is the best model. 81 00:04:03,233 --> 00:04:04,066 All right. 82 00:04:04,066 --> 00:04:07,633 So starting with multiple linear regression let's see the different steps. 83 00:04:07,833 --> 00:04:10,033 So we start by importing the libraries. 84 00:04:10,033 --> 00:04:13,033 Of course that's the first step of the data preprocessing phase. 85 00:04:13,133 --> 00:04:14,700 Then we import the data set. 86 00:04:14,700 --> 00:04:18,100 And as you can see I made it super generic, meaning that 87 00:04:18,100 --> 00:04:22,033 the only thing that you have to change is actually the name of your data set here. 88 00:04:22,033 --> 00:04:25,366 That's why I specified in capital letters that you can't miss it. 89 00:04:25,600 --> 00:04:30,533 Enter the name of your data set here and we will actually do that in a few minutes. 90 00:04:30,900 --> 00:04:33,266 Then here you have nothing to change of course, 91 00:04:33,266 --> 00:04:36,733 because this automatically select all the columns except the last one. 92 00:04:36,733 --> 00:04:39,900 Therefore your features and this automatically selects 93 00:04:40,066 --> 00:04:42,600 the last column meaning the dependent variable. 94 00:04:42,600 --> 00:04:46,800 All right then we split the data set into the training set and a dataset. 95 00:04:47,033 --> 00:04:49,400 Of course here that's very important to do this 96 00:04:49,400 --> 00:04:53,033 because since we want to select the best model well we need this test set 97 00:04:53,166 --> 00:04:54,866 in order to evaluate the performance 98 00:04:54,866 --> 00:04:57,933 of each of them in order to compare it and select the best one. 99 00:04:58,200 --> 00:05:00,633 So we have to do this step. Absolutely. 100 00:05:00,633 --> 00:05:04,900 Then once we have, well the training sets, we will train our model 101 00:05:04,900 --> 00:05:06,433 on the training set. 102 00:05:06,433 --> 00:05:09,766 Then we will predict the test results, you know, to have a look 103 00:05:09,766 --> 00:05:13,566 at the predictions and compare them to the real results in Y test. 104 00:05:13,733 --> 00:05:17,400 And then finally we will evaluate the model performance. 105 00:05:17,400 --> 00:05:21,400 And here I don't want to scroll down now because we will discover together 106 00:05:21,400 --> 00:05:25,100 a bit later the code to evaluate a regression model. 107 00:05:25,266 --> 00:05:28,166 You know, with the r squared coefficient. 108 00:05:28,166 --> 00:05:28,500 All right. 109 00:05:28,500 --> 00:05:32,800 So that's the code template for multiple linear regression. 110 00:05:33,033 --> 00:05:36,400 And as I told you and as you see it is super generic 111 00:05:36,400 --> 00:05:39,300 because for any of your future data set, provided 112 00:05:39,300 --> 00:05:42,700 that they have in the first columns the features and in the last column 113 00:05:42,700 --> 00:05:46,000 the dependent variable, and also provided that they don't have missing data 114 00:05:46,000 --> 00:05:47,366 or categorical data. 115 00:05:47,366 --> 00:05:49,400 Well, the only thing that you have to change 116 00:05:49,400 --> 00:05:53,300 within this code template is just to enter the name of your data set here. 117 00:05:53,300 --> 00:05:54,233 And that's it. 118 00:05:54,233 --> 00:05:57,366 And by just doing this, you will be able to evaluate your model 119 00:05:57,566 --> 00:05:58,900 with irrelevant metrics.