1 00:00:00,100 --> 00:00:01,700 Hello my friends, and welcome 2 00:00:01,700 --> 00:00:06,633 to this new practical activity on multiple linear regression. 3 00:00:06,900 --> 00:00:08,033 So in this new section 4 00:00:08,033 --> 00:00:12,000 we're going to learn together how to build a multiple linear regression model 5 00:00:12,233 --> 00:00:16,166 on the same data set that was introduced by Kirill in the previous lectures. 6 00:00:16,466 --> 00:00:18,166 And just before we start, 7 00:00:18,166 --> 00:00:21,766 I just want to make sure that everyone here is on the same page. 8 00:00:22,100 --> 00:00:26,266 This is the whole machine learning dataset folder containing all the codes 9 00:00:26,366 --> 00:00:27,466 and data sets. 10 00:00:27,466 --> 00:00:31,666 And right before this tutorial, I give you again the link to this folder. 11 00:00:31,800 --> 00:00:33,733 So make sure to connect to that link. 12 00:00:33,733 --> 00:00:37,966 And now we should be all on the same page ready to start this new machine 13 00:00:37,966 --> 00:00:41,233 learning model, which is of course in part two regression. 14 00:00:41,500 --> 00:00:45,233 And now we're going to go of course to multiple linear regression folder. 15 00:00:45,500 --> 00:00:48,866 And we're going to start with Python to implement this model. 16 00:00:49,100 --> 00:00:51,366 All right so this is the data set. 17 00:00:51,366 --> 00:00:55,900 And this is the Python implementation in Ipynb format 18 00:00:55,900 --> 00:01:00,333 which you can either open with Google Collaboratory or Jupyter Notebook. 19 00:01:00,333 --> 00:01:02,100 Make sure you also have the folder 20 00:01:02,100 --> 00:01:05,466 downloaded on your machine so that you can indeed get these files. 21 00:01:06,000 --> 00:01:06,300 All right. 22 00:01:06,300 --> 00:01:07,800 So before we start the implementation, 23 00:01:07,800 --> 00:01:10,800 let me just explain again what this data set is about. 24 00:01:10,933 --> 00:01:14,500 So remember venture capital is funds hired you 25 00:01:14,500 --> 00:01:18,333 as a data scientist to train a machine learning model. 26 00:01:18,333 --> 00:01:22,200 And actually a multiple linear regression model to understand 27 00:01:22,200 --> 00:01:24,700 the correlations between these features, 28 00:01:24,700 --> 00:01:28,533 which are the spend in R&D, administration and marketing. 29 00:01:28,533 --> 00:01:33,700 And as well as the state and the profit of what, of 50 startups. 30 00:01:33,900 --> 00:01:35,033 So in this data set, it's 31 00:01:35,033 --> 00:01:39,033 very important to understand that each row corresponds to a certain startup. 32 00:01:39,033 --> 00:01:43,766 And for each startup, well, you data scientist collected the following data. 33 00:01:43,766 --> 00:01:47,166 R&D spend, administration spend, marketing spend 34 00:01:47,166 --> 00:01:48,900 and the state of the startups. 35 00:01:48,900 --> 00:01:51,000 And of course their profit. 36 00:01:51,000 --> 00:01:53,766 Because the goal for this VC fund 37 00:01:53,766 --> 00:01:58,466 is to figure out in which startup to invest based on these information. 38 00:01:58,633 --> 00:02:02,733 So these are all information that we already know from 50 startups. 39 00:02:02,966 --> 00:02:06,666 And therefore, if you manage to train a machine learning model that can understand 40 00:02:06,666 --> 00:02:09,866 well these correlations, well, for the next step, 41 00:02:09,900 --> 00:02:12,900 you'll be able to deploy this model on these features 42 00:02:12,933 --> 00:02:16,733 to predict what sort of profit this new startup might generate. 43 00:02:16,800 --> 00:02:17,500 Okay. 44 00:02:17,500 --> 00:02:20,933 So for this fund, you definitely want to build an accurate model. 45 00:02:21,533 --> 00:02:21,900 All right. 46 00:02:21,900 --> 00:02:24,133 And now we can start the implementation. 47 00:02:24,133 --> 00:02:25,500 But before we start 48 00:02:25,500 --> 00:02:26,966 I would like you to figure out 49 00:02:26,966 --> 00:02:30,300 what are going to be the first steps of this implementation. 50 00:02:30,300 --> 00:02:32,333 You know, before I show it to you. 51 00:02:32,333 --> 00:02:32,633 All right. 52 00:02:32,633 --> 00:02:34,633 So first, I hope that the first thing that came to 53 00:02:34,633 --> 00:02:38,766 your mind is that indeed the first step is the data preprocessing phase. 54 00:02:39,000 --> 00:02:43,000 And remember, in the data preprocessing phase we start by importing the libraries. 55 00:02:43,033 --> 00:02:44,100 That's for sure. 56 00:02:44,100 --> 00:02:45,833 Then we're going to import the data set. 57 00:02:45,833 --> 00:02:47,433 That's even more for sure. 58 00:02:47,433 --> 00:02:50,966 And then we will split the data set between the training set and the data set. 59 00:02:51,200 --> 00:02:54,733 Because indeed we want to train separately our model 60 00:02:54,733 --> 00:02:57,733 and evaluate its performance on a separate set. 61 00:02:57,866 --> 00:02:59,966 Okay. So that's always required. 62 00:02:59,966 --> 00:03:03,566 But then is there something else that we have to do here. 63 00:03:03,933 --> 00:03:06,666 Well, to answer this question let's have a look 64 00:03:06,666 --> 00:03:09,766 at the columns one by one, starting with R&D spend. 65 00:03:09,800 --> 00:03:13,500 So R&D spent is an empirical column you know containing numerical values. 66 00:03:13,800 --> 00:03:16,466 And when we scroll down you know we can scroll down 67 00:03:16,466 --> 00:03:20,000 because there are only 50 observations corresponding to 50 startups. 68 00:03:20,200 --> 00:03:22,800 And we can see that here there is no missing data. 69 00:03:22,800 --> 00:03:26,733 So all good then second column the administration spent 70 00:03:26,766 --> 00:03:30,266 you know, all the spending administration like paying employee salaries 71 00:03:30,300 --> 00:03:31,800 or anything else. 72 00:03:31,800 --> 00:03:35,100 So this column is once again numerical with numerical values. 73 00:03:35,100 --> 00:03:38,166 And there is once again no missing data. 74 00:03:38,466 --> 00:03:39,033 Perfect. 75 00:03:39,033 --> 00:03:42,666 So so far are three steps of the data preprocessing template 76 00:03:42,833 --> 00:03:45,966 argued the next one column in marketing spend. 77 00:03:46,200 --> 00:03:49,633 Well same numerical column with no missing data. 78 00:03:49,633 --> 00:03:50,833 Oh, good. 79 00:03:50,833 --> 00:03:52,466 And now this column. 80 00:03:52,466 --> 00:03:53,966 You know the last feature. 81 00:03:53,966 --> 00:03:54,833 Notice once again 82 00:03:54,833 --> 00:03:57,166 that all the features are in the first columns 83 00:03:57,166 --> 00:04:00,266 and the dependent variable which you want to predict in the last column. 84 00:04:00,500 --> 00:04:04,900 So back to this stage feature what reflexes you have in your mind. 85 00:04:04,900 --> 00:04:06,600 Now you should have the reflex. 86 00:04:06,600 --> 00:04:09,900 If you paid attention to parts one they depressing. 87 00:04:10,166 --> 00:04:13,833 Basically, the question I'm asking now is do we need to apply 88 00:04:13,833 --> 00:04:16,833 a certain tool of our data preprocessing toolkit, 89 00:04:17,066 --> 00:04:20,366 which we built in part one into this data set? 90 00:04:20,766 --> 00:04:23,766 Well, here the answer is obviously yes, 91 00:04:24,000 --> 00:04:28,166 because indeed this state column is not numerical. 92 00:04:28,200 --> 00:04:30,100 It actually has some categories. 93 00:04:30,100 --> 00:04:34,433 It has three categories which are New York, California and Florida. 94 00:04:34,433 --> 00:04:35,366 And therefore 95 00:04:35,366 --> 00:04:39,000 that's exactly the same situation as in parts when data preprocessing. 96 00:04:39,266 --> 00:04:42,366 There is no order relationship between these 97 00:04:42,366 --> 00:04:45,366 three states New York, California and Florida. 98 00:04:45,633 --> 00:04:50,733 And therefore we will have to apply one hot encoding to that state column, 99 00:04:51,033 --> 00:04:55,066 and therefore will have to grab a tool of our data preprocessing toolkit. 100 00:04:55,066 --> 00:04:59,766 And that's why I prepared it here in order to indeed one hot encode 101 00:04:59,966 --> 00:05:02,966 that categorical variable, the state variable. 102 00:05:03,000 --> 00:05:03,800 All right. 103 00:05:03,800 --> 00:05:05,100 And then the profit is fine. 104 00:05:05,100 --> 00:05:06,600 It is numerical. 105 00:05:06,600 --> 00:05:09,533 And besides there is once again no missing data. 106 00:05:09,533 --> 00:05:11,900 So that's what you know you must do. 107 00:05:11,900 --> 00:05:14,833 First you need to have a look at your data set. 108 00:05:14,833 --> 00:05:17,700 If it is not too long, you can check that there is no missing data 109 00:05:17,700 --> 00:05:18,833 like we just did. 110 00:05:18,833 --> 00:05:22,600 If it is too long, then I recommend to apply your data 111 00:05:22,600 --> 00:05:26,700 preprocessing tool that handles missing data and deploy it on this data set. 112 00:05:27,033 --> 00:05:30,633 And then you must check if any feature is categorical. 113 00:05:30,633 --> 00:05:32,800 And here we could check that very easily. 114 00:05:32,800 --> 00:05:34,066 The state is categorical. 115 00:05:34,066 --> 00:05:37,300 So we're going to apply our one hot encoding tool 116 00:05:37,300 --> 00:05:40,300 of our data preprocessing toolkit on this state column. 117 00:05:40,366 --> 00:05:43,800 And then of course we will apply all the rest of the three 118 00:05:43,800 --> 00:05:46,800 essential steps of our data preprocessing template. 119 00:05:46,800 --> 00:05:50,733 And once again we will do that in a flashlight because this is a template 120 00:05:50,733 --> 00:05:54,233 where we only have one thing to change, which is the name of the dataset.