1 00:00:00,433 --> 00:00:00,700 All right. 2 00:00:00,700 --> 00:00:03,000 So first let's start by importing the libraries. 3 00:00:03,000 --> 00:00:04,933 That's easy. Done. 4 00:00:04,933 --> 00:00:07,033 Next step we import the data set. 5 00:00:07,033 --> 00:00:10,600 And we can now because we have the data set uploaded in our notebook. 6 00:00:11,033 --> 00:00:12,533 So make sure to have it as well. 7 00:00:12,533 --> 00:00:14,300 Okay. Now the data set is imported. 8 00:00:14,300 --> 00:00:17,300 We have the matrix of features and the dependent variable vector y. 9 00:00:17,400 --> 00:00:18,033 Good. 10 00:00:18,033 --> 00:00:21,233 And now I'm going to do a print to show you the state of x. 11 00:00:21,233 --> 00:00:24,100 You know what is X exactly at this stage. 12 00:00:24,100 --> 00:00:25,966 So I'm going to do print x. 13 00:00:25,966 --> 00:00:28,933 And I'm going to run this cell. 14 00:00:28,933 --> 00:00:30,933 All right. And let's see what we get. 15 00:00:30,933 --> 00:00:31,233 All right. 16 00:00:31,233 --> 00:00:36,933 So indeed we get exactly the same columns as in this data set 17 00:00:36,933 --> 00:00:41,066 with first R&D spend then administration spend and marketing spend and state. 18 00:00:41,100 --> 00:00:41,400 Right. 19 00:00:41,400 --> 00:00:45,333 We can clearly see that we get the same columns here in the same order. 20 00:00:45,400 --> 00:00:47,466 All right. So that's the matrix of features. 21 00:00:47,466 --> 00:00:48,400 All good. 22 00:00:48,400 --> 00:00:50,966 Now I'm now going to show you the dependent viral vector 23 00:00:50,966 --> 00:00:52,166 because that's obvious. 24 00:00:52,166 --> 00:00:54,333 We're going to get the same profit. 25 00:00:54,333 --> 00:00:58,633 But what I want to show you is what becomes x 26 00:00:59,000 --> 00:01:01,400 after we encode the categorical data. 27 00:01:01,400 --> 00:01:04,133 You know, and you can actually guess what it will become. 28 00:01:04,133 --> 00:01:07,033 But you will see that the three columns resulting here from one 29 00:01:07,033 --> 00:01:10,033 hot encoding will actually be placed at the beginning. 30 00:01:10,033 --> 00:01:11,500 All right. So let's check it out. 31 00:01:11,500 --> 00:01:15,433 Let's run the cell to indeed apply encoding categorical data. 32 00:01:15,433 --> 00:01:20,533 And now let's create a new code cell in which we're going to print X again. 33 00:01:20,933 --> 00:01:26,133 And let's run this cell to see what x becomes. 34 00:01:26,633 --> 00:01:27,433 And there you go. 35 00:01:27,433 --> 00:01:31,933 Exactly as I told you we have the same three first columns here. 36 00:01:31,933 --> 00:01:34,266 So that corresponds to R&D spend that corresponds 37 00:01:34,266 --> 00:01:37,733 to the administration spend and that corresponds to marketing spend. 38 00:01:38,033 --> 00:01:42,233 But now instead of having this state column here, we indeed have these 39 00:01:42,233 --> 00:01:46,500 three new columns at the beginning, including that state variable. 40 00:01:46,500 --> 00:01:47,700 And we can actually see 41 00:01:47,700 --> 00:01:50,800 what corresponds to what, you know, if we have a look at our data set. 42 00:01:51,000 --> 00:01:54,600 Well, the first row has New York as a state. 43 00:01:54,866 --> 00:01:59,066 And therefore New York was encoded as zero zero and one. 44 00:01:59,466 --> 00:02:02,166 Then let's see as the second state of the second row, 45 00:02:02,166 --> 00:02:04,900 you know, corresponding to the second store, we have California, 46 00:02:04,900 --> 00:02:08,866 and therefore California was encoded as one zero and zero. 47 00:02:09,133 --> 00:02:14,866 And finally, well, Florida was encoded as zero, one and zero. 48 00:02:15,000 --> 00:02:17,666 All right. So that's the one hot encoding that happens. 49 00:02:17,666 --> 00:02:18,666 And now all good. 50 00:02:18,666 --> 00:02:21,833 We have a fully pre-processed data set. 51 00:02:22,100 --> 00:02:23,466 And as I told you in part one. 52 00:02:23,466 --> 00:02:25,966 But I'm going to say it again here because this is important. 53 00:02:25,966 --> 00:02:28,933 We don't have to apply feature scaling. 54 00:02:28,933 --> 00:02:32,566 Why? Because, you know, in the equation of the multiple linear regression, 55 00:02:32,833 --> 00:02:34,233 you know, you have this coefficient 56 00:02:34,233 --> 00:02:36,633 that is multiplied to each independent variable. 57 00:02:36,633 --> 00:02:37,933 You know each feature. 58 00:02:37,933 --> 00:02:41,866 And therefore it doesn't matter that some features have higher values than others, 59 00:02:42,033 --> 00:02:45,900 because the coefficients will compensate to put everything on the same scale. 60 00:02:45,900 --> 00:02:49,433 And therefore remember this in multiple linear regression, 61 00:02:49,466 --> 00:02:52,766 there is absolutely no need to apply feature scaling. 62 00:02:53,166 --> 00:02:55,066 And one last thing I would like to add as well, 63 00:02:55,066 --> 00:02:58,200 because I know a lot of you ask this question, do 64 00:02:58,200 --> 00:03:02,633 we need to check the assumptions of linear regression? 65 00:03:03,133 --> 00:03:07,066 The answer is absolutely not, because I will explain 66 00:03:07,066 --> 00:03:10,933 at the end of this part, you know, part two regression that whenever you have 67 00:03:10,933 --> 00:03:14,666 a new data set and you want to experiment with some machine 68 00:03:14,666 --> 00:03:18,733 learning models to figure out which one leads to the highest accuracy. 69 00:03:18,966 --> 00:03:22,800 Well, even if your data set doesn't have linear relationships, 70 00:03:23,133 --> 00:03:26,400 you can still try a multiple linear regression on it. 71 00:03:26,666 --> 00:03:29,900 And if you know your data set doesn't have linear relationships, well, 72 00:03:29,900 --> 00:03:33,933 your multiple linear regression model will just perform poorly, and therefore 73 00:03:33,966 --> 00:03:38,533 it will get an accuracy lower than the accuracy of your other models. 74 00:03:38,533 --> 00:03:41,733 So you will just not select the multiple linear regression model. 75 00:03:41,733 --> 00:03:45,900 But you don't have to check the multiple linear regression assumptions. 76 00:03:45,900 --> 00:03:47,766 It will just be a waste of time. 77 00:03:47,766 --> 00:03:51,566 Really, I will show you at the end how you can so fast and so efficiently 78 00:03:51,633 --> 00:03:53,366 try each of the models on your data 79 00:03:53,366 --> 00:03:57,366 set and select very quickly the one that has the highest accuracy. 80 00:03:57,566 --> 00:03:58,166 All right. 81 00:03:58,166 --> 00:04:00,000 So I just wanted to be clear on that point. 82 00:04:00,000 --> 00:04:03,100 Don't worry about the multiple linear regression assumptions. 83 00:04:03,300 --> 00:04:05,933 If your data set has linear relationships then good. 84 00:04:05,933 --> 00:04:07,500 Your multiple linear regression 85 00:04:07,500 --> 00:04:11,533 will check the assumptions indeed, and will bring to you a high accuracy. 86 00:04:11,800 --> 00:04:15,300 And if your data set doesn't have linear relationships, well, fine. 87 00:04:15,300 --> 00:04:18,033 Your multiple linear regression just will perform poorly 88 00:04:18,033 --> 00:04:20,533 and you will just take another model and that's it. 89 00:04:20,533 --> 00:04:22,566 That's as simple as that. All right. 90 00:04:22,566 --> 00:04:24,200 So now good. 91 00:04:24,200 --> 00:04:26,200 We're done with data preprocessing. 92 00:04:26,200 --> 00:04:29,900 We can therefore move on to the next step which is to train 93 00:04:29,933 --> 00:04:32,933 the multiple linear regression model on the training set. 94 00:04:32,966 --> 00:04:34,333 So take a little break here. 95 00:04:34,333 --> 00:04:37,466 And as soon as you're ready to build your next machine learning model, 96 00:04:37,633 --> 00:04:40,666 well join me in the next tutorial to tackle this. 97 00:04:40,966 --> 00:04:42,866 And until then, enjoy machine learning.