1 00:00:00,600 --> 00:00:01,200 All right. 2 00:00:01,200 --> 00:00:02,466 So perfect. 3 00:00:02,466 --> 00:00:04,733 So now that everything is essentially said, 4 00:00:04,733 --> 00:00:06,900 let's tackle this data preprocessing phase. 5 00:00:06,900 --> 00:00:08,866 So we're going to do this very efficiently. 6 00:00:08,866 --> 00:00:11,400 I'm going to go to my data preprocessing template. 7 00:00:11,400 --> 00:00:15,300 And I'm going to copy paste each of these code cells. 8 00:00:15,300 --> 00:00:16,766 You know the first ones. 9 00:00:16,766 --> 00:00:19,600 So I'm creating a new code cell here pasting that here. 10 00:00:19,600 --> 00:00:21,633 That's for importing the libraries. 11 00:00:21,633 --> 00:00:25,600 Then we're going to take care of the second step of data preprocessing 12 00:00:25,600 --> 00:00:29,533 which is importing the data set creating therefore. 13 00:00:29,533 --> 00:00:31,500 And you could cell here. 14 00:00:31,500 --> 00:00:33,400 And let's first take care of this. 15 00:00:33,400 --> 00:00:35,666 You know the last one the easy one. 16 00:00:35,666 --> 00:00:39,033 Splitting the data set into the training set and the test set 17 00:00:39,566 --> 00:00:43,866 and pasting that in a new code cell right here. 18 00:00:43,866 --> 00:00:45,400 All right. Perfect. 19 00:00:45,400 --> 00:00:48,600 And now before we encode the categorical data, let's just make sure 20 00:00:48,600 --> 00:00:52,033 to replace what's necessary here in this template. 21 00:00:52,300 --> 00:00:54,933 And once again that's the beauty of the template. 22 00:00:54,933 --> 00:00:59,366 We only need to replace one little thing which is of course the name of the data 23 00:00:59,366 --> 00:01:00,733 set here. Right. 24 00:01:00,733 --> 00:01:05,700 And the name is of course 50 underscore capital s startups dot CSV. 25 00:01:06,000 --> 00:01:06,566 So there we go. 26 00:01:06,566 --> 00:01:12,166 Let's do this 50 underscore startups grid. 27 00:01:12,166 --> 00:01:15,133 And as a reminder we don't have to change anything here 28 00:01:15,133 --> 00:01:19,733 because this automatically selects all the columns except the last one. 29 00:01:19,733 --> 00:01:22,566 And therefore all the four features here. 30 00:01:22,566 --> 00:01:23,700 So that's perfect. 31 00:01:23,700 --> 00:01:27,566 And this line of code automatically selects 32 00:01:27,566 --> 00:01:31,800 the last column, which means the dependent variable profit. 33 00:01:31,900 --> 00:01:32,766 Okay. 34 00:01:32,766 --> 00:01:36,066 So once again we tackled data preprocessing in a flashlight. 35 00:01:36,066 --> 00:01:40,100 And now we just have this one less tool to add in our data 36 00:01:40,100 --> 00:01:44,266 preprocessing phase which is include that state variable here. 37 00:01:44,566 --> 00:01:47,800 So to do this we're going to get our data preprocessing tools which you have 38 00:01:48,000 --> 00:01:50,900 in your part1 data preprocessing folder. 39 00:01:50,900 --> 00:01:53,700 And now we're going to scroll down to find 40 00:01:53,700 --> 00:01:57,300 that tool that you know encodes the categorical data. 41 00:01:57,600 --> 00:02:01,400 So remember we actually have two sub tools here if I may say that 42 00:02:01,400 --> 00:02:05,033 first tool that applies one hot encoding, which is exactly what we want. 43 00:02:05,300 --> 00:02:08,366 And that tool that just encodes a binary variable 44 00:02:08,366 --> 00:02:11,700 into zero and one and of course what we need is this one. 45 00:02:11,700 --> 00:02:15,233 So I'm just going to copy paste that piece of code. 46 00:02:15,433 --> 00:02:18,966 And then I'm going to paste that right here 47 00:02:18,966 --> 00:02:21,433 in a new code cell to encode categorical data. 48 00:02:21,433 --> 00:02:26,066 And now your turn, your turn to think and figure out what we need to do next. 49 00:02:26,366 --> 00:02:29,700 Please press pause on this video and figure out what you have 50 00:02:29,700 --> 00:02:35,333 to change here in order to indeed apply one hot encoding on our data set. 51 00:02:35,366 --> 00:02:36,300 I'll give you a hint. 52 00:02:36,300 --> 00:02:39,866 You only have one little thing to change and then you'll be good to go. 53 00:02:39,866 --> 00:02:41,500 So please press pause. 54 00:02:41,500 --> 00:02:42,000 Okay. 55 00:02:42,000 --> 00:02:44,366 And now I'm going to give you the solution. 56 00:02:44,366 --> 00:02:48,466 So the only thing that you had to change here is that index here. 57 00:02:48,466 --> 00:02:51,933 Remember this corresponds to the index of the column. 58 00:02:51,933 --> 00:02:54,833 You want to apply one hot encoding. 59 00:02:54,833 --> 00:02:57,833 And in our previous data set you know data dot CSV. 60 00:02:57,966 --> 00:03:01,233 Well remember the categorical variable was the first column. 61 00:03:01,233 --> 00:03:03,333 That's why we put index zero here. 62 00:03:03,333 --> 00:03:08,666 But for new data set the categorical variable is actually the fourth column. 63 00:03:08,966 --> 00:03:10,166 But be careful. 64 00:03:10,166 --> 00:03:12,700 Remember that indexes in Python start from zero. 65 00:03:12,700 --> 00:03:15,433 Therefore this column has index zero. This one is index one. 66 00:03:15,433 --> 00:03:18,266 This one is exactly two and this one has index three. 67 00:03:18,266 --> 00:03:21,300 And therefore the index you need to change 68 00:03:21,300 --> 00:03:24,300 here is of course three right. 69 00:03:24,300 --> 00:03:29,466 So this will apply one hot encoding to the column of index three in your data set. 70 00:03:29,633 --> 00:03:33,100 Therefore exactly this date column perfect. 71 00:03:33,100 --> 00:03:36,100 So we're done with the data preprocessing phase. 72 00:03:36,100 --> 00:03:39,900 So now we are going to observe the results of what we just built. 73 00:03:39,900 --> 00:03:42,066 You know just in terms of data preprocessing. 74 00:03:42,066 --> 00:03:44,166 And therefore we're going to do several things here. 75 00:03:44,166 --> 00:03:48,766 First we're going to upload the data set into our notebook. 76 00:03:48,766 --> 00:03:49,133 Right. 77 00:03:49,133 --> 00:03:52,466 And to do this we click this little folder here and then upload. 78 00:03:53,566 --> 00:03:54,033 All right. 79 00:03:54,033 --> 00:03:57,533 So as usual my machine learning dataset folder is on my desktop. 80 00:03:57,700 --> 00:04:00,500 So we're going to go inside. Make sure to find it on your machine. 81 00:04:00,500 --> 00:04:02,400 Then we're going to go to part to regression. 82 00:04:02,400 --> 00:04:05,800 Then section five multiple linear regression in Python. 83 00:04:06,000 --> 00:04:06,833 And there we go. 84 00:04:06,833 --> 00:04:11,300 That's the data set which we need to open and upload into our notebook. 85 00:04:11,733 --> 00:04:13,200 All right it is uploaded. 86 00:04:13,200 --> 00:04:16,633 And now what we're going to do is we're going to run each of these cells 87 00:04:16,633 --> 00:04:17,600 that we just made. 88 00:04:17,600 --> 00:04:19,800 But I'm going to add a few prints, 89 00:04:19,800 --> 00:04:21,866 you know, so that you can really see what we did. 90 00:04:21,866 --> 00:04:25,166 You know how the different matrix of features, independent variable vector 91 00:04:25,166 --> 00:04:28,200 are created and modified belong to data preprocessing phase.