1 00:00:00,466 --> 00:00:00,900 All right. 2 00:00:00,900 --> 00:00:04,000 And now the last step that you had to do was to figure out what 3 00:00:04,000 --> 00:00:07,000 to replace here in this feature scaling implementation. 4 00:00:07,200 --> 00:00:08,833 Well, here that's super easy. 5 00:00:08,833 --> 00:00:12,733 We simply want to, you know, feature scale, all the features. 6 00:00:12,733 --> 00:00:14,033 We want to scale all the features. 7 00:00:14,033 --> 00:00:15,800 We want to scale age and salary. 8 00:00:15,800 --> 00:00:19,433 And of course we don't have to scale the dependent variable purchased 9 00:00:19,566 --> 00:00:23,666 because its values are zero and one and therefore are already 10 00:00:23,666 --> 00:00:26,400 in the range of values we want. So all good with this. 11 00:00:26,400 --> 00:00:29,900 So basically we're only going to scale these two features age 12 00:00:29,900 --> 00:00:31,433 and estimated salary. 13 00:00:31,433 --> 00:00:34,933 And therefore here what we just have to do, you know as replacements 14 00:00:35,200 --> 00:00:38,833 was just to remove that selection of the indexes here. 15 00:00:38,866 --> 00:00:41,866 Here we selected old indexes starting from three. 16 00:00:42,066 --> 00:00:44,266 But this time we don't have to do anything. 17 00:00:44,266 --> 00:00:47,700 We can just scale the whole matrix of features. 18 00:00:47,700 --> 00:00:50,833 So I'm just removing here the index selections. 19 00:00:51,233 --> 00:00:52,033 And there you go. 20 00:00:52,033 --> 00:00:56,166 We'll be ready to feature scale both our training set and our tests. 21 00:00:56,166 --> 00:01:00,300 It and I remind that this is absolutely compulsory to do this. 22 00:01:00,566 --> 00:01:03,566 After splitting the data set into the training set and test it 23 00:01:03,600 --> 00:01:07,400 in order to avoid information leakage from the test set. 24 00:01:08,200 --> 00:01:08,700 All right. 25 00:01:08,700 --> 00:01:09,900 So there you go, my friends. 26 00:01:09,900 --> 00:01:12,966 That was what you needed to do for data preprocessing. 27 00:01:13,000 --> 00:01:14,066 Congratulations. 28 00:01:14,066 --> 00:01:17,566 If you've got the same thing don't worry about that test size. 29 00:01:17,566 --> 00:01:20,200 This is just for the form. But there you go. 30 00:01:20,200 --> 00:01:22,600 This was simply what you had to do. 31 00:01:22,600 --> 00:01:22,966 All right. 32 00:01:22,966 --> 00:01:25,500 So now we're going to do a few prints 33 00:01:25,500 --> 00:01:29,300 to actually see the before and after feature scaling. 34 00:01:29,333 --> 00:01:32,466 So what I'm going to do is right after this 35 00:01:32,466 --> 00:01:35,666 code cell splitting the data set into the training set and test set. 36 00:01:36,000 --> 00:01:40,233 I'm going to make four prints just to show you if you don't need to 37 00:01:40,233 --> 00:01:43,233 look at that will feel free not to include these new code cells. 38 00:01:43,400 --> 00:01:47,100 But what I want to do is print first X train. 39 00:01:47,600 --> 00:01:51,166 Then I would like to print y train. 40 00:01:51,166 --> 00:01:53,100 So here y train. 41 00:01:53,100 --> 00:01:57,266 Then next one I would like to print x test. 42 00:01:57,700 --> 00:01:59,933 And finally I would like to print 43 00:02:01,200 --> 00:02:04,200 well y tests okay. 44 00:02:04,266 --> 00:02:06,500 So that's an after feature scaling. 45 00:02:06,500 --> 00:02:10,366 However we will just, you know print two cells 46 00:02:10,366 --> 00:02:13,366 because we actually don't apply feature scaling to Y. 47 00:02:13,500 --> 00:02:15,500 And therefore we'll just print first. 48 00:02:15,500 --> 00:02:17,433 Well x train again. 49 00:02:17,433 --> 00:02:20,633 And second X test. Right. 50 00:02:20,633 --> 00:02:24,400 Because these will be the only sets of data that will be changed. 51 00:02:24,566 --> 00:02:25,666 All right. Perfect. 52 00:02:25,666 --> 00:02:28,566 So now let's execute everything we have the data set. 53 00:02:28,566 --> 00:02:29,166 All good. 54 00:02:29,166 --> 00:02:32,400 So let's do this starting by importing the libraries. 55 00:02:32,633 --> 00:02:33,366 Good. 56 00:02:33,366 --> 00:02:36,666 Now importing the data set crate. 57 00:02:37,033 --> 00:02:40,533 Now splitting the data set into the training set and test set. 58 00:02:40,733 --> 00:02:42,033 There we go. 59 00:02:42,033 --> 00:02:42,633 All right. 60 00:02:42,633 --> 00:02:45,533 Now let's print X train and see what it looks like. 61 00:02:45,533 --> 00:02:47,466 All right. So let's scroll down a bit. 62 00:02:47,466 --> 00:02:47,900 All right. 63 00:02:47,900 --> 00:02:51,766 That's Xtrain was first the current age and the age feature. 64 00:02:52,000 --> 00:02:56,100 And second column, the estimated salary, the estimated salary feature. 65 00:02:56,300 --> 00:02:56,733 All right. 66 00:02:56,733 --> 00:03:00,800 And of course we have 300 observations in this training set. 67 00:03:00,800 --> 00:03:02,166 You don't have to count them. 68 00:03:02,166 --> 00:03:04,566 But there you go. We have many of them. 69 00:03:04,566 --> 00:03:05,133 All right. 70 00:03:05,133 --> 00:03:10,433 So now let's print Y train is just to have a look at what we create. 71 00:03:10,433 --> 00:03:12,433 You know this is not compulsory. 72 00:03:12,433 --> 00:03:17,100 So that's why train with all the purchased decisions on the previous SUV's. 73 00:03:17,100 --> 00:03:20,366 Zero means that the customer did not buy any SUV. 74 00:03:20,533 --> 00:03:25,966 And one means yes, the customer bought a previous SUV and now x test. 75 00:03:26,533 --> 00:03:29,700 Scroll down a bit right? X does so same. 76 00:03:29,700 --> 00:03:33,566 It contains 100 observations, corresponds to 100 customers, 77 00:03:33,866 --> 00:03:36,866 and for each of them their age and the estimated salary. 78 00:03:37,200 --> 00:03:40,033 And you know, since X test is actually supposed to be 79 00:03:40,033 --> 00:03:44,100 some new data in production, well, we're actually going to suppose that 80 00:03:44,100 --> 00:03:49,533 X test is actually the set of customers who purchased yes or no, the new SUV. 81 00:03:49,533 --> 00:03:52,666 No. We're going to pretend that X test is actually some data 82 00:03:52,800 --> 00:03:56,700 when we deploy our model in production, so that we can evaluate it 83 00:03:56,966 --> 00:04:00,300 on the new observations we need on the new customers buying. 84 00:04:00,300 --> 00:04:02,400 Yes or no, that new SUV. 85 00:04:02,400 --> 00:04:02,866 All right. 86 00:04:02,866 --> 00:04:05,066 It's more fun if we imagine access this way 87 00:04:05,066 --> 00:04:09,666 because it is indeed supposed to be some new observations and now y test. 88 00:04:09,933 --> 00:04:11,133 Let's see. 89 00:04:11,133 --> 00:04:11,533 All right. 90 00:04:11,533 --> 00:04:12,966 And white test contains of course 91 00:04:12,966 --> 00:04:16,666 old the purchased decisions of the customers in this test set. 92 00:04:16,900 --> 00:04:20,233 Meaning whether or not they bought the new SUV. 93 00:04:20,466 --> 00:04:21,233 All right. 94 00:04:21,233 --> 00:04:24,100 Perfect. So now let's apply feature scaling. 95 00:04:24,100 --> 00:04:28,066 And let's see how X train and X tests are transformed. 96 00:04:28,433 --> 00:04:28,800 All right. 97 00:04:28,800 --> 00:04:31,100 So let's do this. Let's play. 98 00:04:31,100 --> 00:04:33,300 And now let's print X train. 99 00:04:33,300 --> 00:04:33,700 All right. 100 00:04:33,700 --> 00:04:37,666 So now we have some scaled values between well you know 101 00:04:37,766 --> 00:04:41,400 minus two and plus three. 102 00:04:41,400 --> 00:04:45,366 You know this one is 2.06 -1.7. 103 00:04:45,600 --> 00:04:49,466 Anyway should be somewhere between minus three and plus three okay. 104 00:04:49,466 --> 00:04:54,400 But now we can clearly see that we have both the two features in the same range. 105 00:04:54,400 --> 00:04:57,900 The transformed age and the transformed estimated salary 106 00:04:58,233 --> 00:05:00,300 are now indeed in the same range. 107 00:05:00,300 --> 00:05:04,300 And that's exactly what we're supposed to get with features killing. 108 00:05:04,300 --> 00:05:05,066 All right. 109 00:05:05,066 --> 00:05:08,400 So now let's scroll down to print X test. 110 00:05:09,133 --> 00:05:14,766 And this same we get the two features H and salary taking values in the same range 111 00:05:14,766 --> 00:05:17,300 between somewhere around minus three and plus three. 112 00:05:17,300 --> 00:05:21,633 So this will improve the training performance of the genetic regression 113 00:05:21,633 --> 00:05:22,100 model. 114 00:05:22,100 --> 00:05:25,166 Well you know for the training set of course only for the training set. 115 00:05:25,166 --> 00:05:28,733 But then when we will deploy our model to predict 116 00:05:28,900 --> 00:05:30,666 whether the customers of the test set. 117 00:05:30,666 --> 00:05:33,833 But yes or no, the new SUV, well, 118 00:05:33,833 --> 00:05:37,200 we will have indeed to apply the predict method on these scaled values. 119 00:05:37,500 --> 00:05:39,966 Otherwise predictions will be nonsense, right? 120 00:05:39,966 --> 00:05:43,166 The predict method has to be called on a set of features 121 00:05:43,166 --> 00:05:47,100 with the same scale as the one that was applied during the training. 122 00:05:47,333 --> 00:05:49,000 Okay, perfect. 123 00:05:49,000 --> 00:05:52,233 So now we can move on to the next exciting step 124 00:05:52,366 --> 00:05:56,866 where we build and train our logistic regression model on the training set.