1 00:00:00,100 --> 00:00:00,900 Okay, my friends. 2 00:00:00,900 --> 00:00:04,833 Are you ready for the final tool of our data preprocessing toolkit 3 00:00:05,100 --> 00:00:10,066 feature scaling, which will allow us to put all our features on the same scale. 4 00:00:10,066 --> 00:00:11,000 So that's the what? 5 00:00:11,000 --> 00:00:13,266 And let me quickly remind the why. 6 00:00:13,266 --> 00:00:14,733 Why do we need to do this? 7 00:00:14,733 --> 00:00:18,166 Well, that's because for some of the machinery models, 8 00:00:18,500 --> 00:00:23,333 that's in order to avoid some features to be dominated by other features 9 00:00:23,333 --> 00:00:27,800 in such a way that the dominated features are not even considered by the machine 10 00:00:28,000 --> 00:00:28,800 model. 11 00:00:28,800 --> 00:00:32,400 Now, you also need to be aware that we will not have to apply feature 12 00:00:32,400 --> 00:00:35,966 scaling for all the machinery models just for some of them. 13 00:00:36,100 --> 00:00:39,733 Therefore, we won't include this in our data preprocessing template, 14 00:00:39,733 --> 00:00:42,600 which I will show you by the way, at the end of this tutorial. 15 00:00:42,600 --> 00:00:45,466 So we will just add this tool in the toolkit 16 00:00:45,466 --> 00:00:49,233 because indeed for a lot of machinery models, we won't even have to apply 17 00:00:49,233 --> 00:00:53,300 feature scaling even if we have features taking very different values. 18 00:00:53,633 --> 00:00:57,500 For example, if you already know a bit about the multiple linear 19 00:00:57,500 --> 00:01:01,800 regression model, you know that each variable is actually multiplied 20 00:01:01,800 --> 00:01:05,400 by a coefficient, you know, in the linear regression equation. 21 00:01:05,800 --> 00:01:06,733 And so well, you know, 22 00:01:06,733 --> 00:01:09,900 if you have variables that take much higher values than others. 23 00:01:10,066 --> 00:01:14,300 Well, when learning the coefficients, the coefficients will just compensate 24 00:01:14,400 --> 00:01:18,000 by taking small values for the variables that take high values. 25 00:01:18,000 --> 00:01:18,333 Right. 26 00:01:18,333 --> 00:01:21,333 We will explain this more in part to regression. 27 00:01:21,500 --> 00:01:25,200 But for now, just know that this is a tool that will be applied 28 00:01:25,200 --> 00:01:29,000 from time to time for certain machine learning models, but not all the time. 29 00:01:29,000 --> 00:01:31,600 As you will see in this course. All right. 30 00:01:31,600 --> 00:01:35,366 And also in the previous tutorial, I actually told you that 31 00:01:35,366 --> 00:01:38,866 I was about to answer one of the most important questions 32 00:01:38,866 --> 00:01:42,466 or most frequently asked questions by the data science community. 33 00:01:42,800 --> 00:01:44,700 And I will keep, of course, my promise. 34 00:01:44,700 --> 00:01:47,000 I will answer this when it's time. 35 00:01:47,000 --> 00:01:49,766 Meaning, at around the middle of this tutorial. 36 00:01:49,766 --> 00:01:52,933 But no worries, all questions about feature scaling 37 00:01:52,966 --> 00:01:56,400 will be answered so that you have absolutely no confusion. 38 00:01:57,000 --> 00:01:58,866 All right, so we have the what and the why. 39 00:01:58,866 --> 00:02:00,600 And now let's proceed to the how. 40 00:02:00,600 --> 00:02:03,900 Meaning how are we going to apply feature scaling. 41 00:02:04,200 --> 00:02:08,233 And to answer this question I'm going to show you the following slides 42 00:02:08,433 --> 00:02:12,300 which are the main two feature scaling techniques 43 00:02:12,300 --> 00:02:16,333 that indeed put all your features in the same scale. 44 00:02:16,800 --> 00:02:20,100 And these two techniques are first standardization, 45 00:02:20,366 --> 00:02:25,333 which consists of subtracting each value of your feature 46 00:02:25,533 --> 00:02:29,333 by the mean of all the values of the feature, and then dividing 47 00:02:29,333 --> 00:02:32,833 by the standard deviation, which is the square root of the variance. 48 00:02:32,833 --> 00:02:36,400 And this will put all the values of the feature 49 00:02:36,600 --> 00:02:39,200 between around minus three and plus three, right? 50 00:02:39,200 --> 00:02:40,400 All the different features. 51 00:02:40,400 --> 00:02:45,200 When you apply this transformation on all the features of your data set, 52 00:02:45,333 --> 00:02:49,200 well, all your features will take value between around minus three and plus three. 53 00:02:49,266 --> 00:02:50,966 So that's standardization. 54 00:02:50,966 --> 00:02:56,300 And then you have normalization which consists of subtracting each value 55 00:02:56,300 --> 00:02:59,300 of your feature by the minimum value of the feature, 56 00:02:59,466 --> 00:03:03,133 and then dividing by the difference between the maximum value of the feature 57 00:03:03,133 --> 00:03:04,966 and the minimum value of the feature. 58 00:03:04,966 --> 00:03:08,633 And so since this is positive, this is positive. 59 00:03:08,633 --> 00:03:11,333 And this is always larger than this. 60 00:03:11,333 --> 00:03:14,433 Well that means that all the values of your features 61 00:03:14,633 --> 00:03:17,466 will become between 0 and 1. 62 00:03:17,466 --> 00:03:17,900 All right. 63 00:03:17,900 --> 00:03:19,633 So this will result in having 64 00:03:19,633 --> 00:03:22,633 values of features between minus three and plus three more or less. 65 00:03:22,733 --> 00:03:27,300 And this will result in having all the values of your features between 0 and 1. 66 00:03:27,700 --> 00:03:31,466 Now the question is also much asked by the data science community. 67 00:03:31,600 --> 00:03:35,100 Should we go for standardization or normalization? 68 00:03:35,600 --> 00:03:38,600 Well, we're going to be here very pragmatic. 69 00:03:38,766 --> 00:03:41,400 Normalization is recommended 70 00:03:41,400 --> 00:03:44,800 when you have a normal distribution in most of your features. 71 00:03:45,000 --> 00:03:47,333 This will be a great feature scaling technique. 72 00:03:47,333 --> 00:03:51,700 In that case standardization actually works well all the time. 73 00:03:51,700 --> 00:03:53,900 It will do the job all the time. 74 00:03:53,900 --> 00:03:57,866 Therefore, since this is a technique that will work all the time, 75 00:03:57,866 --> 00:04:02,433 and this is a technique that is more recommended for some specific situations 76 00:04:02,433 --> 00:04:05,700 where you have most of your features following a normal distribution. 77 00:04:06,000 --> 00:04:09,033 Then my ultimate recommendation for sure 78 00:04:09,033 --> 00:04:13,200 is to go for standardization, because indeed this will always work. 79 00:04:13,200 --> 00:04:15,900 You will always do some relevant feature scaling, 80 00:04:15,900 --> 00:04:18,900 and this will always improve the training process. 81 00:04:18,966 --> 00:04:21,333 So I'm going to teach you this technique. 82 00:04:21,333 --> 00:04:25,066 I'm going to teach you on how to apply it on R matrices and features. 83 00:04:25,066 --> 00:04:27,000 And I'm seeing matrices of features because now 84 00:04:27,000 --> 00:04:30,700 we have two matrices of features which are Xtrain and Exodus. 85 00:04:30,933 --> 00:04:33,666 And since we understood in the previous tutorial 86 00:04:33,666 --> 00:04:37,600 that feature scaling must be applied after the split, well, you understand 87 00:04:37,600 --> 00:04:41,433 that we want apply features getting on the whole matrix of features x, 88 00:04:41,666 --> 00:04:45,666 but of course on both x train and X test separately 89 00:04:45,900 --> 00:04:50,100 and actually the scaler will be fitted to only X train. 90 00:04:50,433 --> 00:04:52,533 And then we'll transform X test. 91 00:04:52,533 --> 00:04:55,166 You know we'll apply feature scaling on access 92 00:04:55,166 --> 00:04:58,666 because indeed since X test is something that's we're not supposed to have 93 00:04:58,666 --> 00:05:01,800 during the training, but only after like when going in production. 94 00:05:01,933 --> 00:05:07,266 Well, we're not allowed to fit our feature scaling tool on the test set right 95 00:05:07,266 --> 00:05:09,533 by fitting the feature scaling to on the test set, 96 00:05:09,533 --> 00:05:10,600 that means that we're going 97 00:05:10,600 --> 00:05:14,500 to get the mean of the whole set, and then the standard deviation in the feature. 98 00:05:14,766 --> 00:05:17,233 No, we don't have the right to do this because x 99 00:05:17,233 --> 00:05:18,966 this is supposed to be something new. 100 00:05:18,966 --> 00:05:22,533 And therefore we'll just get the mean of the values in Xtrain 101 00:05:22,533 --> 00:05:25,133 then get the standard deviation of the values next train. 102 00:05:25,133 --> 00:05:28,133 Then apply this formula to transform all the values in Xtrain 103 00:05:28,333 --> 00:05:32,100 and then apply that same formula, but with the same mean and standard deviation 104 00:05:32,333 --> 00:05:36,033 of the values in Xtrain to scale the values of x. 105 00:05:36,033 --> 00:05:38,633 This. It's really, really important that you understand this. 106 00:05:38,633 --> 00:05:42,366 And this is once again, to give some further elements of response 107 00:05:42,600 --> 00:05:44,466 to that previous question. 108 00:05:44,466 --> 00:05:46,966 Should we scale before or after the split? 109 00:05:46,966 --> 00:05:47,733 All right. 110 00:05:47,733 --> 00:05:52,433 So now that we are all clear on this, let's proceed to the implementation 111 00:05:52,433 --> 00:05:56,400 of the how, meaning the implementation of feature scaling.