1 00:00:00,400 --> 00:00:00,900 All right. 2 00:00:00,900 --> 00:00:04,300 So, as usual, you know, in order to be as much efficient 3 00:00:04,300 --> 00:00:07,600 as we can, we're going to do this with scikit learn, right. 4 00:00:07,600 --> 00:00:10,166 This data science library that has all the tools 5 00:00:10,166 --> 00:00:15,133 and the tool that we're about to use is a class called standard scaler, 6 00:00:15,133 --> 00:00:19,066 in which will exactly perform standardization on 7 00:00:19,200 --> 00:00:21,566 both your matrix of features of the training set 8 00:00:21,566 --> 00:00:24,033 and the matrix of features of the test set. 9 00:00:24,033 --> 00:00:24,933 So let's do this. 10 00:00:24,933 --> 00:00:28,800 Let's start by importing this class, which we have to take first from. 11 00:00:29,200 --> 00:00:32,700 Well scikit learn of course as k learn. 12 00:00:32,966 --> 00:00:37,566 And then from which we're going to get access to the pre processing 13 00:00:37,566 --> 00:00:42,200 module perfect which is a module that contains that standard scalar class. 14 00:00:42,200 --> 00:00:43,000 All right. 15 00:00:43,000 --> 00:00:48,766 So then we're ready to import what we want which is the standard scalar class. 16 00:00:48,900 --> 00:00:51,333 Perfect. So now we have the class. 17 00:00:51,333 --> 00:00:54,666 And then the natural next step here is of course to create 18 00:00:54,666 --> 00:00:57,666 an object of the class I'm about to reveal soon. 19 00:00:58,000 --> 00:01:01,500 That answer to one of the most frequent questions in the data science community. 20 00:01:02,100 --> 00:01:03,633 So let's create this object. 21 00:01:03,633 --> 00:01:06,633 We're going to call it SC for standard scalar. 22 00:01:06,666 --> 00:01:11,200 And then well this object will be created as an instance of the standard 23 00:01:11,200 --> 00:01:11,900 scalar class. 24 00:01:11,900 --> 00:01:15,433 So I'm taking it here pasting that right here adding some parenthesis. 25 00:01:16,033 --> 00:01:17,366 And good news here. 26 00:01:17,366 --> 00:01:21,900 We don't have any arguments to input because what we simply want to do 27 00:01:21,900 --> 00:01:24,833 is get that mean, get that standard deviation, 28 00:01:24,833 --> 00:01:27,866 and then apply this formula to all the values in the feature. 29 00:01:27,866 --> 00:01:30,166 And for this we don't need actually any parameters. 30 00:01:30,166 --> 00:01:32,733 This will automatically do the job. 31 00:01:32,733 --> 00:01:35,333 All right. So then next step. 32 00:01:35,333 --> 00:01:38,300 And now well now is the time for me to reveal 33 00:01:38,300 --> 00:01:43,233 the answer to that question, which is one of the most frequently asked 34 00:01:43,500 --> 00:01:46,433 questions in the data science community. 35 00:01:46,433 --> 00:01:50,566 And that question is do we have to apply 36 00:01:51,033 --> 00:01:54,033 feature scaling, you know, standardization 37 00:01:54,200 --> 00:01:58,500 to the dummy variables in the matrix of features? 38 00:01:58,800 --> 00:02:01,000 This is one of the most frequently asked questions. 39 00:02:01,000 --> 00:02:02,800 You will find it everywhere online as well. 40 00:02:02,800 --> 00:02:04,666 And once again, actually 41 00:02:04,666 --> 00:02:08,466 the answer is pretty obvious, but only after you get the explanation. 42 00:02:08,833 --> 00:02:10,266 So let me tell you the answer. 43 00:02:10,266 --> 00:02:12,466 The answer is no. 44 00:02:12,466 --> 00:02:15,000 The answer is no because simply 45 00:02:15,000 --> 00:02:19,166 well remember the goal of standardization or feature scaling in general, 46 00:02:19,500 --> 00:02:23,566 it is to have all the values of the features in the same range. 47 00:02:23,800 --> 00:02:29,000 And since I told you that standardization actually transforms your features 48 00:02:29,000 --> 00:02:32,933 so that they take values between more or less minus three and plus three. 49 00:02:33,200 --> 00:02:37,533 Well, since here are dummy variables already, take values 50 00:02:37,533 --> 00:02:41,266 between minus three and plus three because they're equal to either 1 or 0. 51 00:02:41,466 --> 00:02:46,133 Well, there is nothing extra to be done here with standardization 52 00:02:46,333 --> 00:02:50,566 and actually standardization will only make it worse because indeed 53 00:02:50,566 --> 00:02:54,333 it will still transform these values between minus three and plus three. 54 00:02:54,333 --> 00:02:58,066 But then you will totally lose the interpretation of these variables. 55 00:02:58,066 --> 00:02:58,900 In other words, 56 00:02:58,900 --> 00:03:03,533 you will lose the information of which country corresponds to the observation. 57 00:03:03,766 --> 00:03:06,800 Now we perfectly know that you know, remember one, 58 00:03:06,800 --> 00:03:10,200 zero and zero corresponds to France because that's how it was encoded. 59 00:03:10,200 --> 00:03:13,066 And then zero, zero and one corresponds to Spain. 60 00:03:13,066 --> 00:03:13,866 But you know, after 61 00:03:13,866 --> 00:03:17,800 we apply feature scaling, if we apply it on the dummy variables, 62 00:03:18,000 --> 00:03:22,200 we will get nonsense numerical values and we will be absolutely incapable 63 00:03:22,366 --> 00:03:26,633 to say which tuple of three values here correspond to which country. 64 00:03:26,733 --> 00:03:28,800 So we will totally lose interpretation. 65 00:03:28,800 --> 00:03:32,400 And besides, this won't improve at all to training performance 66 00:03:32,466 --> 00:03:35,933 because indeed our dummy variables are anyway already 67 00:03:36,133 --> 00:03:39,300 between the same scale range as your other variables 68 00:03:39,566 --> 00:03:43,133 you will see online that applying standardization to your dummy 69 00:03:43,133 --> 00:03:46,200 variables might still increase slightly the performance. 70 00:03:46,200 --> 00:03:48,200 You know the final accuracy of your model. 71 00:03:48,200 --> 00:03:51,166 But I've experimented many times and I've never seen, 72 00:03:51,166 --> 00:03:54,900 you know, a considerable difference that would justify here to apply feature 73 00:03:54,900 --> 00:03:56,400 scaling on the dummy variables. 74 00:03:56,400 --> 00:03:58,166 So really don't do this. 75 00:03:58,166 --> 00:04:01,333 Only apply feature scaling to your numerical values. 76 00:04:01,333 --> 00:04:06,400 Right here we have clearly some variables taking values in a very different range. 77 00:04:06,500 --> 00:04:06,866 Right. 78 00:04:06,866 --> 00:04:12,000 The age goes between 0 and 100 and the salary goes between 0 and 100,000. 79 00:04:12,166 --> 00:04:14,600 So clearly here if it's better for the machine 80 00:04:14,600 --> 00:04:16,933 learning model we have to apply feature scaling. 81 00:04:16,933 --> 00:04:20,133 But let's leave these dummy variables alone 82 00:04:20,300 --> 00:04:23,433 so that we can keep the interpretability of the model. 83 00:04:23,666 --> 00:04:24,333 All right. 84 00:04:24,333 --> 00:04:26,733 So that was the other very important questions. 85 00:04:26,733 --> 00:04:28,200 Now I've covered everything. 86 00:04:28,200 --> 00:04:32,366 There should not be any confusion left in data preprocessing. 87 00:04:32,500 --> 00:04:34,000 I'm really glad that you know this. 88 00:04:34,000 --> 00:04:38,000 And therefore I encourage you now to press pause on this video. 89 00:04:38,233 --> 00:04:41,100 And guess what will be the next step to apply feature 90 00:04:41,100 --> 00:04:44,800 scaling to R matrices a feature extreme and excess.