1 00:00:00,200 --> 00:00:00,666 Hello my 2 00:00:00,666 --> 00:00:04,500 friends, I hope you digested well the previous tutorial where we tackle 3 00:00:04,500 --> 00:00:08,400 this big but yet important tool of our data preprocessing toolkit. 4 00:00:08,666 --> 00:00:09,166 Indeed. 5 00:00:09,166 --> 00:00:13,433 Now you know how to handle the case where you have some categorical data 6 00:00:13,433 --> 00:00:17,466 in your data set, which is a situation you will encounter many times 7 00:00:17,666 --> 00:00:19,700 in your future machine learning career. 8 00:00:19,700 --> 00:00:24,100 And now we have two tools to cover, the first one being splitting 9 00:00:24,100 --> 00:00:26,800 the data set into the training set and the test set, 10 00:00:26,800 --> 00:00:29,433 and the second one being feature scaling. 11 00:00:29,433 --> 00:00:32,566 So before we start, I'm about to answer 12 00:00:32,600 --> 00:00:35,666 one of the most frequently asked questions 13 00:00:35,833 --> 00:00:39,533 in the data science community, which is be ready for it. 14 00:00:39,966 --> 00:00:44,400 Do we have to apply feature scaling before splitting 15 00:00:44,400 --> 00:00:47,900 the data set into the training set, and to set or after? 16 00:00:48,300 --> 00:00:51,500 I've seen this questions many times and you will find 17 00:00:51,500 --> 00:00:54,600 that question in many forums of the data science community. 18 00:00:54,700 --> 00:00:58,666 Some people will say that we have to apply feature scaling before the split. 19 00:00:58,800 --> 00:01:04,000 Some people will say after the split and now I'm about to reveal the right answer. 20 00:01:04,000 --> 00:01:08,533 There is only one right answer, which is, by the way, totally obvious. 21 00:01:08,566 --> 00:01:10,466 After you get the explanation. 22 00:01:10,466 --> 00:01:15,000 So the answer is we have to apply feature scaling. 23 00:01:15,700 --> 00:01:20,166 After splitting the data set into the training set and the test set. 24 00:01:20,300 --> 00:01:22,366 And now let me explain. 25 00:01:22,366 --> 00:01:24,600 So first just to make sure everybody understands. 26 00:01:24,600 --> 00:01:27,300 Let me explain the what first. And then I'll explain the why. 27 00:01:27,300 --> 00:01:30,666 So of course splitting the data set into the training set and the test. 28 00:01:30,666 --> 00:01:33,666 It consists of making two separate sets. 29 00:01:33,666 --> 00:01:35,700 One training set where you're going to train 30 00:01:35,700 --> 00:01:40,066 your machine learning model on existing observations, and one test set where 31 00:01:40,066 --> 00:01:44,433 you're going to evaluate the performance of your model on new observations. 32 00:01:44,633 --> 00:01:48,900 And it's important to understand that these new observations are exactly like, 33 00:01:48,900 --> 00:01:52,100 you know, some future data that you're going to get and on 34 00:01:52,100 --> 00:01:54,600 which you're going to deploy your machine learning model. 35 00:01:54,600 --> 00:01:55,100 All right. 36 00:01:55,100 --> 00:01:56,866 So that's this first tool. 37 00:01:56,866 --> 00:02:02,333 And now feature scaling simply consists of scaling all your variables, 38 00:02:02,333 --> 00:02:07,766 all your features actually to make sure they all take values in the same scale. 39 00:02:07,766 --> 00:02:11,666 And we do this so as to prevent one feature to dominate the other, 40 00:02:11,700 --> 00:02:15,000 which therefore would be neglected by the machine learning model. 41 00:02:15,366 --> 00:02:15,800 All right. 42 00:02:15,800 --> 00:02:18,133 So that's the what for both of these tools. 43 00:02:18,133 --> 00:02:22,733 Now let me explain the why we have to apply feature scaling. 44 00:02:23,000 --> 00:02:26,366 After splitting the data set into the training set and test it. 45 00:02:26,366 --> 00:02:27,633 It's really obvious. 46 00:02:27,633 --> 00:02:30,866 It is for the simple reason that the test set 47 00:02:31,200 --> 00:02:33,900 is supposed to be a brand new set 48 00:02:33,900 --> 00:02:37,466 on which you are going to evaluate your machine learning model. 49 00:02:37,700 --> 00:02:41,100 So it's exactly like, you know, your training, your machine learning model 50 00:02:41,100 --> 00:02:45,100 on your training set, and then later on, you know, after it is trained, 51 00:02:45,133 --> 00:02:48,133 you're going to deploy it on new observations. 52 00:02:48,300 --> 00:02:51,366 So what this means is that the test set is something 53 00:02:51,366 --> 00:02:54,366 you're not supposed to work with for the training. 54 00:02:54,600 --> 00:02:58,133 And feature scaling is as you will see, a technique 55 00:02:58,133 --> 00:03:01,533 that will get the mean and the standard deviation 56 00:03:01,533 --> 00:03:05,033 of your feature, you know, in order to perform the scaling. 57 00:03:05,466 --> 00:03:09,566 So if we apply feature scaling before the split, 58 00:03:09,900 --> 00:03:13,733 then it will actually get the mean and the standard deviation 59 00:03:13,733 --> 00:03:17,033 of all the values, including the ones in the test set. 60 00:03:17,166 --> 00:03:20,333 And since the test set is something you're not supposed to have, 61 00:03:20,333 --> 00:03:24,300 you know, like some future data in production, well, you know, applying 62 00:03:24,300 --> 00:03:28,700 feature scaling on the original data set before the split would cause some 63 00:03:28,700 --> 00:03:31,933 what we call information leakage on the test set. 64 00:03:31,933 --> 00:03:32,733 You know, we would 65 00:03:32,733 --> 00:03:36,933 grab some information from the test set, which we're not supposed to get 66 00:03:37,033 --> 00:03:40,900 because it is supposed to be new data with new observations. 67 00:03:41,166 --> 00:03:46,233 So remember this the essential reason why you should not apply feature scaling 68 00:03:46,233 --> 00:03:50,133 before the split is to prevent information leakage 69 00:03:50,366 --> 00:03:54,900 on the test set, which you're not supposed to have until the training is done.