1 00:00:00,400 --> 00:00:02,066 Hello and welcome to this 2 00:00:02,066 --> 00:00:05,700 almost final tutorial of the part one Data Pre-processing. 3 00:00:06,033 --> 00:00:10,200 I look forward to being well-prepared with our data to start making our machine 4 00:00:10,200 --> 00:00:12,400 learning models. It's going to be very fun. 5 00:00:12,400 --> 00:00:15,733 We just have to hold on for two more tutorials and then we're good to go. 6 00:00:16,266 --> 00:00:20,466 Okay, so today we're going to talk about feature scaling which is very important 7 00:00:20,466 --> 00:00:21,500 in machine learning. 8 00:00:21,500 --> 00:00:23,500 I'm going to explain you why right now. 9 00:00:23,500 --> 00:00:27,066 So I'm going to go to Google Sheets to find our data set. 10 00:00:27,200 --> 00:00:28,400 And here it is. 11 00:00:28,400 --> 00:00:31,700 Let's explain what is feature scaling and why we need to do it. 12 00:00:32,700 --> 00:00:33,000 Okay. 13 00:00:33,000 --> 00:00:33,900 So as you can see we 14 00:00:33,900 --> 00:00:38,400 have these two columns age and salary that contains numerical numbers. 15 00:00:39,133 --> 00:00:41,566 Let's just focus on the age and the salary. 16 00:00:41,566 --> 00:00:46,666 You notice that the variables are not on the same scale, because the age 17 00:00:46,666 --> 00:00:51,200 are going from 27 to 50, 18 00:00:52,000 --> 00:00:56,100 and the salary is going from 40 K to like 90 K. 19 00:00:56,366 --> 00:01:01,433 So because this age variable and the salary variable don't have the same scale, 20 00:01:01,700 --> 00:01:05,166 this will cause some issues in your machine learning models. 21 00:01:05,600 --> 00:01:06,400 And why is that. 22 00:01:06,400 --> 00:01:08,700 It's because you're machine learning models. 23 00:01:08,700 --> 00:01:09,600 A lot of machine learning 24 00:01:09,600 --> 00:01:13,200 models are based on what is called the Euclidean distance. 25 00:01:13,500 --> 00:01:14,500 If you remember that back 26 00:01:14,500 --> 00:01:18,133 from high school, the Euclidean distance between two data points between two points 27 00:01:18,533 --> 00:01:21,533 is the square root of the sum of the squared coordinates. 28 00:01:21,933 --> 00:01:24,133 Well, actually here it's the same. 29 00:01:24,133 --> 00:01:26,700 We have two variables age and salary. 30 00:01:26,700 --> 00:01:31,233 So you can picture age as the x coordinate and the salary as the y coordinate. 31 00:01:31,800 --> 00:01:34,333 And in the machine learning models equations 32 00:01:34,333 --> 00:01:37,866 sum Euclidean distances between observation points. 33 00:01:37,866 --> 00:01:43,266 For example, this one and this one are computed based on these two coordinates. 34 00:01:43,666 --> 00:01:47,000 And actually, since the salary has a much wider range 35 00:01:47,166 --> 00:01:50,366 of values, it's because it's going from 0 to 100 k. 36 00:01:51,200 --> 00:01:55,400 The Euclidean distance will be dominated by the salary. 37 00:01:56,266 --> 00:01:58,633 Because, for example, if we take two observations, for example, 38 00:01:58,633 --> 00:02:04,766 the this one, the ninth one and the third one, well the Euclidean 39 00:02:04,766 --> 00:02:08,766 distance will compute the difference between this salary and this salary. 40 00:02:09,466 --> 00:02:10,966 So let's compute it. That's about 41 00:02:12,400 --> 00:02:12,733 okay. 42 00:02:12,733 --> 00:02:15,666 So this is 43 00:02:15,666 --> 00:02:16,933 this one. 44 00:02:16,933 --> 00:02:19,933 As you can see this is 31,000. 45 00:02:19,933 --> 00:02:22,200 If you put that in square 46 00:02:22,200 --> 00:02:22,500 okay. 47 00:02:22,500 --> 00:02:25,600 Let's see up square that gives this. 48 00:02:26,066 --> 00:02:29,366 And now let's take for the same two observations the edges. 49 00:02:29,866 --> 00:02:34,533 So let's compute equals 48 minus. 50 00:02:35,300 --> 00:02:37,766 It was this one right 27. 51 00:02:37,766 --> 00:02:39,133 Okay. That's the difference. 52 00:02:39,133 --> 00:02:42,133 And now let's take the square equals 53 00:02:42,366 --> 00:02:46,200 this square 441. 54 00:02:46,600 --> 00:02:49,700 So you can see very clearly how this squared 55 00:02:49,700 --> 00:02:52,833 difference dominates this squared difference. 56 00:02:53,100 --> 00:02:56,200 And that's because these two variables are not on the same scale. 57 00:02:56,533 --> 00:02:59,566 So you know in the machine learning equations it will be like 58 00:02:59,800 --> 00:03:03,000 this doesn't exist because it will be dominated by this. 59 00:03:03,400 --> 00:03:08,300 So that's why we absolutely need to put the variables on the same scale. 60 00:03:08,433 --> 00:03:11,400 That is that we are going to transform these two variables, 61 00:03:11,400 --> 00:03:14,400 and they're going to have values in the same range. 62 00:03:14,866 --> 00:03:17,933 For example they're going to have values from minus one to plus one here 63 00:03:18,000 --> 00:03:20,200 and same here minus one to plus one. 64 00:03:20,200 --> 00:03:24,200 So that we don't get this sort of problem with a huge number here 65 00:03:24,400 --> 00:03:27,000 dominating a smaller number here, 66 00:03:27,000 --> 00:03:30,000 so that eventually the smaller number doesn't exist. 67 00:03:30,166 --> 00:03:32,733 There are several ways of scaling your data. 68 00:03:32,733 --> 00:03:37,066 A very common one is the standardization, which means that for each observation 69 00:03:37,066 --> 00:03:41,266 and each feature, you withdraw the mean value of all the values 70 00:03:41,266 --> 00:03:44,266 of the feature, and you divide it by the standard deviation. 71 00:03:44,500 --> 00:03:46,566 So that's the first type of feature scaling. 72 00:03:46,566 --> 00:03:50,400 And another type is normalization which means that you subtract 73 00:03:50,400 --> 00:03:54,600 your observation feature x by the minimum value of all the feature values. 74 00:03:54,833 --> 00:03:58,766 And you divide it by the difference between the maximum of your feature values 75 00:03:58,766 --> 00:04:00,666 and the minimum of your feature values. 76 00:04:00,666 --> 00:04:02,133 Don't worry if you're not very comfortable 77 00:04:02,133 --> 00:04:05,366 with the mathematics here, but what you need to understand is that we are 78 00:04:05,366 --> 00:04:08,400 putting all variables in the same range, in the same scale, 79 00:04:08,700 --> 00:04:11,700 so that no variable is dominated by the other. 80 00:04:12,166 --> 00:04:13,833 Okay, so let's do it right now. Anyway. 81 00:04:13,833 --> 00:04:16,700 You're going to see how the variables are going to be transformed. 82 00:04:16,700 --> 00:04:19,233 You're going to see how they go from having large 83 00:04:19,233 --> 00:04:22,966 and very different values to small and same values. 84 00:04:23,500 --> 00:04:24,700 So let's move on to our.