1 00:00:00,200 --> 00:00:01,333 Hello and welcome back. 2 00:00:01,333 --> 00:00:02,700 Today we've got an important tutorial. 3 00:00:02,700 --> 00:00:04,733 We're talking about feature scaling. 4 00:00:04,733 --> 00:00:08,466 Now before we dive into the technical aspects of feature scaling, 5 00:00:08,833 --> 00:00:12,933 I would like to present you with an image that hopefully will help 6 00:00:12,933 --> 00:00:16,533 you remember what feature scaling is applied to. 7 00:00:16,933 --> 00:00:17,966 Now, even without 8 00:00:17,966 --> 00:00:21,800 knowing anything about feature scaling, please remember that features killing 9 00:00:21,800 --> 00:00:25,733 is always applied to columns, so feature scaling would be applied to this column. 10 00:00:25,966 --> 00:00:27,600 To this column. To this column. 11 00:00:27,600 --> 00:00:28,900 To this column. 12 00:00:28,900 --> 00:00:31,800 Feature scaling is never applied across columns, 13 00:00:31,800 --> 00:00:35,433 so you wouldn't apply feature scaling to data inside a row. 14 00:00:36,166 --> 00:00:37,400 Just to remember. 15 00:00:37,400 --> 00:00:40,400 Feature scaling is always applied to columns. 16 00:00:40,700 --> 00:00:44,100 Now with that out of the way, let's have a look at what feature scaling 17 00:00:44,100 --> 00:00:45,366 actually is. 18 00:00:45,366 --> 00:00:48,633 So there are multiple types of feature scaling multiple techniques. 19 00:00:48,700 --> 00:00:51,166 We're going to look at the two main ones. 20 00:00:51,166 --> 00:00:54,000 normalization and standardization. 21 00:00:54,000 --> 00:00:57,933 Normalization is the process of taking the minimum inside 22 00:00:57,933 --> 00:01:02,366 a column, subtracting that minimum from every single value inside that column, 23 00:01:02,666 --> 00:01:05,633 and then dividing by 24 00:01:05,633 --> 00:01:08,433 the difference between the maximum and the minimum. 25 00:01:08,433 --> 00:01:12,400 So basically every single value in a column is adjusted this way. 26 00:01:12,400 --> 00:01:15,400 And you will end up with a new column or an adjusted column 27 00:01:15,666 --> 00:01:18,900 with values which are all between 0 and 1. 28 00:01:19,500 --> 00:01:23,833 Standardization, on the other hand, is the process is similar, 29 00:01:23,833 --> 00:01:26,833 but instead of subtracting the minimum, we subtract the average 30 00:01:27,000 --> 00:01:30,500 and we divide by the standard deviation. 31 00:01:30,966 --> 00:01:34,600 As a result, all of the values, or almost all of the values inside 32 00:01:34,600 --> 00:01:37,600 the column will be between -3 and 3. 33 00:01:37,700 --> 00:01:40,700 if you have some extreme values or some outliers, 34 00:01:41,133 --> 00:01:44,600 then they can end up outside of these minus three and three boundaries. 35 00:01:45,566 --> 00:01:48,333 So that is normalization standardization. 36 00:01:48,333 --> 00:01:51,700 In the practical tutorials with understand you'll be looking at standardization. 37 00:01:51,700 --> 00:01:53,133 And for simplicity's sake 38 00:01:53,133 --> 00:01:56,300 in these intuition terms we'll have a look at normalization. 39 00:01:56,933 --> 00:02:01,033 So let's imagine we have a data set where we have two columns, 40 00:02:01,200 --> 00:02:04,666 annual income of a person and their age. 41 00:02:05,100 --> 00:02:07,366 Just two simple days with those two columns. 42 00:02:07,366 --> 00:02:10,766 And again, for simplicity's sake, we're only going to have three rows. 43 00:02:11,100 --> 00:02:14,333 We're going to have a blue person, a purple person and a red person. 44 00:02:14,733 --> 00:02:17,733 Now here, is the data. 45 00:02:17,766 --> 00:02:21,433 We have blue person making $70,000 46 00:02:21,433 --> 00:02:24,433 a year and their age is 45 years. 47 00:02:25,100 --> 00:02:28,566 purple person makes $60,000 a year and their age is 44 years, 48 00:02:28,900 --> 00:02:33,533 and red person makes $52,000 a year, and their age is 40 years. 49 00:02:34,100 --> 00:02:37,966 Now, the task at hand is going to be slightly different 50 00:02:37,966 --> 00:02:43,100 to the regressions and, classifications that we've been discussing. 51 00:02:43,200 --> 00:02:47,066 The task at hand is to, 52 00:02:47,066 --> 00:02:50,066 see which of the two people 53 00:02:50,400 --> 00:02:54,233 the purple person is most similar to, just based on this data. 54 00:02:54,266 --> 00:02:59,000 Would you say that the purple person is more similar to the blue person? 55 00:02:59,000 --> 00:03:02,200 Or would you say that the purple person is more similar to the red person? 56 00:03:02,933 --> 00:03:07,100 this is, a this is more relevant for the, 57 00:03:07,100 --> 00:03:10,700 clustering, tasks or clustering 58 00:03:11,133 --> 00:03:15,266 algorithms that we'll be discussing in, the following section of the course. 59 00:03:15,266 --> 00:03:19,500 But it is just such a simple illustrative example that we're going to use it here. 60 00:03:19,500 --> 00:03:20,266 We're going to talk about it. 61 00:03:20,266 --> 00:03:23,800 We're about here to show the importance of feature scaling. 62 00:03:24,100 --> 00:03:28,333 So once again, if you'd like to pause this video please go ahead and try to see 63 00:03:28,333 --> 00:03:31,733 would you group the purple person with the blue person, 64 00:03:31,966 --> 00:03:34,766 or would you group the purple person with the red person? 65 00:03:34,766 --> 00:03:36,533 And now let's have a look at it together. 66 00:03:36,533 --> 00:03:38,566 So the let's look at the differences. 67 00:03:38,566 --> 00:03:41,733 The difference here in salary is $10,000. 68 00:03:41,733 --> 00:03:44,200 And here it is $8,000. 69 00:03:44,200 --> 00:03:45,300 And in terms of age, 70 00:03:45,300 --> 00:03:47,900 the difference between the purple and the blue person is one year. 71 00:03:47,900 --> 00:03:50,700 And between the purple and the red person is four years. 72 00:03:50,700 --> 00:03:52,800 Now, what can happen with unscaled features? 73 00:03:52,800 --> 00:03:54,466 As we see here? 74 00:03:54,466 --> 00:03:58,300 is that the values, the unit values, 75 00:03:58,700 --> 00:04:04,266 of one column can be so much larger than the other that it might overpower. 76 00:04:04,266 --> 00:04:08,233 So, for example, we can see that the values of 10,008 thousand 77 00:04:08,233 --> 00:04:11,233 are much, much greater than the values of one and four. 78 00:04:11,400 --> 00:04:17,000 So we might make the erroneous conclusion that, okay, we're going to ignore values 79 00:04:17,000 --> 00:04:21,300 one four, because those are such small differences compared to 10,000, 8000. 80 00:04:21,300 --> 00:04:26,100 We're going to focus on these large of magnitude numbers 10,008 thousand. 81 00:04:26,333 --> 00:04:29,300 And out of them we'll say that the purple person 82 00:04:29,300 --> 00:04:32,366 is clearly closer to the red person because, 83 00:04:33,300 --> 00:04:35,466 that value is 8000. 84 00:04:35,466 --> 00:04:39,133 It's $2,000 less or 2000 units less than, 85 00:04:39,766 --> 00:04:42,166 the value, the difference between the purple and the blue person. 86 00:04:42,166 --> 00:04:45,233 And as a result, we would group the purple person with the red person. 87 00:04:46,266 --> 00:04:47,433 Now, we don't want 88 00:04:47,433 --> 00:04:50,433 this or similar things happening in our algorithms. 89 00:04:51,200 --> 00:04:53,533 and that's why we need to normalize variables, 90 00:04:53,533 --> 00:04:56,700 because we can't compare right now when comparing salaries to years. 91 00:04:56,700 --> 00:04:59,100 It's it's like comparing apples and oranges. 92 00:04:59,100 --> 00:05:00,966 These are not non comparable things. 93 00:05:00,966 --> 00:05:03,966 What if years was expressed not in years but in minutes. 94 00:05:04,100 --> 00:05:07,700 Then those values would be much higher or in seconds would be even higher. 95 00:05:08,066 --> 00:05:10,033 And even if you have the same units of measurement 96 00:05:10,033 --> 00:05:13,200 like dollars and dollars into columns, they still might not be comparable 97 00:05:13,233 --> 00:05:14,633 because they're relating to different things. 98 00:05:14,633 --> 00:05:17,333 So it's important to scale your features. 99 00:05:17,333 --> 00:05:20,466 And so let's apply normalization as a quick reminder. 100 00:05:20,466 --> 00:05:22,400 This is what the formula for normalization is like. 101 00:05:22,400 --> 00:05:25,200 So we're going to apply it to the columns one by one. 102 00:05:25,200 --> 00:05:28,500 So we are applying it to the dollar column first after normal 103 00:05:28,633 --> 00:05:31,733 after normalizing our values will look like this. 104 00:05:32,000 --> 00:05:34,566 So let's go back for a second and go here. 105 00:05:34,566 --> 00:05:36,633 So I'll let you pause this video if you'd like to do 106 00:05:36,633 --> 00:05:37,933 the manual calculations. 107 00:05:37,933 --> 00:05:40,800 This is, the result. 108 00:05:40,800 --> 00:05:46,266 And once we apply it to the first column, our values will look like this. 109 00:05:46,766 --> 00:05:49,766 Once again, from here we end up here. 110 00:05:49,933 --> 00:05:54,433 And now we can compare like for like based on this image, 111 00:05:54,433 --> 00:05:58,366 based on this data, which do you think the purple person is closest to? 112 00:05:58,766 --> 00:06:00,300 Well, I think the answer is obvious. 113 00:06:00,300 --> 00:06:00,866 The purple 114 00:06:00,866 --> 00:06:04,300 person is almost right in the middle between the red and the blue people. 115 00:06:04,633 --> 00:06:08,233 They're at 0.444, whereas in the H column 116 00:06:08,433 --> 00:06:10,800 the purple person is closest to the blue person. 117 00:06:10,800 --> 00:06:12,433 It's very clear. 118 00:06:12,433 --> 00:06:13,100 So there we go. 119 00:06:13,100 --> 00:06:17,433 That's a quick illustrative example, very simplistic but yet illustrative 120 00:06:17,433 --> 00:06:18,966 example of feature scaling. 121 00:06:18,966 --> 00:06:22,366 And I hope you enjoy seeing that together for the lunch in the practical tours. 122 00:06:22,800 --> 00:06:24,766 On that note, I look forward to seeing you back here next time. 123 00:06:24,766 --> 00:06:26,566 And until then, enjoy machine learning.