1 00:00:00,133 --> 00:00:01,566 All right, my friends, let's do this. 2 00:00:01,566 --> 00:00:03,800 Let's proceed to the next tool in our data 3 00:00:03,800 --> 00:00:07,733 preprocessing toolkit, which is about encoding categorical data. 4 00:00:07,866 --> 00:00:10,800 So first let me explain why we have to do this. 5 00:00:10,800 --> 00:00:13,466 Let's open the data set again. 6 00:00:13,466 --> 00:00:16,800 And as we can see this data set contains one column 7 00:00:16,800 --> 00:00:20,200 with categories, you know France, Spain or Germany. 8 00:00:20,633 --> 00:00:24,866 First you might guess that it will be difficult for machine learning model 9 00:00:24,900 --> 00:00:27,900 to compute some correlations between these columns. 10 00:00:27,900 --> 00:00:32,100 You know, the features and the outcome, which is the dependent variable. 11 00:00:32,233 --> 00:00:36,033 And therefore of course we will have to turn these strings, 12 00:00:36,033 --> 00:00:39,000 you know, these categories into numbers. 13 00:00:39,000 --> 00:00:42,366 So one idea would be to encode France 14 00:00:42,366 --> 00:00:45,733 into zero, Spain into one and Germany into two. 15 00:00:46,000 --> 00:00:50,233 However, if we do this, our future machine learning model could understand 16 00:00:50,333 --> 00:00:53,700 that because France is zero, Spain is one and Germany's two. 17 00:00:54,000 --> 00:00:57,533 There is a numerical order between these three countries, 18 00:00:57,533 --> 00:01:00,600 and mostly it could interpret that this order 19 00:01:00,600 --> 00:01:03,666 matters, whereas of course it is absolutely not the case. 20 00:01:03,866 --> 00:01:04,200 Right? 21 00:01:04,200 --> 00:01:06,366 There is not a relationship order 22 00:01:06,366 --> 00:01:09,366 between these three countries France, Germany and Spain. 23 00:01:09,366 --> 00:01:13,366 So we want to avoid the model to have such an interpretation, 24 00:01:13,666 --> 00:01:17,666 because that could cause some misinterpreted correlations 25 00:01:17,666 --> 00:01:21,033 between the features and the outcome, which we want to predict. 26 00:01:21,433 --> 00:01:24,600 Therefore, we can actually do much better than just 27 00:01:24,800 --> 00:01:28,566 encode these three countries into zero, one, and two. 28 00:01:28,833 --> 00:01:32,933 And this thing that we can do better is actually one hot encoding. 29 00:01:33,166 --> 00:01:36,733 And one hot encoding consists of turning this 30 00:01:36,933 --> 00:01:41,033 country column into three columns y three columns. 31 00:01:41,100 --> 00:01:42,766 Because there are actually three 32 00:01:42,766 --> 00:01:46,433 different classes in this country column, you know, three different categories. 33 00:01:46,633 --> 00:01:50,300 If there were, for example, five countries here, we would turn this column 34 00:01:50,300 --> 00:01:51,700 into five columns. 35 00:01:51,700 --> 00:01:55,000 And one hot encoding consists of creating 36 00:01:55,000 --> 00:01:58,000 binary vectors for each of the countries. 37 00:01:58,066 --> 00:02:00,100 Let me explain this right away. 38 00:02:00,100 --> 00:02:04,900 So very simply, France would, for example, have the vector 100, 39 00:02:05,133 --> 00:02:08,433 Spain would have the vector 010 40 00:02:08,600 --> 00:02:12,900 and Germany would have the vector 001, so that then 41 00:02:12,900 --> 00:02:16,666 there is not a numerical order between the three countries, 42 00:02:16,866 --> 00:02:19,233 because instead of having zero, one and two, 43 00:02:19,233 --> 00:02:23,400 we would only have zeros and ones and therefore three new columns. 44 00:02:23,700 --> 00:02:26,100 I'm going to show you, of course, what we're going to create. 45 00:02:26,100 --> 00:02:30,266 We're basically going to replace this country column by three new columns 46 00:02:30,266 --> 00:02:33,900 containing the zeros and ones encoding each of the countries. 47 00:02:34,166 --> 00:02:36,300 That is called one hot encoding. 48 00:02:36,300 --> 00:02:39,266 And that is a very useful and popular method to use 49 00:02:39,266 --> 00:02:43,266 when pre-processing your data sets containing categorical variables. 50 00:02:43,633 --> 00:02:46,633 So that's the first thing we'll do here for this country column. 51 00:02:46,633 --> 00:02:50,100 And then remember that there is also this purchased columns 52 00:02:50,100 --> 00:02:54,466 that has labels, you know, non-numerical values with yes nos. 53 00:02:54,700 --> 00:02:58,633 And we will actually have to replace them by zeros and ones. 54 00:02:58,833 --> 00:03:01,333 And that's totally fine for the dependent variable. 55 00:03:01,333 --> 00:03:04,400 As long as it is a binary outcome, it is super fine. 56 00:03:04,633 --> 00:03:08,433 It will actually not compromise the future accuracy of the model. 57 00:03:08,666 --> 00:03:11,100 If you just replace no and yes by zero and one. 58 00:03:11,100 --> 00:03:14,100 Okay, so I will teach you how to do these two things. 59 00:03:14,100 --> 00:03:18,266 And first let's start by one hot encoding the country column here. 60 00:03:18,566 --> 00:03:19,666 And there we go. 61 00:03:19,666 --> 00:03:22,566 Let's create a new code cell for this new step. 62 00:03:22,566 --> 00:03:24,766 And coding the independent variable. 63 00:03:26,033 --> 00:03:26,366 All right. 64 00:03:26,366 --> 00:03:29,400 So to do this we're going to use two classes. 65 00:03:29,400 --> 00:03:32,000 The first one is the column transformer class 66 00:03:32,000 --> 00:03:36,000 from the compose module of once again the scikit learn library. 67 00:03:36,300 --> 00:03:38,833 And the second class is the one hot encoding class 68 00:03:38,833 --> 00:03:42,366 from the preprocessing module of the same scikit learn library. 69 00:03:42,700 --> 00:03:45,600 So first let's import these two classes. 70 00:03:45,600 --> 00:03:48,600 So we have to take them from scikit learn 71 00:03:49,033 --> 00:03:53,100 from which we're going to call first the compose module. 72 00:03:53,100 --> 00:03:56,100 There we go from which we're going to import 73 00:03:56,133 --> 00:04:00,400 that class we're interested in which is as Google Collab. 74 00:04:00,400 --> 00:04:03,300 Guess it's perfectly gone transform it. 75 00:04:03,300 --> 00:04:06,366 And then from scikit learn, once again 76 00:04:06,800 --> 00:04:11,600 we're going to get access to the pre-processing module perfect, 77 00:04:11,766 --> 00:04:17,733 from which we're going to import the one hot encoder class. 78 00:04:18,133 --> 00:04:21,633 And now we're going to mix these two classes in order to do this. 79 00:04:21,633 --> 00:04:24,133 One hot encoding on the country column.