1 00:00:00,200 --> 00:00:00,966 All right. 2 00:00:00,966 --> 00:00:02,633 So now all good. 3 00:00:02,633 --> 00:00:06,800 Now we apply correctly this transformation to one hot 4 00:00:06,800 --> 00:00:10,166 encode the columns of this matrix of features x. 5 00:00:10,533 --> 00:00:14,266 Let's check that right away by creating a new code cell 6 00:00:14,266 --> 00:00:17,933 and then printing this matrix of features x. 7 00:00:18,300 --> 00:00:19,266 There we go. 8 00:00:19,266 --> 00:00:21,900 Now let's run this cell. 9 00:00:21,900 --> 00:00:24,900 And then running the next one. 10 00:00:24,900 --> 00:00:26,333 And what do we see. 11 00:00:26,333 --> 00:00:29,333 Well we see exactly what was expected here. 12 00:00:29,500 --> 00:00:33,300 Meaning that we no longer have the first country column with the text. 13 00:00:33,300 --> 00:00:35,800 You know, the three countries here and strings. 14 00:00:35,800 --> 00:00:38,666 This time we have, as we said, three new columns. 15 00:00:38,666 --> 00:00:40,800 So I don't know if you can see it, but that's the first one. 16 00:00:40,800 --> 00:00:42,066 That's the second one. 17 00:00:42,066 --> 00:00:46,700 And that's the third one where each row encodes one of the three countries. 18 00:00:46,900 --> 00:00:49,300 So actually France, you know, that was the first row. 19 00:00:49,300 --> 00:00:53,300 France is encoded as one zero and zero. 20 00:00:53,300 --> 00:00:55,366 You know a vector of one zero and zero. 21 00:00:55,366 --> 00:00:59,566 Then Spain is encoded as a vector of zero zero and one. 22 00:00:59,866 --> 00:01:03,733 And Germany is encoded as a vector of zero one and zero. 23 00:01:03,766 --> 00:01:07,633 You see the unique idea for each of these three countries. 24 00:01:07,633 --> 00:01:09,333 Thanks to these three columns here. 25 00:01:09,333 --> 00:01:11,966 And that's exactly the idea of one hot encoding. 26 00:01:11,966 --> 00:01:14,933 We not only turn our countries 27 00:01:14,933 --> 00:01:19,033 into numerical values, but also there is not a numerical order. 28 00:01:19,266 --> 00:01:22,000 Thanks to these zeros and ones here in three columns. 29 00:01:22,000 --> 00:01:23,000 So that's perfect. 30 00:01:23,000 --> 00:01:26,933 And that will provide the best results for our future machine learning models. 31 00:01:27,633 --> 00:01:29,500 All right. So congratulations. 32 00:01:29,500 --> 00:01:32,966 Now you know how to one hot encode some categorical data. 33 00:01:33,500 --> 00:01:35,933 Now we're going to quickly do another encoding 34 00:01:35,933 --> 00:01:39,800 for the dependent variable because indeed it has a text format. 35 00:01:39,833 --> 00:01:41,066 No. And yes. 36 00:01:41,066 --> 00:01:45,700 And we would just like to convert these strings into zero and one respectively. 37 00:01:45,933 --> 00:01:47,700 And to do this well that's very simple. 38 00:01:47,700 --> 00:01:51,566 We're going to use another class called label encoder and which will 39 00:01:51,566 --> 00:01:56,800 exactly encode these nos and yeses into zeros and ones respectively. 40 00:01:57,233 --> 00:01:57,533 All right. 41 00:01:57,533 --> 00:01:58,200 So let's do this. 42 00:01:58,200 --> 00:02:00,766 Let's create a new code cell here. 43 00:02:00,766 --> 00:02:03,033 Let's scroll down a bit. 44 00:02:03,033 --> 00:02:04,066 Here we go. 45 00:02:04,066 --> 00:02:07,000 And now let's encode the dependent variable. 46 00:02:07,000 --> 00:02:10,800 So as we said we're going to use the label encoder class which we get 47 00:02:10,966 --> 00:02:15,633 from the scikit learn library again from which we're going to call 48 00:02:16,033 --> 00:02:21,166 the preprocessing module from which we're going to import. 49 00:02:22,500 --> 00:02:26,833 Well there we go the label encoder class. 50 00:02:27,366 --> 00:02:30,600 Then exactly as what we did before, we're going to create 51 00:02:30,600 --> 00:02:33,600 an object of this class which we're going to call elite. 52 00:02:33,766 --> 00:02:38,500 And to do this well we simply need to call the class label encoder. 53 00:02:38,900 --> 00:02:41,766 There we go then some parentheses. 54 00:02:41,766 --> 00:02:42,600 Then good news. 55 00:02:42,600 --> 00:02:44,733 We don't have to input anything in the parentheses 56 00:02:44,733 --> 00:02:48,533 because you know we will just need to enter directly. Why. 57 00:02:48,566 --> 00:02:50,733 Because it is only one single vector. 58 00:02:50,733 --> 00:02:53,733 So it will be obvious what will be needed to encode. 59 00:02:53,866 --> 00:02:55,833 And so there we go. Let's do this. 60 00:02:55,833 --> 00:03:00,300 Let's first call our object early from which we're going to call. 61 00:03:00,300 --> 00:03:01,133 Well good news. 62 00:03:01,133 --> 00:03:04,266 Once again there is a fit transform method 63 00:03:04,600 --> 00:03:07,966 which we can call directly on Y 64 00:03:08,366 --> 00:03:11,500 and which will exactly convert the nose. 65 00:03:11,500 --> 00:03:14,500 And yes, you know the text inside numerical values. 66 00:03:14,800 --> 00:03:17,366 And this time we don't have to have a numpy array 67 00:03:17,366 --> 00:03:19,433 because this is the dependent variable vector. 68 00:03:19,433 --> 00:03:21,366 It doesn't need to be a numpy array. 69 00:03:21,366 --> 00:03:24,533 You know as what is expected by the future machinery models. 70 00:03:24,900 --> 00:03:30,533 So we can just set the new Y to be what is returned by this fit 71 00:03:30,566 --> 00:03:34,266 transform method applied to the old way with the Texas nose. 72 00:03:34,266 --> 00:03:35,200 And yes. 73 00:03:35,200 --> 00:03:36,400 All right, let's check this. 74 00:03:36,400 --> 00:03:39,900 Let's create a new code cell just to print 75 00:03:40,500 --> 00:03:42,900 our dependent variable vector y. 76 00:03:42,900 --> 00:03:47,233 Let's run this cell first and then this one. 77 00:03:47,233 --> 00:03:48,800 And let's see if we get zeros and ones. 78 00:03:48,800 --> 00:03:51,800 And yes, indeed we get the zeros and the ones. 79 00:03:51,900 --> 00:03:54,400 Right. No. Corresponds to zero. 80 00:03:54,400 --> 00:03:56,833 And then one here correspond to. Yes. Right. 81 00:03:56,833 --> 00:04:02,233 Then it should be 0011000110. 82 00:04:02,266 --> 00:04:03,666 Okay. So all good. 83 00:04:03,666 --> 00:04:07,033 So now you not only know how to apply one hot encoding 84 00:04:07,033 --> 00:04:08,666 when you have several categories. 85 00:04:08,666 --> 00:04:11,200 And one of the features of your matrix of features. 86 00:04:11,200 --> 00:04:13,000 But also you can do a simple label 87 00:04:13,000 --> 00:04:16,966 encoding when you have, you know, two classes which you can directly encode 88 00:04:16,966 --> 00:04:19,966 into zeros and ones, you know, a binary outcome. 89 00:04:20,700 --> 00:04:22,766 All right. Perfect. Congratulations. 90 00:04:22,766 --> 00:04:26,266 Now you have an extra tool in your data preprocessing toolkit. 91 00:04:26,400 --> 00:04:28,566 The way to encode categorical data. 92 00:04:28,566 --> 00:04:32,633 And now we're going to move on to the next tool which will be to split 93 00:04:32,633 --> 00:04:35,633 the data set into the training set and test set. 94 00:04:35,633 --> 00:04:36,700 So digest this. 95 00:04:36,700 --> 00:04:39,533 And as soon as you're ready meet me in the next tutorial.