1 00:00:00,166 --> 00:00:02,566 Hello and welcome to this tutorial. 2 00:00:02,566 --> 00:00:05,300 Okay, so we are halfway our data pre-processing phase. 3 00:00:05,300 --> 00:00:07,366 We learned how to import the libraries. 4 00:00:07,366 --> 00:00:08,733 To import the data set. 5 00:00:08,733 --> 00:00:10,866 We learned how to take care of missing data. 6 00:00:10,866 --> 00:00:14,266 And today we're going to learn how to encode categorical data. 7 00:00:14,833 --> 00:00:16,900 So the first thing that we're going to do 8 00:00:16,900 --> 00:00:20,233 is that we are going to explain why we need to do this. 9 00:00:20,666 --> 00:00:23,666 So I'm going to go to Google Sheets here to find my data set. 10 00:00:24,133 --> 00:00:28,300 And in this data sets we can see that we have two categorical variables. 11 00:00:28,700 --> 00:00:31,933 We have the country variable here and the purchase variable. 12 00:00:32,633 --> 00:00:35,133 These two variables are categorical variables 13 00:00:35,133 --> 00:00:38,566 because simply they contain categories. Here 14 00:00:38,566 --> 00:00:40,833 the country contains three categories. 15 00:00:40,833 --> 00:00:43,466 It's France, Spain and Germany. 16 00:00:43,466 --> 00:00:47,066 And the purchase variable contains two categories yes and no. 17 00:00:47,500 --> 00:00:50,366 So that's why they're called categorical variables. 18 00:00:50,366 --> 00:00:51,300 And now you can guess 19 00:00:51,300 --> 00:00:55,433 that since machine learning models are based on mathematical equations, 20 00:00:55,700 --> 00:00:58,733 you can intuitively understand that it would cause some problem 21 00:00:58,733 --> 00:01:01,733 if we keep the text here and the categorical variables 22 00:01:01,766 --> 00:01:05,133 in the equations, because we would only want numbers in the equations. 23 00:01:05,700 --> 00:01:09,966 So that's why we need to encode the categorical variables, 24 00:01:09,966 --> 00:01:14,300 that is, to encode the text that we have here into numbers. 25 00:01:15,533 --> 00:01:15,800 Okay. 26 00:01:15,800 --> 00:01:16,300 So in this 27 00:01:16,300 --> 00:01:20,066 tutorial we are going to encode these two variables country and purchased. 28 00:01:20,333 --> 00:01:23,133 And you're going to see that the technique is quite different. 29 00:01:23,133 --> 00:01:24,066 So here we are. 30 00:01:24,066 --> 00:01:27,066 We're going to use something very practical in R 31 00:01:27,066 --> 00:01:28,566 that is the factor function. 32 00:01:28,566 --> 00:01:29,633 And the factor function 33 00:01:29,633 --> 00:01:33,933 will transform your categorical variables into numeric categories. 34 00:01:34,233 --> 00:01:37,200 But it will see the variable as factors. 35 00:01:37,200 --> 00:01:40,200 And you will be able to choose the labels of those factors. 36 00:01:40,366 --> 00:01:42,000 And due to the specific reason, 37 00:01:42,000 --> 00:01:45,000 we will need to create three columns for each category. 38 00:01:45,166 --> 00:01:49,666 We will just transform the country column into a column of factors, 39 00:01:50,100 --> 00:01:52,533 and we are going to specify what the factors are. 40 00:01:52,533 --> 00:01:53,333 So let's do this. 41 00:01:53,333 --> 00:01:56,333 Let's start by encoding the country column. 42 00:01:56,433 --> 00:01:58,333 So we just need to take the column country. 43 00:01:58,333 --> 00:02:02,633 And to do this we type data set dollar country 44 00:02:03,733 --> 00:02:04,333 equals. 45 00:02:04,333 --> 00:02:07,333 And then we use the factor function factor. 46 00:02:08,133 --> 00:02:11,433 And in factor we are going to specify three things. 47 00:02:11,433 --> 00:02:13,033 So let's have a look at the factor function. 48 00:02:13,033 --> 00:02:15,700 To do this we click on F1 okay. 49 00:02:15,700 --> 00:02:18,866 So this contains some info about the factor function. 50 00:02:18,866 --> 00:02:20,400 Let's look at the arguments. 51 00:02:20,400 --> 00:02:25,266 The first argument is the data that we want to transform into factor. 52 00:02:25,300 --> 00:02:28,300 So this is of course going to be our. 53 00:02:30,033 --> 00:02:33,033 Country column from our data set okay. 54 00:02:33,566 --> 00:02:36,100 Then the second parameter is levels. 55 00:02:36,100 --> 00:02:39,633 And that's the names of the categories in our country column. 56 00:02:40,266 --> 00:02:43,266 So let's add here levels equals. 57 00:02:43,500 --> 00:02:47,600 So are we going to write the vector of levels here c parenthesis. 58 00:02:47,600 --> 00:02:49,900 And then we will input here our three categories. 59 00:02:49,900 --> 00:02:50,933 So that's France. 60 00:02:52,933 --> 00:02:55,933 Spain and Germany. 61 00:02:56,700 --> 00:02:57,466 All right. 62 00:02:57,466 --> 00:03:00,466 So by the way C here is a vector and r. 63 00:03:00,733 --> 00:03:03,733 So by creating this c France Spain Germany 64 00:03:03,900 --> 00:03:06,900 we are creating a vector of three elements 65 00:03:07,400 --> 00:03:10,200 that is France Spain and Germany 66 00:03:10,200 --> 00:03:10,533 okay. 67 00:03:10,533 --> 00:03:11,866 And the last arguments 68 00:03:11,866 --> 00:03:15,900 that we need to input is labels because we want to choose which number 69 00:03:16,000 --> 00:03:19,000 we want to give to France, to Spain and to Germany. 70 00:03:19,066 --> 00:03:21,966 So we're going to add here labels. 71 00:03:21,966 --> 00:03:25,333 And then since we transform our country categorical variable 72 00:03:25,333 --> 00:03:28,566 into factors, we don't really care what numbers to use. 73 00:03:29,366 --> 00:03:32,133 So let's use one for France, 74 00:03:32,133 --> 00:03:35,466 two for Spain and three for Germany. 75 00:03:36,000 --> 00:03:38,500 So this has nothing to do with my preference. 76 00:03:38,500 --> 00:03:40,366 It's not because I'm French that I choose one. 77 00:03:40,366 --> 00:03:44,100 For France, it is absolutely none order relative. 78 00:03:44,400 --> 00:03:47,400 So it's just by default that I use one, two, three. 79 00:03:47,566 --> 00:03:50,233 Okay. So that's it. 80 00:03:50,233 --> 00:03:54,733 We just I just forgot one parenthesis and I think this should disappear. Yes. 81 00:03:54,733 --> 00:03:55,466 Good. 82 00:03:55,466 --> 00:04:00,066 Okay, so the encoding of the country categorical variable is ready. 83 00:04:00,400 --> 00:04:02,700 It's actually more simple than in Python. 84 00:04:02,700 --> 00:04:06,733 So before selecting and executing this let's look at our data set. 85 00:04:06,866 --> 00:04:10,733 Our data set contains our country written in text. 86 00:04:11,466 --> 00:04:12,900 And now let's select this 87 00:04:14,700 --> 00:04:16,000 execute. 88 00:04:16,000 --> 00:04:18,300 And now let's look at our data set. 89 00:04:18,300 --> 00:04:19,100 Perfect. 90 00:04:19,100 --> 00:04:21,866 Our country contains our encoded variable. 91 00:04:21,866 --> 00:04:26,766 And as you can see one is France, two is Spain and three is Germany. 92 00:04:27,200 --> 00:04:28,333 Okay. So that's all good. 93 00:04:28,333 --> 00:04:31,333 And now we need to do the same for the purchase column. 94 00:04:31,566 --> 00:04:32,700 And it's exactly the same. 95 00:04:32,700 --> 00:04:34,200 We're going to use the factor function. 96 00:04:34,200 --> 00:04:37,933 We are going to transform the text categories into numerical labels. 97 00:04:38,300 --> 00:04:42,700 So let's do it right now we're going to copy this paste it here. 98 00:04:43,066 --> 00:04:46,600 Here we're going to replace country by purchased. 99 00:04:48,000 --> 00:04:49,633 All right. 100 00:04:49,633 --> 00:04:52,633 Same for here. 101 00:04:53,400 --> 00:04:56,166 Here we need to change the levels because there are new categories. 102 00:04:56,166 --> 00:04:59,766 So the levels are going to be no. 103 00:05:01,866 --> 00:05:04,333 And yes okay. 104 00:05:04,333 --> 00:05:07,333 And now of course we need to change the labels. 105 00:05:07,566 --> 00:05:10,733 So we're going to put zero for no 106 00:05:11,266 --> 00:05:14,266 and one for yes I think that's what actually makes sense. 107 00:05:14,566 --> 00:05:17,633 And parentheses okay. And that's ready. 108 00:05:17,633 --> 00:05:20,400 Let's look at the data set. The data set contains no. 109 00:05:20,400 --> 00:05:22,166 And yes in the purchase column. 110 00:05:22,166 --> 00:05:25,666 And now if I select this and execute 111 00:05:26,900 --> 00:05:29,600 the data set contains now zero and one. 112 00:05:29,600 --> 00:05:31,200 Perfect. 113 00:05:31,200 --> 00:05:32,100 We're all good. 114 00:05:32,100 --> 00:05:36,900 We, we encoded our categorical data in R, and now you know everything 115 00:05:36,900 --> 00:05:41,700 that there is to know about encoding categorical data in Python and R. 116 00:05:42,033 --> 00:05:43,466 So congratulations. 117 00:05:43,466 --> 00:05:47,166 We almost did the most difficult part we are approaching soon. 118 00:05:47,166 --> 00:05:50,166 The fun and exciting part about making models. 119 00:05:50,366 --> 00:05:51,666 So we are almost there. 120 00:05:51,666 --> 00:05:55,033 Hang on for 1 or 2 more tutorials and this is going to get 121 00:05:55,033 --> 00:05:58,033 very exciting.