1 00:00:00,810 --> 00:00:06,990 In this video, we will learn how to handle these categorical variables by creating dummy variables. 2 00:00:09,290 --> 00:00:16,280 As told in Italy later, whenever we have categorical variables, we need to transform them into dummy 3 00:00:16,280 --> 00:00:18,800 variables containing numerical values. 4 00:00:20,900 --> 00:00:25,730 Since our regression model cannot handle non numerical values. 5 00:00:27,110 --> 00:00:34,160 So we have these three categorical variables, airport, autobody and buster, which contain no numerical 6 00:00:34,160 --> 00:00:40,670 values, and we will look at their categories and determine the damage to be created. 7 00:00:42,260 --> 00:00:44,510 So I have the filtering options here. 8 00:00:45,350 --> 00:00:51,890 If you want to apply filtering options, you have to select the top row and go to data menu and click 9 00:00:51,890 --> 00:00:52,820 on this filter option. 10 00:00:54,520 --> 00:01:01,150 Ed, if I open the wildling options for the airport variable, you can see that it contains two values. 11 00:01:01,150 --> 00:01:02,070 Yes and no. 12 00:01:02,560 --> 00:01:05,770 If I say yes, that is a ungentle. 13 00:01:07,500 --> 00:01:13,950 The number of years in the observations can be seen by selecting all the cases and you can see the count 14 00:01:13,950 --> 00:01:15,540 below it is 279. 15 00:01:17,070 --> 00:01:20,480 So out of 506 observations, I have 279. 16 00:01:20,670 --> 00:01:21,160 Yes. 17 00:01:22,950 --> 00:01:24,750 So there are two categories in the airport. 18 00:01:24,750 --> 00:01:31,350 But even as total energy related, we need and minus one, the many variables and is two. 19 00:01:31,830 --> 00:01:33,870 So we need only one dummy variable. 20 00:01:35,340 --> 00:01:36,870 So we'll create a new column. 21 00:01:40,190 --> 00:01:42,410 Will Namik Airport under serious? 22 00:01:47,900 --> 00:01:56,780 It will contain one whenever the value of a variable is yes, and it will contain zero whenever the 23 00:01:57,080 --> 00:01:59,180 value of a variable is not. 24 00:02:00,610 --> 00:02:03,820 So here we need to use it function. 25 00:02:04,840 --> 00:02:11,020 We will see if the value in the corresponding sale is yes. 26 00:02:12,690 --> 00:02:15,360 In double quotation marks, yes. 27 00:02:17,740 --> 00:02:20,140 If this is true, we want it to be won. 28 00:02:21,400 --> 00:02:23,470 If this is false, we want it to be zero. 29 00:02:25,720 --> 00:02:29,070 You can see we get one when Deerwood values. 30 00:02:29,080 --> 00:02:29,470 Yes. 31 00:02:29,590 --> 00:02:37,480 If I drag it to all the other cells below the cell, I'm getting Landreaux wherever there is no, I'm 32 00:02:37,480 --> 00:02:38,230 getting a zero. 33 00:02:38,230 --> 00:02:39,070 Whatever there is. 34 00:02:39,070 --> 00:02:40,510 Yes, I'm getting one. 35 00:02:42,810 --> 00:02:50,280 So the dummy variable for airport categorical variable is ready before deleting the airport variable, 36 00:02:50,460 --> 00:02:54,390 we need to change this formula to values. 37 00:02:55,680 --> 00:03:00,270 So we will select all details, copy them. 38 00:03:02,520 --> 00:03:06,820 Then we will right click and paste as values. 39 00:03:08,460 --> 00:03:10,800 Now it has value instead of the formula. 40 00:03:11,370 --> 00:03:12,870 Now I can delete this column. 41 00:03:19,200 --> 00:03:26,370 So we have airport dummy variable instead of the airport categorical variable, the next categorical 42 00:03:26,370 --> 00:03:27,470 variable is autobody. 43 00:03:28,290 --> 00:03:29,700 This variable contains. 44 00:03:31,050 --> 00:03:36,460 Four categories, one is like one is lake and river, one is none. 45 00:03:36,690 --> 00:03:38,430 And the last one is River. 46 00:03:39,480 --> 00:03:43,170 Since it has four categories, we need four minus one. 47 00:03:43,320 --> 00:03:45,050 That is three dummy variables. 48 00:03:46,740 --> 00:03:53,940 So we will keep three dummy variables, one with name water, body lake, one with name, water, whatever. 49 00:03:54,300 --> 00:03:58,560 And the last one with the name Water, Body, Lake and River. 50 00:04:02,460 --> 00:04:04,890 So we need to insert three different columns. 51 00:04:11,820 --> 00:04:16,110 In the first column, I'll take waterboarding Underscore Lake. 52 00:04:24,020 --> 00:04:33,650 This will have value one if the value in the corresponding water body variable is lake, otherwise it 53 00:04:33,650 --> 00:04:34,910 will have value zero. 54 00:04:36,320 --> 00:04:38,300 So will rate is equal to if. 55 00:04:40,640 --> 00:04:43,630 This still has Value Lake, 56 00:04:46,820 --> 00:04:49,190 then it would be one altitude video. 57 00:04:53,550 --> 00:04:59,250 Extended to other cells in the column by devils living on the bottom right corner of the cell. 58 00:05:01,780 --> 00:05:04,810 Now, in the next column, will they waterboarded over? 59 00:05:16,120 --> 00:05:21,300 Here, if the value is reversed, then it will be won, otherwise it will be zero. 60 00:05:36,540 --> 00:05:42,750 You can see that whenever the values revert to water, what it will be very will contain the value of 61 00:05:42,780 --> 00:05:46,710 one lasting water body, lake and river. 62 00:05:53,790 --> 00:05:54,370 Well. 63 00:05:59,220 --> 00:06:00,180 In this variable. 64 00:06:01,490 --> 00:06:05,270 If Disvalue is lake and river. 65 00:06:11,660 --> 00:06:13,230 Then it should be one other way. 66 00:06:13,280 --> 00:06:13,970 It would be little. 67 00:06:20,260 --> 00:06:28,340 Double click on the bottom right corner to extend it to other states so we have three variables to represent 68 00:06:28,340 --> 00:06:31,760 this one categorical variable since it has four categories. 69 00:06:32,730 --> 00:06:36,060 Now we will select the values in all these three columns. 70 00:06:37,370 --> 00:06:39,770 Copy them, paste them as values. 71 00:06:43,110 --> 00:06:46,650 And we can delete the water body very well not. 72 00:06:51,920 --> 00:06:54,620 The last categorical variable is bus terminal. 73 00:06:56,890 --> 00:07:03,940 It has only one category, which is, yes, if a categorical variable has only one category, it is 74 00:07:03,940 --> 00:07:04,960 not really a variable. 75 00:07:04,960 --> 00:07:11,240 It can be treated as a constant and we do not need any dummy variable to represent this variable since 76 00:07:11,770 --> 00:07:18,200 and then minus one and is equal to one, then the count of the variable comes to zero. 77 00:07:19,900 --> 00:07:22,590 So we do not need this variable at all. 78 00:07:22,780 --> 00:07:24,850 We will straightaway delete this variable. 79 00:07:27,540 --> 00:07:36,180 So with this, our dataset has no blank values, it has no outliers, and all the categorical variables 80 00:07:36,180 --> 00:07:38,310 are converted to numerical values. 81 00:07:39,690 --> 00:07:42,900 With this data set, we are ready to run our analysis.