1 00:00:00,810 --> 00:00:01,650 Hello, everyone. 2 00:00:02,220 --> 00:00:10,380 In this video, we will learn how to build classification Tree and Biton will follow exactly the same 3 00:00:10,380 --> 00:00:14,460 steps that we followed while creating regression tree. 4 00:00:15,420 --> 00:00:16,320 So let's start. 5 00:00:16,410 --> 00:00:19,410 First, we will import all the important libraries. 6 00:00:21,690 --> 00:00:25,020 I have copied this code from my regression tree. 7 00:00:25,530 --> 00:00:26,940 You can also copy that. 8 00:00:27,170 --> 00:00:27,500 Good. 9 00:00:28,990 --> 00:00:33,880 Here I will import my movie classification CSP. 10 00:00:34,750 --> 00:00:37,500 We have already discussed the variables of this data. 11 00:00:38,880 --> 00:00:46,440 So let's just import and to view the top five rows of this data frame. 12 00:00:46,890 --> 00:00:47,790 We will eun's head. 13 00:00:51,000 --> 00:00:52,510 This other top five rows. 14 00:00:52,710 --> 00:00:56,210 You can see we have variable name on the top of this table. 15 00:00:56,820 --> 00:00:58,950 Our dependent variable is Star Trek. 16 00:00:59,220 --> 00:00:59,750 Oscar. 17 00:01:01,050 --> 00:01:07,800 And since this is a classification problem, our dependent variable is taking zero and one will lose. 18 00:01:09,240 --> 00:01:18,150 Now to get details about the type of each variable type and code of values in each variable we can use 19 00:01:18,240 --> 00:01:18,820 in full. 20 00:01:21,870 --> 00:01:24,870 You can see the 506 and Creese. 21 00:01:25,200 --> 00:01:34,410 There are 506 observations and all over variable names are listed on the left hand side of this table. 22 00:01:37,430 --> 00:01:39,800 We have combed off Martinel values. 23 00:01:41,820 --> 00:01:45,930 And you can see time taken biddable, have some and use in their. 24 00:01:50,540 --> 00:01:55,100 Now we need to do missing value imputation for our time taken variable. 25 00:01:55,520 --> 00:01:58,460 First, let's calculate the mean of this column. 26 00:01:59,420 --> 00:02:03,330 So the mean of this time taken variable is 157. 27 00:02:03,710 --> 00:02:07,610 And we will impute all the missing values with this mean value. 28 00:02:08,450 --> 00:02:16,220 We will lose Phyll and then my third and then will provide value, which is equal to mean of this variable. 29 00:02:17,570 --> 00:02:23,200 And since we want permanent changes in our data frame. 30 00:02:25,310 --> 00:02:29,010 We will give in place parameter of value of crew. 31 00:02:31,420 --> 00:02:34,700 Let's run this and let's try to run in four again. 32 00:02:36,680 --> 00:02:43,470 Now you can see that we have imputed the missing values and now the county's 506. 33 00:02:45,940 --> 00:02:53,600 Again, there are two string variables, categorical variables through be underscore available and John, 34 00:02:53,600 --> 00:03:00,790 that you can identify the vertical variable sort of string variable using this type column, the type 35 00:03:00,790 --> 00:03:03,370 of string variables as object. 36 00:03:05,950 --> 00:03:12,430 Now, as we have already discussed, to convert our categorical variables into a dummy variable, we 37 00:03:12,430 --> 00:03:15,130 can use get demis method of panda. 38 00:03:16,330 --> 00:03:18,430 Here we have to first mention the data. 39 00:03:19,490 --> 00:03:20,570 So the ties, dear. 40 00:03:21,400 --> 00:03:25,850 Then we have to mention the column names that contain our categorical variable. 41 00:03:26,650 --> 00:03:30,820 So our very real name is truly underscore Welliver and Johna. 42 00:03:31,450 --> 00:03:36,910 And since we are providing two variable names, we have to put it in a list. 43 00:03:37,450 --> 00:03:43,360 I have put this to a loose and I squared record squared record represent list. 44 00:03:45,080 --> 00:03:47,810 Then, since we won't end minus one, get degrees. 45 00:03:48,940 --> 00:03:53,470 We have to give this drop underscored first parameter of value of crew. 46 00:03:56,260 --> 00:04:03,360 So let's run this and let's again take a sample of phosphide values of our data frame. 47 00:04:11,220 --> 00:04:17,090 You can see now we have dummy variables for 3-D underscore available. 48 00:04:17,700 --> 00:04:18,990 And John, critically. 49 00:04:21,610 --> 00:04:27,820 Now we need to debate over data frame and to X and Y X's sense for our independent variable and Y sense 50 00:04:27,820 --> 00:04:29,200 for our dependent variable. 51 00:04:30,850 --> 00:04:35,090 So we need to select all our variables as independent variable. 52 00:04:35,200 --> 00:04:35,990 Except a. 53 00:04:36,730 --> 00:04:37,650 Oscar variable. 54 00:04:37,700 --> 00:04:39,340 And this is a very dependent variable. 55 00:04:40,030 --> 00:04:42,730 We can do that by using LOKKE method. 56 00:04:43,630 --> 00:04:45,980 So we can write B.F. Dog Block. 57 00:04:46,780 --> 00:04:51,600 And then since we want all the rows, we will put just Skogland symbol. 58 00:04:52,480 --> 00:04:56,260 And then after POMA, we need to mention the columns. 59 00:04:56,500 --> 00:05:01,930 And here we are mentioning that all the columns except Spartech Oscar column. 60 00:05:04,840 --> 00:05:06,720 So always remember for Lokke. 61 00:05:06,820 --> 00:05:08,540 You have to provide two parameters. 62 00:05:09,340 --> 00:05:11,950 One four rows and one four columns. 63 00:05:12,190 --> 00:05:15,280 And these two should be separated by a comma in between. 64 00:05:16,060 --> 00:05:18,960 And if you want all the rows, just put a colon. 65 00:05:19,840 --> 00:05:25,030 And if you want all the columns, just put a colon after this comma. 66 00:05:26,790 --> 00:05:27,970 Let's run this. 67 00:05:30,190 --> 00:05:35,530 Let's look at the first five rows of forward equity with. 68 00:05:37,420 --> 00:05:40,950 You can see we have all the variables except star tech. 69 00:05:41,110 --> 00:05:42,230 Oscar Collum. 70 00:05:44,070 --> 00:05:49,140 Now let's review the number of columns and rows in our extra Dufrene. 71 00:05:51,000 --> 00:05:57,910 So overall, there are 506 rows and then take almost an hour extra frame. 72 00:05:59,680 --> 00:06:07,580 Now, let's create our dependent variable, which is why variable here we need on the Spartech or Scuttle 73 00:06:07,580 --> 00:06:14,080 column from our data frame so we can simply add IDF and in the squared record, we can mention the column 74 00:06:14,090 --> 00:06:14,300 name. 75 00:06:17,820 --> 00:06:20,250 Let's look at the first five values. 76 00:06:20,370 --> 00:06:25,710 You can see we have the first five values and values are in the form of zero and one. 77 00:06:27,690 --> 00:06:29,790 Let's take a ship again. 78 00:06:29,820 --> 00:06:32,880 See, there are total 506 observations. 79 00:06:35,400 --> 00:06:41,880 No, next thing we have to do is to delay the hour X and Y variables and to test synchrony split. 80 00:06:44,120 --> 00:06:48,120 Remember, we did the same process for over regression tree also. 81 00:06:49,140 --> 00:06:54,150 We will lose their trend, split from a skill and dot model selection. 82 00:06:55,050 --> 00:07:02,700 And the output of this function is in the form of forward series or data frame which we are saving and 83 00:07:02,810 --> 00:07:05,640 exten expressed by train and by test. 84 00:07:07,540 --> 00:07:10,780 Again, we are selecting that train sample size of 80 percent. 85 00:07:11,130 --> 00:07:14,640 And test size of 20 percent. 86 00:07:14,850 --> 00:07:18,360 That's why we have provided zero point to in our test size variable. 87 00:07:19,140 --> 00:07:23,000 And here in random demonstrated you can mention any number. 88 00:07:23,280 --> 00:07:26,680 But in this skorts we are always going to use zero two. 89 00:07:26,870 --> 00:07:29,940 The same is split for every model radiation. 90 00:07:33,150 --> 00:07:34,140 Let's run this. 91 00:07:36,810 --> 00:07:40,260 Let's look at the screen dataset. 92 00:07:45,350 --> 00:07:47,600 You can see this is overtrained or does it start? 93 00:07:47,640 --> 00:07:55,080 Take Laskaris not here on let's take us shape to know the number of observations in this dataset. 94 00:07:58,120 --> 00:08:01,210 So 80 percent of 506 is four 04. 95 00:08:01,240 --> 00:08:08,260 That's why we have four hundred and four observations in our extreme data and in our extended day we 96 00:08:08,260 --> 00:08:10,630 should have one hundred and two observations. 97 00:08:12,980 --> 00:08:21,550 So we have divided our data, which had 506 observation, and to extend the next steps containing four 98 00:08:21,550 --> 00:08:24,770 hundred and four and hundred and two observation respectively.