1
00:00:00,810 --> 00:00:01,650
Hello, everyone.

2
00:00:02,220 --> 00:00:10,380
In this video, we will learn how to build classification Tree and Biton will follow exactly the same

3
00:00:10,380 --> 00:00:14,460
steps that we followed while creating regression tree.

4
00:00:15,420 --> 00:00:16,320
So let's start.

5
00:00:16,410 --> 00:00:19,410
First, we will import all the important libraries.

6
00:00:21,690 --> 00:00:25,020
I have copied this code from my regression tree.

7
00:00:25,530 --> 00:00:26,940
You can also copy that.

8
00:00:27,170 --> 00:00:27,500
Good.

9
00:00:28,990 --> 00:00:33,880
Here I will import my movie classification CSP.

10
00:00:34,750 --> 00:00:37,500
We have already discussed the variables of this data.

11
00:00:38,880 --> 00:00:46,440
So let's just import and to view the top five rows of this data frame.

12
00:00:46,890 --> 00:00:47,790
We will eun's head.

13
00:00:51,000 --> 00:00:52,510
This other top five rows.

14
00:00:52,710 --> 00:00:56,210
You can see we have variable name on the top of this table.

15
00:00:56,820 --> 00:00:58,950
Our dependent variable is Star Trek.

16
00:00:59,220 --> 00:00:59,750
Oscar.

17
00:01:01,050 --> 00:01:07,800
And since this is a classification problem, our dependent variable is taking zero and one will lose.

18
00:01:09,240 --> 00:01:18,150
Now to get details about the type of each variable type and code of values in each variable we can use

19
00:01:18,240 --> 00:01:18,820
in full.

20
00:01:21,870 --> 00:01:24,870
You can see the 506 and Creese.

21
00:01:25,200 --> 00:01:34,410
There are 506 observations and all over variable names are listed on the left hand side of this table.

22
00:01:37,430 --> 00:01:39,800
We have combed off Martinel values.

23
00:01:41,820 --> 00:01:45,930
And you can see time taken biddable, have some and use in their.

24
00:01:50,540 --> 00:01:55,100
Now we need to do missing value imputation for our time taken variable.

25
00:01:55,520 --> 00:01:58,460
First, let's calculate the mean of this column.

26
00:01:59,420 --> 00:02:03,330
So the mean of this time taken variable is 157.

27
00:02:03,710 --> 00:02:07,610
And we will impute all the missing values with this mean value.

28
00:02:08,450 --> 00:02:16,220
We will lose Phyll and then my third and then will provide value, which is equal to mean of this variable.

29
00:02:17,570 --> 00:02:23,200
And since we want permanent changes in our data frame.

30
00:02:25,310 --> 00:02:29,010
We will give in place parameter of value of crew.

31
00:02:31,420 --> 00:02:34,700
Let's run this and let's try to run in four again.

32
00:02:36,680 --> 00:02:43,470
Now you can see that we have imputed the missing values and now the county's 506.

33
00:02:45,940 --> 00:02:53,600
Again, there are two string variables, categorical variables through be underscore available and John,

34
00:02:53,600 --> 00:03:00,790
that you can identify the vertical variable sort of string variable using this type column, the type

35
00:03:00,790 --> 00:03:03,370
of string variables as object.

36
00:03:05,950 --> 00:03:12,430
Now, as we have already discussed, to convert our categorical variables into a dummy variable, we

37
00:03:12,430 --> 00:03:15,130
can use get demis method of panda.

38
00:03:16,330 --> 00:03:18,430
Here we have to first mention the data.

39
00:03:19,490 --> 00:03:20,570
So the ties, dear.

40
00:03:21,400 --> 00:03:25,850
Then we have to mention the column names that contain our categorical variable.

41
00:03:26,650 --> 00:03:30,820
So our very real name is truly underscore Welliver and Johna.

42
00:03:31,450 --> 00:03:36,910
And since we are providing two variable names, we have to put it in a list.

43
00:03:37,450 --> 00:03:43,360
I have put this to a loose and I squared record squared record represent list.

44
00:03:45,080 --> 00:03:47,810
Then, since we won't end minus one, get degrees.

45
00:03:48,940 --> 00:03:53,470
We have to give this drop underscored first parameter of value of crew.

46
00:03:56,260 --> 00:04:03,360
So let's run this and let's again take a sample of phosphide values of our data frame.

47
00:04:11,220 --> 00:04:17,090
You can see now we have dummy variables for 3-D underscore available.

48
00:04:17,700 --> 00:04:18,990
And John, critically.

49
00:04:21,610 --> 00:04:27,820
Now we need to debate over data frame and to X and Y X's sense for our independent variable and Y sense

50
00:04:27,820 --> 00:04:29,200
for our dependent variable.

51
00:04:30,850 --> 00:04:35,090
So we need to select all our variables as independent variable.

52
00:04:35,200 --> 00:04:35,990
Except a.

53
00:04:36,730 --> 00:04:37,650
Oscar variable.

54
00:04:37,700 --> 00:04:39,340
And this is a very dependent variable.

55
00:04:40,030 --> 00:04:42,730
We can do that by using LOKKE method.

56
00:04:43,630 --> 00:04:45,980
So we can write B.F. Dog Block.

57
00:04:46,780 --> 00:04:51,600
And then since we want all the rows, we will put just Skogland symbol.

58
00:04:52,480 --> 00:04:56,260
And then after POMA, we need to mention the columns.

59
00:04:56,500 --> 00:05:01,930
And here we are mentioning that all the columns except Spartech Oscar column.

60
00:05:04,840 --> 00:05:06,720
So always remember for Lokke.

61
00:05:06,820 --> 00:05:08,540
You have to provide two parameters.

62
00:05:09,340 --> 00:05:11,950
One four rows and one four columns.

63
00:05:12,190 --> 00:05:15,280
And these two should be separated by a comma in between.

64
00:05:16,060 --> 00:05:18,960
And if you want all the rows, just put a colon.

65
00:05:19,840 --> 00:05:25,030
And if you want all the columns, just put a colon after this comma.

66
00:05:26,790 --> 00:05:27,970
Let's run this.

67
00:05:30,190 --> 00:05:35,530
Let's look at the first five rows of forward equity with.

68
00:05:37,420 --> 00:05:40,950
You can see we have all the variables except star tech.

69
00:05:41,110 --> 00:05:42,230
Oscar Collum.

70
00:05:44,070 --> 00:05:49,140
Now let's review the number of columns and rows in our extra Dufrene.

71
00:05:51,000 --> 00:05:57,910
So overall, there are 506 rows and then take almost an hour extra frame.

72
00:05:59,680 --> 00:06:07,580
Now, let's create our dependent variable, which is why variable here we need on the Spartech or Scuttle

73
00:06:07,580 --> 00:06:14,080
column from our data frame so we can simply add IDF and in the squared record, we can mention the column

74
00:06:14,090 --> 00:06:14,300
name.

75
00:06:17,820 --> 00:06:20,250
Let's look at the first five values.

76
00:06:20,370 --> 00:06:25,710
You can see we have the first five values and values are in the form of zero and one.

77
00:06:27,690 --> 00:06:29,790
Let's take a ship again.

78
00:06:29,820 --> 00:06:32,880
See, there are total 506 observations.

79
00:06:35,400 --> 00:06:41,880
No, next thing we have to do is to delay the hour X and Y variables and to test synchrony split.

80
00:06:44,120 --> 00:06:48,120
Remember, we did the same process for over regression tree also.

81
00:06:49,140 --> 00:06:54,150
We will lose their trend, split from a skill and dot model selection.

82
00:06:55,050 --> 00:07:02,700
And the output of this function is in the form of forward series or data frame which we are saving and

83
00:07:02,810 --> 00:07:05,640
exten expressed by train and by test.

84
00:07:07,540 --> 00:07:10,780
Again, we are selecting that train sample size of 80 percent.

85
00:07:11,130 --> 00:07:14,640
And test size of 20 percent.

86
00:07:14,850 --> 00:07:18,360
That's why we have provided zero point to in our test size variable.

87
00:07:19,140 --> 00:07:23,000
And here in random demonstrated you can mention any number.

88
00:07:23,280 --> 00:07:26,680
But in this skorts we are always going to use zero two.

89
00:07:26,870 --> 00:07:29,940
The same is split for every model radiation.

90
00:07:33,150 --> 00:07:34,140
Let's run this.

91
00:07:36,810 --> 00:07:40,260
Let's look at the screen dataset.

92
00:07:45,350 --> 00:07:47,600
You can see this is overtrained or does it start?

93
00:07:47,640 --> 00:07:55,080
Take Laskaris not here on let's take us shape to know the number of observations in this dataset.

94
00:07:58,120 --> 00:08:01,210
So 80 percent of 506 is four 04.

95
00:08:01,240 --> 00:08:08,260
That's why we have four hundred and four observations in our extreme data and in our extended day we

96
00:08:08,260 --> 00:08:10,630
should have one hundred and two observations.

97
00:08:12,980 --> 00:08:21,550
So we have divided our data, which had 506 observation, and to extend the next steps containing four

98
00:08:21,550 --> 00:08:24,770
hundred and four and hundred and two observation respectively.