1 00:00:01,870 --> 00:00:06,320 In this session, we will understand how to create a decision tree. 2 00:00:07,310 --> 00:00:16,790 Decision three forms the basis for random for the next most two most widely used algorithms in the world 3 00:00:16,790 --> 00:00:17,110 today. 4 00:00:18,430 --> 00:00:20,910 So let's start this with an example. 5 00:00:22,080 --> 00:00:28,260 In this example, I am trying to predict the likelihood of an individual getting a heart disease and 6 00:00:28,260 --> 00:00:33,620 I am considering three factors for that blood pressure, sugar and cholesterol. 7 00:00:34,350 --> 00:00:38,550 So heart disease, the likelihood of someone getting a heart disease. 8 00:00:39,570 --> 00:00:45,090 It's my dependent variable, and these three are independent variables. 9 00:00:46,000 --> 00:00:46,470 OK? 10 00:00:47,570 --> 00:00:51,210 Now, a decision tree is constructed like this. 11 00:00:51,390 --> 00:00:58,080 We start with Amy, one of the three factors, and then go on constructing a tree like this high sugar. 12 00:00:58,430 --> 00:00:59,960 Is it true or false? 13 00:01:00,620 --> 00:01:03,440 If it is true means that the individual IBP. 14 00:01:04,220 --> 00:01:04,580 Yes. 15 00:01:04,580 --> 00:01:09,530 The individual at BP, did he also have high cholesterol, so on and so forth. 16 00:01:10,620 --> 00:01:13,650 The challenge in Decision Tree is determining. 17 00:01:15,070 --> 00:01:20,020 Where to start from, that is, which factors should I start the decision tree from? 18 00:01:21,170 --> 00:01:28,490 That is known as root, not root, nor is the starting point in a decision tree data mining route known 19 00:01:28,490 --> 00:01:28,850 as the. 20 00:01:29,820 --> 00:01:36,360 First steps in constructing the decision, because you can start with a sugar high or high cholesterol, 21 00:01:36,370 --> 00:01:36,670 right? 22 00:01:36,870 --> 00:01:38,990 So I must know where to start from. 23 00:01:39,480 --> 00:01:46,560 That is done using two is one is Guinea index and the other is known as Entropia information. 24 00:01:47,460 --> 00:01:47,910 OK. 25 00:01:49,240 --> 00:01:58,120 Now let's see how to compute the group not using Guiney index, so I have populated some values based 26 00:01:58,120 --> 00:01:59,330 on historical data. 27 00:01:59,980 --> 00:02:05,510 Someone I am considering, if you see, I am considering each of the three factors in isolation now. 28 00:02:05,810 --> 00:02:12,250 OK, so I will consider the factors in isolation for the purpose of determining which node to start 29 00:02:12,250 --> 00:02:12,520 from. 30 00:02:12,950 --> 00:02:22,720 Right so high BP when I BP was through the likelihood of someone developing heart disease out of so 31 00:02:22,720 --> 00:02:30,070 many people who had I be one of five people, developed heart disease and 39 did not develop our disease. 32 00:02:31,840 --> 00:02:41,680 Out of so many people not having I be 34 people developed heart disease and 125 people did not develop 33 00:02:41,680 --> 00:02:42,400 heart disease. 34 00:02:45,040 --> 00:02:51,320 Same is the case with high sugar and high cholesterol, so many people had high cholesterol that refused 35 00:02:51,320 --> 00:02:55,060 to prove right out of their mind. 36 00:02:55,100 --> 00:03:00,110 Two people develop heart disease and 31 did not develop heart disease. 37 00:03:01,190 --> 00:03:03,660 And some people did not have high cholesterol. 38 00:03:03,830 --> 00:03:11,330 That is why you have Olsher out of them fortified developed heart disease and 129 did not develop heart 39 00:03:11,330 --> 00:03:11,740 disease. 40 00:03:12,590 --> 00:03:12,870 Right. 41 00:03:13,190 --> 00:03:17,410 So this representation is what you do first, right? 42 00:03:18,510 --> 00:03:24,900 And then you compute Guiney index using the Formula One minus probability of Estherville square one 43 00:03:24,900 --> 00:03:27,790 minus minus probability of false score. 44 00:03:28,380 --> 00:03:32,470 OK, so I, I'm explaining this for BP, right. 45 00:03:32,730 --> 00:03:35,840 I now find the son will not find us. 46 00:03:36,030 --> 00:03:40,650 Thirty nine total is one forty four and similarly total is 159. 47 00:03:40,920 --> 00:03:44,700 OK, so I will compute Guinea for two. 48 00:03:44,700 --> 00:03:45,870 Guinea for false. 49 00:03:46,890 --> 00:03:47,260 Right. 50 00:03:47,790 --> 00:03:56,290 So how will I do four to one minus one, not five, divided by thirty nine, plus one on five. 51 00:03:56,460 --> 00:03:59,310 That is one four for the whole square minus no scenario. 52 00:04:00,180 --> 00:04:02,270 I get a value of point three nine five. 53 00:04:02,760 --> 00:04:08,460 And then for this I get a value off point three three six fast I consider it. 54 00:04:08,460 --> 00:04:08,970 Yes. 55 00:04:09,180 --> 00:04:16,590 Thirty four divided by one fifty nine which is thirty four plus one twenty five then I consider no which 56 00:04:16,590 --> 00:04:19,830 is one twenty five thirty four plus one twenty five. 57 00:04:19,830 --> 00:04:21,060 Is that in the denominator. 58 00:04:21,070 --> 00:04:22,950 And then I square it right. 59 00:04:23,100 --> 00:04:25,070 I got getting for true and false. 60 00:04:25,080 --> 00:04:28,050 Now let's find the total need for BP. 61 00:04:28,440 --> 00:04:35,370 That is one forty four for the two and one fifty nine for the Fox. 62 00:04:36,030 --> 00:04:36,360 Right. 63 00:04:36,630 --> 00:04:37,500 I multiply. 64 00:04:39,130 --> 00:04:43,480 This value with the three nine five one three three six that I want. 65 00:04:44,470 --> 00:04:48,190 If I then have total meaning for BP. 66 00:04:49,440 --> 00:04:50,110 Is this clear? 67 00:04:52,060 --> 00:04:52,500 OK. 68 00:04:55,440 --> 00:05:03,480 I will use the same approach to compute Guinea for sugar and cholesterol, also frosti compute Guinea 69 00:05:03,480 --> 00:05:08,640 for to Guinea for false and then I compute totally. 70 00:05:09,000 --> 00:05:09,330 Right. 71 00:05:09,810 --> 00:05:14,820 So I have taught in Guinea for BP sugar and cholesterol. 72 00:05:15,360 --> 00:05:23,820 OK, in the case of BP, this point, three six four sugar is point three six point three eight one. 73 00:05:24,690 --> 00:05:31,710 So that will start the decision from I will start where the guinea is the lowest. 74 00:05:32,550 --> 00:05:36,890 OK, Guinea is lowest in the case of sugar. 75 00:05:36,900 --> 00:05:40,370 So my will start with sugar. 76 00:05:40,830 --> 00:05:43,110 The next will be the next highest one. 77 00:05:43,440 --> 00:05:43,790 Right. 78 00:05:44,040 --> 00:05:46,620 That is BP and then cholesterol. 79 00:05:46,620 --> 00:05:48,560 Where are you getting it. 80 00:05:48,870 --> 00:05:52,100 So this is how you construct the decision tree. 81 00:05:52,950 --> 00:06:00,120 If you are wondering, do you need to do so many computations, so many calculations for creating a 82 00:06:00,120 --> 00:06:01,440 decision tree algorithm? 83 00:06:02,400 --> 00:06:10,080 Don't worry, there are pre-built libraries available in Python that makes all these computations very 84 00:06:10,080 --> 00:06:10,470 easy. 85 00:06:10,680 --> 00:06:15,660 OK, you just need to identify the dependent and independent variables. 86 00:06:16,290 --> 00:06:24,330 The process of developing the tree, fitting the model is done by the library that makes the job so 87 00:06:24,330 --> 00:06:24,870 much easier. 88 00:06:24,870 --> 00:06:25,190 Right? 89 00:06:26,190 --> 00:06:30,800 OK, we used to Guinea for the purpose of identifying the root. 90 00:06:31,620 --> 00:06:33,720 We can also use entropy. 91 00:06:34,320 --> 00:06:40,740 OK, mostly the root node will be the same both in Guinea as well as an entropy. 92 00:06:41,070 --> 00:06:43,920 In few cases they will differ. 93 00:06:44,920 --> 00:06:52,160 Right, the route will differ between Guinea and entropy in a few cases. 94 00:06:53,180 --> 00:06:53,600 OK.