1 00:00:00,233 --> 00:00:01,700 Cap analysis. 2 00:00:01,700 --> 00:00:05,433 We've talked about Cap a lot, and in fact, we've talked about cap that much 3 00:00:05,433 --> 00:00:08,533 that I'm no longer even saying cumulative accuracy profile 4 00:00:08,566 --> 00:00:12,933 because I am assuming that you're entirely comfortable 5 00:00:12,933 --> 00:00:15,566 with this abbreviation and the whole term and what it means. 6 00:00:15,566 --> 00:00:18,833 So let's see how to analyze the cap. 7 00:00:20,400 --> 00:00:21,633 As we've discussed, 8 00:00:21,633 --> 00:00:25,833 there are three lines that are important on the cap curve. 9 00:00:25,966 --> 00:00:28,200 The blue line which is the random line. 10 00:00:28,200 --> 00:00:34,633 When you select your samples at random, the red line which is our model line. 11 00:00:34,766 --> 00:00:38,100 The and different models will have different red lines, 12 00:00:38,100 --> 00:00:39,800 but basically it looks something like that. 13 00:00:39,800 --> 00:00:42,466 And the gray line which is the perfect model. 14 00:00:42,466 --> 00:00:47,466 Or when you have a crystal ball, when you can select all of the future 15 00:00:47,666 --> 00:00:51,966 Turner's or purchasers or whatever action takers, 16 00:00:52,200 --> 00:00:55,366 and you can select them right away on the dot 17 00:00:55,366 --> 00:00:58,966 without even selecting one single person that you don't want to select. 18 00:00:59,666 --> 00:01:02,166 And so these are the three main lines. 19 00:01:02,166 --> 00:01:04,866 And how do we analyze this cap curve. 20 00:01:04,866 --> 00:01:05,866 We already know how to build it. 21 00:01:05,866 --> 00:01:07,533 But what can we derive. 22 00:01:07,533 --> 00:01:09,300 What insights can we derive from here. 23 00:01:09,300 --> 00:01:13,733 Well it's kind of intuitive that the closer your red line is to the gray line, 24 00:01:13,733 --> 00:01:16,933 the better your model, the closer is to the blue line, the worse. 25 00:01:17,466 --> 00:01:19,866 So how can we quantify this effect? 26 00:01:19,866 --> 00:01:24,733 Well, there is a standard approach to calculate the accuracy ratio. 27 00:01:24,933 --> 00:01:28,933 And to calculate the accuracy ratio, you need to take the area under 28 00:01:29,000 --> 00:01:33,466 the perfect model or the perfect line which is colored in gray here. 29 00:01:33,466 --> 00:01:36,133 And it's called the a p. And 30 00:01:37,133 --> 00:01:39,500 then you need to take the area under the red line 31 00:01:39,500 --> 00:01:43,166 which is colored in red here, which is a R. 32 00:01:44,100 --> 00:01:46,600 And then you need to divide one by the other. 33 00:01:46,600 --> 00:01:49,600 So you need to divide a r by a p. 34 00:01:49,766 --> 00:01:53,000 And then this ratio that you get is obviously between 0 and 1. 35 00:01:53,300 --> 00:01:56,266 And the closer this ratio is to one the better. 36 00:01:56,266 --> 00:01:59,266 The further it is away from one and closer to zero, the worse. 37 00:02:00,300 --> 00:02:03,533 However, it can be quite complicated to calculate this area under the curve. 38 00:02:03,733 --> 00:02:07,500 Statistical tools can do it for you, but how can you assess 39 00:02:07,700 --> 00:02:10,566 the cap curve by just looking at it? 40 00:02:10,566 --> 00:02:15,100 So visually, it's not that easy to get this quantifiable metric 41 00:02:15,100 --> 00:02:16,433 just by looking at the curve. 42 00:02:16,433 --> 00:02:18,100 So there's a second approach. 43 00:02:18,100 --> 00:02:20,133 And that's what we're going to discuss now. 44 00:02:21,300 --> 00:02:23,433 Let's get rid of the areas. 45 00:02:23,433 --> 00:02:28,900 And instead of looking at the area, what you can do is look at the 50% line 46 00:02:28,900 --> 00:02:33,233 on the horizontal axis and look where it crosses your model, 47 00:02:33,433 --> 00:02:34,766 and then look at where 48 00:02:34,766 --> 00:02:38,600 that line, the horizontal line from there crosses the vertical axis. 49 00:02:38,600 --> 00:02:44,000 So basically how many turns will you pick up or action takers 50 00:02:44,000 --> 00:02:47,400 or how many positive outcomes are you going to identify? 51 00:02:48,133 --> 00:02:51,200 if you take 50% of your population 52 00:02:51,600 --> 00:02:54,766 and in this case, we can see it's around 90% or something like that. 53 00:02:55,133 --> 00:02:57,766 And just by looking at that, there's a 54 00:02:57,766 --> 00:03:02,366 like a rule of thumb, how you can assess your model based on that X number. 55 00:03:02,366 --> 00:03:04,900 And here it is. Are you ready? Here we go. 56 00:03:04,900 --> 00:03:09,500 So if x is less than 60% the model is rubbish. 57 00:03:10,266 --> 00:03:13,200 Basically it's not useful at all. 58 00:03:13,200 --> 00:03:16,200 You ha you can create a better one. 59 00:03:16,200 --> 00:03:17,633 Probably you can create a better one. 60 00:03:17,633 --> 00:03:19,500 And you need to try again. 61 00:03:19,500 --> 00:03:23,700 If, your model, your X is between 60% and 70%, 62 00:03:23,700 --> 00:03:27,633 then the model is considered to be poor, poor or average. 63 00:03:27,633 --> 00:03:30,300 And by the way, these are my this is my rule of thumb. 64 00:03:30,300 --> 00:03:33,300 Other people might have a different rule of thumb, but this is what I go by. 65 00:03:33,633 --> 00:03:36,966 If it's between 60% and 70%, it's it's a poor model, to be honest. 66 00:03:36,966 --> 00:03:39,300 Like you can you can do better than that. 67 00:03:39,300 --> 00:03:44,733 if it's if X is between 70% and 80%, that's a good model. 68 00:03:44,733 --> 00:03:48,033 That's already where you should be aiming for anything above 70%. 69 00:03:48,433 --> 00:03:50,600 That's, can deliver 70 00:03:50,600 --> 00:03:54,166 good quality insights to the business and actually deliver value. 71 00:03:54,633 --> 00:04:00,133 Anything between 80% and 90% like we see here is a very good it's extremely good. 72 00:04:00,133 --> 00:04:03,066 That's if you can get a model over 80%. 73 00:04:03,066 --> 00:04:06,066 That is an amazing result. 74 00:04:06,366 --> 00:04:10,133 And anything above 90% up to 100, that is just too good. 75 00:04:10,500 --> 00:04:13,600 It is too good to believe. And they are. 76 00:04:13,633 --> 00:04:19,233 There's one option that you should be very careful here with is overfitting. 77 00:04:19,233 --> 00:04:22,933 If your model is showing you results like 90% or so, 78 00:04:23,033 --> 00:04:26,033 if a model showing 100%, then the obvious answer there is that 79 00:04:26,033 --> 00:04:29,900 one of your independent variables is actually a post facto variable, 80 00:04:29,900 --> 00:04:33,366 meaning that it shouldn't be in the data because it's looking into the future. 81 00:04:34,000 --> 00:04:36,766 The person who supplied you that variable forgot to take it out 82 00:04:36,766 --> 00:04:41,933 or forgot to explain to you that, you know, their credit score actually, 83 00:04:42,200 --> 00:04:45,300 is turned into zero after they leave the bank, 84 00:04:45,300 --> 00:04:47,366 and therefore everybody with a zero credit score 85 00:04:47,366 --> 00:04:51,133 obviously has left the bank, and therefore your model is picking them up. 86 00:04:51,133 --> 00:04:53,400 Like, like is super easy. 87 00:04:53,400 --> 00:04:56,400 So if you have 100%, that's definitely something on a few variables. 88 00:04:56,566 --> 00:05:00,066 Even if you have 9,000%, you have to check that there could be some 89 00:05:00,366 --> 00:05:01,966 forward looking variables. 90 00:05:01,966 --> 00:05:03,633 The other thing is overfitting. 91 00:05:03,633 --> 00:05:07,866 You could be overfitting your model and what that means is that you, 92 00:05:07,866 --> 00:05:11,100 your model has been so well fit to that 93 00:05:11,100 --> 00:05:14,466 specific data set that you supplied it, that when you true 94 00:05:14,466 --> 00:05:18,266 that it's just heavily relying on the anomalies in that data set. 95 00:05:18,466 --> 00:05:23,133 And when you feed it a new data set, like, you know, in a month time or 96 00:05:23,133 --> 00:05:26,600 something like not not training data, not the data that you train your model on. 97 00:05:26,600 --> 00:05:29,700 And we'll talk about this a bit, a lot more actually in the coming tutorials. 98 00:05:29,700 --> 00:05:34,633 But so if you feed this model some data that you want to actually predict on, 99 00:05:34,800 --> 00:05:36,733 then it will crash. Well, it won't crash it. 100 00:05:36,733 --> 00:05:37,933 It won't perform as well. 101 00:05:37,933 --> 00:05:40,466 Perform, you know, at the 60% mark or something. 102 00:05:40,466 --> 00:05:42,766 So that means your model is overfitted. 103 00:05:42,766 --> 00:05:44,400 And be very careful about that. 104 00:05:44,400 --> 00:05:45,866 We'll talk about overfitting more. 105 00:05:45,866 --> 00:05:49,933 In fact, in the coming tutorials we will learn how to avoid that problem. 106 00:05:50,200 --> 00:05:54,466 And finally, if you can get an an X or this, 107 00:05:54,966 --> 00:05:57,566 parameter to be between 90%, 100% and you're 108 00:05:57,566 --> 00:06:00,633 not using forward looking parameters or you're not overfitting, 109 00:06:00,866 --> 00:06:03,866 then give me a call because I might have a job for you. 110 00:06:04,200 --> 00:06:07,466 People like that are rare, and I have, 111 00:06:07,466 --> 00:06:11,966 a lot of headhunters looking for people who can, do modeling like that. 112 00:06:11,966 --> 00:06:16,366 So definitely keep that in mind and look forward to seeing you then. 113 00:06:16,700 --> 00:06:18,466 Until next time, happy analyzing.