1 00:00:00,966 --> 00:00:03,966 In the previous tutorial we talked about the accuracy paradox. 2 00:00:04,066 --> 00:00:08,700 Hopefully now you see why we need more robust methods to assess our models. 3 00:00:09,266 --> 00:00:10,200 And today we talking 4 00:00:10,200 --> 00:00:14,066 about the cumulative accuracy profile, which is in fact one of those methods. 5 00:00:15,233 --> 00:00:16,400 Let's look at a scenario. 6 00:00:16,400 --> 00:00:20,666 Let's say you're a data scientist at a store which sells clothes, 7 00:00:20,666 --> 00:00:24,100 and your store has a total of 100,000 customers. 8 00:00:24,133 --> 00:00:27,600 I'm placing that number on the horizontal axis 9 00:00:28,033 --> 00:00:30,700 and you know that 10 00:00:30,700 --> 00:00:34,100 from experience, whenever you send an offer like an email 11 00:00:34,100 --> 00:00:37,400 to all your customers or to any random sample of your customers, 12 00:00:37,433 --> 00:00:41,100 approximately 10% of them respond and purchase the product. 13 00:00:41,433 --> 00:00:45,800 So I'm going to place 10,000, which is 10% of the total. 14 00:00:46,066 --> 00:00:48,300 on the vertical axis. 15 00:00:48,300 --> 00:00:53,400 And so what we're going to do is we're we've got an offer that we want to send, 16 00:00:53,733 --> 00:00:59,266 and we want, to see how many customers are going to, purchase our product. 17 00:00:59,266 --> 00:00:59,800 We send it off. 18 00:00:59,800 --> 00:01:05,666 So if we send it to zero, customers obviously will get, zero responses. 19 00:01:05,666 --> 00:01:06,366 Right. 20 00:01:06,366 --> 00:01:09,600 What do you think will happen if we send it to 20,000 customers? 21 00:01:09,900 --> 00:01:11,633 How many do you think will respond? 22 00:01:11,633 --> 00:01:16,400 Well, because this is a random sample and we know that about 10% respond. 23 00:01:16,400 --> 00:01:18,633 So we would say about 2000 would respond. 24 00:01:18,633 --> 00:01:20,233 Fair enough. Right. 25 00:01:20,233 --> 00:01:24,333 If 40,000 if we send to the offer to 40,000 of our customers, 26 00:01:24,633 --> 00:01:26,366 then about 4000 will respond. 27 00:01:26,366 --> 00:01:30,300 60,000 6000 80,000 8000 100,000. 28 00:01:30,833 --> 00:01:34,066 Then 10,000 of our customers should respond. 29 00:01:35,466 --> 00:01:39,333 And this, is a random selection process. 30 00:01:39,333 --> 00:01:42,433 So here we can draw a line which will actually represent, 31 00:01:42,900 --> 00:01:45,000 this, random selection. 32 00:01:45,000 --> 00:01:50,766 The slope of the line equals to that 10% that, we know that respond 33 00:01:51,300 --> 00:01:54,966 on average to our offers if we just send them out like that. 34 00:01:55,366 --> 00:01:59,200 Now the question is, can we somehow improve this experience? 35 00:01:59,200 --> 00:02:03,133 Can we, get more customers to respond to offers? 36 00:02:03,800 --> 00:02:06,700 when we send out, our letter? 37 00:02:06,700 --> 00:02:11,666 So basically, can we somehow target our customers more appropriately? 38 00:02:11,666 --> 00:02:14,100 So to get a better response rate. 39 00:02:14,100 --> 00:02:17,600 And how about instead of sending out these offers 40 00:02:17,600 --> 00:02:21,733 randomly to, say, a random sample of 20,000 customers? 41 00:02:21,800 --> 00:02:25,000 How about we pick and choose the customers we send these offers to? 42 00:02:25,366 --> 00:02:26,700 And how do we pick and choose? 43 00:02:26,700 --> 00:02:28,233 Well, to start off with, 44 00:02:28,233 --> 00:02:31,800 let's build a model just like we did in the previous section. 45 00:02:32,100 --> 00:02:34,400 Basically, a customer segmentation model. 46 00:02:34,400 --> 00:02:37,033 Joe, demographic segmentation model, 47 00:02:37,033 --> 00:02:40,100 but which want to predict whether or not they will leave the company. 48 00:02:40,333 --> 00:02:43,333 It will actually predict whether or not they will purchase a product. 49 00:02:43,466 --> 00:02:45,566 It's a very simple process actually. 50 00:02:45,566 --> 00:02:49,900 In fact, it's the same thing because purchase is also a binary variable. 51 00:02:49,900 --> 00:02:51,000 Yes or no. 52 00:02:51,000 --> 00:02:53,000 And we can also run the same experiment. 53 00:02:53,000 --> 00:02:56,033 We can take a group of customers before we send out the offer, 54 00:02:56,033 --> 00:02:59,733 and then look back and see who purchased with a male or female. 55 00:02:59,766 --> 00:03:02,133 Which country were they in? 56 00:03:02,133 --> 00:03:04,666 what age predominantly were they? 57 00:03:04,666 --> 00:03:08,333 Were they browsing on mobile or were they browsing, via computer? 58 00:03:08,433 --> 00:03:11,466 And all of these factors, we can take them into account, 59 00:03:12,300 --> 00:03:15,166 measure them, put them into a logistic regression 60 00:03:15,166 --> 00:03:19,266 and get a model which will help us assess the likelihood 61 00:03:19,266 --> 00:03:23,600 of certain types of customers purchasing based on their characteristics, 62 00:03:23,600 --> 00:03:26,966 so they change demographic status and and other characteristics. 63 00:03:27,900 --> 00:03:30,333 And once we've built this model, how about 64 00:03:30,333 --> 00:03:34,800 we apply it to select the customers we will send the offer to. 65 00:03:34,866 --> 00:03:37,266 So what the model will tell us like 66 00:03:37,266 --> 00:03:40,900 just like in the example in the previous section where females 67 00:03:40,900 --> 00:03:44,300 of female customers of a bank whose favorite color is red, 68 00:03:44,300 --> 00:03:46,933 they're most likely to leave the bank here. 69 00:03:46,933 --> 00:03:47,933 We'll have a similar result. 70 00:03:47,933 --> 00:03:53,033 It'll say, perhaps male customers in this certain age group, 71 00:03:53,533 --> 00:03:56,700 who browse on mobile, are most likely to purchase the product. 72 00:03:56,800 --> 00:04:00,100 It will tell us something or will actually rank our customers. 73 00:04:00,433 --> 00:04:03,700 It'll give them a probability of purchasing our product. 74 00:04:03,900 --> 00:04:07,266 And then we can use that probability to actually contact our customers. 75 00:04:07,500 --> 00:04:11,100 So of course if we contact zero, customers will get zero response rate. 76 00:04:11,333 --> 00:04:14,566 But if we contact 20,000, we'll probably get a much higher response 77 00:04:14,566 --> 00:04:19,300 rate than just 2000, because we're picking out the customers 78 00:04:19,300 --> 00:04:22,933 that are at the highest risk of accepting this offer. 79 00:04:23,233 --> 00:04:27,700 We know from their previous behavior or from the previous behavior of customers 80 00:04:27,700 --> 00:04:31,666 similar to them, that they have a 90% chance 81 00:04:31,666 --> 00:04:34,466 or an 80% chance of purchasing this product. 82 00:04:34,466 --> 00:04:36,766 And we will go for them first. 83 00:04:36,766 --> 00:04:39,766 We will put them at the top of our list of people who we contact. 84 00:04:40,200 --> 00:04:44,466 Then when we contact, let's say we contact not 20,000 but 40,000. 85 00:04:44,766 --> 00:04:47,766 Our response rate will be higher than 4000, 86 00:04:48,066 --> 00:04:50,900 which we get in the random scenario. 87 00:04:50,900 --> 00:04:55,866 If we if our model is really good, then by the time we're around at around 60,000. 88 00:04:55,866 --> 00:04:59,300 So more just over half of our total customer base, 89 00:04:59,466 --> 00:05:02,433 we are already getting to that 10,000 mark. 90 00:05:02,433 --> 00:05:06,766 So we know that 10,000 people will respond in total. 91 00:05:07,000 --> 00:05:12,033 There's no way we can get above that because that's just, the response rate. 92 00:05:12,033 --> 00:05:13,966 If we contact everybody, it'll be 10,000. 93 00:05:13,966 --> 00:05:15,500 But we're getting very close already. 94 00:05:15,500 --> 00:05:20,533 So even at 60,000, we're already at 9500 respondents or purchases. 95 00:05:20,700 --> 00:05:22,566 We we could actually stop here. 96 00:05:22,566 --> 00:05:26,100 We've already pretty much contacted everyone, but if we want to contact more, 97 00:05:26,700 --> 00:05:31,100 if we send it out to 80,000, we're getting even closer to 10,000 responses. 98 00:05:31,100 --> 00:05:36,700 And if we contact 100,000, we will still be back at our 10,000 responses. 99 00:05:36,900 --> 00:05:39,100 So now let's draw a line through these, 100 00:05:40,066 --> 00:05:41,233 crosses. 101 00:05:41,233 --> 00:05:43,666 So what you see, this line here 102 00:05:43,666 --> 00:05:47,200 is called the cumulative accuracy profile of your model. 103 00:05:47,733 --> 00:05:50,800 And as you can imagine, the better your model, the 104 00:05:51,333 --> 00:05:54,333 larger will be the the area under this line. 105 00:05:54,333 --> 00:05:56,066 So the area between the red 106 00:05:56,066 --> 00:05:59,266 and the blue lines, it will increase as your model gets better. 107 00:05:59,866 --> 00:06:02,800 And if your model is worse, then this red line 108 00:06:02,800 --> 00:06:05,800 will be closer to the blue line, so it'll be closer to random. 109 00:06:06,566 --> 00:06:09,566 The next step we want to do is convert these axes 110 00:06:09,900 --> 00:06:12,166 from absolute values to percentages. 111 00:06:12,166 --> 00:06:15,100 So so they range from 0 to 100%. 112 00:06:15,100 --> 00:06:19,133 And this is how the cap curve is normally represented. 113 00:06:19,800 --> 00:06:21,900 Now let's say we ran another regression model. 114 00:06:21,900 --> 00:06:25,533 And this time we use less variables lesson dependent variables. 115 00:06:25,533 --> 00:06:31,400 Or just because we had less access to independent variables or we didn't see 116 00:06:31,400 --> 00:06:34,566 that there's a multicollinearity effect in our model or something else 117 00:06:35,100 --> 00:06:38,900 that went wrong and that model, because it'll be worse. 118 00:06:39,300 --> 00:06:42,133 This is what its cap curve will look like. 119 00:06:42,133 --> 00:06:45,600 And therefore, by plotting the cap curves, you'll be able to compare models 120 00:06:45,600 --> 00:06:48,800 to each other and understand how much gain. 121 00:06:48,800 --> 00:06:50,800 This is also sometimes called the gain chart, 122 00:06:50,800 --> 00:06:54,200 how much gain you get in each of these models 123 00:06:54,200 --> 00:06:57,966 compared to the random scenario, or how much gain you get. 124 00:06:57,966 --> 00:06:59,233 Additional gain you get 125 00:06:59,233 --> 00:07:03,533 from switching from one model to the next, or from the green one to the red one. 126 00:07:03,533 --> 00:07:04,000 For instance. 127 00:07:04,000 --> 00:07:06,000 You're improving your hit ratio 128 00:07:06,000 --> 00:07:08,766 and therefore you're improving your return on investment. 129 00:07:08,766 --> 00:07:11,833 So therefore the red model is better. 130 00:07:11,833 --> 00:07:14,500 And this is how we are going to be assessing models. 131 00:07:14,500 --> 00:07:16,633 So let's label them. 132 00:07:16,633 --> 00:07:19,500 The blue line is a random selection process. 133 00:07:19,500 --> 00:07:20,833 Like a monkey could do that. 134 00:07:20,833 --> 00:07:22,700 You just pick a random sample 135 00:07:22,700 --> 00:07:25,800 and you send the letter or you just send it to everybody. 136 00:07:26,500 --> 00:07:29,333 You get your 100% of respondents. 137 00:07:29,333 --> 00:07:31,466 The green line is a poor model. 138 00:07:31,466 --> 00:07:32,800 So it's it's a model 139 00:07:32,800 --> 00:07:36,666 is better than random, but it's still not as good as the red one. 140 00:07:37,066 --> 00:07:39,100 The red one is a good model. 141 00:07:39,100 --> 00:07:42,600 As you can see here, at around the 50% mark, 142 00:07:42,600 --> 00:07:45,133 we're getting just over 80% responses. 143 00:07:45,133 --> 00:07:47,133 That's considered a good model. 144 00:07:47,133 --> 00:07:50,733 And there's one more line that you can think of here. 145 00:07:51,166 --> 00:07:53,566 And it's this line. 146 00:07:53,566 --> 00:07:56,566 This line is the ideal line. And 147 00:07:57,566 --> 00:07:59,400 this is what would happen 148 00:07:59,400 --> 00:08:03,333 if you had a crystal ball, if you could predict 149 00:08:03,766 --> 00:08:07,333 exactly who is going to purchase and contact those people, 150 00:08:07,500 --> 00:08:09,000 this is what it would look like. 151 00:08:09,000 --> 00:08:12,000 Why? Well, because if you look at that, 152 00:08:13,200 --> 00:08:16,200 the place where that split occurs, 153 00:08:16,766 --> 00:08:19,333 you will see that it's exactly 10% and 10%. 154 00:08:19,333 --> 00:08:23,700 As you remember, we know that only 10% of our customers ever purchase. 155 00:08:24,066 --> 00:08:29,066 So basically you're saying that on the horizontal axis, I'm going to take 10%. 156 00:08:29,800 --> 00:08:33,333 And each and every single one of those customers 157 00:08:33,333 --> 00:08:36,200 I pick in that 10%, they are going to be those that purchase. 158 00:08:36,200 --> 00:08:39,266 That means I will go right straight to 100%. 159 00:08:39,733 --> 00:08:45,300 with, this last scenario, this actually took me a while to get my head around, 160 00:08:45,300 --> 00:08:48,800 when I first heard about it, because I never understood. 161 00:08:48,800 --> 00:08:50,433 Why is this split at the top? 162 00:08:50,433 --> 00:08:52,133 Why does it break like that? 163 00:08:52,133 --> 00:08:53,166 But that's exactly the reason. 164 00:08:53,166 --> 00:08:56,233 Because you you you might you can imagine that 165 00:08:56,233 --> 00:08:59,400 you have a crystal ball and you contact in the first 10% 166 00:08:59,400 --> 00:09:02,400 or however many in your specific, 167 00:09:02,533 --> 00:09:05,300 business scenario, customers ever purchase. 168 00:09:05,300 --> 00:09:07,133 You contact them right away, 169 00:09:07,133 --> 00:09:10,233 and then it's just flat from there, because it doesn't matter 170 00:09:10,233 --> 00:09:12,133 how many more you contact, they're not going to purchase. 171 00:09:12,133 --> 00:09:13,600 That's just the reality of things. 172 00:09:14,833 --> 00:09:18,133 And that is the curves that you can have on a cap curve. 173 00:09:18,166 --> 00:09:21,166 If you ever see a model that goes under the blue line, 174 00:09:21,266 --> 00:09:22,566 I didn't even draw one here. 175 00:09:22,566 --> 00:09:26,400 But if that happens, that's a very bad 176 00:09:26,400 --> 00:09:29,933 model is basically doing you a disservice 177 00:09:30,266 --> 00:09:31,900 if it's if you see the curve on 178 00:09:31,900 --> 00:09:34,866 the blue line and we'll talk about model deterioration further 179 00:09:34,866 --> 00:09:37,866 in the course when you're talking about maintaining your models. 180 00:09:38,200 --> 00:09:41,400 So that's it for the cap curve for the introduction to cap curve 181 00:09:41,400 --> 00:09:45,600 will be using the cap curve very actively in this section to assess our model. 182 00:09:45,866 --> 00:09:47,233 And in fact we'll actually build 183 00:09:47,233 --> 00:09:51,166 two of them and one for our model and one for our test data. 184 00:09:51,166 --> 00:09:53,400 So that would be very interesting to compare. 185 00:09:53,400 --> 00:09:56,400 One last thing I wanted to mention is and note 186 00:09:56,400 --> 00:10:00,233 that we have the cap which is a cumulative accuracy profile. 187 00:10:00,500 --> 00:10:03,800 And we have a rock which is a receiver operating characteristic. 188 00:10:04,133 --> 00:10:05,866 And a lot of people get these things confused. 189 00:10:05,866 --> 00:10:11,966 And I myself included, I used to, get it confused. 190 00:10:11,966 --> 00:10:16,400 I even tried proving, one time a, a colleague of mine 191 00:10:16,400 --> 00:10:19,166 who knew this stuff really well at the time, 192 00:10:19,166 --> 00:10:21,566 and I was just learning it, that, he was wrong. 193 00:10:21,566 --> 00:10:24,566 But that was a funny experience. 194 00:10:24,700 --> 00:10:26,266 But they're not the same thing. 195 00:10:26,266 --> 00:10:28,100 So cumulative accuracy profiles. 196 00:10:28,100 --> 00:10:29,600 And when we talked about receiver 197 00:10:29,600 --> 00:10:33,366 operating characteristic, we won't be covering, in this course. 198 00:10:33,600 --> 00:10:36,600 It'll be in my advanced course on statistics. 199 00:10:36,966 --> 00:10:38,366 it's very similar. It looks similar. 200 00:10:38,366 --> 00:10:40,300 And that's why a lot of people get confused and actually 201 00:10:41,500 --> 00:10:42,600 I think, 202 00:10:42,600 --> 00:10:46,033 the other reason is that the ROC curve is in Wikipedia, 203 00:10:46,033 --> 00:10:47,200 there's an article for the ROC curve, 204 00:10:47,200 --> 00:10:51,300 but there isn't one in English for the cumulative accuracy profile. 205 00:10:51,300 --> 00:10:55,433 So it's quite hard to, find information on the cap curve. 206 00:10:55,633 --> 00:10:58,233 just, just by searching in Google. 207 00:10:58,233 --> 00:11:01,433 So, maybe you'll be the first person to write 208 00:11:01,433 --> 00:11:04,433 a Wikipedia article on the cap curve. 209 00:11:04,500 --> 00:11:07,166 Who knows? maybe. 210 00:11:07,166 --> 00:11:07,933 Anyway, 211 00:11:07,933 --> 00:11:09,600 I look forward to seeing you in next tutorial 212 00:11:09,600 --> 00:11:13,133 and where we will be working with the cap curve. 213 00:11:13,833 --> 00:11:16,833 And until then, happy analyzing.