1
00:00:00,966 --> 00:00:03,966
In the previous tutorial
we talked about the accuracy paradox.

2
00:00:04,066 --> 00:00:08,700
Hopefully now you see why we need
more robust methods to assess our models.

3
00:00:09,266 --> 00:00:10,200
And today we talking

4
00:00:10,200 --> 00:00:14,066
about the cumulative accuracy profile,
which is in fact one of those methods.

5
00:00:15,233 --> 00:00:16,400
Let's look at a scenario.

6
00:00:16,400 --> 00:00:20,666
Let's say you're a data scientist at a
store which sells clothes,

7
00:00:20,666 --> 00:00:24,100
and your store has a total
of 100,000 customers.

8
00:00:24,133 --> 00:00:27,600
I'm placing that number
on the horizontal axis

9
00:00:28,033 --> 00:00:30,700
and you know that

10
00:00:30,700 --> 00:00:34,100
from experience,
whenever you send an offer like an email

11
00:00:34,100 --> 00:00:37,400
to all your customers
or to any random sample of your customers,

12
00:00:37,433 --> 00:00:41,100
approximately 10% of them
respond and purchase the product.

13
00:00:41,433 --> 00:00:45,800
So I'm going to place 10,000,
which is 10% of the total.

14
00:00:46,066 --> 00:00:48,300
on the vertical axis.

15
00:00:48,300 --> 00:00:53,400
And so what we're going to do is we're
we've got an offer that we want to send,

16
00:00:53,733 --> 00:00:59,266
and we want, to see how many customers
are going to, purchase our product.

17
00:00:59,266 --> 00:00:59,800
We send it off.

18
00:00:59,800 --> 00:01:05,666
So if we send it to zero, customers
obviously will get, zero responses.

19
00:01:05,666 --> 00:01:06,366
Right.

20
00:01:06,366 --> 00:01:09,600
What do you think will happen
if we send it to 20,000 customers?

21
00:01:09,900 --> 00:01:11,633
How many do you think will respond?

22
00:01:11,633 --> 00:01:16,400
Well, because this is a random sample
and we know that about 10% respond.

23
00:01:16,400 --> 00:01:18,633
So we would say about 2000 would respond.

24
00:01:18,633 --> 00:01:20,233
Fair enough. Right.

25
00:01:20,233 --> 00:01:24,333
If 40,000 if we send to the offer
to 40,000 of our customers,

26
00:01:24,633 --> 00:01:26,366
then about 4000 will respond.

27
00:01:26,366 --> 00:01:30,300
60,000 6000 80,000 8000 100,000.

28
00:01:30,833 --> 00:01:34,066
Then 10,000 of
our customers should respond.

29
00:01:35,466 --> 00:01:39,333
And this, is a random selection process.

30
00:01:39,333 --> 00:01:42,433
So here we can draw a line
which will actually represent,

31
00:01:42,900 --> 00:01:45,000
this, random selection.

32
00:01:45,000 --> 00:01:50,766
The slope of the line equals to that
10% that, we know that respond

33
00:01:51,300 --> 00:01:54,966
on average to our offers
if we just send them out like that.

34
00:01:55,366 --> 00:01:59,200
Now the question is,
can we somehow improve this experience?

35
00:01:59,200 --> 00:02:03,133
Can we, get more customers
to respond to offers?

36
00:02:03,800 --> 00:02:06,700
when we send out, our letter?

37
00:02:06,700 --> 00:02:11,666
So basically, can we somehow
target our customers more appropriately?

38
00:02:11,666 --> 00:02:14,100
So to get a better response rate.

39
00:02:14,100 --> 00:02:17,600
And how about instead of
sending out these offers

40
00:02:17,600 --> 00:02:21,733
randomly to, say,
a random sample of 20,000 customers?

41
00:02:21,800 --> 00:02:25,000
How about we pick and choose the customers
we send these offers to?

42
00:02:25,366 --> 00:02:26,700
And how do we pick and choose?

43
00:02:26,700 --> 00:02:28,233
Well, to start off with,

44
00:02:28,233 --> 00:02:31,800
let's build a model just like we did
in the previous section.

45
00:02:32,100 --> 00:02:34,400
Basically, a customer segmentation model.

46
00:02:34,400 --> 00:02:37,033
Joe, demographic segmentation model,

47
00:02:37,033 --> 00:02:40,100
but which want to predict whether or not
they will leave the company.

48
00:02:40,333 --> 00:02:43,333
It will actually predict whether or not
they will purchase a product.

49
00:02:43,466 --> 00:02:45,566
It's a very simple process actually.

50
00:02:45,566 --> 00:02:49,900
In fact, it's the same thing because
purchase is also a binary variable.

51
00:02:49,900 --> 00:02:51,000
Yes or no.

52
00:02:51,000 --> 00:02:53,000
And we can also run the same experiment.

53
00:02:53,000 --> 00:02:56,033
We can take a group of customers
before we send out the offer,

54
00:02:56,033 --> 00:02:59,733
and then look back and see who purchased
with a male or female.

55
00:02:59,766 --> 00:03:02,133
Which country were they in?

56
00:03:02,133 --> 00:03:04,666
what age predominantly were they?

57
00:03:04,666 --> 00:03:08,333
Were they browsing on mobile
or were they browsing, via computer?

58
00:03:08,433 --> 00:03:11,466
And all of these factors,
we can take them into account,

59
00:03:12,300 --> 00:03:15,166
measure them,
put them into a logistic regression

60
00:03:15,166 --> 00:03:19,266
and get a model
which will help us assess the likelihood

61
00:03:19,266 --> 00:03:23,600
of certain types of customers
purchasing based on their characteristics,

62
00:03:23,600 --> 00:03:26,966
so they change demographic status
and and other characteristics.

63
00:03:27,900 --> 00:03:30,333
And once we've built this model, how about

64
00:03:30,333 --> 00:03:34,800
we apply it to select the customers
we will send the offer to.

65
00:03:34,866 --> 00:03:37,266
So what the model will tell us like

66
00:03:37,266 --> 00:03:40,900
just like in the example
in the previous section where females

67
00:03:40,900 --> 00:03:44,300
of female customers of a bank
whose favorite color is red,

68
00:03:44,300 --> 00:03:46,933
they're most likely to leave the bank
here.

69
00:03:46,933 --> 00:03:47,933
We'll have a similar result.

70
00:03:47,933 --> 00:03:53,033
It'll say, perhaps
male customers in this certain age group,

71
00:03:53,533 --> 00:03:56,700
who browse on mobile,
are most likely to purchase the product.

72
00:03:56,800 --> 00:04:00,100
It will tell us something
or will actually rank our customers.

73
00:04:00,433 --> 00:04:03,700
It'll give them a probability
of purchasing our product.

74
00:04:03,900 --> 00:04:07,266
And then we can use that probability
to actually contact our customers.

75
00:04:07,500 --> 00:04:11,100
So of course if we contact zero,
customers will get zero response rate.

76
00:04:11,333 --> 00:04:14,566
But if we contact 20,000,
we'll probably get a much higher response

77
00:04:14,566 --> 00:04:19,300
rate than just 2000,
because we're picking out the customers

78
00:04:19,300 --> 00:04:22,933
that are at the highest risk
of accepting this offer.

79
00:04:23,233 --> 00:04:27,700
We know from their previous behavior
or from the previous behavior of customers

80
00:04:27,700 --> 00:04:31,666
similar to them,
that they have a 90% chance

81
00:04:31,666 --> 00:04:34,466
or an 80% chance
of purchasing this product.

82
00:04:34,466 --> 00:04:36,766
And we will go for them first.

83
00:04:36,766 --> 00:04:39,766
We will put them at the top
of our list of people who we contact.

84
00:04:40,200 --> 00:04:44,466
Then when we contact, let's say we contact
not 20,000 but 40,000.

85
00:04:44,766 --> 00:04:47,766
Our response rate
will be higher than 4000,

86
00:04:48,066 --> 00:04:50,900
which we get in the random scenario.

87
00:04:50,900 --> 00:04:55,866
If we if our model is really good, then
by the time we're around at around 60,000.

88
00:04:55,866 --> 00:04:59,300
So more
just over half of our total customer base,

89
00:04:59,466 --> 00:05:02,433
we are already getting to that
10,000 mark.

90
00:05:02,433 --> 00:05:06,766
So we know that 10,000 people will respond
in total.

91
00:05:07,000 --> 00:05:12,033
There's no way we can get above that
because that's just, the response rate.

92
00:05:12,033 --> 00:05:13,966
If we contact everybody, it'll be 10,000.

93
00:05:13,966 --> 00:05:15,500
But we're getting very close already.

94
00:05:15,500 --> 00:05:20,533
So even at 60,000, we're already
at 9500 respondents or purchases.

95
00:05:20,700 --> 00:05:22,566
We we could actually stop here.

96
00:05:22,566 --> 00:05:26,100
We've already pretty much contacted
everyone, but if we want to contact more,

97
00:05:26,700 --> 00:05:31,100
if we send it out to 80,000, we're getting
even closer to 10,000 responses.

98
00:05:31,100 --> 00:05:36,700
And if we contact 100,000, we will
still be back at our 10,000 responses.

99
00:05:36,900 --> 00:05:39,100
So now let's draw a line through these,

100
00:05:40,066 --> 00:05:41,233
crosses.

101
00:05:41,233 --> 00:05:43,666
So what you see, this line here

102
00:05:43,666 --> 00:05:47,200
is called the cumulative accuracy
profile of your model.

103
00:05:47,733 --> 00:05:50,800
And as you can imagine, the better
your model, the

104
00:05:51,333 --> 00:05:54,333
larger will be the
the area under this line.

105
00:05:54,333 --> 00:05:56,066
So the area between the red

106
00:05:56,066 --> 00:05:59,266
and the blue lines, it will increase
as your model gets better.

107
00:05:59,866 --> 00:06:02,800
And if your model is worse,
then this red line

108
00:06:02,800 --> 00:06:05,800
will be closer to the blue line,
so it'll be closer to random.

109
00:06:06,566 --> 00:06:09,566
The next step we want to do is convert
these axes

110
00:06:09,900 --> 00:06:12,166
from absolute values to percentages.

111
00:06:12,166 --> 00:06:15,100
So so they range from 0 to 100%.

112
00:06:15,100 --> 00:06:19,133
And this is how the cap curve
is normally represented.

113
00:06:19,800 --> 00:06:21,900
Now let's say we ran
another regression model.

114
00:06:21,900 --> 00:06:25,533
And this time we use less variables
lesson dependent variables.

115
00:06:25,533 --> 00:06:31,400
Or just because we had less access
to independent variables or we didn't see

116
00:06:31,400 --> 00:06:34,566
that there's a multicollinearity effect
in our model or something else

117
00:06:35,100 --> 00:06:38,900
that went wrong
and that model, because it'll be worse.

118
00:06:39,300 --> 00:06:42,133
This is what its cap curve will look like.

119
00:06:42,133 --> 00:06:45,600
And therefore, by plotting the cap
curves, you'll be able to compare models

120
00:06:45,600 --> 00:06:48,800
to each other and understand
how much gain.

121
00:06:48,800 --> 00:06:50,800
This is
also sometimes called the gain chart,

122
00:06:50,800 --> 00:06:54,200
how much gain
you get in each of these models

123
00:06:54,200 --> 00:06:57,966
compared to the random scenario,
or how much gain you get.

124
00:06:57,966 --> 00:06:59,233
Additional gain you get

125
00:06:59,233 --> 00:07:03,533
from switching from one model to the next,
or from the green one to the red one.

126
00:07:03,533 --> 00:07:04,000
For instance.

127
00:07:04,000 --> 00:07:06,000
You're improving your hit ratio

128
00:07:06,000 --> 00:07:08,766
and therefore you're
improving your return on investment.

129
00:07:08,766 --> 00:07:11,833
So therefore the red model is better.

130
00:07:11,833 --> 00:07:14,500
And this is how
we are going to be assessing models.

131
00:07:14,500 --> 00:07:16,633
So let's label them.

132
00:07:16,633 --> 00:07:19,500
The blue line is a random selection
process.

133
00:07:19,500 --> 00:07:20,833
Like a monkey could do that.

134
00:07:20,833 --> 00:07:22,700
You just pick a random sample

135
00:07:22,700 --> 00:07:25,800
and you send the letter
or you just send it to everybody.

136
00:07:26,500 --> 00:07:29,333
You get your 100% of respondents.

137
00:07:29,333 --> 00:07:31,466
The green line is a poor model.

138
00:07:31,466 --> 00:07:32,800
So it's it's a model

139
00:07:32,800 --> 00:07:36,666
is better than random,
but it's still not as good as the red one.

140
00:07:37,066 --> 00:07:39,100
The red one is a good model.

141
00:07:39,100 --> 00:07:42,600
As you can see here, at around the 50%
mark,

142
00:07:42,600 --> 00:07:45,133
we're getting just over 80% responses.

143
00:07:45,133 --> 00:07:47,133
That's considered a good model.

144
00:07:47,133 --> 00:07:50,733
And there's one more line
that you can think of here.

145
00:07:51,166 --> 00:07:53,566
And it's this line.

146
00:07:53,566 --> 00:07:56,566
This line is the ideal line. And

147
00:07:57,566 --> 00:07:59,400
this is what would happen

148
00:07:59,400 --> 00:08:03,333
if you had a crystal ball,
if you could predict

149
00:08:03,766 --> 00:08:07,333
exactly who is going to purchase
and contact those people,

150
00:08:07,500 --> 00:08:09,000
this is what it would look like.

151
00:08:09,000 --> 00:08:12,000
Why? Well, because if you look at that,

152
00:08:13,200 --> 00:08:16,200
the place where that split occurs,

153
00:08:16,766 --> 00:08:19,333
you will see that it's exactly
10% and 10%.

154
00:08:19,333 --> 00:08:23,700
As you remember, we know that only 10%
of our customers ever purchase.

155
00:08:24,066 --> 00:08:29,066
So basically you're saying that on the
horizontal axis, I'm going to take 10%.

156
00:08:29,800 --> 00:08:33,333
And each
and every single one of those customers

157
00:08:33,333 --> 00:08:36,200
I pick in that 10%,
they are going to be those that purchase.

158
00:08:36,200 --> 00:08:39,266
That means I will go right
straight to 100%.

159
00:08:39,733 --> 00:08:45,300
with, this last scenario, this actually
took me a while to get my head around,

160
00:08:45,300 --> 00:08:48,800
when I first heard about it,
because I never understood.

161
00:08:48,800 --> 00:08:50,433
Why is this split at the top?

162
00:08:50,433 --> 00:08:52,133
Why does it break like that?

163
00:08:52,133 --> 00:08:53,166
But that's exactly the reason.

164
00:08:53,166 --> 00:08:56,233
Because you you
you might you can imagine that

165
00:08:56,233 --> 00:08:59,400
you have a crystal ball
and you contact in the first 10%

166
00:08:59,400 --> 00:09:02,400
or however many in your specific,

167
00:09:02,533 --> 00:09:05,300
business
scenario, customers ever purchase.

168
00:09:05,300 --> 00:09:07,133
You contact them right away,

169
00:09:07,133 --> 00:09:10,233
and then it's just flat from there,
because it doesn't matter

170
00:09:10,233 --> 00:09:12,133
how many more you contact,
they're not going to purchase.

171
00:09:12,133 --> 00:09:13,600
That's just the reality of things.

172
00:09:14,833 --> 00:09:18,133
And that is the
curves that you can have on a cap curve.

173
00:09:18,166 --> 00:09:21,166
If you ever see a model
that goes under the blue line,

174
00:09:21,266 --> 00:09:22,566
I didn't even draw one here.

175
00:09:22,566 --> 00:09:26,400
But if that happens, that's a very bad

176
00:09:26,400 --> 00:09:29,933
model is basically doing you a disservice

177
00:09:30,266 --> 00:09:31,900
if it's if you see the curve on

178
00:09:31,900 --> 00:09:34,866
the blue line and we'll talk about model
deterioration further

179
00:09:34,866 --> 00:09:37,866
in the course when you're talking
about maintaining your models.

180
00:09:38,200 --> 00:09:41,400
So that's it for the cap curve
for the introduction to cap curve

181
00:09:41,400 --> 00:09:45,600
will be using the cap curve very actively
in this section to assess our model.

182
00:09:45,866 --> 00:09:47,233
And in fact we'll actually build

183
00:09:47,233 --> 00:09:51,166
two of them and one for our model
and one for our test data.

184
00:09:51,166 --> 00:09:53,400
So that would be very interesting
to compare.

185
00:09:53,400 --> 00:09:56,400
One last thing I wanted to
mention is and note

186
00:09:56,400 --> 00:10:00,233
that we have the cap
which is a cumulative accuracy profile.

187
00:10:00,500 --> 00:10:03,800
And we have a rock which is a receiver
operating characteristic.

188
00:10:04,133 --> 00:10:05,866
And a lot of people
get these things confused.

189
00:10:05,866 --> 00:10:11,966
And I myself included, I used to,
get it confused.

190
00:10:11,966 --> 00:10:16,400
I even tried proving, one time
a, a colleague of mine

191
00:10:16,400 --> 00:10:19,166
who knew this stuff really
well at the time,

192
00:10:19,166 --> 00:10:21,566
and I was just learning it, that,
he was wrong.

193
00:10:21,566 --> 00:10:24,566
But that was a funny experience.

194
00:10:24,700 --> 00:10:26,266
But they're not the same thing.

195
00:10:26,266 --> 00:10:28,100
So cumulative accuracy profiles.

196
00:10:28,100 --> 00:10:29,600
And when we talked about receiver

197
00:10:29,600 --> 00:10:33,366
operating characteristic,
we won't be covering, in this course.

198
00:10:33,600 --> 00:10:36,600
It'll be in my advanced
course on statistics.

199
00:10:36,966 --> 00:10:38,366
it's very similar. It looks similar.

200
00:10:38,366 --> 00:10:40,300
And that's why
a lot of people get confused and actually

201
00:10:41,500 --> 00:10:42,600
I think,

202
00:10:42,600 --> 00:10:46,033
the other reason
is that the ROC curve is in Wikipedia,

203
00:10:46,033 --> 00:10:47,200
there's an article for the ROC curve,

204
00:10:47,200 --> 00:10:51,300
but there isn't one in English
for the cumulative accuracy profile.

205
00:10:51,300 --> 00:10:55,433
So it's quite hard to, 
find information on the cap curve.

206
00:10:55,633 --> 00:10:58,233
just, just by searching in Google.

207
00:10:58,233 --> 00:11:01,433
So, maybe you'll be the first person
to write

208
00:11:01,433 --> 00:11:04,433
a Wikipedia article on the cap curve.

209
00:11:04,500 --> 00:11:07,166
Who knows? maybe.

210
00:11:07,166 --> 00:11:07,933
Anyway,

211
00:11:07,933 --> 00:11:09,600
I look forward
to seeing you in next tutorial

212
00:11:09,600 --> 00:11:13,133
and where we will be working
with the cap curve.

213
00:11:13,833 --> 00:11:16,833
And until then, happy analyzing.