1
00:00:00,233 --> 00:00:01,700
Cap analysis.

2
00:00:01,700 --> 00:00:05,433
We've talked about Cap a lot, and in fact,
we've talked about cap that much

3
00:00:05,433 --> 00:00:08,533
that I'm no longer even saying
cumulative accuracy profile

4
00:00:08,566 --> 00:00:12,933
because I am assuming
that you're entirely comfortable

5
00:00:12,933 --> 00:00:15,566
with this abbreviation and the whole term
and what it means.

6
00:00:15,566 --> 00:00:18,833
So let's see how to analyze the cap.

7
00:00:20,400 --> 00:00:21,633
As we've discussed,

8
00:00:21,633 --> 00:00:25,833
there are three lines
that are important on the cap curve.

9
00:00:25,966 --> 00:00:28,200
The blue line which is the random line.

10
00:00:28,200 --> 00:00:34,633
When you select your samples at random,
the red line which is our model line.

11
00:00:34,766 --> 00:00:38,100
The and different models
will have different red lines,

12
00:00:38,100 --> 00:00:39,800
but basically it looks
something like that.

13
00:00:39,800 --> 00:00:42,466
And the gray line
which is the perfect model.

14
00:00:42,466 --> 00:00:47,466
Or when you have a crystal ball,
when you can select all of the future

15
00:00:47,666 --> 00:00:51,966
Turner's or purchasers
or whatever action takers,

16
00:00:52,200 --> 00:00:55,366
and you can select them
right away on the dot

17
00:00:55,366 --> 00:00:58,966
without even selecting one single person
that you don't want to select.

18
00:00:59,666 --> 00:01:02,166
And so these are the three main lines.

19
00:01:02,166 --> 00:01:04,866
And how do we analyze this cap curve.

20
00:01:04,866 --> 00:01:05,866
We already know how to build it.

21
00:01:05,866 --> 00:01:07,533
But what can we derive.

22
00:01:07,533 --> 00:01:09,300
What insights can we derive from here.

23
00:01:09,300 --> 00:01:13,733
Well it's kind of intuitive that the
closer your red line is to the gray line,

24
00:01:13,733 --> 00:01:16,933
the better your model,
the closer is to the blue line, the worse.

25
00:01:17,466 --> 00:01:19,866
So how can we quantify this effect?

26
00:01:19,866 --> 00:01:24,733
Well, there is a standard approach
to calculate the accuracy ratio.

27
00:01:24,933 --> 00:01:28,933
And to calculate the accuracy ratio,
you need to take the area under

28
00:01:29,000 --> 00:01:33,466
the perfect model or the perfect line
which is colored in gray here.

29
00:01:33,466 --> 00:01:36,133
And it's called the a p. And

30
00:01:37,133 --> 00:01:39,500
then you
need to take the area under the red line

31
00:01:39,500 --> 00:01:43,166
which is colored in red here,
which is a R.

32
00:01:44,100 --> 00:01:46,600
And then you need
to divide one by the other.

33
00:01:46,600 --> 00:01:49,600
So you need to divide a r by a p.

34
00:01:49,766 --> 00:01:53,000
And then this ratio that you get
is obviously between 0 and 1.

35
00:01:53,300 --> 00:01:56,266
And the closer
this ratio is to one the better.

36
00:01:56,266 --> 00:01:59,266
The further it is away from one and closer
to zero, the worse.

37
00:02:00,300 --> 00:02:03,533
However, it can be quite complicated
to calculate this area under the curve.

38
00:02:03,733 --> 00:02:07,500
Statistical tools can do it for you,
but how can you assess

39
00:02:07,700 --> 00:02:10,566
the cap curve by just looking at it?

40
00:02:10,566 --> 00:02:15,100
So visually, it's not that easy to 
get this quantifiable metric

41
00:02:15,100 --> 00:02:16,433
just by looking at the curve.

42
00:02:16,433 --> 00:02:18,100
So there's a second approach.

43
00:02:18,100 --> 00:02:20,133
And that's
what we're going to discuss now.

44
00:02:21,300 --> 00:02:23,433
Let's get rid of the areas.

45
00:02:23,433 --> 00:02:28,900
And instead of looking at the area,
what you can do is look at the 50% line

46
00:02:28,900 --> 00:02:33,233
on the horizontal axis
and look where it crosses your model,

47
00:02:33,433 --> 00:02:34,766
and then look at where

48
00:02:34,766 --> 00:02:38,600
that line, the horizontal line from there
crosses the vertical axis.

49
00:02:38,600 --> 00:02:44,000
So basically how many turns
will you pick up or action takers

50
00:02:44,000 --> 00:02:47,400
or how many positive outcomes
are you going to identify?

51
00:02:48,133 --> 00:02:51,200
if you take 50% of your population

52
00:02:51,600 --> 00:02:54,766
and in this case, we can see
it's around 90% or something like that.

53
00:02:55,133 --> 00:02:57,766
And just by looking at that, there's a

54
00:02:57,766 --> 00:03:02,366
like a rule of thumb, how you can assess
your model based on that X number.

55
00:03:02,366 --> 00:03:04,900
And here it is. Are you ready? Here we go.

56
00:03:04,900 --> 00:03:09,500
So if x is less than 60%
the model is rubbish.

57
00:03:10,266 --> 00:03:13,200
Basically it's not useful at all.

58
00:03:13,200 --> 00:03:16,200
You ha you can create a better one.

59
00:03:16,200 --> 00:03:17,633
Probably you can create a better one.

60
00:03:17,633 --> 00:03:19,500
And you need to try again.

61
00:03:19,500 --> 00:03:23,700
If, your model,
your X is between 60% and 70%,

62
00:03:23,700 --> 00:03:27,633
then the model is considered
to be poor, poor or average.

63
00:03:27,633 --> 00:03:30,300
And by the way, these are my
this is my rule of thumb.

64
00:03:30,300 --> 00:03:33,300
Other people might have a different
rule of thumb, but this is what I go by.

65
00:03:33,633 --> 00:03:36,966
If it's between 60% and 70%,
it's it's a poor model, to be honest.

66
00:03:36,966 --> 00:03:39,300
Like you can you can do better than that.

67
00:03:39,300 --> 00:03:44,733
if it's if X is between 70% and 80%,
that's a good model.

68
00:03:44,733 --> 00:03:48,033
That's already where you should be aiming
for anything above 70%.

69
00:03:48,433 --> 00:03:50,600
That's, can deliver

70
00:03:50,600 --> 00:03:54,166
good quality insights to the business
and actually deliver value.

71
00:03:54,633 --> 00:04:00,133
Anything between 80% and 90% like we see
here is a very good it's extremely good.

72
00:04:00,133 --> 00:04:03,066
That's if you can get a model over 80%.

73
00:04:03,066 --> 00:04:06,066
That is an amazing result.

74
00:04:06,366 --> 00:04:10,133
And anything above 90% up to 100,
that is just too good.

75
00:04:10,500 --> 00:04:13,600
It is too good to believe. And they are.

76
00:04:13,633 --> 00:04:19,233
There's one option that you should be
very careful here with is overfitting.

77
00:04:19,233 --> 00:04:22,933
If your model is showing
you results like 90% or so,

78
00:04:23,033 --> 00:04:26,033
if a model showing 100%,
then the obvious answer there is that

79
00:04:26,033 --> 00:04:29,900
one of your independent variables
is actually a post facto variable,

80
00:04:29,900 --> 00:04:33,366
meaning that it shouldn't be in the data
because it's looking into the future.

81
00:04:34,000 --> 00:04:36,766
The person who supplied you
that variable forgot to take it out

82
00:04:36,766 --> 00:04:41,933
or forgot to explain to you that, 
you know, their credit score actually,

83
00:04:42,200 --> 00:04:45,300
is turned into zero
after they leave the bank,

84
00:04:45,300 --> 00:04:47,366
and therefore everybody
with a zero credit score

85
00:04:47,366 --> 00:04:51,133
obviously has left the bank, and therefore
your model is picking them up.

86
00:04:51,133 --> 00:04:53,400
Like, like is super easy.

87
00:04:53,400 --> 00:04:56,400
So if you have 100%, that's
definitely something on a few variables.

88
00:04:56,566 --> 00:05:00,066
Even if you have 9,000%,
you have to check that there could be some

89
00:05:00,366 --> 00:05:01,966
forward looking variables.

90
00:05:01,966 --> 00:05:03,633
The other thing is overfitting.

91
00:05:03,633 --> 00:05:07,866
You could be overfitting your model
and what that means is that you,

92
00:05:07,866 --> 00:05:11,100
your model has been so well fit to that

93
00:05:11,100 --> 00:05:14,466
specific data set that you supplied it,
that when you true

94
00:05:14,466 --> 00:05:18,266
that it's just heavily
relying on the anomalies in that data set.

95
00:05:18,466 --> 00:05:23,133
And when you feed it a new data set, like,
you know, in a month time or

96
00:05:23,133 --> 00:05:26,600
something like not not training data,
not the data that you train your model on.

97
00:05:26,600 --> 00:05:29,700
And we'll talk about this a bit, a lot
more actually in the coming tutorials.

98
00:05:29,700 --> 00:05:34,633
But so if you feed this model some data
that you want to actually predict on,

99
00:05:34,800 --> 00:05:36,733
then it will crash. Well,
it won't crash it.

100
00:05:36,733 --> 00:05:37,933
It won't perform as well.

101
00:05:37,933 --> 00:05:40,466
Perform,
you know, at the 60% mark or something.

102
00:05:40,466 --> 00:05:42,766
So that means your model is overfitted.

103
00:05:42,766 --> 00:05:44,400
And be very careful about that.

104
00:05:44,400 --> 00:05:45,866
We'll talk about overfitting more.

105
00:05:45,866 --> 00:05:49,933
In fact, in the coming tutorials
we will learn how to avoid that problem.

106
00:05:50,200 --> 00:05:54,466
And finally,
if you can get an an X or this,

107
00:05:54,966 --> 00:05:57,566
parameter to be between 90%, 100%
and you're

108
00:05:57,566 --> 00:06:00,633
not using forward looking parameters
or you're not overfitting,

109
00:06:00,866 --> 00:06:03,866
then give me a call
because I might have a job for you.

110
00:06:04,200 --> 00:06:07,466
People like that are rare, and I have,

111
00:06:07,466 --> 00:06:11,966
a lot of headhunters looking for people
who can, do modeling like that.

112
00:06:11,966 --> 00:06:16,366
So definitely keep that in mind
and look forward to seeing you then.

113
00:06:16,700 --> 00:06:18,466
Until next time, happy analyzing.