1
00:00:01,600 --> 00:00:04,800
Hello and welcome back to the course
on Machine Learning.

2
00:00:05,333 --> 00:00:08,533
Today we're talking about the multi-armed
bandit problem.

3
00:00:08,666 --> 00:00:10,633
Don't you just love these names
when they come up?

4
00:00:10,633 --> 00:00:15,000
Such cool names for machine
learning algorithms and problems.

5
00:00:15,733 --> 00:00:19,033
Well, today
we're indeed talking about this problem.

6
00:00:19,033 --> 00:00:22,633
And it is the example

7
00:00:22,633 --> 00:00:25,700
that we're going to be using in this
whole section on reinforcement learning.

8
00:00:25,700 --> 00:00:29,100
We're going to be looking
at different ways that we can solve

9
00:00:29,100 --> 00:00:32,100
the multi-armed bandit problem
and comparing the results.

10
00:00:32,266 --> 00:00:34,900
But before we continue,
I just wanted to mention that

11
00:00:34,900 --> 00:00:38,766
the multi-armed bandit
problem is not the only problem

12
00:00:38,766 --> 00:00:41,100
that can be solved with reinforcement
learning.

13
00:00:41,100 --> 00:00:44,066
Reinforcement
learning is actually really, really cool.

14
00:00:44,066 --> 00:00:49,133
Reinforcement learning, for instance,
is used to train robot dogs to walk.

15
00:00:49,133 --> 00:00:50,400
And I'll give you a quick example.

16
00:00:50,400 --> 00:00:55,500
For instance, you can once you've created
a robot dog, you can implement

17
00:00:55,500 --> 00:00:59,200
an algorithm inside the robot dog,
which will tell it how to walk.

18
00:00:59,200 --> 00:01:00,033
You can tell it,

19
00:01:00,033 --> 00:01:03,266
all right, move your front right foot
and then move your left back foot

20
00:01:03,266 --> 00:01:05,933
and then front left foot, right back foot
and so on.

21
00:01:05,933 --> 00:01:07,700
You can actually give
the sequence of actions

22
00:01:07,700 --> 00:01:11,100
that it needs to take in order
to accomplish a task which is walking.

23
00:01:11,600 --> 00:01:14,866
Or you can implement a reinforcement
learning algorithm

24
00:01:14,866 --> 00:01:20,366
which will train the dog
to walk, in a very, very interesting way.

25
00:01:20,366 --> 00:01:20,900
So basically

26
00:01:20,900 --> 00:01:24,933
what it will do is will say, hey dog, here
all the actions you can take,

27
00:01:25,800 --> 00:01:28,966
you can, move your legs like this,
you can move your legs like that.

28
00:01:28,966 --> 00:01:32,700
And, your goal is to make a step forward.

29
00:01:32,700 --> 00:01:36,066
Every time you make a step forward,
you are, given a reward.

30
00:01:36,066 --> 00:01:39,566
Every time you fall over,
you're, given a punishment.

31
00:01:39,566 --> 00:01:42,066
And a reward
is basically a one in the algorithm.

32
00:01:42,066 --> 00:01:46,800
You don't actually have to give it a
carrot or, you know, something, to eat.

33
00:01:46,800 --> 00:01:48,166
You just give it a one.

34
00:01:48,166 --> 00:01:50,733
An algorithm and a punishment is a zero.

35
00:01:50,733 --> 00:01:54,000
And basically every time it takes a step
forward, it knows it's got a reward.

36
00:01:54,000 --> 00:01:56,166
And it will, yes, that's good for it.

37
00:01:56,166 --> 00:01:59,233
So it basically will try
all these random sets of actions

38
00:01:59,966 --> 00:02:02,133
and see what they lead to.

39
00:02:02,133 --> 00:02:05,666
Every time it takes a step forward, you'll
remember that those were good actions,

40
00:02:05,666 --> 00:02:07,533
and you'll try to repeat them
more and more.

41
00:02:07,533 --> 00:02:10,866
And actually,
dogs like that can learn to walk.

42
00:02:11,200 --> 00:02:15,100
so you don't have to program
an actual walking algorithm into it.

43
00:02:15,100 --> 00:02:18,300
It'll figure out the steps
it needs to take on its own.

44
00:02:18,566 --> 00:02:21,566
The I think that's really mind blowing
and really cool.

45
00:02:21,800 --> 00:02:26,866
but unfortunately, that is a more, 
a topic, more of

46
00:02:27,066 --> 00:02:30,666
on the side of artificial intelligence
rather than just machine learning.

47
00:02:30,966 --> 00:02:34,900
And that is, you know,
that can be a whole course on its own.

48
00:02:34,900 --> 00:02:39,600
We're not going to delve into, training
robot dogs to walk inside this section.

49
00:02:39,600 --> 00:02:40,733
Inside this section,

50
00:02:40,733 --> 00:02:44,033
we are going to talk about
the multi-armed bandit problem,

51
00:02:44,266 --> 00:02:48,700
which is a bit of a different application
of, this machine

52
00:02:48,800 --> 00:02:51,800
learning branch of reinforcement learning.

53
00:02:51,933 --> 00:02:53,966
And plus, of course, there's other,

54
00:02:53,966 --> 00:02:56,733
lots of other applications
of reinforcement learning as well.

55
00:02:56,733 --> 00:03:00,766
So moving on to our multi-armed
bandit, problem.

56
00:03:00,766 --> 00:03:05,400
So first of all,
what on earth is a multi-armed bandit?

57
00:03:05,400 --> 00:03:05,666
Right.

58
00:03:05,666 --> 00:03:09,200
So the first thing that comes to mind, a
is like a robber going into a bank

59
00:03:09,200 --> 00:03:13,800
and so on, and, or somebody with a gun,
but actually a,

60
00:03:14,133 --> 00:03:17,833
a bandit or a one armed bandits.

61
00:03:17,833 --> 00:03:19,800
So let's simplify things.

62
00:03:19,800 --> 00:03:24,000
A one armed bandit is a slot machine,
right?

63
00:03:24,000 --> 00:03:24,966
It's one of these.

64
00:03:24,966 --> 00:03:27,966
And, why is it called a worm on one
armed bandit?

65
00:03:27,966 --> 00:03:30,000
Well, it's got a bit of a history there.

66
00:03:30,000 --> 00:03:33,766
back in the day,
they used to have this, handle

67
00:03:33,766 --> 00:03:35,300
on the right,
and you can still see that in movies.

68
00:03:35,300 --> 00:03:39,233
And maybe some places
you can still find these slot machines

69
00:03:39,233 --> 00:03:40,900
where you actually have
to pull the handle,

70
00:03:40,900 --> 00:03:43,900
because now they're all electronic,
and you just press a button

71
00:03:43,900 --> 00:03:46,600
right
there, push with your push slot machines.

72
00:03:46,600 --> 00:03:49,966
Whereas in the back in the day
you had to pull the lever,

73
00:03:50,366 --> 00:03:54,333
to make it work to like initiate,

74
00:03:54,333 --> 00:03:59,733
the, the game and so hence the arm.

75
00:03:59,733 --> 00:04:00,933
Yeah. But why is it called the bandit?

76
00:04:00,933 --> 00:04:06,400
Well, because, these machines,
they would actually,

77
00:04:07,033 --> 00:04:07,700
you know, these

78
00:04:07,700 --> 00:04:12,000
this is the one of the quickest way
to lose your money in, in a casino,

79
00:04:12,666 --> 00:04:16,533
they would take, I think it was like,

80
00:04:16,566 --> 00:04:20,633
a 50% chance that they would take away
your money back in the day.

81
00:04:20,633 --> 00:04:21,766
So they would.

82
00:04:21,766 --> 00:04:24,766
Of course, you would earn less than your,

83
00:04:24,866 --> 00:04:27,433
you're actually winning.

84
00:04:27,433 --> 00:04:32,233
And it was about a, you know, a 5050
chance whether or not you actually make a,

85
00:04:32,633 --> 00:04:37,666
or you get a win or you, you lose money,
but then they put a bug into them.

86
00:04:37,700 --> 00:04:39,133
I think I read up a little bit online.

87
00:04:39,133 --> 00:04:42,266
They put a bug into them
that people who were playing

88
00:04:42,266 --> 00:04:46,066
them were losing even faster
than, or even more frequently and 50%.

89
00:04:46,066 --> 00:04:50,033
So hence the name bandit, because it was
basically robbing you of your money.

90
00:04:50,333 --> 00:04:54,033
And, you know, one of the quickest way
to ways to lose your money,

91
00:04:54,766 --> 00:04:55,600
hence the multiple.

92
00:04:55,600 --> 00:04:57,966
Oh, that's
why it's called the One arm bandit.

93
00:04:57,966 --> 00:05:00,466
and what is the multi-armed bandit?

94
00:05:00,466 --> 00:05:05,333
Well, the market multi-armed
bandit problem is kind of the challenge

95
00:05:05,333 --> 00:05:11,700
that a person is faced when he comes up
to a whole set of these machines.

96
00:05:11,800 --> 00:05:14,800
When he doesn't have just one,
he has, like, 5 or 10,

97
00:05:14,966 --> 00:05:16,200
you know, programing examples.

98
00:05:16,200 --> 00:05:20,233
We'll have, example of ten,
but we won't be talking specifically

99
00:05:20,233 --> 00:05:21,300
about these machines.

100
00:05:21,300 --> 00:05:24,166
Of course. This is
this is the historic problem.

101
00:05:24,166 --> 00:05:30,200
you'll just now we'll see that there are
many, many other applications, that,

102
00:05:31,233 --> 00:05:34,233
even though it's called the
multi-armed bandit problem, it's actually,

103
00:05:34,233 --> 00:05:37,233
used to solve many other problems
as well.

104
00:05:37,933 --> 00:05:40,666
So, basically here
you're faced with a challenge.

105
00:05:40,666 --> 00:05:42,466
You've got five of these machines, right?

106
00:05:42,466 --> 00:05:46,733
And, how do you actually play them
to maximize your return?

107
00:05:47,066 --> 00:05:52,200
from the the number of games
that you can, actually play.

108
00:05:52,200 --> 00:05:54,233
So you've, you know,
you decided you're going to play,

109
00:05:54,233 --> 00:05:56,466
you know, 100 times or a thousand times.

110
00:05:56,466 --> 00:05:58,600
and you want to maximize return.

111
00:05:58,600 --> 00:06:01,233
How do you figure out which ones of them
to play?

112
00:06:01,233 --> 00:06:03,866
in order to maximize your returns?

113
00:06:03,866 --> 00:06:07,900
Well, the problem, you know,
to describe the problem in more detail,

114
00:06:07,900 --> 00:06:11,933
we've got to mention that,
the, assumption here

115
00:06:11,933 --> 00:06:16,666
is that each one of these machines
has a distribution behind it.

116
00:06:16,666 --> 00:06:20,000
So there's a distribution, of numbers

117
00:06:20,100 --> 00:06:24,366
out of which or outcomes
out of which the machine, picks results.

118
00:06:24,366 --> 00:06:24,633
Right?

119
00:06:24,633 --> 00:06:27,766
So, it has, it has sort of like
each one of these machines

120
00:06:27,766 --> 00:06:31,166
has its own distribution,
and it picks out a result.

121
00:06:31,166 --> 00:06:34,666
You pull the trigger and it just picks out
randomly out of its distribution,

122
00:06:34,933 --> 00:06:37,866
a result, an outcome, you know,
whether you win or whether you lose

123
00:06:37,866 --> 00:06:39,433
and how much you
win and how much you lose.

124
00:06:40,600 --> 00:06:43,600
or basically you lose the same mode
you just put in the coin.

125
00:06:43,700 --> 00:06:46,300
but basically it tells you
whether you win lose

126
00:06:46,300 --> 00:06:49,666
based on the, and distribution
that's built into the machine.

127
00:06:49,800 --> 00:06:53,600
But the problem here is that
you don't know these distributions, right?

128
00:06:53,633 --> 00:06:56,500
You don't know in advance
what the distributions are.

129
00:06:56,500 --> 00:06:59,533
And they are assumed to be different
for these machines.

130
00:06:59,666 --> 00:07:02,566
Sometimes it can be similar to the same,
in some of the machines.

131
00:07:02,566 --> 00:07:05,566
But by by default they are different.

132
00:07:05,966 --> 00:07:09,566
And your goal is to figure out

133
00:07:09,966 --> 00:07:12,733
which of these distributions is the best
one for you.

134
00:07:12,733 --> 00:07:14,533
So, let's have a look.

135
00:07:14,533 --> 00:07:16,500
So there are these distributions. Right.

136
00:07:16,500 --> 00:07:20,700
So for example, we've got these
five machines, the five distributions.

137
00:07:21,000 --> 00:07:22,533
And as you can see right away

138
00:07:22,533 --> 00:07:25,900
just by looking at this
which is the best machine right away,

139
00:07:25,900 --> 00:07:29,533
obviously the one on the right,
the orange one is the best machine

140
00:07:29,533 --> 00:07:34,500
because it's got the best,
you know, it's the most left skewed.

141
00:07:34,900 --> 00:07:36,733
left skewed
because the tails on the left.

142
00:07:36,733 --> 00:07:38,100
So it's, it's got the most

143
00:07:38,100 --> 00:07:41,100
favorable outcomes,
got the highest mean median and mode.

144
00:07:41,200 --> 00:07:45,633
And you, if you knew these distributions
and what you would obviously

145
00:07:45,633 --> 00:07:49,666
just go to the fifth machine
and you would bet on the fifth machine

146
00:07:49,666 --> 00:07:52,466
just on the fifth machine all the time
because,

147
00:07:52,466 --> 00:07:54,366
it's got the best distribution right.

148
00:07:54,366 --> 00:07:56,866
So on average
you would get the best results.

149
00:07:56,866 --> 00:07:58,833
But you don't know that.
You don't know that in advance.

150
00:07:58,833 --> 00:08:01,833
And your goal is to figure out,

151
00:08:02,066 --> 00:08:05,200
you know, it's it's like a
it's like a mind game.

152
00:08:05,200 --> 00:08:10,500
You know, how there's all these movies
about, machine learning and really cool

153
00:08:10,500 --> 00:08:13,600
or cool mathematics on how they're, using

154
00:08:13,600 --> 00:08:16,600
their cool, really good movie was,

155
00:08:16,600 --> 00:08:20,266
imitation
game, right, about Alan Turing and,

156
00:08:20,266 --> 00:08:23,700
and, and how he was solving the Enigma
and so on.

157
00:08:23,700 --> 00:08:25,800
But similar kind of concept.

158
00:08:25,800 --> 00:08:28,566
You don't know which one of these is
the best.

159
00:08:28,566 --> 00:08:29,433
You got to figure it out.

160
00:08:29,433 --> 00:08:33,666
But at the same time, you are already
spending your money doing this right.

161
00:08:33,666 --> 00:08:37,366
You can't just, you know, the longer
you take to figure it out,

162
00:08:38,133 --> 00:08:39,133
there's a trade off, right?

163
00:08:39,133 --> 00:08:40,866
The longer you take to figure it out.

164
00:08:40,866 --> 00:08:45,300
The, more money
you will probably spend on the wrong ones.

165
00:08:45,766 --> 00:08:48,933
and therefore,
you have to figure out very quickly.

166
00:08:49,333 --> 00:08:51,900
So there are these two factors
that are in play, exploration

167
00:08:51,900 --> 00:08:53,000
and exploitation.

168
00:08:53,000 --> 00:08:56,266
So you need to explore the machines
to find out

169
00:08:56,266 --> 00:08:58,766
which one of them is the best one.

170
00:08:58,766 --> 00:09:03,566
And at the same time, you need to as
soon as you can already start exploiting,

171
00:09:03,933 --> 00:09:08,666
exploiting these machines, exploiting
your findings to make the maximum return.

172
00:09:09,200 --> 00:09:11,866
So basically, and there's
another mathematical concept behind

173
00:09:11,866 --> 00:09:14,866
behind all this, which is called regret.

174
00:09:14,933 --> 00:09:17,700
And a regret is is mathematically defined.

175
00:09:17,700 --> 00:09:19,633
And if you want to read more about this
as a goal,

176
00:09:19,633 --> 00:09:21,066
there's a really good white paper on it.

177
00:09:21,066 --> 00:09:24,833
It's called using confidence
bounds for exploitation

178
00:09:24,833 --> 00:09:27,033
and exploration or trade offs.

179
00:09:27,033 --> 00:09:30,000
And it is by, Peter,

180
00:09:31,066 --> 00:09:32,400
Euler or a

181
00:09:32,400 --> 00:09:35,566
AQR from the University of Technology
in Austria.

182
00:09:36,633 --> 00:09:38,100
really like the white paper.

183
00:09:38,100 --> 00:09:40,066
it goes into a lot of detail.

184
00:09:40,066 --> 00:09:41,233
like I didn't even read the whole thing,

185
00:09:41,233 --> 00:09:45,433
but the first couple of, chapters are
pretty good if you want to go into detail.

186
00:09:45,433 --> 00:09:50,700
But basically, regret is,
is when it's suffered

187
00:09:50,700 --> 00:09:54,466
when you're using non non alternative
and not optimal method.

188
00:09:54,466 --> 00:09:54,666
Right.

189
00:09:54,666 --> 00:09:57,666
So the one on the right is the optimal

190
00:09:57,900 --> 00:10:00,666
or the one on the right
the optimal machine.

191
00:10:00,666 --> 00:10:04,433
Whenever you're using the non
optimal machine you have a regret which,

192
00:10:04,500 --> 00:10:08,433
which can be quantified
as like as the difference

193
00:10:08,433 --> 00:10:12,000
between the best outcome
and the known best outcome and the,

194
00:10:12,566 --> 00:10:15,600
you know, all of those sums of the money
that you put,

195
00:10:15,766 --> 00:10:20,000
like your, opportunity cost of actually
exploring the other machines.

196
00:10:20,633 --> 00:10:25,400
And, so the longer you explore the
non-optimal machines, that higher regret.

197
00:10:25,400 --> 00:10:29,600
But at the same time, if you don't explore
for long enough, right,

198
00:10:29,600 --> 00:10:33,500
if you explore, if you don't explore
for longer, long enough, then you're.

199
00:10:33,500 --> 00:10:38,233
And a suboptimal machine
might might appear as an optimal machine.

200
00:10:38,233 --> 00:10:41,166
So for instance, 
this machine over here. Right.

201
00:10:41,166 --> 00:10:45,666
So if we explore, explore, explore,
but we don't spend enough time exploring,

202
00:10:46,000 --> 00:10:47,700
we might think that
this is the best machine

203
00:10:47,700 --> 00:10:50,333
because it's got quite a good return right
close to this one.

204
00:10:50,333 --> 00:10:53,333
And we might start exploiting this one
for the rest of the time.

205
00:10:53,633 --> 00:10:56,233
But in reality, this one was the best one.

206
00:10:56,233 --> 00:11:01,366
So the the, goal is to find the best one
and exploit the best one,

207
00:11:02,166 --> 00:11:06,166
but spend the least amount of time
exploring all of them.

208
00:11:06,166 --> 00:11:06,466
Right?

209
00:11:06,466 --> 00:11:08,600
And while you're exploring
is still earning money,

210
00:11:08,600 --> 00:11:10,600
but not from the optimal machine. Right.

211
00:11:10,600 --> 00:11:12,000
So that's the goal.

212
00:11:12,000 --> 00:11:14,100
That's the point, of this whole exercise.

213
00:11:14,100 --> 00:11:19,766
And it's important to understand here
that, there is the best one so that where

214
00:11:20,133 --> 00:11:22,766
even though these machines, you know,
they,

215
00:11:22,766 --> 00:11:25,400
have like jackpots sometimes.

216
00:11:25,400 --> 00:11:28,533
And so on,
but we are assuming that there's just that

217
00:11:28,533 --> 00:11:31,866
these distributions are, finite there.

218
00:11:31,866 --> 00:11:35,666
And out of them, there's a best one that
you are looking for that's kind of the,

219
00:11:36,100 --> 00:11:40,400
pre emphasis or the whole assumption
on this problem if, there are

220
00:11:40,466 --> 00:11:43,900
there are more complex options
and versions of this problem and,

221
00:11:44,333 --> 00:11:49,533
again, check out some additional
reading on that topic.

222
00:11:49,533 --> 00:11:51,566
That's, that's more or even more advanced.

223
00:11:51,566 --> 00:11:54,600
But what are we going to be using this for
is that's going to be sufficient.

224
00:11:54,600 --> 00:11:56,600
And why is it going
to be sufficient for us?

225
00:11:56,600 --> 00:12:02,366
Because the most common modern
application of this that we can, think of

226
00:12:02,366 --> 00:12:06,300
and the one that we are going to be
exploring is advertising.

227
00:12:06,600 --> 00:12:08,400
But, so let's have a look at some ads.

228
00:12:08,400 --> 00:12:09,766
This is going to be fun.

229
00:12:09,766 --> 00:12:13,433
So just a disclaimer this there's
no affiliation with Coca Cola.

230
00:12:13,433 --> 00:12:15,766
Examples
I used just for educational purposes.

231
00:12:15,766 --> 00:12:16,966
All right. So let's have a look.

232
00:12:16,966 --> 00:12:20,133
we have let's say Coca Cola or

233
00:12:20,133 --> 00:12:23,300
some company wants to run a campaign.

234
00:12:23,633 --> 00:12:27,633
and it's going to be called
welcome to the Coke Side of Life campaign.

235
00:12:28,066 --> 00:12:28,900
And if you search for this

236
00:12:28,900 --> 00:12:32,400
campaign online, you'll see that they had,
you know, hundreds of different ads

237
00:12:32,666 --> 00:12:35,300
that so they came up
with, for this campaign.

238
00:12:35,300 --> 00:12:38,633
And here's here's one example of them
where these are just some images

239
00:12:38,633 --> 00:12:39,333
I pulled from Google.

240
00:12:39,333 --> 00:12:42,900
So maybe these are even drawn
by, people, but we're going to assume

241
00:12:42,900 --> 00:12:47,400
that these are legitimate ads that,
we're going to go into the campaign.

242
00:12:47,700 --> 00:12:49,866
And so we want to find out
which is the best ad,

243
00:12:49,866 --> 00:12:51,300
which is the ad that works best.

244
00:12:51,300 --> 00:12:53,100
So we've got options number one.

245
00:12:53,100 --> 00:12:57,000
Number two, number three, number
four, and number five.

246
00:12:57,333 --> 00:13:02,066
And so now our goal is to find out
which ad works the best.

247
00:13:02,233 --> 00:13:03,666
Maximize our returns.

248
00:13:03,666 --> 00:13:05,800
But right now we don't know
which has worked the best. Right.

249
00:13:05,800 --> 00:13:10,833
So there's no there is a distribution
behind it, but that distribution

250
00:13:10,833 --> 00:13:15,300
will only become known after thousands
and thousands and thousands of people.

251
00:13:15,733 --> 00:13:18,233
Look at these ads and click or not
click on these ads.

252
00:13:18,233 --> 00:13:20,000
And this is actually going
to be very similar

253
00:13:20,000 --> 00:13:21,800
to the example
that we're going to be looking at.

254
00:13:21,800 --> 00:13:24,600
The example that had land
is going to be walking you through

255
00:13:24,600 --> 00:13:25,800
in the programing tutorials.

256
00:13:25,800 --> 00:13:27,700
And in that example
we're going to have ten ads.

257
00:13:27,700 --> 00:13:29,100
So even more.

258
00:13:29,100 --> 00:13:31,100
And so the what can you do here.

259
00:13:31,100 --> 00:13:34,200
Well one way to approach
a problem is just run an AB test.

260
00:13:34,200 --> 00:13:34,433
Right.

261
00:13:34,433 --> 00:13:37,700
So take your five or 50 or 500 ads

262
00:13:37,700 --> 00:13:43,400
and run a huge AB test,
with or multiple AB test and,

263
00:13:43,400 --> 00:13:47,766
wait until you have enough,
of a, a large enough sample.

264
00:13:48,533 --> 00:13:52,600
and then, conclude which ad is the best,
right, with, with certain confidence.

265
00:13:53,133 --> 00:13:56,800
But the problem with that
is that you would spend

266
00:13:56,800 --> 00:13:59,666
a lot of time and money doing that. Right.

267
00:13:59,666 --> 00:14:02,633
So an AB test is pure exploration, right?

268
00:14:02,633 --> 00:14:04,433
You're not exploiting the best option.

269
00:14:04,433 --> 00:14:05,766
You are exploring the best option.

270
00:14:05,766 --> 00:14:10,100
But, to the same extent as you're
exploiting the non-optimal options.

271
00:14:10,100 --> 00:14:10,300
Right.

272
00:14:10,300 --> 00:14:14,800
So if, if we go by our previous
distribution, if this is the best one,

273
00:14:14,800 --> 00:14:18,000
if you just run an AB test
and you're uniformly distributing

274
00:14:18,000 --> 00:14:21,466
or uniformly using these, five options
and therefore,

275
00:14:21,700 --> 00:14:25,333
as much as you're using this one,
you might using all for all four of them.

276
00:14:25,333 --> 00:14:27,400
So basically all five of them.

277
00:14:27,400 --> 00:14:30,433
So basically
you are exploiting it a bit, but

278
00:14:30,433 --> 00:14:33,433
unconsciously, right, in a random way.

279
00:14:34,166 --> 00:14:37,500
and therefore AB tests
are just for exploration.

280
00:14:38,300 --> 00:14:41,400
So the challenge is to find out
which is the best one.

281
00:14:41,766 --> 00:14:45,700
But, do it while you're explore.

282
00:14:45,700 --> 00:14:51,400
while, you know, to exploit the best one
while you're exploring for it.

283
00:14:51,400 --> 00:14:51,600
Right.

284
00:14:51,600 --> 00:14:55,366
So find out which of them is the best
one in the process of,

285
00:14:55,900 --> 00:14:59,800
hold on to find out

286
00:14:59,800 --> 00:15:03,033
which is the best one in the process
of the actual launched campaign.

287
00:15:03,033 --> 00:15:05,033
So not don't have two phases. Yeah.

288
00:15:05,033 --> 00:15:08,033
And do the AB test
and then use the most, the best one.

289
00:15:08,166 --> 00:15:10,900
but actually find out the best
one in the quickest way

290
00:15:10,900 --> 00:15:14,066
possible and start
exploiting it along the way.

291
00:15:14,066 --> 00:15:15,433
So that's the challenge here.

292
00:15:15,433 --> 00:15:17,200
And that's what we're going to be solving.

293
00:15:17,200 --> 00:15:22,233
and that's the modern application
of the multi-armed bandit problem.

294
00:15:22,600 --> 00:15:24,433
So hopefully you're excited about this.

295
00:15:24,433 --> 00:15:27,200
We've got two great algorithms coming up.

296
00:15:27,200 --> 00:15:29,000
can't wait to get started.

297
00:15:29,000 --> 00:15:30,833
I look forward to seeing you
in the next tutorial.

298
00:15:30,833 --> 00:15:33,000
And until then, enjoy machine learning.