1
00:00:00,733 --> 00:00:03,300
Hello and welcome back to the course
on Machine Learning.

2
00:00:03,300 --> 00:00:04,933
Today we've got a very interesting topic.

3
00:00:04,933 --> 00:00:09,566
Today we talk about Thompson sampling
and the algorithm's intuition.

4
00:00:09,566 --> 00:00:10,833
And again

5
00:00:10,833 --> 00:00:15,600
we're going to be using this algorithm
to solve the multi-armed bandit problem.

6
00:00:16,000 --> 00:00:16,366
All right.

7
00:00:16,366 --> 00:00:17,933
So let's get started.

8
00:00:17,933 --> 00:00:20,666
A quick refresher
on that multi-armed bandit problem.

9
00:00:20,666 --> 00:00:23,400
We have several slot machines.

10
00:00:23,400 --> 00:00:25,533
each one of them has a distribution
behind it.

11
00:00:25,533 --> 00:00:26,966
Do you want to do five?

12
00:00:26,966 --> 00:00:29,300
We don't know what
these distributions are.

13
00:00:29,300 --> 00:00:31,933
And we need to start
playing these machines.

14
00:00:31,933 --> 00:00:35,533
And at the same time, figure out
which one has the best distribution

15
00:00:35,533 --> 00:00:38,533
so that we then exploit
that best distribution.

16
00:00:38,666 --> 00:00:42,433
and but,
it'll take us some time to figure it out.

17
00:00:42,433 --> 00:00:46,200
So we need to maximize our return
during the process of figuring out.

18
00:00:46,200 --> 00:00:49,666
So we have to, find that ideal balance

19
00:00:49,666 --> 00:00:53,100
or trade off
between exploration and exploitation.

20
00:00:53,766 --> 00:00:57,300
we had a few tutorials
previously on, these things.

21
00:00:57,300 --> 00:01:00,433
So we talked about the multi-armed
bandit problem in a lot of detail.

22
00:01:00,433 --> 00:01:02,633
So if you haven't watched that tutorial,
I highly recommend

23
00:01:02,633 --> 00:01:04,700
jumping into the previous section
and watching it there.

24
00:01:04,700 --> 00:01:08,966
Also, understanding
of the upper confidence bounds

25
00:01:09,433 --> 00:01:14,633
algorithm will really help you grasp
the concepts of, Thompson sampling.

26
00:01:14,633 --> 00:01:19,500
So if you haven't seen the upper
confidence bound, tutorial,

27
00:01:19,733 --> 00:01:23,800
then I highly recommend checking that out
before proceeding with today's lecture.

28
00:01:24,133 --> 00:01:26,666
All right. So let's get started.

29
00:01:26,666 --> 00:01:31,600
A quick summary of the multi-armed bandit
problem, is as follows.

30
00:01:31,600 --> 00:01:32,666
We have d'Armes.

31
00:01:32,666 --> 00:01:33,600
For example,

32
00:01:33,600 --> 00:01:37,866
arms are ads that we display to users
each time they, connect to a webpage.

33
00:01:37,866 --> 00:01:40,233
So, by the way, yes. indeed.

34
00:01:40,233 --> 00:01:43,633
a modern application of the more modern

35
00:01:43,633 --> 00:01:47,400
representation of the multi-armed bandit
problem is advertising.

36
00:01:47,400 --> 00:01:50,966
So when you display ads,
that is very similar,

37
00:01:51,533 --> 00:01:54,733
or the algorithms that we're going
to be applying can learning,

38
00:01:54,733 --> 00:01:59,066
can be applied to solving a problem
where you're displaying ads.

39
00:01:59,066 --> 00:02:01,366
So if you go back here instead of having,

40
00:02:01,366 --> 00:02:04,033
these slot machines,
each one of these is a different ad,

41
00:02:04,033 --> 00:02:05,733
and you want to figure out
which one of these

42
00:02:05,733 --> 00:02:09,466
ads is the best performing ad,
but you don't have time to do an AB,

43
00:02:09,566 --> 00:02:12,933
so you don't have the funds
or the resources to do an AB test.

44
00:02:13,033 --> 00:02:14,900
You want to figure it out on the fly.

45
00:02:14,900 --> 00:02:16,366
That's when you would apply,

46
00:02:16,366 --> 00:02:19,666
one of these algorithms that we were
talking about in this part of the course,

47
00:02:20,333 --> 00:02:24,700
and of course, this of other
lots of other modern problems

48
00:02:24,700 --> 00:02:27,166
that are very similar
to the multi-armed bandit problem.

49
00:02:27,166 --> 00:02:30,500
Therefore,
these algorithms, are valid for them too.

50
00:02:31,066 --> 00:02:32,533
All right. So moving back here.

51
00:02:32,533 --> 00:02:36,333
So we've got the arms, for example,
arms are ads that we display.

52
00:02:36,333 --> 00:02:37,066
She uses each time

53
00:02:37,066 --> 00:02:40,066
they connect to a web page,
each time a user connects to this page,

54
00:02:40,266 --> 00:02:43,700
that is considered
as a round at each round.

55
00:02:43,700 --> 00:02:47,066
And we choose which to display
to the user at each round.

56
00:02:47,066 --> 00:02:50,433
And the ad gives us a reward, either
0 or 1

57
00:02:51,266 --> 00:02:53,966
one if the ad is clicked, zero
if the ad is not clicked.

58
00:02:53,966 --> 00:02:58,800
In the case of the actual bandits,
it'll be a monetary award, so it won't be

59
00:02:58,800 --> 00:03:01,066
just 0 to 1. It'll be a some value.

60
00:03:01,066 --> 00:03:05,866
our goal is to maximize maximize
the total reward we get over many rounds.

61
00:03:06,166 --> 00:03:06,500
All right.

62
00:03:06,500 --> 00:03:10,300
So that's a quick, overview of the problem
that we're solving.

63
00:03:10,800 --> 00:03:14,133
then here we've got, some very complex

64
00:03:14,533 --> 00:03:19,766
looking mathematics and Bayesian inference
and all these, distributions

65
00:03:21,033 --> 00:03:23,233
and posterior probability

66
00:03:23,233 --> 00:03:28,066
and other, prior distributions
and beta distributions and so on.

67
00:03:28,800 --> 00:03:31,066
we're not going to delve
deep into this right now.

68
00:03:31,066 --> 00:03:35,200
So had lunch will, 
take some time to walk through this slide

69
00:03:35,200 --> 00:03:38,500
with you in the practical tutorials
because,

70
00:03:39,066 --> 00:03:42,466
we're going to be coding this,
from scratch.

71
00:03:42,833 --> 00:03:46,733
well, at least in R and therefore

72
00:03:46,766 --> 00:03:49,200
he will actually walk you
through this slide.

73
00:03:49,200 --> 00:03:54,900
Our goal for today is to understand
the intuition behind all of these things.

74
00:03:54,900 --> 00:03:58,633
So we're going to skip skip
the slide and get to the intuition part.

75
00:03:58,633 --> 00:04:00,533
And this is the actual steps

76
00:04:00,533 --> 00:04:04,500
that, you're going to be using in
the practical tutorial.

77
00:04:04,500 --> 00:04:07,500
So again had lunch
will walk you through this slide as well.

78
00:04:08,300 --> 00:04:10,366
And today we're talking about intuition.

79
00:04:10,366 --> 00:04:12,233
So this is going to be fun.

80
00:04:12,233 --> 00:04:14,100
This is some interesting slides coming up.

81
00:04:14,100 --> 00:04:16,866
And get ready for a fun ride.

82
00:04:16,866 --> 00:04:19,200
So grab a cup of coffee a cup of tea

83
00:04:19,200 --> 00:04:22,300
three rings of popcorn
and, let's get started.

84
00:04:22,800 --> 00:04:25,533
All right, so here we've got a scale.

85
00:04:25,533 --> 00:04:29,733
the, 
a y or the horizontal axis is the return.

86
00:04:30,066 --> 00:04:33,066
The return that we expect
to get from a bandit.

87
00:04:33,066 --> 00:04:34,966
So we're going to look at a simplified
problem.

88
00:04:34,966 --> 00:04:37,366
We're going to look at rather than five
or even ten.

89
00:04:37,366 --> 00:04:38,800
We're going to just look at three bandits

90
00:04:38,800 --> 00:04:41,933
because there's going to be a lot of stuff
going on on this, chart.

91
00:04:41,933 --> 00:04:44,000
And I don't want to clutter it up,

92
00:04:44,000 --> 00:04:45,700
and we want to keep it
as simple as possible,

93
00:04:45,700 --> 00:04:47,366
just so that we can understand
the concept.

94
00:04:47,366 --> 00:04:50,400
And then, the same thing
applies for 5 or 10 or however

95
00:04:50,400 --> 00:04:53,500
many, machines that, you'd be looking at.

96
00:04:54,300 --> 00:04:58,366
So here we've got the return and these
vertical lines, just like in the case,

97
00:04:58,833 --> 00:05:03,400
of the, upper confidence bound,
we had the horizontal lines.

98
00:05:03,633 --> 00:05:04,633
These vertical lines,

99
00:05:05,866 --> 00:05:08,666
represent the
expected return for each of the machines.

100
00:05:08,666 --> 00:05:11,966
So, each of the machines
out of the three machines that we have,

101
00:05:11,966 --> 00:05:15,766
each of the machines
has a distribution behind it

102
00:05:15,766 --> 00:05:19,433
so that the, result,
the amount of money that you win per

103
00:05:19,900 --> 00:05:23,700
game, is picked as a random value
from that distribution.

104
00:05:24,500 --> 00:05:27,600
and we're not going to draw distributions,
but basically just imagine distribution

105
00:05:27,600 --> 00:05:31,666
behind each one of these, 
expected values.

106
00:05:31,666 --> 00:05:35,133
So this is, this is just the center
of that distribution or

107
00:05:35,466 --> 00:05:38,600
the actual expected return
from that machine.

108
00:05:38,600 --> 00:05:39,866
And we're just visualizing it here.

109
00:05:39,866 --> 00:05:44,100
But this is also kind of like
looking into,

110
00:05:44,500 --> 00:05:48,600
into the actual machine itself
or like pulling it apart

111
00:05:48,600 --> 00:05:51,300
and knowing how it works
or being the designer of the machine.

112
00:05:51,300 --> 00:05:54,000
In reality, when you're playing these
machines, of course, you don't know this.

113
00:05:54,000 --> 00:05:58,000
So this is some additional information
that will guide

114
00:05:58,000 --> 00:06:01,000
us, that will help us understand
how the algorithm actually works.

115
00:06:01,000 --> 00:06:03,366
The algorithm itself
doesn't know this information, right.

116
00:06:03,366 --> 00:06:05,633
So this is hidden,
but it's just there for us

117
00:06:05,633 --> 00:06:07,866
so that we can better understand
what's going on.

118
00:06:07,866 --> 00:06:09,066
So these are the expected returns,

119
00:06:09,066 --> 00:06:11,500
the actual expected
returns of each of the machines.

120
00:06:11,500 --> 00:06:13,100
And obviously if you knew this right away,

121
00:06:13,100 --> 00:06:16,733
you would say that machine number three
or the yellow machine is the best one.

122
00:06:17,333 --> 00:06:19,900
because it has the highest return, right?

123
00:06:19,900 --> 00:06:22,000
It has the highest expected return.

124
00:06:22,000 --> 00:06:23,000
You'd always just bet on this one.

125
00:06:23,000 --> 00:06:25,466
But again, you don't know this yet.

126
00:06:25,466 --> 00:06:28,733
All right,
so what happens, with this algorithm?

127
00:06:28,733 --> 00:06:31,733
Well, at the start,
just like with the upper confidence

128
00:06:31,733 --> 00:06:34,400
bound algorithm,
you don't know anything, right?

129
00:06:34,400 --> 00:06:38,133
You have no prior knowledge of the,
current,

130
00:06:38,733 --> 00:06:41,366
situation or status of things,
and therefore,

131
00:06:41,366 --> 00:06:44,900
all the machines are identical to you,
and you have to have at least one

132
00:06:44,900 --> 00:06:49,366
or even a couple of trial rounds
just to get some data to analyze.

133
00:06:49,366 --> 00:06:52,366
And so let's see,
let's say that has happened.

134
00:06:52,400 --> 00:06:56,366
There are some, trial runs
for the so for machine number one

135
00:06:56,366 --> 00:06:57,600
or the Blue machine.

136
00:06:57,600 --> 00:07:01,400
And, based on those trial runs,

137
00:07:01,966 --> 00:07:05,100
the algorithm, the Thompson sampling
algorithm, this is where it starts

138
00:07:05,100 --> 00:07:09,366
getting different to the upper confidence
bound will construct a distribution.

139
00:07:10,066 --> 00:07:12,333
Right. So we'll get to this distribution
in a second.

140
00:07:12,333 --> 00:07:13,166
What means.

141
00:07:13,166 --> 00:07:16,900
But for now let's just do the same thing
for machine in the Green Machine.

142
00:07:17,400 --> 00:07:18,600
And so yeah

143
00:07:18,600 --> 00:07:23,100
we just are pulling the arm several times
like four times for instance.

144
00:07:23,100 --> 00:07:26,833
And we're getting some values and

145
00:07:26,833 --> 00:07:31,200
that are going to be somewhere around
obviously the actual expected return.

146
00:07:31,300 --> 00:07:31,566
Right.

147
00:07:31,566 --> 00:07:32,533
And based on that

148
00:07:32,533 --> 00:07:35,533
and based on those values, of course, it's
probably gonna be a bit more than four.

149
00:07:36,300 --> 00:07:38,233
we're
constructing some sort of distribution

150
00:07:38,233 --> 00:07:42,166
or some sort of perception
of the current state of things.

151
00:07:42,266 --> 00:07:44,566
And this is the part
where it gets interesting.

152
00:07:44,566 --> 00:07:47,433
So the actual meaning of this,

153
00:07:47,433 --> 00:07:50,600
these distributions is not
which you might think at first.

154
00:07:50,600 --> 00:07:54,500
So, these distributions

155
00:07:54,866 --> 00:07:58,933
are actually showing us
or they're representing

156
00:07:59,133 --> 00:08:03,300
not the we're not trying to guess
the distributions behind the machine.

157
00:08:03,300 --> 00:08:06,333
So the first thing that might come to mind
is that all right.

158
00:08:06,333 --> 00:08:07,733
So we've constructed distribution.

159
00:08:07,733 --> 00:08:11,766
And so the blue distribution
is our attempt

160
00:08:11,766 --> 00:08:14,866
at guessing the actual distribution
behind the blue machine.

161
00:08:14,866 --> 00:08:15,233
Right.

162
00:08:15,233 --> 00:08:18,300
The distribution
that this is the expected return for.

163
00:08:18,300 --> 00:08:19,600
And the green is for the green machine.

164
00:08:19,600 --> 00:08:22,266
The yellow set
yellow machine will actually not.

165
00:08:22,266 --> 00:08:23,100
That's not the case.

166
00:08:23,100 --> 00:08:26,100
We are constructing something
completely different,

167
00:08:26,100 --> 00:08:28,166
something completely out of this world.

168
00:08:28,166 --> 00:08:31,766
We're constructing distributions of where

169
00:08:31,766 --> 00:08:36,133
we think
the actual expected value might lie.

170
00:08:37,100 --> 00:08:38,533
It's very important to understand that.

171
00:08:38,533 --> 00:08:41,633
So, we are creating, kind of a,

172
00:08:42,400 --> 00:08:44,800
if you,
if you want to think of it that way,

173
00:08:44,800 --> 00:08:48,733
we're creating an auxiliary mechanism
for us to solve the problem.

174
00:08:48,733 --> 00:08:53,166
So, we're not we're not trying
to recreate these machines.

175
00:08:53,166 --> 00:08:57,433
We're recreating the possible way
these machines could have been created,

176
00:08:57,500 --> 00:08:58,466
kind of in that sense.

177
00:08:58,466 --> 00:09:03,000
So let's, let's, 
just solidify that this is where

178
00:09:03,000 --> 00:09:07,266
we think that expected
the actual expected values will be.

179
00:09:07,266 --> 00:09:09,133
So let's look at the blue one.
For instance.

180
00:09:09,133 --> 00:09:11,133
We got four values.

181
00:09:11,133 --> 00:09:14,666
And based on those four values
we've constructed this distribution

182
00:09:15,133 --> 00:09:20,300
which will show us which is showing us 
where is that value.

183
00:09:20,300 --> 00:09:21,133
Mu star.

184
00:09:21,133 --> 00:09:23,900
So this is mu star,
the actual mu start, but we don't know it.

185
00:09:23,900 --> 00:09:24,966
So the algorithm doesn't know it.

186
00:09:24,966 --> 00:09:29,033
So it's constructed a distribution
trying to guess where this value is.

187
00:09:29,033 --> 00:09:30,866
And of course can't just say it's over
here.

188
00:09:30,866 --> 00:09:32,100
It's over there, it's over there.

189
00:09:32,100 --> 00:09:35,000
It's saying, okay, there's a there's a
very high likelihood it's over here.

190
00:09:35,000 --> 00:09:37,133
But it also could be here,
it could be here, could be here.

191
00:09:37,133 --> 00:09:39,133
And as you move
away, the likelihood drops.

192
00:09:39,133 --> 00:09:42,066
But it still could be anywhere
in this blue space.

193
00:09:42,066 --> 00:09:44,966
Same thing for the green. distribution.

194
00:09:44,966 --> 00:09:49,200
So based on the values that we've seen, 
the random values that have been selected

195
00:09:49,200 --> 00:09:50,400
in the for four rounds,

196
00:09:51,466 --> 00:09:55,866
the algorithm
has created this distribution,

197
00:09:56,433 --> 00:09:59,433
which is saying that this actual,

198
00:09:59,700 --> 00:10:03,600
actual expected
return from a sheet from the Green

199
00:10:03,600 --> 00:10:08,133
Machine is somewhere in this area,
and it's most likely to be here,

200
00:10:08,133 --> 00:10:10,766
but it could be here, it could be here,
it could be here, it could be way

201
00:10:10,766 --> 00:10:13,533
it could be. So basically, at most it's
most likely to be here.

202
00:10:13,533 --> 00:10:15,400
Then it could be here, it could be here.

203
00:10:15,400 --> 00:10:19,633
And as you move away, the likelihood of it
actually being there drops off.

204
00:10:19,900 --> 00:10:20,966
Same thing for the yellow.

205
00:10:20,966 --> 00:10:22,233
So it's very important to understand.

206
00:10:22,233 --> 00:10:27,666
So just to reiterate we are not trying to
guess the distributions behind machines.

207
00:10:27,666 --> 00:10:28,233
Right.

208
00:10:28,233 --> 00:10:32,033
We are doing
like kind of a little magic trick

209
00:10:32,033 --> 00:10:35,266
or a little hat or a hat trick is
I don't know if it's called a hat trick,

210
00:10:35,266 --> 00:10:41,233
but we're trying to, do this,
this create this perception of the world.

211
00:10:41,233 --> 00:10:45,233
We're trying to, mathematically

212
00:10:45,233 --> 00:10:49,566
explain what we think is actually going on
or what could be going on.

213
00:10:49,900 --> 00:10:52,633
And that is, that is important.

214
00:10:52,633 --> 00:10:56,800
also important thing,
because this demonstrates

215
00:10:56,800 --> 00:11:00,666
that the Thompson sampling
is a probabilistic algorithm.

216
00:11:00,933 --> 00:11:04,600
The upper confidence bound
was a deterministic algorithm

217
00:11:04,600 --> 00:11:06,033
where everything was strict.

218
00:11:06,033 --> 00:11:08,033
You know, everything was okay.

219
00:11:08,033 --> 00:11:08,800
So whichever

220
00:11:08,800 --> 00:11:11,800
one has a pious upper confidence bound,
that's what we're going to choose

221
00:11:12,000 --> 00:11:12,566
and so on.

222
00:11:12,566 --> 00:11:17,266
But here we're creating a probabilistic,
perception of the world.

223
00:11:17,266 --> 00:11:19,500
We're saying, so it's likely to be here,

224
00:11:19,500 --> 00:11:21,166
but it could be anywhere
in this blue area,

225
00:11:21,166 --> 00:11:23,400
and this one could be in this green
and so on.

226
00:11:23,400 --> 00:11:25,333
And you'll see exactly
why we've done this.

227
00:11:25,333 --> 00:11:27,833
In the next slide,
we'll understand how this works.

228
00:11:27,833 --> 00:11:29,533
And let's jump straight into that.

229
00:11:29,533 --> 00:11:32,533
Let's understand
now that this is probably the hardest part

230
00:11:32,533 --> 00:11:36,100
to kind of,
get your head around what we've created.

231
00:11:36,100 --> 00:11:39,933
And now that we've created it,
let's see how the algorithm is going to,

232
00:11:40,500 --> 00:11:44,400
utilize this,
auxiliary mechanism that we have created.

233
00:11:44,700 --> 00:11:46,366
All right. So let's have a look.

234
00:11:46,366 --> 00:11:48,333
So there are our distributions.

235
00:11:48,333 --> 00:11:49,300
That's where we think.

236
00:11:49,300 --> 00:11:53,333
So these are the actual expected returns
for each of the machines.

237
00:11:53,600 --> 00:11:54,833
But the algorithm doesn't know them.

238
00:11:54,833 --> 00:11:59,233
The old algorithm has created
these are these distributions where

239
00:11:59,566 --> 00:12:03,066
which allow it to kind of guess
where the

240
00:12:03,066 --> 00:12:06,866
actual distribution might lie for each
for each of these machines.

241
00:12:07,033 --> 00:12:09,933
So what it's going to do
is it's actually going to trigger

242
00:12:09,933 --> 00:12:10,966
each of these distributions.

243
00:12:10,966 --> 00:12:14,200
So like we're in
the we're in a new round.

244
00:12:14,200 --> 00:12:16,300
We have to choose a machine to use.

245
00:12:16,300 --> 00:12:19,900
So what the algorithm will do is it
will go and call

246
00:12:20,600 --> 00:12:24,033
this distribution and it will pull out
a value out of this distribution.

247
00:12:24,033 --> 00:12:25,533
And let's say it pulled that value.

248
00:12:25,533 --> 00:12:27,633
Then it'll pull a value
out of the green distribution.

249
00:12:27,633 --> 00:12:30,400
Let's say it pulled that value
out of the green distribution.

250
00:12:30,400 --> 00:12:33,200
And let's say
and then pulls a value out of the yellow.

251
00:12:33,200 --> 00:12:35,566
This is that distribution of that value.

252
00:12:35,566 --> 00:12:37,033
And again it's pulling them.

253
00:12:37,033 --> 00:12:39,166
So according to the distribution. Right.

254
00:12:39,166 --> 00:12:41,133
So this is a distribution of values.

255
00:12:41,133 --> 00:12:44,133
So basically it's most likely
to pull a value

256
00:12:44,300 --> 00:12:48,600
somewhere in this area then less likely in
for the way you go, it's more

257
00:12:48,600 --> 00:12:52,533
or less less and less likely,
but still it can happen that you can see

258
00:12:52,533 --> 00:12:55,600
that this yellow value
is actually quite far from the center,

259
00:12:55,600 --> 00:12:58,600
but it's still can happen that it pulled
this value out of distribution.

260
00:12:58,733 --> 00:12:59,466
And it can happen.

261
00:12:59,466 --> 00:13:01,633
To pick this green, a value
out of the green distribution

262
00:13:01,633 --> 00:13:03,833
totally can happen in the long term,
of course, is

263
00:13:03,833 --> 00:13:06,833
going to be picking somewhere close
to the center like over the long run.

264
00:13:07,033 --> 00:13:10,700
but on a one off basis,
this can totally happen.

265
00:13:11,366 --> 00:13:14,866
And so now it's picked these values
and guess what that means?

266
00:13:14,866 --> 00:13:19,766
Well, what this actually means
is that we have by doing that,

267
00:13:19,766 --> 00:13:23,333
we have generated our own bandit
configuration.

268
00:13:23,700 --> 00:13:26,433
So we have created a

269
00:13:26,433 --> 00:13:30,366
this hypothetical or imaginary

270
00:13:30,766 --> 00:13:34,633
batch of machine or not batch

271
00:13:34,633 --> 00:13:37,766
imaginary set of machines in our own

272
00:13:37,766 --> 00:13:42,300
virtual world
where we're saying, okay, so the expected,

273
00:13:42,300 --> 00:13:46,566
the actual expected, return for the blue
machine is this value.

274
00:13:46,866 --> 00:13:49,733
The actual expected return
for the green Machine is this value

275
00:13:49,733 --> 00:13:52,266
and the actual expected return
for the yellow machine is this value.

276
00:13:52,266 --> 00:13:57,100
So we've created this, subzero
upside of world

277
00:13:57,100 --> 00:14:02,800
or hypothetical virtual reality in which
we have all our own three bandits.

278
00:14:02,800 --> 00:14:04,200
And how are we
going to solve this problem?

279
00:14:05,200 --> 00:14:07,466
and this whole problem is very easy
to solve.

280
00:14:07,466 --> 00:14:09,600
It's obvious how to solve this problem.

281
00:14:09,600 --> 00:14:11,100
You just pick this machine, right?

282
00:14:11,100 --> 00:14:15,300
Because this machine has the highest
expected return out of the three.

283
00:14:15,833 --> 00:14:18,766
And obviously just going to go
with this machine in the virtual world

284
00:14:18,766 --> 00:14:21,800
in the, sort of the reality.

285
00:14:22,600 --> 00:14:27,866
And, what that means is that now
we translate this result

286
00:14:27,866 --> 00:14:32,066
into the actual world in the virtual,
hypothetical world.

287
00:14:32,066 --> 00:14:33,266
We've selected the Green Machine.

288
00:14:33,266 --> 00:14:35,633
So in the actual world, the,

289
00:14:35,633 --> 00:14:39,066
the algorithm will also select
the Green Machine and what that will do.

290
00:14:39,066 --> 00:14:42,000
So basically
pull the lever for this machine right.

291
00:14:42,000 --> 00:14:46,433
And what it will do is actually
it will, give us a value.

292
00:14:46,433 --> 00:14:48,366
So the machine will spit it
out, spit out a value.

293
00:14:48,366 --> 00:14:52,933
But that value is going to be based
on the distribution behind this machine

294
00:14:52,933 --> 00:14:55,700
where this is the actual expected
value of that distribution.

295
00:14:55,700 --> 00:14:59,966
So the value is going to be somewhere
here, probably close to the actual

296
00:15:00,133 --> 00:15:01,733
expected value doesn't have to be.

297
00:15:01,733 --> 00:15:03,266
Again
there's a distribution behind all of this.

298
00:15:03,266 --> 00:15:03,866
It could be far away.

299
00:15:03,866 --> 00:15:06,300
It could be close.
But let's say in this case it's this one.

300
00:15:06,300 --> 00:15:06,633
All right.

301
00:15:06,633 --> 00:15:10,466
So now this information
this is new information to the algorithm.

302
00:15:10,600 --> 00:15:13,266
What it's going to do
is it's going to say okay.

303
00:15:13,266 --> 00:15:18,633
So the I pull the green lever or the lever
for the green machine I got this value.

304
00:15:18,866 --> 00:15:21,933
So now I have to adjust my perception
of the world.

305
00:15:21,933 --> 00:15:24,933
Right. So I have a, prior probability.

306
00:15:24,933 --> 00:15:28,800
So these are my, well, for the green
Machine, this is my prior, distribution.

307
00:15:28,800 --> 00:15:32,633
This is where the Bayesian inference comes
into play, or it's already been in play.

308
00:15:32,633 --> 00:15:34,733
And this is where
adding to the Bayesian inference.

309
00:15:34,733 --> 00:15:37,300
So that's our prior probability prior
distribution.

310
00:15:37,300 --> 00:15:39,300
Now we've got some new information.

311
00:15:39,300 --> 00:15:40,533
This is our new information.

312
00:15:40,533 --> 00:15:43,166
We're going to add it in and see what
see how that changes.

313
00:15:43,166 --> 00:15:46,733
Our perception of the world
or perception of the world has changed.

314
00:15:46,733 --> 00:15:50,100
The distribution has shifted a bit
and it's become,

315
00:15:50,833 --> 00:15:52,900
narrower
because we have more information.

316
00:15:52,900 --> 00:15:54,600
We our sample sizes increase, of course.

317
00:15:56,300 --> 00:15:56,700
Excuse me.

318
00:15:56,700 --> 00:15:57,000
Of course.

319
00:15:57,000 --> 00:16:00,000
It's not going to increase that that much.

320
00:16:00,366 --> 00:16:00,600
Yeah.

321
00:16:00,600 --> 00:16:02,900
This is just an example to,

322
00:16:02,900 --> 00:16:05,900
demonstrate what we're talking about
to get the point across.

323
00:16:06,233 --> 00:16:09,666
But that's, that's the point that,
every time we add new information,

324
00:16:09,666 --> 00:16:12,666
our distribution
becomes more and more refined.

325
00:16:13,000 --> 00:16:15,200
All right, so now we have a new perception
in the world.

326
00:16:15,200 --> 00:16:18,433
Now what happens
next is a new round, right.

327
00:16:18,433 --> 00:16:19,000
Same thing.

328
00:16:19,000 --> 00:16:22,000
We're going to do the same thing again
for a new round again.

329
00:16:22,166 --> 00:16:25,233
we generate or we pick some values
out of a distributions.

330
00:16:25,233 --> 00:16:26,233
There they are.

331
00:16:26,233 --> 00:16:31,366
now we've constructed a or we've
generated our own bounded configuration,

332
00:16:31,366 --> 00:16:36,733
you know, virtual or virtual reality
or in our hypothetical world,

333
00:16:36,900 --> 00:16:39,000
out of these three,
we have to pick the best bandit,

334
00:16:39,000 --> 00:16:42,000
which is, of course,
the one here, the Yellow Bandit.

335
00:16:42,100 --> 00:16:46,833
And we're going to, now pull the actual
pull in the real world pool.

336
00:16:46,833 --> 00:16:50,033
That, lever of the yellow bandit
algorithm is going to do that.

337
00:16:50,166 --> 00:16:53,933
That's going to trigger the distribution
behind the Yellow bandit,

338
00:16:53,933 --> 00:16:56,500
and that will give us some sort of value.

339
00:16:56,500 --> 00:16:59,833
So this is the actual value
that we received in the real world.

340
00:17:00,133 --> 00:17:04,800
Now we're going to incorporate that value
into our perception of the world.

341
00:17:04,800 --> 00:17:08,100
And our perception of the world
is going to change, is going to adjust.

342
00:17:08,100 --> 00:17:10,800
There
we go. Now we're going to do that again.

343
00:17:10,800 --> 00:17:11,200
All right.

344
00:17:11,200 --> 00:17:13,200
So let's let's just do this one more time.

345
00:17:13,200 --> 00:17:16,266
Just so we practice the logic behind
all of this.

346
00:17:16,533 --> 00:17:19,533
So a new round generates

347
00:17:19,533 --> 00:17:23,366
the generates the bandit configuration.

348
00:17:23,366 --> 00:17:23,600
Right.

349
00:17:23,600 --> 00:17:27,133
So this is what we're going to think
that our,

350
00:17:27,133 --> 00:17:31,366
expected actual expected
returns are going to be our conversion.

351
00:17:31,366 --> 00:17:33,000
This is obviously the best one.

352
00:17:33,000 --> 00:17:35,566
we're going to use a pool.

353
00:17:35,566 --> 00:17:37,433
The yellow machines are a lever.

354
00:17:37,433 --> 00:17:40,200
That's going to trigger
the distribution behind.

355
00:17:40,200 --> 00:17:42,400
The yellow machine is going to spit out
a value.

356
00:17:42,400 --> 00:17:44,500
In the real world, there's a value.

357
00:17:44,500 --> 00:17:47,933
And then we're going to have to adjust
our perception of the world again

358
00:17:48,266 --> 00:17:53,200
to match that or to incorporate
the new information and so on.

359
00:17:53,200 --> 00:17:57,000
We're going to keep doing that until,
we get to a point

360
00:17:57,000 --> 00:18:00,500
where we've refined the distributions
substantially.

361
00:18:00,500 --> 00:18:01,533
And the picture looks like this.

362
00:18:01,533 --> 00:18:03,866
So, they might be refined even more.

363
00:18:03,866 --> 00:18:06,333
This one might be more refined,
this one might be more refined.

364
00:18:06,333 --> 00:18:08,233
But as you could see from there,

365
00:18:08,233 --> 00:18:11,666
we slowly will start generating more
and more rounds

366
00:18:11,666 --> 00:18:14,766
based on the yellow machine
will be triggering the yellow machine.

367
00:18:14,766 --> 00:18:17,766
So these ones might not even get
that refined in the long run,

368
00:18:17,866 --> 00:18:21,033
which is totally fine,
because our point is to get

369
00:18:21,033 --> 00:18:24,466
to the best machine, to find it
and exploit it as much as we can.

370
00:18:25,200 --> 00:18:26,366
So there we go.

371
00:18:26,366 --> 00:18:29,833
That is pretty much how the Thompson
sampling algorithm works.

372
00:18:30,466 --> 00:18:34,366
And as you can see,
it is a, probabilistic algorithm.

373
00:18:34,366 --> 00:18:37,966
And every time
we're generating these values,

374
00:18:38,166 --> 00:18:42,533
they are
we kind of creating this hypothetical,

375
00:18:43,300 --> 00:18:46,366
setup of the bandits.

376
00:18:46,366 --> 00:18:49,600
And then we're solving that,
and then we're applying the results

377
00:18:49,600 --> 00:18:50,333
to the real world.

378
00:18:50,333 --> 00:18:51,900
We're adjusting our perception of reality

379
00:18:51,900 --> 00:18:54,100
based on the new information
that that generates.

380
00:18:54,100 --> 00:18:55,433
And then we're doing that again.

381
00:18:55,433 --> 00:18:59,600
So hopefully this was a, interesting
and tutorial.

382
00:18:59,600 --> 00:19:01,933
I find this, algorithm very, very cool.

383
00:19:01,933 --> 00:19:05,933
And in the next tutorial
we're going to compare, a little bit,

384
00:19:06,000 --> 00:19:08,966
and our confidence bound and the Thompson
sampling algorithm.

385
00:19:08,966 --> 00:19:10,266
And I can't wait to see you there.

386
00:19:10,266 --> 00:19:12,133
Until next time, happy analyzing.