1
00:00:00,233 --> 00:00:02,566
Hello and welcome to this art tutorial.

2
00:00:02,566 --> 00:00:06,200
So in the previous section we introduced
a multi-armed bandit problem for this

3
00:00:06,500 --> 00:00:09,200
ad click through rate optimization problem

4
00:00:09,200 --> 00:00:11,933
and we already tried two algorithms.

5
00:00:11,933 --> 00:00:16,000
The first algorithm was the simple random
selection algorithm

6
00:00:16,000 --> 00:00:20,900
that consists of selecting a pure random,
an ad at each round.

7
00:00:21,200 --> 00:00:25,866
And this algorithm gave us a total reward
of 1002 hundred on average.

8
00:00:25,866 --> 00:00:28,000
Because, you know,
there is this random factor.

9
00:00:28,000 --> 00:00:31,133
But on average
we got 1002 hundred total reward.

10
00:00:31,500 --> 00:00:34,133
And of course,
when we plotted the histogram,

11
00:00:34,133 --> 00:00:37,266
every ad was selected
more or less the same number of times.

12
00:00:37,633 --> 00:00:41,233
And then we tried another algorithm
which was the upper confidence

13
00:00:41,233 --> 00:00:42,433
bound algorithm.

14
00:00:42,433 --> 00:00:47,366
And there we got much better results
because not only we managed to

15
00:00:47,366 --> 00:00:53,133
almost double the total reward
because we got a total reward of 2178.

16
00:00:53,366 --> 00:00:56,533
So almost a double of what we obtained
with the random selection.

17
00:00:56,900 --> 00:00:59,933
And the even better thing
is that we managed

18
00:00:59,933 --> 00:01:03,466
to figure out
which ad version was the best to show

19
00:01:03,466 --> 00:01:07,366
to the user, that is, which ad version
had the highest conversion

20
00:01:07,366 --> 00:01:10,833
rate, the highest CTR,
that is the highest click through rate.

21
00:01:11,300 --> 00:01:13,100
And now in this section

22
00:01:13,100 --> 00:01:17,366
we are going to try a new algorithm
which is called Thompson Sampling.

23
00:01:17,800 --> 00:01:20,733
And so in this section
we are going to look at two things.

24
00:01:20,733 --> 00:01:25,166
First thing is Ken Thompson
sampling beat the upper confidence bound.

25
00:01:25,166 --> 00:01:26,933
That is Ken Thompson sampling.

26
00:01:26,933 --> 00:01:31,533
Give us a total reward
that is even higher than 2178.

27
00:01:31,766 --> 00:01:34,900
You know, we almost doubled the total
reward of random selection.

28
00:01:35,133 --> 00:01:37,600
Let's see if we can beat that again.

29
00:01:37,600 --> 00:01:41,033
Like more than double it or even triple it
I don't know.

30
00:01:41,066 --> 00:01:41,733
Let's see.

31
00:01:41,733 --> 00:01:44,233
And the second thing
that we're going to look at is

32
00:01:44,233 --> 00:01:48,000
if we get the same ad version
that has the highest conversion rate.

33
00:01:48,100 --> 00:01:50,700
Let's
see if we get ad version number five,

34
00:01:50,700 --> 00:01:53,933
which was the ad found
by the upper confidence bound algorithm.

35
00:01:53,966 --> 00:01:55,800
It found ad version number five.

36
00:01:55,800 --> 00:01:59,100
So let's hope that we also get
the ad version number five.

37
00:01:59,400 --> 00:02:01,200
This should logically be the case

38
00:02:01,200 --> 00:02:05,000
if we managed to be the upper confidence
bound in terms of total reward.

39
00:02:05,166 --> 00:02:06,333
So let's do this.

40
00:02:06,333 --> 00:02:08,833
Let's implement Thompson
sampling in this section.

41
00:02:08,833 --> 00:02:11,233
We're actually going to do that
in one step.

42
00:02:11,233 --> 00:02:13,166
Because when you think about it

43
00:02:13,166 --> 00:02:17,600
we will only need to change the strategy
here in this for n loop.

44
00:02:17,700 --> 00:02:20,666
And then keep the same to implement
Thompson sampling.

45
00:02:20,666 --> 00:02:22,933
Because you know
we'll give these parameters here.

46
00:02:22,933 --> 00:02:27,000
That is this number of rounds,
this number of ads, this ad selected

47
00:02:27,000 --> 00:02:30,400
vector that contains all the different ads
selected at each round.

48
00:02:30,833 --> 00:02:33,900
And this we will need to change
because these are the parameters

49
00:02:33,900 --> 00:02:35,666
of the upper confidence bound algorithm.

50
00:02:35,666 --> 00:02:37,033
So we will need to change it.

51
00:02:37,033 --> 00:02:39,733
And of course
we will give this total reward

52
00:02:39,733 --> 00:02:43,133
because we want to get the total reward
accumulated

53
00:02:43,133 --> 00:02:46,500
through the execution of this Thompson
sampling algorithm.

54
00:02:46,933 --> 00:02:50,666
So what we're going to do
now is simply take everything from here

55
00:02:51,000 --> 00:02:55,500
to here, copy, and then paste it here.

56
00:02:55,500 --> 00:02:57,966
And we will only change
what we need to change.

57
00:02:57,966 --> 00:03:01,966
That is the right parameters
for the Thompson sampling algorithm.

58
00:03:02,466 --> 00:03:04,133
All right. So let's do it.

59
00:03:04,133 --> 00:03:08,900
We can already replace
UCB here by Thompson sampling.

60
00:03:10,633 --> 00:03:13,266
And now let's change the strategy.

61
00:03:13,266 --> 00:03:15,466
But before let's start with the basics.

62
00:03:15,466 --> 00:03:18,033
Let's set the right
folder as working directory.

63
00:03:18,033 --> 00:03:19,833
So let's go to our Machine learning

64
00:03:19,833 --> 00:03:24,033
A-Z folder then part six Reinforcement
learning Thompson Sampling.

65
00:03:24,033 --> 00:03:28,733
And then make sure that you have this
as CTR optimization CSV file that is

66
00:03:28,733 --> 00:03:33,466
of course the same CSV file as the one
we had in upper confidence bound.

67
00:03:33,766 --> 00:03:35,700
So if you have this data set, you're
now ready

68
00:03:35,700 --> 00:03:38,933
to click on this more button here
and then set as working directory.

69
00:03:39,866 --> 00:03:40,266
All right.

70
00:03:40,266 --> 00:03:44,000
So now let's implement
our Thompson sampling algorithm.

71
00:03:44,333 --> 00:03:48,600
So let's directly jump to the slide
of the Thompson sampling algorithm.

72
00:03:49,600 --> 00:03:49,933
All right.

73
00:03:49,933 --> 00:03:51,200
So what do we see here.

74
00:03:51,200 --> 00:03:54,200
This Thompson
sampling algorithm takes three steps.

75
00:03:54,400 --> 00:03:59,666
First step is at each round we need to
consider two numbers for each eye.

76
00:04:00,033 --> 00:04:04,533
The first number is the number of times
the ad I got reward one up to round

77
00:04:04,533 --> 00:04:08,333
n, and the second number
is the number of times the ad I got.

78
00:04:08,333 --> 00:04:13,333
We weren't zero up to round N, so that's
the first thing we'll do here.

79
00:04:13,333 --> 00:04:15,666
We will consider these parameters

80
00:04:15,666 --> 00:04:19,333
and declare the variables
corresponding to these parameters.

81
00:04:19,633 --> 00:04:25,033
And we can notice that if we compare this
Thompson sampling algorithm to the UCB

82
00:04:25,033 --> 00:04:29,500
algorithm, well it's the same step
one with different parameters.

83
00:04:29,500 --> 00:04:32,733
Because as you can notice here in the step
one of the upper

84
00:04:32,733 --> 00:04:36,533
confidence bound algorithm,
we also consider two numbers.

85
00:04:36,566 --> 00:04:40,333
And these two numbers are the number
of times the ad I was selected up to round

86
00:04:40,333 --> 00:04:44,600
n, and the sum of rewards of the ad
I up to round n.

87
00:04:45,066 --> 00:04:49,266
So if we have a look at this code
section here, which is the code section

88
00:04:49,266 --> 00:04:51,333
for the upper confidence bound algorithm,

89
00:04:51,333 --> 00:04:53,933
well, we can see that
these two parameters are here

90
00:04:53,933 --> 00:04:59,566
and that these parameters are no longer
the parameters for Thompson sampling.

91
00:04:59,900 --> 00:05:01,066
And these two parameters

92
00:05:01,066 --> 00:05:05,433
that we considered in step one in the UCB
algorithm are actually replaced

93
00:05:05,433 --> 00:05:09,333
by two new parameters in the Thompson
sampling algorithm.

94
00:05:09,333 --> 00:05:13,733
So what we'll do right now
is simply to remove these two parameters

95
00:05:13,733 --> 00:05:17,133
that are the parameters considered
in the step one of the UCB algorithm,

96
00:05:17,400 --> 00:05:20,733
and replace them
by these two new parameters

97
00:05:20,733 --> 00:05:23,600
that we need to consider for Thompson
sampling algorithm.

98
00:05:23,600 --> 00:05:25,666
So let's replace them right away.

99
00:05:25,666 --> 00:05:28,666
And therefore let's declare
two new variables.

100
00:05:28,733 --> 00:05:33,766
So first number the number of times
the ad I got we want one up to round n.

101
00:05:34,033 --> 00:05:40,200
So let's call this number
numbers of rewards and then underscore.

102
00:05:40,366 --> 00:05:45,166
And one to specify that it's
the number of times the ad got reward one.

103
00:05:45,500 --> 00:05:46,966
And then the second number

104
00:05:46,966 --> 00:05:51,300
is the number of times
the ad I got reward zero up to run.

105
00:05:51,300 --> 00:05:56,400
And so same we're going to create
this variable numbers of rewards zero.

106
00:05:57,000 --> 00:05:59,533
Now what are going to be
these two variables.

107
00:05:59,533 --> 00:06:02,400
So these are the parameters of Thompson
sampling

108
00:06:02,400 --> 00:06:05,700
at the essence of the future strategy
that we're going to have here.

109
00:06:06,033 --> 00:06:09,300
So these two variables here
are going to be some vectors

110
00:06:09,533 --> 00:06:12,333
of the elements that has ten elements.

111
00:06:12,333 --> 00:06:15,533
And as you might have guessed
they will contain for each ad

112
00:06:15,533 --> 00:06:21,133
the number of times they got rewards,
one up to round N and zero up to one.

113
00:06:21,133 --> 00:06:23,533
And so we're going to initialize

114
00:06:23,533 --> 00:06:27,300
these vectors the same way as we did
for upper confidence bound.

115
00:06:27,300 --> 00:06:30,300
That is we're going to take this integer d

116
00:06:30,533 --> 00:06:33,466
that creates a vector of the elements.

117
00:06:33,466 --> 00:06:35,433
And those d elements are zeros.

118
00:06:35,433 --> 00:06:39,666
So that's exactly how we are going
to initialize these two variables here

119
00:06:39,933 --> 00:06:42,933
by a vector of ten zeros.

120
00:06:43,133 --> 00:06:47,133
Because of course at the beginning
the number of rewards of each ad is

121
00:06:47,133 --> 00:06:50,533
of course zero because simply each
ad wasn't selected yet.

122
00:06:51,100 --> 00:06:53,533
All right. So we have our two arguments.

123
00:06:53,533 --> 00:06:56,766
That means that step one is done
and we can move on to step two.

124
00:06:57,100 --> 00:07:01,833
So step two is for each ad I
we take a random draw

125
00:07:01,833 --> 00:07:05,300
from this distribution below
which is the beta distribution.

126
00:07:05,666 --> 00:07:06,600
And why is that?

127
00:07:06,600 --> 00:07:07,300
It's because we have

128
00:07:07,300 --> 00:07:11,033
two important assumptions here
which are related to Bayesian inference.

129
00:07:11,566 --> 00:07:13,533
So the first assumption
is this first line.

130
00:07:13,533 --> 00:07:17,900
Here we suppose that each ad
I gets rewards from a Bernoulli

131
00:07:17,900 --> 00:07:22,633
distribution of parameter theta
I, which is the probability of success.

132
00:07:22,633 --> 00:07:26,000
And you can picture this
probability of success by showing the ad

133
00:07:26,000 --> 00:07:29,866
to a huge amount of users,
like 1 million users and theta.

134
00:07:29,866 --> 00:07:33,633
I could be interpreted as the number of
times the ad because we were one.

135
00:07:33,666 --> 00:07:37,600
That is, the number of successes divided
by the total number of times we selected

136
00:07:37,600 --> 00:07:39,433
the ad. That is 1 million.

137
00:07:39,433 --> 00:07:42,566
So basically theta
I is the probability of success.

138
00:07:42,566 --> 00:07:46,000
That is the probability of getting reward
one when we select the ad.

139
00:07:46,500 --> 00:07:51,966
And so the assumption is that each
ad I get rewards zero and one from this

140
00:07:51,966 --> 00:07:56,400
Bernoulli distribution of parameter theta
I, which is the probability of success.

141
00:07:57,133 --> 00:08:00,933
And then the second assumption, a less
strong assumption than the first one,

142
00:08:00,933 --> 00:08:03,233
which is the strongest assumption
that we have.

143
00:08:03,233 --> 00:08:05,666
The second assumption,
which is this second line here,

144
00:08:05,666 --> 00:08:09,300
and that is that we assume that theta
I has a uniform distribution

145
00:08:09,433 --> 00:08:11,466
which has the prior distribution.

146
00:08:11,466 --> 00:08:14,833
And then we use the Bayes rule
to get the posterior distribution,

147
00:08:14,833 --> 00:08:16,333
which is p c to I.

148
00:08:16,333 --> 00:08:19,033
Given
the rewards that we got up to the round n.

149
00:08:19,033 --> 00:08:20,633
And so by using Bayes rule,

150
00:08:20,633 --> 00:08:24,633
that's how we get this beta distribution
in this step two here.

151
00:08:25,000 --> 00:08:28,000
And so by taking at each round
these random draws of these

152
00:08:28,000 --> 00:08:31,233
beta distribution, well,
since these random draws represent

153
00:08:31,233 --> 00:08:33,566
nothing else,
then this probability of success,

154
00:08:33,566 --> 00:08:37,733
we get the strategy here, which is to take
the maximum of these random draws.

155
00:08:37,733 --> 00:08:39,900
Because the maximum of these random draws

156
00:08:39,900 --> 00:08:43,433
is approximating
the highest probability of success.

157
00:08:43,766 --> 00:08:45,933
And that's the whole idea behind Thompson
sampling.

158
00:08:45,933 --> 00:08:49,566
It's that we are trying
to estimate these parameter theta

159
00:08:49,933 --> 00:08:53,200
that are the probability
of success of each of these ten ads.

160
00:08:53,400 --> 00:08:56,400
Then by taking these random draws
and taking the highest of them,

161
00:08:56,400 --> 00:08:59,566
we are estimating
the highest probability of success.

162
00:08:59,833 --> 00:09:03,600
And this highest probability of success
corresponds to one specific

163
00:09:03,600 --> 00:09:05,000
ad at each round.

164
00:09:05,000 --> 00:09:08,733
So when we take these random draws
at a specific round, we might be wrong.

165
00:09:08,933 --> 00:09:12,700
But when we take these random
draws over thousands of rounds, well,

166
00:09:12,700 --> 00:09:17,866
just based on the essence of probability,
we obtain over all the theta

167
00:09:17,866 --> 00:09:20,433
that corresponds to the ad
that has the highest probability

168
00:09:20,433 --> 00:09:23,866
of success, that is, the highest
probability of getting reward equals one.

169
00:09:24,266 --> 00:09:26,533
And that's the idea behind Thompson
sampling.

170
00:09:26,533 --> 00:09:30,100
And by the way taking these maximum
of these random draws that is this maximum

171
00:09:30,100 --> 00:09:33,666
of these estimations of the probability
of getting reward equals one.

172
00:09:33,933 --> 00:09:36,533
Well that's nothing else but step three.

173
00:09:36,533 --> 00:09:39,600
And so right now
what we'll do is to implement

174
00:09:39,600 --> 00:09:44,100
the strategy composed of step two
and step three in this code section here

175
00:09:44,200 --> 00:09:48,500
replacing the old strategy of
getting the upper confidence bound okay.

176
00:09:48,500 --> 00:09:50,366
So let's do it efficiently.

177
00:09:50,366 --> 00:09:52,666
Let's keep the logic behind this code
section.

178
00:09:52,666 --> 00:09:55,133
Let's not delete everything too quickly.

179
00:09:55,133 --> 00:09:58,466
You know, since we'll get the different
random draws of each add at each round.

180
00:09:58,733 --> 00:10:01,766
And since we need to take this highest
random draw, well,

181
00:10:01,766 --> 00:10:07,500
we should keep this coding strategy here
that gets the maximum value of something.

182
00:10:07,833 --> 00:10:11,066
So we'll replace this max upper bound here

183
00:10:11,066 --> 00:10:14,066
by a max random.

184
00:10:14,200 --> 00:10:18,033
Because you know, in the UCB algorithm
we needed to take the maximum upper

185
00:10:18,033 --> 00:10:18,966
confidence bound.

186
00:10:18,966 --> 00:10:22,666
And here for Thompson sampling
we need to take the maximum random draw.

187
00:10:22,666 --> 00:10:26,100
So we call this maximum random draw
max random.

188
00:10:26,666 --> 00:10:27,066
All right.

189
00:10:27,066 --> 00:10:31,433
And then of course we keep this at zero
because you know that's just to initialize

190
00:10:31,600 --> 00:10:35,266
the add
that we'll select at a specific round.

191
00:10:35,266 --> 00:10:39,200
Because of course with Thompson sampling
we will need to pick a not to show

192
00:10:39,200 --> 00:10:39,933
to the user.

193
00:10:39,933 --> 00:10:43,466
So we will keep this at zero
in this at equals I here.

194
00:10:43,766 --> 00:10:48,933
But then once we absolutely need to
change here is this if else because this

195
00:10:48,933 --> 00:10:53,466
if else is directly specific
to the upper confidence bound strategy.

196
00:10:53,466 --> 00:10:55,166
So we will remove this.

197
00:10:55,166 --> 00:10:58,466
And now we will implement the Thompson
sampling strategy.

198
00:10:59,266 --> 00:11:02,400
So first thing we need to do
is to generate

199
00:11:02,400 --> 00:11:05,333
the random draws of each of the ten ads.

200
00:11:05,333 --> 00:11:08,633
So we keep this for I in 1D,
because we will need this loop

201
00:11:08,633 --> 00:11:10,133
to go through the ten ads.

202
00:11:10,133 --> 00:11:12,300
And so now
we need to take the random draws.

203
00:11:12,300 --> 00:11:16,133
So we are going to declare here
a new variable that we'll call random

204
00:11:16,766 --> 00:11:18,166
underscore data.

205
00:11:18,166 --> 00:11:21,166
And that of course will correspond
to the different random draws.

206
00:11:21,300 --> 00:11:24,500
Because these are random
draws taken from the data distribution.

207
00:11:25,300 --> 00:11:26,466
So then equals.

208
00:11:26,466 --> 00:11:29,466
And now we're going to use a function of R

209
00:11:29,566 --> 00:11:35,166
which is the or beta function,
which will give us exactly what we want.

210
00:11:35,166 --> 00:11:39,900
That is it will give us some random draws
of the beta distribution of parameters

211
00:11:39,900 --> 00:11:40,866
that we choose.

212
00:11:40,866 --> 00:11:44,100
And as we can see on this slide here,
the first parameter is

213
00:11:44,100 --> 00:11:47,600
the number of times the aversion
I got reward one plus one.

214
00:11:48,000 --> 00:11:52,833
And the second parameter is the number of
times the ad I got reward zero plus one.

215
00:11:53,133 --> 00:11:54,266
So let's do it.

216
00:11:54,266 --> 00:11:58,200
Let's press F1 here to get some info
about this our beta function.

217
00:11:58,700 --> 00:12:01,266
So we will need
only three of these arguments here.

218
00:12:01,266 --> 00:12:04,866
The first argument we need
is m the number of observations.

219
00:12:04,866 --> 00:12:10,200
So here n equals one because we only want
to take one random draw and then shape

220
00:12:10,200 --> 00:12:13,366
one and shape two of the two parameters
of our beta distribution.

221
00:12:13,366 --> 00:12:17,733
That is, shape one is the number of times
the aggregate reward one plus one,

222
00:12:17,933 --> 00:12:21,466
and shape two is the number of times
the ad that we want zero plus one.

223
00:12:21,600 --> 00:12:25,500
So here for shape one
we input shape one equals

224
00:12:25,933 --> 00:12:28,533
numbers of rewards one.

225
00:12:28,533 --> 00:12:31,400
And of course,
since this corresponds to a specific ad,

226
00:12:31,400 --> 00:12:35,100
I will add some brackets here
and take the ad version I.

227
00:12:35,133 --> 00:12:38,100
This we're dealing with
at the specific time in this for I loop

228
00:12:38,100 --> 00:12:41,100
here, that is, this corresponds
to a specific ad.

229
00:12:41,233 --> 00:12:44,066
And let's not forget the plus one here.

230
00:12:44,066 --> 00:12:46,700
And then comma
to input the second parameter.

231
00:12:46,700 --> 00:12:49,700
And of course the second parameter
is going to be

232
00:12:49,900 --> 00:12:53,333
the i'th index
of this numbers of rewards zero vector.

233
00:12:53,700 --> 00:12:56,700
And then let's not forget the plus
one here.

234
00:12:57,333 --> 00:13:00,933
And that's our two parameters
of this beta distribution

235
00:13:01,100 --> 00:13:04,100
from which we are getting the random draws
okay.

236
00:13:04,566 --> 00:13:06,066
So we have everything we need.

237
00:13:06,066 --> 00:13:08,166
And now of course
we will need to play with this

238
00:13:08,166 --> 00:13:12,466
if condition here
to get the maximum of these random draws.

239
00:13:12,766 --> 00:13:15,900
So a good exercise for
you would be to pause on this video

240
00:13:15,900 --> 00:13:20,333
and try to finish this code section here
to guess the last elements

241
00:13:20,333 --> 00:13:21,433
of this code here.

242
00:13:21,433 --> 00:13:24,833
So I'm going to tell you right now,
well right now we need to take

243
00:13:24,833 --> 00:13:26,600
the maximum of these random draws.

244
00:13:26,600 --> 00:13:29,733
We already declared this max random
variable here.

245
00:13:29,733 --> 00:13:32,733
That will be
this maximum of these random draws.

246
00:13:32,933 --> 00:13:34,800
And so well you guessed it.

247
00:13:34,800 --> 00:13:39,666
Now we need to replace this
max upper bound here by max random.

248
00:13:40,533 --> 00:13:43,800
And here of course
we need to replace this upper bound here

249
00:13:43,800 --> 00:13:46,833
by random data okay.

250
00:13:47,233 --> 00:13:48,200
And then same here.

251
00:13:48,200 --> 00:13:51,200
We replace max upper bound here by max

252
00:13:51,333 --> 00:13:56,700
random
and upper bound here by random data.

253
00:13:57,366 --> 00:14:01,433
And eventually of course
we keep this as equals I here okay.

254
00:14:01,433 --> 00:14:02,333
So let's re

255
00:14:02,333 --> 00:14:06,166
explain quickly what's happening here
for each at I in this for I loop here.

256
00:14:06,566 --> 00:14:10,766
Well we're taking a random draw
from this data distribution of parameters.

257
00:14:10,766 --> 00:14:14,533
Number of times the adequate reward
one plus one and number of times the add.

258
00:14:14,533 --> 00:14:16,000
Good. We were zero plus one.

259
00:14:16,000 --> 00:14:19,400
And then each time we take a random draw
from this distribution,

260
00:14:19,633 --> 00:14:24,566
well we check to see if this random draw
is higher than this max random here.

261
00:14:24,600 --> 00:14:28,700
So of course for the first add
this will be the case because max random

262
00:14:28,700 --> 00:14:30,133
was initialized to zero.

263
00:14:30,133 --> 00:14:33,733
So this condition will be true
for the first add and therefore max

264
00:14:33,733 --> 00:14:36,733
random will be equal to this
first random draw here.

265
00:14:36,833 --> 00:14:38,700
And so we keep this first add here.

266
00:14:38,700 --> 00:14:41,733
And then when we move on to the next add,
what happens

267
00:14:41,733 --> 00:14:44,966
is that we take another random draw
from this data, distribute

268
00:14:44,966 --> 00:14:48,600
one of these parameters here
that corresponds to this new at I.

269
00:14:48,866 --> 00:14:52,300
And then if this new random draw
is higher than this max random here

270
00:14:52,333 --> 00:14:54,600
that was equal to the previous random
draw.

271
00:14:54,600 --> 00:14:56,500
Well,
that means that this condition is true

272
00:14:56,500 --> 00:15:00,300
and therefore max random
takes the value of this new random draw.

273
00:15:00,300 --> 00:15:01,200
And therefore

274
00:15:01,200 --> 00:15:05,233
we select this new add version
I here that has the highest random draw,

275
00:15:05,233 --> 00:15:07,700
and we forget about the previous add
that we selected

276
00:15:07,700 --> 00:15:10,533
because simply
it has a lower random draw. All right.

277
00:15:10,533 --> 00:15:11,500
So that's the idea.

278
00:15:11,500 --> 00:15:13,633
And that's the strategy of Thompson

279
00:15:13,633 --> 00:15:16,633
sampling
that we are implementing at each round.

280
00:15:16,900 --> 00:15:17,700
Great.

281
00:15:17,700 --> 00:15:19,733
So now we almost have everything we need.

282
00:15:19,733 --> 00:15:23,666
The thing that we need to update
now is what's happening here.

283
00:15:23,666 --> 00:15:27,433
Because this comes from the upper
confidence bound algorithm.

284
00:15:27,700 --> 00:15:32,100
So this is you know, the parameter of step
one in the UCB algorithm.

285
00:15:32,400 --> 00:15:34,566
So we will need to remove this line.

286
00:15:34,566 --> 00:15:35,866
We don't need this.

287
00:15:35,866 --> 00:15:39,700
And we have to keep this
because this is to get the real reward.

288
00:15:39,900 --> 00:15:42,000
As we explained in the YouTube algorithm.

289
00:15:42,000 --> 00:15:46,433
You know, this is actually how we proceed
to get the reward in this simulation

290
00:15:46,433 --> 00:15:47,233
data set.

291
00:15:47,233 --> 00:15:48,200
And then what do we have?

292
00:15:48,200 --> 00:15:51,100
We have this line
containing the sums of rewards.

293
00:15:51,100 --> 00:15:54,866
And of course the sums of rewards
is a parameter of the UCB algorithm.

294
00:15:54,866 --> 00:15:57,866
So we need to remove that as well.

295
00:15:58,000 --> 00:16:00,600
And now we are ready to update

296
00:16:00,600 --> 00:16:03,600
what we need to update for the Thompson
sampling algorithm.

297
00:16:03,666 --> 00:16:04,166
All right.

298
00:16:04,166 --> 00:16:05,766
And then we have the total reward that

299
00:16:05,766 --> 00:16:08,766
of course we must keep
because this is the exciting result.

300
00:16:08,833 --> 00:16:11,833
And this is kind of a
performance evaluator.

301
00:16:11,933 --> 00:16:14,400
So now according to you
what do we need to update.

302
00:16:14,400 --> 00:16:18,166
Well of course what we need to update
are these two vectors here

303
00:16:18,166 --> 00:16:21,166
numbers of rewards
one and numbers of rewards zero.

304
00:16:21,200 --> 00:16:25,066
Because you know in this strategy
here we are in putting the ith

305
00:16:25,066 --> 00:16:26,766
index of these two vectors.

306
00:16:26,766 --> 00:16:29,700
But you know we need to update them
in each round because otherwise

307
00:16:29,700 --> 00:16:33,666
they will always be equal to zero
because they were initialized as zero.

308
00:16:34,100 --> 00:16:35,400
So now what do we need to do.

309
00:16:35,400 --> 00:16:39,600
We need to increment their values
when we get the reward.

310
00:16:39,900 --> 00:16:43,566
So let's see what we need to increment
for this variable here

311
00:16:43,566 --> 00:16:44,733
the numbers of rewards one.

312
00:16:44,733 --> 00:16:45,933
Well these corresponds

313
00:16:45,933 --> 00:16:49,833
to the different numbers of times
each add gets reward one at round n.

314
00:16:50,100 --> 00:16:53,366
So we need to increment it
only when the ads go to reward one.

315
00:16:53,733 --> 00:16:55,133
And what about this vector.

316
00:16:55,133 --> 00:16:56,066
Well that's the same.

317
00:16:56,066 --> 00:16:56,700
That's the vector

318
00:16:56,700 --> 00:17:00,900
containing the different numbers of times
each add got reward zero up to round n.

319
00:17:01,200 --> 00:17:03,366
And so we need to increment this vector

320
00:17:03,366 --> 00:17:06,766
only when the add that we selected
got reward zero.

321
00:17:07,166 --> 00:17:09,966
So since this depends on the reward we get

322
00:17:09,966 --> 00:17:13,200
when we select the add,
we will need an if condition.

323
00:17:13,566 --> 00:17:16,466
And so we're going to write this
if condition here if

324
00:17:16,466 --> 00:17:21,800
and so the condition is simply
if reward equals equals one.

325
00:17:22,300 --> 00:17:26,333
So if this reward
we get at this specific round

326
00:17:26,333 --> 00:17:29,800
when we select this specific ad here,
if this reward is one

327
00:17:30,033 --> 00:17:31,000
then what do we need to do.

328
00:17:31,000 --> 00:17:36,400
We need to increment this numbers
of rewards one but only for the index ad.

329
00:17:36,566 --> 00:17:40,833
Because this index ad here corresponds
to the index of the ad that was selected.

330
00:17:41,166 --> 00:17:42,133
So let's do it.

331
00:17:42,133 --> 00:17:45,333
Let's increment
this numbers of rewards one.

332
00:17:45,766 --> 00:17:48,766
Let's copy that. And let's face it here.

333
00:17:48,866 --> 00:17:51,633
And now let's take the ad index

334
00:17:51,633 --> 00:17:54,600
corresponding to the index of the ad
that was selected.

335
00:17:54,600 --> 00:17:57,400
And then that's
where we need to increment it.

336
00:17:57,400 --> 00:18:00,933
So I'm going to copy this and paste

337
00:18:00,933 --> 00:18:03,966
it here and add a plus one.

338
00:18:04,500 --> 00:18:04,933
All right.

339
00:18:04,933 --> 00:18:08,866
So when reward is one well of course
what we need to do is to add a plus

340
00:18:08,866 --> 00:18:12,600
one to the number of times
this ad here gets reward one.

341
00:18:13,066 --> 00:18:13,466
All right.

342
00:18:13,466 --> 00:18:16,400
And then we have this else.

343
00:18:16,400 --> 00:18:21,166
And this else corresponds to the situation
where reward here is equal to zero.

344
00:18:21,366 --> 00:18:25,766
That is when the ad that we selected got
a zero reward at the specific amount end.

345
00:18:26,066 --> 00:18:27,600
And therefore when that happens

346
00:18:27,600 --> 00:18:31,733
we need to increment
the numbers of rewards zero this time.

347
00:18:31,733 --> 00:18:34,066
So I'm copying this line and

348
00:18:35,366 --> 00:18:37,200
basing it right here.

349
00:18:37,200 --> 00:18:43,333
And then replace the index I by ad
because we need to increment the numbers

350
00:18:43,333 --> 00:18:48,300
of rewards zero, but only for the value
corresponding to this ad index here.

351
00:18:48,800 --> 00:18:49,200
All right.

352
00:18:49,200 --> 00:18:50,400
And now we're done.

353
00:18:50,400 --> 00:18:53,400
Thompson
sampling is actually fully implemented.

354
00:18:53,700 --> 00:18:57,400
So now time for the exciting step
visualizing the results.

355
00:18:57,766 --> 00:18:59,700
So we'll do that in the next tutorial.

356
00:18:59,700 --> 00:19:02,700
And until then enjoy machine learning.