1 00:00:00,233 --> 00:00:02,566 Hello and welcome to this art tutorial. 2 00:00:02,566 --> 00:00:06,200 So in the previous section we introduced a multi-armed bandit problem for this 3 00:00:06,500 --> 00:00:09,200 ad click through rate optimization problem 4 00:00:09,200 --> 00:00:11,933 and we already tried two algorithms. 5 00:00:11,933 --> 00:00:16,000 The first algorithm was the simple random selection algorithm 6 00:00:16,000 --> 00:00:20,900 that consists of selecting a pure random, an ad at each round. 7 00:00:21,200 --> 00:00:25,866 And this algorithm gave us a total reward of 1002 hundred on average. 8 00:00:25,866 --> 00:00:28,000 Because, you know, there is this random factor. 9 00:00:28,000 --> 00:00:31,133 But on average we got 1002 hundred total reward. 10 00:00:31,500 --> 00:00:34,133 And of course, when we plotted the histogram, 11 00:00:34,133 --> 00:00:37,266 every ad was selected more or less the same number of times. 12 00:00:37,633 --> 00:00:41,233 And then we tried another algorithm which was the upper confidence 13 00:00:41,233 --> 00:00:42,433 bound algorithm. 14 00:00:42,433 --> 00:00:47,366 And there we got much better results because not only we managed to 15 00:00:47,366 --> 00:00:53,133 almost double the total reward because we got a total reward of 2178. 16 00:00:53,366 --> 00:00:56,533 So almost a double of what we obtained with the random selection. 17 00:00:56,900 --> 00:00:59,933 And the even better thing is that we managed 18 00:00:59,933 --> 00:01:03,466 to figure out which ad version was the best to show 19 00:01:03,466 --> 00:01:07,366 to the user, that is, which ad version had the highest conversion 20 00:01:07,366 --> 00:01:10,833 rate, the highest CTR, that is the highest click through rate. 21 00:01:11,300 --> 00:01:13,100 And now in this section 22 00:01:13,100 --> 00:01:17,366 we are going to try a new algorithm which is called Thompson Sampling. 23 00:01:17,800 --> 00:01:20,733 And so in this section we are going to look at two things. 24 00:01:20,733 --> 00:01:25,166 First thing is Ken Thompson sampling beat the upper confidence bound. 25 00:01:25,166 --> 00:01:26,933 That is Ken Thompson sampling. 26 00:01:26,933 --> 00:01:31,533 Give us a total reward that is even higher than 2178. 27 00:01:31,766 --> 00:01:34,900 You know, we almost doubled the total reward of random selection. 28 00:01:35,133 --> 00:01:37,600 Let's see if we can beat that again. 29 00:01:37,600 --> 00:01:41,033 Like more than double it or even triple it I don't know. 30 00:01:41,066 --> 00:01:41,733 Let's see. 31 00:01:41,733 --> 00:01:44,233 And the second thing that we're going to look at is 32 00:01:44,233 --> 00:01:48,000 if we get the same ad version that has the highest conversion rate. 33 00:01:48,100 --> 00:01:50,700 Let's see if we get ad version number five, 34 00:01:50,700 --> 00:01:53,933 which was the ad found by the upper confidence bound algorithm. 35 00:01:53,966 --> 00:01:55,800 It found ad version number five. 36 00:01:55,800 --> 00:01:59,100 So let's hope that we also get the ad version number five. 37 00:01:59,400 --> 00:02:01,200 This should logically be the case 38 00:02:01,200 --> 00:02:05,000 if we managed to be the upper confidence bound in terms of total reward. 39 00:02:05,166 --> 00:02:06,333 So let's do this. 40 00:02:06,333 --> 00:02:08,833 Let's implement Thompson sampling in this section. 41 00:02:08,833 --> 00:02:11,233 We're actually going to do that in one step. 42 00:02:11,233 --> 00:02:13,166 Because when you think about it 43 00:02:13,166 --> 00:02:17,600 we will only need to change the strategy here in this for n loop. 44 00:02:17,700 --> 00:02:20,666 And then keep the same to implement Thompson sampling. 45 00:02:20,666 --> 00:02:22,933 Because you know we'll give these parameters here. 46 00:02:22,933 --> 00:02:27,000 That is this number of rounds, this number of ads, this ad selected 47 00:02:27,000 --> 00:02:30,400 vector that contains all the different ads selected at each round. 48 00:02:30,833 --> 00:02:33,900 And this we will need to change because these are the parameters 49 00:02:33,900 --> 00:02:35,666 of the upper confidence bound algorithm. 50 00:02:35,666 --> 00:02:37,033 So we will need to change it. 51 00:02:37,033 --> 00:02:39,733 And of course we will give this total reward 52 00:02:39,733 --> 00:02:43,133 because we want to get the total reward accumulated 53 00:02:43,133 --> 00:02:46,500 through the execution of this Thompson sampling algorithm. 54 00:02:46,933 --> 00:02:50,666 So what we're going to do now is simply take everything from here 55 00:02:51,000 --> 00:02:55,500 to here, copy, and then paste it here. 56 00:02:55,500 --> 00:02:57,966 And we will only change what we need to change. 57 00:02:57,966 --> 00:03:01,966 That is the right parameters for the Thompson sampling algorithm. 58 00:03:02,466 --> 00:03:04,133 All right. So let's do it. 59 00:03:04,133 --> 00:03:08,900 We can already replace UCB here by Thompson sampling. 60 00:03:10,633 --> 00:03:13,266 And now let's change the strategy. 61 00:03:13,266 --> 00:03:15,466 But before let's start with the basics. 62 00:03:15,466 --> 00:03:18,033 Let's set the right folder as working directory. 63 00:03:18,033 --> 00:03:19,833 So let's go to our Machine learning 64 00:03:19,833 --> 00:03:24,033 A-Z folder then part six Reinforcement learning Thompson Sampling. 65 00:03:24,033 --> 00:03:28,733 And then make sure that you have this as CTR optimization CSV file that is 66 00:03:28,733 --> 00:03:33,466 of course the same CSV file as the one we had in upper confidence bound. 67 00:03:33,766 --> 00:03:35,700 So if you have this data set, you're now ready 68 00:03:35,700 --> 00:03:38,933 to click on this more button here and then set as working directory. 69 00:03:39,866 --> 00:03:40,266 All right. 70 00:03:40,266 --> 00:03:44,000 So now let's implement our Thompson sampling algorithm. 71 00:03:44,333 --> 00:03:48,600 So let's directly jump to the slide of the Thompson sampling algorithm. 72 00:03:49,600 --> 00:03:49,933 All right. 73 00:03:49,933 --> 00:03:51,200 So what do we see here. 74 00:03:51,200 --> 00:03:54,200 This Thompson sampling algorithm takes three steps. 75 00:03:54,400 --> 00:03:59,666 First step is at each round we need to consider two numbers for each eye. 76 00:04:00,033 --> 00:04:04,533 The first number is the number of times the ad I got reward one up to round 77 00:04:04,533 --> 00:04:08,333 n, and the second number is the number of times the ad I got. 78 00:04:08,333 --> 00:04:13,333 We weren't zero up to round N, so that's the first thing we'll do here. 79 00:04:13,333 --> 00:04:15,666 We will consider these parameters 80 00:04:15,666 --> 00:04:19,333 and declare the variables corresponding to these parameters. 81 00:04:19,633 --> 00:04:25,033 And we can notice that if we compare this Thompson sampling algorithm to the UCB 82 00:04:25,033 --> 00:04:29,500 algorithm, well it's the same step one with different parameters. 83 00:04:29,500 --> 00:04:32,733 Because as you can notice here in the step one of the upper 84 00:04:32,733 --> 00:04:36,533 confidence bound algorithm, we also consider two numbers. 85 00:04:36,566 --> 00:04:40,333 And these two numbers are the number of times the ad I was selected up to round 86 00:04:40,333 --> 00:04:44,600 n, and the sum of rewards of the ad I up to round n. 87 00:04:45,066 --> 00:04:49,266 So if we have a look at this code section here, which is the code section 88 00:04:49,266 --> 00:04:51,333 for the upper confidence bound algorithm, 89 00:04:51,333 --> 00:04:53,933 well, we can see that these two parameters are here 90 00:04:53,933 --> 00:04:59,566 and that these parameters are no longer the parameters for Thompson sampling. 91 00:04:59,900 --> 00:05:01,066 And these two parameters 92 00:05:01,066 --> 00:05:05,433 that we considered in step one in the UCB algorithm are actually replaced 93 00:05:05,433 --> 00:05:09,333 by two new parameters in the Thompson sampling algorithm. 94 00:05:09,333 --> 00:05:13,733 So what we'll do right now is simply to remove these two parameters 95 00:05:13,733 --> 00:05:17,133 that are the parameters considered in the step one of the UCB algorithm, 96 00:05:17,400 --> 00:05:20,733 and replace them by these two new parameters 97 00:05:20,733 --> 00:05:23,600 that we need to consider for Thompson sampling algorithm. 98 00:05:23,600 --> 00:05:25,666 So let's replace them right away. 99 00:05:25,666 --> 00:05:28,666 And therefore let's declare two new variables. 100 00:05:28,733 --> 00:05:33,766 So first number the number of times the ad I got we want one up to round n. 101 00:05:34,033 --> 00:05:40,200 So let's call this number numbers of rewards and then underscore. 102 00:05:40,366 --> 00:05:45,166 And one to specify that it's the number of times the ad got reward one. 103 00:05:45,500 --> 00:05:46,966 And then the second number 104 00:05:46,966 --> 00:05:51,300 is the number of times the ad I got reward zero up to run. 105 00:05:51,300 --> 00:05:56,400 And so same we're going to create this variable numbers of rewards zero. 106 00:05:57,000 --> 00:05:59,533 Now what are going to be these two variables. 107 00:05:59,533 --> 00:06:02,400 So these are the parameters of Thompson sampling 108 00:06:02,400 --> 00:06:05,700 at the essence of the future strategy that we're going to have here. 109 00:06:06,033 --> 00:06:09,300 So these two variables here are going to be some vectors 110 00:06:09,533 --> 00:06:12,333 of the elements that has ten elements. 111 00:06:12,333 --> 00:06:15,533 And as you might have guessed they will contain for each ad 112 00:06:15,533 --> 00:06:21,133 the number of times they got rewards, one up to round N and zero up to one. 113 00:06:21,133 --> 00:06:23,533 And so we're going to initialize 114 00:06:23,533 --> 00:06:27,300 these vectors the same way as we did for upper confidence bound. 115 00:06:27,300 --> 00:06:30,300 That is we're going to take this integer d 116 00:06:30,533 --> 00:06:33,466 that creates a vector of the elements. 117 00:06:33,466 --> 00:06:35,433 And those d elements are zeros. 118 00:06:35,433 --> 00:06:39,666 So that's exactly how we are going to initialize these two variables here 119 00:06:39,933 --> 00:06:42,933 by a vector of ten zeros. 120 00:06:43,133 --> 00:06:47,133 Because of course at the beginning the number of rewards of each ad is 121 00:06:47,133 --> 00:06:50,533 of course zero because simply each ad wasn't selected yet. 122 00:06:51,100 --> 00:06:53,533 All right. So we have our two arguments. 123 00:06:53,533 --> 00:06:56,766 That means that step one is done and we can move on to step two. 124 00:06:57,100 --> 00:07:01,833 So step two is for each ad I we take a random draw 125 00:07:01,833 --> 00:07:05,300 from this distribution below which is the beta distribution. 126 00:07:05,666 --> 00:07:06,600 And why is that? 127 00:07:06,600 --> 00:07:07,300 It's because we have 128 00:07:07,300 --> 00:07:11,033 two important assumptions here which are related to Bayesian inference. 129 00:07:11,566 --> 00:07:13,533 So the first assumption is this first line. 130 00:07:13,533 --> 00:07:17,900 Here we suppose that each ad I gets rewards from a Bernoulli 131 00:07:17,900 --> 00:07:22,633 distribution of parameter theta I, which is the probability of success. 132 00:07:22,633 --> 00:07:26,000 And you can picture this probability of success by showing the ad 133 00:07:26,000 --> 00:07:29,866 to a huge amount of users, like 1 million users and theta. 134 00:07:29,866 --> 00:07:33,633 I could be interpreted as the number of times the ad because we were one. 135 00:07:33,666 --> 00:07:37,600 That is, the number of successes divided by the total number of times we selected 136 00:07:37,600 --> 00:07:39,433 the ad. That is 1 million. 137 00:07:39,433 --> 00:07:42,566 So basically theta I is the probability of success. 138 00:07:42,566 --> 00:07:46,000 That is the probability of getting reward one when we select the ad. 139 00:07:46,500 --> 00:07:51,966 And so the assumption is that each ad I get rewards zero and one from this 140 00:07:51,966 --> 00:07:56,400 Bernoulli distribution of parameter theta I, which is the probability of success. 141 00:07:57,133 --> 00:08:00,933 And then the second assumption, a less strong assumption than the first one, 142 00:08:00,933 --> 00:08:03,233 which is the strongest assumption that we have. 143 00:08:03,233 --> 00:08:05,666 The second assumption, which is this second line here, 144 00:08:05,666 --> 00:08:09,300 and that is that we assume that theta I has a uniform distribution 145 00:08:09,433 --> 00:08:11,466 which has the prior distribution. 146 00:08:11,466 --> 00:08:14,833 And then we use the Bayes rule to get the posterior distribution, 147 00:08:14,833 --> 00:08:16,333 which is p c to I. 148 00:08:16,333 --> 00:08:19,033 Given the rewards that we got up to the round n. 149 00:08:19,033 --> 00:08:20,633 And so by using Bayes rule, 150 00:08:20,633 --> 00:08:24,633 that's how we get this beta distribution in this step two here. 151 00:08:25,000 --> 00:08:28,000 And so by taking at each round these random draws of these 152 00:08:28,000 --> 00:08:31,233 beta distribution, well, since these random draws represent 153 00:08:31,233 --> 00:08:33,566 nothing else, then this probability of success, 154 00:08:33,566 --> 00:08:37,733 we get the strategy here, which is to take the maximum of these random draws. 155 00:08:37,733 --> 00:08:39,900 Because the maximum of these random draws 156 00:08:39,900 --> 00:08:43,433 is approximating the highest probability of success. 157 00:08:43,766 --> 00:08:45,933 And that's the whole idea behind Thompson sampling. 158 00:08:45,933 --> 00:08:49,566 It's that we are trying to estimate these parameter theta 159 00:08:49,933 --> 00:08:53,200 that are the probability of success of each of these ten ads. 160 00:08:53,400 --> 00:08:56,400 Then by taking these random draws and taking the highest of them, 161 00:08:56,400 --> 00:08:59,566 we are estimating the highest probability of success. 162 00:08:59,833 --> 00:09:03,600 And this highest probability of success corresponds to one specific 163 00:09:03,600 --> 00:09:05,000 ad at each round. 164 00:09:05,000 --> 00:09:08,733 So when we take these random draws at a specific round, we might be wrong. 165 00:09:08,933 --> 00:09:12,700 But when we take these random draws over thousands of rounds, well, 166 00:09:12,700 --> 00:09:17,866 just based on the essence of probability, we obtain over all the theta 167 00:09:17,866 --> 00:09:20,433 that corresponds to the ad that has the highest probability 168 00:09:20,433 --> 00:09:23,866 of success, that is, the highest probability of getting reward equals one. 169 00:09:24,266 --> 00:09:26,533 And that's the idea behind Thompson sampling. 170 00:09:26,533 --> 00:09:30,100 And by the way taking these maximum of these random draws that is this maximum 171 00:09:30,100 --> 00:09:33,666 of these estimations of the probability of getting reward equals one. 172 00:09:33,933 --> 00:09:36,533 Well that's nothing else but step three. 173 00:09:36,533 --> 00:09:39,600 And so right now what we'll do is to implement 174 00:09:39,600 --> 00:09:44,100 the strategy composed of step two and step three in this code section here 175 00:09:44,200 --> 00:09:48,500 replacing the old strategy of getting the upper confidence bound okay. 176 00:09:48,500 --> 00:09:50,366 So let's do it efficiently. 177 00:09:50,366 --> 00:09:52,666 Let's keep the logic behind this code section. 178 00:09:52,666 --> 00:09:55,133 Let's not delete everything too quickly. 179 00:09:55,133 --> 00:09:58,466 You know, since we'll get the different random draws of each add at each round. 180 00:09:58,733 --> 00:10:01,766 And since we need to take this highest random draw, well, 181 00:10:01,766 --> 00:10:07,500 we should keep this coding strategy here that gets the maximum value of something. 182 00:10:07,833 --> 00:10:11,066 So we'll replace this max upper bound here 183 00:10:11,066 --> 00:10:14,066 by a max random. 184 00:10:14,200 --> 00:10:18,033 Because you know, in the UCB algorithm we needed to take the maximum upper 185 00:10:18,033 --> 00:10:18,966 confidence bound. 186 00:10:18,966 --> 00:10:22,666 And here for Thompson sampling we need to take the maximum random draw. 187 00:10:22,666 --> 00:10:26,100 So we call this maximum random draw max random. 188 00:10:26,666 --> 00:10:27,066 All right. 189 00:10:27,066 --> 00:10:31,433 And then of course we keep this at zero because you know that's just to initialize 190 00:10:31,600 --> 00:10:35,266 the add that we'll select at a specific round. 191 00:10:35,266 --> 00:10:39,200 Because of course with Thompson sampling we will need to pick a not to show 192 00:10:39,200 --> 00:10:39,933 to the user. 193 00:10:39,933 --> 00:10:43,466 So we will keep this at zero in this at equals I here. 194 00:10:43,766 --> 00:10:48,933 But then once we absolutely need to change here is this if else because this 195 00:10:48,933 --> 00:10:53,466 if else is directly specific to the upper confidence bound strategy. 196 00:10:53,466 --> 00:10:55,166 So we will remove this. 197 00:10:55,166 --> 00:10:58,466 And now we will implement the Thompson sampling strategy. 198 00:10:59,266 --> 00:11:02,400 So first thing we need to do is to generate 199 00:11:02,400 --> 00:11:05,333 the random draws of each of the ten ads. 200 00:11:05,333 --> 00:11:08,633 So we keep this for I in 1D, because we will need this loop 201 00:11:08,633 --> 00:11:10,133 to go through the ten ads. 202 00:11:10,133 --> 00:11:12,300 And so now we need to take the random draws. 203 00:11:12,300 --> 00:11:16,133 So we are going to declare here a new variable that we'll call random 204 00:11:16,766 --> 00:11:18,166 underscore data. 205 00:11:18,166 --> 00:11:21,166 And that of course will correspond to the different random draws. 206 00:11:21,300 --> 00:11:24,500 Because these are random draws taken from the data distribution. 207 00:11:25,300 --> 00:11:26,466 So then equals. 208 00:11:26,466 --> 00:11:29,466 And now we're going to use a function of R 209 00:11:29,566 --> 00:11:35,166 which is the or beta function, which will give us exactly what we want. 210 00:11:35,166 --> 00:11:39,900 That is it will give us some random draws of the beta distribution of parameters 211 00:11:39,900 --> 00:11:40,866 that we choose. 212 00:11:40,866 --> 00:11:44,100 And as we can see on this slide here, the first parameter is 213 00:11:44,100 --> 00:11:47,600 the number of times the aversion I got reward one plus one. 214 00:11:48,000 --> 00:11:52,833 And the second parameter is the number of times the ad I got reward zero plus one. 215 00:11:53,133 --> 00:11:54,266 So let's do it. 216 00:11:54,266 --> 00:11:58,200 Let's press F1 here to get some info about this our beta function. 217 00:11:58,700 --> 00:12:01,266 So we will need only three of these arguments here. 218 00:12:01,266 --> 00:12:04,866 The first argument we need is m the number of observations. 219 00:12:04,866 --> 00:12:10,200 So here n equals one because we only want to take one random draw and then shape 220 00:12:10,200 --> 00:12:13,366 one and shape two of the two parameters of our beta distribution. 221 00:12:13,366 --> 00:12:17,733 That is, shape one is the number of times the aggregate reward one plus one, 222 00:12:17,933 --> 00:12:21,466 and shape two is the number of times the ad that we want zero plus one. 223 00:12:21,600 --> 00:12:25,500 So here for shape one we input shape one equals 224 00:12:25,933 --> 00:12:28,533 numbers of rewards one. 225 00:12:28,533 --> 00:12:31,400 And of course, since this corresponds to a specific ad, 226 00:12:31,400 --> 00:12:35,100 I will add some brackets here and take the ad version I. 227 00:12:35,133 --> 00:12:38,100 This we're dealing with at the specific time in this for I loop 228 00:12:38,100 --> 00:12:41,100 here, that is, this corresponds to a specific ad. 229 00:12:41,233 --> 00:12:44,066 And let's not forget the plus one here. 230 00:12:44,066 --> 00:12:46,700 And then comma to input the second parameter. 231 00:12:46,700 --> 00:12:49,700 And of course the second parameter is going to be 232 00:12:49,900 --> 00:12:53,333 the i'th index of this numbers of rewards zero vector. 233 00:12:53,700 --> 00:12:56,700 And then let's not forget the plus one here. 234 00:12:57,333 --> 00:13:00,933 And that's our two parameters of this beta distribution 235 00:13:01,100 --> 00:13:04,100 from which we are getting the random draws okay. 236 00:13:04,566 --> 00:13:06,066 So we have everything we need. 237 00:13:06,066 --> 00:13:08,166 And now of course we will need to play with this 238 00:13:08,166 --> 00:13:12,466 if condition here to get the maximum of these random draws. 239 00:13:12,766 --> 00:13:15,900 So a good exercise for you would be to pause on this video 240 00:13:15,900 --> 00:13:20,333 and try to finish this code section here to guess the last elements 241 00:13:20,333 --> 00:13:21,433 of this code here. 242 00:13:21,433 --> 00:13:24,833 So I'm going to tell you right now, well right now we need to take 243 00:13:24,833 --> 00:13:26,600 the maximum of these random draws. 244 00:13:26,600 --> 00:13:29,733 We already declared this max random variable here. 245 00:13:29,733 --> 00:13:32,733 That will be this maximum of these random draws. 246 00:13:32,933 --> 00:13:34,800 And so well you guessed it. 247 00:13:34,800 --> 00:13:39,666 Now we need to replace this max upper bound here by max random. 248 00:13:40,533 --> 00:13:43,800 And here of course we need to replace this upper bound here 249 00:13:43,800 --> 00:13:46,833 by random data okay. 250 00:13:47,233 --> 00:13:48,200 And then same here. 251 00:13:48,200 --> 00:13:51,200 We replace max upper bound here by max 252 00:13:51,333 --> 00:13:56,700 random and upper bound here by random data. 253 00:13:57,366 --> 00:14:01,433 And eventually of course we keep this as equals I here okay. 254 00:14:01,433 --> 00:14:02,333 So let's re 255 00:14:02,333 --> 00:14:06,166 explain quickly what's happening here for each at I in this for I loop here. 256 00:14:06,566 --> 00:14:10,766 Well we're taking a random draw from this data distribution of parameters. 257 00:14:10,766 --> 00:14:14,533 Number of times the adequate reward one plus one and number of times the add. 258 00:14:14,533 --> 00:14:16,000 Good. We were zero plus one. 259 00:14:16,000 --> 00:14:19,400 And then each time we take a random draw from this distribution, 260 00:14:19,633 --> 00:14:24,566 well we check to see if this random draw is higher than this max random here. 261 00:14:24,600 --> 00:14:28,700 So of course for the first add this will be the case because max random 262 00:14:28,700 --> 00:14:30,133 was initialized to zero. 263 00:14:30,133 --> 00:14:33,733 So this condition will be true for the first add and therefore max 264 00:14:33,733 --> 00:14:36,733 random will be equal to this first random draw here. 265 00:14:36,833 --> 00:14:38,700 And so we keep this first add here. 266 00:14:38,700 --> 00:14:41,733 And then when we move on to the next add, what happens 267 00:14:41,733 --> 00:14:44,966 is that we take another random draw from this data, distribute 268 00:14:44,966 --> 00:14:48,600 one of these parameters here that corresponds to this new at I. 269 00:14:48,866 --> 00:14:52,300 And then if this new random draw is higher than this max random here 270 00:14:52,333 --> 00:14:54,600 that was equal to the previous random draw. 271 00:14:54,600 --> 00:14:56,500 Well, that means that this condition is true 272 00:14:56,500 --> 00:15:00,300 and therefore max random takes the value of this new random draw. 273 00:15:00,300 --> 00:15:01,200 And therefore 274 00:15:01,200 --> 00:15:05,233 we select this new add version I here that has the highest random draw, 275 00:15:05,233 --> 00:15:07,700 and we forget about the previous add that we selected 276 00:15:07,700 --> 00:15:10,533 because simply it has a lower random draw. All right. 277 00:15:10,533 --> 00:15:11,500 So that's the idea. 278 00:15:11,500 --> 00:15:13,633 And that's the strategy of Thompson 279 00:15:13,633 --> 00:15:16,633 sampling that we are implementing at each round. 280 00:15:16,900 --> 00:15:17,700 Great. 281 00:15:17,700 --> 00:15:19,733 So now we almost have everything we need. 282 00:15:19,733 --> 00:15:23,666 The thing that we need to update now is what's happening here. 283 00:15:23,666 --> 00:15:27,433 Because this comes from the upper confidence bound algorithm. 284 00:15:27,700 --> 00:15:32,100 So this is you know, the parameter of step one in the UCB algorithm. 285 00:15:32,400 --> 00:15:34,566 So we will need to remove this line. 286 00:15:34,566 --> 00:15:35,866 We don't need this. 287 00:15:35,866 --> 00:15:39,700 And we have to keep this because this is to get the real reward. 288 00:15:39,900 --> 00:15:42,000 As we explained in the YouTube algorithm. 289 00:15:42,000 --> 00:15:46,433 You know, this is actually how we proceed to get the reward in this simulation 290 00:15:46,433 --> 00:15:47,233 data set. 291 00:15:47,233 --> 00:15:48,200 And then what do we have? 292 00:15:48,200 --> 00:15:51,100 We have this line containing the sums of rewards. 293 00:15:51,100 --> 00:15:54,866 And of course the sums of rewards is a parameter of the UCB algorithm. 294 00:15:54,866 --> 00:15:57,866 So we need to remove that as well. 295 00:15:58,000 --> 00:16:00,600 And now we are ready to update 296 00:16:00,600 --> 00:16:03,600 what we need to update for the Thompson sampling algorithm. 297 00:16:03,666 --> 00:16:04,166 All right. 298 00:16:04,166 --> 00:16:05,766 And then we have the total reward that 299 00:16:05,766 --> 00:16:08,766 of course we must keep because this is the exciting result. 300 00:16:08,833 --> 00:16:11,833 And this is kind of a performance evaluator. 301 00:16:11,933 --> 00:16:14,400 So now according to you what do we need to update. 302 00:16:14,400 --> 00:16:18,166 Well of course what we need to update are these two vectors here 303 00:16:18,166 --> 00:16:21,166 numbers of rewards one and numbers of rewards zero. 304 00:16:21,200 --> 00:16:25,066 Because you know in this strategy here we are in putting the ith 305 00:16:25,066 --> 00:16:26,766 index of these two vectors. 306 00:16:26,766 --> 00:16:29,700 But you know we need to update them in each round because otherwise 307 00:16:29,700 --> 00:16:33,666 they will always be equal to zero because they were initialized as zero. 308 00:16:34,100 --> 00:16:35,400 So now what do we need to do. 309 00:16:35,400 --> 00:16:39,600 We need to increment their values when we get the reward. 310 00:16:39,900 --> 00:16:43,566 So let's see what we need to increment for this variable here 311 00:16:43,566 --> 00:16:44,733 the numbers of rewards one. 312 00:16:44,733 --> 00:16:45,933 Well these corresponds 313 00:16:45,933 --> 00:16:49,833 to the different numbers of times each add gets reward one at round n. 314 00:16:50,100 --> 00:16:53,366 So we need to increment it only when the ads go to reward one. 315 00:16:53,733 --> 00:16:55,133 And what about this vector. 316 00:16:55,133 --> 00:16:56,066 Well that's the same. 317 00:16:56,066 --> 00:16:56,700 That's the vector 318 00:16:56,700 --> 00:17:00,900 containing the different numbers of times each add got reward zero up to round n. 319 00:17:01,200 --> 00:17:03,366 And so we need to increment this vector 320 00:17:03,366 --> 00:17:06,766 only when the add that we selected got reward zero. 321 00:17:07,166 --> 00:17:09,966 So since this depends on the reward we get 322 00:17:09,966 --> 00:17:13,200 when we select the add, we will need an if condition. 323 00:17:13,566 --> 00:17:16,466 And so we're going to write this if condition here if 324 00:17:16,466 --> 00:17:21,800 and so the condition is simply if reward equals equals one. 325 00:17:22,300 --> 00:17:26,333 So if this reward we get at this specific round 326 00:17:26,333 --> 00:17:29,800 when we select this specific ad here, if this reward is one 327 00:17:30,033 --> 00:17:31,000 then what do we need to do. 328 00:17:31,000 --> 00:17:36,400 We need to increment this numbers of rewards one but only for the index ad. 329 00:17:36,566 --> 00:17:40,833 Because this index ad here corresponds to the index of the ad that was selected. 330 00:17:41,166 --> 00:17:42,133 So let's do it. 331 00:17:42,133 --> 00:17:45,333 Let's increment this numbers of rewards one. 332 00:17:45,766 --> 00:17:48,766 Let's copy that. And let's face it here. 333 00:17:48,866 --> 00:17:51,633 And now let's take the ad index 334 00:17:51,633 --> 00:17:54,600 corresponding to the index of the ad that was selected. 335 00:17:54,600 --> 00:17:57,400 And then that's where we need to increment it. 336 00:17:57,400 --> 00:18:00,933 So I'm going to copy this and paste 337 00:18:00,933 --> 00:18:03,966 it here and add a plus one. 338 00:18:04,500 --> 00:18:04,933 All right. 339 00:18:04,933 --> 00:18:08,866 So when reward is one well of course what we need to do is to add a plus 340 00:18:08,866 --> 00:18:12,600 one to the number of times this ad here gets reward one. 341 00:18:13,066 --> 00:18:13,466 All right. 342 00:18:13,466 --> 00:18:16,400 And then we have this else. 343 00:18:16,400 --> 00:18:21,166 And this else corresponds to the situation where reward here is equal to zero. 344 00:18:21,366 --> 00:18:25,766 That is when the ad that we selected got a zero reward at the specific amount end. 345 00:18:26,066 --> 00:18:27,600 And therefore when that happens 346 00:18:27,600 --> 00:18:31,733 we need to increment the numbers of rewards zero this time. 347 00:18:31,733 --> 00:18:34,066 So I'm copying this line and 348 00:18:35,366 --> 00:18:37,200 basing it right here. 349 00:18:37,200 --> 00:18:43,333 And then replace the index I by ad because we need to increment the numbers 350 00:18:43,333 --> 00:18:48,300 of rewards zero, but only for the value corresponding to this ad index here. 351 00:18:48,800 --> 00:18:49,200 All right. 352 00:18:49,200 --> 00:18:50,400 And now we're done. 353 00:18:50,400 --> 00:18:53,400 Thompson sampling is actually fully implemented. 354 00:18:53,700 --> 00:18:57,400 So now time for the exciting step visualizing the results. 355 00:18:57,766 --> 00:18:59,700 So we'll do that in the next tutorial. 356 00:18:59,700 --> 00:19:02,700 And until then enjoy machine learning.