1 00:00:00,233 --> 00:00:02,533 Hello and welcome to this art tutorial. 2 00:00:02,533 --> 00:00:03,900 So today we're starting a. 3 00:00:03,900 --> 00:00:07,166 New branch of machine learning which is reinforcement learning. 4 00:00:07,400 --> 00:00:07,800 And that. 5 00:00:07,800 --> 00:00:12,633 Is taking us closer to the field of artificial intelligence, because robots. 6 00:00:12,633 --> 00:00:14,933 And artificial intelligence that comes with. 7 00:00:14,933 --> 00:00:16,633 It are partly built with. 8 00:00:16,633 --> 00:00:18,300 Reinforcement learning. 9 00:00:18,300 --> 00:00:18,866 So to. 10 00:00:18,866 --> 00:00:20,333 Prevent for a minute disappointment 11 00:00:20,333 --> 00:00:22,566 in the next tutorials, we're not going to build any. 12 00:00:22,566 --> 00:00:24,566 Robots, but we will solve a very. 13 00:00:24,566 --> 00:00:28,133 Interesting problem which is called the multi-armed bandit problem. 14 00:00:28,466 --> 00:00:29,500 And we are going to. 15 00:00:29,500 --> 00:00:32,100 Solve this problem with the two most popular algorithms for. 16 00:00:32,100 --> 00:00:32,600 This. 17 00:00:32,600 --> 00:00:36,500 These are the upper confidence bounds and the Thompson sampling algorithms. 18 00:00:36,533 --> 00:00:38,566 So we're going to start today with. 19 00:00:38,566 --> 00:00:39,766 Upper confidence bounds. 20 00:00:39,766 --> 00:00:42,566 We're going to implement this algorithm in R. 21 00:00:42,566 --> 00:00:44,100 And in this first. Tutorial. 22 00:00:44,100 --> 00:00:46,133 We are going to import the. Data set and. 23 00:00:46,133 --> 00:00:47,500 Explain what the problem is. 24 00:00:47,500 --> 00:00:51,233 That is we will explain what the multi-armed bandit problem is about. 25 00:00:51,433 --> 00:00:53,300 So let's start with the basics. 26 00:00:53,300 --> 00:00:55,900 Let's set the right folder as. Working. Directory. 27 00:00:55,900 --> 00:00:58,366 So let's go to our Machine learning A-Z folder. 28 00:00:58,366 --> 00:01:04,133 Then part six Reinforcement Learning and section 32 upper Confidence bounds UCB. 29 00:01:04,533 --> 00:01:05,366 All right. 30 00:01:05,366 --> 00:01:06,533 And now in this folder make. 31 00:01:06,533 --> 00:01:08,400 Sure that you have the ads CTR. 32 00:01:08,400 --> 00:01:11,200 Optimization. Covers for click through rate. 33 00:01:11,200 --> 00:01:15,466 So we are going to try to optimize the click through rate of. 34 00:01:15,466 --> 00:01:18,633 Different users on an ad that we put on a social network. 35 00:01:18,700 --> 00:01:20,466 And therefore that's the name of our. 36 00:01:20,466 --> 00:01:22,033 Data set CSV file. 37 00:01:22,033 --> 00:01:23,400 So if you have this data. 38 00:01:23,400 --> 00:01:25,766 Set, you're now ready to click on this more button here. 39 00:01:25,766 --> 00:01:28,233 And then set as working directory. 40 00:01:28,233 --> 00:01:28,600 Good. 41 00:01:28,600 --> 00:01:31,300 And now we are going to import. The data set. 42 00:01:31,300 --> 00:01:34,800 So as usual we are going to call our variable for the data set. 43 00:01:34,966 --> 00:01:36,266 Data set. 44 00:01:36,266 --> 00:01:36,900 And then. 45 00:01:36,900 --> 00:01:39,666 Equals read dot csv. 46 00:01:39,666 --> 00:01:40,500 Here it is. 47 00:01:40,500 --> 00:01:44,066 And then in parentheses we just need to. Add. 48 00:01:44,066 --> 00:01:46,500 In quotes the name. Of the data set. 49 00:01:46,500 --> 00:01:48,433 So it's right here. 50 00:01:48,433 --> 00:01:52,333 And it is ad underscore. CTR. 51 00:01:52,800 --> 00:01:55,700 Underscore optimization. 52 00:01:55,700 --> 00:01:56,266 Here we go. 53 00:01:56,266 --> 00:01:57,200 And let's not. 54 00:01:57,200 --> 00:01:59,700 Forget to dot CSV. In the end. 55 00:01:59,700 --> 00:02:01,800 And now we're ready to import. This data set. 56 00:02:01,800 --> 00:02:02,866 So let's do it. 57 00:02:02,866 --> 00:02:04,833 Let's select. This line and. 58 00:02:04,833 --> 00:02:06,500 Execute data set. 59 00:02:06,500 --> 00:02:07,200 Will import it. 60 00:02:07,200 --> 00:02:07,833 So now. 61 00:02:07,833 --> 00:02:09,600 Let's have a look by. 62 00:02:09,600 --> 00:02:10,833 Clicking on data Set. 63 00:02:10,833 --> 00:02:12,000 Right here. 64 00:02:12,000 --> 00:02:13,166 Here we go okay. 65 00:02:13,166 --> 00:02:14,666 So remember in part. 66 00:02:14,666 --> 00:02:17,333 Three classification we dealt. With a problem that. 67 00:02:17,333 --> 00:02:18,433 Consisted of. 68 00:02:18,433 --> 00:02:19,800 Classifying and targeting 69 00:02:19,800 --> 00:02:23,733 the users on the social network for some core company marketing campaigns. 70 00:02:23,733 --> 00:02:26,633 Remember we had this business client of the social network. 71 00:02:26,633 --> 00:02:29,100 That put ads. On the social network. 72 00:02:29,100 --> 00:02:31,033 And then we made these classification models. 73 00:02:31,033 --> 00:02:34,133 To target the users on the social network most likely 74 00:02:34,133 --> 00:02:37,133 to buy this brand new luxury SUV that the. 75 00:02:37,133 --> 00:02:39,833 Car company launched at a very. Low price. 76 00:02:39,833 --> 00:02:40,833 And basically, to. 77 00:02:40,833 --> 00:02:45,133 Prepare this marketing campaign, this car company prepared an ad. 78 00:02:45,366 --> 00:02:47,600 That they would. Put on the social network. 79 00:02:47,600 --> 00:02:49,933 And what happened is that the Department of Marketing 80 00:02:49,933 --> 00:02:53,600 prepared some different versions of this same ad, you know, putting the car 81 00:02:53,600 --> 00:02:58,100 in different sceneries, like, for example, one ad had the car on a beautiful road, 82 00:02:58,433 --> 00:03:01,433 and on another version of the ad, the car is on the mountain 83 00:03:01,500 --> 00:03:04,200 and maybe on another version. It's on a beautiful bridge. 84 00:03:04,200 --> 00:03:04,900 Well, the. 85 00:03:04,900 --> 00:03:08,433 Department of Marketing prepared different versions of this same. 86 00:03:08,433 --> 00:03:10,100 Ad that they would put. 87 00:03:10,100 --> 00:03:11,666 On the social network. 88 00:03:11,666 --> 00:03:15,700 But the problem is that they prepared ten great. 89 00:03:15,700 --> 00:03:17,800 Versions of the same ad. 90 00:03:17,800 --> 00:03:18,833 The ten versions of. 91 00:03:18,833 --> 00:03:21,333 This ad look great, so they're. 92 00:03:21,333 --> 00:03:21,866 Actually not. 93 00:03:21,866 --> 00:03:25,133 Very sure of which ad to put on the social network. 94 00:03:25,333 --> 00:03:27,100 They want to put the ad that will get. 95 00:03:27,100 --> 00:03:29,500 The maximum clicks, you know, so that most. 96 00:03:29,500 --> 00:03:31,100 Users buy the SUV. 97 00:03:31,100 --> 00:03:34,766 And so they need to put the ad that will lead to the best conversion rate. 98 00:03:35,266 --> 00:03:35,933 So what. 99 00:03:35,933 --> 00:03:36,900 This car company did. 100 00:03:36,900 --> 00:03:38,466 Is that they hired us as a data. 101 00:03:38,466 --> 00:03:40,500 Scientist, and they said. Okay. 102 00:03:40,500 --> 00:03:42,400 I have ten versions. Of the ad. 103 00:03:42,400 --> 00:03:44,433 We have a limited budget to place the ads. 104 00:03:44,433 --> 00:03:45,866 On the social network. Because. 105 00:03:45,866 --> 00:03:48,566 Putting these ads on the social network costs some money. 106 00:03:48,566 --> 00:03:50,000 And so this car company would like. 107 00:03:50,000 --> 00:03:50,533 A data. 108 00:03:50,533 --> 00:03:51,933 Scientist to find the. 109 00:03:51,933 --> 00:03:54,900 Best strategy, to quickly find out which. 110 00:03:54,900 --> 00:03:55,433 Version of. 111 00:03:55,433 --> 00:03:58,033 This ad is the best for the users. 112 00:03:58,033 --> 00:04:00,900 That is, which version of the ad will lead us to the. 113 00:04:00,900 --> 00:04:02,300 Highest conversion rate. 114 00:04:02,300 --> 00:04:05,166 That's the CTR, that's the click through rate. 115 00:04:05,166 --> 00:04:06,000 We want to find the. 116 00:04:06,000 --> 00:04:08,700 Ad that will. Get the most clicks. 117 00:04:08,700 --> 00:04:09,300 And so now. 118 00:04:09,300 --> 00:04:13,333 Speaking of this, this is leading us to the key difference between what we. 119 00:04:13,333 --> 00:04:16,233 Were about to do now and what we've been doing earlier. 120 00:04:16,233 --> 00:04:19,000 Because earlier we had a data set with. 121 00:04:19,000 --> 00:04:22,666 Some data containing independent variables and one dependent variable, 122 00:04:23,033 --> 00:04:26,500 and then we did some clustering where we had independent variables only. 123 00:04:26,900 --> 00:04:29,033 And now things are different. 124 00:04:29,033 --> 00:04:31,433 We start with no data. 125 00:04:31,433 --> 00:04:33,266 I know we have some data set in front of us. 126 00:04:33,266 --> 00:04:35,700 But this is just a data. Set for simulation. 127 00:04:35,700 --> 00:04:38,800 Because what happens in real life and we're going to pretend 128 00:04:38,800 --> 00:04:40,433 we're in real life, we're going to pretend. 129 00:04:40,433 --> 00:04:41,566 That we. Don't have any. 130 00:04:41,566 --> 00:04:42,600 Data yet. 131 00:04:42,600 --> 00:04:44,600 Well, what. Happens in real. Life is that. 132 00:04:44,600 --> 00:04:46,666 We are going to start experimenting with these 133 00:04:46,666 --> 00:04:50,333 ads by placing them on a social network, the different versions of the ads, 134 00:04:50,566 --> 00:04:51,700 and according to the. 135 00:04:51,700 --> 00:04:56,566 Results we observe, we will change our strategy to place these ads on the. 136 00:04:56,566 --> 00:04:57,600 Social network. 137 00:04:57,600 --> 00:04:59,333 So here. Are the different steps. 138 00:04:59,333 --> 00:05:00,466 Of the process. 139 00:05:00,466 --> 00:05:01,100 We have. 140 00:05:01,100 --> 00:05:04,500 Ten versions of the same ad, ten versions of. 141 00:05:04,700 --> 00:05:08,866 This ad trying to sell this cheap luxury SUV. 142 00:05:09,233 --> 00:05:10,866 And each time a. User of. 143 00:05:10,866 --> 00:05:13,366 The social network will log in to his account. 144 00:05:13,366 --> 00:05:14,333 We will place. 145 00:05:14,333 --> 00:05:17,566 One version of these ten ads, and that will. 146 00:05:17,566 --> 00:05:18,733 Be around each. 147 00:05:18,733 --> 00:05:20,866 Time a user connects to its account. 148 00:05:20,866 --> 00:05:22,933 We will show him one version of the ad. 149 00:05:22,933 --> 00:05:24,166 For example, ad three. 150 00:05:24,166 --> 00:05:25,100 Version three of the. 151 00:05:25,100 --> 00:05:28,100 Ad, and we will observe its response. 152 00:05:28,233 --> 00:05:32,233 If the user clicks on the ad, we get a reward equals to one, 153 00:05:32,533 --> 00:05:36,600 and if the user doesn't click on the ad, we get a reward equals to zero. 154 00:05:37,033 --> 00:05:41,533 And we're going to do this for 10,000 users on the social network. 155 00:05:41,700 --> 00:05:44,100 We're going to show the ad to 10,000 users. 156 00:05:44,100 --> 00:05:44,600 We're going to. 157 00:05:44,600 --> 00:05:49,066 Observe if the user clicks yes or no on the ad, if the user clicks on the ad, 158 00:05:49,066 --> 00:05:50,966 that will give. Us one reward. 159 00:05:50,966 --> 00:05:52,766 And if the user. Doesn't click. On the ad. 160 00:05:52,766 --> 00:05:55,100 That will give us zero reward. 161 00:05:55,100 --> 00:05:57,600 However, we're not going to show the. 162 00:05:57,600 --> 00:06:00,400 Different versions of the. Ad to each user. At random. 163 00:06:00,400 --> 00:06:01,266 There's going to be. 164 00:06:01,266 --> 00:06:03,666 A specific strategy to do this. 165 00:06:03,666 --> 00:06:05,100 And the key. 166 00:06:05,100 --> 00:06:07,833 Thing to understand about reinforcement learning is that. 167 00:06:07,833 --> 00:06:08,833 The strategy will. 168 00:06:08,833 --> 00:06:11,166 Depend. At each. Round on. 169 00:06:11,166 --> 00:06:14,766 The previous results we observed at the previous rounds. 170 00:06:15,000 --> 00:06:15,566 So for. 171 00:06:15,566 --> 00:06:19,000 Example, when we are at around ten, well, what happens behind the scene 172 00:06:19,000 --> 00:06:20,100 is that the algorithm will. 173 00:06:20,100 --> 00:06:22,200 Look at the different results observed. 174 00:06:22,200 --> 00:06:23,833 During the first ten rounds. 175 00:06:23,833 --> 00:06:25,500 And according to these results. 176 00:06:25,500 --> 00:06:29,633 It will decide which version of the ad it will show to the user. 177 00:06:29,933 --> 00:06:31,433 That's why reinforcement learning 178 00:06:31,433 --> 00:06:34,933 is also called online learning or interactive learning. 179 00:06:35,300 --> 00:06:36,900 Because the strategy is dynamic. 180 00:06:36,900 --> 00:06:39,900 It depends on the observations from the beginning of the experiment 181 00:06:40,166 --> 00:06:41,866 up to the present time. 182 00:06:41,866 --> 00:06:43,633 And so now. What is this data set? 183 00:06:43,633 --> 00:06:46,800 This is just some simulation of what is going to happen when we. 184 00:06:46,800 --> 00:06:48,900 Show the ads to the users. 185 00:06:48,900 --> 00:06:52,800 In other words, this is what God knows because we have no idea 186 00:06:52,800 --> 00:06:55,766 on which ad each user is going to click on. 187 00:06:55,766 --> 00:06:57,566 And that's what. This data set is telling us. 188 00:06:57,566 --> 00:06:59,966 It's telling us for each round, that is for. 189 00:06:59,966 --> 00:07:00,700 Each user. 190 00:07:00,700 --> 00:07:03,233 Connecting to its account on which versions. 191 00:07:03,233 --> 00:07:05,033 Of the ad the user is going to click on. 192 00:07:05,033 --> 00:07:06,566 So let's give an example. 193 00:07:06,566 --> 00:07:09,566 Let's explain what happens for the five first users. 194 00:07:09,633 --> 00:07:11,433 So let's take. The first round. 195 00:07:11,433 --> 00:07:14,433 And according to the simulation or according to God. 196 00:07:14,566 --> 00:07:15,600 This first user of. 197 00:07:15,600 --> 00:07:17,766 The social network is going to click on the ad. 198 00:07:17,766 --> 00:07:22,766 If we show him the first version, the fifth version and the ninth version, 199 00:07:23,166 --> 00:07:25,966 and if we show him the second version, the third version, the fourth 200 00:07:25,966 --> 00:07:29,366 version, six, seven, eight or 10th version. 201 00:07:29,700 --> 00:07:32,700 Well, this user is not going to click on the ad. 202 00:07:32,866 --> 00:07:34,633 So this is what God knows. 203 00:07:34,633 --> 00:07:37,533 But as far as we. Are concerned, we have no. 204 00:07:37,533 --> 00:07:40,400 Idea on which ads this user will click on. 205 00:07:40,400 --> 00:07:41,800 So what about the second user? 206 00:07:41,800 --> 00:07:43,000 So that's the second round. 207 00:07:43,000 --> 00:07:47,066 At the second round we show another version of the ad and according to God's. 208 00:07:47,066 --> 00:07:49,700 Truth, the second user will only click on the ad. 209 00:07:49,700 --> 00:07:54,300 If we show him the ninth version, the third user will never click on the ad. 210 00:07:54,300 --> 00:07:55,233 Whatever version we. 211 00:07:55,233 --> 00:07:57,400 Display, the fourth user will only. 212 00:07:57,400 --> 00:07:59,500 Click on the second version and the. 213 00:07:59,500 --> 00:08:02,433 Eighth. Version, and the fifth user will never click on the. 214 00:08:02,433 --> 00:08:04,866 Ad, whatever. Version we show to him. 215 00:08:04,866 --> 00:08:06,900 All right. So that's the idea of the problem. 216 00:08:06,900 --> 00:08:09,000 And so we're going to build two algorithms the. 217 00:08:09,000 --> 00:08:12,000 UCB. Algorithm and the Thompson sampling algorithm. 218 00:08:12,300 --> 00:08:14,433 And these algorithms will decide. At each. 219 00:08:14,433 --> 00:08:15,066 Run here. 220 00:08:15,066 --> 00:08:18,200 Which version of the ad to show to the user. 221 00:08:18,500 --> 00:08:19,366 And depending on the. 222 00:08:19,366 --> 00:08:21,800 Reward the ads will get that as reward. 223 00:08:21,800 --> 00:08:22,366 Equals one. 224 00:08:22,366 --> 00:08:24,933 If the user clicks on the. Ad, or reward equals zero. 225 00:08:24,933 --> 00:08:26,833 If the user. Doesn't click on the ad. 226 00:08:26,833 --> 00:08:28,966 It will decide which ad to show to the. 227 00:08:28,966 --> 00:08:31,300 User the next round according to the. 228 00:08:31,300 --> 00:08:32,900 Previous observations. 229 00:08:32,900 --> 00:08:35,666 And so we're. Going to have 10,000 rounds. 230 00:08:35,666 --> 00:08:37,300 If we go down here. 231 00:08:37,300 --> 00:08:40,466 We can see that we are showing the ad to 10,000 users. 232 00:08:40,800 --> 00:08:42,600 And so of. Course the goal of the. 233 00:08:42,600 --> 00:08:44,366 Algorithm is to maximize. 234 00:08:44,366 --> 00:08:45,900 The total reward that is. 235 00:08:45,900 --> 00:08:48,800 The sum of all the different rewards at each round 236 00:08:48,800 --> 00:08:50,933 obtained by the different selections of. 237 00:08:50,933 --> 00:08:51,900 The ads. 238 00:08:51,900 --> 00:08:53,233 Okay, so let's do it. 239 00:08:53,233 --> 00:08:54,466 Let's start with. 240 00:08:54,466 --> 00:08:57,466 Upper confidence bound. The UCB algorithm. 241 00:08:57,466 --> 00:08:59,133 But before we start implementing 242 00:08:59,133 --> 00:09:01,833 this algorithm, I would like to show you something. 243 00:09:01,833 --> 00:09:03,900 I would like to show you what. Would happen if. 244 00:09:03,900 --> 00:09:07,566 We randomly select the versions of the ad at each round. 245 00:09:07,633 --> 00:09:10,633 You know. No algorithm, no strategy. 246 00:09:10,700 --> 00:09:14,766 Each time a user connects to its account, we are displaying one version 247 00:09:14,766 --> 00:09:17,666 of these ten. Ads totally. At random. 248 00:09:17,666 --> 00:09:20,866 So I actually prepared this algorithm. 249 00:09:20,866 --> 00:09:22,333 We're not going to implement it. 250 00:09:22,333 --> 00:09:24,066 Together because this algorithm. 251 00:09:24,066 --> 00:09:25,566 Is actually not very relevant. 252 00:09:25,566 --> 00:09:27,333 It's just to give you the motivation of what 253 00:09:27,333 --> 00:09:29,633 we will implement in the next tutorials. 254 00:09:29,633 --> 00:09:33,333 But this algorithm is actually provided in the folder. 255 00:09:33,333 --> 00:09:36,233 You see it's this. Random selection file. 256 00:09:36,233 --> 00:09:38,400 And actually I prepared it here. 257 00:09:38,400 --> 00:09:39,900 That's the algorithm. 258 00:09:39,900 --> 00:09:42,900 So as you can see I call this algorithm random selection. 259 00:09:43,100 --> 00:09:44,700 Here I'm importing the data set. 260 00:09:44,700 --> 00:09:45,533 As we just did. 261 00:09:45,533 --> 00:09:47,800 So I don't need to execute that again. 262 00:09:47,800 --> 00:09:51,866 And in this section I am implementing the random selection algorithm that. 263 00:09:52,000 --> 00:09:54,266 Just. Consists. Of selecting. 264 00:09:54,266 --> 00:09:57,066 At random one version of the ad. At each round. That is. 265 00:09:57,066 --> 00:10:00,133 Each time a user connects on its social network accounts. 266 00:10:00,666 --> 00:10:03,133 So I'm going to execute this section right now. 267 00:10:04,166 --> 00:10:06,300 So here it is well implemented. 268 00:10:06,300 --> 00:10:08,033 And we can see the different results. 269 00:10:08,033 --> 00:10:08,966 Of this algorithm. 270 00:10:08,966 --> 00:10:10,333 So the most. Important. 271 00:10:10,333 --> 00:10:12,533 Result is the total reward. 272 00:10:12,533 --> 00:10:14,766 That is this variable is the sum. 273 00:10:14,766 --> 00:10:16,166 Of the different rewards. 274 00:10:16,166 --> 00:10:17,333 Up to the last round. 275 00:10:17,333 --> 00:10:19,800 That is up. To the 10,000 user. 276 00:10:19,800 --> 00:10:21,433 And so what is this total reward. 277 00:10:21,433 --> 00:10:22,900 This total reward is. 278 00:10:22,900 --> 00:10:25,400 1242. 279 00:10:25,400 --> 00:10:27,133 So what happened is that the. 280 00:10:27,133 --> 00:10:29,833 Random selection algorithm randomly selected. 281 00:10:29,833 --> 00:10:31,200 Each ad at each round. 282 00:10:31,200 --> 00:10:32,100 We can actually see. 283 00:10:32,100 --> 00:10:33,566 The random selections in this. 284 00:10:33,566 --> 00:10:35,433 Ad selected list. 285 00:10:35,433 --> 00:10:38,100 So we can clearly see what happened at ground zero. 286 00:10:38,100 --> 00:10:39,366 For the first user. 287 00:10:39,366 --> 00:10:42,833 The random selection algorithm selected the version number for then, 288 00:10:42,833 --> 00:10:44,866 then the second round, the version number for, 289 00:10:44,866 --> 00:10:48,000 then the third round the version number three, then the fourth run 290 00:10:48,000 --> 00:10:51,000 version number one, and the fifth round version number four. 291 00:10:51,166 --> 00:10:52,666 That's the random selections. 292 00:10:52,666 --> 00:10:55,066 And so at each round, based on the. 293 00:10:55,066 --> 00:10:59,100 God's true results, the selection of the ad generated a reward. 294 00:10:59,400 --> 00:11:01,200 So at the first round, for the first. 295 00:11:01,200 --> 00:11:02,666 User connecting on its account, 296 00:11:02,666 --> 00:11:05,666 the random selection algorithm selected the ad number four. 297 00:11:06,000 --> 00:11:08,700 And we see a zero here, which means that this first user. 298 00:11:08,700 --> 00:11:10,200 Doesn't click on this ad. 299 00:11:10,200 --> 00:11:12,933 So we get a zero reward at the first round. 300 00:11:12,933 --> 00:11:15,633 Then what about the second selection for. 301 00:11:15,633 --> 00:11:16,700 Well, we see here the. 302 00:11:16,700 --> 00:11:17,633 Second round that there is. 303 00:11:17,633 --> 00:11:18,433 Also a zero. 304 00:11:18,433 --> 00:11:22,266 For ad number four, which means that the second user doesn't click on this. Ad. 305 00:11:22,266 --> 00:11:24,966 And therefore we also get a zero reward. 306 00:11:24,966 --> 00:11:25,266 And the. 307 00:11:25,266 --> 00:11:27,800 Total reward that we observe here is actually. 308 00:11:27,800 --> 00:11:28,800 The sum of. 309 00:11:28,800 --> 00:11:31,500 All the rewards that it gets, whether it's zero. 310 00:11:31,500 --> 00:11:32,233 Or one. 311 00:11:32,233 --> 00:11:35,233 At the end of the 10,000 rounds. 312 00:11:35,366 --> 00:11:35,700 All right. 313 00:11:35,700 --> 00:11:36,566 So the. 314 00:11:36,566 --> 00:11:38,700 Interesting thing to remember here. Is that. 315 00:11:38,700 --> 00:11:39,566 When we randomly. 316 00:11:39,566 --> 00:11:41,233 Select the ads, we get a. 317 00:11:41,233 --> 00:11:44,166 Reward of 1200 and. 42. 318 00:11:44,166 --> 00:11:46,466 Well, you know, there is this random factor. 319 00:11:46,466 --> 00:11:47,000 And so of. 320 00:11:47,000 --> 00:11:48,233 Course if we select. 321 00:11:48,233 --> 00:11:51,000 That again we'll get another. Reward. 322 00:11:51,000 --> 00:11:53,733 But it will be very close to this value here. 323 00:11:53,733 --> 00:11:55,833 I'm going to do that again. 324 00:11:55,833 --> 00:11:56,566 And as you can see. 325 00:11:56,566 --> 00:11:59,533 We get. 1232. 326 00:11:59,533 --> 00:12:06,733 And I can even do that again I get 1246 again, 1236. 327 00:12:06,733 --> 00:12:10,200 So we always get a total reward close to. 328 00:12:10,200 --> 00:12:12,200 1002 hundred. 329 00:12:12,200 --> 00:12:13,333 So let's keep this. 330 00:12:13,333 --> 00:12:15,833 Result in our minds, because then we will compare. 331 00:12:15,833 --> 00:12:19,133 It to the total reward that we get. 332 00:12:19,366 --> 00:12:23,733 Thanks to our more advanced algorithm, which is the upper confidence bound 333 00:12:23,966 --> 00:12:26,566 and then the Thompson sampling algorithm. 334 00:12:26,566 --> 00:12:28,600 So 1200. 335 00:12:28,600 --> 00:12:29,466 Let's see how. 336 00:12:29,466 --> 00:12:32,266 UCB and Thompson sampling. Beat that. 337 00:12:32,266 --> 00:12:34,833 And now just. The last thing to show. You. As for every. 338 00:12:34,833 --> 00:12:36,933 Algorithm we implement in this. Course, we get. 339 00:12:36,933 --> 00:12:38,400 The exciting step in the end which is. 340 00:12:38,400 --> 00:12:39,933 To visualize the results. 341 00:12:39,933 --> 00:12:42,533 Well in this part, reinforcement learning, the. 342 00:12:42,533 --> 00:12:45,333 Visualization of the results. Will consist. 343 00:12:45,333 --> 00:12:48,600 Of visualizing a histogram where we see. 344 00:12:48,600 --> 00:12:51,433 The different selections of the different versions of the ad. 345 00:12:51,433 --> 00:12:55,166 So I'm going to show you what happened for our random selection algorithm. 346 00:12:55,733 --> 00:12:56,833 Let's do it. 347 00:12:56,833 --> 00:12:59,266 Press Command and Control plus enter to execute. 348 00:12:59,266 --> 00:13:00,466 And here we. Go. 349 00:13:00,466 --> 00:13:03,400 Of course. Since our algorithm randomly. 350 00:13:03,400 --> 00:13:03,900 Selected. 351 00:13:03,900 --> 00:13:05,000 The different versions of the. 352 00:13:05,000 --> 00:13:07,966 Ads at each round, well, of course we get. 353 00:13:07,966 --> 00:13:11,966 A nearly uniform distribution of the different versions of the ads. 354 00:13:12,200 --> 00:13:13,966 The ten different versions of the. 355 00:13:13,966 --> 00:13:15,966 Ads were selected more or less the. 356 00:13:15,966 --> 00:13:17,566 Same number of times. 357 00:13:17,566 --> 00:13:17,966 All right. 358 00:13:17,966 --> 00:13:20,966 So that was just to give you a little extra motivation. 359 00:13:21,033 --> 00:13:23,200 And now time to go pro. 360 00:13:23,200 --> 00:13:25,900 Let's go back to our UCB. 361 00:13:25,900 --> 00:13:28,566 Algorithm and start implementing it. 362 00:13:28,566 --> 00:13:33,066 So remember 1200 total reward for the random selection algorithm. 363 00:13:33,366 --> 00:13:35,600 Let's see how UCB beats that. 364 00:13:35,600 --> 00:13:37,600 So we'll find out in the next tutorial. 365 00:13:37,600 --> 00:13:39,366 And until then enjoy machine learning.