1 00:00:00,733 --> 00:00:03,300 Hello and welcome back to the course on Machine Learning. 2 00:00:03,300 --> 00:00:04,933 Today we've got a very interesting topic. 3 00:00:04,933 --> 00:00:09,566 Today we talk about Thompson sampling and the algorithm's intuition. 4 00:00:09,566 --> 00:00:10,833 And again 5 00:00:10,833 --> 00:00:15,600 we're going to be using this algorithm to solve the multi-armed bandit problem. 6 00:00:16,000 --> 00:00:16,366 All right. 7 00:00:16,366 --> 00:00:17,933 So let's get started. 8 00:00:17,933 --> 00:00:20,666 A quick refresher on that multi-armed bandit problem. 9 00:00:20,666 --> 00:00:23,400 We have several slot machines. 10 00:00:23,400 --> 00:00:25,533 each one of them has a distribution behind it. 11 00:00:25,533 --> 00:00:26,966 Do you want to do five? 12 00:00:26,966 --> 00:00:29,300 We don't know what these distributions are. 13 00:00:29,300 --> 00:00:31,933 And we need to start playing these machines. 14 00:00:31,933 --> 00:00:35,533 And at the same time, figure out which one has the best distribution 15 00:00:35,533 --> 00:00:38,533 so that we then exploit that best distribution. 16 00:00:38,666 --> 00:00:42,433 and but, it'll take us some time to figure it out. 17 00:00:42,433 --> 00:00:46,200 So we need to maximize our return during the process of figuring out. 18 00:00:46,200 --> 00:00:49,666 So we have to, find that ideal balance 19 00:00:49,666 --> 00:00:53,100 or trade off between exploration and exploitation. 20 00:00:53,766 --> 00:00:57,300 we had a few tutorials previously on, these things. 21 00:00:57,300 --> 00:01:00,433 So we talked about the multi-armed bandit problem in a lot of detail. 22 00:01:00,433 --> 00:01:02,633 So if you haven't watched that tutorial, I highly recommend 23 00:01:02,633 --> 00:01:04,700 jumping into the previous section and watching it there. 24 00:01:04,700 --> 00:01:08,966 Also, understanding of the upper confidence bounds 25 00:01:09,433 --> 00:01:14,633 algorithm will really help you grasp the concepts of, Thompson sampling. 26 00:01:14,633 --> 00:01:19,500 So if you haven't seen the upper confidence bound, tutorial, 27 00:01:19,733 --> 00:01:23,800 then I highly recommend checking that out before proceeding with today's lecture. 28 00:01:24,133 --> 00:01:26,666 All right. So let's get started. 29 00:01:26,666 --> 00:01:31,600 A quick summary of the multi-armed bandit problem, is as follows. 30 00:01:31,600 --> 00:01:32,666 We have d'Armes. 31 00:01:32,666 --> 00:01:33,600 For example, 32 00:01:33,600 --> 00:01:37,866 arms are ads that we display to users each time they, connect to a webpage. 33 00:01:37,866 --> 00:01:40,233 So, by the way, yes. indeed. 34 00:01:40,233 --> 00:01:43,633 a modern application of the more modern 35 00:01:43,633 --> 00:01:47,400 representation of the multi-armed bandit problem is advertising. 36 00:01:47,400 --> 00:01:50,966 So when you display ads, that is very similar, 37 00:01:51,533 --> 00:01:54,733 or the algorithms that we're going to be applying can learning, 38 00:01:54,733 --> 00:01:59,066 can be applied to solving a problem where you're displaying ads. 39 00:01:59,066 --> 00:02:01,366 So if you go back here instead of having, 40 00:02:01,366 --> 00:02:04,033 these slot machines, each one of these is a different ad, 41 00:02:04,033 --> 00:02:05,733 and you want to figure out which one of these 42 00:02:05,733 --> 00:02:09,466 ads is the best performing ad, but you don't have time to do an AB, 43 00:02:09,566 --> 00:02:12,933 so you don't have the funds or the resources to do an AB test. 44 00:02:13,033 --> 00:02:14,900 You want to figure it out on the fly. 45 00:02:14,900 --> 00:02:16,366 That's when you would apply, 46 00:02:16,366 --> 00:02:19,666 one of these algorithms that we were talking about in this part of the course, 47 00:02:20,333 --> 00:02:24,700 and of course, this of other lots of other modern problems 48 00:02:24,700 --> 00:02:27,166 that are very similar to the multi-armed bandit problem. 49 00:02:27,166 --> 00:02:30,500 Therefore, these algorithms, are valid for them too. 50 00:02:31,066 --> 00:02:32,533 All right. So moving back here. 51 00:02:32,533 --> 00:02:36,333 So we've got the arms, for example, arms are ads that we display. 52 00:02:36,333 --> 00:02:37,066 She uses each time 53 00:02:37,066 --> 00:02:40,066 they connect to a web page, each time a user connects to this page, 54 00:02:40,266 --> 00:02:43,700 that is considered as a round at each round. 55 00:02:43,700 --> 00:02:47,066 And we choose which to display to the user at each round. 56 00:02:47,066 --> 00:02:50,433 And the ad gives us a reward, either 0 or 1 57 00:02:51,266 --> 00:02:53,966 one if the ad is clicked, zero if the ad is not clicked. 58 00:02:53,966 --> 00:02:58,800 In the case of the actual bandits, it'll be a monetary award, so it won't be 59 00:02:58,800 --> 00:03:01,066 just 0 to 1. It'll be a some value. 60 00:03:01,066 --> 00:03:05,866 our goal is to maximize maximize the total reward we get over many rounds. 61 00:03:06,166 --> 00:03:06,500 All right. 62 00:03:06,500 --> 00:03:10,300 So that's a quick, overview of the problem that we're solving. 63 00:03:10,800 --> 00:03:14,133 then here we've got, some very complex 64 00:03:14,533 --> 00:03:19,766 looking mathematics and Bayesian inference and all these, distributions 65 00:03:21,033 --> 00:03:23,233 and posterior probability 66 00:03:23,233 --> 00:03:28,066 and other, prior distributions and beta distributions and so on. 67 00:03:28,800 --> 00:03:31,066 we're not going to delve deep into this right now. 68 00:03:31,066 --> 00:03:35,200 So had lunch will, take some time to walk through this slide 69 00:03:35,200 --> 00:03:38,500 with you in the practical tutorials because, 70 00:03:39,066 --> 00:03:42,466 we're going to be coding this, from scratch. 71 00:03:42,833 --> 00:03:46,733 well, at least in R and therefore 72 00:03:46,766 --> 00:03:49,200 he will actually walk you through this slide. 73 00:03:49,200 --> 00:03:54,900 Our goal for today is to understand the intuition behind all of these things. 74 00:03:54,900 --> 00:03:58,633 So we're going to skip skip the slide and get to the intuition part. 75 00:03:58,633 --> 00:04:00,533 And this is the actual steps 76 00:04:00,533 --> 00:04:04,500 that, you're going to be using in the practical tutorial. 77 00:04:04,500 --> 00:04:07,500 So again had lunch will walk you through this slide as well. 78 00:04:08,300 --> 00:04:10,366 And today we're talking about intuition. 79 00:04:10,366 --> 00:04:12,233 So this is going to be fun. 80 00:04:12,233 --> 00:04:14,100 This is some interesting slides coming up. 81 00:04:14,100 --> 00:04:16,866 And get ready for a fun ride. 82 00:04:16,866 --> 00:04:19,200 So grab a cup of coffee a cup of tea 83 00:04:19,200 --> 00:04:22,300 three rings of popcorn and, let's get started. 84 00:04:22,800 --> 00:04:25,533 All right, so here we've got a scale. 85 00:04:25,533 --> 00:04:29,733 the, a y or the horizontal axis is the return. 86 00:04:30,066 --> 00:04:33,066 The return that we expect to get from a bandit. 87 00:04:33,066 --> 00:04:34,966 So we're going to look at a simplified problem. 88 00:04:34,966 --> 00:04:37,366 We're going to look at rather than five or even ten. 89 00:04:37,366 --> 00:04:38,800 We're going to just look at three bandits 90 00:04:38,800 --> 00:04:41,933 because there's going to be a lot of stuff going on on this, chart. 91 00:04:41,933 --> 00:04:44,000 And I don't want to clutter it up, 92 00:04:44,000 --> 00:04:45,700 and we want to keep it as simple as possible, 93 00:04:45,700 --> 00:04:47,366 just so that we can understand the concept. 94 00:04:47,366 --> 00:04:50,400 And then, the same thing applies for 5 or 10 or however 95 00:04:50,400 --> 00:04:53,500 many, machines that, you'd be looking at. 96 00:04:54,300 --> 00:04:58,366 So here we've got the return and these vertical lines, just like in the case, 97 00:04:58,833 --> 00:05:03,400 of the, upper confidence bound, we had the horizontal lines. 98 00:05:03,633 --> 00:05:04,633 These vertical lines, 99 00:05:05,866 --> 00:05:08,666 represent the expected return for each of the machines. 100 00:05:08,666 --> 00:05:11,966 So, each of the machines out of the three machines that we have, 101 00:05:11,966 --> 00:05:15,766 each of the machines has a distribution behind it 102 00:05:15,766 --> 00:05:19,433 so that the, result, the amount of money that you win per 103 00:05:19,900 --> 00:05:23,700 game, is picked as a random value from that distribution. 104 00:05:24,500 --> 00:05:27,600 and we're not going to draw distributions, but basically just imagine distribution 105 00:05:27,600 --> 00:05:31,666 behind each one of these, expected values. 106 00:05:31,666 --> 00:05:35,133 So this is, this is just the center of that distribution or 107 00:05:35,466 --> 00:05:38,600 the actual expected return from that machine. 108 00:05:38,600 --> 00:05:39,866 And we're just visualizing it here. 109 00:05:39,866 --> 00:05:44,100 But this is also kind of like looking into, 110 00:05:44,500 --> 00:05:48,600 into the actual machine itself or like pulling it apart 111 00:05:48,600 --> 00:05:51,300 and knowing how it works or being the designer of the machine. 112 00:05:51,300 --> 00:05:54,000 In reality, when you're playing these machines, of course, you don't know this. 113 00:05:54,000 --> 00:05:58,000 So this is some additional information that will guide 114 00:05:58,000 --> 00:06:01,000 us, that will help us understand how the algorithm actually works. 115 00:06:01,000 --> 00:06:03,366 The algorithm itself doesn't know this information, right. 116 00:06:03,366 --> 00:06:05,633 So this is hidden, but it's just there for us 117 00:06:05,633 --> 00:06:07,866 so that we can better understand what's going on. 118 00:06:07,866 --> 00:06:09,066 So these are the expected returns, 119 00:06:09,066 --> 00:06:11,500 the actual expected returns of each of the machines. 120 00:06:11,500 --> 00:06:13,100 And obviously if you knew this right away, 121 00:06:13,100 --> 00:06:16,733 you would say that machine number three or the yellow machine is the best one. 122 00:06:17,333 --> 00:06:19,900 because it has the highest return, right? 123 00:06:19,900 --> 00:06:22,000 It has the highest expected return. 124 00:06:22,000 --> 00:06:23,000 You'd always just bet on this one. 125 00:06:23,000 --> 00:06:25,466 But again, you don't know this yet. 126 00:06:25,466 --> 00:06:28,733 All right, so what happens, with this algorithm? 127 00:06:28,733 --> 00:06:31,733 Well, at the start, just like with the upper confidence 128 00:06:31,733 --> 00:06:34,400 bound algorithm, you don't know anything, right? 129 00:06:34,400 --> 00:06:38,133 You have no prior knowledge of the, current, 130 00:06:38,733 --> 00:06:41,366 situation or status of things, and therefore, 131 00:06:41,366 --> 00:06:44,900 all the machines are identical to you, and you have to have at least one 132 00:06:44,900 --> 00:06:49,366 or even a couple of trial rounds just to get some data to analyze. 133 00:06:49,366 --> 00:06:52,366 And so let's see, let's say that has happened. 134 00:06:52,400 --> 00:06:56,366 There are some, trial runs for the so for machine number one 135 00:06:56,366 --> 00:06:57,600 or the Blue machine. 136 00:06:57,600 --> 00:07:01,400 And, based on those trial runs, 137 00:07:01,966 --> 00:07:05,100 the algorithm, the Thompson sampling algorithm, this is where it starts 138 00:07:05,100 --> 00:07:09,366 getting different to the upper confidence bound will construct a distribution. 139 00:07:10,066 --> 00:07:12,333 Right. So we'll get to this distribution in a second. 140 00:07:12,333 --> 00:07:13,166 What means. 141 00:07:13,166 --> 00:07:16,900 But for now let's just do the same thing for machine in the Green Machine. 142 00:07:17,400 --> 00:07:18,600 And so yeah 143 00:07:18,600 --> 00:07:23,100 we just are pulling the arm several times like four times for instance. 144 00:07:23,100 --> 00:07:26,833 And we're getting some values and 145 00:07:26,833 --> 00:07:31,200 that are going to be somewhere around obviously the actual expected return. 146 00:07:31,300 --> 00:07:31,566 Right. 147 00:07:31,566 --> 00:07:32,533 And based on that 148 00:07:32,533 --> 00:07:35,533 and based on those values, of course, it's probably gonna be a bit more than four. 149 00:07:36,300 --> 00:07:38,233 we're constructing some sort of distribution 150 00:07:38,233 --> 00:07:42,166 or some sort of perception of the current state of things. 151 00:07:42,266 --> 00:07:44,566 And this is the part where it gets interesting. 152 00:07:44,566 --> 00:07:47,433 So the actual meaning of this, 153 00:07:47,433 --> 00:07:50,600 these distributions is not which you might think at first. 154 00:07:50,600 --> 00:07:54,500 So, these distributions 155 00:07:54,866 --> 00:07:58,933 are actually showing us or they're representing 156 00:07:59,133 --> 00:08:03,300 not the we're not trying to guess the distributions behind the machine. 157 00:08:03,300 --> 00:08:06,333 So the first thing that might come to mind is that all right. 158 00:08:06,333 --> 00:08:07,733 So we've constructed distribution. 159 00:08:07,733 --> 00:08:11,766 And so the blue distribution is our attempt 160 00:08:11,766 --> 00:08:14,866 at guessing the actual distribution behind the blue machine. 161 00:08:14,866 --> 00:08:15,233 Right. 162 00:08:15,233 --> 00:08:18,300 The distribution that this is the expected return for. 163 00:08:18,300 --> 00:08:19,600 And the green is for the green machine. 164 00:08:19,600 --> 00:08:22,266 The yellow set yellow machine will actually not. 165 00:08:22,266 --> 00:08:23,100 That's not the case. 166 00:08:23,100 --> 00:08:26,100 We are constructing something completely different, 167 00:08:26,100 --> 00:08:28,166 something completely out of this world. 168 00:08:28,166 --> 00:08:31,766 We're constructing distributions of where 169 00:08:31,766 --> 00:08:36,133 we think the actual expected value might lie. 170 00:08:37,100 --> 00:08:38,533 It's very important to understand that. 171 00:08:38,533 --> 00:08:41,633 So, we are creating, kind of a, 172 00:08:42,400 --> 00:08:44,800 if you, if you want to think of it that way, 173 00:08:44,800 --> 00:08:48,733 we're creating an auxiliary mechanism for us to solve the problem. 174 00:08:48,733 --> 00:08:53,166 So, we're not we're not trying to recreate these machines. 175 00:08:53,166 --> 00:08:57,433 We're recreating the possible way these machines could have been created, 176 00:08:57,500 --> 00:08:58,466 kind of in that sense. 177 00:08:58,466 --> 00:09:03,000 So let's, let's, just solidify that this is where 178 00:09:03,000 --> 00:09:07,266 we think that expected the actual expected values will be. 179 00:09:07,266 --> 00:09:09,133 So let's look at the blue one. For instance. 180 00:09:09,133 --> 00:09:11,133 We got four values. 181 00:09:11,133 --> 00:09:14,666 And based on those four values we've constructed this distribution 182 00:09:15,133 --> 00:09:20,300 which will show us which is showing us where is that value. 183 00:09:20,300 --> 00:09:21,133 Mu star. 184 00:09:21,133 --> 00:09:23,900 So this is mu star, the actual mu start, but we don't know it. 185 00:09:23,900 --> 00:09:24,966 So the algorithm doesn't know it. 186 00:09:24,966 --> 00:09:29,033 So it's constructed a distribution trying to guess where this value is. 187 00:09:29,033 --> 00:09:30,866 And of course can't just say it's over here. 188 00:09:30,866 --> 00:09:32,100 It's over there, it's over there. 189 00:09:32,100 --> 00:09:35,000 It's saying, okay, there's a there's a very high likelihood it's over here. 190 00:09:35,000 --> 00:09:37,133 But it also could be here, it could be here, could be here. 191 00:09:37,133 --> 00:09:39,133 And as you move away, the likelihood drops. 192 00:09:39,133 --> 00:09:42,066 But it still could be anywhere in this blue space. 193 00:09:42,066 --> 00:09:44,966 Same thing for the green. distribution. 194 00:09:44,966 --> 00:09:49,200 So based on the values that we've seen, the random values that have been selected 195 00:09:49,200 --> 00:09:50,400 in the for four rounds, 196 00:09:51,466 --> 00:09:55,866 the algorithm has created this distribution, 197 00:09:56,433 --> 00:09:59,433 which is saying that this actual, 198 00:09:59,700 --> 00:10:03,600 actual expected return from a sheet from the Green 199 00:10:03,600 --> 00:10:08,133 Machine is somewhere in this area, and it's most likely to be here, 200 00:10:08,133 --> 00:10:10,766 but it could be here, it could be here, it could be here, it could be way 201 00:10:10,766 --> 00:10:13,533 it could be. So basically, at most it's most likely to be here. 202 00:10:13,533 --> 00:10:15,400 Then it could be here, it could be here. 203 00:10:15,400 --> 00:10:19,633 And as you move away, the likelihood of it actually being there drops off. 204 00:10:19,900 --> 00:10:20,966 Same thing for the yellow. 205 00:10:20,966 --> 00:10:22,233 So it's very important to understand. 206 00:10:22,233 --> 00:10:27,666 So just to reiterate we are not trying to guess the distributions behind machines. 207 00:10:27,666 --> 00:10:28,233 Right. 208 00:10:28,233 --> 00:10:32,033 We are doing like kind of a little magic trick 209 00:10:32,033 --> 00:10:35,266 or a little hat or a hat trick is I don't know if it's called a hat trick, 210 00:10:35,266 --> 00:10:41,233 but we're trying to, do this, this create this perception of the world. 211 00:10:41,233 --> 00:10:45,233 We're trying to, mathematically 212 00:10:45,233 --> 00:10:49,566 explain what we think is actually going on or what could be going on. 213 00:10:49,900 --> 00:10:52,633 And that is, that is important. 214 00:10:52,633 --> 00:10:56,800 also important thing, because this demonstrates 215 00:10:56,800 --> 00:11:00,666 that the Thompson sampling is a probabilistic algorithm. 216 00:11:00,933 --> 00:11:04,600 The upper confidence bound was a deterministic algorithm 217 00:11:04,600 --> 00:11:06,033 where everything was strict. 218 00:11:06,033 --> 00:11:08,033 You know, everything was okay. 219 00:11:08,033 --> 00:11:08,800 So whichever 220 00:11:08,800 --> 00:11:11,800 one has a pious upper confidence bound, that's what we're going to choose 221 00:11:12,000 --> 00:11:12,566 and so on. 222 00:11:12,566 --> 00:11:17,266 But here we're creating a probabilistic, perception of the world. 223 00:11:17,266 --> 00:11:19,500 We're saying, so it's likely to be here, 224 00:11:19,500 --> 00:11:21,166 but it could be anywhere in this blue area, 225 00:11:21,166 --> 00:11:23,400 and this one could be in this green and so on. 226 00:11:23,400 --> 00:11:25,333 And you'll see exactly why we've done this. 227 00:11:25,333 --> 00:11:27,833 In the next slide, we'll understand how this works. 228 00:11:27,833 --> 00:11:29,533 And let's jump straight into that. 229 00:11:29,533 --> 00:11:32,533 Let's understand now that this is probably the hardest part 230 00:11:32,533 --> 00:11:36,100 to kind of, get your head around what we've created. 231 00:11:36,100 --> 00:11:39,933 And now that we've created it, let's see how the algorithm is going to, 232 00:11:40,500 --> 00:11:44,400 utilize this, auxiliary mechanism that we have created. 233 00:11:44,700 --> 00:11:46,366 All right. So let's have a look. 234 00:11:46,366 --> 00:11:48,333 So there are our distributions. 235 00:11:48,333 --> 00:11:49,300 That's where we think. 236 00:11:49,300 --> 00:11:53,333 So these are the actual expected returns for each of the machines. 237 00:11:53,600 --> 00:11:54,833 But the algorithm doesn't know them. 238 00:11:54,833 --> 00:11:59,233 The old algorithm has created these are these distributions where 239 00:11:59,566 --> 00:12:03,066 which allow it to kind of guess where the 240 00:12:03,066 --> 00:12:06,866 actual distribution might lie for each for each of these machines. 241 00:12:07,033 --> 00:12:09,933 So what it's going to do is it's actually going to trigger 242 00:12:09,933 --> 00:12:10,966 each of these distributions. 243 00:12:10,966 --> 00:12:14,200 So like we're in the we're in a new round. 244 00:12:14,200 --> 00:12:16,300 We have to choose a machine to use. 245 00:12:16,300 --> 00:12:19,900 So what the algorithm will do is it will go and call 246 00:12:20,600 --> 00:12:24,033 this distribution and it will pull out a value out of this distribution. 247 00:12:24,033 --> 00:12:25,533 And let's say it pulled that value. 248 00:12:25,533 --> 00:12:27,633 Then it'll pull a value out of the green distribution. 249 00:12:27,633 --> 00:12:30,400 Let's say it pulled that value out of the green distribution. 250 00:12:30,400 --> 00:12:33,200 And let's say and then pulls a value out of the yellow. 251 00:12:33,200 --> 00:12:35,566 This is that distribution of that value. 252 00:12:35,566 --> 00:12:37,033 And again it's pulling them. 253 00:12:37,033 --> 00:12:39,166 So according to the distribution. Right. 254 00:12:39,166 --> 00:12:41,133 So this is a distribution of values. 255 00:12:41,133 --> 00:12:44,133 So basically it's most likely to pull a value 256 00:12:44,300 --> 00:12:48,600 somewhere in this area then less likely in for the way you go, it's more 257 00:12:48,600 --> 00:12:52,533 or less less and less likely, but still it can happen that you can see 258 00:12:52,533 --> 00:12:55,600 that this yellow value is actually quite far from the center, 259 00:12:55,600 --> 00:12:58,600 but it's still can happen that it pulled this value out of distribution. 260 00:12:58,733 --> 00:12:59,466 And it can happen. 261 00:12:59,466 --> 00:13:01,633 To pick this green, a value out of the green distribution 262 00:13:01,633 --> 00:13:03,833 totally can happen in the long term, of course, is 263 00:13:03,833 --> 00:13:06,833 going to be picking somewhere close to the center like over the long run. 264 00:13:07,033 --> 00:13:10,700 but on a one off basis, this can totally happen. 265 00:13:11,366 --> 00:13:14,866 And so now it's picked these values and guess what that means? 266 00:13:14,866 --> 00:13:19,766 Well, what this actually means is that we have by doing that, 267 00:13:19,766 --> 00:13:23,333 we have generated our own bandit configuration. 268 00:13:23,700 --> 00:13:26,433 So we have created a 269 00:13:26,433 --> 00:13:30,366 this hypothetical or imaginary 270 00:13:30,766 --> 00:13:34,633 batch of machine or not batch 271 00:13:34,633 --> 00:13:37,766 imaginary set of machines in our own 272 00:13:37,766 --> 00:13:42,300 virtual world where we're saying, okay, so the expected, 273 00:13:42,300 --> 00:13:46,566 the actual expected, return for the blue machine is this value. 274 00:13:46,866 --> 00:13:49,733 The actual expected return for the green Machine is this value 275 00:13:49,733 --> 00:13:52,266 and the actual expected return for the yellow machine is this value. 276 00:13:52,266 --> 00:13:57,100 So we've created this, subzero upside of world 277 00:13:57,100 --> 00:14:02,800 or hypothetical virtual reality in which we have all our own three bandits. 278 00:14:02,800 --> 00:14:04,200 And how are we going to solve this problem? 279 00:14:05,200 --> 00:14:07,466 and this whole problem is very easy to solve. 280 00:14:07,466 --> 00:14:09,600 It's obvious how to solve this problem. 281 00:14:09,600 --> 00:14:11,100 You just pick this machine, right? 282 00:14:11,100 --> 00:14:15,300 Because this machine has the highest expected return out of the three. 283 00:14:15,833 --> 00:14:18,766 And obviously just going to go with this machine in the virtual world 284 00:14:18,766 --> 00:14:21,800 in the, sort of the reality. 285 00:14:22,600 --> 00:14:27,866 And, what that means is that now we translate this result 286 00:14:27,866 --> 00:14:32,066 into the actual world in the virtual, hypothetical world. 287 00:14:32,066 --> 00:14:33,266 We've selected the Green Machine. 288 00:14:33,266 --> 00:14:35,633 So in the actual world, the, 289 00:14:35,633 --> 00:14:39,066 the algorithm will also select the Green Machine and what that will do. 290 00:14:39,066 --> 00:14:42,000 So basically pull the lever for this machine right. 291 00:14:42,000 --> 00:14:46,433 And what it will do is actually it will, give us a value. 292 00:14:46,433 --> 00:14:48,366 So the machine will spit it out, spit out a value. 293 00:14:48,366 --> 00:14:52,933 But that value is going to be based on the distribution behind this machine 294 00:14:52,933 --> 00:14:55,700 where this is the actual expected value of that distribution. 295 00:14:55,700 --> 00:14:59,966 So the value is going to be somewhere here, probably close to the actual 296 00:15:00,133 --> 00:15:01,733 expected value doesn't have to be. 297 00:15:01,733 --> 00:15:03,266 Again there's a distribution behind all of this. 298 00:15:03,266 --> 00:15:03,866 It could be far away. 299 00:15:03,866 --> 00:15:06,300 It could be close. But let's say in this case it's this one. 300 00:15:06,300 --> 00:15:06,633 All right. 301 00:15:06,633 --> 00:15:10,466 So now this information this is new information to the algorithm. 302 00:15:10,600 --> 00:15:13,266 What it's going to do is it's going to say okay. 303 00:15:13,266 --> 00:15:18,633 So the I pull the green lever or the lever for the green machine I got this value. 304 00:15:18,866 --> 00:15:21,933 So now I have to adjust my perception of the world. 305 00:15:21,933 --> 00:15:24,933 Right. So I have a, prior probability. 306 00:15:24,933 --> 00:15:28,800 So these are my, well, for the green Machine, this is my prior, distribution. 307 00:15:28,800 --> 00:15:32,633 This is where the Bayesian inference comes into play, or it's already been in play. 308 00:15:32,633 --> 00:15:34,733 And this is where adding to the Bayesian inference. 309 00:15:34,733 --> 00:15:37,300 So that's our prior probability prior distribution. 310 00:15:37,300 --> 00:15:39,300 Now we've got some new information. 311 00:15:39,300 --> 00:15:40,533 This is our new information. 312 00:15:40,533 --> 00:15:43,166 We're going to add it in and see what see how that changes. 313 00:15:43,166 --> 00:15:46,733 Our perception of the world or perception of the world has changed. 314 00:15:46,733 --> 00:15:50,100 The distribution has shifted a bit and it's become, 315 00:15:50,833 --> 00:15:52,900 narrower because we have more information. 316 00:15:52,900 --> 00:15:54,600 We our sample sizes increase, of course. 317 00:15:56,300 --> 00:15:56,700 Excuse me. 318 00:15:56,700 --> 00:15:57,000 Of course. 319 00:15:57,000 --> 00:16:00,000 It's not going to increase that that much. 320 00:16:00,366 --> 00:16:00,600 Yeah. 321 00:16:00,600 --> 00:16:02,900 This is just an example to, 322 00:16:02,900 --> 00:16:05,900 demonstrate what we're talking about to get the point across. 323 00:16:06,233 --> 00:16:09,666 But that's, that's the point that, every time we add new information, 324 00:16:09,666 --> 00:16:12,666 our distribution becomes more and more refined. 325 00:16:13,000 --> 00:16:15,200 All right, so now we have a new perception in the world. 326 00:16:15,200 --> 00:16:18,433 Now what happens next is a new round, right. 327 00:16:18,433 --> 00:16:19,000 Same thing. 328 00:16:19,000 --> 00:16:22,000 We're going to do the same thing again for a new round again. 329 00:16:22,166 --> 00:16:25,233 we generate or we pick some values out of a distributions. 330 00:16:25,233 --> 00:16:26,233 There they are. 331 00:16:26,233 --> 00:16:31,366 now we've constructed a or we've generated our own bounded configuration, 332 00:16:31,366 --> 00:16:36,733 you know, virtual or virtual reality or in our hypothetical world, 333 00:16:36,900 --> 00:16:39,000 out of these three, we have to pick the best bandit, 334 00:16:39,000 --> 00:16:42,000 which is, of course, the one here, the Yellow Bandit. 335 00:16:42,100 --> 00:16:46,833 And we're going to, now pull the actual pull in the real world pool. 336 00:16:46,833 --> 00:16:50,033 That, lever of the yellow bandit algorithm is going to do that. 337 00:16:50,166 --> 00:16:53,933 That's going to trigger the distribution behind the Yellow bandit, 338 00:16:53,933 --> 00:16:56,500 and that will give us some sort of value. 339 00:16:56,500 --> 00:16:59,833 So this is the actual value that we received in the real world. 340 00:17:00,133 --> 00:17:04,800 Now we're going to incorporate that value into our perception of the world. 341 00:17:04,800 --> 00:17:08,100 And our perception of the world is going to change, is going to adjust. 342 00:17:08,100 --> 00:17:10,800 There we go. Now we're going to do that again. 343 00:17:10,800 --> 00:17:11,200 All right. 344 00:17:11,200 --> 00:17:13,200 So let's let's just do this one more time. 345 00:17:13,200 --> 00:17:16,266 Just so we practice the logic behind all of this. 346 00:17:16,533 --> 00:17:19,533 So a new round generates 347 00:17:19,533 --> 00:17:23,366 the generates the bandit configuration. 348 00:17:23,366 --> 00:17:23,600 Right. 349 00:17:23,600 --> 00:17:27,133 So this is what we're going to think that our, 350 00:17:27,133 --> 00:17:31,366 expected actual expected returns are going to be our conversion. 351 00:17:31,366 --> 00:17:33,000 This is obviously the best one. 352 00:17:33,000 --> 00:17:35,566 we're going to use a pool. 353 00:17:35,566 --> 00:17:37,433 The yellow machines are a lever. 354 00:17:37,433 --> 00:17:40,200 That's going to trigger the distribution behind. 355 00:17:40,200 --> 00:17:42,400 The yellow machine is going to spit out a value. 356 00:17:42,400 --> 00:17:44,500 In the real world, there's a value. 357 00:17:44,500 --> 00:17:47,933 And then we're going to have to adjust our perception of the world again 358 00:17:48,266 --> 00:17:53,200 to match that or to incorporate the new information and so on. 359 00:17:53,200 --> 00:17:57,000 We're going to keep doing that until, we get to a point 360 00:17:57,000 --> 00:18:00,500 where we've refined the distributions substantially. 361 00:18:00,500 --> 00:18:01,533 And the picture looks like this. 362 00:18:01,533 --> 00:18:03,866 So, they might be refined even more. 363 00:18:03,866 --> 00:18:06,333 This one might be more refined, this one might be more refined. 364 00:18:06,333 --> 00:18:08,233 But as you could see from there, 365 00:18:08,233 --> 00:18:11,666 we slowly will start generating more and more rounds 366 00:18:11,666 --> 00:18:14,766 based on the yellow machine will be triggering the yellow machine. 367 00:18:14,766 --> 00:18:17,766 So these ones might not even get that refined in the long run, 368 00:18:17,866 --> 00:18:21,033 which is totally fine, because our point is to get 369 00:18:21,033 --> 00:18:24,466 to the best machine, to find it and exploit it as much as we can. 370 00:18:25,200 --> 00:18:26,366 So there we go. 371 00:18:26,366 --> 00:18:29,833 That is pretty much how the Thompson sampling algorithm works. 372 00:18:30,466 --> 00:18:34,366 And as you can see, it is a, probabilistic algorithm. 373 00:18:34,366 --> 00:18:37,966 And every time we're generating these values, 374 00:18:38,166 --> 00:18:42,533 they are we kind of creating this hypothetical, 375 00:18:43,300 --> 00:18:46,366 setup of the bandits. 376 00:18:46,366 --> 00:18:49,600 And then we're solving that, and then we're applying the results 377 00:18:49,600 --> 00:18:50,333 to the real world. 378 00:18:50,333 --> 00:18:51,900 We're adjusting our perception of reality 379 00:18:51,900 --> 00:18:54,100 based on the new information that that generates. 380 00:18:54,100 --> 00:18:55,433 And then we're doing that again. 381 00:18:55,433 --> 00:18:59,600 So hopefully this was a, interesting and tutorial. 382 00:18:59,600 --> 00:19:01,933 I find this, algorithm very, very cool. 383 00:19:01,933 --> 00:19:05,933 And in the next tutorial we're going to compare, a little bit, 384 00:19:06,000 --> 00:19:08,966 and our confidence bound and the Thompson sampling algorithm. 385 00:19:08,966 --> 00:19:10,266 And I can't wait to see you there. 386 00:19:10,266 --> 00:19:12,133 Until next time, happy analyzing.