1 00:00:00,133 --> 00:00:02,500 Hello and welcome to this art tutorial. 2 00:00:02,500 --> 00:00:05,466 So in the previous tutorial we did the data pre-processing step. 3 00:00:05,466 --> 00:00:09,366 We first imported the data set the usual way with the recursive function, 4 00:00:09,600 --> 00:00:13,500 but then we explained that we needed to create a sparse matrix containing 5 00:00:13,500 --> 00:00:16,700 all the transactions that occurred in the store during the whole week, 6 00:00:17,000 --> 00:00:20,700 and to build the sparse matrix we used to read that transaction function, 7 00:00:20,800 --> 00:00:24,333 including the arm duplicates argument to remove all the duplicates. 8 00:00:24,600 --> 00:00:25,800 And this sparse matrix 9 00:00:25,800 --> 00:00:29,900 is exactly what we need to train our apriori model on the data set. 10 00:00:29,900 --> 00:00:32,233 And that's what we are going to do in this tutorial. 11 00:00:32,233 --> 00:00:35,700 We're going to make the rules if I can say that okay. 12 00:00:35,700 --> 00:00:40,766 So now thanks to the rules package, the training is going to be very simple 13 00:00:40,766 --> 00:00:43,800 because we're just going to use one function, which is by the way 14 00:00:43,800 --> 00:00:47,266 called the apriori function with only two arguments. 15 00:00:47,600 --> 00:00:48,466 So let's do it. 16 00:00:48,466 --> 00:00:51,733 We're going to create a new variable and we will call it rules. 17 00:00:51,966 --> 00:00:56,400 Because this variable will contain in some way the different rules 18 00:00:56,633 --> 00:00:58,200 of our business problem. 19 00:00:58,200 --> 00:01:00,200 So rules here and then equals. 20 00:01:00,200 --> 00:01:04,533 And that's where we use the apriori function with the different arguments. 21 00:01:04,866 --> 00:01:07,600 And so we're going to input two arguments. 22 00:01:07,600 --> 00:01:10,266 The first argument is going to be our data set. 23 00:01:10,266 --> 00:01:14,500 And the second argument is going to be the parameter argument that will contain 24 00:01:14,700 --> 00:01:18,866 the choice of a minimum support and a choice of a confidence support. 25 00:01:19,200 --> 00:01:20,566 So let's have a look at these arguments. 26 00:01:20,566 --> 00:01:23,100 I'm going to press F1 here. 27 00:01:23,100 --> 00:01:26,566 And here I have some info about the primary function. 28 00:01:26,566 --> 00:01:29,066 So as you can see the first argument is data. 29 00:01:29,066 --> 00:01:30,300 So let's input it. 30 00:01:30,300 --> 00:01:33,000 Now that's the easiest argument to input. 31 00:01:33,000 --> 00:01:35,500 So data equals data set. 32 00:01:35,500 --> 00:01:36,300 And then comma. 33 00:01:36,300 --> 00:01:38,533 And then the second argument. 34 00:01:38,533 --> 00:01:40,633 So the second argument is parameter. 35 00:01:40,633 --> 00:01:45,600 And as it's written here the parameter is an object of class a parameter. 36 00:01:45,600 --> 00:01:48,666 And so this object will contain the minimum support 37 00:01:48,666 --> 00:01:52,666 that we will set ourselves and a minimum confidence. 38 00:01:52,800 --> 00:01:57,433 And we can also specify a maximum of items you want to have in the rules. 39 00:01:57,433 --> 00:01:59,266 So that's given with Max then. 40 00:01:59,266 --> 00:02:02,800 And actually it's also possible to include a minimum specifying 41 00:02:02,800 --> 00:02:05,766 the minimum number of products you want to have in your rules, 42 00:02:05,766 --> 00:02:07,233 but we won't actually need that. 43 00:02:07,233 --> 00:02:11,166 And what we will need inevitably is the support and the confidence. 44 00:02:11,166 --> 00:02:14,300 So let's input that so parameter here. 45 00:02:14,866 --> 00:02:18,766 And we need to include the support and the confidence in this parameter 46 00:02:18,766 --> 00:02:21,833 the following way we're going to take the list function. 47 00:02:22,366 --> 00:02:26,066 And in this list we're going to input the support and the confidence. 48 00:02:26,066 --> 00:02:28,433 So I'm going to add the two arguments here. 49 00:02:28,433 --> 00:02:31,466 And then we'll see what value we will input for these two arguments. 50 00:02:32,666 --> 00:02:33,233 All right. 51 00:02:33,233 --> 00:02:37,600 So let's take the slide of the intuition tutorial about the apriori algorithm. 52 00:02:37,900 --> 00:02:40,900 And let's see what the different steps of this algorithm are. 53 00:02:40,966 --> 00:02:45,733 So the first step as you can see is to set a minimum support and confidence. 54 00:02:45,733 --> 00:02:48,433 So that's exactly what we are about to do right now. 55 00:02:48,433 --> 00:02:51,433 We are in the first step of the apriori algorithm. 56 00:02:51,766 --> 00:02:54,900 And that consists of choosing a support and a confidence. 57 00:02:55,366 --> 00:02:59,800 So the choice of a support and confidence is not a general rule. 58 00:02:59,833 --> 00:03:04,633 We cannot express to support all the confidence with an explicit equation. 59 00:03:05,100 --> 00:03:07,333 It actually depends on the business problem itself. 60 00:03:07,333 --> 00:03:09,266 It actually depends on your goals. 61 00:03:09,266 --> 00:03:11,700 The goal related to your business problem. 62 00:03:11,700 --> 00:03:13,266 It also depends on your data 63 00:03:13,266 --> 00:03:16,366 set, the number of observations you have, the number of items. 64 00:03:16,566 --> 00:03:20,466 So that depends on different circumstances that don't allow us 65 00:03:20,466 --> 00:03:24,800 to make some general rule about how to calculate the support and the confidence. 66 00:03:24,800 --> 00:03:25,566 But don't worry, 67 00:03:25,566 --> 00:03:27,933 it will make a lot of sense when we explain 68 00:03:27,933 --> 00:03:30,766 how we calculate the support and the confidence here. 69 00:03:30,766 --> 00:03:33,766 And you will be able to apply it to your business problem. 70 00:03:34,600 --> 00:03:36,733 Okay, so let's start with the support. 71 00:03:36,733 --> 00:03:39,766 The support of a set of items AI is equal 72 00:03:39,766 --> 00:03:42,866 to the number of transactions containing this set of items. 73 00:03:42,866 --> 00:03:45,866 I divided by the total number of transactions. 74 00:03:46,166 --> 00:03:48,633 And the support argument that we're inputting here 75 00:03:48,633 --> 00:03:51,766 is actually the minimum support you want to have in your rules. 76 00:03:51,766 --> 00:03:55,233 That means that the items that are going to appear in your rules 77 00:03:55,233 --> 00:03:59,566 will have a higher support than this minimum support here. 78 00:03:59,800 --> 00:04:01,500 And same for the confidence. 79 00:04:01,500 --> 00:04:05,466 So what we must ask ourselves is that what supports do we want to have 80 00:04:05,466 --> 00:04:09,533 of our different items in the rules so that the rules are relevant? 81 00:04:09,933 --> 00:04:12,666 Because, for example, if we go back to the plot here with 82 00:04:12,666 --> 00:04:18,000 actually the 100 observations, which is this one, if we zoom on it, well, 83 00:04:18,000 --> 00:04:22,000 we can see that we have a lot of products that are not purchased very frequently. 84 00:04:22,133 --> 00:04:25,300 And these specific products are the products with small supports 85 00:04:25,500 --> 00:04:28,166 because a few transactions contain these products here. 86 00:04:28,166 --> 00:04:31,300 So when you divide the number of transactions containing these products 87 00:04:31,300 --> 00:04:34,800 by the total number of transactions, then you will get a small support. 88 00:04:35,200 --> 00:04:38,233 And you know, since these products are not purchased very often, 89 00:04:38,600 --> 00:04:39,733 they're not very relevant 90 00:04:39,733 --> 00:04:43,433 for our optimization problem because, you know, we want to optimize the sales, 91 00:04:43,633 --> 00:04:46,633 but what we want to optimize overall is the revenue. 92 00:04:46,733 --> 00:04:49,900 And since the revenue is a linear combination of the different numbers 93 00:04:49,900 --> 00:04:53,600 of products where the coefficients are actually the prices of these products, 94 00:04:53,700 --> 00:04:57,666 well, in order to optimize the revenue, we would need to optimize the sales 95 00:04:57,666 --> 00:05:00,666 of these products here that are purchased very often, 96 00:05:00,900 --> 00:05:03,900 rather than these products here that are less purchased. 97 00:05:04,066 --> 00:05:07,066 And so what we need to choose here is a support 98 00:05:07,066 --> 00:05:09,666 that will only include the products 99 00:05:09,666 --> 00:05:13,700 on the left of this vertical bar here that will correspond 100 00:05:13,700 --> 00:05:14,700 to the minimum support. 101 00:05:14,700 --> 00:05:20,866 So for example, let's say the value here on the y axis is 0.05. 102 00:05:21,033 --> 00:05:23,866 Then that means that all the products on the left of this vertical 103 00:05:23,866 --> 00:05:27,033 bar will have a support higher than 0.05. 104 00:05:27,233 --> 00:05:32,100 And so if we set a minimum support of 0.05, then the rules will only contain 105 00:05:32,100 --> 00:05:35,700 the different products of this left side of the vertical bar here. 106 00:05:36,233 --> 00:05:38,033 And so now how to choose to support. 107 00:05:38,033 --> 00:05:41,533 Well we need to look at the products that are purchased rather frequently. 108 00:05:41,933 --> 00:05:44,400 Like at least 3 or 4 times a day. 109 00:05:44,400 --> 00:05:46,866 Again that depends on your business goal. 110 00:05:46,866 --> 00:05:51,533 But what's for sure is that if we managed to find some strong rules about items 111 00:05:51,533 --> 00:05:54,200 that are bought, at least 3 or 4 times a day, 112 00:05:54,200 --> 00:05:58,100 then by associating them and placing them together, customers 113 00:05:58,100 --> 00:06:00,333 will be more likely to put them in their basket 114 00:06:00,333 --> 00:06:02,933 and therefore more of these products will be purchased 115 00:06:02,933 --> 00:06:05,000 and therefore the sales will increase. 116 00:06:05,000 --> 00:06:09,600 So that will be the starting point of how we are going to set the minimum support. 117 00:06:09,766 --> 00:06:11,533 We are going to consider the product 118 00:06:11,533 --> 00:06:14,533 that are purchased at least 3 or 4 times a day. 119 00:06:14,700 --> 00:06:16,166 And then we will look at the rules. 120 00:06:16,166 --> 00:06:16,800 And of course, 121 00:06:16,800 --> 00:06:20,900 if we're not convinced by the rules, we will change this value of the support. 122 00:06:20,900 --> 00:06:23,033 That's how we work with the apriori model. 123 00:06:23,033 --> 00:06:23,266 You know, 124 00:06:23,266 --> 00:06:26,966 we try different values of the support, different values of the confidence 125 00:06:27,133 --> 00:06:30,633 until we are satisfied with the rules and until we think it makes sense. 126 00:06:30,933 --> 00:06:34,500 And, you know, we can also try these rules within a certain period of time. 127 00:06:34,700 --> 00:06:36,900 And then we look at the impact on the revenue. 128 00:06:36,900 --> 00:06:40,166 And if we don't observe a meaningful increase in the sales revenue, 129 00:06:40,333 --> 00:06:43,900 we can later change the support and the confidence to change the rules 130 00:06:43,900 --> 00:06:45,633 and then experience again 131 00:06:45,633 --> 00:06:48,766 until we find the strongest rules that optimize the sales. 132 00:06:49,133 --> 00:06:51,066 So that's actually what happened in real life. 133 00:06:51,066 --> 00:06:53,066 But of course, in these tutorials, we're going to try 134 00:06:53,066 --> 00:06:56,066 with products purchased 3 or 4 times a day. 135 00:06:56,133 --> 00:06:58,400 And so we'll see what happens okay. 136 00:06:58,400 --> 00:07:01,233 So actually we didn't set the support yet. 137 00:07:01,233 --> 00:07:02,833 We just decided that we will 138 00:07:02,833 --> 00:07:05,833 look at the products that are purchased at least 3 or 4 times a day. 139 00:07:05,900 --> 00:07:08,366 But that will quickly lead us to the support. 140 00:07:08,366 --> 00:07:11,866 Because if a product is bought, let's say three times a day, 141 00:07:11,866 --> 00:07:16,300 that means it's purchased three times seven equals 21 times a week. 142 00:07:16,800 --> 00:07:19,466 And since the support is the number of transactions 143 00:07:19,466 --> 00:07:22,800 contained in this product, over the total number of transactions, 144 00:07:23,266 --> 00:07:27,733 and since there are 7500 transactions, then we get the minimum support 145 00:07:27,733 --> 00:07:31,966 that is equal to seven times three over 7005 hundred. 146 00:07:32,300 --> 00:07:34,966 So let me explain that by writing this here. 147 00:07:34,966 --> 00:07:37,133 So okay, we said we considered the products 148 00:07:37,133 --> 00:07:39,000 that are purchased three times a day. 149 00:07:39,000 --> 00:07:40,500 So that's three here. 150 00:07:40,500 --> 00:07:44,366 Then since the total number of transactions where register over a week, 151 00:07:44,600 --> 00:07:46,300 that means that if we consider the products 152 00:07:46,300 --> 00:07:49,366 that are purchased three times a day, that means that a purchase 153 00:07:49,366 --> 00:07:52,800 on average three times seven times a week. 154 00:07:52,800 --> 00:07:55,200 So this three times seven equals 21. 155 00:07:55,200 --> 00:07:58,200 Here is the number of transactions containing 156 00:07:58,300 --> 00:08:01,933 this product bought three times a day over the whole week. 157 00:08:02,400 --> 00:08:05,233 And now we need to divide by the total number 158 00:08:05,233 --> 00:08:08,466 of transactions to get this minimum support. 159 00:08:08,466 --> 00:08:12,966 And the total number of transactions is actually 7005 hundred. 160 00:08:13,600 --> 00:08:17,566 And this value here that we're about to compute is nothing else 161 00:08:17,566 --> 00:08:22,600 than the support of a product that is purchased three times a day. 162 00:08:23,100 --> 00:08:23,600 And, you know, 163 00:08:23,600 --> 00:08:27,566 we want our rules to consider only the products that are at least purchased 164 00:08:27,600 --> 00:08:28,433 three times a day. 165 00:08:28,433 --> 00:08:32,333 So all the products of our rules will have a higher support 166 00:08:32,500 --> 00:08:34,900 than this support here that we're about to compute. 167 00:08:34,900 --> 00:08:36,000 So let's compute it. 168 00:08:36,000 --> 00:08:37,433 Let's find out what it is. 169 00:08:37,433 --> 00:08:40,800 And that's the value we will give to this support parameter here. 170 00:08:41,133 --> 00:08:43,633 So right now I just need to press enter. 171 00:08:43,633 --> 00:08:46,733 And that's the value 0.00 28. 172 00:08:46,733 --> 00:08:49,633 We will round this to 0.03. 173 00:08:49,633 --> 00:08:52,500 And so that is the minimum support 174 00:08:52,500 --> 00:08:55,866 of the product that will be considered by our rules. 175 00:08:56,366 --> 00:09:01,500 So let's input it 0.003 right okay. 176 00:09:01,500 --> 00:09:03,000 So that's it for the support. 177 00:09:03,000 --> 00:09:07,533 Now the second step of our step one is to set a minimum confidence. 178 00:09:08,066 --> 00:09:10,866 So the choice of the confidence still depends 179 00:09:10,866 --> 00:09:13,933 on the business problem but mostly on your business goals. 180 00:09:14,266 --> 00:09:17,866 So what we'll do now is we're not going to compute a confidence 181 00:09:17,866 --> 00:09:19,133 like we computed the support. 182 00:09:19,133 --> 00:09:23,300 So we are going to start with the default value and then decrease 183 00:09:23,533 --> 00:09:27,600 the confidence step by step until we get some relevant rules. 184 00:09:28,033 --> 00:09:31,700 Because you know the confidence is kind of an arbitrary choice. 185 00:09:31,900 --> 00:09:32,966 We don't want to have a too 186 00:09:32,966 --> 00:09:35,533 high confidence because if we get it too high confidence, 187 00:09:35,533 --> 00:09:39,433 we will get to obvious rules, you know, rules that we don't need 188 00:09:39,500 --> 00:09:43,233 a machine learning algorithm to understand where we need to place the products 189 00:09:43,233 --> 00:09:44,133 next to each other. 190 00:09:44,133 --> 00:09:47,100 And we shouldn't have a too small confidence, 191 00:09:47,100 --> 00:09:51,333 because if we get a too small confidence, we will get some nonsense rules. 192 00:09:51,333 --> 00:09:55,166 Like, like, you know, if I'm buying chocolate, I want to buy shampoo. 193 00:09:55,500 --> 00:09:58,200 That's a nonsense rule that doesn't make any sense. 194 00:09:58,200 --> 00:10:02,000 And that's the kind of rule we will get if we set it to small confidence. 195 00:10:02,000 --> 00:10:06,466 So we will start with the default value, which is actually open eight. 196 00:10:06,466 --> 00:10:08,100 I think we will have a look at it. 197 00:10:08,100 --> 00:10:12,566 We can go back to help here to look at the description. 198 00:10:12,900 --> 00:10:16,566 And if we want to have the info about these two arguments here, support 199 00:10:16,566 --> 00:10:17,633 and confidence. 200 00:10:17,633 --> 00:10:21,900 What we need to do is click on this API parameter here, which is the class. 201 00:10:22,233 --> 00:10:25,200 And here we go that gives you the information 202 00:10:25,200 --> 00:10:28,666 of the parameter arguments of apriori and class. 203 00:10:28,866 --> 00:10:31,866 The other model that will make after this section. 204 00:10:31,933 --> 00:10:35,033 And as you can see we get informations 205 00:10:35,066 --> 00:10:38,133 about the support and confidence and other arguments. 206 00:10:38,500 --> 00:10:42,900 So these are actually the arguments that are both in apriori and eclat. 207 00:10:42,900 --> 00:10:45,933 And below you have some additional arguments 208 00:10:45,933 --> 00:10:50,700 that are only for apriori, because you'll see that the Eclair algorithm doesn't 209 00:10:50,700 --> 00:10:55,533 have a confidence in its algorithm and only consider this support. 210 00:10:55,566 --> 00:10:57,033 We will see that afterwards. 211 00:10:57,033 --> 00:11:00,466 But right now what we're interested in is actually the confidence. 212 00:11:00,900 --> 00:11:03,900 And you can see that the default value is 0.8. 213 00:11:04,566 --> 00:11:07,533 So that's what we will start with. 214 00:11:07,533 --> 00:11:10,333 I'm not saying we will get some interesting results. 215 00:11:10,333 --> 00:11:12,900 You can already imagine what we're going to get 216 00:11:12,900 --> 00:11:15,900 because 0.8 is a very high confidence. 217 00:11:15,900 --> 00:11:18,933 Try to guess what we'll get with this high confidence 0.8. 218 00:11:19,266 --> 00:11:22,333 And don't worry, we will divide it by two to try some smaller 219 00:11:22,333 --> 00:11:25,333 confidences until we get some relevant rules. 220 00:11:25,333 --> 00:11:28,266 Okay, so that's actually ready with the single line 221 00:11:28,266 --> 00:11:31,300 of code containing only these two parameters, the data set 222 00:11:31,666 --> 00:11:35,666 and this parameter here with the minimum support and the minimum confidence 223 00:11:36,000 --> 00:11:40,200 we train our primary model on our data set. 224 00:11:40,800 --> 00:11:44,066 So let's select this line and execute. 225 00:11:44,400 --> 00:11:45,433 And here we go. 226 00:11:45,433 --> 00:11:47,800 Our apriori model is created. 227 00:11:47,800 --> 00:11:50,766 And by the way the rules are also created. 228 00:11:50,766 --> 00:11:52,500 So let's have a look at the info here. 229 00:11:52,500 --> 00:11:54,900 All right. So that's the apriori model. 230 00:11:54,900 --> 00:11:59,066 And here we have the default parameters of this parameter argument here. 231 00:11:59,433 --> 00:12:02,233 So we can see that we have the minimum confidence here 0.8. 232 00:12:02,233 --> 00:12:05,233 And the minimum support 0.03. 233 00:12:05,500 --> 00:12:08,266 And we also have the midlane of the basket. 234 00:12:08,266 --> 00:12:12,133 That means that the basket that the rules will consider will contain 235 00:12:12,133 --> 00:12:13,300 at least one product. 236 00:12:13,300 --> 00:12:17,700 Well, we could have actually set two here to have at least two products in the row. 237 00:12:17,900 --> 00:12:19,866 We'll see if we get a problem with that. 238 00:12:19,866 --> 00:12:24,000 But so far the most important arguments we need to input are the support 239 00:12:24,000 --> 00:12:25,533 and the confidence. 240 00:12:25,533 --> 00:12:26,233 Okay. 241 00:12:26,233 --> 00:12:29,066 So the algorithmic control that's not very important for us. 242 00:12:29,066 --> 00:12:31,733 Now that's kind of a little more advanced. 243 00:12:31,733 --> 00:12:34,733 And here we get some other interesting informations. 244 00:12:34,866 --> 00:12:39,166 The most important information we need to look at here is the number of rules. 245 00:12:39,366 --> 00:12:42,200 We can actually see zero rules here. 246 00:12:42,200 --> 00:12:45,800 That means that when we trained our apriori model here, 247 00:12:46,000 --> 00:12:49,000 this model actually found zero rule. 248 00:12:49,233 --> 00:12:51,200 And can you guess why. 249 00:12:51,200 --> 00:12:54,600 Well of course it's due to the choice of our minimum confidence. 250 00:12:54,966 --> 00:12:59,466 Because by setting this minimum confidence 0.8, that means that all the rules 251 00:12:59,466 --> 00:13:04,266 made by our apriori algorithm have a confidence higher than 0.8. 252 00:13:04,266 --> 00:13:05,600 And what does that mean? 253 00:13:05,600 --> 00:13:08,366 That means that each rule should be correct 254 00:13:08,366 --> 00:13:11,366 at least on 80% of the transactions. 255 00:13:11,900 --> 00:13:13,066 So 80% is a lot. 256 00:13:13,066 --> 00:13:17,200 That means that the rule must be true at least four times out of five. 257 00:13:17,433 --> 00:13:21,933 And that's why the apriori found zero rule with a minimum confidence of 0.8, 258 00:13:22,166 --> 00:13:26,333 because no rule is true at least four transactions out of five. 259 00:13:26,800 --> 00:13:28,233 So that's what I was telling you. 260 00:13:28,233 --> 00:13:32,066 We can start with the default value, but since we have a lot of transactions 261 00:13:32,100 --> 00:13:36,000 and a lot of products that the customers can purchase, well, 262 00:13:36,000 --> 00:13:38,366 of course we need to set smaller confidence. 263 00:13:38,366 --> 00:13:40,633 So we will divided by two. 264 00:13:40,633 --> 00:13:42,733 So we will try now. 0.4. 265 00:13:42,733 --> 00:13:44,566 And now let's see what we get. 266 00:13:44,566 --> 00:13:46,866 So let's re-execute this line. 267 00:13:46,866 --> 00:13:49,866 It will retrain the model on a data set 268 00:13:49,966 --> 00:13:53,200 and recreate some new rules. Here we go. 269 00:13:53,233 --> 00:13:56,100 Now we have 281 rules. 270 00:13:56,100 --> 00:13:58,366 Much better. So that's a relief. 271 00:13:58,366 --> 00:14:02,466 And now of course what we're going to do is look at the rules themselves. 272 00:14:02,466 --> 00:14:05,000 We're going to visually see what the rules are. 273 00:14:05,000 --> 00:14:06,200 So we're going to see exactly 274 00:14:06,200 --> 00:14:08,466 which products should be placed next to each other. 275 00:14:08,466 --> 00:14:11,133 And we will see the strongest association rules. 276 00:14:11,133 --> 00:14:15,133 We will see what product customers purchase if they buy another product. 277 00:14:15,433 --> 00:14:17,600 So we will see all this very explicitly. 278 00:14:17,600 --> 00:14:20,400 And that's what we are going to do in the next tutorial. 279 00:14:20,400 --> 00:14:22,800 So I look forward to discovering these rules with you. 280 00:14:22,800 --> 00:14:24,600 And until then, enjoy machine learning.