1 00:00:00,233 --> 00:00:02,533 Hello and welcome to this art tutorial. 2 00:00:02,533 --> 00:00:06,200 So in the following tutorials we are going to implement the apriori algorithm. 3 00:00:06,466 --> 00:00:08,566 And as usual, we are going to make this machine 4 00:00:08,566 --> 00:00:12,600 learning model to create some added value in some specific business. 5 00:00:12,633 --> 00:00:15,766 And in this part this business problem is going to be about 6 00:00:15,766 --> 00:00:20,133 optimizing the sales in a grocery store, a grocery store in the south of France. 7 00:00:20,433 --> 00:00:23,766 And you're going to perfectly understand how the apriori algorithm 8 00:00:23,766 --> 00:00:27,666 is going to do a perfect job at doing this, optimizing the sales. 9 00:00:28,000 --> 00:00:32,600 Because recently a lot of stores considerably created some added value 10 00:00:32,600 --> 00:00:36,133 thanks to machine learning and especially association rule learning, 11 00:00:36,433 --> 00:00:39,633 by using it to optimize the sales of their products. 12 00:00:39,833 --> 00:00:41,200 And how did they do that? 13 00:00:41,200 --> 00:00:44,266 Well, they just used association rule learning 14 00:00:44,433 --> 00:00:48,100 to know exactly where to place the products in the store. 15 00:00:48,333 --> 00:00:51,333 You know, for example, I'll give you a very simple example. 16 00:00:51,600 --> 00:00:54,033 If someone buys some cereals, well, 17 00:00:54,033 --> 00:00:57,033 the same person is very likely to buy some milk as well. 18 00:00:57,200 --> 00:01:00,066 So by placing the cereals close to the milk, 19 00:01:00,066 --> 00:01:03,800 the store is very likely to put these two products into the same basket, 20 00:01:04,066 --> 00:01:07,133 even if the buyer originally intended to only buy cereals. 21 00:01:07,400 --> 00:01:09,600 Or I can give you a more general example. 22 00:01:09,600 --> 00:01:12,733 Suppose a person wants to buy a specific product. 23 00:01:12,733 --> 00:01:16,766 Let's call it product A, and this product A can be associated 24 00:01:16,766 --> 00:01:18,800 very well to another product B 25 00:01:18,800 --> 00:01:23,100 and the person who wants to buy the product A might not think of this 26 00:01:23,100 --> 00:01:26,700 good association between the product A and the product B well, 27 00:01:26,700 --> 00:01:29,900 if you place the product A and the product B next to each other, 28 00:01:30,033 --> 00:01:33,166 well, the association can suddenly pop up in the buyer's mind 29 00:01:33,466 --> 00:01:37,300 so that, you know, the buyer can tell, hey, that's actually a good combination. 30 00:01:37,300 --> 00:01:40,300 Why don't I try these two for my next lunch or something? 31 00:01:40,466 --> 00:01:44,700 And so again, even if the buyer was originally meant to buy only the product, 32 00:01:44,700 --> 00:01:48,233 a well, due to this association that popped up in its mind 33 00:01:48,366 --> 00:01:51,466 thanks to the placement of product A and product B next to each other, 34 00:01:51,666 --> 00:01:56,100 well, the buyer finally buys the two product A and product B, so that's 35 00:01:56,100 --> 00:02:00,566 the idea of how we can create added value for retail stores or grocery stores. 36 00:02:00,833 --> 00:02:04,766 And so what we'll make in this future tutorial to optimize the sales 37 00:02:04,866 --> 00:02:08,600 can be also apply to any other store that is selling some different products. 38 00:02:08,733 --> 00:02:10,566 You can think of an online store. 39 00:02:10,566 --> 00:02:12,233 You know these recommendations. 40 00:02:12,233 --> 00:02:14,333 People who bought this also about that. 41 00:02:14,333 --> 00:02:17,766 Well, these recommendations are based on association rules as well. 42 00:02:17,766 --> 00:02:21,600 But not only it can be also the result of a recommendation systems 43 00:02:21,600 --> 00:02:25,866 like collaborative filtering or content based item based collaborative filtering. 44 00:02:26,000 --> 00:02:29,000 But association rule learning has a role to play. 45 00:02:29,166 --> 00:02:32,700 So now let's make our first association rule learning algorithm, 46 00:02:32,700 --> 00:02:36,233 which is the apriori model for this specific store in the south of France. 47 00:02:36,800 --> 00:02:38,000 So let's do it. 48 00:02:38,000 --> 00:02:41,400 As usual, we're going to set the working directory by going to our 49 00:02:41,433 --> 00:02:45,766 new part folder which is the folder part five Association Rule Learning. 50 00:02:46,200 --> 00:02:49,033 And we are starting with the apriori algorithm. 51 00:02:49,033 --> 00:02:51,600 So that's the folder we want to set is working directory. 52 00:02:51,600 --> 00:02:55,000 Make sure that you have the market basket optimization CSV file. 53 00:02:55,333 --> 00:02:58,933 And you can click on this more button here and then set as working directory. 54 00:02:59,466 --> 00:03:01,200 All right. We're in the right folder now. 55 00:03:01,200 --> 00:03:05,500 So the first thing we're going to do is to import the data set. 56 00:03:05,700 --> 00:03:08,200 So the data set is market basket optimization. 57 00:03:08,200 --> 00:03:11,266 So as usual we are going to call it data sets. 58 00:03:11,633 --> 00:03:15,600 And then of course we're going to use the read dot CSV function. 59 00:03:16,066 --> 00:03:19,666 And now we simply input the name of the CSV file. 60 00:03:19,866 --> 00:03:22,200 So here we go to market 61 00:03:24,033 --> 00:03:26,100 basket 62 00:03:26,100 --> 00:03:29,100 optimization dot csv. 63 00:03:29,700 --> 00:03:30,000 All right. 64 00:03:30,000 --> 00:03:31,633 So let's execute that. 65 00:03:31,633 --> 00:03:34,633 And let's explain what the data set is about. 66 00:03:34,800 --> 00:03:37,066 Here we go. Data set will import it. 67 00:03:37,066 --> 00:03:41,300 It has 7500 observations and 20 variables. 68 00:03:41,700 --> 00:03:44,700 So let's check it out I'm clicking on data set here. 69 00:03:44,966 --> 00:03:45,366 All right. 70 00:03:45,366 --> 00:03:46,466 And that is the data set. 71 00:03:46,466 --> 00:03:52,033 So the first thing that we can see here is that this line here contains 72 00:03:52,300 --> 00:03:55,300 some products as these products here. 73 00:03:55,433 --> 00:03:59,466 And of course these are not the titles of the different columns here. 74 00:03:59,466 --> 00:04:05,733 So to improve this what we can first do is to add this header argument here 75 00:04:06,066 --> 00:04:09,600 header equals and then simply false this way. 76 00:04:09,900 --> 00:04:12,633 And that tells are that the first line of our data 77 00:04:12,633 --> 00:04:15,633 set doesn't contain the titles of the columns. 78 00:04:15,833 --> 00:04:19,666 So let's check it out now let's select this line and execute. 79 00:04:19,800 --> 00:04:20,900 Here we go. 80 00:04:20,900 --> 00:04:24,433 let's close this and click again on the data set. 81 00:04:24,766 --> 00:04:25,400 And here we go. 82 00:04:25,400 --> 00:04:28,500 We don't have any titles for the columns, but you know, this 83 00:04:28,500 --> 00:04:32,000 first observation is no longer seen as the titles of the columns. 84 00:04:32,233 --> 00:04:34,433 But that's the real observation itself okay. 85 00:04:34,433 --> 00:04:35,766 So better. 86 00:04:35,766 --> 00:04:37,733 And now let's describe the data set. 87 00:04:37,733 --> 00:04:40,666 So as I told you, we are making this apriori 88 00:04:40,666 --> 00:04:43,800 model for a store in the south of France. 89 00:04:44,000 --> 00:04:47,766 And so we want to find out the association rules of the different products 90 00:04:47,766 --> 00:04:51,666 of the store, to see how the manager of the store can optimize 91 00:04:51,666 --> 00:04:55,066 the placement of its different products to optimize the sales. 92 00:04:55,500 --> 00:04:55,833 Okay. 93 00:04:55,833 --> 00:04:59,400 So the first thing to say now is that this store is located 94 00:04:59,400 --> 00:05:02,366 in one of the most popular places in the south of France. 95 00:05:02,366 --> 00:05:04,866 So a lot of people go into the store. 96 00:05:04,866 --> 00:05:07,633 And so, you know, this place is a very convivial place, 97 00:05:07,633 --> 00:05:11,933 a very friendly place where people love to hang out, relax, talk to each other. 98 00:05:12,266 --> 00:05:15,833 And so these people come very often to this store because 99 00:05:15,833 --> 00:05:19,133 even if it's not to buy something, it's at least to meet their friends. 100 00:05:19,800 --> 00:05:23,500 And therefore, the manager of the store noticed and calculated that on average, 101 00:05:23,766 --> 00:05:27,100 each customer goes and buy something to the store once a week. 102 00:05:27,600 --> 00:05:32,933 So this data set here contains the 7500 transactions 103 00:05:33,133 --> 00:05:34,533 of all the different customers 104 00:05:34,533 --> 00:05:37,700 that bought a basket of products during a whole week. 105 00:05:37,966 --> 00:05:40,833 Indeed, the manager took it as the basis of its analysis 106 00:05:40,833 --> 00:05:42,900 because since each customer is going on average 107 00:05:42,900 --> 00:05:46,500 once a week to the store, then the transaction registered over 108 00:05:46,500 --> 00:05:50,233 a week is quite representative of what customers want to buy. 109 00:05:50,666 --> 00:05:54,133 So based on all these 7500 transactions, 110 00:05:54,433 --> 00:05:58,566 our machine learning model, our apriori model is going to learn 111 00:05:58,566 --> 00:06:02,233 the different associations it can make to actually understand 112 00:06:02,233 --> 00:06:05,733 the rules, such as if customers buy this set of products, 113 00:06:05,733 --> 00:06:08,733 then they're likely to buy this other set of products. 114 00:06:08,800 --> 00:06:10,500 So that's what we want to figure out. 115 00:06:10,500 --> 00:06:13,533 And that's what our apriori model will tell us. 116 00:06:13,900 --> 00:06:14,833 Okay. 117 00:06:14,833 --> 00:06:17,233 So each observation line here corresponds 118 00:06:17,233 --> 00:06:21,500 to a specific customer who bought a specific basket of product. 119 00:06:21,666 --> 00:06:26,466 So for example if we look at line two here, that corresponds to one customer 120 00:06:26,466 --> 00:06:31,300 who bought burgers, meatballs and eggs at a specific time of this week. 121 00:06:31,700 --> 00:06:34,066 And that's the same for all the other observations 122 00:06:34,066 --> 00:06:37,366 that correspond to other customers, or maybe the same customer 123 00:06:37,366 --> 00:06:40,433 who went back to the store another day or another time. 124 00:06:40,766 --> 00:06:42,733 So that's what the data set is about. 125 00:06:42,733 --> 00:06:45,900 But actually, this is not the data sets we're going to use 126 00:06:46,033 --> 00:06:49,033 to train our apriori model. 127 00:06:49,566 --> 00:06:54,166 And the reason is that the package we're going to use to build our apriori model, 128 00:06:54,333 --> 00:06:58,133 which is, by the way, the rules package, doesn't take a data 129 00:06:58,133 --> 00:06:59,900 set like this as input. 130 00:06:59,900 --> 00:07:03,966 It doesn't take a CSV file that we imported thanks to the readcsv 131 00:07:03,966 --> 00:07:05,033 function. 132 00:07:05,033 --> 00:07:08,700 What it takes as input is called a sparse matrix. 133 00:07:09,000 --> 00:07:11,000 And so what is a sparse matrix? 134 00:07:11,000 --> 00:07:14,200 It's actually a matrix that contains a lot of zeros. 135 00:07:14,500 --> 00:07:18,133 In machine learning you will encounter a lot of times the word sparsity 136 00:07:18,466 --> 00:07:21,400 that corresponds to a large number of zeros. 137 00:07:21,400 --> 00:07:26,100 So a sparse matrix is a matrix containing a very few number of non-zero values. 138 00:07:26,633 --> 00:07:32,066 So what we're going to do now is transform this data set here into a sparse matrix. 139 00:07:32,066 --> 00:07:34,133 And can you guess what we're going to do. 140 00:07:34,133 --> 00:07:38,700 Well what we're going to do is take all the different products of this data set. 141 00:07:39,266 --> 00:07:42,966 And actually I already know that there are 120 products. 142 00:07:43,200 --> 00:07:47,966 And we're going to attribute one column to each of these 120 products. 143 00:07:47,966 --> 00:07:50,400 So that means we'll get 120 columns. 144 00:07:52,233 --> 00:07:53,666 So for example. 145 00:07:53,666 --> 00:07:56,433 So for example we'll have the column shrimp 146 00:07:56,433 --> 00:07:59,866 the column almonds the column avocado the column vegetables. 147 00:07:59,866 --> 00:08:03,033 Mix cortices energy drink tomato juice up 148 00:08:03,033 --> 00:08:06,133 to the 120th product that there is. 149 00:08:06,133 --> 00:08:08,900 We're going to see all the products then on a plot. 150 00:08:08,900 --> 00:08:12,033 But in this data set there are 120 products 151 00:08:12,033 --> 00:08:15,833 which are, by the way, the 120 products of the store. 152 00:08:16,433 --> 00:08:19,433 So there's going to be one column for each of these products, 153 00:08:19,800 --> 00:08:21,433 and that's going to be the columns. 154 00:08:21,433 --> 00:08:25,266 And then the lines are still going to be the different transactions 155 00:08:25,266 --> 00:08:29,100 corresponding to each of the 7005 hundred customers 156 00:08:29,100 --> 00:08:32,100 that bought a basket of products during the whole week. 157 00:08:32,100 --> 00:08:35,100 But instead of having the list of the product they bought, 158 00:08:35,233 --> 00:08:40,066 we will have in each of the 120 columns here, a 0 or 1, 159 00:08:40,500 --> 00:08:41,733 and it's going to be a one. 160 00:08:41,733 --> 00:08:45,633 If the product is in the basket of the customer during its transaction, 161 00:08:46,033 --> 00:08:49,033 and a zero if the product is not in the basket. 162 00:08:49,166 --> 00:08:52,400 So for example, let's take the second customer here, 163 00:08:52,833 --> 00:08:58,200 the second customer, but a basket of three products burgers, meatballs and eggs. 164 00:08:58,500 --> 00:08:59,133 Okay. 165 00:08:59,133 --> 00:09:02,266 So in our sparse matrix we'll have one 166 00:09:02,266 --> 00:09:05,466 burgers column, one meatballs column and one x column. 167 00:09:05,500 --> 00:09:08,266 They're not necessarily going to be next to each other. 168 00:09:08,266 --> 00:09:11,500 You know burgers can be the fifth column and meatballs can be 169 00:09:11,500 --> 00:09:14,633 the ninth column and X can be the 12th column. 170 00:09:14,833 --> 00:09:18,200 That depends on how the rules package is going to make a matrix. 171 00:09:18,200 --> 00:09:21,066 But we will have a column for each of these three products. 172 00:09:21,066 --> 00:09:23,933 And so in these columns, since the customer number two 173 00:09:23,933 --> 00:09:27,133 bought some burgers, meatballs and eggs, there will be a one 174 00:09:27,500 --> 00:09:28,500 in each of these columns. 175 00:09:28,500 --> 00:09:31,533 There will be a one in the burgers column or one in the meatballs column, 176 00:09:31,766 --> 00:09:33,233 and a one in the X column. 177 00:09:33,233 --> 00:09:36,500 And all the rest of the columns are going to have a zero value. 178 00:09:36,666 --> 00:09:41,233 And that's because all the other products were not in the basket of this customer 179 00:09:41,233 --> 00:09:42,233 number two. 180 00:09:42,233 --> 00:09:47,100 So you can guess, you can imagine that we are going to have a lot of zero values. 181 00:09:47,400 --> 00:09:49,233 And that's even more true considering the fact 182 00:09:49,233 --> 00:09:53,100 that we have a lot of customers that bought baskets of only one product. 183 00:09:53,100 --> 00:09:56,133 For example, this customer number ten here bought some French fries, 184 00:09:56,400 --> 00:09:58,100 this one bought some cookies. 185 00:09:58,100 --> 00:10:00,266 This one bought some mineral water. 186 00:10:00,266 --> 00:10:03,600 So, you know, for these three customers who bought only one product, 187 00:10:03,833 --> 00:10:05,200 we're going to have only one column 188 00:10:05,200 --> 00:10:08,333 that contains a non-zero value and all the other columns. 189 00:10:08,333 --> 00:10:12,233 That means all the 119 columns will contain zeros. 190 00:10:12,533 --> 00:10:15,900 So you can see that we're going to have a lot of zeros in this matrix. 191 00:10:16,200 --> 00:10:19,000 And so for those of you who are discovering sparsity, 192 00:10:19,000 --> 00:10:21,900 I'm happy to introduce you to sparse matrices. 193 00:10:21,900 --> 00:10:25,000 So let's build this sparse matrix right now. 194 00:10:25,000 --> 00:10:27,233 You'll see that it's going to be very easy. 195 00:10:27,233 --> 00:10:30,933 So let's go back to our code and let's create this sparse matrix. 196 00:10:30,933 --> 00:10:35,933 So to create this sparse matrix we're going to use a package of course. 197 00:10:35,933 --> 00:10:38,166 And this package is the rules package. 198 00:10:38,166 --> 00:10:42,000 So we're going to install it and import it. 199 00:10:42,533 --> 00:10:46,700 So as usual we're going to take the function install dot packages. 200 00:10:47,066 --> 00:10:52,100 And then in parentheses we just input the name of the package in quotes. 201 00:10:52,566 --> 00:10:54,900 And that's the rules package. 202 00:10:54,900 --> 00:10:55,900 All right. 203 00:10:55,900 --> 00:10:59,400 So let's check to see if I have it. 204 00:10:59,666 --> 00:11:01,333 Well I already know I have it. 205 00:11:01,333 --> 00:11:03,766 It's actually already here and already imported. 206 00:11:03,766 --> 00:11:07,200 So that's the package for which the description says that 207 00:11:07,300 --> 00:11:11,333 it's mining association rules and frequent item sets. 208 00:11:11,700 --> 00:11:12,633 Okay. 209 00:11:12,633 --> 00:11:14,200 So mine is already installed. 210 00:11:14,200 --> 00:11:15,866 So I'm not going to execute this line. 211 00:11:15,866 --> 00:11:17,300 I'll just put in comments. 212 00:11:17,300 --> 00:11:20,933 And so if you don't have the package here in the packages list, 213 00:11:21,333 --> 00:11:24,566 you need to select this line and execute. 214 00:11:24,566 --> 00:11:27,566 And this will install the package without any issue. 215 00:11:27,566 --> 00:11:31,633 And as far as I'm concerned, I'm just going to put that in comment right. 216 00:11:32,400 --> 00:11:35,900 And to make sure that the rules package is well imported, 217 00:11:36,300 --> 00:11:37,633 we need to add the line here. 218 00:11:37,633 --> 00:11:41,933 Library and in parenthesis a rules already there. 219 00:11:41,933 --> 00:11:42,900 Perfect. 220 00:11:42,900 --> 00:11:43,866 And that makes sure 221 00:11:43,866 --> 00:11:47,400 that if you execute the whole script, the rules package will be imported. 222 00:11:48,066 --> 00:11:51,300 And now we're ready to create our sparse matrix. 223 00:11:51,866 --> 00:11:55,033 So since our data set has no use here 224 00:11:55,033 --> 00:11:58,566 because we're not going to use it to build and train our apriori model, 225 00:11:58,733 --> 00:12:01,733 we will call our sparse matrix again data set. 226 00:12:02,133 --> 00:12:02,733 Okay. 227 00:12:02,733 --> 00:12:07,533 And to create this sparse matrix, it's actually almost the same as importing 228 00:12:07,533 --> 00:12:12,100 a CSV file, because instead of writing here read dot csv, 229 00:12:12,433 --> 00:12:15,933 we simply need to type read dot transactions, 230 00:12:17,233 --> 00:12:20,600 read that transactions and then it's the same in parenthesis. 231 00:12:20,600 --> 00:12:23,533 We need to input the name of the CSV file. 232 00:12:23,533 --> 00:12:26,866 So we'll copy that and paste it here. 233 00:12:27,800 --> 00:12:29,066 That's the first argument. 234 00:12:29,066 --> 00:12:33,300 But then we need to specify to this function that the separator 235 00:12:33,566 --> 00:12:36,900 of our CSV file is actually comma. 236 00:12:37,166 --> 00:12:40,933 So we need to add here set equals in quotes comma. 237 00:12:40,933 --> 00:12:42,733 And why do we need to do this. 238 00:12:42,733 --> 00:12:45,200 It's because you know our CSV file. 239 00:12:45,200 --> 00:12:47,100 If you open it with a text editor 240 00:12:47,100 --> 00:12:50,100 you will see that the different products are separated by a comma. 241 00:12:50,400 --> 00:12:54,000 And we actually didn't have to specify that the separator was a comma here, 242 00:12:54,000 --> 00:12:57,600 because that's the default separator of the readcsv function. 243 00:12:57,966 --> 00:13:01,900 But that's not the default separator of the read transactions function. 244 00:13:01,900 --> 00:13:04,366 So that's why we need to specify two here. 245 00:13:04,366 --> 00:13:06,600 So set equals comma. 246 00:13:06,600 --> 00:13:08,700 And actually we could stop here. 247 00:13:08,700 --> 00:13:10,366 But since I promised you to give you 248 00:13:10,366 --> 00:13:14,200 real life data sets I added on purpose and reality in the data sets. 249 00:13:14,200 --> 00:13:17,700 And this reality is about having some anomalies in the data. 250 00:13:18,300 --> 00:13:21,300 And these anomalies are actually some duplicates. 251 00:13:21,466 --> 00:13:24,833 Indeed, when this manager registered all the different transactions, 252 00:13:25,100 --> 00:13:27,866 well, he might have been very likely to make some human 253 00:13:27,866 --> 00:13:30,866 mistakes to put some duplicates in the data. 254 00:13:30,866 --> 00:13:33,133 So for example, if we go back to our data set. 255 00:13:33,133 --> 00:13:37,166 So that's the whole data sets import of that CSV with the red dot CSV function. 256 00:13:37,433 --> 00:13:42,500 And so for example, when this transaction of the 31st customer was registered, 257 00:13:42,766 --> 00:13:47,133 one can make some mistake of putting twice light cream here, for example. 258 00:13:47,633 --> 00:13:51,600 And to train the apriori algorithm we need to have no duplicates. 259 00:13:51,866 --> 00:13:55,766 So there is actually a good way to handle these duplicates. 260 00:13:55,766 --> 00:14:00,000 It's actually very simple because we just need to add an additional argument. 261 00:14:00,400 --> 00:14:04,500 If we look at the reader transactions function here by pressing F1, 262 00:14:04,500 --> 00:14:08,133 you can see that are empty duplicates argument. 263 00:14:08,566 --> 00:14:11,233 And as you can see, it's a logical value 264 00:14:11,233 --> 00:14:14,666 specifying if duplicate items should be removed from the transaction. 265 00:14:15,100 --> 00:14:17,933 So since the apriori algorithm is trained 266 00:14:17,933 --> 00:14:21,666 on transaction data set, they're supposed to have no duplicate values. 267 00:14:21,966 --> 00:14:24,766 We need to add this argument, 268 00:14:26,533 --> 00:14:30,100 removed duplicates, and set it to true, 269 00:14:30,600 --> 00:14:34,066 and that will remove all the duplicates in each of the transactions. 270 00:14:34,300 --> 00:14:36,766 Maybe your data sets won't have any duplicates, but 271 00:14:36,766 --> 00:14:40,733 it's very common to have a few anomalies in data sets, such as some duplicates. 272 00:14:41,066 --> 00:14:44,933 But here we will be fine thanks to this remove duplicates argument. 273 00:14:45,500 --> 00:14:49,933 All right, so we're now actually ready to create our sparse matrix. 274 00:14:50,333 --> 00:14:51,300 So let's do this. 275 00:14:51,300 --> 00:14:54,466 Let's execute this line. And here we go. 276 00:14:54,833 --> 00:14:57,833 All right so now the sparse matrix is created. 277 00:14:58,200 --> 00:15:01,966 Unfortunately we can not have a look at it because as you can see if I click 278 00:15:01,966 --> 00:15:05,900 on this data see here the new data set sparse matrix is not appearing here. 279 00:15:05,900 --> 00:15:08,766 That's actually the old one. So we can close this. 280 00:15:08,766 --> 00:15:13,333 But we can actually get some info about this sparse matrix. 281 00:15:13,533 --> 00:15:16,733 But before getting all this detailed information, well we can see 282 00:15:16,733 --> 00:15:20,266 that we already have some information about the duplicates itself. 283 00:15:20,566 --> 00:15:23,500 When you execute this line to create your sparse matrix 284 00:15:23,500 --> 00:15:27,866 with the reductions actions function, including the duplicates argument, 285 00:15:28,100 --> 00:15:29,166 you will automatically 286 00:15:29,166 --> 00:15:32,166 have this message distribution of transaction with duplicates. 287 00:15:32,200 --> 00:15:34,000 And here we see that we have one five. 288 00:15:34,000 --> 00:15:37,900 That means that there are five transactions containing one duplicates. 289 00:15:38,166 --> 00:15:42,333 And for example, if in your data set you have some triplicate duplicates 290 00:15:42,333 --> 00:15:45,866 that appear twice in any transaction, well, 291 00:15:45,900 --> 00:15:49,800 you will have a two here and you will have the number of triplicate here. 292 00:15:50,000 --> 00:15:53,000 So that just gives the distribution of transaction with duplicates. 293 00:15:53,233 --> 00:15:55,000 And anyway now they're removed. 294 00:15:55,000 --> 00:16:00,833 So we can actually get some more detailed info about this sparse matrix. 295 00:16:00,833 --> 00:16:04,166 And to get this info, we, as we already did many times 296 00:16:04,166 --> 00:16:07,200 before, need to use the summary function. 297 00:16:07,400 --> 00:16:08,566 So summary. 298 00:16:08,566 --> 00:16:10,966 And here we input data set. 299 00:16:10,966 --> 00:16:11,366 All right. 300 00:16:11,366 --> 00:16:14,166 And that will give us some info about the sports matrix. 301 00:16:14,166 --> 00:16:16,633 So let's execute this line. And here we go. 302 00:16:17,700 --> 00:16:18,800 So what do we see here. 303 00:16:18,800 --> 00:16:22,166 First we are reminded that this data set contains 304 00:16:22,166 --> 00:16:25,700 transactions as item matrix in sparse format. 305 00:16:25,766 --> 00:16:28,566 So that exactly means that's a sparse matrix. 306 00:16:28,566 --> 00:16:32,600 And we can see that we have 7000 and 501 rows. 307 00:16:32,600 --> 00:16:35,200 And we have 119 columns. 308 00:16:35,200 --> 00:16:39,833 And we can see that we have a density of 0.03 in this sparse matrix. 309 00:16:39,833 --> 00:16:40,800 And what does that mean. 310 00:16:40,800 --> 00:16:46,400 That means that the proportion of non-zero values is 0.03. 311 00:16:46,633 --> 00:16:50,633 We have 3% non-zero values and 97% zero values. 312 00:16:51,000 --> 00:16:53,366 Okay. Then we have the most frequent items. 313 00:16:53,366 --> 00:16:56,366 So the item that is the most part is mineral water. 314 00:16:56,600 --> 00:16:59,100 Yes, it can be very hot in the south of France. 315 00:16:59,100 --> 00:17:02,400 And it's a good French tradition to have a bottle of water during meals. 316 00:17:03,000 --> 00:17:03,500 Okay then. 317 00:17:03,500 --> 00:17:05,066 The French love very much eggs. 318 00:17:05,066 --> 00:17:07,700 They love spaghetti, French fries, chocolate 319 00:17:07,700 --> 00:17:09,566 and that's all the other products. 320 00:17:09,566 --> 00:17:13,200 And then then we have some interesting information about the distribution 321 00:17:13,200 --> 00:17:17,133 of the baskets of all the 7500 transactions. 322 00:17:17,133 --> 00:17:22,166 So for example here, this one associated to 1754, 323 00:17:22,200 --> 00:17:28,033 means that there were 1754 baskets containing only one products. 324 00:17:28,033 --> 00:17:32,566 And then we have 1358 baskets containing two products, 325 00:17:32,766 --> 00:17:36,366 1044 baskets containing three products, etc.. 326 00:17:37,033 --> 00:17:38,433 And we also have the quantiles 327 00:17:38,433 --> 00:17:42,033 of this distribution with the minimum value, the maximum value. 328 00:17:42,300 --> 00:17:45,166 So the minimum value is of course the basket of one product. 329 00:17:45,166 --> 00:17:48,933 The maximum value is a basket of 20 products, and on average, 330 00:17:49,166 --> 00:17:53,733 people put four products in their basket when they go to the store. 331 00:17:54,066 --> 00:17:56,400 All right. So that's interesting information. 332 00:17:56,400 --> 00:18:00,033 But of course we will get some even more interesting informations afterwards. 333 00:18:00,266 --> 00:18:04,066 And speaking of these more interesting informations, we can already have one. 334 00:18:04,066 --> 00:18:05,233 Now it's actually 335 00:18:05,233 --> 00:18:08,833 going to be a visual information because we're going to make a frequency 336 00:18:08,833 --> 00:18:10,500 plot of the different products, 337 00:18:10,500 --> 00:18:13,766 but by the different customers in the store during this whole week. 338 00:18:14,266 --> 00:18:17,700 And so to get this plot very easily, we can use one function 339 00:18:17,700 --> 00:18:22,800 of the arrows package, which is the item frequency plot function. 340 00:18:23,000 --> 00:18:24,166 And in this function 341 00:18:24,166 --> 00:18:28,133 we just need to input two arguments, which is going to be the data set. 342 00:18:28,133 --> 00:18:30,900 So that's the sparse matrix. So that's the first argument. 343 00:18:30,900 --> 00:18:33,266 And the second argument is sub n. 344 00:18:33,266 --> 00:18:34,200 And that's the number 345 00:18:34,200 --> 00:18:38,033 of the most sold products you want to have in this frequency plot. 346 00:18:38,200 --> 00:18:41,200 So for example if I put top n equals 347 00:18:41,233 --> 00:18:44,433 100 I will get the 100 348 00:18:44,433 --> 00:18:48,233 most purchased product by the French customers in this French store. 349 00:18:48,400 --> 00:18:49,700 So let's check it out. 350 00:18:49,700 --> 00:18:52,300 I'm going to execute this line. 351 00:18:52,300 --> 00:18:53,966 Here we go. And that's the plot. 352 00:18:53,966 --> 00:18:57,900 Don't worry I'm going to zoom on it so that we can see better the products. 353 00:18:58,500 --> 00:18:59,566 And here we go. 354 00:18:59,566 --> 00:19:04,100 So that's the first 100 products most purchased by the customers. 355 00:19:04,100 --> 00:19:05,266 So that's kind of interesting. 356 00:19:05,266 --> 00:19:09,500 And if you want to have less products in this plot, you can just look at the top 357 00:19:09,500 --> 00:19:13,400 ten and you'll get actually the first ten products 358 00:19:13,400 --> 00:19:17,133 purchased by the customers, which are of course the same first ten products. 359 00:19:17,433 --> 00:19:21,133 Okay, so this plot is actually going to be interesting for us, 360 00:19:21,233 --> 00:19:24,900 what's coming next, because we will have to choose a value 361 00:19:24,900 --> 00:19:28,200 for the support according to the Priore algorithm itself, 362 00:19:28,300 --> 00:19:32,666 and we will be able to actually use this plot to look at different 363 00:19:32,666 --> 00:19:36,800 supports of the product, to choose a good value for our support. 364 00:19:37,300 --> 00:19:39,833 So that's what we're going to do in the next tutorials. 365 00:19:39,833 --> 00:19:43,966 We're going to start training our apriori model on our data set, 366 00:19:44,200 --> 00:19:47,233 which is going to be the sparse matrix here that we just built. 367 00:19:47,600 --> 00:19:50,566 And so I look forward to building the apriori model with you. 368 00:19:50,566 --> 00:19:52,800 And until then enjoy machine learning.