1
00:00:00,233 --> 00:00:02,533
Hello and welcome to this art tutorial.

2
00:00:02,533 --> 00:00:06,200
So in the following tutorials we are going
to implement the apriori algorithm.

3
00:00:06,466 --> 00:00:08,566
And as usual,
we are going to make this machine

4
00:00:08,566 --> 00:00:12,600
learning model to create some added value
in some specific business.

5
00:00:12,633 --> 00:00:15,766
And in this part
this business problem is going to be about

6
00:00:15,766 --> 00:00:20,133
optimizing the sales in a grocery store,
a grocery store in the south of France.

7
00:00:20,433 --> 00:00:23,766
And you're going to perfectly understand
how the apriori algorithm

8
00:00:23,766 --> 00:00:27,666
is going to do a perfect job
at doing this, optimizing the sales.

9
00:00:28,000 --> 00:00:32,600
Because recently a lot of stores
considerably created some added value

10
00:00:32,600 --> 00:00:36,133
thanks to machine learning
and especially association rule learning,

11
00:00:36,433 --> 00:00:39,633
by using it
to optimize the sales of their products.

12
00:00:39,833 --> 00:00:41,200
And how did they do that?

13
00:00:41,200 --> 00:00:44,266
Well,
they just used association rule learning

14
00:00:44,433 --> 00:00:48,100
to know exactly where
to place the products in the store.

15
00:00:48,333 --> 00:00:51,333
You know, for example,
I'll give you a very simple example.

16
00:00:51,600 --> 00:00:54,033
If someone buys some cereals, well,

17
00:00:54,033 --> 00:00:57,033
the same person is very likely
to buy some milk as well.

18
00:00:57,200 --> 00:01:00,066
So by placing the cereals
close to the milk,

19
00:01:00,066 --> 00:01:03,800
the store is very likely to put these
two products into the same basket,

20
00:01:04,066 --> 00:01:07,133
even if the buyer originally intended
to only buy cereals.

21
00:01:07,400 --> 00:01:09,600
Or I can give you a more general example.

22
00:01:09,600 --> 00:01:12,733
Suppose
a person wants to buy a specific product.

23
00:01:12,733 --> 00:01:16,766
Let's call it product
A, and this product A can be associated

24
00:01:16,766 --> 00:01:18,800
very well to another product B

25
00:01:18,800 --> 00:01:23,100
and the person who wants to buy
the product A might not think of this

26
00:01:23,100 --> 00:01:26,700
good association between the product A
and the product B well,

27
00:01:26,700 --> 00:01:29,900
if you place the product A
and the product B next to each other,

28
00:01:30,033 --> 00:01:33,166
well, the association
can suddenly pop up in the buyer's mind

29
00:01:33,466 --> 00:01:37,300
so that, you know, the buyer can tell,
hey, that's actually a good combination.

30
00:01:37,300 --> 00:01:40,300
Why don't I try these
two for my next lunch or something?

31
00:01:40,466 --> 00:01:44,700
And so again, even if the buyer was
originally meant to buy only the product,

32
00:01:44,700 --> 00:01:48,233
a well, due to this association
that popped up in its mind

33
00:01:48,366 --> 00:01:51,466
thanks to the placement of product
A and product B next to each other,

34
00:01:51,666 --> 00:01:56,100
well, the buyer finally buys
the two product A and product B, so that's

35
00:01:56,100 --> 00:02:00,566
the idea of how we can create added value
for retail stores or grocery stores.

36
00:02:00,833 --> 00:02:04,766
And so what we'll make in this future
tutorial to optimize the sales

37
00:02:04,866 --> 00:02:08,600
can be also apply to any other store
that is selling some different products.

38
00:02:08,733 --> 00:02:10,566
You can think of an online store.

39
00:02:10,566 --> 00:02:12,233
You know these recommendations.

40
00:02:12,233 --> 00:02:14,333
People who bought this also about that.

41
00:02:14,333 --> 00:02:17,766
Well, these recommendations are based
on association rules as well.

42
00:02:17,766 --> 00:02:21,600
But not only it can be also the result
of a recommendation systems

43
00:02:21,600 --> 00:02:25,866
like collaborative filtering or content
based item based collaborative filtering.

44
00:02:26,000 --> 00:02:29,000
But association
rule learning has a role to play.

45
00:02:29,166 --> 00:02:32,700
So now let's make our first association
rule learning algorithm,

46
00:02:32,700 --> 00:02:36,233
which is the apriori model for this
specific store in the south of France.

47
00:02:36,800 --> 00:02:38,000
So let's do it.

48
00:02:38,000 --> 00:02:41,400
As usual, we're going to set
the working directory by going to our

49
00:02:41,433 --> 00:02:45,766
new part folder which is the folder part
five Association Rule Learning.

50
00:02:46,200 --> 00:02:49,033
And we are starting
with the apriori algorithm.

51
00:02:49,033 --> 00:02:51,600
So that's the folder
we want to set is working directory.

52
00:02:51,600 --> 00:02:55,000
Make sure that you have the market basket
optimization CSV file.

53
00:02:55,333 --> 00:02:58,933
And you can click on this more button here
and then set as working directory.

54
00:02:59,466 --> 00:03:01,200
All right. We're in the right folder now.

55
00:03:01,200 --> 00:03:05,500
So the first thing we're going to do
is to import the data set.

56
00:03:05,700 --> 00:03:08,200
So the data set is market
basket optimization.

57
00:03:08,200 --> 00:03:11,266
So as usual
we are going to call it data sets.

58
00:03:11,633 --> 00:03:15,600
And then of course we're going to use
the read dot CSV function.

59
00:03:16,066 --> 00:03:19,666
And now we simply input
the name of the CSV file.

60
00:03:19,866 --> 00:03:22,200
So here we go to market

61
00:03:24,033 --> 00:03:26,100
basket

62
00:03:26,100 --> 00:03:29,100
optimization dot csv.

63
00:03:29,700 --> 00:03:30,000
All right.

64
00:03:30,000 --> 00:03:31,633
So let's execute that.

65
00:03:31,633 --> 00:03:34,633
And let's explain
what the data set is about.

66
00:03:34,800 --> 00:03:37,066
Here we go. Data set will import it.

67
00:03:37,066 --> 00:03:41,300
It has 7500 observations and 20 variables.

68
00:03:41,700 --> 00:03:44,700
So let's check it out I'm clicking on data
set here.

69
00:03:44,966 --> 00:03:45,366
All right.

70
00:03:45,366 --> 00:03:46,466
And that is the data set.

71
00:03:46,466 --> 00:03:52,033
So the first thing that we can see here
is that this line here contains

72
00:03:52,300 --> 00:03:55,300
some products as these products here.

73
00:03:55,433 --> 00:03:59,466
And of course these are not the titles
of the different columns here.

74
00:03:59,466 --> 00:04:05,733
So to improve this what we can first do
is to add this header argument here

75
00:04:06,066 --> 00:04:09,600
header equals
and then simply false this way.

76
00:04:09,900 --> 00:04:12,633
And that tells
are that the first line of our data

77
00:04:12,633 --> 00:04:15,633
set doesn't
contain the titles of the columns.

78
00:04:15,833 --> 00:04:19,666
So let's check it out
now let's select this line and execute.

79
00:04:19,800 --> 00:04:20,900
Here we go.

80
00:04:20,900 --> 00:04:24,433
let's close this and click again
on the data set.

81
00:04:24,766 --> 00:04:25,400
And here we go.

82
00:04:25,400 --> 00:04:28,500
We don't have any titles for the columns,
but you know, this

83
00:04:28,500 --> 00:04:32,000
first observation is no longer
seen as the titles of the columns.

84
00:04:32,233 --> 00:04:34,433
But that's the real observation itself
okay.

85
00:04:34,433 --> 00:04:35,766
So better.

86
00:04:35,766 --> 00:04:37,733
And now let's describe the data set.

87
00:04:37,733 --> 00:04:40,666
So as I told you,
we are making this apriori

88
00:04:40,666 --> 00:04:43,800
model for a store in the south of France.

89
00:04:44,000 --> 00:04:47,766
And so we want to find out the association
rules of the different products

90
00:04:47,766 --> 00:04:51,666
of the store, to see
how the manager of the store can optimize

91
00:04:51,666 --> 00:04:55,066
the placement of its different products
to optimize the sales.

92
00:04:55,500 --> 00:04:55,833
Okay.

93
00:04:55,833 --> 00:04:59,400
So the first thing to say
now is that this store is located

94
00:04:59,400 --> 00:05:02,366
in one of the most popular places
in the south of France.

95
00:05:02,366 --> 00:05:04,866
So a lot of people go into the store.

96
00:05:04,866 --> 00:05:07,633
And so, you know,
this place is a very convivial place,

97
00:05:07,633 --> 00:05:11,933
a very friendly place where people love
to hang out, relax, talk to each other.

98
00:05:12,266 --> 00:05:15,833
And so these people
come very often to this store because

99
00:05:15,833 --> 00:05:19,133
even if it's not to buy something, it's
at least to meet their friends.

100
00:05:19,800 --> 00:05:23,500
And therefore, the manager of the store
noticed and calculated that on average,

101
00:05:23,766 --> 00:05:27,100
each customer goes and buy
something to the store once a week.

102
00:05:27,600 --> 00:05:32,933
So this data set here contains
the 7500 transactions

103
00:05:33,133 --> 00:05:34,533
of all the different customers

104
00:05:34,533 --> 00:05:37,700
that bought a basket of products
during a whole week.

105
00:05:37,966 --> 00:05:40,833
Indeed, the manager took it
as the basis of its analysis

106
00:05:40,833 --> 00:05:42,900
because since each customer
is going on average

107
00:05:42,900 --> 00:05:46,500
once a week to the store,
then the transaction registered over

108
00:05:46,500 --> 00:05:50,233
a week is quite representative
of what customers want to buy.

109
00:05:50,666 --> 00:05:54,133
So based on all these 7500 transactions,

110
00:05:54,433 --> 00:05:58,566
our machine learning model,
our apriori model is going to learn

111
00:05:58,566 --> 00:06:02,233
the different associations
it can make to actually understand

112
00:06:02,233 --> 00:06:05,733
the rules, such
as if customers buy this set of products,

113
00:06:05,733 --> 00:06:08,733
then they're likely to buy
this other set of products.

114
00:06:08,800 --> 00:06:10,500
So that's what we want to figure out.

115
00:06:10,500 --> 00:06:13,533
And that's
what our apriori model will tell us.

116
00:06:13,900 --> 00:06:14,833
Okay.

117
00:06:14,833 --> 00:06:17,233
So each observation line here corresponds

118
00:06:17,233 --> 00:06:21,500
to a specific customer
who bought a specific basket of product.

119
00:06:21,666 --> 00:06:26,466
So for example if we look at line
two here, that corresponds to one customer

120
00:06:26,466 --> 00:06:31,300
who bought burgers, meatballs
and eggs at a specific time of this week.

121
00:06:31,700 --> 00:06:34,066
And that's the same
for all the other observations

122
00:06:34,066 --> 00:06:37,366
that correspond to other customers,
or maybe the same customer

123
00:06:37,366 --> 00:06:40,433
who went back to the store
another day or another time.

124
00:06:40,766 --> 00:06:42,733
So that's what the data set is about.

125
00:06:42,733 --> 00:06:45,900
But actually, this is not the data sets
we're going to use

126
00:06:46,033 --> 00:06:49,033
to train our apriori model.

127
00:06:49,566 --> 00:06:54,166
And the reason is that the package we're
going to use to build our apriori model,

128
00:06:54,333 --> 00:06:58,133
which is, by the way,
the rules package, doesn't take a data

129
00:06:58,133 --> 00:06:59,900
set like this as input.

130
00:06:59,900 --> 00:07:03,966
It doesn't take a CSV file
that we imported thanks to the readcsv

131
00:07:03,966 --> 00:07:05,033
function.

132
00:07:05,033 --> 00:07:08,700
What it takes as input
is called a sparse matrix.

133
00:07:09,000 --> 00:07:11,000
And so what is a sparse matrix?

134
00:07:11,000 --> 00:07:14,200
It's actually a matrix
that contains a lot of zeros.

135
00:07:14,500 --> 00:07:18,133
In machine learning you will encounter
a lot of times the word sparsity

136
00:07:18,466 --> 00:07:21,400
that corresponds
to a large number of zeros.

137
00:07:21,400 --> 00:07:26,100
So a sparse matrix is a matrix containing
a very few number of non-zero values.

138
00:07:26,633 --> 00:07:32,066
So what we're going to do now is transform
this data set here into a sparse matrix.

139
00:07:32,066 --> 00:07:34,133
And can you guess what we're going to do.

140
00:07:34,133 --> 00:07:38,700
Well what we're going to do is take all
the different products of this data set.

141
00:07:39,266 --> 00:07:42,966
And actually I already know
that there are 120 products.

142
00:07:43,200 --> 00:07:47,966
And we're going to attribute one column
to each of these 120 products.

143
00:07:47,966 --> 00:07:50,400
So that means we'll get 120 columns.

144
00:07:52,233 --> 00:07:53,666
So for example.

145
00:07:53,666 --> 00:07:56,433
So for example
we'll have the column shrimp

146
00:07:56,433 --> 00:07:59,866
the column almonds the column avocado
the column vegetables.

147
00:07:59,866 --> 00:08:03,033
Mix cortices energy drink tomato juice up

148
00:08:03,033 --> 00:08:06,133
to the 120th product that there is.

149
00:08:06,133 --> 00:08:08,900
We're going to see all the products
then on a plot.

150
00:08:08,900 --> 00:08:12,033
But in this data set
there are 120 products

151
00:08:12,033 --> 00:08:15,833
which are, by the way,
the 120 products of the store.

152
00:08:16,433 --> 00:08:19,433
So there's going to be one column
for each of these products,

153
00:08:19,800 --> 00:08:21,433
and that's going to be the columns.

154
00:08:21,433 --> 00:08:25,266
And then the lines are still going
to be the different transactions

155
00:08:25,266 --> 00:08:29,100
corresponding to each of the 7005
hundred customers

156
00:08:29,100 --> 00:08:32,100
that bought a basket of products
during the whole week.

157
00:08:32,100 --> 00:08:35,100
But instead of having the list
of the product they bought,

158
00:08:35,233 --> 00:08:40,066
we will have in each of the 120 columns
here, a 0 or 1,

159
00:08:40,500 --> 00:08:41,733
and it's going to be a one.

160
00:08:41,733 --> 00:08:45,633
If the product is in the basket
of the customer during its transaction,

161
00:08:46,033 --> 00:08:49,033
and a zero
if the product is not in the basket.

162
00:08:49,166 --> 00:08:52,400
So for example,
let's take the second customer here,

163
00:08:52,833 --> 00:08:58,200
the second customer, but a basket of three
products burgers, meatballs and eggs.

164
00:08:58,500 --> 00:08:59,133
Okay.

165
00:08:59,133 --> 00:09:02,266
So in our sparse matrix we'll have one

166
00:09:02,266 --> 00:09:05,466
burgers column,
one meatballs column and one x column.

167
00:09:05,500 --> 00:09:08,266
They're not necessarily
going to be next to each other.

168
00:09:08,266 --> 00:09:11,500
You know burgers can be the fifth column
and meatballs can be

169
00:09:11,500 --> 00:09:14,633
the ninth column
and X can be the 12th column.

170
00:09:14,833 --> 00:09:18,200
That depends on how the rules package
is going to make a matrix.

171
00:09:18,200 --> 00:09:21,066
But we will have a column
for each of these three products.

172
00:09:21,066 --> 00:09:23,933
And so in these columns,
since the customer number two

173
00:09:23,933 --> 00:09:27,133
bought some burgers, meatballs and eggs,
there will be a one

174
00:09:27,500 --> 00:09:28,500
in each of these columns.

175
00:09:28,500 --> 00:09:31,533
There will be a one in the burgers column
or one in the meatballs column,

176
00:09:31,766 --> 00:09:33,233
and a one in the X column.

177
00:09:33,233 --> 00:09:36,500
And all the rest of the columns
are going to have a zero value.

178
00:09:36,666 --> 00:09:41,233
And that's because all the other products
were not in the basket of this customer

179
00:09:41,233 --> 00:09:42,233
number two.

180
00:09:42,233 --> 00:09:47,100
So you can guess, you can imagine that
we are going to have a lot of zero values.

181
00:09:47,400 --> 00:09:49,233
And that's even more true
considering the fact

182
00:09:49,233 --> 00:09:53,100
that we have a lot of customers
that bought baskets of only one product.

183
00:09:53,100 --> 00:09:56,133
For example, this customer number ten here
bought some French fries,

184
00:09:56,400 --> 00:09:58,100
this one bought some cookies.

185
00:09:58,100 --> 00:10:00,266
This one bought some mineral water.

186
00:10:00,266 --> 00:10:03,600
So, you know, for these three customers
who bought only one product,

187
00:10:03,833 --> 00:10:05,200
we're going to have only one column

188
00:10:05,200 --> 00:10:08,333
that contains a non-zero value
and all the other columns.

189
00:10:08,333 --> 00:10:12,233
That means all the 119 columns
will contain zeros.

190
00:10:12,533 --> 00:10:15,900
So you can see that we're going to have
a lot of zeros in this matrix.

191
00:10:16,200 --> 00:10:19,000
And so for those of you
who are discovering sparsity,

192
00:10:19,000 --> 00:10:21,900
I'm happy
to introduce you to sparse matrices.

193
00:10:21,900 --> 00:10:25,000
So let's build this sparse matrix
right now.

194
00:10:25,000 --> 00:10:27,233
You'll see that
it's going to be very easy.

195
00:10:27,233 --> 00:10:30,933
So let's go back to our code
and let's create this sparse matrix.

196
00:10:30,933 --> 00:10:35,933
So to create this sparse matrix
we're going to use a package of course.

197
00:10:35,933 --> 00:10:38,166
And this package is the rules package.

198
00:10:38,166 --> 00:10:42,000
So we're going to install it
and import it.

199
00:10:42,533 --> 00:10:46,700
So as usual we're going to take
the function install dot packages.

200
00:10:47,066 --> 00:10:52,100
And then in parentheses we just input
the name of the package in quotes.

201
00:10:52,566 --> 00:10:54,900
And that's the rules package.

202
00:10:54,900 --> 00:10:55,900
All right.

203
00:10:55,900 --> 00:10:59,400
So let's check to see if I have it.

204
00:10:59,666 --> 00:11:01,333
Well I already know I have it.

205
00:11:01,333 --> 00:11:03,766
It's actually already here
and already imported.

206
00:11:03,766 --> 00:11:07,200
So that's the package
for which the description says that

207
00:11:07,300 --> 00:11:11,333
it's mining
association rules and frequent item sets.

208
00:11:11,700 --> 00:11:12,633
Okay.

209
00:11:12,633 --> 00:11:14,200
So mine is already installed.

210
00:11:14,200 --> 00:11:15,866
So I'm not going to execute this line.

211
00:11:15,866 --> 00:11:17,300
I'll just put in comments.

212
00:11:17,300 --> 00:11:20,933
And so if you don't have the package
here in the packages list,

213
00:11:21,333 --> 00:11:24,566
you need to select this line and execute.

214
00:11:24,566 --> 00:11:27,566
And this will install the package
without any issue.

215
00:11:27,566 --> 00:11:31,633
And as far as I'm concerned, I'm
just going to put that in comment right.

216
00:11:32,400 --> 00:11:35,900
And to make sure that the rules package is
well imported,

217
00:11:36,300 --> 00:11:37,633
we need to add the line here.

218
00:11:37,633 --> 00:11:41,933
Library
and in parenthesis a rules already there.

219
00:11:41,933 --> 00:11:42,900
Perfect.

220
00:11:42,900 --> 00:11:43,866
And that makes sure

221
00:11:43,866 --> 00:11:47,400
that if you execute the whole script,
the rules package will be imported.

222
00:11:48,066 --> 00:11:51,300
And now we're ready
to create our sparse matrix.

223
00:11:51,866 --> 00:11:55,033
So since our data set has no use here

224
00:11:55,033 --> 00:11:58,566
because we're not going to use it
to build and train our apriori model,

225
00:11:58,733 --> 00:12:01,733
we will call our sparse matrix again
data set.

226
00:12:02,133 --> 00:12:02,733
Okay.

227
00:12:02,733 --> 00:12:07,533
And to create this sparse matrix,
it's actually almost the same as importing

228
00:12:07,533 --> 00:12:12,100
a CSV file, because instead of
writing here read dot csv,

229
00:12:12,433 --> 00:12:15,933
we simply need to
type read dot transactions,

230
00:12:17,233 --> 00:12:20,600
read that transactions
and then it's the same in parenthesis.

231
00:12:20,600 --> 00:12:23,533
We need to input the name of the CSV file.

232
00:12:23,533 --> 00:12:26,866
So we'll copy that and paste it here.

233
00:12:27,800 --> 00:12:29,066
That's the first argument.

234
00:12:29,066 --> 00:12:33,300
But then we need to specify
to this function that the separator

235
00:12:33,566 --> 00:12:36,900
of our CSV file is actually comma.

236
00:12:37,166 --> 00:12:40,933
So we need to add here
set equals in quotes comma.

237
00:12:40,933 --> 00:12:42,733
And why do we need to do this.

238
00:12:42,733 --> 00:12:45,200
It's because you know our CSV file.

239
00:12:45,200 --> 00:12:47,100
If you open it with a text editor

240
00:12:47,100 --> 00:12:50,100
you will see that the different products
are separated by a comma.

241
00:12:50,400 --> 00:12:54,000
And we actually didn't have to specify
that the separator was a comma here,

242
00:12:54,000 --> 00:12:57,600
because that's the default separator
of the readcsv function.

243
00:12:57,966 --> 00:13:01,900
But that's not the default separator
of the read transactions function.

244
00:13:01,900 --> 00:13:04,366
So that's why we need to specify two here.

245
00:13:04,366 --> 00:13:06,600
So set equals comma.

246
00:13:06,600 --> 00:13:08,700
And actually we could stop here.

247
00:13:08,700 --> 00:13:10,366
But since I promised you to give you

248
00:13:10,366 --> 00:13:14,200
real life data sets I added on purpose
and reality in the data sets.

249
00:13:14,200 --> 00:13:17,700
And this reality is about having
some anomalies in the data.

250
00:13:18,300 --> 00:13:21,300
And these anomalies
are actually some duplicates.

251
00:13:21,466 --> 00:13:24,833
Indeed, when this manager registered
all the different transactions,

252
00:13:25,100 --> 00:13:27,866
well, he might have been very likely
to make some human

253
00:13:27,866 --> 00:13:30,866
mistakes
to put some duplicates in the data.

254
00:13:30,866 --> 00:13:33,133
So for example,
if we go back to our data set.

255
00:13:33,133 --> 00:13:37,166
So that's the whole data sets import
of that CSV with the red dot CSV function.

256
00:13:37,433 --> 00:13:42,500
And so for example, when this transaction
of the 31st customer was registered,

257
00:13:42,766 --> 00:13:47,133
one can make some mistake of putting twice
light cream here, for example.

258
00:13:47,633 --> 00:13:51,600
And to train the apriori algorithm
we need to have no duplicates.

259
00:13:51,866 --> 00:13:55,766
So there is actually a good way
to handle these duplicates.

260
00:13:55,766 --> 00:14:00,000
It's actually very simple because we just
need to add an additional argument.

261
00:14:00,400 --> 00:14:04,500
If we look at the reader transactions
function here by pressing F1,

262
00:14:04,500 --> 00:14:08,133
you can see that
are empty duplicates argument.

263
00:14:08,566 --> 00:14:11,233
And as you can see, it's a logical value

264
00:14:11,233 --> 00:14:14,666
specifying if duplicate items
should be removed from the transaction.

265
00:14:15,100 --> 00:14:17,933
So since the apriori algorithm is trained

266
00:14:17,933 --> 00:14:21,666
on transaction data set, they're supposed
to have no duplicate values.

267
00:14:21,966 --> 00:14:24,766
We need to add this argument,

268
00:14:26,533 --> 00:14:30,100
removed duplicates, and set it to true,

269
00:14:30,600 --> 00:14:34,066
and that will remove all the duplicates
in each of the transactions.

270
00:14:34,300 --> 00:14:36,766
Maybe your data sets
won't have any duplicates, but

271
00:14:36,766 --> 00:14:40,733
it's very common to have a few anomalies
in data sets, such as some duplicates.

272
00:14:41,066 --> 00:14:44,933
But here we will be fine
thanks to this remove duplicates argument.

273
00:14:45,500 --> 00:14:49,933
All right, so we're now actually ready
to create our sparse matrix.

274
00:14:50,333 --> 00:14:51,300
So let's do this.

275
00:14:51,300 --> 00:14:54,466
Let's execute this line. And here we go.

276
00:14:54,833 --> 00:14:57,833
All right
so now the sparse matrix is created.

277
00:14:58,200 --> 00:15:01,966
Unfortunately we can not have a look at it
because as you can see if I click

278
00:15:01,966 --> 00:15:05,900
on this data see here the new data
set sparse matrix is not appearing here.

279
00:15:05,900 --> 00:15:08,766
That's actually the old one.
So we can close this.

280
00:15:08,766 --> 00:15:13,333
But we can actually get some info
about this sparse matrix.

281
00:15:13,533 --> 00:15:16,733
But before getting all
this detailed information, well we can see

282
00:15:16,733 --> 00:15:20,266
that we already have some information
about the duplicates itself.

283
00:15:20,566 --> 00:15:23,500
When you execute this line
to create your sparse matrix

284
00:15:23,500 --> 00:15:27,866
with the reductions actions function,
including the duplicates argument,

285
00:15:28,100 --> 00:15:29,166
you will automatically

286
00:15:29,166 --> 00:15:32,166
have this message distribution
of transaction with duplicates.

287
00:15:32,200 --> 00:15:34,000
And here we see that we have one five.

288
00:15:34,000 --> 00:15:37,900
That means that there are five
transactions containing one duplicates.

289
00:15:38,166 --> 00:15:42,333
And for example, if in your data set
you have some triplicate duplicates

290
00:15:42,333 --> 00:15:45,866
that appear twice in any transaction,
well,

291
00:15:45,900 --> 00:15:49,800
you will have a two here and you will have
the number of triplicate here.

292
00:15:50,000 --> 00:15:53,000
So that just gives the distribution
of transaction with duplicates.

293
00:15:53,233 --> 00:15:55,000
And anyway now they're removed.

294
00:15:55,000 --> 00:16:00,833
So we can actually get some more detailed
info about this sparse matrix.

295
00:16:00,833 --> 00:16:04,166
And to get this info, we,
as we already did many times

296
00:16:04,166 --> 00:16:07,200
before, need to use the summary function.

297
00:16:07,400 --> 00:16:08,566
So summary.

298
00:16:08,566 --> 00:16:10,966
And here we input data set.

299
00:16:10,966 --> 00:16:11,366
All right.

300
00:16:11,366 --> 00:16:14,166
And that will give us some info
about the sports matrix.

301
00:16:14,166 --> 00:16:16,633
So let's execute
this line. And here we go.

302
00:16:17,700 --> 00:16:18,800
So what do we see here.

303
00:16:18,800 --> 00:16:22,166
First we are reminded
that this data set contains

304
00:16:22,166 --> 00:16:25,700
transactions
as item matrix in sparse format.

305
00:16:25,766 --> 00:16:28,566
So that exactly means
that's a sparse matrix.

306
00:16:28,566 --> 00:16:32,600
And we can see that
we have 7000 and 501 rows.

307
00:16:32,600 --> 00:16:35,200
And we have 119 columns.

308
00:16:35,200 --> 00:16:39,833
And we can see that we have
a density of 0.03 in this sparse matrix.

309
00:16:39,833 --> 00:16:40,800
And what does that mean.

310
00:16:40,800 --> 00:16:46,400
That means that the proportion of non-zero
values is 0.03.

311
00:16:46,633 --> 00:16:50,633
We have 3% non-zero
values and 97% zero values.

312
00:16:51,000 --> 00:16:53,366
Okay.
Then we have the most frequent items.

313
00:16:53,366 --> 00:16:56,366
So the item that is
the most part is mineral water.

314
00:16:56,600 --> 00:16:59,100
Yes, it can be very hot
in the south of France.

315
00:16:59,100 --> 00:17:02,400
And it's a good French tradition
to have a bottle of water during meals.

316
00:17:03,000 --> 00:17:03,500
Okay then.

317
00:17:03,500 --> 00:17:05,066
The French love very much eggs.

318
00:17:05,066 --> 00:17:07,700
They love spaghetti,
French fries, chocolate

319
00:17:07,700 --> 00:17:09,566
and that's all the other products.

320
00:17:09,566 --> 00:17:13,200
And then then we have some interesting
information about the distribution

321
00:17:13,200 --> 00:17:17,133
of the baskets
of all the 7500 transactions.

322
00:17:17,133 --> 00:17:22,166
So for example here,
this one associated to 1754,

323
00:17:22,200 --> 00:17:28,033
means that there were 1754 baskets
containing only one products.

324
00:17:28,033 --> 00:17:32,566
And then we have 1358 baskets
containing two products,

325
00:17:32,766 --> 00:17:36,366
1044 baskets
containing three products, etc..

326
00:17:37,033 --> 00:17:38,433
And we also have the quantiles

327
00:17:38,433 --> 00:17:42,033
of this distribution
with the minimum value, the maximum value.

328
00:17:42,300 --> 00:17:45,166
So the minimum value is
of course the basket of one product.

329
00:17:45,166 --> 00:17:48,933
The maximum value is a basket
of 20 products, and on average,

330
00:17:49,166 --> 00:17:53,733
people put four products in their basket
when they go to the store.

331
00:17:54,066 --> 00:17:56,400
All right.
So that's interesting information.

332
00:17:56,400 --> 00:18:00,033
But of course we will get some even more
interesting informations afterwards.

333
00:18:00,266 --> 00:18:04,066
And speaking of these more interesting
informations, we can already have one.

334
00:18:04,066 --> 00:18:05,233
Now it's actually

335
00:18:05,233 --> 00:18:08,833
going to be a visual information
because we're going to make a frequency

336
00:18:08,833 --> 00:18:10,500
plot of the different products,

337
00:18:10,500 --> 00:18:13,766
but by the different customers
in the store during this whole week.

338
00:18:14,266 --> 00:18:17,700
And so to get this plot very easily,
we can use one function

339
00:18:17,700 --> 00:18:22,800
of the arrows package,
which is the item frequency plot function.

340
00:18:23,000 --> 00:18:24,166
And in this function

341
00:18:24,166 --> 00:18:28,133
we just need to input two arguments,
which is going to be the data set.

342
00:18:28,133 --> 00:18:30,900
So that's the sparse matrix.
So that's the first argument.

343
00:18:30,900 --> 00:18:33,266
And the second argument is sub n.

344
00:18:33,266 --> 00:18:34,200
And that's the number

345
00:18:34,200 --> 00:18:38,033
of the most sold products
you want to have in this frequency plot.

346
00:18:38,200 --> 00:18:41,200
So for example if I put top n equals

347
00:18:41,233 --> 00:18:44,433
100 I will get the 100

348
00:18:44,433 --> 00:18:48,233
most purchased product by the French
customers in this French store.

349
00:18:48,400 --> 00:18:49,700
So let's check it out.

350
00:18:49,700 --> 00:18:52,300
I'm going to execute this line.

351
00:18:52,300 --> 00:18:53,966
Here we go. And that's the plot.

352
00:18:53,966 --> 00:18:57,900
Don't worry I'm going to zoom on it
so that we can see better the products.

353
00:18:58,500 --> 00:18:59,566
And here we go.

354
00:18:59,566 --> 00:19:04,100
So that's the first 100 products
most purchased by the customers.

355
00:19:04,100 --> 00:19:05,266
So that's kind of interesting.

356
00:19:05,266 --> 00:19:09,500
And if you want to have less products
in this plot, you can just look at the top

357
00:19:09,500 --> 00:19:13,400
ten and you'll get actually
the first ten products

358
00:19:13,400 --> 00:19:17,133
purchased by the customers, which are
of course the same first ten products.

359
00:19:17,433 --> 00:19:21,133
Okay, so this plot is actually going
to be interesting for us,

360
00:19:21,233 --> 00:19:24,900
what's coming next,
because we will have to choose a value

361
00:19:24,900 --> 00:19:28,200
for the support
according to the Priore algorithm itself,

362
00:19:28,300 --> 00:19:32,666
and we will be able to actually
use this plot to look at different

363
00:19:32,666 --> 00:19:36,800
supports of the product,
to choose a good value for our support.

364
00:19:37,300 --> 00:19:39,833
So that's what we're going to do
in the next tutorials.

365
00:19:39,833 --> 00:19:43,966
We're going to start training
our apriori model on our data set,

366
00:19:44,200 --> 00:19:47,233
which is going to be
the sparse matrix here that we just built.

367
00:19:47,600 --> 00:19:50,566
And so I look forward
to building the apriori model with you.

368
00:19:50,566 --> 00:19:52,800
And until then enjoy machine learning.