1 00:00:00,500 --> 00:00:03,066 Hello and welcome back to the course on Machine Learning. 2 00:00:03,066 --> 00:00:04,100 Today we're going to cover off 3 00:00:04,100 --> 00:00:07,100 some additional comments about the Naive Bayes classifier. 4 00:00:07,233 --> 00:00:08,400 All right so let's have a look. 5 00:00:08,400 --> 00:00:09,900 Today we're going to go through things. 6 00:00:09,900 --> 00:00:12,933 Number one is the question why is it called naive. 7 00:00:13,433 --> 00:00:17,200 Number two is the p of x and how we can potentially drop it sometimes. 8 00:00:17,700 --> 00:00:19,466 I'll show you a quick shortcut. 9 00:00:19,466 --> 00:00:20,900 And number three is what happens 10 00:00:20,900 --> 00:00:23,966 when there's more than two features involved in your data set. 11 00:00:24,466 --> 00:00:26,400 All right so let's get started. 12 00:00:26,400 --> 00:00:31,266 Question why is this algorithm called the Naive Bayes algorithm? 13 00:00:31,733 --> 00:00:33,300 Well, the answer is pretty simple. 14 00:00:33,300 --> 00:00:37,166 The answer is because the Bayes theorem requires 15 00:00:37,166 --> 00:00:41,333 some independence assumptions, and the Bayes theorem is the foundation 16 00:00:41,333 --> 00:00:46,466 of the Naive Bayes machine learning algorithm, and therefore the Naive Bayes. 17 00:00:46,700 --> 00:00:47,900 A machine learning algorithm 18 00:00:47,900 --> 00:00:52,300 also relies on these assumptions, which are oftentimes not correct, 19 00:00:52,800 --> 00:00:56,533 and therefore it's kind of naive to assume that they're going to be correct. 20 00:00:56,800 --> 00:00:57,900 That's the reason why it's called. 21 00:00:57,900 --> 00:01:00,900 Let's go back to our example and see what what that means. 22 00:01:01,066 --> 00:01:03,333 So here we've got age and salary right. 23 00:01:03,333 --> 00:01:04,333 And based on 24 00:01:04,333 --> 00:01:08,166 those we're using the Naive Bayes algorithm to classify our data points 25 00:01:08,166 --> 00:01:11,333 into people who drive to work or people who walk to work. 26 00:01:11,933 --> 00:01:15,700 Well, the Bayes theorem, the way we apply it, actually requires 27 00:01:15,700 --> 00:01:19,533 that age and salary have to be independent, or the variables 28 00:01:19,533 --> 00:01:22,633 that we're working with in this case as salary, have to be independent. 29 00:01:23,200 --> 00:01:26,566 And that is like a fundamental assumption of the Bayes theorem. 30 00:01:26,566 --> 00:01:27,933 And then you can only apply it. 31 00:01:27,933 --> 00:01:30,833 And then you can get those probabilities and so on. 32 00:01:30,833 --> 00:01:33,866 But in our case, if you just think about it 33 00:01:33,866 --> 00:01:37,700 fundamentally, it is probably not the case. 34 00:01:37,700 --> 00:01:41,100 I probably if there is some sort of correlation between age and salary, 35 00:01:41,100 --> 00:01:45,200 because as a person gets older, their experience grows, 36 00:01:45,200 --> 00:01:48,366 or the number of years that they've spent in the workforce grows 37 00:01:48,533 --> 00:01:50,166 and therefore their salary grows. 38 00:01:50,166 --> 00:01:52,133 So it's natural for the salary to grow. 39 00:01:52,133 --> 00:01:55,333 If age, it might not be a super strong correlation, for it's 40 00:01:55,333 --> 00:01:56,533 not it's not for everybody. 41 00:01:56,533 --> 00:01:59,033 But overall there is some sort of correlation. 42 00:01:59,033 --> 00:02:01,133 So there are not absolutely independent variables. 43 00:02:01,133 --> 00:02:02,766 And we're all you can even see that from the chart. 44 00:02:02,766 --> 00:02:04,466 You can just by looking at our chart, 45 00:02:04,466 --> 00:02:07,800 see that there's some sort of correlation between the two variables. 46 00:02:08,366 --> 00:02:11,133 And therefore given that they're not independent, 47 00:02:11,133 --> 00:02:14,166 you can't really apply the Bayes theorem and therefore you can't 48 00:02:14,166 --> 00:02:17,333 apply the base algorithm to machine learning. 49 00:02:17,333 --> 00:02:21,400 And that's why it's called Naive Bayes algorithm, because oftentimes it's applied 50 00:02:21,633 --> 00:02:24,500 even though the variables or the features 51 00:02:24,500 --> 00:02:27,766 are not independent or not completely independent, 52 00:02:28,200 --> 00:02:30,900 and it's still applied and it's still gives good results. 53 00:02:30,900 --> 00:02:34,100 And that's why it's called naive because it's a naive assumption. 54 00:02:34,766 --> 00:02:35,566 All right. 55 00:02:35,566 --> 00:02:37,300 Number £0.02 of x. 56 00:02:37,300 --> 00:02:40,300 So let's have a look at what we performed. 57 00:02:40,366 --> 00:02:43,300 So kind of like rewind and and now analyze 58 00:02:43,300 --> 00:02:46,333 what we did in our steps in the previous tutorial. 59 00:02:46,333 --> 00:02:50,400 So in step two what we did is we took p of x. 60 00:02:50,400 --> 00:02:55,300 So when we were calculating p of x we drew a circle around our new data point. 61 00:02:55,300 --> 00:02:58,200 We removed the data point just so it's not in the way. 62 00:02:58,200 --> 00:02:59,966 And then we shaded the area. 63 00:02:59,966 --> 00:03:01,800 And so what is p of x. 64 00:03:01,800 --> 00:03:06,033 Well p of x is the likelihood that a randomly selected 65 00:03:06,033 --> 00:03:09,900 point from this data set will exhibit the features 66 00:03:09,966 --> 00:03:13,266 similar to the data point that we were about to add. 67 00:03:13,600 --> 00:03:14,933 And as we agreed, 68 00:03:14,933 --> 00:03:18,833 anything in that circle is deemed to be similar to our data point. 69 00:03:19,166 --> 00:03:23,300 Another way to think about it is what happens if I throw in a random 70 00:03:23,300 --> 00:03:27,233 variable or a random data point into this data set right now, 71 00:03:27,666 --> 00:03:30,800 what is the likelihood that it will fall into the circle? 72 00:03:30,800 --> 00:03:34,400 What is the likelihood that it will exhibit features similar 73 00:03:34,400 --> 00:03:39,133 to the features of the point that I'm about to add into the data set. 74 00:03:39,633 --> 00:03:41,933 So basically it falls into that circle. 75 00:03:41,933 --> 00:03:46,200 And so p of x is the number of similar observations or similar observations 76 00:03:46,200 --> 00:03:49,866 means observations similar to the points that we're about to add 77 00:03:50,266 --> 00:03:53,933 divided by the total number of observations, which is student okay. 78 00:03:54,300 --> 00:03:55,566 So it's four four. 79 00:03:55,566 --> 00:03:59,166 We can see that there's four points in this circle right now divided by 30. 80 00:03:59,833 --> 00:04:05,400 And the interesting thing here is that this result is the same both times. 81 00:04:05,400 --> 00:04:06,800 So this is in step two. 82 00:04:06,800 --> 00:04:11,233 In step one where we were calculating for the people who walk to work. 83 00:04:11,233 --> 00:04:14,466 So the probability for people who walk to work it was the same. 84 00:04:14,466 --> 00:04:17,866 So basically p of x doesn't change whether you're calculating it in 85 00:04:18,300 --> 00:04:21,266 in the in step one where we were calculating the probability 86 00:04:21,266 --> 00:04:24,533 that the person with these features, as a person who walks to work, 87 00:04:24,533 --> 00:04:28,033 or if you're calculating in the step two scenario where you're calculating 88 00:04:28,033 --> 00:04:31,133 if that person with these features is a person who drives to work 89 00:04:31,800 --> 00:04:33,833 and therefore it's the same both times. 90 00:04:33,833 --> 00:04:35,533 So what does that mean? Well, let's have a look at the formula. 91 00:04:35,533 --> 00:04:39,366 So you can see that the formula in step one it was p of the probability 92 00:04:39,366 --> 00:04:42,366 of person who exhibits features x that he walks. 93 00:04:42,366 --> 00:04:44,566 He or she works at walks to work. 94 00:04:44,566 --> 00:04:45,933 And that was a formula. 95 00:04:45,933 --> 00:04:48,000 And CP of X is at the bottom. 96 00:04:48,000 --> 00:04:50,433 And for step two it was a probability of a person 97 00:04:50,433 --> 00:04:54,900 who exhibits features X being a person who drives to work. 98 00:04:55,066 --> 00:04:57,100 And as you can see, p of x is at the bottom here. 99 00:04:57,100 --> 00:05:00,400 So what did we do in step three. Right. 100 00:05:00,400 --> 00:05:02,100 So let's move on to step three from here. 101 00:05:02,100 --> 00:05:03,633 Step three we compared the two. 102 00:05:03,633 --> 00:05:08,000 So now if we take these two formulas these right sides of the formulas 103 00:05:08,366 --> 00:05:10,400 and put them into the comparison 104 00:05:10,400 --> 00:05:13,533 like that, you will see that at the bottom the denominator is the same. 105 00:05:13,900 --> 00:05:16,566 Now that we know that the nominator is not zero 106 00:05:16,566 --> 00:05:20,400 and it's actually greater than zero, probabilities is never less than zero. 107 00:05:20,400 --> 00:05:23,066 And we know it's not zero. So we can just get rid of it. Right. 108 00:05:23,066 --> 00:05:27,900 We can multiply both sides by p of x and therefore the sign won't change. 109 00:05:27,900 --> 00:05:30,400 And also we'll get rid of the denominator. 110 00:05:30,400 --> 00:05:33,866 And that way we won't actually have to perform that calculation. 111 00:05:33,866 --> 00:05:35,566 So that's one less calculation to perform. 112 00:05:35,566 --> 00:05:39,533 So you can just compare the top parts of these calculations. 113 00:05:39,833 --> 00:05:42,433 And so that is what is done a lot of the time. 114 00:05:42,433 --> 00:05:46,200 So if you've done other courses on machine learning or if you've read some 115 00:05:46,200 --> 00:05:50,500 maybe articles on machine learning, you will find that this is oftentimes 116 00:05:50,500 --> 00:05:50,933 a case. 117 00:05:50,933 --> 00:05:55,266 And also sometimes it is not mentioned that this is happening. 118 00:05:55,266 --> 00:06:00,000 So sometimes it's assumed that or it can be assumed that you know what's going on. 119 00:06:00,000 --> 00:06:02,366 So just be careful that look out for that. 120 00:06:02,366 --> 00:06:04,500 And it's a it's a valid approach. As we discussed. 121 00:06:04,500 --> 00:06:06,400 It's totally valid to do it that way. 122 00:06:06,400 --> 00:06:09,600 But that is only if you're comparing the two, right. 123 00:06:09,600 --> 00:06:12,233 If you're only comparing the two, then you can do that 124 00:06:12,233 --> 00:06:14,500 if you actually want to calculate the value. 125 00:06:14,500 --> 00:06:16,800 So we said 75%, 25%. 126 00:06:16,800 --> 00:06:20,366 If you want to calculate the value, you can't do that because it'll be to 127 00:06:20,566 --> 00:06:24,566 different realities because you're supposed to divide by a certain value. 128 00:06:24,566 --> 00:06:26,000 It's not the actual value. 129 00:06:26,000 --> 00:06:30,766 And moreover, if you want to like comparing is okay, 130 00:06:30,766 --> 00:06:35,066 but calculating the actual value and maybe some performing some operations 131 00:06:35,066 --> 00:06:38,833 or, you know, you calculate the value from this, you want to compare the value 132 00:06:38,833 --> 00:06:42,533 from this scenario to a value from another problem, right? 133 00:06:42,533 --> 00:06:43,566 Like a value 134 00:06:43,566 --> 00:06:48,366 from a different kind of a scenario where p of x will be different, right? 135 00:06:48,400 --> 00:06:53,633 Not from this particular example that you're working with. 136 00:06:53,633 --> 00:06:56,033 You want to compare the probability of the person being a person 137 00:06:56,033 --> 00:07:00,100 who walks to work in this example, to the probability of that person 138 00:07:00,300 --> 00:07:03,600 being a person who walks to work from a different example, right 139 00:07:03,600 --> 00:07:04,966 where p of x will be different. 140 00:07:04,966 --> 00:07:08,133 If you want to compare across like that, that will also not work, 141 00:07:08,133 --> 00:07:10,766 because if p of x is different, so be careful of that. 142 00:07:10,766 --> 00:07:14,166 It's always the safer and safer way is always just to perform the full 143 00:07:14,166 --> 00:07:15,200 calculation. 144 00:07:15,200 --> 00:07:18,600 But if you're doing it often, or if you just want to save some time 145 00:07:18,933 --> 00:07:20,100 or if you just maybe 146 00:07:20,100 --> 00:07:23,800 reading other literature, then it's it's also good to know about this approach 147 00:07:23,800 --> 00:07:28,200 where the denominator can be dropped when just comparing between the two. 148 00:07:28,833 --> 00:07:29,133 All right. 149 00:07:29,133 --> 00:07:33,700 So that was another point of like kind of a hint there or maybe a shortcut. 150 00:07:34,266 --> 00:07:38,566 And so the last point for today, what happens when we have more than two 151 00:07:38,566 --> 00:07:39,600 classes. 152 00:07:39,600 --> 00:07:42,800 So as you remember in this scenario we only had two classes, the red and green 153 00:07:42,800 --> 00:07:45,800 or the people who walk to work and the people who drive to work. 154 00:07:46,100 --> 00:07:48,200 What happens when you have more classes? 155 00:07:48,200 --> 00:07:50,666 How is the challenge different? 156 00:07:50,666 --> 00:07:54,400 Well, when we have only two classes, we compared, as you remember, 157 00:07:54,400 --> 00:07:56,066 we compared the people. 158 00:07:56,066 --> 00:07:59,933 The probability that a person who exhibits features X walks to work. 159 00:07:59,933 --> 00:08:01,966 So basically that new data point that we added, 160 00:08:01,966 --> 00:08:04,200 what is the probability that it's a person who walks to work 161 00:08:04,200 --> 00:08:06,266 versus what's the probability that that new data point 162 00:08:06,266 --> 00:08:10,200 is a person who drives to work and it turned out that we were comparing 163 00:08:10,433 --> 00:08:15,400 75% versus 25%, and 75% was greater than 25%. 164 00:08:15,400 --> 00:08:18,633 And therefore the probability of the person of that person 165 00:08:18,633 --> 00:08:19,033 being a person 166 00:08:19,033 --> 00:08:23,100 who walks to work was greater than that person being a person who drives to work, 167 00:08:23,100 --> 00:08:26,800 and therefore we decide to classify them as a person who walks to work. 168 00:08:27,166 --> 00:08:28,966 And it's very straightforward. 169 00:08:28,966 --> 00:08:32,766 And moreover, you'll find that every time when you just have two classes, 170 00:08:33,133 --> 00:08:34,600 it will always add up to one. 171 00:08:34,600 --> 00:08:37,600 So we didn't even have to calculate the second one. 172 00:08:37,600 --> 00:08:39,333 We could have just stopped at this one, 173 00:08:39,333 --> 00:08:42,933 because if this is 75%, this one is automatically 25%. 174 00:08:43,200 --> 00:08:44,266 It's always going to be like that. 175 00:08:44,266 --> 00:08:45,933 If you have two classes. 176 00:08:45,933 --> 00:08:49,100 The way it changes, if you have three classes, it is it gets more interesting. 177 00:08:49,100 --> 00:08:51,200 Right? So you've calculated one and there's this. 178 00:08:51,200 --> 00:08:52,766 Then there's still two more left. 179 00:08:52,766 --> 00:08:56,800 So if you're working with one with just two classes, 180 00:08:56,800 --> 00:08:58,366 as soon as you calculate one you're done. 181 00:08:58,366 --> 00:09:01,366 You can decide right away if it's greater than 50%. 182 00:09:01,366 --> 00:09:03,466 Then you assign that class. 183 00:09:03,466 --> 00:09:05,333 If it's less than 50%, yes, another class. 184 00:09:05,333 --> 00:09:08,366 If it's equal to 50%, then you've got like a tie. 185 00:09:09,000 --> 00:09:12,266 Whereas if you have 2 or 3 or more classes, then just calculating 186 00:09:12,266 --> 00:09:15,066 the one won't be enough, because then you have two other ones 187 00:09:15,066 --> 00:09:17,633 and you would still have to calculate at least another one. 188 00:09:17,633 --> 00:09:21,300 So just means that it's more of an interesting 189 00:09:21,500 --> 00:09:24,866 selection problem when you have more classes. 190 00:09:24,866 --> 00:09:29,266 That's pretty much the main thing that, changes when you have more classes. 191 00:09:29,900 --> 00:09:31,066 And there we go. 192 00:09:31,066 --> 00:09:33,233 That's all for today. 193 00:09:33,233 --> 00:09:37,133 I hope you enjoyed these extra tips on the naive Bayes classifier, 194 00:09:37,466 --> 00:09:39,333 and I look forward to see you next time. 195 00:09:39,333 --> 00:09:41,166 Until then, enjoy machine learning.