1 00:00:00,533 --> 00:00:01,266 And now let's see. 2 00:00:01,266 --> 00:00:05,000 Okay, so the random forest classifier definitely catches 3 00:00:05,000 --> 00:00:08,666 most of the users that didn't buy the SUV in the right category. 4 00:00:08,666 --> 00:00:09,533 That is the red region. 5 00:00:09,533 --> 00:00:13,600 So that means that it classified well, most of the users who didn't buy the SUV. 6 00:00:13,600 --> 00:00:17,366 And then for the green users who are the users but the SUV in reality, 7 00:00:17,600 --> 00:00:21,500 because as we can see, most of them are in the right green region 8 00:00:22,266 --> 00:00:26,066 and it's desperately trying to catch some outliers. 9 00:00:26,466 --> 00:00:27,600 We can call them this way. 10 00:00:27,600 --> 00:00:29,300 For example, this guy here is 11 00:00:29,300 --> 00:00:32,900 a user that didn't buy the SUV in reality because this is a red point 12 00:00:33,266 --> 00:00:36,100 and it is way into the green region here, as we can see. 13 00:00:36,100 --> 00:00:39,100 But the random forest classifier managed to make this 14 00:00:39,333 --> 00:00:42,500 little rectangle part of the region in red 15 00:00:42,633 --> 00:00:46,600 to catch this user that didn't buy the SUV and classify it well. 16 00:00:47,000 --> 00:00:48,566 But is it the smart way of doing it? 17 00:00:48,566 --> 00:00:52,033 Is because what will tell you that for some new observations, we will have, 18 00:00:52,033 --> 00:00:57,266 you know, some users who didn't buy the SUV in this red rectangle here. 19 00:00:57,300 --> 00:01:00,800 So that looks like overfitting, because it made this red rectangle here 20 00:01:00,800 --> 00:01:03,800 because we had this user indeed who didn't buy the SUV. 21 00:01:03,900 --> 00:01:06,800 But nothing tells us that for some new observations, 22 00:01:06,800 --> 00:01:10,166 we will have some users who didn't buy the SUV in this 23 00:01:10,166 --> 00:01:13,500 red rectangle here, so we should be careful with that. 24 00:01:13,800 --> 00:01:15,366 And same for this user here. 25 00:01:15,366 --> 00:01:19,366 As you can see that this user is in some sort of a irregular red region here, 26 00:01:19,533 --> 00:01:23,066 but fortunately our random forest classifier was not too obsessed 27 00:01:23,166 --> 00:01:25,500 at making all the predictions correct. 28 00:01:25,500 --> 00:01:28,700 Because as we can see, this red user here is in the green region. 29 00:01:28,933 --> 00:01:32,000 So that means that it's still paid attention to overfitting, 30 00:01:32,000 --> 00:01:34,400 but not too much. And we should be careful with that. 31 00:01:34,400 --> 00:01:37,333 So speaking of overfitting, let's check that right now. 32 00:01:37,333 --> 00:01:42,300 Let's look at the test results right now to see how this region is here. 33 00:01:42,300 --> 00:01:43,833 Because you know the regions won't change. 34 00:01:43,833 --> 00:01:46,133 These are the regions built by our model. 35 00:01:46,133 --> 00:01:50,100 So when we look at the test results we will have the same red region 36 00:01:50,100 --> 00:01:53,100 here with this rectangle here and green region here. 37 00:01:53,400 --> 00:01:56,500 But what will change will be the test set observation points. 38 00:01:56,500 --> 00:01:58,233 That is all the red points and the green points. 39 00:01:58,233 --> 00:01:59,166 This will change 40 00:01:59,166 --> 00:02:03,100 and we will see if we have some red points here in this rectangle here. 41 00:02:03,100 --> 00:02:06,600 And actually probably not because this looks like overfitting 42 00:02:06,600 --> 00:02:10,566 that occurred because our classifier was fitted too much to the training set. 43 00:02:10,566 --> 00:02:12,700 So let's find out about that right now. 44 00:02:12,700 --> 00:02:18,000 Let's select this section dedicated to visualize the test set results. 45 00:02:18,566 --> 00:02:23,066 So I'll just select everything and press Command and Control plus enter to execute. 46 00:02:24,000 --> 00:02:24,433 All right. 47 00:02:24,433 --> 00:02:26,933 So what is the first thing you see here. 48 00:02:26,933 --> 00:02:28,366 Well yes indeed. 49 00:02:28,366 --> 00:02:33,233 This red rectangle here is totally and useful for some new observations. 50 00:02:33,533 --> 00:02:38,466 So that was clearly a red rectangle region to catch some uses of the training set, 51 00:02:38,733 --> 00:02:41,733 because our classifier was too much fitted to the training set. 52 00:02:41,866 --> 00:02:45,466 And this red rectangle actually doesn't make any sense here 53 00:02:45,466 --> 00:02:49,733 because indeed we don't have any red user in this rectangle region here. 54 00:02:49,733 --> 00:02:50,500 Well, it's it's 55 00:02:50,500 --> 00:02:53,500 not that it doesn't make any sense, but it's totally and useful here. 56 00:02:54,300 --> 00:02:57,900 And besides, you know, we have this green point here and this green point 57 00:02:57,900 --> 00:03:01,800 could have been in this region here that would make an incorrect prediction. 58 00:03:01,800 --> 00:03:02,933 We were lucky on this one, 59 00:03:02,933 --> 00:03:06,300 but this could have happened because these are new observations. 60 00:03:06,466 --> 00:03:08,600 And now random forest classification machine learning 61 00:03:08,600 --> 00:03:12,366 model didn't learn anything from this new observation points. 62 00:03:12,500 --> 00:03:15,500 So this guy could totally have ended up here. 63 00:03:16,033 --> 00:03:17,700 So lucky on this one. 64 00:03:17,700 --> 00:03:21,566 And by the way, same for this region here we don't have any red user. 65 00:03:21,566 --> 00:03:24,233 That is some user who didn't buy the SUV in this red region. 66 00:03:24,233 --> 00:03:27,233 So this red region is totally and useful as well. 67 00:03:27,700 --> 00:03:28,800 Okay so that's the idea. 68 00:03:28,800 --> 00:03:31,666 But most of all it did a pretty good job because of course it got 69 00:03:31,666 --> 00:03:34,933 most of the red users here with a low edge and low 70 00:03:34,933 --> 00:03:38,033 estimated salary, and therefore users who didn't buy the SUV. 71 00:03:38,466 --> 00:03:41,533 And most of the green users who are quite old with a higher 72 00:03:41,533 --> 00:03:45,766 estimated salary, who bought this awesome, cheap luxury SUV? 73 00:03:46,633 --> 00:03:48,600 Okay, and now what is the conclusion of all this? 74 00:03:48,600 --> 00:03:49,600 Because we reached 75 00:03:49,600 --> 00:03:53,800 the end of our classification adventure, we built all our classifiers. 76 00:03:53,800 --> 00:03:54,866 So according to you, 77 00:03:54,866 --> 00:03:58,800 what is the best classifier for this particular business problem? 78 00:03:58,800 --> 00:04:00,366 What is the best one? 79 00:04:00,366 --> 00:04:04,166 It should be a classifier that classified correctly the users who didn't buy 80 00:04:04,166 --> 00:04:08,100 the SUV, and the users who bought the SUV, and at the same time 81 00:04:08,300 --> 00:04:11,500 prevented overfitting in the training set to be able 82 00:04:11,500 --> 00:04:14,666 to make some good new predictions of some new observations. 83 00:04:15,200 --> 00:04:18,700 So in my opinion, the best classifier would be the kernel 84 00:04:18,700 --> 00:04:22,000 SVM in terms of the balance between the percentage 85 00:04:22,000 --> 00:04:25,500 of incorrect predictions and the fact that we want to prevent overfitting. 86 00:04:25,866 --> 00:04:28,566 Well, if we look at them again, in my opinion 87 00:04:28,566 --> 00:04:31,400 the kernel SVM classifier would be the best one. 88 00:04:31,400 --> 00:04:33,533 All right. So that's the end of this tutorial. 89 00:04:33,533 --> 00:04:36,200 And now I have to say congratulations, 90 00:04:36,200 --> 00:04:40,400 because you built a great deal of classifiers from simple classifiers 91 00:04:40,400 --> 00:04:44,800 with logistic regression to more sophisticated and more complex 92 00:04:45,000 --> 00:04:49,033 classifiers like kernel, SVM or random forest classifiers. 93 00:04:49,433 --> 00:04:51,000 But that's not the end of the journey. 94 00:04:51,000 --> 00:04:54,333 In the next section, we will be talking about how to evaluate 95 00:04:54,566 --> 00:04:58,500 the performance of our models and how we can improve them. 96 00:04:58,700 --> 00:05:02,400 And then eventually we will have a homework on a real life data set, 97 00:05:02,700 --> 00:05:06,433 where we will combine what we learned here about how to build some classifiers, 98 00:05:06,566 --> 00:05:08,600 and all the next concept that we will learn 99 00:05:08,600 --> 00:05:12,300 to evaluate the model performance in order to find the best model 100 00:05:12,566 --> 00:05:16,266 for this real life business problem data set that you will be given and we will 101 00:05:16,266 --> 00:05:20,866 do the job as a data scientist or machine learning scientist would do in reality. 102 00:05:21,133 --> 00:05:22,666 So congratulations again. 103 00:05:22,666 --> 00:05:24,633 I look forward to seeing you in the next section. 104 00:05:24,633 --> 00:05:26,433 And until then, enjoy machine learning.