1 00:00:00,566 --> 00:00:00,933 All right. 2 00:00:00,933 --> 00:00:05,200 So now, my friends, we're going to rerun all this, right? 3 00:00:05,200 --> 00:00:07,133 Because we're basically done. 4 00:00:07,133 --> 00:00:11,000 And so to run all this again we're going to first upload 5 00:00:11,000 --> 00:00:14,466 the data set because it is not yet uploaded in the notebook. 6 00:00:14,900 --> 00:00:17,866 So upload and then make sure to find 7 00:00:17,866 --> 00:00:21,366 your whole machine learning is that folder containing all the codes and data sets. 8 00:00:21,633 --> 00:00:23,033 So then we're going to go inside. 9 00:00:23,033 --> 00:00:25,600 Then we're going to go to part three classification. 10 00:00:25,600 --> 00:00:29,866 And then K nearest neighbors Kiernan and then Python. 11 00:00:29,866 --> 00:00:35,066 And then we select our social network add data set click open. 12 00:00:35,566 --> 00:00:38,300 Now it's going to be uploaded on the notebook. 13 00:00:38,300 --> 00:00:39,933 Perfect. There we go. 14 00:00:39,933 --> 00:00:41,533 Now we can run everything. 15 00:00:41,533 --> 00:00:45,000 And to do this well we're going to click runtime here. 16 00:00:45,000 --> 00:00:50,566 And then simply run and let's just start from here okay. 17 00:00:50,566 --> 00:00:51,433 Great. 18 00:00:51,433 --> 00:00:55,333 k nearest neighbors classifier was created and trained properly. 19 00:00:55,800 --> 00:00:57,133 And at the end. 20 00:00:57,133 --> 00:00:59,400 So all the predictions are generated. 21 00:00:59,400 --> 00:01:01,966 Again we can actually have a look at them. 22 00:01:01,966 --> 00:01:05,966 And wow, we seem to have some great predictions that remember on the test set. 23 00:01:06,233 --> 00:01:07,633 So all this is correct. 24 00:01:07,633 --> 00:01:11,466 Here we have an incorrect prediction, meaning that the K and then predicted 25 00:01:11,466 --> 00:01:12,733 this particular customer 26 00:01:12,733 --> 00:01:16,866 to buy the CV, whereas in reality that customer didn't buy the SUV. 27 00:01:17,400 --> 00:01:19,600 Then all this is correct or correct? 28 00:01:19,600 --> 00:01:20,466 All correct. 29 00:01:20,466 --> 00:01:21,633 It seems really good. 30 00:01:21,633 --> 00:01:24,766 Here we have another incorrect prediction, this time the opposite one. 31 00:01:24,933 --> 00:01:26,033 Our model predicted 32 00:01:26,033 --> 00:01:30,666 that this particular customer didn't buy the SUV, whereas in reality that customer. 33 00:01:30,666 --> 00:01:32,066 But the SUV. 34 00:01:32,066 --> 00:01:34,500 All right. Correct. All correct. All correct. 35 00:01:34,500 --> 00:01:35,233 So pretty good. 36 00:01:35,233 --> 00:01:38,400 Wow I think we actually have very, very few incorrect prediction. 37 00:01:38,433 --> 00:01:40,700 Let's actually check it out right now okay. No. 38 00:01:40,700 --> 00:01:42,266 So I didn't see it properly. 39 00:01:42,266 --> 00:01:47,066 But we actually have in total four plus three incorrect predictions. 40 00:01:47,066 --> 00:01:48,133 That's really really good. 41 00:01:48,133 --> 00:01:50,300 That's better than the logistic regression model. 42 00:01:50,300 --> 00:01:54,400 Remember we had actually an accuracy of the test set 43 00:01:55,400 --> 00:01:57,566 of 89%. 44 00:01:57,566 --> 00:01:59,833 So here clearly the K-nearest neighbors did 45 00:01:59,833 --> 00:02:01,500 better than the logistic regression model. 46 00:02:01,500 --> 00:02:05,466 And you will perfectly understand why when I show you the visualization. 47 00:02:05,733 --> 00:02:07,333 So it's actually here. 48 00:02:07,333 --> 00:02:07,700 All right. 49 00:02:07,700 --> 00:02:10,933 So four plus three equals seven incorrect predictions. 50 00:02:11,166 --> 00:02:14,666 Then 64 correct predictions of the class zero 51 00:02:14,666 --> 00:02:18,966 meaning 64 correct predictions that the customer doesn't buy the SUV. 52 00:02:19,266 --> 00:02:22,133 And 29 correct predictions of the class one 53 00:02:22,133 --> 00:02:26,066 meaning 29 correct predictions that the customer buys the SUV. 54 00:02:26,466 --> 00:02:27,433 And then we have indeed 55 00:02:27,433 --> 00:02:31,466 four incorrect predictions of the class one, meaning four incorrect predictions. 56 00:02:31,533 --> 00:02:36,300 The customer buys the SUV and three incorrect predictions of the class zero, 57 00:02:36,300 --> 00:02:40,866 meaning three incorrect predictions that the customer doesn't buy the SUV. 58 00:02:41,366 --> 00:02:42,766 So very, very good. 59 00:02:42,766 --> 00:02:47,400 And now, as you can see, it is still executing the cell that plots 60 00:02:47,700 --> 00:02:49,333 that 2D array. 61 00:02:49,333 --> 00:02:52,500 You know, showing all the prediction regions 62 00:02:52,500 --> 00:02:55,700 and prediction boundary of the canine model 63 00:02:55,800 --> 00:02:59,000 first on the training set and then on the test set. 64 00:02:59,500 --> 00:03:03,133 And the reason why it is taking some time here to execute. 65 00:03:03,133 --> 00:03:03,700 Oh, there we go. 66 00:03:03,700 --> 00:03:07,500 We have perfect timing is simply because, you know, the K and end 67 00:03:07,500 --> 00:03:10,933 model is actually very compute intensive. 68 00:03:10,966 --> 00:03:14,466 You know, there are a lot of computations when running the canine model. 69 00:03:14,466 --> 00:03:16,133 And that's why it took some time. 70 00:03:16,133 --> 00:03:18,466 And so now let's interpret the results. 71 00:03:18,466 --> 00:03:21,500 So remember with the logistic regression 72 00:03:21,500 --> 00:03:25,133 classifier the prediction boundary was a straight line. 73 00:03:25,133 --> 00:03:29,100 And that's because the logistic regression model is a linear classifier. 74 00:03:29,166 --> 00:03:32,166 We'll actually see another type of linear classifier in this part. 75 00:03:32,333 --> 00:03:36,500 But remember for linear classifiers the prediction boundary 76 00:03:36,500 --> 00:03:40,400 or prediction curve for classification is a straight line. 77 00:03:40,400 --> 00:03:42,700 And in three dimensions it's a straight plan. 78 00:03:42,700 --> 00:03:45,600 And then in dimension it's a straight hyperplane. 79 00:03:45,600 --> 00:03:48,600 And so here we had a straight line which therefore 80 00:03:48,633 --> 00:03:51,733 resulted in having a lot of incorrect predictions. 81 00:03:51,733 --> 00:03:54,433 Because as we can see there are a lot of green points here. 82 00:03:54,433 --> 00:03:58,800 You know, green customers who but the SUV in reality that fell 83 00:03:58,800 --> 00:04:02,766 in the red region where we predict indeed that the customers don't buy the SUV. 84 00:04:03,000 --> 00:04:04,300 And same for these ones here. 85 00:04:04,300 --> 00:04:08,333 You know, here we have a lot of customers who in reality but the SUV. 86 00:04:08,333 --> 00:04:11,566 But we're predicting the two because they fall in the red region. 87 00:04:11,933 --> 00:04:15,733 And so what we were hoping for is to build a new classifier 88 00:04:15,966 --> 00:04:20,200 for which the decision boundary, you know, separating the two prediction regions 89 00:04:20,433 --> 00:04:23,833 does something like this, you know, would not be a straight line, 90 00:04:23,833 --> 00:04:27,133 but some kind of a curve catching all the red points here 91 00:04:27,133 --> 00:04:30,100 without catching those green points here and here. 92 00:04:30,100 --> 00:04:32,000 Okay, so that's what we were hoping for. 93 00:04:32,000 --> 00:04:34,266 And well, that's exactly what we got here. 94 00:04:34,266 --> 00:04:36,266 Thanks to our Kiernan model. 95 00:04:36,266 --> 00:04:40,300 We don't get a smooth curve, but we get actually some kind of a curve 96 00:04:40,300 --> 00:04:42,766 that indeed catches all the red points here 97 00:04:42,766 --> 00:04:46,466 without catching the green points and leaving those green points 98 00:04:46,466 --> 00:04:50,600 and these ones inside the green region where they should end up right. 99 00:04:50,966 --> 00:04:54,333 And of course, here we have some green points that were super hard to catch 100 00:04:54,333 --> 00:04:57,166 because they are stuck in the middle of all those red points here. 101 00:04:57,166 --> 00:04:57,766 So that's fine. 102 00:04:57,766 --> 00:05:01,433 And anyway, in machine learning we want to avoid overfitting. 103 00:05:01,433 --> 00:05:04,466 You know, when we have some two perfect predictions on the training set 104 00:05:04,600 --> 00:05:08,166 and therefore probably resulting in having bad performance, bad predictions 105 00:05:08,366 --> 00:05:10,900 on the test set and well, speaking of the test set, 106 00:05:10,900 --> 00:05:14,466 let's see how our key and did it should be fine now. 107 00:05:14,500 --> 00:05:19,833 Yes, it is executed and once again we get excellent results. 108 00:05:19,933 --> 00:05:20,233 Right. 109 00:05:20,233 --> 00:05:23,966 Because same the key and prediction curve or prediction boundary 110 00:05:23,966 --> 00:05:28,100 or even decision boundary catches all the right customers. 111 00:05:28,100 --> 00:05:30,366 You know, the red customers who didn't buy in reality 112 00:05:30,366 --> 00:05:34,233 the SUV therefore leaving the green customers who 113 00:05:34,233 --> 00:05:38,933 but in reality the SUV in the right prediction region, the green region. 114 00:05:39,433 --> 00:05:41,833 All right. So that's already much better. 115 00:05:41,833 --> 00:05:46,233 And indeed we increased the accuracy of the test set by 4%. 116 00:05:46,533 --> 00:05:49,800 And now the question is can we do even better than this. 117 00:05:49,800 --> 00:05:52,233 Well, that's exactly what we're going to find out 118 00:05:52,233 --> 00:05:55,666 in the next practical activities to discover the big winner. 119 00:05:55,966 --> 00:05:57,833 And until then enjoy machine learning.