1 00:00:00,233 --> 00:00:00,566 All right. 2 00:00:00,566 --> 00:00:03,766 Then, as we said, we want to get the same criterion 3 00:00:03,766 --> 00:00:07,333 as in the intuition lectures, meaning entropy with the information gain. 4 00:00:07,566 --> 00:00:08,533 So there we go. 5 00:00:08,533 --> 00:00:13,166 Let's add criterion equals and quote entropy. 6 00:00:13,766 --> 00:00:14,300 Great. 7 00:00:14,300 --> 00:00:18,100 And then finally that final parameter random underscore state 8 00:00:18,566 --> 00:00:21,566 to which we set the value zero. 9 00:00:21,766 --> 00:00:22,600 Perfect. 10 00:00:22,600 --> 00:00:26,300 And now final step you know it by heart classifier. 11 00:00:26,566 --> 00:00:29,333 Then from this classifier we call the fit 12 00:00:29,333 --> 00:00:33,166 method which will train the classifier only built 13 00:00:33,166 --> 00:00:37,866 so far onto the training set composed of the two arguments. 14 00:00:37,866 --> 00:00:42,566 We have two input here, which are x train for the matrix of features 15 00:00:42,566 --> 00:00:45,933 of the training set, and then y train 16 00:00:45,933 --> 00:00:49,400 for the dependent variable vector of the same training set. 17 00:00:49,800 --> 00:00:50,933 And that's it my friends. 18 00:00:50,933 --> 00:00:56,866 Now we're about to find out if we can beat that record accuracy of 93%. 19 00:00:57,166 --> 00:00:58,933 I actually have a good feeling about this. 20 00:00:58,933 --> 00:01:01,433 We might beat it, but let's not talk too fast. 21 00:01:01,433 --> 00:01:03,066 We never know what's going to happen. 22 00:01:03,066 --> 00:01:06,733 So first let's upload the data set by clicking this fully here. 23 00:01:06,733 --> 00:01:08,866 You know let's upload it in the notebook. 24 00:01:08,866 --> 00:01:11,633 So right now as usual you know same story. 25 00:01:11,633 --> 00:01:12,966 The Colab notebook 26 00:01:12,966 --> 00:01:16,466 is connecting to a runtime to enable file browsing on your machine. 27 00:01:16,800 --> 00:01:19,500 And we will get the upload button in a second. 28 00:01:19,500 --> 00:01:20,533 There we go. 29 00:01:20,533 --> 00:01:25,400 So let's click it and let's go to where we have our machine learning. 30 00:01:25,400 --> 00:01:26,666 It is at folder. 31 00:01:26,666 --> 00:01:28,666 There it is. Mine is on my machine. 32 00:01:28,666 --> 00:01:31,600 So we're going to go inside and part three classification. 33 00:01:31,600 --> 00:01:34,433 Then section 20 random forest classification. 34 00:01:34,433 --> 00:01:37,666 The last class regression model of this part Gratulations again 35 00:01:37,666 --> 00:01:40,666 for making such huge progress with this course. 36 00:01:40,766 --> 00:01:45,866 There we go inside and Python and then social network add dot csv. 37 00:01:46,533 --> 00:01:47,433 Let's open it. 38 00:01:47,433 --> 00:01:50,233 And now we're very close to the final result, 39 00:01:50,233 --> 00:01:54,466 you know to the final discovery of whether we're going to beat yes or no. 40 00:01:54,466 --> 00:01:57,766 The record accuracy of 93%. So there we go. 41 00:01:57,766 --> 00:01:59,600 Let's click runtime here. 42 00:01:59,600 --> 00:02:04,500 And then let's click run URL to build and train again. 43 00:02:04,500 --> 00:02:06,366 The random Forest classification. Here we go. 44 00:02:06,366 --> 00:02:09,133 We have it now and our future prediction. 45 00:02:09,133 --> 00:02:13,066 So let's see let's see let's see what we get first that prediction 46 00:02:13,066 --> 00:02:15,466 of the purchase decision of that single customer of age 47 00:02:15,466 --> 00:02:19,600 30 and $87,000 estimated salary is correct, right? 48 00:02:19,600 --> 00:02:22,633 Because in reality, this customer didn't buy the SUV. 49 00:02:23,100 --> 00:02:24,900 And now with the test result, 50 00:02:24,900 --> 00:02:28,166 let's scroll back up here and let's see a bit what we have. 51 00:02:28,500 --> 00:02:30,700 So all this is correct here. Correct. 52 00:02:30,700 --> 00:02:32,633 One incorrect prediction here. 53 00:02:32,633 --> 00:02:34,466 Two other incorrect predictions here. 54 00:02:34,466 --> 00:02:36,466 Oh maybe we won't beat it. 55 00:02:36,466 --> 00:02:40,800 You know, let's see directly if we beat it and well actually no. 56 00:02:40,800 --> 00:02:42,733 Wow. Okay. I'm very surprised. 57 00:02:42,733 --> 00:02:44,900 I thought we had a chance to beat it. 58 00:02:44,900 --> 00:02:46,533 I hope you're not too disappointed. 59 00:02:46,533 --> 00:02:50,833 But indeed, we didn't beat that record accuracy of 93%. 60 00:02:51,000 --> 00:02:54,600 Because indeed, with the random forest, we get 91%. 61 00:02:54,600 --> 00:02:55,766 Let's try to tune. 62 00:02:55,766 --> 00:02:57,600 You know, this is not our final word. 63 00:02:57,600 --> 00:03:00,733 Let's try to tune a bit the number of estimators. 64 00:03:00,733 --> 00:03:02,133 Maybe we can get a better one. 65 00:03:02,133 --> 00:03:05,066 Let's try, for example, the default value of 100. 66 00:03:05,066 --> 00:03:05,500 But you know 67 00:03:05,500 --> 00:03:09,900 I don't think we will even improve that because we might yet anyway overfitting. 68 00:03:09,900 --> 00:03:13,400 And this will not help of course for the predictions of new observations 69 00:03:13,600 --> 00:03:14,333 in the test set. 70 00:03:14,333 --> 00:03:15,600 But anyway let's try. 71 00:03:15,600 --> 00:03:17,666 Let's run all again. 72 00:03:17,666 --> 00:03:19,800 So this will rebuild and retrain 73 00:03:19,800 --> 00:03:22,800 your random forest classification with 100 trees. 74 00:03:23,233 --> 00:03:23,800 All right. 75 00:03:23,800 --> 00:03:25,633 We're about to get a new one. There we go. 76 00:03:25,633 --> 00:03:29,333 So now we have indeed 100 trees in the random forest. 77 00:03:29,733 --> 00:03:30,166 All right. 78 00:03:30,166 --> 00:03:32,900 The new result prediction is still correct as a result. 79 00:03:32,900 --> 00:03:35,266 Okay. And now let's see the confusion matrix. 80 00:03:35,266 --> 00:03:36,366 That's what I was telling you. 81 00:03:36,366 --> 00:03:37,700 Still 91%. 82 00:03:37,700 --> 00:03:40,600 So it was perhaps better trained on the training set. 83 00:03:40,600 --> 00:03:43,433 But what we get on the test set is just the same. 84 00:03:43,433 --> 00:03:47,033 So anyway, you know, clearly the best model for our data set here, 85 00:03:47,033 --> 00:03:52,133 you know, for classification is kernel SVM and k nearest neighbors. 86 00:03:52,133 --> 00:03:55,133 So I'm going to put that back to ten right. 87 00:03:55,200 --> 00:03:57,466 Press save run everything again. 88 00:03:57,466 --> 00:04:01,233 And I'm going to show you, you know the final visualization results 89 00:04:01,233 --> 00:04:03,566 for Random Forest because it's always good to see it. 90 00:04:03,566 --> 00:04:07,100 You know even if we didn't beat the accuracy let's observe them. 91 00:04:07,233 --> 00:04:09,433 Let's actually observe them in the original file 92 00:04:09,433 --> 00:04:11,933 because it is right now running. 93 00:04:11,933 --> 00:04:14,400 All right. So we'll find them at the bottom. 94 00:04:14,400 --> 00:04:15,200 And so there you go. 95 00:04:15,200 --> 00:04:16,800 That's the result of the training set. 96 00:04:16,800 --> 00:04:19,500 And below you have the results on the test set. 97 00:04:19,500 --> 00:04:23,566 And indeed we see that even if it could be very well trained on the training set. 98 00:04:23,800 --> 00:04:28,000 Well we still have some wrong predictions here of green customers 99 00:04:28,000 --> 00:04:30,000 who fall in the wrong red region. 100 00:04:30,000 --> 00:04:31,700 And there's not much we can do. 101 00:04:31,700 --> 00:04:33,000 You know, when tuning the random forest 102 00:04:33,000 --> 00:04:36,933 classification to catch correctly these customers here. 103 00:04:37,300 --> 00:04:39,600 But I want to say something now. 104 00:04:39,600 --> 00:04:39,933 You know, 105 00:04:39,933 --> 00:04:43,733 maybe you want to play with the diverse classification models we implemented. 106 00:04:43,733 --> 00:04:46,733 And by playing with them I mean playing with the parameters, you know, 107 00:04:46,733 --> 00:04:49,733 trying different values of the parameters. 108 00:04:49,766 --> 00:04:53,900 And please let me know, you know, in either private message or in the Q&A, 109 00:04:54,200 --> 00:04:58,200 if you managed to beat 93%, you know, as a final accuracy 110 00:04:58,200 --> 00:04:59,600 on the test set, of course. 111 00:04:59,600 --> 00:05:00,866 But let me know if you succeed. 112 00:05:00,866 --> 00:05:03,633 I'll be very interested to see how you did it. 113 00:05:03,633 --> 00:05:06,933 All right, so here we are at the end of the section. 114 00:05:06,933 --> 00:05:08,933 Congratulations for completing it. 115 00:05:08,933 --> 00:05:12,100 Now you have some great tools in your classification toolkit. 116 00:05:12,366 --> 00:05:15,300 Please understand that the best models we got here 117 00:05:15,300 --> 00:05:18,566 are just for this data set before your future data set. 118 00:05:18,566 --> 00:05:20,100 The best model might be another one. 119 00:05:20,100 --> 00:05:23,200 It might be random forest, or it might be Naive Bayes. 120 00:05:23,400 --> 00:05:25,000 So you have to try all of them. 121 00:05:25,000 --> 00:05:28,500 And speaking of which, this is exactly what we'll do next. 122 00:05:28,500 --> 00:05:32,700 In this part three, I'm going to take all these code templates here that we made. 123 00:05:32,700 --> 00:05:33,766 I'm going to simplify them. 124 00:05:33,766 --> 00:05:36,800 You know, I'm going to remove all the prints and everything 125 00:05:36,900 --> 00:05:40,033 so that it can be very clear and well-structured, and mostly 126 00:05:40,166 --> 00:05:44,633 so that you can get very efficient code templates that you can try and deploy 127 00:05:44,666 --> 00:05:47,400 very quickly and efficiently on your data sets, 128 00:05:47,400 --> 00:05:51,033 so that you can quickly figure out what is the best model and that's 129 00:05:51,033 --> 00:05:53,266 what we'll do in this last section of part three. 130 00:05:53,266 --> 00:05:54,466 Can't wait to meet you there. 131 00:05:54,466 --> 00:05:56,200 And until then, enjoy machine learning.