1 00:00:00,700 --> 00:00:02,200 All right, so that's done. 2 00:00:02,200 --> 00:00:04,766 That's, what we had to input for train here. 3 00:00:04,766 --> 00:00:10,266 So train is your training set, but without the dependent variable then comma. 4 00:00:10,266 --> 00:00:12,233 And then let's add the second argument. 5 00:00:12,233 --> 00:00:14,300 So the second argument is test. 6 00:00:14,300 --> 00:00:15,700 So you can guess what it is. 7 00:00:15,700 --> 00:00:19,466 It's going to be the same test equals test set of course. 8 00:00:20,333 --> 00:00:20,633 All right. 9 00:00:20,633 --> 00:00:22,966 So test set and then same training set. 10 00:00:22,966 --> 00:00:24,133 We're going to remove 11 00:00:24,133 --> 00:00:28,500 the dependent variable because anyway we are supposed not to know the results. 12 00:00:28,500 --> 00:00:31,066 We want to predict the observations of the test set. 13 00:00:31,066 --> 00:00:32,800 So anyway we need to remove it. 14 00:00:32,800 --> 00:00:35,766 So comma to take all the lines and minus 15 00:00:35,766 --> 00:00:38,766 three to remove the last column 16 00:00:39,233 --> 00:00:39,566 okay. 17 00:00:39,566 --> 00:00:42,066 So we have our training set and our test set. 18 00:00:42,066 --> 00:00:44,066 And now what is the next parameter okay. 19 00:00:44,066 --> 00:00:48,333 The next parameter is CL factor of true classification of training set. 20 00:00:48,933 --> 00:00:51,266 So can you guess what it's going to be 21 00:00:52,600 --> 00:00:52,866 okay. 22 00:00:52,866 --> 00:00:57,033 Let's see CL equals in your opinion what is it going to be. 23 00:00:57,733 --> 00:01:00,900 Well you know to train a classifier 24 00:01:00,900 --> 00:01:04,166 the classifier needs to have okay the independent variables. 25 00:01:04,466 --> 00:01:05,900 But it also needs to have 26 00:01:05,900 --> 00:01:10,000 the dependent variable because it needs to have the results to, 27 00:01:10,366 --> 00:01:15,233 you know, find the correlations between the informations of the independent 28 00:01:15,233 --> 00:01:18,233 variables and the information contained in the dependent variable. 29 00:01:18,333 --> 00:01:22,100 So here, since we only have the info about the independent variables, 30 00:01:22,200 --> 00:01:25,833 we also need to include somewhere the info of the dependent variable. 31 00:01:26,133 --> 00:01:27,500 And that's what we add here. 32 00:01:27,500 --> 00:01:31,300 That's the CL so factor of true classifications of training set. 33 00:01:31,300 --> 00:01:33,933 That is the categorical dependent variable. 34 00:01:33,933 --> 00:01:34,800 So let's do this. 35 00:01:34,800 --> 00:01:37,333 So to take this vector actually. 36 00:01:37,333 --> 00:01:40,466 So as you can see that's the last column of the training set. 37 00:01:40,766 --> 00:01:44,933 So it's going to be training set taking all the lines of the observations. 38 00:01:45,166 --> 00:01:47,066 And then the 123. 39 00:01:47,066 --> 00:01:50,233 So third index of the column purchased. 40 00:01:50,566 --> 00:01:55,900 So let's take that training set brackets come up. 41 00:01:56,133 --> 00:02:01,066 And then three because the column we want is indexed by three. 42 00:02:01,766 --> 00:02:03,466 All right. So that's for the third argument. 43 00:02:03,466 --> 00:02:07,400 And then we have one more argument which is the number of neighbors. 44 00:02:07,766 --> 00:02:09,333 So let's add this one. 45 00:02:09,333 --> 00:02:12,666 So remember in Python we took five neighbors. 46 00:02:12,933 --> 00:02:14,533 That's actually the default parameter. 47 00:02:14,533 --> 00:02:15,666 So here we're going to take the same. 48 00:02:15,666 --> 00:02:20,333 That will allow us to compare the results we obtained on Python in R. 49 00:02:20,700 --> 00:02:22,300 So it will be interesting. 50 00:02:22,300 --> 00:02:25,300 So let's take k equals five neighbors. 51 00:02:26,100 --> 00:02:26,566 All right. 52 00:02:26,566 --> 00:02:28,200 And now we have everything we need. 53 00:02:28,200 --> 00:02:30,366 We can select this. 54 00:02:30,366 --> 00:02:31,733 And here it is widespread. 55 00:02:31,733 --> 00:02:32,700 All good. 56 00:02:32,700 --> 00:02:35,700 So now let's have a look at white bread. 57 00:02:35,900 --> 00:02:36,800 We can have a look here. 58 00:02:36,800 --> 00:02:40,600 White bread and pressing white bread in the console and press enter to 59 00:02:40,666 --> 00:02:41,700 have a look at it. 60 00:02:41,700 --> 00:02:43,700 And here are all the predictions for the test set. 61 00:02:43,700 --> 00:02:45,000 So remember the test. 62 00:02:45,000 --> 00:02:46,633 It contains 100 observations. 63 00:02:46,633 --> 00:02:49,500 So here we have 100 predictions. 64 00:02:49,500 --> 00:02:53,700 Correspond to the same observations as these guys here. 65 00:02:54,100 --> 00:02:58,566 So for example let's take the first observation to the first users. 66 00:02:58,933 --> 00:03:02,633 So let's take the 12345. 67 00:03:02,633 --> 00:03:07,600 So the five first users these five first users didn't buy the SUV. 68 00:03:07,600 --> 00:03:11,500 In reality because the purchased variable equals zero here. 69 00:03:11,500 --> 00:03:12,400 And that's the truth. 70 00:03:12,400 --> 00:03:14,900 That's what actually happens in reality. 71 00:03:14,900 --> 00:03:17,333 And what does our prediction say? 72 00:03:17,333 --> 00:03:20,066 1234550 here. 73 00:03:20,066 --> 00:03:23,900 So correct predictions for the five first users okay perfect. 74 00:03:24,433 --> 00:03:27,433 Then then we have four ones. 75 00:03:27,600 --> 00:03:30,300 The the 678 76 00:03:30,300 --> 00:03:33,600 and nine users actually bought the SUV. 77 00:03:33,900 --> 00:03:35,633 So great for the six one. 78 00:03:35,633 --> 00:03:38,466 The seventh one great. Correct prediction eight one. 79 00:03:38,466 --> 00:03:41,200 Correct prediction as well. But the classifier. 80 00:03:41,200 --> 00:03:42,900 You made a little mistake here. 81 00:03:42,900 --> 00:03:43,533 But that's fine. 82 00:03:43,533 --> 00:03:46,666 It looks like it's making some correct prediction most of the time. 83 00:03:46,666 --> 00:03:49,500 And then we were going to check that on the confusion matrix. 84 00:03:49,500 --> 00:03:50,533 That will be faster. 85 00:03:51,566 --> 00:03:52,400 Was just to 86 00:03:52,400 --> 00:03:55,400 you know, understand to explain what white bread was. 87 00:03:55,400 --> 00:03:56,900 But I think you get it. 88 00:03:56,900 --> 00:03:59,900 So here we just need to select this and execute 89 00:04:00,066 --> 00:04:03,000 here CM we can have a look at it in the console. 90 00:04:03,000 --> 00:04:07,600 CM and that's the predictions okay. 91 00:04:07,600 --> 00:04:12,333 So we have six plus five incorrect predictions 11 incorrect predictions. 92 00:04:12,333 --> 00:04:13,966 So that's not too bad. 93 00:04:13,966 --> 00:04:18,433 And now what we are most interested to see is the prediction regions 94 00:04:18,633 --> 00:04:21,500 how they behave and especially the prediction boundary to 95 00:04:21,500 --> 00:04:24,966 see if it's going to be a straight line or something else. 96 00:04:25,333 --> 00:04:29,366 And actually you're going to see that the K then is a nonlinear classifier. 97 00:04:29,366 --> 00:04:31,566 So we will get something different. 98 00:04:31,566 --> 00:04:34,566 Then once we got the logistic regression.