1 00:00:00,066 --> 00:00:01,500 Now, the other very 2 00:00:01,500 --> 00:00:05,366 important thing to understand is that these are two prediction regions 3 00:00:05,633 --> 00:00:08,866 separated by a straight line, which is the straight line. 4 00:00:08,866 --> 00:00:10,066 Here. 5 00:00:10,066 --> 00:00:13,066 And the straight line is called the prediction boundary, 6 00:00:13,200 --> 00:00:16,200 because it's the boundary between the two prediction regions. 7 00:00:16,866 --> 00:00:20,100 And the fact that it's a straight line is not random. 8 00:00:20,633 --> 00:00:22,033 It is for a particular reason. 9 00:00:22,033 --> 00:00:24,466 And that's the thing very important to understand, 10 00:00:24,466 --> 00:00:27,766 because that's the essence of logistic regression. 11 00:00:28,500 --> 00:00:31,066 If the prediction boundary is a straight line here, 12 00:00:31,066 --> 00:00:35,966 that's because our logistic regression classifier is a linear classifier. 13 00:00:36,566 --> 00:00:39,266 That means that here, since we are in two dimensions, you know, 14 00:00:39,266 --> 00:00:42,366 because we have two independent variable the age and the estimated salary. 15 00:00:42,366 --> 00:00:43,866 So we are in two dimensions. 16 00:00:43,866 --> 00:00:47,800 Then since the logistic regression classifier is a linear classifier, 17 00:00:48,300 --> 00:00:52,900 then the prediction boundary separator here can only be a straight line. 18 00:00:53,200 --> 00:00:55,966 If we were in three dimensions then it would be 19 00:00:55,966 --> 00:00:58,966 a straight plan separating two spaces. 20 00:00:59,000 --> 00:01:01,233 But here in two dimensions it's a straight line 21 00:01:01,233 --> 00:01:03,000 and it will always be a straight line. 22 00:01:03,000 --> 00:01:06,933 If your classifier is a linear classifier, but you will see later 23 00:01:06,933 --> 00:01:10,166 that when we build non linear classifiers, 24 00:01:10,300 --> 00:01:14,500 then the prediction boundary separator won't be a straight line anymore. 25 00:01:14,766 --> 00:01:17,933 I won't tell you more right now and I will let you wait for the surprise. 26 00:01:18,533 --> 00:01:22,966 So here we can clearly see that our logistic regression classifier manages 27 00:01:22,966 --> 00:01:27,866 to catch most of the users who didn't buy the SUV in the red region here, 28 00:01:28,300 --> 00:01:32,300 and most of the users who bought the SUV in the green region here. 29 00:01:32,566 --> 00:01:34,866 So it actually did a pretty good job. 30 00:01:34,866 --> 00:01:39,033 However, it seems to have trouble catching some green users here 31 00:01:39,033 --> 00:01:43,066 who in spite of their low salary, but the luxury SUV, 32 00:01:43,533 --> 00:01:47,700 as well as those other green users here who also bought the luxury SUV 33 00:01:48,266 --> 00:01:49,266 because as you can see, 34 00:01:49,266 --> 00:01:53,200 this green points here and those here are in the red region, 35 00:01:53,500 --> 00:01:56,400 which is the region where our classifier predicts 36 00:01:56,400 --> 00:01:59,400 that the users don't buy the SUV. 37 00:01:59,433 --> 00:02:02,666 And those incorrect predictions are due specifically 38 00:02:02,666 --> 00:02:06,300 to the fact that our classifier is a linear classifier. 39 00:02:06,300 --> 00:02:09,900 And because our users are not linearly distributed, 40 00:02:10,266 --> 00:02:13,966 if they were linearly distributed, then we will have all the green points 41 00:02:13,966 --> 00:02:17,266 here in the space and all the red points here in this space. 42 00:02:17,466 --> 00:02:19,800 And then the linear classifier with a straight line could 43 00:02:19,800 --> 00:02:23,400 perfectly separate all the red points here, and all the green points here. 44 00:02:23,833 --> 00:02:28,266 But here we have some rebellious points who are not in the wanted linear regions. 45 00:02:28,533 --> 00:02:32,200 And because our classifier has a linear straight line separator, 46 00:02:32,200 --> 00:02:36,333 that's why it has trouble catching those users here and those here. 47 00:02:36,566 --> 00:02:40,366 You can clearly see that even if you try to rotate this straight 48 00:02:40,366 --> 00:02:44,833 line here, well, you will always have some green points in the wrong category. 49 00:02:45,166 --> 00:02:50,166 For example, if we try to rotate here this way, like putting it down, well 50 00:02:50,166 --> 00:02:53,733 okay, we will catch these green points here and the right green region here. 51 00:02:54,033 --> 00:02:59,233 But since we rotated down we will take more green users here 52 00:02:59,233 --> 00:03:04,200 because this will go up and more green users here will be in the red region. 53 00:03:04,600 --> 00:03:07,033 So that's the best separator. 54 00:03:07,033 --> 00:03:09,366 The logistic regression classifier could find. 55 00:03:09,366 --> 00:03:10,766 And it couldn't do better 56 00:03:10,766 --> 00:03:14,366 because it can only be a straight line separating these two regions. 57 00:03:14,966 --> 00:03:18,000 Because to catch those users, the green users here and the green users 58 00:03:18,000 --> 00:03:21,133 here in the red category that is the green region are classified. 59 00:03:21,133 --> 00:03:25,433 We need to make some kind of a curve here to, you know, classify 60 00:03:25,433 --> 00:03:29,433 correctly those green users here and here and place them in the green region. 61 00:03:29,600 --> 00:03:33,866 And that would prevent our classroom from making this incorrect predictions here 62 00:03:34,000 --> 00:03:36,333 because it is a straight line with a curve. 63 00:03:36,333 --> 00:03:39,833 Here we would catch all the red users, probably in the red region 64 00:03:40,033 --> 00:03:42,566 and all the green users in the green region. 65 00:03:42,566 --> 00:03:45,100 So that would make an awesome classifier. 66 00:03:45,100 --> 00:03:45,766 And you will see 67 00:03:45,766 --> 00:03:49,633 how our nonlinear classifiers will make a terrific job in doing this. 68 00:03:49,866 --> 00:03:51,066 I can't wait to show you this. 69 00:03:52,066 --> 00:03:52,500 Okay. 70 00:03:52,500 --> 00:03:55,700 And now eventually, the last thing very important to understand is that 71 00:03:56,266 --> 00:03:58,333 this is the training set. 72 00:03:58,333 --> 00:03:59,300 This is a training set. 73 00:03:59,300 --> 00:04:00,333 So that means that 74 00:04:00,333 --> 00:04:04,600 our classifier learns how to classify based on these informations here. 75 00:04:04,833 --> 00:04:08,100 So I would hold my breath a few more seconds until I find out 76 00:04:08,100 --> 00:04:12,100 if our logistic regression classifier can manage to make good predictions 77 00:04:12,100 --> 00:04:16,366 of new observations, that is, to classify new users into the right regions, 78 00:04:16,633 --> 00:04:20,600 which, by the way, are fixed regions here, because these are the regions 79 00:04:20,600 --> 00:04:24,566 generated by the learning experience of our logistic regression classifier, 80 00:04:24,900 --> 00:04:28,200 and therefore won't change if we look at some new observations. 81 00:04:28,200 --> 00:04:31,033 That is, new social network users, 82 00:04:31,033 --> 00:04:34,033 and that's what we are about to find out on the test set. 83 00:04:34,166 --> 00:04:35,500 So hold on. 84 00:04:35,500 --> 00:04:37,500 So it's very simple. 85 00:04:37,500 --> 00:04:40,933 We're just going to copy all this code section here. 86 00:04:42,900 --> 00:04:44,100 Paste it here. 87 00:04:44,100 --> 00:04:46,900 And I'm just going to change the training set here. 88 00:04:46,900 --> 00:04:49,466 My test set. 89 00:04:49,466 --> 00:04:52,833 Same here I change training set by test set. 90 00:04:52,833 --> 00:04:54,233 And that's all. 91 00:04:54,233 --> 00:04:56,566 That's all because I structured the codes in such a way 92 00:04:56,566 --> 00:04:59,900 that we only need to change the training set into the test set here. 93 00:05:00,200 --> 00:05:03,200 To plot this graph on a specific set. 94 00:05:03,366 --> 00:05:08,200 However, let's change the title here because we want to specify 95 00:05:08,200 --> 00:05:11,200 that it's the test set and it's ready. 96 00:05:11,266 --> 00:05:14,333 So let's select this and execute. 97 00:05:16,800 --> 00:05:19,800 Let's see what happens. 98 00:05:19,966 --> 00:05:22,966 And here are the results of the test set. 99 00:05:23,400 --> 00:05:24,433 So that's not too bad. 100 00:05:24,433 --> 00:05:26,666 That's not too bad. Because as we can see 101 00:05:26,666 --> 00:05:30,033 the major the majority of red points are in the right region. 102 00:05:30,300 --> 00:05:34,466 That means the region predicted to be zero and the majority of green points 103 00:05:34,466 --> 00:05:35,966 are in the right region. 104 00:05:37,166 --> 00:05:37,666 As for the 105 00:05:37,666 --> 00:05:41,400 training set, there are some observations that were incorrectly predicted. 106 00:05:41,633 --> 00:05:42,366 That's normal. 107 00:05:42,366 --> 00:05:46,166 That's because it's a linear classifier and it cannot make a curve here. 108 00:05:46,200 --> 00:05:49,200 Catching all the right guys. 109 00:05:49,333 --> 00:05:52,266 All right so that's it for the interpretation of the graph. 110 00:05:52,266 --> 00:05:55,566 I can't wait to show you how we can make more powerful classifiers. 111 00:05:55,766 --> 00:05:58,866 And of course these are going to be nonlinear classifiers.