1 00:00:00,133 --> 00:00:02,800 Hello and welcome to this art tutorial. 2 00:00:02,800 --> 00:00:06,666 In the previous tutorials we implemented our logistic regression model on Python. 3 00:00:06,900 --> 00:00:09,100 And this time we're going to do it on R. 4 00:00:09,100 --> 00:00:12,600 So the first thing that we need to do is to set a working directory. 5 00:00:13,000 --> 00:00:14,566 Right now I'm on my desktop. 6 00:00:14,566 --> 00:00:17,366 So let's go to the Machine Learning A-Z folder. 7 00:00:17,366 --> 00:00:21,333 Part three classification section logistic regression. 8 00:00:21,800 --> 00:00:23,900 And here we are on the right folder. 9 00:00:23,900 --> 00:00:26,400 So that's the folder to set as working directory. 10 00:00:26,400 --> 00:00:28,633 Let's make sure we have the social network. 11 00:00:28,633 --> 00:00:31,300 CSV file. All good. 12 00:00:31,300 --> 00:00:35,166 So I'm going to click on this more button here to set this folder 13 00:00:35,166 --> 00:00:36,666 as working directory. 14 00:00:36,666 --> 00:00:37,500 And here we go. 15 00:00:37,500 --> 00:00:38,366 Everything is fine. 16 00:00:38,366 --> 00:00:40,866 We're ready to start making the model. 17 00:00:40,866 --> 00:00:44,266 So the first step as usual is to preprocess the data. 18 00:00:44,266 --> 00:00:47,400 And to do this of course we're going to use our data 19 00:00:47,400 --> 00:00:50,400 pre-processing template that we made in part one. 20 00:00:50,566 --> 00:00:53,033 So I'm going to select all this 21 00:00:54,133 --> 00:00:55,166 copy. 22 00:00:55,166 --> 00:00:59,100 And then I'm going to go back to my logistic regression file to paste it here. 23 00:00:59,800 --> 00:01:00,666 All right. 24 00:01:00,666 --> 00:01:04,766 And now we just need to change a few things to preprocess our data. 25 00:01:04,800 --> 00:01:07,800 That's of course the name of the data set here 26 00:01:07,800 --> 00:01:10,800 which is social 27 00:01:11,200 --> 00:01:14,200 network ads. 28 00:01:14,666 --> 00:01:15,600 All right. 29 00:01:15,600 --> 00:01:17,500 And then we need to change another few more things. 30 00:01:17,500 --> 00:01:21,600 But first let's select this line to see what our data set looks like. 31 00:01:22,000 --> 00:01:23,033 So come in and control this. 32 00:01:23,033 --> 00:01:24,666 Enter to execute. 33 00:01:24,666 --> 00:01:27,266 Here we go. The data set is well imported. 34 00:01:27,266 --> 00:01:29,100 Let's click on that. 35 00:01:29,100 --> 00:01:30,800 And here's the data set. 36 00:01:30,800 --> 00:01:33,300 So as a quick reminder this data set 37 00:01:33,300 --> 00:01:37,100 contains informations of users of a social network. 38 00:01:37,433 --> 00:01:42,266 These informations are the user ID the gender, the age and the estimated salary. 39 00:01:42,800 --> 00:01:45,900 And the social network has several business clients. 40 00:01:46,233 --> 00:01:47,866 And these business clients 41 00:01:47,866 --> 00:01:51,833 put ads on the social network for marketing campaigns purposes. 42 00:01:52,500 --> 00:01:56,900 And one of their business clients is a car company who has just launched 43 00:01:56,900 --> 00:02:00,000 its brand new luxury SUV at a ridiculous price. 44 00:02:00,800 --> 00:02:05,533 So this car company put ads of their new SUV products on the social network. 45 00:02:05,833 --> 00:02:08,966 And then the social network gathered some informations about 46 00:02:09,300 --> 00:02:11,966 which users responded positively 47 00:02:11,966 --> 00:02:15,266 to the ad by buying the product 48 00:02:15,600 --> 00:02:19,266 and those who responded negatively by not buying the product. 49 00:02:19,300 --> 00:02:21,433 So that's what the last column is about. 50 00:02:21,433 --> 00:02:25,566 The last column tells for each user if the user but the car 51 00:02:25,900 --> 00:02:29,600 and then it's its one or didn't buy the car and then it's a zero. 52 00:02:30,466 --> 00:02:30,800 All right. 53 00:02:30,800 --> 00:02:32,733 So that's the business problem itself. 54 00:02:32,733 --> 00:02:36,200 And now our mission is to make a logistic regression model 55 00:02:36,333 --> 00:02:40,166 that will try to understand the correlations between information 56 00:02:40,266 --> 00:02:44,333 such as the age and the salary, and the decision of the user to buy. 57 00:02:44,333 --> 00:02:46,666 Yes or no, the SUV. 58 00:02:46,666 --> 00:02:46,966 All right. 59 00:02:46,966 --> 00:02:49,500 So let's go back to our logistic regression model. 60 00:02:49,500 --> 00:02:51,966 And let's see what we need to change next. 61 00:02:51,966 --> 00:02:56,533 So this line is to select the variables we want to train our model with. 62 00:02:57,000 --> 00:02:59,700 So as I just said we're going to train our model 63 00:02:59,700 --> 00:03:01,966 with only the age and the salary. 64 00:03:01,966 --> 00:03:05,166 So that means that we want to predict if the user is going to buy the SUV 65 00:03:05,500 --> 00:03:08,500 based on only the age and the salary. 66 00:03:08,500 --> 00:03:11,600 So here we will need to select the indexes of the columns 67 00:03:11,600 --> 00:03:13,300 we want to take for our model. 68 00:03:13,300 --> 00:03:15,100 So I'm going to remove that as comment. 69 00:03:16,200 --> 00:03:18,266 And let's look at the indexes okay. 70 00:03:18,266 --> 00:03:20,166 So indexes in are sorted one. 71 00:03:20,166 --> 00:03:23,400 So that's 1234 and five. 72 00:03:23,600 --> 00:03:27,000 So we only want to take the indexes three four and five. 73 00:03:27,566 --> 00:03:29,300 So let's do this. 74 00:03:29,300 --> 00:03:33,333 We're going to choose from 3 to 5. 75 00:03:33,900 --> 00:03:34,266 All right. 76 00:03:34,266 --> 00:03:37,533 So now let's select this and execute. 77 00:03:38,433 --> 00:03:39,000 All right. 78 00:03:39,000 --> 00:03:41,066 And now if we go back to our data sets 79 00:03:41,066 --> 00:03:43,900 you can see that we only have our three columns of interest 80 00:03:43,900 --> 00:03:46,900 which are the age the salary and purchased. 81 00:03:47,733 --> 00:03:49,233 Okay. 82 00:03:49,233 --> 00:03:50,100 Now next step. 83 00:03:50,100 --> 00:03:53,700 Next step is to split the data sets into the training set and the test set. 84 00:03:54,333 --> 00:03:57,400 And here what we only need to change is the split ratio. 85 00:03:57,600 --> 00:04:00,600 Or maybe not. But we have 400 observations. 86 00:04:00,600 --> 00:04:04,100 I think a good split would be to have 300 observations 87 00:04:04,100 --> 00:04:07,133 in the training set, and 100 observations in the test set. 88 00:04:08,000 --> 00:04:11,500 And to do this we need to take oh point 75. 89 00:04:12,166 --> 00:04:15,166 That is 75% going to the training set. 90 00:04:15,300 --> 00:04:17,366 That is 300 observations. 91 00:04:17,366 --> 00:04:17,700 Okay. 92 00:04:17,700 --> 00:04:21,633 So let's select this and command or Control plus enter to execute. 93 00:04:22,400 --> 00:04:23,200 Here we go. 94 00:04:23,200 --> 00:04:26,233 Now let's look at our training set and our test set. 95 00:04:28,500 --> 00:04:28,866 All right. 96 00:04:28,866 --> 00:04:31,866 So our training set as you can see has 300 observations. 97 00:04:32,233 --> 00:04:35,966 And the test set has 100 observations listed here. 98 00:04:36,433 --> 00:04:37,666 Perfect. 99 00:04:37,666 --> 00:04:39,400 Now let's go back to our logistic regression 100 00:04:39,400 --> 00:04:42,400 and take care of the next step which is the feature scaling. 101 00:04:42,933 --> 00:04:46,300 So for classification it's better to do feature scaling. 102 00:04:46,600 --> 00:04:47,500 So we're going to do it. 103 00:04:47,500 --> 00:04:50,500 We're going to remove those comments here by pressing 104 00:04:50,666 --> 00:04:53,666 command or control plus shift plus C. 105 00:04:53,800 --> 00:04:54,800 All right. 106 00:04:54,800 --> 00:04:57,800 And now let's check that we have the right indexes. 107 00:04:57,800 --> 00:05:00,800 Here we have indexes two and three. 108 00:05:01,633 --> 00:05:04,200 Here the dependent variable is categorical. 109 00:05:04,200 --> 00:05:05,866 So we will only scale this. 110 00:05:05,866 --> 00:05:08,133 And that's index one and two. 111 00:05:08,133 --> 00:05:10,666 So let's go back to our logistic regression model. 112 00:05:10,666 --> 00:05:14,133 And so here we need to choose 1 to 2. 113 00:05:14,966 --> 00:05:17,233 So let's do it for the four 114 00:05:18,200 --> 00:05:22,166 two. One two 115 00:05:22,966 --> 00:05:25,966 and one two. 116 00:05:26,166 --> 00:05:30,233 All right let's select this command and control plus enter to execute. 117 00:05:31,033 --> 00:05:32,233 And here we go. 118 00:05:32,233 --> 00:05:34,300 Now let's have a look at our training sets. 119 00:05:34,300 --> 00:05:37,300 Yep. The age and the salary are perfectly scaled. 120 00:05:37,366 --> 00:05:39,433 And same for the test set. 121 00:05:39,433 --> 00:05:41,433 Perfect. Perfectly scaled. 122 00:05:41,433 --> 00:05:44,966 All right, so we are done with the data pre-processing phase. 123 00:05:45,133 --> 00:05:47,700 Now our data set is well pre-processed. 124 00:05:47,700 --> 00:05:49,233 So that's the end of this tutorial. 125 00:05:49,233 --> 00:05:53,133 I can't wait to build this logistic regression model 126 00:05:53,433 --> 00:05:56,666 on our data set that is now prepared in the next tutorials. 127 00:05:57,100 --> 00:06:00,100 Until then, enjoy machine learning.