1 00:00:00,533 --> 00:00:03,633 So I'm going to jump back to my folder. 2 00:00:03,633 --> 00:00:06,566 Part one Data Pre-processing, which is here. 3 00:00:06,566 --> 00:00:07,666 And here we go. 4 00:00:07,666 --> 00:00:10,666 So the categorical data our file is here. 5 00:00:10,833 --> 00:00:13,833 So let's open it. Here it is. 6 00:00:13,933 --> 00:00:20,466 And right now what I'm going to do is I'm going to take this copy. 7 00:00:20,800 --> 00:00:21,933 So you have the same file. 8 00:00:21,933 --> 00:00:26,000 So you can also take it from your folder or take it from the course. 9 00:00:26,100 --> 00:00:30,600 And let's go back to our multiple linear regression file and paste 10 00:00:31,700 --> 00:00:33,266 that here. 11 00:00:33,266 --> 00:00:34,166 Okay. 12 00:00:34,166 --> 00:00:36,766 Then of course we need to change a few things here. 13 00:00:36,766 --> 00:00:40,033 So we need to change the name of our categorical variable. 14 00:00:40,200 --> 00:00:41,766 So in part one it was country. 15 00:00:41,766 --> 00:00:44,733 And here it is state. 16 00:00:44,733 --> 00:00:45,500 Okay. 17 00:00:45,500 --> 00:00:46,200 Same here. 18 00:00:46,200 --> 00:00:49,200 We need to change country by state. 19 00:00:49,600 --> 00:00:51,200 Let's not forget to align this. 20 00:00:51,200 --> 00:00:53,000 This is very important in R. 21 00:00:53,000 --> 00:00:56,000 And then any programing language. 22 00:00:56,466 --> 00:00:59,000 Here we go there. 23 00:00:59,000 --> 00:01:00,766 And now we need to change the levels. 24 00:01:00,766 --> 00:01:03,566 So before you know the categorical variable was the countries. 25 00:01:03,566 --> 00:01:06,566 And the three categories were France Spain and Germany. 26 00:01:06,800 --> 00:01:11,500 And here our three categories are New York California and Florida. 27 00:01:12,800 --> 00:01:14,133 So let's do it on our dataset. 28 00:01:14,133 --> 00:01:17,133 Actually I will close that because we no longer need it. 29 00:01:17,333 --> 00:01:21,366 So here are levels are we said New York. 30 00:01:25,033 --> 00:01:28,033 California. And. 31 00:01:29,566 --> 00:01:31,566 Florida. 32 00:01:31,566 --> 00:01:32,400 Okay. 33 00:01:32,400 --> 00:01:36,000 And then the labels that is the numeric numbers 34 00:01:36,000 --> 00:01:38,866 which are actually factors, the numeric factors 35 00:01:38,866 --> 00:01:42,600 that are going to replace this three text here New York California and Florida. 36 00:01:42,600 --> 00:01:45,600 Are these numbers you choose here for labels. 37 00:01:45,800 --> 00:01:47,166 So here we have 123. 38 00:01:47,166 --> 00:01:49,833 That means that New York is going to be one. 39 00:01:49,833 --> 00:01:52,833 California is going to be two and Florida is going to be three. 40 00:01:52,866 --> 00:01:57,000 You're going to see I'm going to select this and execute. 41 00:01:58,100 --> 00:01:58,566 All right. 42 00:01:58,566 --> 00:02:01,300 And now let's look at our data set. 43 00:02:01,300 --> 00:02:05,100 As you can see the state is now encoded with the 123 values. 44 00:02:05,200 --> 00:02:09,000 So one for New York, two for California and three for Florida. 45 00:02:09,800 --> 00:02:11,733 Let's go back. Okay. 46 00:02:11,733 --> 00:02:13,266 So the encoding is done. 47 00:02:13,266 --> 00:02:15,966 And that's a much better thing for our model. 48 00:02:15,966 --> 00:02:19,033 Now our model has a greater chance to work. 49 00:02:19,500 --> 00:02:23,066 And now the last thing we need to do is to split the data 50 00:02:23,066 --> 00:02:25,500 sets into the training set and the test set. 51 00:02:25,500 --> 00:02:29,300 So here let's not forget to change the name of the dependent variable here, 52 00:02:29,766 --> 00:02:32,466 which is not purchased but profit. 53 00:02:34,500 --> 00:02:35,466 All right. 54 00:02:35,466 --> 00:02:38,433 And then we need to change a split ratio if necessary. 55 00:02:38,433 --> 00:02:40,000 Let's see. We have 50 observations. 56 00:02:40,000 --> 00:02:42,000 So a good split would be to have 57 00:02:42,000 --> 00:02:45,600 40 observations in the training set and ten observations in the test set. 58 00:02:45,900 --> 00:02:50,833 So that makes actually an 80% split ratio 80% going to the training set. 59 00:02:51,200 --> 00:02:53,600 And this is already what we have. Perfect. 60 00:02:53,600 --> 00:02:56,233 So we don't have to do anything here for the split ratio. 61 00:02:56,233 --> 00:02:59,233 And we are ready to take all of these. 62 00:02:59,400 --> 00:03:02,100 And execute. 63 00:03:02,100 --> 00:03:03,900 And here we go. 64 00:03:03,900 --> 00:03:06,900 Let's have a look at our training set and our test set. 65 00:03:09,366 --> 00:03:09,900 Here it is. 66 00:03:09,900 --> 00:03:11,333 That's the training set. 67 00:03:11,333 --> 00:03:11,566 Okay. 68 00:03:11,566 --> 00:03:14,566 So it contains 40 entries for the observations. 69 00:03:14,633 --> 00:03:15,600 Great. 70 00:03:15,600 --> 00:03:17,700 We have our encoded variable for state. 71 00:03:17,700 --> 00:03:18,966 That's perfect. 72 00:03:18,966 --> 00:03:22,300 And then a test set that contains ten observations. 73 00:03:22,566 --> 00:03:24,700 And everything looks fine. 74 00:03:24,700 --> 00:03:28,433 All right so let's go back to multiple linear regression. 75 00:03:29,466 --> 00:03:31,600 And the last step is feature scaling. 76 00:03:31,600 --> 00:03:35,433 But as for simple linear regression we won't need to apply 77 00:03:35,433 --> 00:03:37,033 feature scaling manually. 78 00:03:37,033 --> 00:03:40,466 This will be taken care of with the function that we're going to use 79 00:03:40,466 --> 00:03:43,800 to fit multiple linear regression to our training set. 80 00:03:44,166 --> 00:03:46,166 So we're all fine. We're all good here. 81 00:03:46,166 --> 00:03:48,233 We are ready to move on to the next step. 82 00:03:48,233 --> 00:03:51,133 And that's what we're going to do in the next tutorial. 83 00:03:51,133 --> 00:03:54,566 Thank you for watching this one and I look forward to seeing you in the next one. 84 00:03:55,133 --> 00:03:58,133 Until then, enjoy machine learning.