1 00:00:00,533 --> 00:00:01,333 Hello and 2 00:00:01,333 --> 00:00:05,300 welcome to the final tutorial of part one Data Pre-processing. 3 00:00:05,600 --> 00:00:10,766 We finally completed all the steps that there is to do to prepare any data 4 00:00:10,766 --> 00:00:13,766 sets on which we will build our machine learning models. 5 00:00:13,800 --> 00:00:14,400 And now what 6 00:00:14,400 --> 00:00:17,933 we only need to do left is to prepare the data pre-processing template. 7 00:00:18,200 --> 00:00:21,433 Because we learned a few stuff, we learned how to import a data 8 00:00:21,433 --> 00:00:25,766 set to take care of missing data, to encode categorical data, 9 00:00:26,000 --> 00:00:30,066 to split the data set into the training set and the data set, and to apply feature 10 00:00:30,066 --> 00:00:33,066 scaling to put all of our variables on the same scale. 11 00:00:33,433 --> 00:00:35,566 However, in the data pre-processing template, 12 00:00:35,566 --> 00:00:37,533 we're not going to include all of these. 13 00:00:37,533 --> 00:00:40,600 We are only going to include the importing the libraries. 14 00:00:41,066 --> 00:00:43,733 Then of course we need to import the data set. 15 00:00:43,733 --> 00:00:46,866 Then regarding missing data, I just wanted to show you how to take care 16 00:00:46,866 --> 00:00:49,866 of that in case you have some missing data in your data sets. 17 00:00:50,166 --> 00:00:51,333 You know, in your work. 18 00:00:51,333 --> 00:00:53,700 So we are not going to include this in the templates. 19 00:00:53,700 --> 00:00:55,800 It's just good to know how to take care of this. 20 00:00:55,800 --> 00:00:58,533 But then we will focus on the machine learning models themselves. 21 00:00:58,533 --> 00:01:02,366 And if in your work experience you encounter missing data 22 00:01:02,366 --> 00:01:03,766 you know how to handle this. 23 00:01:03,766 --> 00:01:06,066 Any missing data issues then? 24 00:01:06,066 --> 00:01:09,600 As for categorical data, we won't include it either in the template 25 00:01:09,600 --> 00:01:11,100 because we are going to find 26 00:01:11,100 --> 00:01:15,300 very few examples of data where we have to encode the data. 27 00:01:15,300 --> 00:01:16,833 There will be some examples, 28 00:01:16,833 --> 00:01:20,333 but we won't include it in the templates because our data sets are going 29 00:01:20,333 --> 00:01:21,300 to be well prepared 30 00:01:21,300 --> 00:01:25,566 so that we can mainly focus on the machine learning models and get the maximum fun. 31 00:01:26,200 --> 00:01:29,300 And then of course, we will include this in the template, 32 00:01:29,300 --> 00:01:32,300 splitting the data sets into the training set and the test set. 33 00:01:32,400 --> 00:01:34,300 Because that's a very important step. 34 00:01:34,300 --> 00:01:35,600 You need to split your data 35 00:01:35,600 --> 00:01:40,233 set between training and test, because you need to evaluate your model 36 00:01:40,233 --> 00:01:45,700 on a different set than the sets on which you build your model and feature scaling. 37 00:01:45,700 --> 00:01:47,066 Okay, so feature scaling. 38 00:01:47,066 --> 00:01:50,966 I hesitated to include it in the template, but we are going to include it 39 00:01:51,233 --> 00:01:54,933 only we're going to put that as comment because you're going to see that 40 00:01:55,266 --> 00:01:59,366 there are several libraries in R and Python, and some of them require us 41 00:01:59,366 --> 00:02:03,000 to apply features getting to the data, and some of them don't. 42 00:02:03,400 --> 00:02:05,100 So most of them don't. 43 00:02:05,100 --> 00:02:07,666 Actually most of them take care of that for you. 44 00:02:07,666 --> 00:02:09,300 You don't have to do it manually, 45 00:02:09,300 --> 00:02:12,600 but you will see that some libraries don't apply the feature scaling. 46 00:02:12,933 --> 00:02:13,500 I won't tell you 47 00:02:13,500 --> 00:02:17,066 which models there are right now, because I will let you find out the surprise. 48 00:02:17,266 --> 00:02:18,900 But keep that in mind 49 00:02:18,900 --> 00:02:22,100 and you will see that sometimes we will have to use feature scaling. 50 00:02:22,666 --> 00:02:24,600 Okay, so let's start making the template. 51 00:02:24,600 --> 00:02:25,966 It's going to be very quick. 52 00:02:25,966 --> 00:02:28,300 Let's do it right now. Let's jump to R. 53 00:02:28,300 --> 00:02:28,733 All right. 54 00:02:28,733 --> 00:02:31,600 So here is all the steps that we did together. 55 00:02:31,600 --> 00:02:33,166 So importing the data sets. 56 00:02:33,166 --> 00:02:36,933 Yes we're keeping it of course taking care of missing data. 57 00:02:36,933 --> 00:02:40,433 That was just to show you we won't need it in the future. So. 58 00:02:40,633 --> 00:02:42,766 But you will probably need it in your future. 59 00:02:42,766 --> 00:02:45,300 But in the future of this course, we won't need it. 60 00:02:45,300 --> 00:02:46,300 So we will remove it 61 00:02:47,400 --> 00:02:48,733 and coding categorical data. 62 00:02:48,733 --> 00:02:49,466 Same here. 63 00:02:49,466 --> 00:02:53,400 We will have to do it once or twice, but not all the time. 64 00:02:53,400 --> 00:02:54,666 So we will remove it. 65 00:02:54,666 --> 00:02:57,500 And as I told you, I'm putting this in a separate file 66 00:02:57,500 --> 00:03:00,500 so that you can still use it if you need it for your work. 67 00:03:01,866 --> 00:03:03,166 All right. 68 00:03:03,166 --> 00:03:05,100 And now I think we have everything 69 00:03:05,100 --> 00:03:07,533 splitting the data set into the training set and test set. 70 00:03:07,533 --> 00:03:12,300 Of course we are keeping that and feature scaling feature scaling. 71 00:03:12,700 --> 00:03:15,166 As for Python we're going to put that in comment. 72 00:03:15,166 --> 00:03:18,066 So then the three double quotes 73 00:03:18,066 --> 00:03:21,066 is not a way to put multiline comments in R. 74 00:03:21,366 --> 00:03:22,866 So unfortunately we cannot do it. 75 00:03:22,866 --> 00:03:25,466 What we need to do is this is a good trick. 76 00:03:25,466 --> 00:03:29,900 You select the two lines here and then you press Command and Control 77 00:03:30,266 --> 00:03:33,266 plus shift plus C. 78 00:03:33,766 --> 00:03:36,200 And that puts all your lines in comment. 79 00:03:36,200 --> 00:03:39,500 And eventually we'll just add a final touch to this template, 80 00:03:39,700 --> 00:03:44,133 which is the case where we need to take a subset of our data set. 81 00:03:44,533 --> 00:03:47,633 And in that case we will add the line data set 82 00:03:48,600 --> 00:03:49,666 equals. 83 00:03:49,666 --> 00:03:52,666 Then we take the same data set. 84 00:03:53,366 --> 00:03:56,133 And we use the square brackets to select 85 00:03:56,133 --> 00:03:59,200 the indexes of the columns we are interested in to build our model. 86 00:03:59,466 --> 00:04:01,833 So let's just put some numbers here. 87 00:04:01,833 --> 00:04:04,833 And if we need to select specific columns of interest 88 00:04:04,833 --> 00:04:08,100 in our future data sets we will change those indexes. 89 00:04:08,633 --> 00:04:11,633 But for now let's just put this line in comment. 90 00:04:13,233 --> 00:04:14,500 All right. 91 00:04:14,500 --> 00:04:15,000 All right. 92 00:04:15,000 --> 00:04:17,900 So the template is ready on as well. 93 00:04:17,900 --> 00:04:20,166 We are ready to start our machine learning models. 94 00:04:20,166 --> 00:04:22,266 You have no idea how I'm excited about that. 95 00:04:22,266 --> 00:04:24,600 I can't wait to show you how it's going to be a lot of fun. 96 00:04:24,600 --> 00:04:26,766 We're going to make awesome predictions. 97 00:04:26,766 --> 00:04:27,366 You know, we will. 98 00:04:27,366 --> 00:04:27,900 We will have 99 00:04:27,900 --> 00:04:31,500 some kind of scenarios, business problems, and we will have to solve them. 100 00:04:31,666 --> 00:04:34,700 And then we will use the power of machine learning to solve these problems. 101 00:04:35,033 --> 00:04:37,966 You will see how machine learning is powerful. 102 00:04:37,966 --> 00:04:40,966 It gives you amazing predictions, amazing results. 103 00:04:41,000 --> 00:04:44,466 And keep in mind you will never be lost because we will always know what 104 00:04:44,466 --> 00:04:46,766 we are coding. We will always visualize it. 105 00:04:46,766 --> 00:04:49,500 There will always be some visual graphics 106 00:04:49,500 --> 00:04:52,533 of what we are doing when we make every machine learning model. 107 00:04:52,833 --> 00:04:55,166 So I can't wait to start this with you. 108 00:04:55,166 --> 00:04:58,000 This was the boring part, but the very important part to do 109 00:04:58,000 --> 00:05:01,400 so that's great that we did it and now it's time to have fun. 110 00:05:02,033 --> 00:05:05,400 So thank you for watching this data pre-processing tutorials. 111 00:05:05,400 --> 00:05:08,100 Congratulations for completing this part. 112 00:05:08,100 --> 00:05:10,433 You are now ready to make machine learning models, 113 00:05:10,433 --> 00:05:13,033 and I look forward to seeing you in the next part. 114 00:05:13,033 --> 00:05:14,833 Until then, enjoy machine learning.