1 00:00:00,033 --> 00:00:00,633 So there you go. 2 00:00:00,633 --> 00:00:03,566 That's the. Whole implementation. Let me scroll down a bit. 3 00:00:03,566 --> 00:00:05,233 And as you can see. You will. 4 00:00:05,233 --> 00:00:06,300 Get many tools. 5 00:00:06,300 --> 00:00:10,033 So let's have a look at them one by one through the table of contents. 6 00:00:10,500 --> 00:00:12,700 So the first thing I'll teach you is how. 7 00:00:12,700 --> 00:00:14,400 To import. The libraries. 8 00:00:14,400 --> 00:00:16,900 These are the libraries we will always use. 9 00:00:16,900 --> 00:00:19,333 In any machine learning model implementation. 10 00:00:19,333 --> 00:00:21,700 So we. Will include them in the template. 11 00:00:21,700 --> 00:00:23,066 So that. They can be ready to. 12 00:00:23,066 --> 00:00:25,333 Use. For our implementations. 13 00:00:25,333 --> 00:00:27,566 Then I. Will teach you how to. Import. 14 00:00:27,566 --> 00:00:31,033 The data set that exact same data set which I. 15 00:00:31,033 --> 00:00:33,366 Just introduced to. You a. Few seconds ago. 16 00:00:33,366 --> 00:00:36,466 That data set I will show you how to first upload. 17 00:00:36,466 --> 00:00:40,000 It on Google Colab, and then import it in your Python file. 18 00:00:40,333 --> 00:00:40,666 All right. 19 00:00:40,666 --> 00:00:43,333 So that's the second step importing the dataset. 20 00:00:43,333 --> 00:00:45,466 Then I will teach. You how to take care. 21 00:00:45,466 --> 00:00:46,500 Of missing data. 22 00:00:46,500 --> 00:00:48,466 Because indeed in most of the. 23 00:00:48,466 --> 00:00:50,533 Data sets you work with in. 24 00:00:50,533 --> 00:00:54,000 Your machine learning career, you might encounter some missing data. 25 00:00:54,300 --> 00:00:56,600 And that's the case here for our data set. 26 00:00:56,600 --> 00:01:00,600 As you can see, there is a missing salary here and a missing age. 27 00:01:00,766 --> 00:01:02,866 And I will teach you. Exactly what to. 28 00:01:02,866 --> 00:01:04,466 Do in order to handle this. 29 00:01:04,466 --> 00:01:06,633 There are some techniques. Which are. 30 00:01:06,633 --> 00:01:08,400 The most relevant to then, you know, 31 00:01:08,400 --> 00:01:11,000 optimize the training of your machine learning models. 32 00:01:11,000 --> 00:01:13,700 And I will. Show you the best of. These techniques. 33 00:01:13,700 --> 00:01:15,066 All right then. After. 34 00:01:15,066 --> 00:01:17,800 Taking care of the missing. Data, I will teach you how. 35 00:01:17,800 --> 00:01:18,900 To encode. 36 00:01:18,900 --> 00:01:21,300 Categorical data, whether it is the. 37 00:01:21,300 --> 00:01:22,333 Independent variable. 38 00:01:22,333 --> 00:01:24,700 Which is your predictors, or the. 39 00:01:24,700 --> 00:01:27,433 Dependent. Variable, which is what you want to predict. 40 00:01:27,433 --> 00:01:29,833 So as you can see in our data. 41 00:01:29,833 --> 00:01:32,866 Set, well we have actually two categorical variables. 42 00:01:32,866 --> 00:01:35,633 We have. This one the country column containing in. 43 00:01:35,633 --> 00:01:37,066 These three categories. 44 00:01:37,066 --> 00:01:39,333 France, Spain and Germany as. 45 00:01:39,333 --> 00:01:40,833 Opposed. To all these numerical. 46 00:01:40,833 --> 00:01:42,666 Values. In the. Other variables. 47 00:01:42,666 --> 00:01:44,666 And also we have this categorical. 48 00:01:44,666 --> 00:01:46,500 Variable containing two. Categories. 49 00:01:46,500 --> 00:01:47,800 Yes and no. 50 00:01:47,800 --> 00:01:50,033 All right. So I will teach you what to do with that. 51 00:01:50,033 --> 00:01:52,233 Situation so that then you can be ready. 52 00:01:52,233 --> 00:01:54,400 To preprocess. Any data. 53 00:01:54,400 --> 00:01:58,166 After this I will teach you how to split the data set. 54 00:01:58,166 --> 00:02:00,033 Into the training set and a data set. 55 00:02:00,033 --> 00:02:03,566 And that step is very important because each time you want to train a 56 00:02:03,566 --> 00:02:04,700 machine learning model. 57 00:02:04,700 --> 00:02:06,800 Well, you have to create two separate sets. 58 00:02:06,800 --> 00:02:08,000 One training set. 59 00:02:08,000 --> 00:02:09,166 Where you're going to. 60 00:02:09,166 --> 00:02:12,633 Train your machine learning model to understand the correlations inside your. 61 00:02:12,633 --> 00:02:13,366 Data set. 62 00:02:13,366 --> 00:02:14,266 And one test. 63 00:02:14,266 --> 00:02:17,333 Set that you're going to use to evaluate your. 64 00:02:17,366 --> 00:02:18,266 Machine learning models. 65 00:02:18,266 --> 00:02:20,700 And that's therefore on new observations. 66 00:02:20,700 --> 00:02:21,666 Because the test set. 67 00:02:21,666 --> 00:02:25,300 You know, is like new data on which the model wasn't trained. 68 00:02:25,300 --> 00:02:29,266 So that's very important to do this in order to check that indeed there is not. 69 00:02:29,266 --> 00:02:30,033 Overfitting. 70 00:02:30,033 --> 00:02:32,633 You know, when your machine learning model is trained too. 71 00:02:32,633 --> 00:02:34,266 Well. On the training set. 72 00:02:34,266 --> 00:02:35,666 So well that. 73 00:02:35,666 --> 00:02:38,100 It doesn't perform well on new observations. 74 00:02:38,100 --> 00:02:40,166 And so that's. Why this step. Is very important. 75 00:02:40,166 --> 00:02:43,166 And we will include it in the data preprocessing template. 76 00:02:43,466 --> 00:02:44,133 And finally. 77 00:02:44,133 --> 00:02:45,000 I will teach you a. 78 00:02:45,000 --> 00:02:46,300 Very important tool 79 00:02:46,300 --> 00:02:50,233 that you might have to use in some of your machine learning model implementations. 80 00:02:50,466 --> 00:02:52,466 Which is feature. Scaling. 81 00:02:52,466 --> 00:02:55,733 Which actually scales all your features. 82 00:02:55,733 --> 00:02:58,300 To make sure they are on the. Right scale. 83 00:02:58,300 --> 00:03:00,433 So you won't have to use. That all the. Time. 84 00:03:00,433 --> 00:03:01,066 We will see. 85 00:03:01,066 --> 00:03:04,233 Of course, in each of the machine learning model implementation, whether. 86 00:03:04,233 --> 00:03:06,300 We have to apply feature. Scaling or not. 87 00:03:06,300 --> 00:03:08,266 So you'll have. Everything, but. 88 00:03:08,266 --> 00:03:10,000 We. Have to include. This feature scaling. 89 00:03:10,000 --> 00:03:12,733 Tool in the data preprocessing toolkit, because indeed. 90 00:03:12,733 --> 00:03:15,300 We'll have to use it. From time to time. 91 00:03:15,300 --> 00:03:15,900 All right. 92 00:03:15,900 --> 00:03:17,366 That's the table of contents. 93 00:03:17,366 --> 00:03:19,500 And we are going to reimplement. 94 00:03:19,500 --> 00:03:21,000 All. This from scratch. 95 00:03:21,000 --> 00:03:23,633 However, if you feel. You're comfortable. 96 00:03:23,633 --> 00:03:24,566 With all these tools. 97 00:03:24,566 --> 00:03:27,466 And you understand them already, and you can't. Wait. 98 00:03:27,466 --> 00:03:30,466 To move on to the machine learning implementations. 99 00:03:30,500 --> 00:03:33,033 Well, feel free to just read this code. 100 00:03:33,033 --> 00:03:34,433 And make sure you understand it. 101 00:03:34,433 --> 00:03:36,433 100%, and then move on to. 102 00:03:36,433 --> 00:03:37,000 Part two. 103 00:03:37,000 --> 00:03:38,333 Regression, where we're going to build. 104 00:03:38,333 --> 00:03:40,133 Our first regression. Models. 105 00:03:40,133 --> 00:03:43,933 However, I really insist that this course must be action based. 106 00:03:43,933 --> 00:03:46,700 So I. Really want you to take action as much as you can. 107 00:03:46,700 --> 00:03:47,533 And that's why. 108 00:03:47,533 --> 00:03:51,366 If you're ready to do this for data preprocessing, well, stay. 109 00:03:51,366 --> 00:03:52,566 With me in this part one, 110 00:03:52,566 --> 00:03:56,333 because we are going to re-implement each and every single one. 111 00:03:56,333 --> 00:03:57,400 Of these tools. 112 00:03:57,400 --> 00:03:58,366 And to do. This. 113 00:03:58,366 --> 00:03:59,400 Because remember that. 114 00:03:59,400 --> 00:04:00,733 This colab file is. 115 00:04:00,733 --> 00:04:02,633 Actually in read only mode. 116 00:04:02,633 --> 00:04:05,233 Well, to do this, we'll have to create a copy by. 117 00:04:05,233 --> 00:04:06,466 Clicking. File here. 118 00:04:06,466 --> 00:04:09,266 And then. Save a copy in drive. 119 00:04:09,266 --> 00:04:09,866 This will. 120 00:04:09,866 --> 00:04:10,633 As you can see. 121 00:04:10,633 --> 00:04:12,966 Create a copy of this Google. 122 00:04:12,966 --> 00:04:14,700 Colab implementation on. 123 00:04:14,700 --> 00:04:16,500 Which you will be able to. 124 00:04:16,500 --> 00:04:17,833 Modify and. 125 00:04:17,833 --> 00:04:20,466 Mostly code. Your own implementation. 126 00:04:20,466 --> 00:04:20,866 All right. 127 00:04:20,866 --> 00:04:22,966 And that's exactly what we're going to do now. We're going to. 128 00:04:22,966 --> 00:04:25,233 Actually. Re-Implement all. These code. 129 00:04:25,233 --> 00:04:26,666 Cells from scratch so that I make. 130 00:04:26,666 --> 00:04:28,933 Sure that you take. Action. 131 00:04:28,933 --> 00:04:31,133 And therefore we're. Going to remove. 132 00:04:31,133 --> 00:04:31,666 Each. 133 00:04:31,666 --> 00:04:32,266 One of. 134 00:04:32,266 --> 00:04:33,866 These code cells by. 135 00:04:33,866 --> 00:04:35,433 Clicking this trash button here. 136 00:04:35,433 --> 00:04:37,433 But make sure not to remove. 137 00:04:37,433 --> 00:04:38,533 Well, the text cells. 138 00:04:38,533 --> 00:04:39,300 This is a text cell. 139 00:04:39,300 --> 00:04:40,466 This is a code cell. 140 00:04:40,466 --> 00:04:42,433 So make sure to only remove. 141 00:04:42,433 --> 00:04:44,133 The code cells 142 00:04:44,133 --> 00:04:48,766 so that we can actually keep this well highlighted structure of this. 143 00:04:48,766 --> 00:04:51,766 Implementation. All right. Almost done. 144 00:04:52,100 --> 00:04:54,000 A few code cells left. 145 00:04:54,000 --> 00:04:55,333 Trash button. 146 00:04:55,333 --> 00:04:57,333 Trash trash trash. 147 00:04:57,333 --> 00:04:59,866 And almost done. There we go. 148 00:04:59,866 --> 00:05:00,200 All right. 149 00:05:00,200 --> 00:05:03,700 So that's the whole structure of this implementation. 150 00:05:03,900 --> 00:05:05,866 These are all the tools that you will. 151 00:05:05,866 --> 00:05:07,333 Probably need when. 152 00:05:07,333 --> 00:05:10,133 Preprocessing your future data sets for your future. 153 00:05:10,133 --> 00:05:11,200 Machine learning models. 154 00:05:11,200 --> 00:05:14,766 So it's very important that you understand well this implementation. 155 00:05:15,300 --> 00:05:17,566 So whenever you're. Ready let's start. 156 00:05:17,566 --> 00:05:19,133 From the next tutorial to. 157 00:05:19,133 --> 00:05:20,866 Master data preprocessing.