1 00:00:00,100 --> 00:00:00,900 Hello my friends. 2 00:00:00,900 --> 00:00:05,400 Congratulations again for completing part one Data preprocessing. 3 00:00:05,500 --> 00:00:08,533 You are now ready to build machine learning models 4 00:00:08,700 --> 00:00:13,066 and the first models that we're going to build together are regression models. 5 00:00:13,100 --> 00:00:16,100 Welcome to part two on regression. 6 00:00:16,166 --> 00:00:19,633 This is the branch of machine learning that aims 7 00:00:19,633 --> 00:00:23,833 to predict some continuous real numbers, like for example, 8 00:00:23,833 --> 00:00:28,866 a salary or a temperature or any kind of continuous numerical value. 9 00:00:29,066 --> 00:00:32,933 And together we will build the best models to make these predictions. 10 00:00:33,366 --> 00:00:37,400 And so now first of all, let's make sure we are all on the same page. 11 00:00:37,500 --> 00:00:41,033 This is the whole machine learning A to z folder 12 00:00:41,033 --> 00:00:44,900 containing all the codes in Python and R and the data set. 13 00:00:45,200 --> 00:00:49,433 I gave you the link of this folder right before this tutorial. 14 00:00:49,433 --> 00:00:52,500 You know in the article to make sure to connect to this folder. 15 00:00:52,600 --> 00:00:55,233 And also make sure to download the whole folder 16 00:00:55,233 --> 00:00:58,933 in order to get the data sets, because we will have indeed to import them 17 00:00:58,933 --> 00:01:01,300 whenever we build a machine learning model. 18 00:01:01,300 --> 00:01:02,200 All right. 19 00:01:02,200 --> 00:01:05,533 And so now, since we're all on the same page, let's start 20 00:01:05,566 --> 00:01:10,133 our new journey within regression part two regression. 21 00:01:10,133 --> 00:01:11,000 There we go. 22 00:01:11,000 --> 00:01:13,633 There are several sections inside each section 23 00:01:13,633 --> 00:01:16,633 corresponding to a different regression model. 24 00:01:16,733 --> 00:01:19,733 We're going to start with simple linear regression, 25 00:01:19,733 --> 00:01:22,833 which is the simplest machine learning model you could ever build. 26 00:01:22,833 --> 00:01:25,933 And so it's good that we start with this, because indeed with this one 27 00:01:25,933 --> 00:01:28,200 we will only have one independent variable. 28 00:01:28,200 --> 00:01:31,966 You know, one feature and of course one continuous real value to predict. 29 00:01:32,266 --> 00:01:35,266 Then we will move on to multiple linear regression, 30 00:01:35,266 --> 00:01:38,733 which is based on the same equation than simple linear regression. 31 00:01:38,933 --> 00:01:42,200 Only this time we will have several features instead of one. 32 00:01:42,633 --> 00:01:45,433 Then we will move on to polynomial regression, 33 00:01:45,433 --> 00:01:49,300 which will allow us to tackle some non-linear data sets. 34 00:01:49,300 --> 00:01:53,633 You know, data sets with non-linear correlations as opposed to previous 35 00:01:53,633 --> 00:01:56,633 models, simple linear regression and multiple linear regression, 36 00:01:56,700 --> 00:02:00,866 which can provide some amazing and accurate predictions for linear data set. 37 00:02:00,900 --> 00:02:03,600 You know, data sets with linear correlations. 38 00:02:03,600 --> 00:02:07,266 Then after polynomial regression, we will move on to support 39 00:02:07,266 --> 00:02:10,900 vector regression, which is another kind of non-linear model 40 00:02:10,900 --> 00:02:13,100 that can make some accurate predictions 41 00:02:13,100 --> 00:02:16,200 for non-linear data sets with non-linear correlations. 42 00:02:16,666 --> 00:02:19,400 And then finally we will move on to decision tree regression 43 00:02:19,400 --> 00:02:22,866 and random forest regression, which can provide an alternative 44 00:02:23,033 --> 00:02:26,033 to predict an outcome for non-linear data sets. 45 00:02:26,066 --> 00:02:28,000 So you will have many options. 46 00:02:28,000 --> 00:02:29,633 You know, after this part two, 47 00:02:29,633 --> 00:02:33,666 you will have basically a toolkit of several regression models. 48 00:02:33,666 --> 00:02:36,966 And so whenever you end up with a new data set where you have to predict 49 00:02:36,966 --> 00:02:40,366 a real continuous outcome, well, you can just try all of them 50 00:02:40,366 --> 00:02:43,933 and select in the end the one that has the best accuracy. 51 00:02:44,166 --> 00:02:47,400 And thanks to the code templates that will result 52 00:02:47,400 --> 00:02:50,633 from each of these sections, you know you will get very clear code 53 00:02:50,633 --> 00:02:54,000 templates that you can adapt to your own data sets so that you can try 54 00:02:54,000 --> 00:02:57,866 these models on your own data sets in a flashlight quickly and efficiently, 55 00:02:57,933 --> 00:03:01,666 so that you can select the best one, giving you the best accuracy. 56 00:03:02,100 --> 00:03:03,666 And so now in this first section 57 00:03:03,666 --> 00:03:07,233 in regression, we're going to start with simple linear regression of course. 58 00:03:07,433 --> 00:03:08,266 So there we go. 59 00:03:08,266 --> 00:03:10,266 Make sure to go inside this folder. 60 00:03:10,266 --> 00:03:11,700 We're going to start with Python. 61 00:03:11,700 --> 00:03:16,266 So let's go inside this Python folder and you will find inside two files. 62 00:03:16,466 --> 00:03:19,900 First the data set salary data dot CSV. 63 00:03:20,200 --> 00:03:24,800 And of course our Python implementation which has the IPython notebook format 64 00:03:24,800 --> 00:03:29,100 which I remind you can run on either Google Colab or Jupyter Notebook. 65 00:03:29,400 --> 00:03:33,433 And of course we're going to implement this model together on Google Colab. 66 00:03:33,433 --> 00:03:36,566 But first let me explain the data set and the problem 67 00:03:36,566 --> 00:03:39,566 that we're going to solve with simple linear regression. 68 00:03:39,566 --> 00:03:42,600 And to do this, I'm going to open our data 69 00:03:42,600 --> 00:03:45,600 set here and explain what this is about. 70 00:03:45,700 --> 00:03:46,200 All right. 71 00:03:46,200 --> 00:03:50,800 So first I want to reassure you that indeed this is a very simple data set. 72 00:03:50,800 --> 00:03:52,900 You know, with only 30 observations. 73 00:03:52,900 --> 00:03:55,800 And of course in real life the data sets are more complex. 74 00:03:55,800 --> 00:03:58,766 But I want to start working on a simple data set 75 00:03:58,766 --> 00:04:02,300 so that we can really focus on how to build the model itself. 76 00:04:02,333 --> 00:04:04,500 You know, because if we had a complex data set, 77 00:04:04,500 --> 00:04:07,233 we would lose a bit our focus on the model. 78 00:04:07,233 --> 00:04:09,500 And I really want us to focus on the model. 79 00:04:09,500 --> 00:04:11,633 So let me describe the data set. 80 00:04:11,633 --> 00:04:13,600 It is a data set containing, as you can see, 81 00:04:13,600 --> 00:04:17,733 30 observations and two columns with of course one feature. 82 00:04:17,733 --> 00:04:21,566 This is the feature years of experience and the dependent variable 83 00:04:21,566 --> 00:04:24,400 which we want to predict which is the salary. 84 00:04:24,400 --> 00:04:28,900 So let's say that this data set belongs to a company that gathered 85 00:04:28,900 --> 00:04:32,700 data of some of their employees, collecting for each of them, 86 00:04:32,833 --> 00:04:36,000 their years of experience and their salary. 87 00:04:36,000 --> 00:04:41,700 So you see, each row of the data set here corresponds to different employees, 88 00:04:41,700 --> 00:04:43,900 corresponds to one employee of the company, 89 00:04:43,900 --> 00:04:47,766 and for each employee of this company, we have indeed the number of years 90 00:04:47,766 --> 00:04:50,833 of experience in the company and their salary. 91 00:04:50,833 --> 00:04:51,733 Okay. 92 00:04:51,733 --> 00:04:56,333 And so the goal very simply is to build a simple linear regression model 93 00:04:56,600 --> 00:05:00,233 that will be trained to understand the correlations 94 00:05:00,233 --> 00:05:03,933 between the number of years of experience and the salary, 95 00:05:04,166 --> 00:05:08,866 so that it can predict for a new employee, you know, having a new number 96 00:05:08,866 --> 00:05:10,033 of years of experience. 97 00:05:10,033 --> 00:05:14,066 Well, the corresponding salary or the salary that this person should get. 98 00:05:14,366 --> 00:05:16,233 So you see, that's a very easy problem. 99 00:05:16,233 --> 00:05:19,833 But at least you'll know how to build a simple linear 100 00:05:19,833 --> 00:05:23,100 regression model and perfectly master it in all the detail. 101 00:05:23,466 --> 00:05:25,766 And then I want to say something very important. 102 00:05:25,766 --> 00:05:28,133 Remember that we're going to build each time 103 00:05:28,133 --> 00:05:31,433 some code templates which you can adapt to your own data set, 104 00:05:31,766 --> 00:05:34,833 so that when you want to use the code template on your own data set, 105 00:05:35,000 --> 00:05:39,433 you will only have one thing to change, which will be the name of your data. 106 00:05:39,666 --> 00:05:44,300 I will always make sure to make the code templates as much generic as we can, 107 00:05:44,300 --> 00:05:48,066 so that you only have 1 or 2 things to change when you want to deploy them 108 00:05:48,066 --> 00:05:48,833 on your data set.