1 00:00:00,300 --> 00:00:01,100 Hello my friends. 2 00:00:01,100 --> 00:00:01,966 Welcome back. 3 00:00:01,966 --> 00:00:05,400 Now let's learn together how to import a data set. 4 00:00:05,533 --> 00:00:08,700 As a reminder, we're going to learn how to import the following data. 5 00:00:08,700 --> 00:00:13,200 Set data dot CSV, which is a very simple data set of, 6 00:00:13,200 --> 00:00:16,066 let's say a retail company that is doing some analysis 7 00:00:16,066 --> 00:00:19,300 on which clients purchased one of their products. 8 00:00:19,466 --> 00:00:24,533 So the rows in this data set correspond to different customers of this employee. 9 00:00:24,766 --> 00:00:28,266 And for each of these customers, we have the country they live in their 10 00:00:28,266 --> 00:00:32,566 age, their salary, and whether or not they purchased the product. 11 00:00:32,600 --> 00:00:33,133 Okay. 12 00:00:33,133 --> 00:00:36,033 So we're going to learn how to import that CSV 13 00:00:36,033 --> 00:00:39,166 on Python using, of course, the pandas library. 14 00:00:39,300 --> 00:00:39,966 Here we go. 15 00:00:39,966 --> 00:00:42,933 So let's first create a new code cell. 16 00:00:42,933 --> 00:00:45,366 And now let's import this data set. 17 00:00:45,366 --> 00:00:48,500 So the first thing we have to do is to create a new variable. 18 00:00:48,633 --> 00:00:51,733 And this variable will contain exactly the data set. 19 00:00:52,100 --> 00:00:54,766 So I always like to choose simple names 20 00:00:54,766 --> 00:00:57,966 for my variables representing well what we're creating. 21 00:00:57,966 --> 00:01:00,233 And therefore since now we're importing the data set 22 00:01:00,233 --> 00:01:02,700 and we want to integrate the data set in a variable. 23 00:01:02,700 --> 00:01:05,033 I'm going to call this variable data set. 24 00:01:05,033 --> 00:01:06,933 Okay. As simple as that. 25 00:01:06,933 --> 00:01:09,600 So this variable what will it be equal to. 26 00:01:09,600 --> 00:01:14,733 Well it will be equal to the output of a certain function by pandas. 27 00:01:15,000 --> 00:01:17,966 And this certain function will exactly read 28 00:01:17,966 --> 00:01:20,966 all the values of this data set 29 00:01:21,133 --> 00:01:23,933 and will create what we call a data frame. 30 00:01:23,933 --> 00:01:28,133 It's a certain format of data, whether it is in Python or even R. 31 00:01:28,400 --> 00:01:31,200 So it will create a data frame and it will contain 32 00:01:31,200 --> 00:01:35,100 exactly the same rows and columns and values as what you see here. 33 00:01:35,133 --> 00:01:38,600 And this data frame will be exactly this data set variable. 34 00:01:38,733 --> 00:01:39,500 All right. 35 00:01:39,500 --> 00:01:40,233 So there we go. 36 00:01:40,233 --> 00:01:43,733 In order to create this data frame we're going to call a certain function 37 00:01:43,733 --> 00:01:47,766 by the pandas library which is called read underscore CSV. 38 00:01:47,966 --> 00:01:49,933 And in this function we will only have to 39 00:01:49,933 --> 00:01:53,466 input the name of the data set with the extension. 40 00:01:54,033 --> 00:01:58,333 So since we're about to call a function of this, well, 41 00:01:58,333 --> 00:02:01,333 the first thing we have to do is call dependance library. 42 00:02:01,533 --> 00:02:04,933 And therefore remember since we gave it the shortcut named PG, 43 00:02:05,100 --> 00:02:08,100 in order to call it, we need to add here PG. 44 00:02:08,233 --> 00:02:11,900 And then to call a function from a library we need to add a dot. 45 00:02:11,900 --> 00:02:13,433 It's always done like that. 46 00:02:13,433 --> 00:02:16,433 And that's where you can call the function you want to use. 47 00:02:16,533 --> 00:02:21,100 And as we said this function is named read underscore CSV. 48 00:02:21,300 --> 00:02:24,666 And then you add some parenthesis to enter the argument. 49 00:02:24,833 --> 00:02:26,133 So there we go. Let's do this. 50 00:02:26,133 --> 00:02:27,900 This will only what you will have to do 51 00:02:27,900 --> 00:02:30,733 when using this read underscore CSV function. 52 00:02:30,733 --> 00:02:33,966 You have to input in quotes the name of the data set. 53 00:02:34,266 --> 00:02:39,000 As a reminder, the name of the data set is data with a capital D dot csv. 54 00:02:39,266 --> 00:02:40,066 So there we go. 55 00:02:40,066 --> 00:02:46,333 Data dot csv okay, and this will create the data frame. 56 00:02:46,333 --> 00:02:48,700 You know all the values inside this data set. 57 00:02:48,700 --> 00:02:52,766 And this data frame will be exactly this data set variable okay. 58 00:02:53,166 --> 00:02:54,600 So that's the first step. 59 00:02:54,600 --> 00:02:56,766 But that's not enough to import data set. 60 00:02:56,766 --> 00:02:59,133 You know as a first step of data preprocessing, 61 00:02:59,133 --> 00:03:03,233 the next thing that you have to do is create two new entities. 62 00:03:03,266 --> 00:03:05,933 The first one is the matrix of features. 63 00:03:05,933 --> 00:03:08,933 And the second one is the dependent variable vector. 64 00:03:08,966 --> 00:03:11,966 So let me show you exactly what they mean and what they are 65 00:03:12,100 --> 00:03:15,300 in the data set right here okay. 66 00:03:15,800 --> 00:03:20,700 So I'm going to give you now a very first important principle in machine learning. 67 00:03:21,166 --> 00:03:23,800 In any data set with which you're going 68 00:03:23,800 --> 00:03:27,066 to train a machine learning model, you have the same entities 69 00:03:27,066 --> 00:03:30,600 which are the features and the dependent variable vector. 70 00:03:30,933 --> 00:03:32,266 Can you guess here? 71 00:03:32,266 --> 00:03:35,266 What are the features and what is the dependent variable? 72 00:03:35,533 --> 00:03:40,066 Well, very simply the features are the columns 73 00:03:40,066 --> 00:03:43,466 with which you're going to predict the dependent variable. 74 00:03:43,800 --> 00:03:46,900 And the dependent variable of course is the last column. 75 00:03:46,900 --> 00:03:49,900 Because, you know, this company would like to predict 76 00:03:49,933 --> 00:03:53,066 if some future customers are going to purchase 77 00:03:53,066 --> 00:03:56,100 a certain product based on these informations. 78 00:03:56,300 --> 00:04:01,033 So very simply, the features or also called the independent variables, 79 00:04:01,333 --> 00:04:05,700 are the variables containing some informations with which you can 80 00:04:05,700 --> 00:04:09,633 predict what you want to predict, which is called the dependent variable. 81 00:04:10,033 --> 00:04:10,366 All right. 82 00:04:10,366 --> 00:04:14,100 So remember this very important principle in any machine 83 00:04:14,100 --> 00:04:17,133 learning model that you're going to build you're going to have separately 84 00:04:17,333 --> 00:04:20,966 the features usually in the first columns of your data set. 85 00:04:21,333 --> 00:04:25,433 And the dependent variable, usually in the last column of your data set, 86 00:04:25,566 --> 00:04:28,066 you will see that all the data sets we will use in this course 87 00:04:28,066 --> 00:04:31,300 and most of the data sets you'll use in your machine learning career 88 00:04:31,433 --> 00:04:35,200 will have the same format with first the features you know in the first columns 89 00:04:35,366 --> 00:04:38,366 and the dependent variable vector in the last column. 90 00:04:38,433 --> 00:04:39,333 Okay. 91 00:04:39,333 --> 00:04:42,666 And so right now what we want to create, you know, the two entities, 92 00:04:42,666 --> 00:04:44,366 we want to create our first 93 00:04:44,366 --> 00:04:49,033 the matrix of features containing separately these three columns here, 94 00:04:49,033 --> 00:04:51,266 you know, country age salary 95 00:04:51,266 --> 00:04:55,200 and separately we want to create the dependent variable vector 96 00:04:55,266 --> 00:04:59,366 containing only this last column because that's the column we want to predict. 97 00:04:59,700 --> 00:05:03,200 And that's exactly what we always have to do in this first data 98 00:05:03,200 --> 00:05:04,000 preprocessing phase. 99 00:05:04,000 --> 00:05:05,133 So let's do this. 100 00:05:05,133 --> 00:05:07,066 Let's create these two entities. 101 00:05:07,066 --> 00:05:10,066 And we are going to call them x for the matrix of features 102 00:05:10,200 --> 00:05:13,133 and y for the dependent variable vector.