1 00:00:00,860 --> 00:00:07,310 And this video, we are going to learn how to import the data into our software and get it ready to 2 00:00:07,310 --> 00:00:08,270 start building the model. 3 00:00:10,650 --> 00:00:15,220 I have ordered and here you can post the video and copy it out. 4 00:00:15,480 --> 00:00:19,890 I'll also be sharing the file of discord in the resource section. 5 00:00:20,190 --> 00:00:25,110 You can download that file and you can copy paste that Acleda code from that fellow to. 6 00:00:26,050 --> 00:00:29,060 So the first part is how to import the dataset. 7 00:00:30,140 --> 00:00:35,060 To import the data, we use this lead CSB function. 8 00:00:36,390 --> 00:00:43,770 So if your data is saved as a we even if it is an excellent fight, you can open it in Excel and save 9 00:00:43,770 --> 00:00:45,540 it as a CSP file. 10 00:00:45,840 --> 00:00:49,950 So when you have us, yes, we find that there's a comma separated values file. 11 00:00:50,340 --> 00:00:53,820 You can import that file using the dawsey as we function. 12 00:00:54,920 --> 00:00:56,220 Indeed, NTSB function. 13 00:00:56,300 --> 00:00:57,620 You have to give the. 14 00:00:59,750 --> 00:01:06,020 Part of that faith and the data in that faith will be imported into this variable, which I have named 15 00:01:06,020 --> 00:01:06,830 as movie. 16 00:01:16,550 --> 00:01:23,180 So when I ran that command felt this commanded then which assigned the data in that file to a movie 17 00:01:23,180 --> 00:01:23,710 variable. 18 00:01:25,140 --> 00:01:29,040 When I run view this very well, I can see that data. 19 00:01:30,180 --> 00:01:33,540 This is the data that we discussed earlier in the columns. 20 00:01:33,570 --> 00:01:34,860 We have a rebuilt. 21 00:01:35,070 --> 00:01:41,630 The last column will be the collection column, which is a dependent variable, Audie. 22 00:01:42,060 --> 00:01:43,410 Very well that we want to predict. 23 00:01:45,450 --> 00:01:48,900 And we have all the 506 observations. 24 00:01:54,350 --> 00:01:54,950 And this data. 25 00:01:57,990 --> 00:02:04,870 You can also note that the headers that is the column names, but auto by the DGSE function. 26 00:02:05,370 --> 00:02:09,090 And it has identified that DECL column header. 27 00:02:09,540 --> 00:02:10,050 What DVD? 28 00:02:10,130 --> 00:02:10,650 Willemse. 29 00:02:14,750 --> 00:02:20,780 So by using these two lanes, you can import the data into your, ah, software. 30 00:02:23,150 --> 00:02:27,800 The next step is before running any model or training, any model. 31 00:02:27,920 --> 00:02:29,360 We need to prepare the data. 32 00:02:30,440 --> 00:02:33,250 I'm showing you only one step of data processing. 33 00:02:35,070 --> 00:02:41,460 This step is called missing value imputation, that is whenever the value of variable is missing. 34 00:02:43,320 --> 00:02:44,280 That we have to fill. 35 00:02:45,240 --> 00:02:50,580 It is a critical step because if a value is missing in the data, say the models cannot be trained, 36 00:02:50,670 --> 00:02:54,660 that is, and will not be able to Dundy functions to train these models. 37 00:02:55,200 --> 00:02:56,820 So it is a mandatory step. 38 00:02:56,880 --> 00:02:58,260 That is why I'm showing it to you. 39 00:02:58,740 --> 00:03:01,910 There are various other processing techniques. 40 00:03:03,010 --> 00:03:05,140 That we have to do to get Dee Dee Dee Dee. 41 00:03:06,210 --> 00:03:12,090 If you want to learn those techniques, also, I'll be sharing the link of how you can learn those techniques 42 00:03:12,440 --> 00:03:13,230 and the description. 43 00:03:14,370 --> 00:03:16,740 I'm not going to cover that part in this video. 44 00:03:18,280 --> 00:03:20,110 So how to handle missing values? 45 00:03:20,830 --> 00:03:21,460 First of all. 46 00:03:21,730 --> 00:03:27,850 Will then this command to see are there any missing we're losing in any of the variables. 47 00:03:28,090 --> 00:03:30,430 So we'll run somebody of data. 48 00:03:38,020 --> 00:03:43,400 And this somebody we get minimum, maximum and the three quartile values. 49 00:03:45,180 --> 00:03:50,790 Apart from this, whichever variable has some missing values, it will get an additional information 50 00:03:50,790 --> 00:03:51,490 of enemies. 51 00:03:52,320 --> 00:03:56,790 So the time taken variable as will end is. 52 00:03:57,830 --> 00:04:01,320 So in this variable, there are two entities which were empty. 53 00:04:02,740 --> 00:04:06,190 So if you open the Excel file, there will be quilters empty. 54 00:04:06,430 --> 00:04:12,130 When we have imported that file into art, those cells are named as any. 55 00:04:13,370 --> 00:04:15,150 We cannot have any in our dataset. 56 00:04:15,350 --> 00:04:19,220 So we'll be changing the end is to some value. 57 00:04:19,820 --> 00:04:27,140 The value we are going to assign these blank cells should be some harmless value, such as mean or median. 58 00:04:27,950 --> 00:04:33,860 So we are going to find out the meaning of the well losing time taken very well. 59 00:04:34,460 --> 00:04:39,440 And we are going to put that mean value into these blank cells to do that. 60 00:04:39,470 --> 00:04:49,790 The is by running this line, we can replace the blank values with the of other values in that particular 61 00:04:49,790 --> 00:04:50,290 variable. 62 00:04:52,110 --> 00:04:57,880 So let me explain to you what this line is doing on the left part. 63 00:04:58,340 --> 00:05:01,750 We are identifying those cells which are blank. 64 00:05:02,500 --> 00:05:06,490 So movie dollar time taken takes you to that particular variable. 65 00:05:06,970 --> 00:05:11,890 And within that variable, you want to find out those cells where. 66 00:05:13,200 --> 00:05:14,680 The value is any. 67 00:05:15,660 --> 00:05:20,560 So is any function is finding out those cells which are having value. 68 00:05:20,700 --> 00:05:22,350 Any in this variable. 69 00:05:24,580 --> 00:05:30,260 And we are putting the mean of time taken variable after removing any values. 70 00:05:31,000 --> 00:05:33,010 So we had five, 506, six observations. 71 00:05:33,700 --> 00:05:35,980 Twelve of them had any value. 72 00:05:37,750 --> 00:05:46,150 So for the remaining 494 observations, we find out demean using this function and report that mean 73 00:05:46,150 --> 00:05:49,390 value and to those values which had any. 74 00:05:50,950 --> 00:05:54,970 Using this single lane, we'll be doing missing value imputation. 75 00:05:56,710 --> 00:06:05,350 Once the missing values are replaced or are imputed, we can run our analysis, what better analysis? 76 00:06:05,500 --> 00:06:11,620 We should take some other steps also, such as outlier treatment, variable transformation. 77 00:06:12,190 --> 00:06:14,800 Looking at correlations between variables and so on. 78 00:06:15,900 --> 00:06:19,850 We will not be covering all those steps here after missing value in prediction. 79 00:06:20,060 --> 00:06:24,470 Our data is ready so that we can use it for training our and remodel.