1
00:00:00,860 --> 00:00:07,310
And this video, we are going to learn how to import the data into our software and get it ready to

2
00:00:07,310 --> 00:00:08,270
start building the model.

3
00:00:10,650 --> 00:00:15,220
I have ordered and here you can post the video and copy it out.

4
00:00:15,480 --> 00:00:19,890
I'll also be sharing the file of discord in the resource section.

5
00:00:20,190 --> 00:00:25,110
You can download that file and you can copy paste that Acleda code from that fellow to.

6
00:00:26,050 --> 00:00:29,060
So the first part is how to import the dataset.

7
00:00:30,140 --> 00:00:35,060
To import the data, we use this lead CSB function.

8
00:00:36,390 --> 00:00:43,770
So if your data is saved as a we even if it is an excellent fight, you can open it in Excel and save

9
00:00:43,770 --> 00:00:45,540
it as a CSP file.

10
00:00:45,840 --> 00:00:49,950
So when you have us, yes, we find that there's a comma separated values file.

11
00:00:50,340 --> 00:00:53,820
You can import that file using the dawsey as we function.

12
00:00:54,920 --> 00:00:56,220
Indeed, NTSB function.

13
00:00:56,300 --> 00:00:57,620
You have to give the.

14
00:00:59,750 --> 00:01:06,020
Part of that faith and the data in that faith will be imported into this variable, which I have named

15
00:01:06,020 --> 00:01:06,830
as movie.

16
00:01:16,550 --> 00:01:23,180
So when I ran that command felt this commanded then which assigned the data in that file to a movie

17
00:01:23,180 --> 00:01:23,710
variable.

18
00:01:25,140 --> 00:01:29,040
When I run view this very well, I can see that data.

19
00:01:30,180 --> 00:01:33,540
This is the data that we discussed earlier in the columns.

20
00:01:33,570 --> 00:01:34,860
We have a rebuilt.

21
00:01:35,070 --> 00:01:41,630
The last column will be the collection column, which is a dependent variable, Audie.

22
00:01:42,060 --> 00:01:43,410
Very well that we want to predict.

23
00:01:45,450 --> 00:01:48,900
And we have all the 506 observations.

24
00:01:54,350 --> 00:01:54,950
And this data.

25
00:01:57,990 --> 00:02:04,870
You can also note that the headers that is the column names, but auto by the DGSE function.

26
00:02:05,370 --> 00:02:09,090
And it has identified that DECL column header.

27
00:02:09,540 --> 00:02:10,050
What DVD?

28
00:02:10,130 --> 00:02:10,650
Willemse.

29
00:02:14,750 --> 00:02:20,780
So by using these two lanes, you can import the data into your, ah, software.

30
00:02:23,150 --> 00:02:27,800
The next step is before running any model or training, any model.

31
00:02:27,920 --> 00:02:29,360
We need to prepare the data.

32
00:02:30,440 --> 00:02:33,250
I'm showing you only one step of data processing.

33
00:02:35,070 --> 00:02:41,460
This step is called missing value imputation, that is whenever the value of variable is missing.

34
00:02:43,320 --> 00:02:44,280
That we have to fill.

35
00:02:45,240 --> 00:02:50,580
It is a critical step because if a value is missing in the data, say the models cannot be trained,

36
00:02:50,670 --> 00:02:54,660
that is, and will not be able to Dundy functions to train these models.

37
00:02:55,200 --> 00:02:56,820
So it is a mandatory step.

38
00:02:56,880 --> 00:02:58,260
That is why I'm showing it to you.

39
00:02:58,740 --> 00:03:01,910
There are various other processing techniques.

40
00:03:03,010 --> 00:03:05,140
That we have to do to get Dee Dee Dee Dee.

41
00:03:06,210 --> 00:03:12,090
If you want to learn those techniques, also, I'll be sharing the link of how you can learn those techniques

42
00:03:12,440 --> 00:03:13,230
and the description.

43
00:03:14,370 --> 00:03:16,740
I'm not going to cover that part in this video.

44
00:03:18,280 --> 00:03:20,110
So how to handle missing values?

45
00:03:20,830 --> 00:03:21,460
First of all.

46
00:03:21,730 --> 00:03:27,850
Will then this command to see are there any missing we're losing in any of the variables.

47
00:03:28,090 --> 00:03:30,430
So we'll run somebody of data.

48
00:03:38,020 --> 00:03:43,400
And this somebody we get minimum, maximum and the three quartile values.

49
00:03:45,180 --> 00:03:50,790
Apart from this, whichever variable has some missing values, it will get an additional information

50
00:03:50,790 --> 00:03:51,490
of enemies.

51
00:03:52,320 --> 00:03:56,790
So the time taken variable as will end is.

52
00:03:57,830 --> 00:04:01,320
So in this variable, there are two entities which were empty.

53
00:04:02,740 --> 00:04:06,190
So if you open the Excel file, there will be quilters empty.

54
00:04:06,430 --> 00:04:12,130
When we have imported that file into art, those cells are named as any.

55
00:04:13,370 --> 00:04:15,150
We cannot have any in our dataset.

56
00:04:15,350 --> 00:04:19,220
So we'll be changing the end is to some value.

57
00:04:19,820 --> 00:04:27,140
The value we are going to assign these blank cells should be some harmless value, such as mean or median.

58
00:04:27,950 --> 00:04:33,860
So we are going to find out the meaning of the well losing time taken very well.

59
00:04:34,460 --> 00:04:39,440
And we are going to put that mean value into these blank cells to do that.

60
00:04:39,470 --> 00:04:49,790
The is by running this line, we can replace the blank values with the of other values in that particular

61
00:04:49,790 --> 00:04:50,290
variable.

62
00:04:52,110 --> 00:04:57,880
So let me explain to you what this line is doing on the left part.

63
00:04:58,340 --> 00:05:01,750
We are identifying those cells which are blank.

64
00:05:02,500 --> 00:05:06,490
So movie dollar time taken takes you to that particular variable.

65
00:05:06,970 --> 00:05:11,890
And within that variable, you want to find out those cells where.

66
00:05:13,200 --> 00:05:14,680
The value is any.

67
00:05:15,660 --> 00:05:20,560
So is any function is finding out those cells which are having value.

68
00:05:20,700 --> 00:05:22,350
Any in this variable.

69
00:05:24,580 --> 00:05:30,260
And we are putting the mean of time taken variable after removing any values.

70
00:05:31,000 --> 00:05:33,010
So we had five, 506, six observations.

71
00:05:33,700 --> 00:05:35,980
Twelve of them had any value.

72
00:05:37,750 --> 00:05:46,150
So for the remaining 494 observations, we find out demean using this function and report that mean

73
00:05:46,150 --> 00:05:49,390
value and to those values which had any.

74
00:05:50,950 --> 00:05:54,970
Using this single lane, we'll be doing missing value imputation.

75
00:05:56,710 --> 00:06:05,350
Once the missing values are replaced or are imputed, we can run our analysis, what better analysis?

76
00:06:05,500 --> 00:06:11,620
We should take some other steps also, such as outlier treatment, variable transformation.

77
00:06:12,190 --> 00:06:14,800
Looking at correlations between variables and so on.

78
00:06:15,900 --> 00:06:19,850
We will not be covering all those steps here after missing value in prediction.

79
00:06:20,060 --> 00:06:24,470
Our data is ready so that we can use it for training our and remodel.