1
00:00:01,410 --> 00:00:05,220
Now we are going to start building SBM model in.

2
00:00:08,170 --> 00:00:14,590
If you do not know how to set up our or our our studio or you are not comfortable with the basic operations

3
00:00:14,590 --> 00:00:21,020
and art, I would recommend that you first complete these section on basics of R.

4
00:00:22,210 --> 00:00:23,030
In that section.

5
00:00:24,380 --> 00:00:30,790
You will know how to set up our studio and how to do some basic operation than art so that you do not

6
00:00:30,790 --> 00:00:34,600
face difficulty by creating this as a model.

7
00:00:37,450 --> 00:00:38,170
So let's start.

8
00:00:38,950 --> 00:00:43,690
The first step of building a model is getting the data.

9
00:00:45,730 --> 00:00:47,510
I have the data in Aceh as we fly.

10
00:00:47,980 --> 00:00:51,180
And that's CSP file is saved at this location.

11
00:00:53,650 --> 00:00:59,090
So I'm importing that data into art, using this lead CSP we function.

12
00:01:01,920 --> 00:01:10,560
So whenever you have your data in text format or Exelis format or CSP format, try to create us.

13
00:01:10,720 --> 00:01:11,830
As we find out of it.

14
00:01:12,310 --> 00:01:16,110
And it will be very easy to get the data from that as we file.

15
00:01:17,200 --> 00:01:21,160
However, in order, you can also import data from text files.

16
00:01:21,820 --> 00:01:29,350
But reading from us, ESV is more efficient and there'll be less chance of any error when you do it

17
00:01:29,350 --> 00:01:29,890
like this.

18
00:01:33,920 --> 00:01:34,680
So in the lead.

19
00:01:34,760 --> 00:01:35,600
Yes, we function.

20
00:01:36,170 --> 00:01:37,610
You first have to give the.

21
00:01:38,770 --> 00:01:39,800
Location of defile.

22
00:01:41,680 --> 00:01:42,690
No, two things here.

23
00:01:43,360 --> 00:01:44,110
First thing is.

24
00:01:45,370 --> 00:01:50,170
If you copy paste the location of the file, by default, you'll be getting backslash, is it?

25
00:01:51,190 --> 00:01:53,410
But in art, when you're giving the location.

26
00:01:53,950 --> 00:01:55,300
Should be forewords flashes.

27
00:01:55,540 --> 00:01:56,710
So it should be like this.

28
00:01:57,520 --> 00:01:59,740
That is a forward slash and not like this.

29
00:01:59,910 --> 00:02:01,000
That is a backward slash.

30
00:02:02,580 --> 00:02:10,050
The second parameter is header, which specifies that whether the first column of your data contains

31
00:02:10,440 --> 00:02:11,190
headers or not.

32
00:02:11,280 --> 00:02:12,800
That is the variable names are not.

33
00:02:13,440 --> 00:02:18,900
So sent in my CSP file, the first row contains the header or the variable names.

34
00:02:19,080 --> 00:02:21,370
That is why I have written here that is equal to true.

35
00:02:23,820 --> 00:02:28,110
Now the result of this function will be stored in movie.

36
00:02:28,440 --> 00:02:31,020
And the result is the data from the CSP.

37
00:02:31,680 --> 00:02:35,490
So when I ran this by pressing control into.

38
00:02:38,810 --> 00:02:43,650
You can see that our data frame called Movies' created and did write fiction.

39
00:02:44,630 --> 00:02:45,830
If I click on movie.

40
00:02:51,440 --> 00:02:58,100
A command incident with his View movie, so you can either run this command, which is a new movie to

41
00:02:58,100 --> 00:03:01,960
look at the data or you can click here, which automatically Prentis Command.

42
00:03:03,970 --> 00:03:09,780
And you can see that this is the data that we want to do analysis on.

43
00:03:10,650 --> 00:03:16,170
It has all the variables and columns and all the observations and growth.

44
00:03:16,800 --> 00:03:26,760
And the last column, which is static Oscar, which we want to predict, contains devalue one or zero,

45
00:03:27,030 --> 00:03:31,290
meaning that if that particular movie won an Oscar, there be one here.

46
00:03:31,530 --> 00:03:34,750
And if that particular movie did not win an Oscar, there'll be zero.

47
00:03:36,870 --> 00:03:37,920
So let's get back.

48
00:03:39,600 --> 00:03:44,730
Now, we have imported the data and it is stored in this movie very well.

49
00:03:46,650 --> 00:03:53,130
But before we start analyzing the data, just certain steps that we need to take, which are part of

50
00:03:53,130 --> 00:03:58,080
data processing data, pre processing is critical to the model.

51
00:03:59,070 --> 00:04:04,440
If the debris processing is done correctly, only then we'll be able to get good model results.

52
00:04:06,480 --> 00:04:14,430
I'm showing here only one step of data processing, which is missing value imputation because it is

53
00:04:14,430 --> 00:04:19,920
mandatory that there should not be any missing value in any sale of the data.

54
00:04:22,020 --> 00:04:25,610
If there is any missing value, our model will not be able to run.

55
00:04:26,430 --> 00:04:27,900
So because this is mandatory.

56
00:04:27,920 --> 00:04:29,100
That is why I am telling you.

57
00:04:29,370 --> 00:04:35,910
But in general, data preprocessing contains many steps such as outlier treatment.

58
00:04:36,690 --> 00:04:43,770
Looking at the variable distributions by plotting histograms and identifying skewness or finding or

59
00:04:43,770 --> 00:04:48,870
correlations and removing variables which are highly correlated and so on.

60
00:04:49,020 --> 00:04:51,330
So there are several steps in data preprocessing.

61
00:04:52,470 --> 00:04:55,860
So here we'll be doing missing value imputation.

62
00:04:56,640 --> 00:05:01,440
First, we need to identify the variable in which there are any missing values.

63
00:05:01,860 --> 00:05:08,370
So we'll run the summary command, which gives us a summary of all the variables in our database.

64
00:05:10,200 --> 00:05:15,120
So you can see for all the variables, I'm good at getting minimum, maximum, average.

65
00:05:15,870 --> 00:05:18,570
First, second and third quartile values.

66
00:05:20,410 --> 00:05:26,940
But for this variable I'm taking, that is an additional value, which is and is.

67
00:05:28,110 --> 00:05:31,700
So whenever in your CSP file there is a blank cell.

68
00:05:32,190 --> 00:05:38,560
And we import that data into are that blank cell will be saved as an A..

69
00:05:39,990 --> 00:05:42,490
And this time taken variable has a will.

70
00:05:42,690 --> 00:05:46,260
And is that is there are 12 cells which are empty.

71
00:05:48,000 --> 00:05:53,490
Now, when we have empty cells in a variable, we need to put some value in those cells.

72
00:05:55,110 --> 00:05:59,520
Usually we have to look at business knowledge to determine.

73
00:05:59,730 --> 00:06:03,960
But value to be used to replace these blank values.

74
00:06:05,520 --> 00:06:08,310
Sometimes it makes sense to use zero.

75
00:06:08,940 --> 00:06:13,860
Or it may make sense to use maximum value out of all the observations.

76
00:06:14,040 --> 00:06:21,870
As the replacement to blank cells, generally, it makes more sense to replace those values with mean

77
00:06:21,960 --> 00:06:26,400
or median of the other values that we actually have in that variable.

78
00:06:28,050 --> 00:06:32,310
So here I am showing you the method to replace these blank values.

79
00:06:32,460 --> 00:06:36,930
Using the mean of other observations that we have.

80
00:06:38,460 --> 00:06:40,820
So let me explain to you this line of code.

81
00:06:42,000 --> 00:06:47,920
We first find out the mean of all the values which do not have a name then.

82
00:06:49,170 --> 00:06:59,400
So for the movie dataset, in the time taken variable, we removed the Emmys and find or demean this

83
00:06:59,400 --> 00:07:07,140
value is to be placed in those cells where I am taken variable has any value.

84
00:07:10,230 --> 00:07:17,490
So wherever in time taken variable we have any value, we will replace that any value with the mean

85
00:07:17,490 --> 00:07:20,280
of all the other cells where the value is not any.

86
00:07:21,810 --> 00:07:22,580
So if I had done this.

87
00:07:22,580 --> 00:07:22,980
Come on.

88
00:07:26,310 --> 00:07:34,800
And then look at somebody again, you you'll see that am big and variable now does not have any any.

89
00:07:36,600 --> 00:07:40,020
So in this way we do missing value imputation.

90
00:07:40,830 --> 00:07:43,490
This is the only thing we are doing in data processing.

91
00:07:44,100 --> 00:07:46,950
There are other steps also that you should take.

92
00:07:47,520 --> 00:07:55,410
And I recommend that you learn all those steps so that you create a better model after importing the

93
00:07:55,410 --> 00:07:57,450
data and doing data prepossessing.

94
00:07:57,780 --> 00:07:59,340
We are ready to work on our data.