1 00:00:01,410 --> 00:00:05,220 Now we are going to start building SBM model in. 2 00:00:08,170 --> 00:00:14,590 If you do not know how to set up our or our our studio or you are not comfortable with the basic operations 3 00:00:14,590 --> 00:00:21,020 and art, I would recommend that you first complete these section on basics of R. 4 00:00:22,210 --> 00:00:23,030 In that section. 5 00:00:24,380 --> 00:00:30,790 You will know how to set up our studio and how to do some basic operation than art so that you do not 6 00:00:30,790 --> 00:00:34,600 face difficulty by creating this as a model. 7 00:00:37,450 --> 00:00:38,170 So let's start. 8 00:00:38,950 --> 00:00:43,690 The first step of building a model is getting the data. 9 00:00:45,730 --> 00:00:47,510 I have the data in Aceh as we fly. 10 00:00:47,980 --> 00:00:51,180 And that's CSP file is saved at this location. 11 00:00:53,650 --> 00:00:59,090 So I'm importing that data into art, using this lead CSP we function. 12 00:01:01,920 --> 00:01:10,560 So whenever you have your data in text format or Exelis format or CSP format, try to create us. 13 00:01:10,720 --> 00:01:11,830 As we find out of it. 14 00:01:12,310 --> 00:01:16,110 And it will be very easy to get the data from that as we file. 15 00:01:17,200 --> 00:01:21,160 However, in order, you can also import data from text files. 16 00:01:21,820 --> 00:01:29,350 But reading from us, ESV is more efficient and there'll be less chance of any error when you do it 17 00:01:29,350 --> 00:01:29,890 like this. 18 00:01:33,920 --> 00:01:34,680 So in the lead. 19 00:01:34,760 --> 00:01:35,600 Yes, we function. 20 00:01:36,170 --> 00:01:37,610 You first have to give the. 21 00:01:38,770 --> 00:01:39,800 Location of defile. 22 00:01:41,680 --> 00:01:42,690 No, two things here. 23 00:01:43,360 --> 00:01:44,110 First thing is. 24 00:01:45,370 --> 00:01:50,170 If you copy paste the location of the file, by default, you'll be getting backslash, is it? 25 00:01:51,190 --> 00:01:53,410 But in art, when you're giving the location. 26 00:01:53,950 --> 00:01:55,300 Should be forewords flashes. 27 00:01:55,540 --> 00:01:56,710 So it should be like this. 28 00:01:57,520 --> 00:01:59,740 That is a forward slash and not like this. 29 00:01:59,910 --> 00:02:01,000 That is a backward slash. 30 00:02:02,580 --> 00:02:10,050 The second parameter is header, which specifies that whether the first column of your data contains 31 00:02:10,440 --> 00:02:11,190 headers or not. 32 00:02:11,280 --> 00:02:12,800 That is the variable names are not. 33 00:02:13,440 --> 00:02:18,900 So sent in my CSP file, the first row contains the header or the variable names. 34 00:02:19,080 --> 00:02:21,370 That is why I have written here that is equal to true. 35 00:02:23,820 --> 00:02:28,110 Now the result of this function will be stored in movie. 36 00:02:28,440 --> 00:02:31,020 And the result is the data from the CSP. 37 00:02:31,680 --> 00:02:35,490 So when I ran this by pressing control into. 38 00:02:38,810 --> 00:02:43,650 You can see that our data frame called Movies' created and did write fiction. 39 00:02:44,630 --> 00:02:45,830 If I click on movie. 40 00:02:51,440 --> 00:02:58,100 A command incident with his View movie, so you can either run this command, which is a new movie to 41 00:02:58,100 --> 00:03:01,960 look at the data or you can click here, which automatically Prentis Command. 42 00:03:03,970 --> 00:03:09,780 And you can see that this is the data that we want to do analysis on. 43 00:03:10,650 --> 00:03:16,170 It has all the variables and columns and all the observations and growth. 44 00:03:16,800 --> 00:03:26,760 And the last column, which is static Oscar, which we want to predict, contains devalue one or zero, 45 00:03:27,030 --> 00:03:31,290 meaning that if that particular movie won an Oscar, there be one here. 46 00:03:31,530 --> 00:03:34,750 And if that particular movie did not win an Oscar, there'll be zero. 47 00:03:36,870 --> 00:03:37,920 So let's get back. 48 00:03:39,600 --> 00:03:44,730 Now, we have imported the data and it is stored in this movie very well. 49 00:03:46,650 --> 00:03:53,130 But before we start analyzing the data, just certain steps that we need to take, which are part of 50 00:03:53,130 --> 00:03:58,080 data processing data, pre processing is critical to the model. 51 00:03:59,070 --> 00:04:04,440 If the debris processing is done correctly, only then we'll be able to get good model results. 52 00:04:06,480 --> 00:04:14,430 I'm showing here only one step of data processing, which is missing value imputation because it is 53 00:04:14,430 --> 00:04:19,920 mandatory that there should not be any missing value in any sale of the data. 54 00:04:22,020 --> 00:04:25,610 If there is any missing value, our model will not be able to run. 55 00:04:26,430 --> 00:04:27,900 So because this is mandatory. 56 00:04:27,920 --> 00:04:29,100 That is why I am telling you. 57 00:04:29,370 --> 00:04:35,910 But in general, data preprocessing contains many steps such as outlier treatment. 58 00:04:36,690 --> 00:04:43,770 Looking at the variable distributions by plotting histograms and identifying skewness or finding or 59 00:04:43,770 --> 00:04:48,870 correlations and removing variables which are highly correlated and so on. 60 00:04:49,020 --> 00:04:51,330 So there are several steps in data preprocessing. 61 00:04:52,470 --> 00:04:55,860 So here we'll be doing missing value imputation. 62 00:04:56,640 --> 00:05:01,440 First, we need to identify the variable in which there are any missing values. 63 00:05:01,860 --> 00:05:08,370 So we'll run the summary command, which gives us a summary of all the variables in our database. 64 00:05:10,200 --> 00:05:15,120 So you can see for all the variables, I'm good at getting minimum, maximum, average. 65 00:05:15,870 --> 00:05:18,570 First, second and third quartile values. 66 00:05:20,410 --> 00:05:26,940 But for this variable I'm taking, that is an additional value, which is and is. 67 00:05:28,110 --> 00:05:31,700 So whenever in your CSP file there is a blank cell. 68 00:05:32,190 --> 00:05:38,560 And we import that data into are that blank cell will be saved as an A.. 69 00:05:39,990 --> 00:05:42,490 And this time taken variable has a will. 70 00:05:42,690 --> 00:05:46,260 And is that is there are 12 cells which are empty. 71 00:05:48,000 --> 00:05:53,490 Now, when we have empty cells in a variable, we need to put some value in those cells. 72 00:05:55,110 --> 00:05:59,520 Usually we have to look at business knowledge to determine. 73 00:05:59,730 --> 00:06:03,960 But value to be used to replace these blank values. 74 00:06:05,520 --> 00:06:08,310 Sometimes it makes sense to use zero. 75 00:06:08,940 --> 00:06:13,860 Or it may make sense to use maximum value out of all the observations. 76 00:06:14,040 --> 00:06:21,870 As the replacement to blank cells, generally, it makes more sense to replace those values with mean 77 00:06:21,960 --> 00:06:26,400 or median of the other values that we actually have in that variable. 78 00:06:28,050 --> 00:06:32,310 So here I am showing you the method to replace these blank values. 79 00:06:32,460 --> 00:06:36,930 Using the mean of other observations that we have. 80 00:06:38,460 --> 00:06:40,820 So let me explain to you this line of code. 81 00:06:42,000 --> 00:06:47,920 We first find out the mean of all the values which do not have a name then. 82 00:06:49,170 --> 00:06:59,400 So for the movie dataset, in the time taken variable, we removed the Emmys and find or demean this 83 00:06:59,400 --> 00:07:07,140 value is to be placed in those cells where I am taken variable has any value. 84 00:07:10,230 --> 00:07:17,490 So wherever in time taken variable we have any value, we will replace that any value with the mean 85 00:07:17,490 --> 00:07:20,280 of all the other cells where the value is not any. 86 00:07:21,810 --> 00:07:22,580 So if I had done this. 87 00:07:22,580 --> 00:07:22,980 Come on. 88 00:07:26,310 --> 00:07:34,800 And then look at somebody again, you you'll see that am big and variable now does not have any any. 89 00:07:36,600 --> 00:07:40,020 So in this way we do missing value imputation. 90 00:07:40,830 --> 00:07:43,490 This is the only thing we are doing in data processing. 91 00:07:44,100 --> 00:07:46,950 There are other steps also that you should take. 92 00:07:47,520 --> 00:07:55,410 And I recommend that you learn all those steps so that you create a better model after importing the 93 00:07:55,410 --> 00:07:57,450 data and doing data prepossessing. 94 00:07:57,780 --> 00:07:59,340 We are ready to work on our data.