1 00:00:01,290 --> 00:00:04,680 Now, we have imported the data and the raw data is ready. 2 00:00:05,280 --> 00:00:08,310 We will first take a look at the sample of this data. 3 00:00:10,380 --> 00:00:14,790 To take the Simpson will write our variable name, which is D.F.. 4 00:00:17,300 --> 00:00:18,740 And little red dot head. 5 00:00:23,680 --> 00:00:24,260 Fit on there. 6 00:00:25,110 --> 00:00:28,440 We are getting five top rolls from raw data. 7 00:00:28,910 --> 00:00:30,270 There's the sample of forward data. 8 00:00:30,360 --> 00:00:32,130 We have 19 columns. 9 00:00:33,620 --> 00:00:39,460 And the sample you can get, the type of all the variables, what values they are taking. 10 00:00:40,900 --> 00:00:50,140 Next to look at the count of each variable and to look at their data, Day-To-Day will write D.F. Dot 11 00:00:50,240 --> 00:00:50,760 info. 12 00:00:58,250 --> 00:01:02,510 Here you can see we have over variable names on the left. 13 00:01:03,080 --> 00:01:05,710 Then we have a column for number of values. 14 00:01:08,400 --> 00:01:13,160 If you notice, four and Hoz bats, the number of loses 98. 15 00:01:14,280 --> 00:01:18,360 That mean eight values are missing from this variable. 16 00:01:19,680 --> 00:01:20,930 We need to handle this. 17 00:01:20,940 --> 00:01:27,420 Missing were loose before running logistic regression or any other kind of machine learning algorithm 18 00:01:27,420 --> 00:01:28,140 on this data. 19 00:01:29,610 --> 00:01:37,500 And on the right more side, you can see the type of each variable name is sense for numerical data. 20 00:01:38,550 --> 00:01:44,340 We also have object, which are distinct types of data that eschatological data. 21 00:01:46,620 --> 00:01:52,830 To get the number of rose and columns in your data, you can also use Daudt shape. 22 00:01:53,340 --> 00:01:55,680 So if I had a beef dot shape, 23 00:01:59,040 --> 00:02:01,770 I wouldn't get the number of rows and columns. 24 00:02:02,580 --> 00:02:06,570 So as you can see, we have five hundred and six rows and total. 25 00:02:06,600 --> 00:02:08,070 There are 19 columns. 26 00:02:09,400 --> 00:02:13,330 In which it bin columns are of over independent variables. 27 00:02:14,110 --> 00:02:17,200 And the last column is for word dependent variable. 28 00:02:18,400 --> 00:02:23,420 Now let's rent a duty on this data and buy a 10:00. 29 00:02:23,530 --> 00:02:27,280 You will get a duty only for your numerical values. 30 00:02:27,910 --> 00:02:35,290 So you will not get unity for airport waterboardings and bus terminal variable. 31 00:02:37,690 --> 00:02:38,520 Two grand. 32 00:02:38,720 --> 00:02:39,100 Really? 33 00:02:39,400 --> 00:02:41,250 We just have to write down describe. 34 00:02:41,670 --> 00:02:42,830 So we'll write D.F.. 35 00:02:43,080 --> 00:02:44,500 This is our very own name. 36 00:02:44,800 --> 00:02:46,000 And then describe. 37 00:02:50,490 --> 00:02:56,820 If we run this, you can see all our numerical variables are listed on the top. 38 00:02:58,170 --> 00:03:04,650 And we have value such as count mean standard deviation minimum maximum. 39 00:03:04,720 --> 00:03:05,880 25Th percentile. 40 00:03:05,910 --> 00:03:07,010 58 percentile. 41 00:03:07,260 --> 00:03:08,770 Seventy fifth percentile. 42 00:03:09,180 --> 00:03:11,190 For each of these different variables. 43 00:03:12,880 --> 00:03:21,460 Count, it stands for total number of values in that variable mean send for the mean or average of that 44 00:03:21,460 --> 00:03:21,970 variable. 45 00:03:22,340 --> 00:03:25,250 Standard deviation is the standard deviation of that variable. 46 00:03:26,600 --> 00:03:29,990 Min and Max are for minimum and maximum value. 47 00:03:30,230 --> 00:03:32,610 That variable is speaking in our dataset. 48 00:03:33,530 --> 00:03:39,610 Then we 25th, 58 and 75 percentile value percentile. 49 00:03:40,430 --> 00:03:47,300 If you remember, just mean that if you arrange all the values in ascending order, your 25th percentile 50 00:03:47,510 --> 00:03:53,930 value will be the value that is occurring at the 25th percentile of that range data. 51 00:03:54,560 --> 00:04:01,700 Similarly, 50 percentile is the value that is occurring at the 15th position of that data. 52 00:04:02,810 --> 00:04:04,670 This is same as the median of data. 53 00:04:06,040 --> 00:04:09,930 And similarly, 75 percent, 10 cents for value. 54 00:04:09,980 --> 00:04:11,560 That is occurring at 75. 55 00:04:11,700 --> 00:04:14,060 Position of the Arain straight up. 56 00:04:15,860 --> 00:04:19,220 Now we will look at all of these variables one by one. 57 00:04:21,620 --> 00:04:23,990 Not what we want from this. 58 00:04:24,040 --> 00:04:27,710 It really is the number of missing values. 59 00:04:28,890 --> 00:04:37,650 And the variables which have outliers, outliers, means values that are not following the pattern of 60 00:04:37,650 --> 00:04:38,520 that variable. 61 00:04:39,120 --> 00:04:45,690 So, for example, if the values of my some variable is between one and 10. 62 00:04:46,530 --> 00:04:52,650 And then there is just one value, which is in range of thousand or 10000, then we call that value 63 00:04:52,680 --> 00:04:53,490 as outlier. 64 00:04:55,080 --> 00:04:58,020 So first we want to identify missing values. 65 00:04:58,050 --> 00:05:01,260 Second, we want to identify the outliers. 66 00:05:03,090 --> 00:05:06,690 Third, we will look at their distribution of categorical variables. 67 00:05:09,130 --> 00:05:16,780 We have already identified missing values that in four or you can also look at the count, but all of 68 00:05:16,780 --> 00:05:17,830 this UDD. 69 00:05:20,210 --> 00:05:23,600 To identify the outliers, there are two methods. 70 00:05:23,910 --> 00:05:27,830 First, you have to look at the difference between the mean and the median. 71 00:05:27,920 --> 00:05:29,960 Median is that 50 percent had value. 72 00:05:31,490 --> 00:05:40,970 So if there is any outlier, there is a huge difference between mean and median on outliers on the affect 73 00:05:41,030 --> 00:05:42,980 mean, not the median value. 74 00:05:43,100 --> 00:05:43,400 That's. 75 00:05:43,880 --> 00:05:50,420 If there is any outlier in one of our variable, we'll see a huge difference between mean and median 76 00:05:50,420 --> 00:05:50,790 value. 77 00:05:52,440 --> 00:05:59,590 We can also look at the distribution of this minimum under 25 percent annually, 50 percent. 78 00:06:00,010 --> 00:06:04,870 Lou, 75 percent value and maximum to notice any outlier. 79 00:06:06,630 --> 00:06:14,810 If you see any major difference between any two consecutive values, that means there is an outlier, 80 00:06:16,830 --> 00:06:17,700 for example. 81 00:06:17,880 --> 00:06:22,290 And then what first column, which is for estimated price. 82 00:06:23,190 --> 00:06:28,110 You can see minimum wage, Lewis fight and 25th percentile value is seventeen. 83 00:06:29,100 --> 00:06:32,090 Then again, 50 percentile value is 21. 84 00:06:32,820 --> 00:06:38,190 If you notice, that is not a great difference in any two of this category. 85 00:06:39,380 --> 00:06:44,060 The difference between these categories are falling in between four to 25. 86 00:06:44,810 --> 00:06:47,180 So you can see there is not a huge difference. 87 00:06:47,310 --> 00:06:51,410 Or we can say that there is no outlier in our cristeta. 88 00:06:54,190 --> 00:06:59,580 Now, similarly, you can look at all of this individual variables. 89 00:07:00,120 --> 00:07:04,050 I will radically jump to the variable in which there are defects. 90 00:07:06,720 --> 00:07:09,660 So if you look at the variable and hard Groomes. 91 00:07:12,100 --> 00:07:13,570 The minimum value is ten. 92 00:07:14,340 --> 00:07:16,920 The 25th percentile value is eleven. 93 00:07:17,920 --> 00:07:18,620 Again, 58. 94 00:07:19,240 --> 00:07:21,690 And so in the fifth percentile value are well. 95 00:07:22,420 --> 00:07:23,140 And 14. 96 00:07:25,270 --> 00:07:27,490 And the maximum value is one zero one. 97 00:07:28,750 --> 00:07:35,050 You can see there is a huge difference between so in the fifth percentile value and the maximum value. 98 00:07:37,430 --> 00:07:41,950 So we can say that there is something wrong with this data either. 99 00:07:42,640 --> 00:07:48,160 There is a skewed distribution or there is an outlier in this data. 100 00:07:50,030 --> 00:07:53,780 Similarly, if you look at the rainfall data. 101 00:07:56,280 --> 00:07:58,220 The minimum value is just three. 102 00:07:58,740 --> 00:08:06,390 When the 25 person dead will lose 28 and then 50 percent, then value is 39 and so on. 103 00:08:06,810 --> 00:08:15,030 You can see that there is a outlier on the lower end of this data to confirm our assumptions. 104 00:08:15,510 --> 00:08:23,190 There are several methods will first draw bulk splotched for Rover and Hot Grooms and then will use 105 00:08:23,190 --> 00:08:26,680 a scatterplot blurring to file players for overtrained for. 106 00:08:29,860 --> 00:08:30,700 One more thing. 107 00:08:30,850 --> 00:08:37,050 There is no hard and fast rule for identifying outliers or identifying by turning your data. 108 00:08:38,850 --> 00:08:42,450 These are creative processes and you have to come back. 109 00:08:42,540 --> 00:08:45,560 Few are finding any difficulty in handling this data. 110 00:08:45,660 --> 00:08:47,130 Later on in the process. 111 00:08:49,150 --> 00:08:57,940 Now, before drying our box plot for Anha Grooms, I will first explain you what box plot is by using 112 00:08:58,060 --> 00:09:04,650 an example for and Herzberg's so I will write as soon as that box plot 113 00:09:08,350 --> 00:09:11,170 and then y equal to and hoz husbands. 114 00:09:17,450 --> 00:09:19,650 And then data equal to B.F.. 115 00:09:26,150 --> 00:09:37,070 So this is our box looks in the middle, you can see a rectangle with a line in between, the upper 116 00:09:37,070 --> 00:09:43,610 line of this box is that top quartile value, which is same as the 75 percentile value. 117 00:09:45,320 --> 00:09:53,480 The bottom line of this box plot is for the first quartile value, which is the 25 percentile value 118 00:09:54,480 --> 00:09:55,060 and model. 119 00:09:55,610 --> 00:09:57,960 The line is for the 50 percent annually. 120 00:09:58,280 --> 00:09:59,480 Also known as the median. 121 00:10:01,700 --> 00:10:07,220 Now, the difference between the upper line of this box and the lower line on this box, which is the 122 00:10:07,220 --> 00:10:14,360 75 percent down only under 25 percentile value is known as and third quartile range or also known as 123 00:10:14,480 --> 00:10:15,170 AQR. 124 00:10:16,370 --> 00:10:24,230 And this this skirts, which are present at the top and at the bottom of this lines are calculated using 125 00:10:24,230 --> 00:10:25,370 this I.Q.. 126 00:10:26,240 --> 00:10:35,120 So by default, this is one point five times a cure from the upper and the lower end of this box. 127 00:10:36,440 --> 00:10:45,170 We are not going into detail about this, but whatever points lies outside this with skirts are known 128 00:10:45,170 --> 00:10:46,220 as outliers. 129 00:10:48,200 --> 00:10:52,420 If you want to know more about box plot, you can write this statement. 130 00:10:53,210 --> 00:10:59,030 You can use caution, Mark, before any function to open that help for that function. 131 00:11:01,580 --> 00:11:06,740 So if we execute it, you will get the before you, Hal, about this function. 132 00:11:07,670 --> 00:11:12,350 So you can use questionmark an airplane to open the help section. 133 00:11:14,330 --> 00:11:23,060 So now let's grow a box, look for and heart rooms will write as soon as dot box blood. 134 00:11:27,150 --> 00:11:29,880 And then what via variable is and hot Groomes. 135 00:11:35,380 --> 00:11:36,850 And the data is D.F.. 136 00:11:40,820 --> 00:11:51,170 Fee executer, you can see the box and the skirts are flying at values below 20. 137 00:11:53,650 --> 00:12:01,000 There are two points present outside this box spot which added values around 80 in hungry. 138 00:12:02,830 --> 00:12:06,520 Clearly, these two points are outlier for this variable. 139 00:12:06,730 --> 00:12:09,610 There are no values between 20 to 80. 140 00:12:10,210 --> 00:12:14,610 These two outliers are the only value which just lying about 80. 141 00:12:16,030 --> 00:12:20,800 So from this box plot, we have identified two outliers in our anha crumbs. 142 00:12:22,120 --> 00:12:26,800 Now let's discuss the second method of identifying outliers. 143 00:12:28,270 --> 00:12:36,460 Which is scatterplot draw droid scatterplot between our X and Y variable here, our X variable will 144 00:12:36,460 --> 00:12:46,030 be there in full and our Y variable will be our dependent variable, which is solid to plot scatterplot. 145 00:12:46,580 --> 00:12:48,570 We'll write as soon as Stojan plot, 146 00:12:55,000 --> 00:12:56,860 then our X is rainfall. 147 00:13:01,270 --> 00:13:02,590 Y is sold. 148 00:13:06,800 --> 00:13:08,210 And data is D.F.. 149 00:13:15,240 --> 00:13:23,010 You can see on the top we have the Instagram of a word dream for data on the right hand side. 150 00:13:23,100 --> 00:13:28,020 We have this program of over sored, which is dependent variable. 151 00:13:28,680 --> 00:13:32,010 Our dependent variable is on there taking well, lose zero and one. 152 00:13:32,250 --> 00:13:36,910 That's why we are only getting two bars in our Instagram. 153 00:13:38,880 --> 00:13:42,810 And in between on the plot area, we have the scatter plots. 154 00:13:44,700 --> 00:13:54,420 And as you can see, all the points are line between 20 to 60 except one point, which is near to zero. 155 00:13:56,180 --> 00:14:01,370 We can clearly see this as a kind of outlier for this rainfall data. 156 00:14:02,680 --> 00:14:06,400 And we will create this outlet in the later part of forecourts. 157 00:14:08,540 --> 00:14:10,610 Now we have two observations. 158 00:14:11,780 --> 00:14:14,090 From over a really off numerical data. 159 00:14:14,510 --> 00:14:18,470 First, there are missing values and horse bets. 160 00:14:19,110 --> 00:14:27,830 Second, there are outliers on the higher end and end hard rooms, and that is an outlier on the lower 161 00:14:27,830 --> 00:14:30,800 end and rainfall variable. 162 00:14:32,720 --> 00:14:38,180 Now, let's look at our categorical variables for our categorical variables. 163 00:14:38,240 --> 00:14:40,490 We will plot borrowed graphs. 164 00:14:42,170 --> 00:14:46,520 So our first categorical variable is airport to grow. 165 00:14:47,050 --> 00:14:47,900 Alert for that. 166 00:14:47,930 --> 00:14:49,910 We will write as soon as dot com plot. 167 00:14:52,960 --> 00:14:56,490 X equal to airport and they take to D.F.. 168 00:15:07,700 --> 00:15:11,050 You can see our airport variable is sticking to values. 169 00:15:11,570 --> 00:15:12,680 Yes and no. 170 00:15:13,850 --> 00:15:20,230 And the height of this, but represent the number of occurrence of these values in our data. 171 00:15:21,950 --> 00:15:24,260 As you can see, there is nothing wrong with this data. 172 00:15:26,710 --> 00:15:34,960 Now, let's move on to our next Freamon, which is what everybody will copy this call and will write 173 00:15:35,300 --> 00:15:37,120 Waterboardings and set off airport. 174 00:15:47,740 --> 00:15:51,060 So over waterboarding, variable is taking forward values. 175 00:15:52,130 --> 00:15:53,690 River Lake. 176 00:15:54,140 --> 00:15:54,580 None. 177 00:15:55,040 --> 00:15:57,560 And river in lake again. 178 00:15:58,100 --> 00:16:00,110 There is nothing wrong with this data. 179 00:16:03,060 --> 00:16:10,860 Next, get a political video of Elizabeth Steadman, then we'll just ride bus terminal and sort of what 180 00:16:10,910 --> 00:16:12,390 body and what this. 181 00:16:15,840 --> 00:16:20,800 Here you can see bus terminal is on there taking one value, which is yes. 182 00:16:21,660 --> 00:16:24,000 So in a way, this is not a variable. 183 00:16:24,120 --> 00:16:25,290 This is a constant. 184 00:16:26,370 --> 00:16:30,840 It will not provide any additional information to what model. 185 00:16:31,500 --> 00:16:33,660 It will just act as a constraint. 186 00:16:34,650 --> 00:16:41,280 Therefore, we can delete this variable or grob this variable from our data as it will not provide any 187 00:16:41,280 --> 00:16:43,030 additional information to that model. 188 00:16:43,500 --> 00:16:46,110 This will just act as a concern for our model. 189 00:16:47,990 --> 00:16:56,440 Nobilis known, our observations are the first observation was missing when he was in and Herzberg's. 190 00:17:08,410 --> 00:17:18,970 I would second observation was bus terminal has on the US well use and it will not provide any additional 191 00:17:18,970 --> 00:17:21,610 information for our model and we can drop it. 192 00:17:28,630 --> 00:17:31,790 I know our third observation was. 193 00:17:32,850 --> 00:17:37,050 About old players and and what, Groomes and brimful. 194 00:17:47,340 --> 00:17:53,940 So always perform this EBD and univariate analysis before craning your model. 195 00:17:55,320 --> 00:18:01,830 This will help you clean your data and make your data usable for your machine learning algorithms.