1 00:00:01,430 --> 00:00:05,160 No, let us see how to do univariate analysis in our. 2 00:00:06,270 --> 00:00:11,190 We'll first find the extended data dictionary of each variable that is EDT. 3 00:00:11,520 --> 00:00:17,420 And after looking at EDT, if we have doubt on distribution of a particular valuable, we will plaudit 4 00:00:17,570 --> 00:00:20,440 its program to look at its distribution. 5 00:00:21,590 --> 00:00:24,740 Then we will also plot budgets of the categorical variables. 6 00:00:25,430 --> 00:00:28,910 And lastly, we will summarize our observations from the univariate analysis. 7 00:00:31,360 --> 00:00:32,390 Two good EDT. 8 00:00:32,450 --> 00:00:36,710 We just run somebody and within bracket relate D.F.. 9 00:00:40,300 --> 00:00:41,160 And we done this. 10 00:00:41,430 --> 00:00:42,850 You can see in the window below. 11 00:00:43,320 --> 00:00:45,480 Let us make this window big. 12 00:00:45,870 --> 00:00:46,620 But looking here. 13 00:00:48,450 --> 00:00:58,750 So what each variable we are getting, the minimum, maximum mean and cortile values by quarter, I 14 00:00:58,750 --> 00:01:04,840 mean the first quarter, that is 25th percentile, the median, which is 58 percentile. 15 00:01:05,320 --> 00:01:07,270 And third quartile, which is 70 percent. 16 00:01:08,300 --> 00:01:15,340 So since this is a little bit difficult to read, because the same in value is going into two different 17 00:01:16,300 --> 00:01:18,820 roles for different types of variables. 18 00:01:19,690 --> 00:01:22,030 Let us put it into a readable format. 19 00:01:22,360 --> 00:01:24,940 We will reduce the font size for this. 20 00:01:25,610 --> 00:01:26,560 So go to view. 21 00:01:28,200 --> 00:01:34,570 Beans, and we'll go to being labeled with an appearance. 22 00:01:35,370 --> 00:01:37,830 We will reduce this font size from third to eleven. 23 00:01:39,210 --> 00:01:39,940 Click on to play. 24 00:01:42,010 --> 00:01:47,200 Let's click on Okina and we will extend this to the late. 25 00:01:48,320 --> 00:01:48,950 So that. 26 00:01:50,200 --> 00:01:51,880 It will die in order. 27 00:01:52,420 --> 00:01:52,690 So. 28 00:01:53,930 --> 00:01:54,650 Let's keep it here. 29 00:01:56,360 --> 00:01:57,920 Now it is visible clearly. 30 00:01:59,680 --> 00:02:04,490 As you know, the difference between mean and median values really indicate skewness and outliers. 31 00:02:05,320 --> 00:02:10,690 So if we look at price and it's mean and median values does seem fairly close. 32 00:02:12,120 --> 00:02:14,550 So we need not look at price. 33 00:02:15,590 --> 00:02:17,240 Crime rate, however, has. 34 00:02:18,780 --> 00:02:21,030 Bigger difference between mean and median values. 35 00:02:21,930 --> 00:02:26,130 Another indicator of Skewness and outliers is the distribution within quartiles. 36 00:02:26,730 --> 00:02:34,010 So if we look at the first quartile of crime rate within a short range of zero point zero six two zero 37 00:02:34,010 --> 00:02:39,420 point zero eight, in fact, did a one zero zero six two zero point zero eight. 38 00:02:40,540 --> 00:02:46,030 There is the first quartile that is, Gerti, five percent of values are within this short range. 39 00:02:47,220 --> 00:02:49,950 But the third quarter lendee, maximum value. 40 00:02:50,040 --> 00:02:51,450 That is the last quartile. 41 00:02:53,140 --> 00:02:56,410 It has a huge range from three point six to eighty eight point nine. 42 00:02:57,880 --> 00:02:59,800 And it also has 25 percent values. 43 00:03:00,520 --> 00:03:07,000 So there is a huge difference in the ranges and they both are containing only 25 percent of the values. 44 00:03:07,840 --> 00:03:13,210 So this very well either has outliers or skewness in its distribution. 45 00:03:14,680 --> 00:03:22,420 As we move ahead and look at each variable and its mean and median values, the next variable of interest 46 00:03:22,420 --> 00:03:23,860 is in hard rooms. 47 00:03:26,140 --> 00:03:30,190 And also the mean and within values have some difference. 48 00:03:30,370 --> 00:03:34,930 But if you look at the quartiles, the max value is exceptionally high. 49 00:03:35,320 --> 00:03:41,530 So from minimum of ten point zero 06 to third quartile, it is only going from 10 to 14. 50 00:03:42,190 --> 00:03:44,590 But from third quarter to maximum. 51 00:03:45,190 --> 00:03:47,350 It is moving from 14 to one zero one. 52 00:03:48,910 --> 00:03:53,200 So there is some distribution or outlier issue here. 53 00:03:54,450 --> 00:03:56,120 Similarly with rainfall. 54 00:03:57,670 --> 00:03:58,660 Minimum is three. 55 00:04:00,140 --> 00:04:02,800 And the first court date comes at twenty eight. 56 00:04:03,980 --> 00:04:11,340 But for all other quartiles, the distribution is nearly, what, a range of 10 unit to this minimum. 57 00:04:11,380 --> 00:04:15,560 Two, first cortile is a little bit skewed or has some outliers. 58 00:04:16,400 --> 00:04:24,370 So for crime rate and rooms and rainfall, we believe there is either Skewness in the distribution. 59 00:04:25,320 --> 00:04:26,670 Or presence of outliers. 60 00:04:27,560 --> 00:04:31,920 Not another thing to notice in this variable and horse, Biggs. 61 00:04:33,100 --> 00:04:34,650 It has one additional value of. 62 00:04:34,800 --> 00:04:37,190 And is there are eight any values? 63 00:04:38,110 --> 00:04:40,750 So while we import data from us, yes, we file. 64 00:04:40,870 --> 00:04:46,570 If there is any blank will, you are automatically assigned it a value of any. 65 00:04:48,070 --> 00:04:50,380 So we need to handle these missing values also. 66 00:04:51,220 --> 00:04:55,960 Now, let's get back to these three variables that we identified earlier, which either have outliers 67 00:04:55,990 --> 00:04:56,740 or Skewness. 68 00:04:57,520 --> 00:05:03,190 We need to see the actual distribution during the five which issued they are actually facing. 69 00:05:04,210 --> 00:05:07,930 The distribution can be seen in histograms or in scatterplot. 70 00:05:09,180 --> 00:05:11,490 First, we will plot histogram for each of these. 71 00:05:12,870 --> 00:05:13,940 To plot Instagram. 72 00:05:14,100 --> 00:05:15,600 Just write his. 73 00:05:17,930 --> 00:05:18,610 ATSDR. 74 00:05:20,280 --> 00:05:21,950 And and record, we relate. 75 00:05:26,050 --> 00:05:29,520 The F dollar crime under read. 76 00:05:36,910 --> 00:05:38,950 On the day, you can see the plot for this. 77 00:05:41,190 --> 00:05:47,250 If we look at the Instagram, it has bulk of values on the left between zero drippin probably. 78 00:05:48,360 --> 00:05:51,690 But how many values are going beyond the value of twenty? 79 00:05:51,810 --> 00:05:55,980 We are not really sure, even if there are 10 15 values. 80 00:05:56,370 --> 00:05:57,690 These may not be outliers. 81 00:05:57,720 --> 00:05:59,310 These may be genuine values. 82 00:05:59,940 --> 00:06:04,650 So this histogram is not giving us the correct picture for this particular variable. 83 00:06:05,130 --> 00:06:07,320 We should go and look at Scatterplot. 84 00:06:08,370 --> 00:06:16,650 So let's applaud Scatterplot for each of these three variables to get scatterplot for all of these simultaneously. 85 00:06:16,860 --> 00:06:21,180 We will rate payers and within bracket. 86 00:06:25,560 --> 00:06:31,990 We will start with DeLay and we'll start writing all the variables that we want in this bad. 87 00:06:32,390 --> 00:06:33,470 So first is price. 88 00:06:37,780 --> 00:06:38,290 Plus. 89 00:06:41,560 --> 00:06:45,970 Crime rate, close and heartbeat. 90 00:06:51,450 --> 00:06:53,850 Plus, rainfall, comma. 91 00:06:58,640 --> 00:07:00,820 All might be days, including if. 92 00:07:07,080 --> 00:07:07,500 Dundies. 93 00:07:15,180 --> 00:07:16,390 Let us correct this variable. 94 00:07:16,950 --> 00:07:18,060 It should be in hot rooms. 95 00:07:18,790 --> 00:07:19,650 Let's run it again. 96 00:07:21,120 --> 00:07:23,850 You can see we have a set of scatterplot here. 97 00:07:23,880 --> 00:07:26,160 Let us click on Zoom Button to look at them. 98 00:07:28,060 --> 00:07:28,890 That's Maximizer. 99 00:07:32,480 --> 00:07:36,200 So first, we need to look at price vs. crime rate. 100 00:07:37,350 --> 00:07:41,430 It has a lot of values within the range of zero to 10. 101 00:07:41,940 --> 00:07:50,280 But still a considerable number of values are going beyond this value took price and crime rate probably 102 00:07:50,280 --> 00:07:52,410 do not have a linear relationship. 103 00:07:52,980 --> 00:07:54,930 They have some other type of relationship. 104 00:07:57,380 --> 00:08:04,280 So we need to transform this variable in some way so that they end together leaner listenership. 105 00:08:04,640 --> 00:08:07,040 We will see this transformation a little reduce. 106 00:08:08,220 --> 00:08:11,160 Next, we will look at price and in hot rooms. 107 00:08:13,000 --> 00:08:20,650 If you look at this graph, all the values are within short range, but two values are exceptionally 108 00:08:20,740 --> 00:08:21,440 out of order. 109 00:08:23,190 --> 00:08:29,370 This exceptionally out of order values are certainly at outliers that do not behave in any particular 110 00:08:29,370 --> 00:08:29,850 pattern. 111 00:08:30,260 --> 00:08:31,770 They are clearly outliers. 112 00:08:32,250 --> 00:08:33,630 Similarly for rainfall. 113 00:08:35,840 --> 00:08:39,680 All of these values seem to be beyond 20 and within 60. 114 00:08:41,260 --> 00:08:44,260 Only one value is even below 10. 115 00:08:44,770 --> 00:08:46,750 This value is clearly an outlier. 116 00:08:47,560 --> 00:08:56,290 So we have identified that in hot rooms and rainfall has outliers crime rate as some different type 117 00:08:56,290 --> 00:08:57,580 of relationship with price. 118 00:08:57,760 --> 00:09:03,530 And we will manipulate this variable so that it has a linear type of relationship with price. 119 00:09:03,700 --> 00:09:04,210 Later on. 120 00:09:08,810 --> 00:09:12,950 One last thing we should do is looking at bar plots of categorical variables. 121 00:09:14,020 --> 00:09:15,790 We have three categorical variables. 122 00:09:16,820 --> 00:09:23,140 Airport, water, body and bus terminals, which is represented by bus and escort her. 123 00:09:25,040 --> 00:09:26,240 Do applaud, but applaud. 124 00:09:27,050 --> 00:09:27,750 We just right. 125 00:09:29,160 --> 00:09:29,990 But applaud. 126 00:09:31,490 --> 00:09:32,480 And within bracket. 127 00:09:37,390 --> 00:09:44,520 We'll right able and again within Blacket will specify the variable. 128 00:09:44,700 --> 00:09:47,230 It is D.F. Dollar Airport. 129 00:09:53,570 --> 00:09:54,350 Not Sundays. 130 00:09:56,930 --> 00:09:57,980 So you on the right. 131 00:09:58,160 --> 00:10:00,020 We have Bob blood of. 132 00:10:01,420 --> 00:10:04,270 Airport, which is a categorical variable as to values. 133 00:10:04,450 --> 00:10:05,380 Yes and no. 134 00:10:06,070 --> 00:10:12,190 And by looking at this graph, we do not see anything suspicious about this particular variable. 135 00:10:14,370 --> 00:10:16,560 No, let us do the same thing for everybody. 136 00:10:19,680 --> 00:10:20,640 Just change this very well. 137 00:10:20,670 --> 00:10:21,750 Do what, awardee? 138 00:10:26,160 --> 00:10:35,480 And on it, this also has forward values, Lake Lake, and we were bored none. 139 00:10:35,830 --> 00:10:36,400 And they were. 140 00:10:37,870 --> 00:10:39,690 And this also seems fine to us. 141 00:10:41,080 --> 00:10:44,130 Now, let us run Lebar platform bus terminal. 142 00:10:52,970 --> 00:11:00,660 Here you can see it has only one value on the cities offer of our dataset has bus terminal in them. 143 00:11:02,330 --> 00:11:09,490 So since all duties of our data have bus terminal, we cannot identify whether this variable impacts 144 00:11:09,490 --> 00:11:10,760 the final solution or not. 145 00:11:11,950 --> 00:11:14,740 So we need to ignore this variable in our analysis. 146 00:11:15,850 --> 00:11:22,930 So we did EDT, Blood-Red, Scatterplot and Bartletts to identify variables with outliers, skewness, 147 00:11:23,200 --> 00:11:25,840 missing values, and I use less categorical variable. 148 00:11:27,230 --> 00:11:28,840 So let's list Dondi observations. 149 00:11:30,610 --> 00:11:35,200 Posters and hard rooms and rainfall has outliers. 150 00:11:41,700 --> 00:11:45,390 And rainfall has always 151 00:11:48,780 --> 00:11:52,830 taken in hospitals as missing lose. 152 00:12:04,130 --> 00:12:08,240 It is bus terminal is a useless variable. 153 00:12:19,680 --> 00:12:22,220 And the last observation is crime rate. 154 00:12:25,930 --> 00:12:30,140 Crime rate has some other functional relationship with price. 155 00:12:40,350 --> 00:12:42,390 We'll handle these issues in the coming videos.