1 00:00:00,720 --> 00:00:08,490 In this video, we are going to learn how to do univariate analysis in Microsoft Excel, as I told you, 2 00:00:08,490 --> 00:00:09,330 in detail related. 3 00:00:10,290 --> 00:00:17,460 We want to look at some summary statistics like mean median mode of all the quantitative variables. 4 00:00:19,440 --> 00:00:24,000 To do that, we use data analysis then in Excel. 5 00:00:26,890 --> 00:00:34,570 You can find the data analysis and data menu if you are not able to see this data analysis option, 6 00:00:34,990 --> 00:00:42,490 you need to go to the file menu, click on the options button here. 7 00:00:42,490 --> 00:00:43,690 You have to go to Alden's. 8 00:00:46,950 --> 00:00:52,020 In the bottom, you have to click on Go Button, which is for many excellent. 9 00:00:54,240 --> 00:01:02,040 Hair analysis tool that should be paid if testing you, you'll get this data analysis and then you determine 10 00:01:02,370 --> 00:01:04,220 if it is not, you will not see this. 11 00:01:04,410 --> 00:01:05,930 So take this and click on. 12 00:01:09,870 --> 00:01:18,270 Before we use the data analysis again to do univariate analysis, we need to segregate the quantitative 13 00:01:18,300 --> 00:01:19,740 and qualitative variables. 14 00:01:21,330 --> 00:01:29,730 As you can see in our table, we have several qualitative variables, such as airport, water, body. 15 00:01:31,250 --> 00:01:32,150 And Buster. 16 00:01:33,850 --> 00:01:37,610 These variables are categorical, they do not have numerical values in them. 17 00:01:39,160 --> 00:01:44,450 We will take all these categorical variables and put them in the end of the data. 18 00:01:44,470 --> 00:01:44,830 They will. 19 00:01:47,350 --> 00:01:54,400 So we'll select the whole column by clicking on the name of the column, pressing controllers to cut 20 00:01:54,820 --> 00:01:57,650 and control, we in the end to pasted. 21 00:02:00,550 --> 00:02:02,620 We will do the same for the other columns. 22 00:02:08,660 --> 00:02:14,240 This is the last categorical column now these blank columns will be deleted. 23 00:02:15,910 --> 00:02:18,740 Just right, click on the name of development, select delete. 24 00:02:24,440 --> 00:02:36,380 So all are quantitative variables are from column B to column Q, the columns are these are having categorical 25 00:02:36,380 --> 00:02:36,870 variables. 26 00:02:38,510 --> 00:02:43,300 So now we will go to data analysis here. 27 00:02:43,310 --> 00:02:45,480 We will select descriptive statistics. 28 00:02:46,280 --> 00:02:52,510 This is the option we need to select to do univariate analysis will click on OK. 29 00:02:53,910 --> 00:03:01,530 We will select the whole input range from B three to two five zero nine. 30 00:03:05,210 --> 00:03:11,270 You can see the shortcuts on the bottom right of the screen, which popped up when I pressed them. 31 00:03:13,600 --> 00:03:14,500 So important. 32 00:03:14,860 --> 00:03:23,530 And the next option is group by here we have to select the option where we have our variables, since 33 00:03:23,530 --> 00:03:29,830 our variables are in the columns and our observations are in rows, we want to group them by the columns. 34 00:03:31,890 --> 00:03:38,730 Next option is Labor's infrastructure, since our table contains Labor's infrastructure, that is the 35 00:03:38,730 --> 00:03:41,500 name of the variable is in the first row of the table. 36 00:03:42,150 --> 00:03:44,100 Therefore, we will take this checkbox. 37 00:03:47,460 --> 00:03:51,870 Then is the option that where do we want the airport to be placed? 38 00:03:52,110 --> 00:03:53,950 We want the airport in a new wheelchair. 39 00:03:54,090 --> 00:03:56,280 Therefore, we have selected new Wilkshire. 40 00:03:58,600 --> 00:04:00,730 Then is a check checkbox for some of these statistics. 41 00:04:00,880 --> 00:04:06,310 We will take it since we are doing univariate analysis and we do want to see these summary statistics. 42 00:04:08,370 --> 00:04:16,350 The last two options will be used to get the cartel values, since by default, descriptive statistics 43 00:04:16,350 --> 00:04:22,680 option is not going to give us the cartel values, we will put 25. 44 00:04:25,110 --> 00:04:26,100 In both of these. 45 00:04:27,880 --> 00:04:35,880 So the 25th smallest value is actually the first quartile value, that is the 25th percentile value 46 00:04:36,820 --> 00:04:44,740 and this twenty fifth largest value is going to be the 75 percentile value of the third quartile value. 47 00:04:46,420 --> 00:04:49,610 So with all these options, it will just click on, OK? 48 00:04:51,840 --> 00:04:56,010 And you can see in a new sheet I am getting. 49 00:04:57,160 --> 00:05:00,970 The descriptive statistics of all the quantitative variables. 50 00:05:02,860 --> 00:05:08,890 Here you can see the mean standard median mode, standard deviation. 51 00:05:11,040 --> 00:05:13,470 The Bulgarians and other such values. 52 00:05:15,670 --> 00:05:23,350 The point of looking at this descriptive statistic values is that we want to identify the problems in 53 00:05:23,350 --> 00:05:27,650 our dataset, a dataset can have several types of problems. 54 00:05:29,050 --> 00:05:36,220 One problem could be of missing values, for example, some variable, maybe missing a few of the values 55 00:05:36,220 --> 00:05:37,180 from its data. 56 00:05:38,560 --> 00:05:40,660 One problem could be of outliers. 57 00:05:40,930 --> 00:05:50,050 That is some variable maybe containing extraordinarily large or smaller values, although we will see 58 00:05:50,050 --> 00:05:54,730 how to identify and handle these problems in the coming videos here. 59 00:05:54,760 --> 00:06:01,660 We will see how to identify these problems using the descriptive statistics we have with us. 60 00:06:03,550 --> 00:06:12,370 So to identify if any particular variable is missing some values, we can use this control, all the 61 00:06:12,370 --> 00:06:14,490 variables should have the same goal. 62 00:06:14,680 --> 00:06:22,390 That is, since we have 506 observations inaudible, all the variables should have 506 values. 63 00:06:23,470 --> 00:06:27,030 If any variable is missing some values, it will have a lesser count. 64 00:06:28,900 --> 00:06:32,470 We can see that for the variable and hospita. 65 00:06:35,310 --> 00:06:39,750 We have got less than five or six, it is brain. 66 00:06:40,800 --> 00:06:50,370 This means that it values are missing for this particular body will we will see how to go and rectify 67 00:06:50,370 --> 00:06:51,450 this in our dataset. 68 00:06:54,460 --> 00:06:56,470 The second problem would be of outliers. 69 00:06:58,560 --> 00:07:06,210 To identify outliers from Italy, there are two possible ways, one is to look at the difference between 70 00:07:06,210 --> 00:07:07,060 mean and median. 71 00:07:08,910 --> 00:07:14,880 So if there is an outlier, suppose we have an exceptionally large value in some variable. 72 00:07:16,140 --> 00:07:22,680 So far, such a variable demesne will be shifted towards the outlier and median will remain almost at 73 00:07:22,680 --> 00:07:23,350 the same range. 74 00:07:23,790 --> 00:07:26,840 So there will be a large difference between mean and median values. 75 00:07:29,220 --> 00:07:35,340 The second method is to compare the distribution of the variable between different coordinates. 76 00:07:38,210 --> 00:07:44,450 To compare the distribution and content, we will see the difference between smallest and the first 77 00:07:44,450 --> 00:07:49,590 quartile value and the 70 percentile and largest value. 78 00:07:50,750 --> 00:07:52,730 So if I look at this price variable. 79 00:07:54,750 --> 00:07:58,620 The mean and median are almost in the same range. 80 00:07:58,680 --> 00:08:03,360 I mean, is twenty two point something and median is twenty one point something. 81 00:08:06,360 --> 00:08:15,030 If I look at the quarterly distribution, the first quarter is between minimum and the 25th percentile 82 00:08:15,030 --> 00:08:20,160 value, the minimum value is five and the 25th percentile value is ten point two. 83 00:08:22,920 --> 00:08:26,580 So in the first quarter, we have a range of five point. 84 00:08:30,350 --> 00:08:37,340 In the last quartile, the seventy fifth percentile values, forty three point eight, and the largest 85 00:08:37,340 --> 00:08:38,450 value is 50. 86 00:08:40,460 --> 00:08:44,390 So the last quarter has a range of six point to. 87 00:08:46,220 --> 00:08:50,240 So the difference between the Quartel distribution is not very large. 88 00:08:52,660 --> 00:08:57,550 This is suggesting that price, well, does not have any outliers. 89 00:08:59,290 --> 00:09:03,880 So I will highlight this very well, which actually has some outliers. 90 00:09:05,880 --> 00:09:08,550 If you look at the Anholt rooms very well. 91 00:09:14,120 --> 00:09:16,550 From the mean and median value. 92 00:09:19,590 --> 00:09:20,970 There is not much different. 93 00:09:24,860 --> 00:09:33,080 Which may tell us that maybe there is no players when we look at the Quartel distribution, the first 94 00:09:33,080 --> 00:09:42,620 quartile is ranging from ten point zero five to ten point one, which is a very small range of point 95 00:09:43,430 --> 00:09:43,820 one. 96 00:09:47,760 --> 00:09:56,220 And the last quarter is ranging from fifteen point one nine to one hundred one, which is a very huge 97 00:09:56,220 --> 00:10:05,580 range of eighty five unit, so definitely there are some outliers in this particular variable. 98 00:10:08,400 --> 00:10:11,640 How to handle all players will be seeing in a separate letter. 99 00:10:14,740 --> 00:10:22,780 So this is how by looking at the entity, we can identify problems in different variables, I suggest 100 00:10:22,780 --> 00:10:27,600 that you go through each and every individual variable to identify the issues with that variable. 101 00:10:29,230 --> 00:10:32,580 Later on, we will learn how to handle each of these issues.