1 00:00:02,320 --> 00:00:10,450 So whenever we have the data in the tabular format ready with us, we should first look at the descriptors 2 00:00:10,450 --> 00:00:11,890 of each variable. 3 00:00:13,000 --> 00:00:19,550 The descriptive statistics, as we discussed in another section, are used to describe the data and 4 00:00:19,550 --> 00:00:20,270 summarize it. 5 00:00:21,610 --> 00:00:27,460 And since we will describe each and every single variable and not relationships between two or more 6 00:00:27,460 --> 00:00:31,900 variables, it is called univariate analysis unit stange. 7 00:00:31,910 --> 00:00:34,120 What one where it stands, what variable? 8 00:00:34,390 --> 00:00:36,820 So it is one variable analysis. 9 00:00:38,830 --> 00:00:43,120 So we can look at the buildings as part of the univariate analysis. 10 00:00:44,470 --> 00:00:46,060 We can see mean median mode. 11 00:00:47,050 --> 00:00:52,300 We can see measures of dispersion like range quartiles and standard deviations. 12 00:00:53,850 --> 00:00:57,360 And for categorical data, we can look at count of each category. 13 00:00:58,940 --> 00:01:06,020 Most software packages for statistics have a very easy way to do univariate analysis for all the variables 14 00:01:06,020 --> 00:01:06,740 of the dataset. 15 00:01:08,110 --> 00:01:10,990 And when we run it, we see something like this. 16 00:01:15,400 --> 00:01:24,100 So for a variable like age, we can have info like mean median, minimum, maximum the 25th. 17 00:01:24,310 --> 00:01:27,070 58 and 75 percentile values. 18 00:01:29,520 --> 00:01:37,530 Imagine if I arrange all the ages in ascending order, the first value will be the minimum value. 19 00:01:38,400 --> 00:01:40,650 The last value will be the maximum value. 20 00:01:42,030 --> 00:01:46,530 The 25th percentile value this twenty five point seven five. 21 00:01:46,550 --> 00:01:49,560 Value will be at one fourth position. 22 00:01:50,130 --> 00:01:54,200 That is, 25 percent of the values will be lower than this value. 23 00:01:55,980 --> 00:01:56,830 Twenty nine. 24 00:01:57,250 --> 00:02:00,490 This 50 percentile value will be at the middle. 25 00:02:00,790 --> 00:02:03,280 That is, 50 percent of values will be lower than this. 26 00:02:04,120 --> 00:02:07,270 And this 50 percentile value is same as the median also. 27 00:02:07,750 --> 00:02:09,610 If you remember the definition of median. 28 00:02:11,570 --> 00:02:17,440 Seventy five percent value prettified went to file is added Trefort position. 29 00:02:18,190 --> 00:02:22,380 So seventy five percent values are below thirty five point two for a. 30 00:02:23,680 --> 00:02:29,530 And from 75 percentile to the maximum value is thirty five point two five two fifty one. 31 00:02:29,650 --> 00:02:32,980 This range is 70 and they'll do maximum. 32 00:02:34,190 --> 00:02:36,590 So you can observe what this data. 33 00:02:37,810 --> 00:02:44,680 The first twenty five percent values are in a very small range of twenty four to twenty five point seven 34 00:02:44,680 --> 00:02:44,970 five. 35 00:02:46,510 --> 00:02:54,130 But if you only see the last quartile, the last twenty five percent of values add between the huge 36 00:02:54,130 --> 00:02:57,430 range of thirty five point two, five to fifty one. 37 00:02:58,390 --> 00:03:01,420 So this distribution is not evenly distributed. 38 00:03:03,400 --> 00:03:06,680 Such observations help us identify issues in our data. 39 00:03:08,980 --> 00:03:16,270 This whole information for all the variables of the dataset is called the Extended Data Dictionary. 40 00:03:17,770 --> 00:03:25,450 Using EDT, we can deduce a lot of things like patterns of outliers, presence of missing values and 41 00:03:25,450 --> 00:03:25,930 so on. 42 00:03:27,510 --> 00:03:32,980 What are these issues and how we handle them will be covered in the coming videos?