1 00:00:02,290 --> 00:00:10,450 So whenever we have the data in the tabular format ready with us, we should first look at the descriptions 2 00:00:10,450 --> 00:00:11,860 of each variable. 3 00:00:12,950 --> 00:00:19,550 The descriptive statistics, as we discussed in another section, are used to describe the data and 4 00:00:19,550 --> 00:00:20,270 summarize it. 5 00:00:21,490 --> 00:00:27,490 And since we will describe each and every single variable and not relationships between two or more 6 00:00:27,490 --> 00:00:34,050 variables, it is called univariate analysis unit stands for one very standard variable. 7 00:00:34,360 --> 00:00:36,820 So it is one variable analysis. 8 00:00:38,770 --> 00:00:43,120 So we can look at the bulletins as part of the univariate analysis. 9 00:00:44,380 --> 00:00:52,270 We can see mean median mode, we can see measures of dispersion like range quartiles and standard deviations. 10 00:00:53,760 --> 00:00:57,300 And for categorical data, we can look at count of each category. 11 00:00:58,850 --> 00:01:06,020 Most software packages for statistics have a very easy way to do univariate analysis for all the variables 12 00:01:06,020 --> 00:01:06,740 of the dataset. 13 00:01:07,990 --> 00:01:10,960 And when we run it, we see something like this. 14 00:01:15,370 --> 00:01:25,000 So for a variable like age, we can have info like mean median, minimum, maximum, the 25th fiftieth 15 00:01:25,000 --> 00:01:27,070 and 25th percentile values. 16 00:01:29,400 --> 00:01:37,500 Imagine if I arrange all the ages in ascending order, the first value will be the minimum value. 17 00:01:38,350 --> 00:01:44,250 The last value will be the maximum value, the 25th percentile value. 18 00:01:44,640 --> 00:01:49,520 This twenty five point seven five value will be at one fourth position. 19 00:01:50,040 --> 00:01:54,150 That is, 25 percent of the values will be lower than this value. 20 00:01:55,930 --> 00:02:00,440 Twenty nine days, 50 percent daily value will be at the middle. 21 00:02:00,700 --> 00:02:03,250 That is 50 percent of value will be lower than this. 22 00:02:04,000 --> 00:02:06,820 And this 50 percentile value is the same as the median. 23 00:02:06,820 --> 00:02:09,610 Also, if you remember the definition of median. 24 00:02:11,530 --> 00:02:20,980 Seventy five percent in value, 35, five to five is a Trefort position, so 75 percent values are below 25 00:02:20,980 --> 00:02:22,330 35 point to five. 26 00:02:23,560 --> 00:02:32,080 And from 75 percentile to the maximum value is thirty five point five to 51, this range is 75 percentile 27 00:02:32,080 --> 00:02:32,980 do maximum. 28 00:02:34,130 --> 00:02:36,620 So you can observe for this data. 29 00:02:37,780 --> 00:02:44,990 The first 25 percent values are in a very small range of 24 to twenty five point seven five. 30 00:02:46,450 --> 00:02:55,270 But if you see the last quartile, the last 25 percent of values are between a huge range of thirty 31 00:02:55,300 --> 00:02:57,370 five point two, five to fifty one. 32 00:02:58,330 --> 00:03:01,390 So this distribution is not evenly distributed. 33 00:03:03,340 --> 00:03:06,700 Such observations help us identify issues in data. 34 00:03:08,920 --> 00:03:16,210 This whole information for all the variables of the dataset is called the Extended Data Dictionary. 35 00:03:17,710 --> 00:03:25,630 Using day, we can use a lot of things like patterns of outliers, presence of missing values and so 36 00:03:25,630 --> 00:03:25,900 on. 37 00:03:27,420 --> 00:03:32,970 What are these issues and how we handle them will be covered in the coming videos?