1
00:00:02,320 --> 00:00:10,450
So whenever we have the data in the tabular format ready with us, we should first look at the descriptors

2
00:00:10,450 --> 00:00:11,890
of each variable.

3
00:00:13,000 --> 00:00:19,550
The descriptive statistics, as we discussed in another section, are used to describe the data and

4
00:00:19,550 --> 00:00:20,270
summarize it.

5
00:00:21,610 --> 00:00:27,460
And since we will describe each and every single variable and not relationships between two or more

6
00:00:27,460 --> 00:00:31,900
variables, it is called univariate analysis unit stange.

7
00:00:31,910 --> 00:00:34,120
What one where it stands, what variable?

8
00:00:34,390 --> 00:00:36,820
So it is one variable analysis.

9
00:00:38,830 --> 00:00:43,120
So we can look at the buildings as part of the univariate analysis.

10
00:00:44,470 --> 00:00:46,060
We can see mean median mode.

11
00:00:47,050 --> 00:00:52,300
We can see measures of dispersion like range quartiles and standard deviations.

12
00:00:53,850 --> 00:00:57,360
And for categorical data, we can look at count of each category.

13
00:00:58,940 --> 00:01:06,020
Most software packages for statistics have a very easy way to do univariate analysis for all the variables

14
00:01:06,020 --> 00:01:06,740
of the dataset.

15
00:01:08,110 --> 00:01:10,990
And when we run it, we see something like this.

16
00:01:15,400 --> 00:01:24,100
So for a variable like age, we can have info like mean median, minimum, maximum the 25th.

17
00:01:24,310 --> 00:01:27,070
58 and 75 percentile values.

18
00:01:29,520 --> 00:01:37,530
Imagine if I arrange all the ages in ascending order, the first value will be the minimum value.

19
00:01:38,400 --> 00:01:40,650
The last value will be the maximum value.

20
00:01:42,030 --> 00:01:46,530
The 25th percentile value this twenty five point seven five.

21
00:01:46,550 --> 00:01:49,560
Value will be at one fourth position.

22
00:01:50,130 --> 00:01:54,200
That is, 25 percent of the values will be lower than this value.

23
00:01:55,980 --> 00:01:56,830
Twenty nine.

24
00:01:57,250 --> 00:02:00,490
This 50 percentile value will be at the middle.

25
00:02:00,790 --> 00:02:03,280
That is, 50 percent of values will be lower than this.

26
00:02:04,120 --> 00:02:07,270
And this 50 percentile value is same as the median also.

27
00:02:07,750 --> 00:02:09,610
If you remember the definition of median.

28
00:02:11,570 --> 00:02:17,440
Seventy five percent value prettified went to file is added Trefort position.

29
00:02:18,190 --> 00:02:22,380
So seventy five percent values are below thirty five point two for a.

30
00:02:23,680 --> 00:02:29,530
And from 75 percentile to the maximum value is thirty five point two five two fifty one.

31
00:02:29,650 --> 00:02:32,980
This range is 70 and they'll do maximum.

32
00:02:34,190 --> 00:02:36,590
So you can observe what this data.

33
00:02:37,810 --> 00:02:44,680
The first twenty five percent values are in a very small range of twenty four to twenty five point seven

34
00:02:44,680 --> 00:02:44,970
five.

35
00:02:46,510 --> 00:02:54,130
But if you only see the last quartile, the last twenty five percent of values add between the huge

36
00:02:54,130 --> 00:02:57,430
range of thirty five point two, five to fifty one.

37
00:02:58,390 --> 00:03:01,420
So this distribution is not evenly distributed.

38
00:03:03,400 --> 00:03:06,680
Such observations help us identify issues in our data.

39
00:03:08,980 --> 00:03:16,270
This whole information for all the variables of the dataset is called the Extended Data Dictionary.

40
00:03:17,770 --> 00:03:25,450
Using EDT, we can deduce a lot of things like patterns of outliers, presence of missing values and

41
00:03:25,450 --> 00:03:25,930
so on.

42
00:03:27,510 --> 00:03:32,980
What are these issues and how we handle them will be covered in the coming videos?