1 00:00:01,900 --> 00:00:07,480 In this video, we will learn how to summarize data by creating a frequency distribution tables for 2 00:00:07,480 --> 00:00:09,210 qualitative and quantitative data. 3 00:00:10,370 --> 00:00:15,380 Then we will see how to convert these tables into backyards and how to construct Instagram's. 4 00:00:17,680 --> 00:00:22,390 This is basically descriptive statistics where we describe the distribution of our data. 5 00:00:23,200 --> 00:00:27,160 So let's start what is the frequency distribution? 6 00:00:28,210 --> 00:00:33,910 Frequency distribution is listing down categories from our data and against each category. 7 00:00:34,300 --> 00:00:38,440 We will write the number of elements or data points that belong to each category. 8 00:00:41,930 --> 00:00:48,740 So suppose 420 students get admission into a college and you have to work, column one contains student 9 00:00:48,740 --> 00:00:52,980 names and column two contains the branch of specialization of that student. 10 00:00:54,410 --> 00:01:00,980 Such data is raw data, but when you take out these branches as categories and find out the number of 11 00:01:00,980 --> 00:01:04,970 students that belong to each branch, you get a table like this. 12 00:01:06,570 --> 00:01:08,550 This is called a frequency distribution. 13 00:01:09,710 --> 00:01:15,930 Now, these numbers on the right are the frequency of occurrence of each branch in our raw data. 14 00:01:17,000 --> 00:01:23,510 There is a term called relative frequency of a category which represents the contribution of that category 15 00:01:23,540 --> 00:01:24,290 to the total. 16 00:01:25,070 --> 00:01:31,790 That is mathematically it is frequency of that category divided by the sum of all frequencies. 17 00:01:32,670 --> 00:01:38,970 So radiofrequency for biotechnology will be sixty, divided by 420. 18 00:01:40,810 --> 00:01:44,560 Which comes out to be point one for which is nearly fourteen point two percent. 19 00:01:48,000 --> 00:01:52,230 The same information can be depicted in the form of a graph called Bajak. 20 00:01:54,770 --> 00:02:01,700 In a vertical backyard, the categories go on the horizontal axis and the frequency value is denoted 21 00:02:01,700 --> 00:02:03,650 by the height of the vertical bars. 22 00:02:04,850 --> 00:02:08,150 This is a fairly common chart and it's very easy to draw. 23 00:02:09,530 --> 00:02:15,010 Judging that which is the most popular brands may have been difficult if we were looking at raw data, 24 00:02:15,440 --> 00:02:21,680 but here we can easily see that more students go for electrical engineering, at least go for mathematics. 25 00:02:23,860 --> 00:02:29,890 Earlier, we saw frequency distribution for quality, the greater knowledge frequency, distribution 26 00:02:29,890 --> 00:02:35,020 for quantity, the reader quantitative data, we have to create the buckets first. 27 00:02:35,710 --> 00:02:41,440 And when we create these buckets and assign the frequency of occurrence of values that belong to that 28 00:02:41,440 --> 00:02:44,170 bucket, we get group data. 29 00:02:45,650 --> 00:02:52,280 So suppose we have a list of students and their science marks, this is and group data, but when I 30 00:02:52,280 --> 00:02:59,690 create category of marks such as zero to 35, 36 to 55 and so on, and find out the number of students 31 00:02:59,990 --> 00:03:06,170 where scored Martin each category, I get this table and this table is group data. 32 00:03:08,620 --> 00:03:14,750 And we will learn to do this using software like Oren Python, but for a small number of observations, 33 00:03:14,750 --> 00:03:16,220 we can do it manually also. 34 00:03:17,220 --> 00:03:18,870 So let's learn how to do it manually. 35 00:03:20,950 --> 00:03:27,580 First, we have to decide the number of buckets, usually we keep it between five and 20. 36 00:03:29,630 --> 00:03:37,580 Next, we decided last word, last word is decided using this formula, which is maximum value minus 37 00:03:37,580 --> 00:03:42,290 minimum value divided by the number of classes that you have decided in the first point. 38 00:03:44,910 --> 00:03:51,210 Then, starting with the minimum value, we keep on adding the glasswork to get the buckets, and once 39 00:03:51,210 --> 00:03:55,910 we have the buckets, we just need to assign the number of observations belonging to each bucket. 40 00:03:57,300 --> 00:04:00,000 So let's do it and an example for better clarity. 41 00:04:04,110 --> 00:04:06,990 Suppose we have the following numbers as customer ages. 42 00:04:07,960 --> 00:04:12,790 We want to group this data first, we will select the number of groups we want to create. 43 00:04:13,420 --> 00:04:15,650 So let's say we want to create five buckets. 44 00:04:17,530 --> 00:04:20,470 Next, we find out Glassford using the formula. 45 00:04:22,270 --> 00:04:30,730 So 34 minus seven, the highest value minus the lowest value divided by the number of classes that we 46 00:04:30,730 --> 00:04:35,740 decided, which is five, which comes out to five point for which we are down to five. 47 00:04:37,820 --> 00:04:44,210 Now, starting with the lowest value we are declasse, worked to get 12, so first classes seven to 48 00:04:44,210 --> 00:04:48,150 12, next is 13. 49 00:04:48,350 --> 00:04:51,600 We again at the Glassford and we get 18. 50 00:04:51,620 --> 00:04:54,020 So 13 to 18 is the next class. 51 00:04:54,650 --> 00:04:55,950 And we continue to do so. 52 00:04:56,110 --> 00:04:59,390 We get the last Latvijas 31 to 36. 53 00:05:01,970 --> 00:05:04,130 The next column is called Tele. 54 00:05:05,130 --> 00:05:13,650 We put a mark for each value that belongs to that category, so the first value N belongs to seven to 55 00:05:13,650 --> 00:05:14,380 12 category. 56 00:05:14,430 --> 00:05:21,110 So we put a marker, the next value 14 belongs to 13 to 18 category. 57 00:05:21,810 --> 00:05:28,950 So we put a line here and we continue to do so till we get to the last value, which is 33, which belongs 58 00:05:28,950 --> 00:05:32,100 to the 31 to 36 category. 59 00:05:33,960 --> 00:05:40,920 You can also notice that the airport is landing without Lane having this land line, Heilprin counting 60 00:05:42,030 --> 00:05:44,550 the last column is the only count of these lanes. 61 00:05:44,730 --> 00:05:51,780 So first category has two lanes, seconders four and the last one with this large lane, which means 62 00:05:51,780 --> 00:05:55,290 five and three more as it frequency. 63 00:05:57,040 --> 00:06:00,220 We have this table, which has the frequency distribution. 64 00:06:03,140 --> 00:06:07,400 When we plot the frequency distribution of quantitative data, it is called Instagram. 65 00:06:10,250 --> 00:06:16,430 I have created this in PowerPoint only it is a little bit different than the table we created because 66 00:06:16,430 --> 00:06:21,620 the limits are in decimal point, so it is starting from seven to twelve point four. 67 00:06:23,530 --> 00:06:29,410 And from twelve point four to seventeen point eight, because the actual value of the glass was it was 68 00:06:29,530 --> 00:06:30,410 five point four. 69 00:06:30,820 --> 00:06:33,450 We don't add it to five for our convenience. 70 00:06:36,810 --> 00:06:42,830 Like Bajaj, Instagram is also representing the frequency values of each class. 71 00:06:43,630 --> 00:06:49,210 So whenever you get a data, it is good to plot a histogram so that you get the distribution of your 72 00:06:49,210 --> 00:06:49,630 data. 73 00:06:52,870 --> 00:06:54,550 So as you can see from this graph. 74 00:06:55,600 --> 00:07:00,130 Most of the customers belong to the category of 28 to 34. 75 00:07:01,550 --> 00:07:06,530 And very few customers belong to age less than 23. 76 00:07:08,300 --> 00:07:12,550 So we have very few teenagers, but we have a lot of young customers. 77 00:07:15,780 --> 00:07:18,240 Next, we look at some common shapes of histograms. 78 00:07:19,340 --> 00:07:20,790 There are three properties here. 79 00:07:21,080 --> 00:07:26,330 One is Cemetery City, second is Skewness, and third is uniformity. 80 00:07:27,620 --> 00:07:32,090 Asymmetric variable distribution is seem about decentered. 81 00:07:33,060 --> 00:07:36,420 You can see that left is the mirror image of the right. 82 00:07:39,180 --> 00:07:42,030 So these two types, asymmetric. 83 00:07:43,870 --> 00:07:51,460 If higher frequencies are more shifted towards one side and the other side has lower frequencies, then 84 00:07:51,460 --> 00:07:55,210 the graph is skewed as represented in these two graphs. 85 00:07:57,470 --> 00:08:02,810 If the frequencies are uniformly distributed in all the classes, then the graph is uniform. 86 00:08:03,830 --> 00:08:09,520 If you go back to the histogram withdrew earlier, you can see that it is a skewed data. 87 00:08:10,280 --> 00:08:17,720 The frequency of the last class is higher and frequency of the previous classes is much lower. 88 00:08:19,610 --> 00:08:25,490 It does not have some atrocity sense, if you look at left part of this graph, it does not seem as 89 00:08:25,820 --> 00:08:34,540 part of this graph and it is not uniform, since all the different classes have very different values. 90 00:08:36,570 --> 00:08:42,030 In the end, we will look at an important probability distribution called the normal distribution. 91 00:08:43,420 --> 00:08:49,690 It occurs very often in the recorded data, and very often we make assumptions that the data is normally 92 00:08:49,690 --> 00:08:50,350 distributed. 93 00:08:51,300 --> 00:08:58,110 What this means is that our data is a continuous numerical data, and the probability density at that 94 00:08:58,110 --> 00:09:03,750 point is actually a function of the distance of that point from the mean of that data. 95 00:09:05,240 --> 00:09:07,490 It is represented by this formula. 96 00:09:10,440 --> 00:09:20,190 We don't need to remember this formula, just notice that the height by which is diplomat didn't get 97 00:09:20,190 --> 00:09:26,130 any point is a function of the distance of that particular value from the mean. 98 00:09:26,790 --> 00:09:33,420 So X minus new distance from the main determines the probability density at any particular point. 99 00:09:35,860 --> 00:09:39,820 So when we have such data and we draw a histogram of such data. 100 00:09:41,680 --> 00:09:49,120 So if we create a number of classes and we assign the value, the frequency, distribution of values 101 00:09:49,120 --> 00:09:55,270 in that classes, it will be drawn something like this to the shape will resemble this. 102 00:09:56,350 --> 00:10:05,500 So in simple terms, what this graph is representing is the value at mean as maximum probability of 103 00:10:05,500 --> 00:10:05,960 occurrence. 104 00:10:07,570 --> 00:10:13,300 And as you go farther from the mean on either side, the probability of occurrence decreases. 105 00:10:15,400 --> 00:10:23,140 Also note that since Y denotes probability, identity and not probability, therefore, to get the probability 106 00:10:23,140 --> 00:10:27,530 of occurrence between a range, you will have to calculate the area under the graph. 107 00:10:28,300 --> 00:10:34,240 So, for example, if you want to find out the probability of occurrence of. 108 00:10:36,080 --> 00:10:43,940 Value, which is less than they mean it will be area on the left of the side of the graph and this will 109 00:10:43,940 --> 00:10:45,590 be half of the total area. 110 00:10:47,590 --> 00:10:53,250 We'll discuss how to calculate this area and how to calculate this probability, these software packages 111 00:10:53,260 --> 00:10:58,960 which we are using in this course, will automatically give us the probability for any change that we 112 00:10:58,960 --> 00:10:59,740 want to calculate. 113 00:11:01,280 --> 00:11:04,040 Listed, there are three properties of normal distribution. 114 00:11:04,750 --> 00:11:10,730 The total area under the covers, one because this graph is showing the probability identity. 115 00:11:11,180 --> 00:11:15,920 So if you want to find the probability of all values happening, it is one. 116 00:11:16,830 --> 00:11:17,930 The curve is symmetric. 117 00:11:18,650 --> 00:11:24,200 And these two tables at the end, these extend indefinitely. 118 00:11:25,790 --> 00:11:28,370 We won't go deeper than this for normal distribution. 119 00:11:28,380 --> 00:11:32,500 And that is all we need for what we are going to learn later in the course. 120 00:11:33,470 --> 00:11:34,270 That's all for this. 121 00:11:34,280 --> 00:11:36,170 Would you like to see you in the next video?