1 00:00:01,860 --> 00:00:09,930 In this session, we're going to see univariate analysis, what is the reason I'm considering each of 2 00:00:09,930 --> 00:00:18,810 the variables one by one, I'm analyzing each other variables in isolation, like if, let's say gender 3 00:00:18,810 --> 00:00:20,730 is a variable, I'm analyzing gender. 4 00:00:21,360 --> 00:00:23,890 If income is a variable, I'm analyzing income. 5 00:00:23,910 --> 00:00:25,770 How I'm analyzing in isolation. 6 00:00:26,900 --> 00:00:27,240 Right. 7 00:00:27,530 --> 00:00:35,420 What kind of analysis I do depends on whether the data is numerical, continues variables or categorical 8 00:00:35,420 --> 00:00:38,200 gains for is a continuous variable. 9 00:00:38,210 --> 00:00:41,980 I'll be using measures of central tendency in this push. 10 00:00:42,410 --> 00:00:44,700 What is measures of central tendency in this position? 11 00:00:45,370 --> 00:00:46,710 You will see that very shortly. 12 00:00:47,120 --> 00:00:47,370 Right. 13 00:00:47,840 --> 00:00:49,400 But why are we doing this? 14 00:00:50,030 --> 00:00:51,950 We are doing this because of two reasons. 15 00:00:51,950 --> 00:00:54,320 One, I want to ensure. 16 00:00:55,480 --> 00:00:57,910 I get a better idea about my life. 17 00:00:57,940 --> 00:00:59,080 That's very obvious. 18 00:00:59,650 --> 00:01:03,080 The second reason is because of the concept of population. 19 00:01:04,450 --> 00:01:05,910 What is this population of? 20 00:01:07,830 --> 00:01:17,460 Many times in real life situations, we want to get to work on the complete historical leader, we will 21 00:01:17,460 --> 00:01:21,470 only be provided with a sample of the historical data. 22 00:01:22,470 --> 00:01:34,380 Right, it is up to us to ensure that the sample data that we have taken up for developing here is representative 23 00:01:34,380 --> 00:01:37,200 of the entire historical dataset. 24 00:01:37,980 --> 00:01:48,270 For example, if my entire historical dataset consisted of data from all the states in India or the 25 00:01:48,270 --> 00:01:55,260 US, if I take the sample data for the purpose of. 26 00:01:56,240 --> 00:02:02,220 Creating my machine learning model, I must ensure the representativeness is adhered to. 27 00:02:02,420 --> 00:02:09,230 What I mean by that is if, for example, the Midwest, you know, in the US, right in the historical 28 00:02:09,230 --> 00:02:16,120 data, contributed to 30 percent of all the insurance cases in the sample. 29 00:02:16,130 --> 00:02:20,000 Also, I must ensure that the Midwest contributes to 30 percent. 30 00:02:21,910 --> 00:02:22,250 But. 31 00:02:23,560 --> 00:02:30,760 This is something that happens in opinion polls and all right, when they do a survey as to which party 32 00:02:30,760 --> 00:02:37,690 will win, which candidate will win, they need to ensure the person who is doing the survey needs to 33 00:02:37,690 --> 00:02:41,100 ensure that the representativeness is taken care. 34 00:02:42,180 --> 00:02:48,090 Like, if I'm doing an all India survey as to which party will come to power, I need to ensure this 35 00:02:48,090 --> 00:02:52,470 representativeness in the electorate population is taken. 36 00:02:54,430 --> 00:02:54,750 Right. 37 00:02:55,270 --> 00:03:01,450 So the same concept we're doing here also because it's a very basic concept, that is when analyzing 38 00:03:01,450 --> 00:03:08,710 the variables in isolation, it will help to validate this representativeness aspect. 39 00:03:08,710 --> 00:03:15,700 That is, whatever proportion I have in the complete dataset, which is called as population, I must 40 00:03:15,700 --> 00:03:23,310 ensure the same representation level is maintained in my sample data. 41 00:03:23,320 --> 00:03:28,490 Also, I please note that, you know, you don't get to work on the complete dataset. 42 00:03:28,510 --> 00:03:29,980 You know what is called as population. 43 00:03:30,880 --> 00:03:34,030 That can be various reasons for that and what the data can be collected. 44 00:03:34,030 --> 00:03:34,300 Right. 45 00:03:35,890 --> 00:03:37,990 Only the recent data may be available. 46 00:03:38,680 --> 00:03:38,980 Right. 47 00:03:39,970 --> 00:03:44,590 Your database may have some limitations, so you may be compromising. 48 00:03:45,560 --> 00:03:51,650 So that can be various reasons, but that's a fact of life, right, if you get to work on the complete 49 00:03:51,650 --> 00:03:52,110 data set. 50 00:03:52,130 --> 00:03:55,860 Well and good, but what if you don't get to work on the complete dataset? 51 00:03:56,600 --> 00:03:57,740 Must still continue, right. 52 00:03:58,520 --> 00:04:04,280 We have a wonderful principle of statistics, which is the representation aspect in population as a 53 00:04:04,280 --> 00:04:10,910 sample, please here to ensure that the sample that you are taking for developing machine learning model. 54 00:04:11,540 --> 00:04:11,890 Right. 55 00:04:11,900 --> 00:04:17,570 Is representative of the population or the complete dataset. 56 00:04:18,170 --> 00:04:21,350 I believe in order for it. 57 00:04:22,760 --> 00:04:29,000 So let's get back to the univariate analysis in univariate analysis, if it is a continuous data for 58 00:04:29,180 --> 00:04:29,800 a number. 59 00:04:29,810 --> 00:04:30,080 Right. 60 00:04:30,530 --> 00:04:34,950 What I do is I look at what is the measure of central tendency and what is dispersion. 61 00:04:36,110 --> 00:04:41,330 Central tendency tells me, you know, I don't have a legitimate point. 62 00:04:41,990 --> 00:04:46,620 What is the value this person tells? 63 00:04:47,090 --> 00:04:47,540 Right. 64 00:04:48,110 --> 00:04:51,500 What is the level of radiation that I have in my process? 65 00:04:53,030 --> 00:05:01,010 But if let's say I take the example of Moxa painting a glass, if I say that in all the average motss, 66 00:05:02,150 --> 00:05:03,170 say 80 percent. 67 00:05:04,960 --> 00:05:11,860 But few students have scored even 95 percent or 99 percent, and few students have scored just 10 percent 68 00:05:11,860 --> 00:05:12,480 of kids. 69 00:05:13,450 --> 00:05:18,370 So this Extreme's right, eight percent and 99 percent. 70 00:05:20,370 --> 00:05:23,140 That tells me the variation that is there. 71 00:05:23,970 --> 00:05:27,420 I need to understand this also, right? 72 00:05:27,840 --> 00:05:29,850 So that's what I'll do for my country. 73 00:05:32,430 --> 00:05:34,650 So if let's say these are the numbers. 74 00:05:35,750 --> 00:05:39,730 OK, I mean, it's nothing but the arithmetic, right? 75 00:05:40,340 --> 00:05:48,030 I add all these numbers and divide by the number of entries, which is divided by five points. 76 00:05:48,560 --> 00:05:55,910 This arithmetic average median is the midpoint after arranging the data in ascending order. 77 00:05:56,930 --> 00:06:03,210 OK, please note, I need or indeed in ascending order and the midpoint is the number that counts. 78 00:06:03,490 --> 00:06:03,830 OK. 79 00:06:05,610 --> 00:06:09,570 That's my more this the frequently occurring member. 80 00:06:11,220 --> 00:06:19,230 OK, standard deviation is the extent of deviation from the average, how much I might deviating from 81 00:06:19,230 --> 00:06:20,040 the average. 82 00:06:21,520 --> 00:06:23,620 Get that's what I do in. 83 00:06:25,190 --> 00:06:30,890 Measures of central tendency and the other concepts like percentiles, they are also used, but I'm 84 00:06:30,890 --> 00:06:32,810 not covering it here. 85 00:06:33,320 --> 00:06:37,110 If you are interested, you can go ahead and understand it. 86 00:06:38,000 --> 00:06:40,940 But this is sufficient for most of the scenarios. 87 00:06:41,380 --> 00:06:49,880 OK, you will be using median more than on other concepts like the central tendency in this push and 88 00:06:49,880 --> 00:06:54,950 concepts subsequently writing the other analysis also. 89 00:06:56,960 --> 00:07:00,810 So now let's apply to some new tendency in this position to data. 90 00:07:03,150 --> 00:07:10,400 I start with the income, right, I look at what is the mean one way to understand the variation. 91 00:07:10,680 --> 00:07:17,190 OK, yes, of course, calculating the standard, I can also plot the frequency distribution. 92 00:07:19,280 --> 00:07:21,360 Like frequency distribution. 93 00:07:23,240 --> 00:07:30,470 If I put this frequency distribution, I can find that most of the income is. 94 00:07:33,350 --> 00:07:40,200 Zero to 5000 on the next is between 5000 to 20000 and beyond 20000 the. 95 00:07:41,670 --> 00:07:46,200 Few values of it, so that's the first information I get. 96 00:07:49,730 --> 00:07:58,040 If I were to apply this apply, the concept of population was a sample based on this, if you see this 97 00:07:58,040 --> 00:08:01,800 here, you know, I can convert this into some percentages. 98 00:08:01,990 --> 00:08:07,550 Majorities in 025 dozing off of that is in five thousand to twenty thousand. 99 00:08:07,550 --> 00:08:11,880 And then few entries are there greater than 20000 of. 100 00:08:12,880 --> 00:08:20,080 This is, let's say for this my sample, I need to verify if this proportion is there in my population 101 00:08:20,080 --> 00:08:24,810 or the complete dataset or if it is not there, I'm setting myself up for failure. 102 00:08:26,480 --> 00:08:29,520 Like my forecast accuracy will definitely take a minute. 103 00:08:30,590 --> 00:08:36,810 Please stop your exercise if the sample is not representative of the population. 104 00:08:36,830 --> 00:08:37,850 You must stop. 105 00:08:39,300 --> 00:08:43,140 You must first ensure that the representativeness is followed. 106 00:08:43,230 --> 00:08:45,870 That's one of the fundamental things in. 107 00:08:47,400 --> 00:08:50,130 Machine learning or in statistics? 108 00:08:51,540 --> 00:08:53,280 Machine learning is based on statistics. 109 00:08:54,000 --> 00:08:57,250 So that's the thing you will validate. 110 00:08:57,960 --> 00:09:03,930 So now that now you've got this right for categorical variables, you know, I find out what is the 111 00:09:04,680 --> 00:09:06,190 for example, in the case of gender? 112 00:09:06,810 --> 00:09:08,190 What is the percentage of male? 113 00:09:08,200 --> 00:09:09,540 What is the percentage of female? 114 00:09:09,570 --> 00:09:09,840 Right. 115 00:09:10,590 --> 00:09:12,100 What is the percentage of graduate? 116 00:09:12,150 --> 00:09:14,930 What is the percentage of of the number of graduates? 117 00:09:14,940 --> 00:09:21,630 Number of saying that whether the individual is married or not. 118 00:09:22,650 --> 00:09:25,620 I can also put this in the form of a chart. 119 00:09:26,250 --> 00:09:30,650 So all that I explain in population was the same applies here also. 120 00:09:32,150 --> 00:09:32,520 Like. 121 00:09:33,660 --> 00:09:42,480 So please ensure that, right, so I do this for both my continuance, that's for less categorical. 122 00:09:42,730 --> 00:09:50,700 If I put it in the ground, know I'm able to understand the composition of data better because the picture 123 00:09:50,700 --> 00:09:54,240 is worth a thousand words, like to always use graphs. 124 00:09:56,520 --> 00:09:56,880 Right. 125 00:09:57,210 --> 00:10:00,950 So that's what we do in univariate analysis. 126 00:10:02,110 --> 00:10:07,500 Clear and you understand why you univariate analysis is necessary on. 127 00:10:08,810 --> 00:10:16,610 OK, so having understood this, now getting to Vivarium analysis studies how one variable interacts 128 00:10:16,610 --> 00:10:22,190 with the other way, how dependent variable interacts with independently. 129 00:10:22,760 --> 00:10:26,790 So that's what we're going to see in by video analysis. 130 00:10:27,470 --> 00:10:27,960 OK.