1 00:00:02,530 --> 00:00:10,150 This session is about population was the sample, and what we are learning here will form the basis 2 00:00:10,150 --> 00:00:11,600 for the next session. 3 00:00:12,100 --> 00:00:14,050 It is hypothesis testing. 4 00:00:14,890 --> 00:00:17,560 So let's understand this concept clearly. 5 00:00:19,040 --> 00:00:25,400 When you're working on a mission learning project, you don't get to work on the complete data set. 6 00:00:26,210 --> 00:00:30,890 It could be due to various reasons like buying cost. 7 00:00:31,340 --> 00:00:32,890 The space may not be available. 8 00:00:33,500 --> 00:00:38,020 Part of the data may even be corrupted or any other reasons. 9 00:00:38,330 --> 00:00:45,170 So you get to work on your part of the complete data set known as a sample when you're working on a 10 00:00:45,170 --> 00:00:45,750 sample. 11 00:00:46,250 --> 00:00:55,310 What we're actually doing is you're trying to draw inferences for the population based on the inferences 12 00:00:55,310 --> 00:00:56,060 of the sample. 13 00:00:57,450 --> 00:00:57,820 Right. 14 00:00:57,840 --> 00:01:04,140 That's what you're doing, your developing machine learning model for your smaller data set, but you 15 00:01:04,140 --> 00:01:07,620 apply that machine learning model for the entire dataset. 16 00:01:09,850 --> 00:01:15,140 This population was a sample place in our day to day life also. 17 00:01:15,970 --> 00:01:23,110 Let's say you're trying to admit your kid in a school and let's say the school has got to find its students 18 00:01:23,620 --> 00:01:25,140 before admitting your kid. 19 00:01:25,510 --> 00:01:31,030 You may probably want to take feedback about the school from 20 students whom you already know. 20 00:01:32,610 --> 00:01:39,090 It is practically not possible to collect feedback from 500 students, but as it is possible to collect 21 00:01:39,090 --> 00:01:46,890 feedback from 20 students, if you see here, if all the grindy students give a good feedback out of 22 00:01:46,890 --> 00:01:52,740 the majority of the students, give a good feedback, I may probably end up admitting my kid in that 23 00:01:52,740 --> 00:01:53,400 particular school. 24 00:01:54,000 --> 00:01:55,300 So what am I doing here? 25 00:01:55,920 --> 00:02:02,520 I am extrapolating the inference that I got from my sample on to the entire population. 26 00:02:03,640 --> 00:02:09,620 I'm saying whatever inference are drawn for the sample is applicable for the population as well. 27 00:02:10,670 --> 00:02:18,200 Right now, to ensure that whatever I have drawn as insurance, for example, is applicable for the 28 00:02:18,200 --> 00:02:23,940 population, I must ensure that this sample is representative. 29 00:02:24,470 --> 00:02:30,300 So what is this representativeness if you see this example that is there on your screen? 30 00:02:30,650 --> 00:02:35,270 There are different types of students in the population and there are different types of students in 31 00:02:35,270 --> 00:02:35,760 the sample. 32 00:02:37,010 --> 00:02:46,250 OK, now if the percentage of different types of students is maintained in the sample, also, if the 33 00:02:46,250 --> 00:02:53,900 same level of percentage is maintained in my sample, then the sample is said to be representative of 34 00:02:53,900 --> 00:02:54,740 the population. 35 00:02:55,690 --> 00:03:05,620 In this scenario, this 43, 22, 17 and 17 is not reflected in the sample, the percentage contribution 36 00:03:05,830 --> 00:03:11,500 of different types of students is different from the members or the percentage that you see here. 37 00:03:11,950 --> 00:03:14,650 Hence, the sample is not representative. 38 00:03:15,460 --> 00:03:19,640 There's a very, very important concept in machine learning or deploying. 39 00:03:20,780 --> 00:03:27,620 Why do you think this is applicable, reflecting your developing a machine learning model to predict 40 00:03:27,620 --> 00:03:38,690 how diseases and if the majority of your dataset comprises of Indians and you go and apply the model 41 00:03:38,690 --> 00:03:45,430 that you've developed with the majority of Indians in your database to, let's say, Chinese or Caucasians, 42 00:03:46,100 --> 00:03:55,520 your inference wouldn't be accurate enough because the propensity for heart disease is much higher in 43 00:03:55,520 --> 00:03:59,420 Indians and Asians compared to Chinese and Caucasians. 44 00:04:00,500 --> 00:04:12,170 Right, your model will be accurate onely, if the percentage contribution of different races is addressed 45 00:04:12,170 --> 00:04:13,640 in your sample. 46 00:04:15,200 --> 00:04:20,630 Are you understanding why this concept is relevant when it comes to machine learning and deep learning? 47 00:04:22,120 --> 00:04:22,620 OK. 48 00:04:25,250 --> 00:04:33,950 The same concept of representative sampling is important in the world of opinion polls, also, if let's 49 00:04:33,950 --> 00:04:40,520 say you're trying to assess the likelihood of a particular party winning the elections and see us or 50 00:04:40,550 --> 00:04:47,490 India, you must ensure that your sample is representative of the population of the electorate. 51 00:04:48,230 --> 00:04:56,300 If you're going to collect more feedback from, say, northern or central India, your conclusion or 52 00:04:56,300 --> 00:04:59,900 inference won't be comprehensive or accurate. 53 00:05:00,440 --> 00:05:02,370 The same holds true for the US also. 54 00:05:02,840 --> 00:05:11,210 If your feedback is primarily from coastal areas, right, and you ignore subtle and mysterious, your 55 00:05:11,210 --> 00:05:20,620 inference wouldn't be accurate because your sample is not representative of the population of electorates. 56 00:05:21,590 --> 00:05:26,350 So please keep this in mind whenever you're working on samples. 57 00:05:28,960 --> 00:05:33,640 The other aspect when it comes to sampling is what is known as random sampling. 58 00:05:35,500 --> 00:05:42,760 Random sampling ensures that there are no biases when it comes to picking the sample for taking feedback 59 00:05:42,760 --> 00:05:50,290 or doing any type of study in the example that you see on your screen, that is one a local or one red 60 00:05:50,290 --> 00:05:54,990 color one, one blue color one and two green colored balls. 61 00:05:55,840 --> 00:06:02,640 Now, the blue colored ball, I have taken the ball that is number six instead of six. 62 00:06:02,650 --> 00:06:08,640 I could have picked up an eight or 11, nine, 14 or 16 or 17. 63 00:06:09,310 --> 00:06:15,270 Each of these blue balls have an equal chance of being picked up the sample. 64 00:06:16,030 --> 00:06:23,410 If I ensure that each of the blue balls have an equal chance of being picked up, then I have short 65 00:06:23,620 --> 00:06:25,570 randomness in the sampling. 66 00:06:27,730 --> 00:06:34,180 Now, let's try to understand this, using the school example, there are 500 students and I am taking 67 00:06:34,180 --> 00:06:36,360 feedback from 20 students. 68 00:06:37,240 --> 00:06:39,970 What if I collect feedback from 20 students? 69 00:06:39,970 --> 00:06:43,000 I know very well I'm introducing a bias, right. 70 00:06:46,020 --> 00:06:54,480 So when it comes to sampling, let's ensure the randomness is maintained and representativeness is also 71 00:06:54,480 --> 00:07:04,290 maintained, only then the model that you develop will be accurate, otherwise you will have lower accuracy. 72 00:07:05,490 --> 00:07:06,020 OK.