1 00:00:00,600 --> 00:00:05,070 Hello, guys, and welcome back to another class of our course, I want the complete introduction to 2 00:00:05,070 --> 00:00:08,040 data science and python programming. 3 00:00:08,940 --> 00:00:15,480 So in this class, we are going to have a basic introduction to some statistical concepts showing the 4 00:00:15,480 --> 00:00:16,080 best class. 5 00:00:16,080 --> 00:00:21,660 We talked about what our statistics and in this class, we are going to talk about some basic statistical 6 00:00:21,660 --> 00:00:23,850 concepts as we see the statistic. 7 00:00:23,850 --> 00:00:28,620 Concepts that we are going to cover in this class would be the sampling. 8 00:00:28,920 --> 00:00:32,020 And this is exactly what we are going to talk about. 9 00:00:32,040 --> 00:00:35,950 So let's jump right into it right now. 10 00:00:36,420 --> 00:00:37,970 So basically, it's possible. 11 00:00:38,190 --> 00:00:42,680 Well, in statistics or in the world in general, there is a population. 12 00:00:42,690 --> 00:00:46,710 So let's say, for example, we are studying, we want to study a certain population, want to study, 13 00:00:46,710 --> 00:00:48,840 for example, the population of the United States. 14 00:00:49,530 --> 00:00:54,110 So to know, for example, if they love pizza, once again, it's just an example. 15 00:00:54,360 --> 00:01:02,850 So we can ask each citizen of the US where each person who lives in the US, if you love pizza by sending 16 00:01:03,090 --> 00:01:07,450 them a survey, by just asking them a question or whatever. 17 00:01:07,830 --> 00:01:13,830 So basically you can ask the whole population, but it's not possible to ask the whole population because 18 00:01:13,830 --> 00:01:21,420 first of all, it's going to take too much resources to monitor control and will be able to ask the 19 00:01:21,420 --> 00:01:28,410 whole population this question if they like pizza, for example, and also because logistically it's 20 00:01:28,410 --> 00:01:34,310 almost impossible and there is no well, there will be no outcome. 21 00:01:34,320 --> 00:01:40,180 So basically, you will not make more money by asking the whole population if they love pizza. 22 00:01:40,650 --> 00:01:42,340 So what exactly would we do? 23 00:01:42,630 --> 00:01:48,780 So instead of asking the whole population to ask a sample of the population and this is where we have 24 00:01:48,780 --> 00:01:51,210 the difference between a population and a sample. 25 00:01:51,450 --> 00:01:59,610 So basically, as you can see here, the population will be the whole well, all the US citizen, all 26 00:01:59,610 --> 00:02:06,750 the US population, and the sample will simply be a part of this population to who will ask the question. 27 00:02:07,260 --> 00:02:10,770 So first of all, having a sample is way easier. 28 00:02:11,610 --> 00:02:16,230 So in this case, when you guys have a sample to who you will ask a certain question. 29 00:02:17,070 --> 00:02:23,280 It's easier to monitor, it's easier to control, it's easier to find a sample than to ask the whole 30 00:02:23,280 --> 00:02:23,880 population. 31 00:02:24,120 --> 00:02:25,530 Logistically, it's way easier. 32 00:02:25,830 --> 00:02:29,190 And usually the results are pretty much good. 33 00:02:29,190 --> 00:02:36,270 But there is a lot of different sampling techniques and then not all of them are well are perfect, 34 00:02:36,270 --> 00:02:39,270 but usually they work pretty well. 35 00:02:40,560 --> 00:02:44,250 So you have two types of sampling methods. 36 00:02:44,520 --> 00:02:51,570 So they have probability sampling method and no probability sampling method and basically the main difference 37 00:02:51,570 --> 00:02:53,340 between both of them. 38 00:02:53,340 --> 00:03:01,410 So probability and no probability is that the probability sampling method will be based on selecting 39 00:03:01,410 --> 00:03:02,520 random people. 40 00:03:02,910 --> 00:03:11,700 So you will try to select the much random people as possible and then no probability sampling method 41 00:03:12,660 --> 00:03:16,260 will not necessarily select random people always. 42 00:03:16,260 --> 00:03:23,190 So yes, we'll have random people, but you will understand the next few slides why it's not as well 43 00:03:23,460 --> 00:03:31,620 as random randomized as the sample sampling, the probability sampling method, which is based really 44 00:03:31,620 --> 00:03:33,210 on random people. 45 00:03:34,650 --> 00:03:42,090 Does it make more doesn't make the probability sampling method worse than the probability sampling method 46 00:03:42,090 --> 00:03:45,710 because the people are not necessarily 100 percent random? 47 00:03:47,010 --> 00:03:49,800 No, but at the same time it's not the best. 48 00:03:49,800 --> 00:03:55,860 So usually experts prefer using probability sampling method because it's based more on random people 49 00:03:55,860 --> 00:04:00,660 and it's really well, it's more randomized than the non the non probability sampling method. 50 00:04:01,590 --> 00:04:08,490 But once again, sometimes it's way more practical to work with non probability sampling method, even 51 00:04:08,490 --> 00:04:12,300 if it's not the best then to use the probability sampling method. 52 00:04:12,990 --> 00:04:23,160 So let's understand what is exactly the same thing method and what are all the the the sampling methods 53 00:04:23,160 --> 00:04:23,670 right here. 54 00:04:25,470 --> 00:04:29,850 So basically the probability sampling method we have for sampling methods. 55 00:04:30,060 --> 00:04:35,760 So we have the sample random samples, the systematic samples, the stratified samples and finally the 56 00:04:35,760 --> 00:04:36,960 cluster sampling. 57 00:04:37,710 --> 00:04:43,830 The sample, a random sample is basically, let's say, for example, we have a population and of one 58 00:04:43,830 --> 00:04:45,060 thousand people, for example. 59 00:04:45,300 --> 00:04:50,340 And you want to select, let's say, ten people, you need ten people for your sample. 60 00:04:50,700 --> 00:04:55,530 So instead of saying, let's say, I don't know, isolate selected, I will select really random people 61 00:04:55,530 --> 00:04:57,570 of those one thousand persons. 62 00:04:57,840 --> 00:04:59,460 So I will select for example. 63 00:04:59,460 --> 00:04:59,730 I will. 64 00:05:00,190 --> 00:05:08,710 Really, ten random people, and this would be really simple, random sample, and this is really simple 65 00:05:08,710 --> 00:05:11,520 to do, so you just select 10 persons. 66 00:05:12,220 --> 00:05:14,920 The next one would be the systematic sample. 67 00:05:15,430 --> 00:05:17,340 It's pretty much the same. 68 00:05:17,530 --> 00:05:20,110 Well, it's not the same thing as the simple random sample. 69 00:05:20,320 --> 00:05:24,770 It looks like the simple random sample, but it's a bit more different. 70 00:05:25,030 --> 00:05:30,730 So let's say, for example, we have once again our population of one thousand people, and that will 71 00:05:30,740 --> 00:05:33,240 give we'll give a number to each person. 72 00:05:33,250 --> 00:05:37,120 So let's say, for example, each person will be number one, number two, number three until one thousand. 73 00:05:38,140 --> 00:05:42,790 Then from that moment, let's say, for example, you select the person number five. 74 00:05:43,180 --> 00:05:49,640 And from this person, you say, OK, each plus, then I would be used inside of my sample. 75 00:05:49,990 --> 00:05:55,060 So in other words, you select person number five, then the next person in your sample would be the 76 00:05:55,060 --> 00:05:57,370 person number 15 to take the person number five. 77 00:05:57,370 --> 00:05:58,420 Fifteen twenty five. 78 00:05:58,420 --> 00:05:59,990 Thirty five, forty five. 79 00:06:00,250 --> 00:06:04,600 So as you can see in the picture here, you will select the person number five, your person. 80 00:06:04,600 --> 00:06:13,580 Twenty five, fifteen, twenty five is going to be the exact same jump each and every time for the stratified. 81 00:06:13,600 --> 00:06:14,130 So the next. 82 00:06:14,140 --> 00:06:16,330 So this is for this systematic sample. 83 00:06:17,080 --> 00:06:18,830 As you can see, it's not really complicated. 84 00:06:19,510 --> 00:06:22,100 So the next one would be the stratified sampling. 85 00:06:22,810 --> 00:06:24,220 This one is a bit different. 86 00:06:24,230 --> 00:06:26,470 So once again, it's based on the random people. 87 00:06:26,470 --> 00:06:34,420 But you want to let's say, for example, you want to to have a clear how you said you want to have 88 00:06:34,420 --> 00:06:39,100 a clear representation of each population inside of your sample. 89 00:06:39,400 --> 00:06:43,540 So let's say, for example, you have once again your population of one thousand people and you have 90 00:06:43,540 --> 00:06:50,390 two hundred people who wear a green t shirt and 800 people who wear, let's say, a blue t shirt. 91 00:06:50,950 --> 00:06:54,540 So what you will do, you will select two and you need a sample of 10 persons. 92 00:06:54,910 --> 00:07:01,390 So you will select two persons who wear a green t shirt and you select eight persons, wear a blue t 93 00:07:01,390 --> 00:07:06,700 shirt just to be able to represent properly each population inside of your sample. 94 00:07:06,910 --> 00:07:13,360 So you will not select just random people in this population, because let's say you have five persons 95 00:07:13,360 --> 00:07:19,150 who wear a green t shirt and five persons who wear a blue T-shirt, that this will not be representative 96 00:07:19,150 --> 00:07:24,280 of your population because you have way more persons with a with a blue t shirt. 97 00:07:25,210 --> 00:07:30,280 So in this case, let's say you select eight persons with a blue t shirt, but you will select random 98 00:07:30,280 --> 00:07:35,960 people inside of this population of eight hundred people with a blue t shirt. 99 00:07:36,520 --> 00:07:43,750 So basically the stratified sample is really selecting is really selecting people well, the right amount 100 00:07:43,750 --> 00:07:45,170 of people in each group. 101 00:07:45,190 --> 00:07:51,910 So basically, if you have a bigger group of people with a blue t shirt necessarily, you will select 102 00:07:52,450 --> 00:07:58,940 a well, a higher proportion of people in the in the group of people who have blue t shirts. 103 00:07:59,410 --> 00:08:01,510 So this is for the stratified sample. 104 00:08:01,900 --> 00:08:08,890 The last type of sample that is in the probability sampling method would be the cluster. 105 00:08:09,880 --> 00:08:16,090 The cluster sampling and basically this method is really well involves dividing the population into 106 00:08:16,090 --> 00:08:23,950 subgroups and that each subgroup would have similar characteristics, characteristics so to the whole 107 00:08:23,950 --> 00:08:24,550 sample. 108 00:08:24,700 --> 00:08:30,730 So in other words, instead of sampling individual individuals from each subgroup, you randomly select 109 00:08:30,730 --> 00:08:31,810 entire subgroups. 110 00:08:32,170 --> 00:08:33,740 So basically, what does this mean? 111 00:08:33,760 --> 00:08:36,440 It means you will work with that. 112 00:08:36,460 --> 00:08:37,960 Well, small you will. 113 00:08:38,410 --> 00:08:41,050 Let's say you have a population of one thousand people once again. 114 00:08:41,710 --> 00:08:46,350 And this population, you will divide it into groups of, let's say, five people. 115 00:08:46,630 --> 00:08:51,340 So instead of selecting one person, you will select a whole group of five people. 116 00:08:52,340 --> 00:08:55,390 So it's like the random sampling, the sample, random sample. 117 00:08:55,630 --> 00:09:00,430 But instead of working with one person, you will work with small groups. 118 00:09:00,460 --> 00:09:02,830 So basically this would be groups. 119 00:09:04,150 --> 00:09:06,510 So this is for the probability sampling methods. 120 00:09:06,520 --> 00:09:10,830 As I said, those methods are really into random. 121 00:09:10,840 --> 00:09:19,570 So you work with random people and as those are really the most used by experts because they are more 122 00:09:19,570 --> 00:09:25,870 representative of the population and you have more representative numbers in general with those sampling 123 00:09:25,870 --> 00:09:26,290 methods. 124 00:09:27,250 --> 00:09:32,200 So the second type of sampling methods will be the non probability sampling method. 125 00:09:32,500 --> 00:09:38,680 And that in this case, as I said, this one is not well, it's a bit different from the sampling method 126 00:09:38,980 --> 00:09:44,530 because you will work with people not who you know, but that are more accessible to you. 127 00:09:45,100 --> 00:09:51,910 So basically, you will select the people with who who you ask or who you will conduct the survey or 128 00:09:51,910 --> 00:09:52,950 whatever you want to do. 129 00:09:52,960 --> 00:09:53,860 So you study. 130 00:09:53,860 --> 00:09:59,620 Let's say you want to study people who love pizza so you will decide to who you ask. 131 00:10:00,100 --> 00:10:07,000 The question, so let's see those for sampling method, so first of all, we have the convenience sample, 132 00:10:07,000 --> 00:10:13,150 which is the first one right here, then we have the involuntary response sample, the purposive sample, 133 00:10:13,150 --> 00:10:15,070 and finally the snowball sample. 134 00:10:15,940 --> 00:10:17,920 So first of all, we have the communion sample. 135 00:10:17,930 --> 00:10:22,810 This one is, well, the simplest basically how it works. 136 00:10:22,810 --> 00:10:24,550 It's pretty simple. 137 00:10:24,880 --> 00:10:32,680 It's imagine you and you need a sample of 10 people and you don't select random people. 138 00:10:32,680 --> 00:10:34,460 You just select people who you know. 139 00:10:34,480 --> 00:10:40,480 So in this case, let's say, for example, you have friends and this is way more convenient for you 140 00:10:40,660 --> 00:10:45,880 to ask your friends to fill up the survey than to go and ask some strangers somewhere. 141 00:10:46,240 --> 00:10:48,100 So basically, this is the convenience sample. 142 00:10:48,100 --> 00:10:53,560 So you ask, let's say, for example, your friends, you tell your friends, you have your sample of 143 00:10:53,560 --> 00:10:55,810 10 people and here we go. 144 00:10:55,810 --> 00:11:00,250 You have well, you have your sample and you conduct your study just with your 10 friends. 145 00:11:00,610 --> 00:11:04,570 But let's say, for example, the ten of your friends, like pizza once again, is not representative 146 00:11:04,570 --> 00:11:08,100 of the whole population because once again, you controlled the sample. 147 00:11:08,500 --> 00:11:11,320 So this is why it's not always a good thing. 148 00:11:12,760 --> 00:11:16,060 The second thing would be the voluntary response sample. 149 00:11:17,200 --> 00:11:20,020 This one is pretty much as the convenience sample. 150 00:11:20,020 --> 00:11:24,580 But instead of asking your friends, you ask, well. 151 00:11:26,450 --> 00:11:31,970 You will ask people to answer, well, with surveys and everything, you ask people to volunteer to 152 00:11:31,970 --> 00:11:33,580 respond to your sample. 153 00:11:33,650 --> 00:11:39,110 So basically you let's say, come somewhere, you drop your survey and people come and fill up your 154 00:11:39,110 --> 00:11:39,560 survey. 155 00:11:40,460 --> 00:11:47,570 Once again, this is really limited and this is not really random and representative of the whole population, 156 00:11:47,570 --> 00:11:52,970 because let's say, for example, you're at the university, you come at your campus somewhere, you 157 00:11:52,970 --> 00:11:57,590 sit down, you put on your survey, you will not necessarily have access to the whole population or 158 00:11:57,590 --> 00:11:59,270 to a representative sample. 159 00:11:59,660 --> 00:12:05,870 You will have people, let's say, from one department who are studying at that department, at that 160 00:12:05,870 --> 00:12:06,590 university. 161 00:12:06,830 --> 00:12:12,800 Who will Majali answer your survey, which is not really representative of the whole population. 162 00:12:12,800 --> 00:12:17,510 It's just representative of the department where you conducted the survey in question. 163 00:12:18,950 --> 00:12:19,210 All right. 164 00:12:19,220 --> 00:12:24,980 So this is for the voluntary response sample and then after that will have the purpose of sampling. 165 00:12:25,160 --> 00:12:30,650 So this method of sampling will simply involve the researcher using their judgment, judgment to select 166 00:12:30,650 --> 00:12:34,800 a sample that is the most useful to the purpose of the research. 167 00:12:35,660 --> 00:12:43,110 In other words, you select the people to who you will to you will conduct the research. 168 00:12:43,490 --> 00:12:48,110 So once again, this is not representative at all, in my opinion. 169 00:12:48,110 --> 00:12:54,260 Once again, of in this case, if you want to if you want to stay as random as possible, this is not 170 00:12:54,260 --> 00:12:55,270 representative at all. 171 00:12:56,030 --> 00:12:56,330 Why? 172 00:12:56,340 --> 00:13:03,250 Because once again, the researcher will select the people who for him are fit to conduct the research. 173 00:13:03,560 --> 00:13:10,280 So let's say, for example, you want to look for people who, I don't know, luff be used and the researcher 174 00:13:10,460 --> 00:13:18,050 wants to have a positive, let's say, for example, a positive positive answers in his in his, I don't 175 00:13:18,050 --> 00:13:18,670 know, his survey. 176 00:13:19,130 --> 00:13:24,310 So he starts asking people who he knows love pizza to answer his service. 177 00:13:24,320 --> 00:13:29,440 Once again, the results will not will not necessarily represent the population in general. 178 00:13:29,810 --> 00:13:32,640 So basically purpose of sampling is just that. 179 00:13:32,650 --> 00:13:41,670 Well, it could work in some cases in cases that you really need some exact people, some really you 180 00:13:41,750 --> 00:13:46,390 need subjects that are that fit, that fits certain criteria. 181 00:13:46,670 --> 00:13:50,780 But if you want to stay as random as possible, it's not the best way to conduct the research. 182 00:13:51,170 --> 00:13:58,540 This is the third type of non probability sampling method, and the last one would be the snowball sampling. 183 00:13:59,180 --> 00:14:02,390 So in this case, you will do the exact same thing. 184 00:14:02,390 --> 00:14:04,580 So it's going to be the way it works. 185 00:14:04,580 --> 00:14:08,720 It's like convenience sampling, but just with a snowball effect. 186 00:14:08,960 --> 00:14:13,850 So I'd say, for example, you ask two of your friends to answer your survey and you ask your friends 187 00:14:13,850 --> 00:14:20,030 to share your survey with some other people so your friends will share your survey with people who will 188 00:14:20,030 --> 00:14:23,180 share your survey with other people, etc.. 189 00:14:23,510 --> 00:14:30,050 So basically, as I said, it's like a convenience sample, but just with a snowball effect, in some 190 00:14:30,050 --> 00:14:32,300 cases, this could be really effective. 191 00:14:32,300 --> 00:14:37,400 So let's say, for example, you are studying a I don't know, you are studying a certain community 192 00:14:37,410 --> 00:14:41,900 somewhere in the country and, you know, just one person inside of this community. 193 00:14:41,900 --> 00:14:48,380 So maybe this person, you can ask the person to fill up your survey and ask this person to share, 194 00:14:48,380 --> 00:14:52,880 for example, this survey in his or her community. 195 00:14:53,300 --> 00:14:57,800 And in this case, you will be able to conduct your study on this community in general. 196 00:14:58,640 --> 00:15:05,240 Once again, this is not necessarily representative of well, of the community because it's maybe representative 197 00:15:05,240 --> 00:15:08,790 of the community in a certain place. 198 00:15:08,790 --> 00:15:13,010 So, for example, you are studying, I don't know, you are studying population of people who wear 199 00:15:13,010 --> 00:15:15,950 a blue t shirt in the United States. 200 00:15:15,950 --> 00:15:20,060 But you are conducting your research, let's say, in New York City. 201 00:15:20,360 --> 00:15:23,480 So you ask someone to fill up your survey and to ask. 202 00:15:23,660 --> 00:15:29,090 And since you don't know people who are wearing blue t shirts, so this person who is wearing a blue 203 00:15:29,090 --> 00:15:34,250 t shirt will simply hand your survey to other people who are wearing blue t shirts. 204 00:15:34,370 --> 00:15:37,760 Once again, this is just an example for you guys to understand. 205 00:15:38,930 --> 00:15:43,340 So in this case, it could be representative. 206 00:15:43,340 --> 00:15:50,060 But until a certain point, and it's not like the best way to represent well, to to represent and conduct 207 00:15:50,060 --> 00:15:54,650 a random study to represent the population in general and to conduct a study. 208 00:15:55,550 --> 00:15:58,400 So this is for these sampling methods that exist. 209 00:15:58,970 --> 00:16:01,850 So, as I said, you have eight of those sampling methods. 210 00:16:02,300 --> 00:16:05,000 You have two groups of sampling methods. 211 00:16:05,000 --> 00:16:11,150 You have the probability sampling method and the non probability sensing method to understand the difference 212 00:16:11,150 --> 00:16:12,110 between both. 213 00:16:12,110 --> 00:16:14,060 The first one is really randomized. 214 00:16:14,060 --> 00:16:19,520 So you ask some random people and you work really with random people, which is more representative 215 00:16:19,520 --> 00:16:20,900 of the population in general. 216 00:16:20,900 --> 00:16:25,310 And the second one would be the non probability sampling method, which is not. 217 00:16:25,390 --> 00:16:29,750 Necessarily worst or which is not necessarily. 218 00:16:29,920 --> 00:16:34,870 Well, it's more efficient and easier to do because you will ask, for example, friends. 219 00:16:34,870 --> 00:16:40,260 You ask people who you know, once again, it's not necessarily representative of the whole population. 220 00:16:40,270 --> 00:16:40,840 It could work. 221 00:16:40,840 --> 00:16:41,940 It could bring you numbers. 222 00:16:42,190 --> 00:16:44,590 Once again, those numbers are not necessarily exact. 223 00:16:45,470 --> 00:16:46,750 So that's it for this part of the course. 224 00:16:46,790 --> 00:16:53,500 Guys, right now, you know the difference between all the sampling methods and all in our next class.