1
00:00:02,530 --> 00:00:10,150
This session is about population was the sample, and what we are learning here will form the basis

2
00:00:10,150 --> 00:00:11,600
for the next session.

3
00:00:12,100 --> 00:00:14,050
It is hypothesis testing.

4
00:00:14,890 --> 00:00:17,560
So let's understand this concept clearly.

5
00:00:19,040 --> 00:00:25,400
When you're working on a mission learning project, you don't get to work on the complete data set.

6
00:00:26,210 --> 00:00:30,890
It could be due to various reasons like buying cost.

7
00:00:31,340 --> 00:00:32,890
The space may not be available.

8
00:00:33,500 --> 00:00:38,020
Part of the data may even be corrupted or any other reasons.

9
00:00:38,330 --> 00:00:45,170
So you get to work on your part of the complete data set known as a sample when you're working on a

10
00:00:45,170 --> 00:00:45,750
sample.

11
00:00:46,250 --> 00:00:55,310
What we're actually doing is you're trying to draw inferences for the population based on the inferences

12
00:00:55,310 --> 00:00:56,060
of the sample.

13
00:00:57,450 --> 00:00:57,820
Right.

14
00:00:57,840 --> 00:01:04,140
That's what you're doing, your developing machine learning model for your smaller data set, but you

15
00:01:04,140 --> 00:01:07,620
apply that machine learning model for the entire dataset.

16
00:01:09,850 --> 00:01:15,140
This population was a sample place in our day to day life also.

17
00:01:15,970 --> 00:01:23,110
Let's say you're trying to admit your kid in a school and let's say the school has got to find its students

18
00:01:23,620 --> 00:01:25,140
before admitting your kid.

19
00:01:25,510 --> 00:01:31,030
You may probably want to take feedback about the school from 20 students whom you already know.

20
00:01:32,610 --> 00:01:39,090
It is practically not possible to collect feedback from 500 students, but as it is possible to collect

21
00:01:39,090 --> 00:01:46,890
feedback from 20 students, if you see here, if all the grindy students give a good feedback out of

22
00:01:46,890 --> 00:01:52,740
the majority of the students, give a good feedback, I may probably end up admitting my kid in that

23
00:01:52,740 --> 00:01:53,400
particular school.

24
00:01:54,000 --> 00:01:55,300
So what am I doing here?

25
00:01:55,920 --> 00:02:02,520
I am extrapolating the inference that I got from my sample on to the entire population.

26
00:02:03,640 --> 00:02:09,620
I'm saying whatever inference are drawn for the sample is applicable for the population as well.

27
00:02:10,670 --> 00:02:18,200
Right now, to ensure that whatever I have drawn as insurance, for example, is applicable for the

28
00:02:18,200 --> 00:02:23,940
population, I must ensure that this sample is representative.

29
00:02:24,470 --> 00:02:30,300
So what is this representativeness if you see this example that is there on your screen?

30
00:02:30,650 --> 00:02:35,270
There are different types of students in the population and there are different types of students in

31
00:02:35,270 --> 00:02:35,760
the sample.

32
00:02:37,010 --> 00:02:46,250
OK, now if the percentage of different types of students is maintained in the sample, also, if the

33
00:02:46,250 --> 00:02:53,900
same level of percentage is maintained in my sample, then the sample is said to be representative of

34
00:02:53,900 --> 00:02:54,740
the population.

35
00:02:55,690 --> 00:03:05,620
In this scenario, this 43, 22, 17 and 17 is not reflected in the sample, the percentage contribution

36
00:03:05,830 --> 00:03:11,500
of different types of students is different from the members or the percentage that you see here.

37
00:03:11,950 --> 00:03:14,650
Hence, the sample is not representative.

38
00:03:15,460 --> 00:03:19,640
There's a very, very important concept in machine learning or deploying.

39
00:03:20,780 --> 00:03:27,620
Why do you think this is applicable, reflecting your developing a machine learning model to predict

40
00:03:27,620 --> 00:03:38,690
how diseases and if the majority of your dataset comprises of Indians and you go and apply the model

41
00:03:38,690 --> 00:03:45,430
that you've developed with the majority of Indians in your database to, let's say, Chinese or Caucasians,

42
00:03:46,100 --> 00:03:55,520
your inference wouldn't be accurate enough because the propensity for heart disease is much higher in

43
00:03:55,520 --> 00:03:59,420
Indians and Asians compared to Chinese and Caucasians.

44
00:04:00,500 --> 00:04:12,170
Right, your model will be accurate onely, if the percentage contribution of different races is addressed

45
00:04:12,170 --> 00:04:13,640
in your sample.

46
00:04:15,200 --> 00:04:20,630
Are you understanding why this concept is relevant when it comes to machine learning and deep learning?

47
00:04:22,120 --> 00:04:22,620
OK.

48
00:04:25,250 --> 00:04:33,950
The same concept of representative sampling is important in the world of opinion polls, also, if let's

49
00:04:33,950 --> 00:04:40,520
say you're trying to assess the likelihood of a particular party winning the elections and see us or

50
00:04:40,550 --> 00:04:47,490
India, you must ensure that your sample is representative of the population of the electorate.

51
00:04:48,230 --> 00:04:56,300
If you're going to collect more feedback from, say, northern or central India, your conclusion or

52
00:04:56,300 --> 00:04:59,900
inference won't be comprehensive or accurate.

53
00:05:00,440 --> 00:05:02,370
The same holds true for the US also.

54
00:05:02,840 --> 00:05:11,210
If your feedback is primarily from coastal areas, right, and you ignore subtle and mysterious, your

55
00:05:11,210 --> 00:05:20,620
inference wouldn't be accurate because your sample is not representative of the population of electorates.

56
00:05:21,590 --> 00:05:26,350
So please keep this in mind whenever you're working on samples.

57
00:05:28,960 --> 00:05:33,640
The other aspect when it comes to sampling is what is known as random sampling.

58
00:05:35,500 --> 00:05:42,760
Random sampling ensures that there are no biases when it comes to picking the sample for taking feedback

59
00:05:42,760 --> 00:05:50,290
or doing any type of study in the example that you see on your screen, that is one a local or one red

60
00:05:50,290 --> 00:05:54,990
color one, one blue color one and two green colored balls.

61
00:05:55,840 --> 00:06:02,640
Now, the blue colored ball, I have taken the ball that is number six instead of six.

62
00:06:02,650 --> 00:06:08,640
I could have picked up an eight or 11, nine, 14 or 16 or 17.

63
00:06:09,310 --> 00:06:15,270
Each of these blue balls have an equal chance of being picked up the sample.

64
00:06:16,030 --> 00:06:23,410
If I ensure that each of the blue balls have an equal chance of being picked up, then I have short

65
00:06:23,620 --> 00:06:25,570
randomness in the sampling.

66
00:06:27,730 --> 00:06:34,180
Now, let's try to understand this, using the school example, there are 500 students and I am taking

67
00:06:34,180 --> 00:06:36,360
feedback from 20 students.

68
00:06:37,240 --> 00:06:39,970
What if I collect feedback from 20 students?

69
00:06:39,970 --> 00:06:43,000
I know very well I'm introducing a bias, right.

70
00:06:46,020 --> 00:06:54,480
So when it comes to sampling, let's ensure the randomness is maintained and representativeness is also

71
00:06:54,480 --> 00:07:04,290
maintained, only then the model that you develop will be accurate, otherwise you will have lower accuracy.

72
00:07:05,490 --> 00:07:06,020
OK.