1 00:00:02,480 --> 00:00:05,900 So let's get started with the exploratory data analysis. 2 00:00:06,500 --> 00:00:09,470 The first step is understanding the data. 3 00:00:10,320 --> 00:00:17,670 I want to understand what is really there in my mind, and the first step for that is determining what 4 00:00:17,670 --> 00:00:20,520 is my independent variable and what is my independent. 5 00:00:21,900 --> 00:00:30,140 Let me explain this concept by taking a very simple example, a restaurant that provides pizza and home 6 00:00:30,150 --> 00:00:31,140 delivers pizza. 7 00:00:33,450 --> 00:00:40,920 That restaurant's time taken to deliver pizza is dependent on the order volume right now, many, many 8 00:00:40,920 --> 00:00:41,650 orders are there. 9 00:00:42,660 --> 00:00:44,950 The time taken to deliver the pizza. 10 00:00:45,480 --> 00:00:47,440 The time taken to prepare the pizza. 11 00:00:47,640 --> 00:00:50,390 Is cooking time and the oven temperature. 12 00:00:51,270 --> 00:00:55,890 So all the time taken to deliver is dependent on these four factors. 13 00:00:56,920 --> 00:01:04,180 So this is my dependent variable and these are the independent variables. 14 00:01:05,700 --> 00:01:06,010 Right. 15 00:01:06,720 --> 00:01:14,280 This is the first thing we acidy if I were to develop a machine learning model for this example, it 16 00:01:14,280 --> 00:01:14,730 would be. 17 00:01:16,300 --> 00:01:23,980 What will be the time taken to deliver pizza based on all our cooking time, oven temperature and delivery? 18 00:01:25,090 --> 00:01:25,290 Right. 19 00:01:26,170 --> 00:01:29,410 So you get an idea of what is an independent variable and dependent variable. 20 00:01:30,930 --> 00:01:32,340 Another example would be. 21 00:01:34,000 --> 00:01:37,690 The member of Marks is caught in an exam, he's dependent. 22 00:01:38,950 --> 00:01:41,620 On the number of this study. 23 00:01:42,960 --> 00:01:52,020 Right, so the moxa pain is the dependent variable and the number of hours I study is the independent. 24 00:01:53,600 --> 00:01:53,910 Right. 25 00:01:55,090 --> 00:02:01,110 So let's see the scenario that we will be taking in our exploratory data analysis. 26 00:02:02,050 --> 00:02:09,280 Will be taking an insurance he studied, the insurance company has to determine the insurance charges 27 00:02:09,280 --> 00:02:11,980 for an individual who has taken the whole home. 28 00:02:14,110 --> 00:02:21,520 The insurance company will evaluate leverne parameters and determine the insurance charges for the individual, 29 00:02:22,330 --> 00:02:28,300 as you can see, the parameters of gender married, whether the individual is married or not, a member 30 00:02:28,300 --> 00:02:34,660 of the police education level, whether the individual is self-employed, what is applicant's income? 31 00:02:35,500 --> 00:02:37,530 What is the applicant's income? 32 00:02:37,570 --> 00:02:38,820 What is the loan amount? 33 00:02:38,830 --> 00:02:42,130 What is the training or what is a credit history? 34 00:02:43,000 --> 00:02:48,610 If the individual has taken loans in the past, what is the property? 35 00:02:49,000 --> 00:02:51,790 So all these things are ascertained by the. 36 00:02:54,090 --> 00:03:00,210 Insurance companies, as you can see, that I own my summers and American summers normally, right? 37 00:03:02,870 --> 00:03:08,300 And if you see here, the insurance charges is dependent on these factors, health insurance charges 38 00:03:08,300 --> 00:03:13,990 is the dependent variable and all the extras that you see here are independent votes. 39 00:03:15,660 --> 00:03:16,140 Correct. 40 00:03:17,070 --> 00:03:23,480 So this is the fastest way we are going to be using this example through the exploratory data analysis. 41 00:03:26,110 --> 00:03:30,840 After this, you want to understand what is actually there in the data, right? 42 00:03:31,810 --> 00:03:38,710 For that, you will use data not ahead of time, it tells you the top three records that is there in 43 00:03:38,710 --> 00:03:41,210 my data can got an idea what is there? 44 00:03:41,550 --> 00:03:44,770 You can see that is also a case of. 45 00:03:45,910 --> 00:03:47,720 Probably a junk or missing one. 46 00:03:48,370 --> 00:03:49,760 How are you going to treat that? 47 00:03:50,260 --> 00:03:54,670 That is one of the things that we are going to cover in these subsequent sections. 48 00:03:55,130 --> 00:04:01,390 Remember, if you don't address a lot of junk values, your forecast accuracy will take a hit and probably 49 00:04:01,390 --> 00:04:03,400 you may even get an error like. 50 00:04:05,760 --> 00:04:11,640 Point out we'll be using Google Cola and Python, we will collaborate the development, environment 51 00:04:11,640 --> 00:04:13,500 and Python as the programming language. 52 00:04:14,250 --> 00:04:21,240 So the fight that I will be using for analysis file containing the past data is actually stored in the 53 00:04:21,240 --> 00:04:22,210 Google brain. 54 00:04:22,860 --> 00:04:26,960 I first need to establish the connection between CoLab and Google. 55 00:04:27,300 --> 00:04:27,520 Right. 56 00:04:28,020 --> 00:04:29,190 And after that. 57 00:04:30,110 --> 00:04:35,630 I import five pounds and then I remember this is my prime location. 58 00:04:36,550 --> 00:04:38,170 The Firestone Google. 59 00:04:39,250 --> 00:04:39,560 Yeah. 60 00:04:42,190 --> 00:04:45,550 Damn, I want to understand what type of data is there in my dataset. 61 00:04:46,530 --> 00:04:49,080 Like, you see, these are the variables that you saw. 62 00:04:50,460 --> 00:04:50,780 Right. 63 00:04:52,360 --> 00:04:58,080 If I use data, not info, I get, whether it is a flawed or an integer or an object. 64 00:04:58,660 --> 00:05:03,550 I get all this type of information. 65 00:05:04,660 --> 00:05:05,020 Right. 66 00:05:05,440 --> 00:05:12,070 So this is part of understanding that it is data mining dependent versus independent variables. 67 00:05:14,800 --> 00:05:23,890 Understanding what variables are, what variables anonymously play into the float, was stress getting 68 00:05:24,310 --> 00:05:30,850 an idea about the type of data, the kind of data that we have by looking at the top three records? 69 00:05:32,230 --> 00:05:36,050 So that's what you do in understanding it, right? 70 00:05:36,460 --> 00:05:39,740 So the next step is doing univariate analysis. 71 00:05:40,650 --> 00:05:41,050 OK.