1 00:00:00,930 --> 00:00:03,580 So here is the data that we are going to use. 2 00:00:03,750 --> 00:00:04,590 And this goes. 3 00:00:06,270 --> 00:00:12,750 The problem statement here is that you are a manager in a real estate company and you want to find out 4 00:00:13,350 --> 00:00:15,180 the true value of a property. 5 00:00:16,620 --> 00:00:20,880 That is basically pricing of past property transactions. 6 00:00:21,360 --> 00:00:25,420 You want to get the price of the property you want to sell. 7 00:00:27,130 --> 00:00:36,100 So price of a property will be the dependent variable in our analysis and other factors which impact 8 00:00:36,100 --> 00:00:38,850 the price which we identified Mrs. 9 00:00:38,980 --> 00:00:43,150 Primary and secondary research are the independent variables. 10 00:00:44,590 --> 00:00:47,770 This dataset has values for all these variables. 11 00:00:49,900 --> 00:00:51,100 Whenever we collect data. 12 00:00:51,760 --> 00:00:54,520 Different parts of data come from different sources. 13 00:00:55,000 --> 00:01:00,940 And you need to collate this data into a single tabular format such as this. 14 00:01:03,460 --> 00:01:07,780 And this table, the hetero tells us the variable names. 15 00:01:09,170 --> 00:01:12,290 While keeping variable names, try not to use space. 16 00:01:13,250 --> 00:01:18,920 Instead, use an underscore, for example, in crime rate. 17 00:01:18,950 --> 00:01:20,870 We have used crime underscore rate. 18 00:01:22,520 --> 00:01:27,410 This is because some softwares do not accept spaces and variable names. 19 00:01:28,850 --> 00:01:33,980 Secondly, try to keep names such that you can recognize the actual variable. 20 00:01:34,730 --> 00:01:37,970 That is a way to put names like X1, x2, x3. 21 00:01:38,420 --> 00:01:45,230 Since post analysis, the results from softwares will also be showing these variable names. 22 00:01:45,650 --> 00:01:48,500 And then it will become difficult to make sense of the design. 23 00:01:49,610 --> 00:01:53,330 I will show you in a couple of minutes what each variable stands for. 24 00:01:53,630 --> 00:01:58,250 But even by just looking at the names, you get some idea of what this data is. 25 00:01:59,600 --> 00:02:05,360 So this data is about price of houses and other related variables to each house. 26 00:02:06,950 --> 00:02:14,960 We will look at these variables in detail and some time after this top header row, each row contains 27 00:02:14,960 --> 00:02:17,180 values pertaining to one observation. 28 00:02:18,110 --> 00:02:21,350 And we have our total of 506 observations. 29 00:02:24,440 --> 00:02:27,530 The total number of columns we have is 19. 30 00:02:31,550 --> 00:02:38,300 We may have received different parts of this data from different sources, probably one source gave 31 00:02:38,330 --> 00:02:40,730 us data of wholesale address and its price. 32 00:02:42,330 --> 00:02:47,910 One source gave us the tough crime rates in different neighborhoods in such a case. 33 00:02:48,330 --> 00:02:55,260 We need to use a unique identifier which will help us match these two parts. 34 00:02:57,270 --> 00:03:05,850 So join all these different parts of data and get it into one single dataset with rows and columns that 35 00:03:05,920 --> 00:03:10,730 columns contain each variable and rules contain each of deletion. 36 00:03:13,110 --> 00:03:20,460 Now, your dataset is at 80, but whenever you are going to share your data or your analysis with anyone, 37 00:03:21,000 --> 00:03:23,880 the data should always be accompanied by data. 38 00:03:23,930 --> 00:03:24,480 Dignity. 39 00:03:27,190 --> 00:03:31,840 A data dictionary contains information about the variables in the dataset. 40 00:03:33,910 --> 00:03:37,480 So comprehensive data dictionary includes the following. 41 00:03:38,680 --> 00:03:41,860 It has definitions of all the variables of your dataset. 42 00:03:43,380 --> 00:03:47,420 Then it should tell about the unique identifier for each observation. 43 00:03:49,410 --> 00:03:52,020 This unique identifier is called the primary key. 44 00:03:53,900 --> 00:03:55,910 To match data from another table. 45 00:03:56,330 --> 00:03:59,570 If there is a matching key used, that should also be highlighted. 46 00:04:00,770 --> 00:04:06,050 If you want to learn more about primary and secondary keys, you can look at our video here. 47 00:04:06,410 --> 00:04:07,550 We have shared Dillinger. 48 00:04:09,520 --> 00:04:14,590 This video is part of our another course on database management using a skewed. 49 00:04:16,250 --> 00:04:19,610 Lastly, if there is a categorical variable in the data. 50 00:04:20,660 --> 00:04:22,790 All its values should also be explained. 51 00:04:24,820 --> 00:04:27,910 Now, let's define the variables of our host pricing. 52 00:04:28,010 --> 00:04:28,490 Does it? 53 00:04:30,880 --> 00:04:38,200 On the left is the variable name, as indeed he does it, on the right is the brief description of what 54 00:04:38,200 --> 00:04:39,350 that variable is. 55 00:04:41,390 --> 00:04:43,750 I'm going to read this one by one. 56 00:04:43,870 --> 00:04:44,100 No. 57 00:04:45,700 --> 00:04:54,170 So Price tells us the value of the house crime rate tells us the crime rate in that neighborhood rest 58 00:04:54,220 --> 00:04:58,090 area stands for the proportion of a residential area in the town. 59 00:04:59,140 --> 00:05:02,680 Air quality gives us the index of quality of air in that neighborhood. 60 00:05:04,150 --> 00:05:08,110 Room number is the average number of rooms in houses of that locality. 61 00:05:09,560 --> 00:05:13,260 Age is how old is that house destruction in years? 62 00:05:13,350 --> 00:05:16,140 How many years ago it was constructed? 63 00:05:17,850 --> 00:05:25,470 This one, this two, this three, just four are four distances of that house from the nearest employment 64 00:05:25,470 --> 00:05:25,870 hub. 65 00:05:27,130 --> 00:05:28,990 Because we hypothesize that. 66 00:05:30,130 --> 00:05:38,860 The price of homes will depend on the employment opportunity available Nereid features stands for the 67 00:05:38,860 --> 00:05:42,610 number of teachers but thousand population in that town. 68 00:05:46,390 --> 00:05:47,980 Then we have poor prop. 69 00:05:48,290 --> 00:05:51,250 This is the proportion of world population in the town. 70 00:05:53,230 --> 00:05:55,290 Airport stands for. 71 00:05:56,050 --> 00:05:58,120 Is that any airport in the city or not? 72 00:05:58,270 --> 00:06:00,730 So this is a categorical very well it does. 73 00:06:00,730 --> 00:06:01,500 Do I lose? 74 00:06:01,630 --> 00:06:02,430 Yes and no. 75 00:06:02,860 --> 00:06:04,220 These are self-explanatory. 76 00:06:04,360 --> 00:06:04,930 That, yes. 77 00:06:04,930 --> 00:06:06,910 Means that there is any airport in the city. 78 00:06:06,970 --> 00:06:08,860 No means there is no airport in the city. 79 00:06:09,860 --> 00:06:14,050 And horse hospitals is giving us number of hospital beds. 80 00:06:14,530 --> 00:06:20,300 But thousand population in the town and heart rooms is giving us number of hotel rooms. 81 00:06:20,350 --> 00:06:27,430 But out in public and in the town waterboardings and other categorical variable, it has values like 82 00:06:27,520 --> 00:06:35,590 Lake River Boat or none, which is giving us whether there is a natural freshwater source within the 83 00:06:35,590 --> 00:06:36,250 city or not. 84 00:06:38,350 --> 00:06:41,800 Rainfall is giving us the average rainfall in centimeters. 85 00:06:43,880 --> 00:06:46,470 Buster is a categorical variable. 86 00:06:46,590 --> 00:06:49,620 We're just telling us whether there is a bus terminal in the city. 87 00:06:50,190 --> 00:06:50,570 Yes. 88 00:06:50,580 --> 00:06:54,210 Means there is no means there is no bus terminal in the city. 89 00:06:55,260 --> 00:07:02,150 And Boks variable is giving us the proportion of land assigned as green land in that town. 90 00:07:04,240 --> 00:07:10,900 So with the CSB file and the PDAF of Peabody, Adarsh in the resource section of this lecture. 91 00:07:12,010 --> 00:07:13,360 You have the due data. 92 00:07:13,420 --> 00:07:15,520 We will be using for our analysis. 93 00:07:17,480 --> 00:07:23,460 Note that this data is not real data and is similar to the Boston housing data set commonly used word 94 00:07:23,480 --> 00:07:24,140 analysis. 95 00:07:25,570 --> 00:07:28,570 But do not take the result of our analysis too seriously. 96 00:07:28,840 --> 00:07:30,280 This is not actual data.