1 00:00:00,870 --> 00:00:04,590 So here is the data that we are going to use in this course. 2 00:00:06,240 --> 00:00:12,750 The problem statement here is that you are a manager in a real estate company and you want to find out 3 00:00:13,290 --> 00:00:22,110 the true value of a property that is based on the pricing of past property transactions, you want to 4 00:00:22,470 --> 00:00:25,410 get the price of the property you want to sell. 5 00:00:27,070 --> 00:00:36,100 So price of a property will be the dependent variable in our analysis and other factors which impact 6 00:00:36,100 --> 00:00:43,100 the price, which we identified with primary and secondary research are the independent variables. 7 00:00:44,530 --> 00:00:51,140 This dataset has values for all these variables whenever we collect data. 8 00:00:51,730 --> 00:00:54,490 Different parts of data come from different sources. 9 00:00:54,890 --> 00:01:00,910 And you need to collect this data into a single tabular format such as this. 10 00:01:03,340 --> 00:01:07,750 In this table, the room tells us the variable names. 11 00:01:09,110 --> 00:01:18,680 While keeping variable names, try not to use space, instead use an underscore, for example, in crime 12 00:01:18,680 --> 00:01:20,870 rate, we have used crime underscore rate. 13 00:01:22,460 --> 00:01:27,410 This is because some softwares do not accept spaces and variable names. 14 00:01:28,820 --> 00:01:36,350 Secondly, try to keep names such that you can recognize the actual variable that is a to put names 15 00:01:36,350 --> 00:01:37,910 like X1 Extra X3. 16 00:01:38,360 --> 00:01:46,070 Since post analysis, the results from softwares will also be showing these variable names and then 17 00:01:46,070 --> 00:01:48,560 it will become difficult to make sense of the result. 18 00:01:49,490 --> 00:01:55,310 I will show you in a couple of minutes what each variable stands for, but even by just looking at the 19 00:01:55,310 --> 00:01:58,230 names, you get some idea of what this data is. 20 00:01:59,540 --> 00:02:05,300 So this data is about price of houses and other related variables to each house. 21 00:02:06,850 --> 00:02:08,660 We will look at these variables in detail. 22 00:02:08,660 --> 00:02:18,440 And sometime after this top hit a row, each row contains values pertaining to one observation and we 23 00:02:18,440 --> 00:02:21,320 have a total of 506 observations. 24 00:02:24,380 --> 00:02:27,500 The total number of columns we have is 19. 25 00:02:31,490 --> 00:02:38,330 We may have received different parts of this data from different sources, probably one source gave 26 00:02:38,330 --> 00:02:40,700 us data of home address and its price. 27 00:02:42,270 --> 00:02:49,740 One source gave us the tough crime rates in different neighborhoods in such a case, we need to use 28 00:02:49,740 --> 00:02:55,200 a unique identifier which will help us match these two parts. 29 00:02:57,210 --> 00:03:05,610 So join all these different parts of data and get it into one single dataset with rows and columns, 30 00:03:05,610 --> 00:03:10,590 with columns, contain each variable and contain each observation. 31 00:03:13,050 --> 00:03:20,430 Now, your dataset is ready, but whenever you are going to share your data or your analysis with anyone, 32 00:03:20,940 --> 00:03:24,450 the data should always be accompanied by data dictionary. 33 00:03:27,070 --> 00:03:31,870 A data dictionary contains information about the variables in the dataset. 34 00:03:33,850 --> 00:03:37,450 So comprehensive data dictionary includes the following. 35 00:03:38,590 --> 00:03:41,830 It has definitions of all the variables of your dataset. 36 00:03:43,320 --> 00:03:47,430 Then it should tell about the unique identifier for each observation. 37 00:03:49,380 --> 00:03:51,930 This unique identifier is called the primary key. 38 00:03:53,840 --> 00:03:59,530 To match data from another table, if there is a matching key used, that should also be highlighted. 39 00:04:00,680 --> 00:04:06,000 If you want to learn more about primary and secondary keys, you can look at our video here. 40 00:04:06,320 --> 00:04:07,530 We have shared Ehlinger. 41 00:04:09,490 --> 00:04:14,560 This video is part of our Anadarko's on database management using Escuela. 42 00:04:16,160 --> 00:04:19,610 Lastly, if there is a categorical variable in the data. 43 00:04:20,540 --> 00:04:22,820 All its values should also be explained. 44 00:04:24,740 --> 00:04:28,490 Now, let's define the variables of our host data dataset. 45 00:04:30,790 --> 00:04:38,230 On the left is the variable name, as in the data set on the right is the brief description of what 46 00:04:38,230 --> 00:04:39,340 that variable is. 47 00:04:41,300 --> 00:04:44,060 I'm going to read this one by one, No. 48 00:04:45,640 --> 00:04:54,180 So Price tells us the value of the house crime rate, as does the crime rate in that neighborhood research 49 00:04:54,190 --> 00:04:58,040 area, stands for the proportion of residential area in the town. 50 00:04:59,020 --> 00:05:02,670 Air quality gives us the index of quality of air in that neighborhood. 51 00:05:04,090 --> 00:05:08,080 Room number is the average number of rooms in houses of that locality. 52 00:05:09,570 --> 00:05:16,110 Ages, how old is that house construction in years, how many years ago it was constructed? 53 00:05:17,790 --> 00:05:25,500 This one, this two, the three, just four are four distances of that house from the nearest employment 54 00:05:25,500 --> 00:05:25,860 hub. 55 00:05:27,040 --> 00:05:28,960 Because we hypothesize that. 56 00:05:30,070 --> 00:05:38,860 The price of house will depend on the employment opportunity available Nereid teachers stands for the 57 00:05:38,860 --> 00:05:42,580 number of teachers 1000 population in that town. 58 00:05:46,330 --> 00:05:47,950 Then we have a bigger problem. 59 00:05:48,220 --> 00:05:57,400 This is the proportion of world population in town airport stands for, is there any airport in the 60 00:05:57,400 --> 00:05:58,180 city or not? 61 00:05:58,210 --> 00:06:00,090 So this is a categorical variable. 62 00:06:00,370 --> 00:06:01,570 It has two values. 63 00:06:01,570 --> 00:06:02,370 Yes and no. 64 00:06:02,800 --> 00:06:04,260 These are self-explanatory. 65 00:06:04,300 --> 00:06:06,970 That just means that there is an airport in the city. 66 00:06:06,970 --> 00:06:08,830 No means there is no airport in the city. 67 00:06:10,300 --> 00:06:14,050 And Hotbed is giving us number of hospital beds. 68 00:06:14,500 --> 00:06:20,320 But the population in the town and heart rooms is giving us number of hotel rooms. 69 00:06:20,320 --> 00:06:25,740 But our population in the town waterboardings, another categorical variable. 70 00:06:26,230 --> 00:06:34,450 It has values like Lake River Boat or none, which is giving us whether there is a natural freshwater 71 00:06:34,450 --> 00:06:36,250 source within the city or not. 72 00:06:38,230 --> 00:06:41,800 Rainfall is giving us the average rainfall in centimeters. 73 00:06:43,860 --> 00:06:49,550 Buster is a categorical variable, which is telling us whether there is a bus terminal in the city. 74 00:06:50,100 --> 00:06:54,150 Yes, means there is no means there is no bus terminal in the city. 75 00:06:55,140 --> 00:07:02,160 And Parks Variable is giving us the proportion of land assigned as green land in that town. 76 00:07:04,210 --> 00:07:10,870 So with The Cafferty File and the PDF of the attacks in the resource section of this lecture. 77 00:07:11,950 --> 00:07:15,520 You have with you the data we will be using for our analysis. 78 00:07:17,390 --> 00:07:23,270 Note that this data is not real data, and it's similar to the Boston housing data set commonly used 79 00:07:23,270 --> 00:07:24,110 for analysis. 80 00:07:25,480 --> 00:07:28,600 But do not take the results of our analysis too seriously. 81 00:07:28,870 --> 00:07:30,340 This is not actually the.