1 00:00:00,790 --> 00:00:08,080 So here is the data that we are going to use in this course, the problem statement here is that you 2 00:00:08,080 --> 00:00:10,180 are a manager in a real estate company. 3 00:00:11,200 --> 00:00:18,970 You want to find out these selling potential of a property that is basically data of past property transactions. 4 00:00:19,660 --> 00:00:24,250 You want to predict whether the property will be sold within three months or not. 5 00:00:26,680 --> 00:00:34,840 So the sold variable here, which is the last variable of this day, will really be dependent variable 6 00:00:34,900 --> 00:00:35,770 in our analysis. 7 00:00:37,150 --> 00:00:43,360 And other factors which impact this old variable, which we identified, busiest primary and secondary 8 00:00:43,360 --> 00:00:46,150 search, are these independent variables. 9 00:00:48,390 --> 00:00:51,810 This dataset has values for all these variables. 10 00:00:52,800 --> 00:00:54,090 Whenever we collect data. 11 00:00:54,420 --> 00:00:57,120 Different parts of data come from different sources. 12 00:00:57,960 --> 00:01:02,070 And we need to collate this data into a single tabular format. 13 00:01:02,360 --> 00:01:02,990 As this. 14 00:01:05,190 --> 00:01:09,090 And this table, the header row is just the variable names. 15 00:01:11,210 --> 00:01:16,280 While keeping the video names, trying not to use space, instead using underscore. 16 00:01:17,840 --> 00:01:24,800 For example, an air quality variable we have used underscored this is because some software do not 17 00:01:24,800 --> 00:01:25,760 accept spaces. 18 00:01:28,420 --> 00:01:35,680 Secondly, try to keep names so that you can recognize the actual variable that is a whiteboarding, 19 00:01:35,680 --> 00:01:37,930 putting names like X1, x2, X3. 20 00:01:39,610 --> 00:01:45,550 Since post analysis, the results from software will also be showing these variable names. 21 00:01:45,760 --> 00:01:48,290 And then it will become difficult to make sense of the design. 22 00:01:50,990 --> 00:01:54,540 I will show you in a couple of minutes what each variable stands for. 23 00:01:54,950 --> 00:02:00,290 But even by just looking at the names, you'll get some idea of what this day is. 24 00:02:02,810 --> 00:02:08,540 So this data is about how transactions and other related variables to each house. 25 00:02:10,580 --> 00:02:16,590 We will look at these variables in detail in some time after this top head row. 26 00:02:17,090 --> 00:02:20,900 Each row contains values pertaining to one single observation. 27 00:02:22,640 --> 00:02:25,820 And we have a total of five hundred six observations. 28 00:02:29,050 --> 00:02:31,720 The total number of columns we have is 19. 29 00:02:33,370 --> 00:02:39,340 We may have received different parts of this data from different sources, probably one source gave 30 00:02:39,400 --> 00:02:42,100 us data of also address and its price. 31 00:02:43,330 --> 00:02:49,470 One source gave data of air quality in different neighborhoods into the case. 32 00:02:49,830 --> 00:02:54,570 We need to use a unique identifier which will help us match these two parts. 33 00:02:56,100 --> 00:03:04,170 So doing all these different parts of data and get the data into one single data set with the rows and 34 00:03:04,170 --> 00:03:09,780 columns, the columns contain each variable and rules contain each observation. 35 00:03:13,370 --> 00:03:18,500 Now, your data suggest 80, but whenever you are going to share your data, all your analysis with 36 00:03:18,530 --> 00:03:22,430 anyone, the data should be accompanied by data dignity. 37 00:03:25,650 --> 00:03:29,100 A data dictionary contains information about the variables. 38 00:03:29,140 --> 00:03:29,990 Indeed, do does it? 39 00:03:31,950 --> 00:03:35,190 So comprehensive data dictionary includes the following. 40 00:03:36,480 --> 00:03:39,720 It has definitions of all the variables of your d2c. 41 00:03:41,350 --> 00:03:45,520 Then it should tell about the unique identifier for each observation. 42 00:03:47,230 --> 00:03:52,990 This unique identifier is called the primary key to match the data from another table. 43 00:03:53,500 --> 00:03:56,540 If there is a magic key used, that should also be highlighted. 44 00:03:58,010 --> 00:04:03,650 If you want to learn more about primary, Antec and Ricki's you can look at our video here. 45 00:04:04,250 --> 00:04:05,870 We have shared the link in this presentation. 46 00:04:07,070 --> 00:04:11,500 This video is part of another course on database management using Escuela. 47 00:04:13,990 --> 00:04:20,900 Lastly, if there is a categorical variable in the data, all its values should also be explained. 48 00:04:23,550 --> 00:04:26,730 Now, let's define the variables of our host bracingly, does it? 49 00:04:31,550 --> 00:04:40,580 On the left is the variable limb, as indeed I say, on the right is a brief description of what that 50 00:04:40,580 --> 00:04:41,300 variable is. 51 00:04:43,190 --> 00:04:44,570 I'm going to read this one by one. 52 00:04:44,600 --> 00:04:44,930 No. 53 00:04:47,350 --> 00:04:51,330 So the price tells us the asking price of the property by the owner. 54 00:04:53,490 --> 00:04:56,820 That is not how much the owner is willing to sell the property. 55 00:04:59,810 --> 00:05:00,800 That's a NATO stance. 56 00:05:00,890 --> 00:05:02,900 What proportion of residential area in the town? 57 00:05:04,200 --> 00:05:08,120 It call Stanford index of quality of it in that neighborhood. 58 00:05:10,020 --> 00:05:14,140 Room them is the average number of rooms in the houses of that locality. 59 00:05:16,460 --> 00:05:21,650 It is how all is that house construction is that is all many years ago. 60 00:05:21,740 --> 00:05:22,660 It was constructed. 61 00:05:24,870 --> 00:05:31,740 This one, two, three, four are four distances of that hose from the nearest employment hub. 62 00:05:34,480 --> 00:05:40,120 Because we hypothesised that the ease of selling the house will depend on the employment opportunity 63 00:05:40,210 --> 00:05:41,110 available needed. 64 00:05:43,310 --> 00:05:47,720 Teachers stand for the number of teachers, but Ulgen population in that town. 65 00:05:53,130 --> 00:05:54,570 Then we have food probe. 66 00:05:55,860 --> 00:05:58,650 That is the proportion of world population in the town. 67 00:06:00,410 --> 00:06:01,700 Airport stands for. 68 00:06:02,390 --> 00:06:04,310 Is there any airport in the city or not? 69 00:06:04,880 --> 00:06:06,620 So this is a categorical variable. 70 00:06:07,370 --> 00:06:08,470 It has to Aleuts. 71 00:06:08,600 --> 00:06:09,290 Yes, I know. 72 00:06:09,680 --> 00:06:11,200 These are self explanatory. 73 00:06:12,560 --> 00:06:12,850 Yes. 74 00:06:12,860 --> 00:06:14,930 Means that there is an airport in the city. 75 00:06:15,170 --> 00:06:17,090 No means there is no airport in the city. 76 00:06:19,550 --> 00:06:22,970 And horse bed is giving us the number of hospital beds. 77 00:06:23,160 --> 00:06:25,100 But Teligent population in the down. 78 00:06:27,240 --> 00:06:31,780 And hard rooms, guilty number of hotel rooms, but out in public and indeed on. 79 00:06:34,000 --> 00:06:36,300 Waterboarding is another categorical variable. 80 00:06:37,170 --> 00:06:40,740 It has values like Lake River boat or none. 81 00:06:42,230 --> 00:06:46,670 Which is giving us whether there is a natural freshwater source within the city or not. 82 00:06:49,420 --> 00:06:52,710 Rainfall is giving us the average rainfall in centimeters. 83 00:06:54,890 --> 00:07:00,820 Buster is a categorical variable with this telling us whether there is a bus terminal in the city. 84 00:07:03,200 --> 00:07:07,190 Yes means that there is no means there is no bus terminal in the city. 85 00:07:09,890 --> 00:07:15,660 And Park's variable is giving us the proportion of land assigned as green land in that town. 86 00:07:17,630 --> 00:07:22,910 And lastly, we have the sole variable, which is our dependent variable, which is telling us whether 87 00:07:23,300 --> 00:07:27,720 that particular property was sold within three months of getting listed on or. 88 00:07:28,880 --> 00:07:36,350 By default, it will be having value zero and one one meaning that the house was sold within three months 89 00:07:36,860 --> 00:07:41,700 and zero means that the house was not sold within three months. 90 00:07:44,380 --> 00:07:44,920 So that does. 91 00:07:44,950 --> 00:07:49,990 Yes, wi fi and the PDA of PBT, a dash in the resource section of this lecture. 92 00:07:51,130 --> 00:07:53,890 You have the new the data we'll be using for analysis. 93 00:07:55,380 --> 00:08:01,050 No doubt this data is not real leader, and it's similar to Boston housing data set commonly used for 94 00:08:01,070 --> 00:08:01,650 analysis. 95 00:08:02,880 --> 00:08:05,780 But do not take the result of our analysis too seriously. 96 00:08:06,020 --> 00:08:07,310 This is not actually does it? 97 00:08:09,130 --> 00:08:13,560 In the next video, we will learn how to import this data, say, in our software package.