1
00:00:00,930 --> 00:00:03,580
So here is the data that we are going to use.

2
00:00:03,750 --> 00:00:04,590
And this goes.

3
00:00:06,270 --> 00:00:12,750
The problem statement here is that you are a manager in a real estate company and you want to find out

4
00:00:13,350 --> 00:00:15,180
the true value of a property.

5
00:00:16,620 --> 00:00:20,880
That is basically pricing of past property transactions.

6
00:00:21,360 --> 00:00:25,420
You want to get the price of the property you want to sell.

7
00:00:27,130 --> 00:00:36,100
So price of a property will be the dependent variable in our analysis and other factors which impact

8
00:00:36,100 --> 00:00:38,850
the price which we identified Mrs.

9
00:00:38,980 --> 00:00:43,150
Primary and secondary research are the independent variables.

10
00:00:44,590 --> 00:00:47,770
This dataset has values for all these variables.

11
00:00:49,900 --> 00:00:51,100
Whenever we collect data.

12
00:00:51,760 --> 00:00:54,520
Different parts of data come from different sources.

13
00:00:55,000 --> 00:01:00,940
And you need to collate this data into a single tabular format such as this.

14
00:01:03,460 --> 00:01:07,780
And this table, the hetero tells us the variable names.

15
00:01:09,170 --> 00:01:12,290
While keeping variable names, try not to use space.

16
00:01:13,250 --> 00:01:18,920
Instead, use an underscore, for example, in crime rate.

17
00:01:18,950 --> 00:01:20,870
We have used crime underscore rate.

18
00:01:22,520 --> 00:01:27,410
This is because some softwares do not accept spaces and variable names.

19
00:01:28,850 --> 00:01:33,980
Secondly, try to keep names such that you can recognize the actual variable.

20
00:01:34,730 --> 00:01:37,970
That is a way to put names like X1, x2, x3.

21
00:01:38,420 --> 00:01:45,230
Since post analysis, the results from softwares will also be showing these variable names.

22
00:01:45,650 --> 00:01:48,500
And then it will become difficult to make sense of the design.

23
00:01:49,610 --> 00:01:53,330
I will show you in a couple of minutes what each variable stands for.

24
00:01:53,630 --> 00:01:58,250
But even by just looking at the names, you get some idea of what this data is.

25
00:01:59,600 --> 00:02:05,360
So this data is about price of houses and other related variables to each house.

26
00:02:06,950 --> 00:02:14,960
We will look at these variables in detail and some time after this top header row, each row contains

27
00:02:14,960 --> 00:02:17,180
values pertaining to one observation.

28
00:02:18,110 --> 00:02:21,350
And we have our total of 506 observations.

29
00:02:24,440 --> 00:02:27,530
The total number of columns we have is 19.

30
00:02:31,550 --> 00:02:38,300
We may have received different parts of this data from different sources, probably one source gave

31
00:02:38,330 --> 00:02:40,730
us data of wholesale address and its price.

32
00:02:42,330 --> 00:02:47,910
One source gave us the tough crime rates in different neighborhoods in such a case.

33
00:02:48,330 --> 00:02:55,260
We need to use a unique identifier which will help us match these two parts.

34
00:02:57,270 --> 00:03:05,850
So join all these different parts of data and get it into one single dataset with rows and columns that

35
00:03:05,920 --> 00:03:10,730
columns contain each variable and rules contain each of deletion.

36
00:03:13,110 --> 00:03:20,460
Now, your dataset is at 80, but whenever you are going to share your data or your analysis with anyone,

37
00:03:21,000 --> 00:03:23,880
the data should always be accompanied by data.

38
00:03:23,930 --> 00:03:24,480
Dignity.

39
00:03:27,190 --> 00:03:31,840
A data dictionary contains information about the variables in the dataset.

40
00:03:33,910 --> 00:03:37,480
So comprehensive data dictionary includes the following.

41
00:03:38,680 --> 00:03:41,860
It has definitions of all the variables of your dataset.

42
00:03:43,380 --> 00:03:47,420
Then it should tell about the unique identifier for each observation.

43
00:03:49,410 --> 00:03:52,020
This unique identifier is called the primary key.

44
00:03:53,900 --> 00:03:55,910
To match data from another table.

45
00:03:56,330 --> 00:03:59,570
If there is a matching key used, that should also be highlighted.

46
00:04:00,770 --> 00:04:06,050
If you want to learn more about primary and secondary keys, you can look at our video here.

47
00:04:06,410 --> 00:04:07,550
We have shared Dillinger.

48
00:04:09,520 --> 00:04:14,590
This video is part of our another course on database management using a skewed.

49
00:04:16,250 --> 00:04:19,610
Lastly, if there is a categorical variable in the data.

50
00:04:20,660 --> 00:04:22,790
All its values should also be explained.

51
00:04:24,820 --> 00:04:27,910
Now, let's define the variables of our host pricing.

52
00:04:28,010 --> 00:04:28,490
Does it?

53
00:04:30,880 --> 00:04:38,200
On the left is the variable name, as indeed he does it, on the right is the brief description of what

54
00:04:38,200 --> 00:04:39,350
that variable is.

55
00:04:41,390 --> 00:04:43,750
I'm going to read this one by one.

56
00:04:43,870 --> 00:04:44,100
No.

57
00:04:45,700 --> 00:04:54,170
So Price tells us the value of the house crime rate tells us the crime rate in that neighborhood rest

58
00:04:54,220 --> 00:04:58,090
area stands for the proportion of a residential area in the town.

59
00:04:59,140 --> 00:05:02,680
Air quality gives us the index of quality of air in that neighborhood.

60
00:05:04,150 --> 00:05:08,110
Room number is the average number of rooms in houses of that locality.

61
00:05:09,560 --> 00:05:13,260
Age is how old is that house destruction in years?

62
00:05:13,350 --> 00:05:16,140
How many years ago it was constructed?

63
00:05:17,850 --> 00:05:25,470
This one, this two, this three, just four are four distances of that house from the nearest employment

64
00:05:25,470 --> 00:05:25,870
hub.

65
00:05:27,130 --> 00:05:28,990
Because we hypothesize that.

66
00:05:30,130 --> 00:05:38,860
The price of homes will depend on the employment opportunity available Nereid features stands for the

67
00:05:38,860 --> 00:05:42,610
number of teachers but thousand population in that town.

68
00:05:46,390 --> 00:05:47,980
Then we have poor prop.

69
00:05:48,290 --> 00:05:51,250
This is the proportion of world population in the town.

70
00:05:53,230 --> 00:05:55,290
Airport stands for.

71
00:05:56,050 --> 00:05:58,120
Is that any airport in the city or not?

72
00:05:58,270 --> 00:06:00,730
So this is a categorical very well it does.

73
00:06:00,730 --> 00:06:01,500
Do I lose?

74
00:06:01,630 --> 00:06:02,430
Yes and no.

75
00:06:02,860 --> 00:06:04,220
These are self-explanatory.

76
00:06:04,360 --> 00:06:04,930
That, yes.

77
00:06:04,930 --> 00:06:06,910
Means that there is any airport in the city.

78
00:06:06,970 --> 00:06:08,860
No means there is no airport in the city.

79
00:06:09,860 --> 00:06:14,050
And horse hospitals is giving us number of hospital beds.

80
00:06:14,530 --> 00:06:20,300
But thousand population in the town and heart rooms is giving us number of hotel rooms.

81
00:06:20,350 --> 00:06:27,430
But out in public and in the town waterboardings and other categorical variable, it has values like

82
00:06:27,520 --> 00:06:35,590
Lake River Boat or none, which is giving us whether there is a natural freshwater source within the

83
00:06:35,590 --> 00:06:36,250
city or not.

84
00:06:38,350 --> 00:06:41,800
Rainfall is giving us the average rainfall in centimeters.

85
00:06:43,880 --> 00:06:46,470
Buster is a categorical variable.

86
00:06:46,590 --> 00:06:49,620
We're just telling us whether there is a bus terminal in the city.

87
00:06:50,190 --> 00:06:50,570
Yes.

88
00:06:50,580 --> 00:06:54,210
Means there is no means there is no bus terminal in the city.

89
00:06:55,260 --> 00:07:02,150
And Boks variable is giving us the proportion of land assigned as green land in that town.

90
00:07:04,240 --> 00:07:10,900
So with the CSB file and the PDAF of Peabody, Adarsh in the resource section of this lecture.

91
00:07:12,010 --> 00:07:13,360
You have the due data.

92
00:07:13,420 --> 00:07:15,520
We will be using for our analysis.

93
00:07:17,480 --> 00:07:23,460
Note that this data is not real data and is similar to the Boston housing data set commonly used word

94
00:07:23,480 --> 00:07:24,140
analysis.

95
00:07:25,570 --> 00:07:28,570
But do not take the result of our analysis too seriously.

96
00:07:28,840 --> 00:07:30,280
This is not actual data.