1 00:00:03,210 --> 00:00:06,980 Now, we have imported data and our data is ready. 2 00:00:07,860 --> 00:00:10,300 So let's run data on this data. 3 00:00:11,630 --> 00:00:12,450 Duran unity. 4 00:00:13,330 --> 00:00:15,130 Just straight your variable name. 5 00:00:15,160 --> 00:00:17,620 That is beef don't describe. 6 00:00:25,600 --> 00:00:26,830 This is our unity. 7 00:00:27,280 --> 00:00:29,860 So let's go through all the variables. 8 00:00:29,930 --> 00:00:30,610 When wavin. 9 00:00:32,340 --> 00:00:37,960 As we calculated using the ship, the total number of records are 506. 10 00:00:38,580 --> 00:00:39,990 So for price, very well. 11 00:00:40,120 --> 00:00:42,190 Our total record is 506. 12 00:00:42,780 --> 00:00:46,800 And the mean price, house prices, twenty two point five to. 13 00:00:47,940 --> 00:00:50,450 We also have a standard deviation of this video. 14 00:00:50,750 --> 00:00:57,000 But the minimum value is five and the maximum value is 50. 15 00:00:57,420 --> 00:00:59,070 And the median is 21. 16 00:00:59,790 --> 00:01:03,510 Median is always represented as the four feet percentile value. 17 00:01:07,040 --> 00:01:10,640 On this information, you can see for all of the variables. 18 00:01:13,530 --> 00:01:20,100 There are two important values that we should look for, and then UDD first is the common variable. 19 00:01:22,410 --> 00:01:26,930 So on for almost all the variables, is 506. 20 00:01:28,640 --> 00:01:30,160 But for and horse backs. 21 00:01:30,530 --> 00:01:33,350 That is the number of hospital beds, but posehn. 22 00:01:34,720 --> 00:01:36,630 The number is not five zero six. 23 00:01:37,510 --> 00:01:42,160 This means that there are eight missing values in our data for this video. 24 00:01:42,190 --> 00:01:42,460 But. 25 00:01:44,480 --> 00:01:50,870 We will treat this missing values and the later part of the video, but at the moment we only want to 26 00:01:50,870 --> 00:01:54,170 look at what all the variables have a missing value in them. 27 00:01:54,950 --> 00:01:58,900 So it looks like on the end, Horsburgh missing. 28 00:01:58,900 --> 00:01:59,450 They lose. 29 00:02:01,320 --> 00:02:05,430 The second important thing, boom or bust, is the difference between the mean and median. 30 00:02:07,210 --> 00:02:13,090 If you see there is no significant difference and mean and median off price, there is a significant 31 00:02:13,090 --> 00:02:16,120 difference in the mean, then maybe enough time rate. 32 00:02:18,020 --> 00:02:19,330 You can also look at them. 33 00:02:20,370 --> 00:02:23,430 You, Maxwell, 25 percent dead and the 75. 34 00:02:23,510 --> 00:02:25,590 But son died of that variable. 35 00:02:26,340 --> 00:02:30,020 You get a sense of the distribution of data of that variable. 36 00:02:31,440 --> 00:02:38,710 If you see the minimum value is zero point zero zero six and the 25 percentile value is zero point zero 37 00:02:38,710 --> 00:02:39,450 is eight. 38 00:02:40,140 --> 00:02:47,120 This means that there is little difference between the zero percent and the minimum value and the 25 39 00:02:47,130 --> 00:02:48,150 percentile value. 40 00:02:48,870 --> 00:02:53,190 And there is a large difference between the 75 percentile value and the max value. 41 00:02:56,470 --> 00:03:02,680 This means that a word data for this variable is negatively skewed. 42 00:03:03,010 --> 00:03:08,850 Most of the values lies in the small range of zero to three point six. 43 00:03:10,040 --> 00:03:14,340 And the rest of values are dispersed from three to 88. 44 00:03:15,830 --> 00:03:19,840 This means that our data is skewed for this variable. 45 00:03:23,990 --> 00:03:30,390 If you look at the mean and median value of other variables, you will find out that for. 46 00:03:32,000 --> 00:03:38,630 And hotel rooms, the maximum value is significantly larger than the sum in the fifth. 47 00:03:38,920 --> 00:03:40,120 The mean value. 48 00:03:41,550 --> 00:03:43,140 The largest value is one zero one. 49 00:03:43,830 --> 00:03:44,480 Where does this end? 50 00:03:44,520 --> 00:03:46,170 Seventy fifth value is 14. 51 00:03:47,280 --> 00:03:50,460 And defended by Lewis on the 11:00. 52 00:03:52,530 --> 00:03:58,110 So data may be positively skewed or there may be some outliers in this data. 53 00:03:59,340 --> 00:04:02,460 Similarly for rainfall, also, we can see that. 54 00:04:03,540 --> 00:04:10,230 The minimum wage, Lewis three, then the five percentile, Lewis, 28, and them and then the maximum 55 00:04:10,230 --> 00:04:11,170 value is 60. 56 00:04:12,120 --> 00:04:18,990 So here I can see that the data is either negatively skewed or there is an outlier in this data. 57 00:04:19,650 --> 00:04:23,070 We'll discuss about outliers and missing value. 58 00:04:23,920 --> 00:04:24,240 And then what? 59 00:04:24,240 --> 00:04:25,020 Next videos. 60 00:04:25,230 --> 00:04:30,690 But at this point, we will use a duty display, identifiers, such variables. 61 00:04:33,390 --> 00:04:38,070 As you can see, we've identified this issues in our data using duty. 62 00:04:39,870 --> 00:04:44,310 And you should always look for a duty before starting your analysis. 63 00:04:48,150 --> 00:04:53,310 Since we identified problems with the word rainfall and added hotel room data. 64 00:04:54,740 --> 00:04:59,000 Will not scatterplot for this data to understand them completely. 65 00:05:03,220 --> 00:05:04,700 To Blätter Scatterplot. 66 00:05:05,430 --> 00:05:06,520 I will use as. 67 00:05:07,290 --> 00:05:09,650 That is Seabourne function joint block. 68 00:05:09,860 --> 00:05:11,270 As soon as don't join Lord. 69 00:05:13,200 --> 00:05:14,430 The expendible. 70 00:05:17,980 --> 00:05:19,560 It's my end or Delarue. 71 00:05:24,260 --> 00:05:27,110 Remember to put variable names in quotes. 72 00:05:30,800 --> 00:05:32,400 And then y equal to. 73 00:05:34,470 --> 00:05:42,540 Twice since my dependent variable is price and put by equal to price, and then they take away to be. 74 00:05:48,890 --> 00:05:49,260 Remember? 75 00:05:49,440 --> 00:05:51,630 Right, we're able limbs that at least. 76 00:05:56,780 --> 00:06:01,160 You can see on the top we have the highest program of any hotel rooms. 77 00:06:04,590 --> 00:06:08,070 And on the right, we have his program of price. 78 00:06:09,350 --> 00:06:10,310 And in between. 79 00:06:11,770 --> 00:06:16,410 In between the main plot, we have the scatter plot of Christ versus and hotel rooms. 80 00:06:17,280 --> 00:06:22,470 One thing you can quickly identify from this graph is the two outliers. 81 00:06:23,040 --> 00:06:27,690 These two points are at very large distance from the rest of the points. 82 00:06:28,320 --> 00:06:33,210 So most of my data lies between zero and twenty four and hotel rooms. 83 00:06:33,510 --> 00:06:38,390 But there are two points for which we have a value near to a, B and hundred. 84 00:06:40,520 --> 00:06:44,490 Similarly, will Blätter scatterplot for rainfall also right? 85 00:06:44,760 --> 00:06:46,370 Asanas don't join plot. 86 00:06:51,310 --> 00:06:51,940 You will, right? 87 00:06:52,080 --> 00:06:53,380 Exequatur rainfall. 88 00:06:57,280 --> 00:06:59,110 And by equal to price. 89 00:07:06,550 --> 00:07:08,270 And there details, details, Deth. 90 00:07:14,420 --> 00:07:17,600 Now, again, on the top, we have a stroke rahmah of rainfall. 91 00:07:17,720 --> 00:07:20,060 And on the right, we have a stroke Raam of Y. 92 00:07:20,600 --> 00:07:25,580 And in between we have a distribution or scatterplot of rainfall towards this place. 93 00:07:27,160 --> 00:07:31,210 Here you can see my rainfall values lies between 20 and 60. 94 00:07:31,300 --> 00:07:36,230 Most of my rainfall values lies between 20 and 16, but I can see one outlier. 95 00:07:36,400 --> 00:07:41,410 There is a single point for which the value is almost like four or five. 96 00:07:42,960 --> 00:07:49,590 This may be due to the some sampling error or calculation error or something else. 97 00:07:50,460 --> 00:07:53,700 And we'll learn more about outliers and our outlier videos. 98 00:07:57,150 --> 00:08:01,150 Now, we have only covered numerical variables. 99 00:08:01,510 --> 00:08:04,000 Now let's move on to categorical variables. 100 00:08:07,130 --> 00:08:12,520 So, again, we will take a sample of five values, will, right, B.F. Dopehead. 101 00:08:20,440 --> 00:08:26,080 And we tried to identify our categorical variables or first psychological variable variables airport. 102 00:08:28,830 --> 00:08:31,290 To analyze this will lower the bar. 103 00:08:31,400 --> 00:08:41,280 Lord Aurecon, Lord of Categorical, everyone will read as an assault on Lord and then exequatur to 104 00:08:43,860 --> 00:08:46,170 our categorical variable, which is airport. 105 00:08:50,900 --> 00:09:00,170 And then they take what it be, a bull run this now we have a lot of work, ecological way. 106 00:09:00,310 --> 00:09:03,980 Well, airport and you can see their distribution on this airport. 107 00:09:06,490 --> 00:09:08,920 There is nothing unusual in this data. 108 00:09:09,100 --> 00:09:11,620 So we'll move on to the next set of what he calls medieval. 109 00:09:15,030 --> 00:09:16,340 Is, though, what everybody. 110 00:09:21,030 --> 00:09:26,170 In Wheelwright, that's an assault on blood mommai exists waterboarding. 111 00:09:28,060 --> 00:09:29,760 The days of. 112 00:09:40,550 --> 00:09:43,250 Again, you can see there this region of waterboardings. 113 00:09:46,620 --> 00:09:49,420 And there is nothing unusual about that also. 114 00:09:49,560 --> 00:09:51,780 So we'll move on to the next variable. 115 00:09:55,770 --> 00:09:57,010 There is a bus terminal. 116 00:10:21,130 --> 00:10:25,990 Yet you can see my bus terminal is only taking one value that this. 117 00:10:26,200 --> 00:10:26,710 Yes. 118 00:10:28,370 --> 00:10:36,710 So this may not be useful in our data, since this will not brought any differentiating power for our 119 00:10:36,770 --> 00:10:37,790 dependent variable. 120 00:10:38,220 --> 00:10:40,670 And for all the values of dependent variables. 121 00:10:40,850 --> 00:10:41,860 This will be us. 122 00:10:42,500 --> 00:10:46,850 So and all the cases, this will not impact our model anyway. 123 00:10:47,270 --> 00:10:51,830 So we'll look at it in more detail in the later part of this court's. 124 00:10:53,450 --> 00:10:59,540 For now, we will just mark this variable and we'll write some of the observation that we have from 125 00:10:59,540 --> 00:11:07,750 looking at the duty and programs and Bartletts, our first observation is about the missing values in 126 00:11:07,830 --> 00:11:09,530 very well and also Broome's. 127 00:11:17,250 --> 00:11:24,420 Second observation was Skewness or all players in the variable, the crime rate. 128 00:11:30,690 --> 00:11:34,820 Or total obliteration was outliers and. 129 00:11:37,430 --> 00:11:40,160 Hotel rooms vary, but that is an outcomes. 130 00:11:47,050 --> 00:11:49,550 And also in the variables, rainfall. 131 00:11:51,650 --> 00:11:53,380 I our fourth observation. 132 00:11:56,520 --> 00:12:01,440 Was about the bus terminal where even it was only taking one value. 133 00:12:06,570 --> 00:12:10,020 We'll look at this observations in the later part of the sports.