1 00:00:00,830 --> 00:00:05,760 In this video, we will learn how to do feature engineering for daytime data. 2 00:00:07,770 --> 00:00:14,660 Feature engineering is one of the most important step for any forecasting related machine learning project. 3 00:00:16,180 --> 00:00:25,090 Feature engineering means modifying, billeting or combining existing raw features from our data to 4 00:00:25,090 --> 00:00:27,040 create some new features. 5 00:00:29,210 --> 00:00:37,130 These newly created features will help us in the Hansing the performance of Overfocusing more than. 6 00:00:40,340 --> 00:00:46,650 Let's first start with creating new daytime related features from our date column. 7 00:00:48,680 --> 00:00:55,720 We are going to use the same dataset that we imported last time we named it B.F., too. 8 00:00:56,230 --> 00:00:58,870 So let's look at the content of this data frame. 9 00:01:00,370 --> 00:01:08,300 So here in our first column, we have the dates and in the second column we have, but. 10 00:01:11,120 --> 00:01:13,370 So we have the same data. 11 00:01:13,520 --> 00:01:17,900 You know, at first column and numeric data in the second column. 12 00:01:20,470 --> 00:01:25,380 Now, suppose if I just want to create a column for here. 13 00:01:26,320 --> 00:01:28,840 So here the year is 1959. 14 00:01:29,050 --> 00:01:34,690 So I want the column that says 1959 as the year value in next column. 15 00:01:35,590 --> 00:01:41,900 Similarly, for all these dates, I want a and column for mentioning that year. 16 00:01:42,820 --> 00:01:44,500 So how to do that? 17 00:01:45,250 --> 00:01:49,300 First, let's create another day. 18 00:01:49,330 --> 00:01:51,060 Daphne, that is feature length of time. 19 00:01:51,820 --> 00:01:55,750 We are using B to copy Manteau to do this. 20 00:01:58,910 --> 00:02:03,630 Now, here we can use the same like properties to do this. 21 00:02:04,730 --> 00:02:10,600 So if you select your column that contains a date and if you like dot duty. 22 00:02:11,270 --> 00:02:20,930 And after that dot the year or dot man, you will get corresponding either man or day for that date. 23 00:02:22,160 --> 00:02:29,800 For example, here, if I am creating a new column here and I am writing B F to date dot booted out 24 00:02:29,800 --> 00:02:34,790 here, I will get the ear information in my new column. 25 00:02:34,850 --> 00:02:35,500 Which is it? 26 00:02:36,320 --> 00:02:39,260 Similarly, if I want to extract the month information. 27 00:02:39,990 --> 00:02:43,610 So for example, for all these dates, the month is one. 28 00:02:44,990 --> 00:02:49,760 And if I want this information in another column, I can write this. 29 00:02:50,870 --> 00:02:53,170 I'm creating a new column which is Month. 30 00:02:53,810 --> 00:02:57,660 And again, I'm using the F to date. 31 00:02:58,430 --> 00:03:01,430 And then the date time like features. 32 00:03:01,880 --> 00:03:03,950 And in that I am calling month. 33 00:03:04,930 --> 00:03:10,910 So if I on this I can do some Leard thing if I want a separate column for days as well. 34 00:03:11,330 --> 00:03:19,340 So here I want another column with zero one zero two zero three which contain the days. 35 00:03:23,110 --> 00:03:32,570 If I run this and if I take the first five values of my new data frame that this features, you can 36 00:03:32,570 --> 00:03:35,510 see that we are getting the desired result. 37 00:03:40,120 --> 00:03:47,500 We have separated the year information, the month information and the days information from this combined 38 00:03:47,500 --> 00:03:50,260 field to feed new fields. 39 00:03:51,550 --> 00:03:53,680 That is year, month and days. 40 00:03:58,340 --> 00:04:04,800 So this is the first week of being featured in the movie from a column that contains daytime FORTMAN 41 00:04:05,010 --> 00:04:05,380 data. 42 00:04:06,290 --> 00:04:10,860 We can create separate feeds for year, month, days. 43 00:04:11,590 --> 00:04:14,090 Our mandate time and so on. 44 00:04:15,590 --> 00:04:21,150 If you want to look at some more information on this, you can click on this oficial documentation of 45 00:04:21,270 --> 00:04:21,770 Fundus. 46 00:04:24,730 --> 00:04:31,030 Here you will see all the properties that are available in Thane Lake properties. 47 00:04:31,330 --> 00:04:34,780 So we just used it months and days. 48 00:04:34,870 --> 00:04:39,130 But there are other additional methods as well that you can use. 49 00:04:44,130 --> 00:04:48,720 Now, the next types of feature like we want is the leg features. 50 00:04:49,230 --> 00:04:57,490 So, for example, suppose for this rule number two, I want to get the values of the previous thumb. 51 00:04:58,290 --> 00:05:05,610 So, for example, for this row three, I want the additional column which sees number 32 here, because 52 00:05:05,640 --> 00:05:08,310 this is the value for the previous month. 53 00:05:08,820 --> 00:05:14,760 Similarly, for the second row, I want the value of 35 since 35 is the value for the previous month. 54 00:05:15,270 --> 00:05:22,740 So this type of features are Kollek features, and these types of features are very important for any 55 00:05:22,740 --> 00:05:23,780 forecasting problem. 56 00:05:25,790 --> 00:05:33,320 So let's see how to clear this leg features spiriting leg features is very simple. 57 00:05:33,470 --> 00:05:36,600 We can use dot show method of pandas. 58 00:05:38,680 --> 00:05:43,090 So here I am creating a feature name, leg one. 59 00:05:44,770 --> 00:05:52,480 And again, I'm using the birth values and I want to have a birth value of the previous time period. 60 00:05:56,370 --> 00:05:59,120 So to do this, we can use Dortch method. 61 00:06:00,240 --> 00:06:07,110 We can just call the column for it, the values are needed and then we can use dot shift. 62 00:06:07,950 --> 00:06:12,750 And in the beckert we can mention the difference of the time period. 63 00:06:13,440 --> 00:06:18,290 So here we want the information of previous time period. 64 00:06:18,810 --> 00:06:20,580 That is the differences one. 65 00:06:21,300 --> 00:06:21,890 That's right. 66 00:06:21,990 --> 00:06:28,340 If I write Larcher and then record if I write one, I will get the value of previous time. 67 00:06:33,780 --> 00:06:35,820 Now to get the value off. 68 00:06:36,240 --> 00:06:38,100 Same day last year. 69 00:06:38,460 --> 00:06:44,190 We can use 365 since there are 365 days in a year. 70 00:06:45,720 --> 00:06:55,320 So if we do a large ship and then record, if we mention 365 days, we will get the value that was there. 71 00:06:56,290 --> 00:06:57,300 At the same day. 72 00:06:57,440 --> 00:07:08,150 But the last fear, and that is one more thing, if you see there is no previous month value available 73 00:07:08,210 --> 00:07:09,390 for this first store. 74 00:07:09,790 --> 00:07:11,870 So this is the first entry in our data. 75 00:07:13,400 --> 00:07:16,590 So in this case, the output will be there. 76 00:07:16,790 --> 00:07:18,980 And that is not a number. 77 00:07:20,390 --> 00:07:28,730 Similarly, if we are creating a lag feature with one year of delay for the first 365 values we will 78 00:07:28,730 --> 00:07:29,090 get. 79 00:07:29,610 --> 00:07:31,850 And in this variable. 80 00:07:35,610 --> 00:07:40,750 So we have created this pool Legba variable one with the previous day value. 81 00:07:40,980 --> 00:07:44,100 And another one with the last year value. 82 00:07:44,700 --> 00:07:48,180 So let's look at our data frame once again. 83 00:07:51,150 --> 00:07:53,490 You can see now we have two more variables. 84 00:07:53,610 --> 00:07:54,870 Leg one, leg two. 85 00:07:56,370 --> 00:07:59,820 As I said earlier, for the first record we have. 86 00:07:59,940 --> 00:08:03,900 And then since the previous day, rally was not available. 87 00:08:04,840 --> 00:08:07,000 And for the leg, two variables we have. 88 00:08:07,160 --> 00:08:15,720 And then for the first 365 records, because the previous year, values are not available for the first 89 00:08:15,730 --> 00:08:25,110 365, because here you can also see that we are getting to define value in the second row since the 90 00:08:25,110 --> 00:08:27,510 birth on of first row was 35. 91 00:08:28,930 --> 00:08:34,410 You can see that there is a leg off on one day within this, but when you like, when you. 92 00:08:39,030 --> 00:08:42,750 So this is how we create leg features in Python. 93 00:08:42,910 --> 00:08:43,930 Using pandas. 94 00:08:44,590 --> 00:08:48,130 Now let's look at some other features. 95 00:08:48,900 --> 00:08:50,590 The first one is the window feature. 96 00:08:51,640 --> 00:08:58,350 So suppose I want another variable here where I want the average of these two aloose. 97 00:08:59,260 --> 00:09:02,710 And here I want the average of these two values. 98 00:09:03,400 --> 00:09:06,880 So we want the average of this one. 99 00:09:07,690 --> 00:09:09,570 But data and the last month. 100 00:09:09,610 --> 00:09:10,540 But data. 101 00:09:12,350 --> 00:09:13,310 How can you do that? 102 00:09:14,630 --> 00:09:18,130 For this, we have to use the ruling method. 103 00:09:21,430 --> 00:09:27,910 So the method is on most similar to creating lag variables here. 104 00:09:28,330 --> 00:09:31,130 We have to provide not groaning method. 105 00:09:31,690 --> 00:09:35,080 We have to mention the number of periods we want to consider. 106 00:09:35,230 --> 00:09:38,410 So if I just want to consider only two periods. 107 00:09:38,800 --> 00:09:45,850 So suppose here I want the average of these two I lose my window size will be equal to two. 108 00:09:46,240 --> 00:09:50,080 If I want to consider three butut, I would no size will be equal to three. 109 00:09:51,940 --> 00:09:56,410 Now, after this, I have to use another aggregating function. 110 00:09:57,220 --> 00:09:59,010 So since I have three values. 111 00:09:59,600 --> 00:10:04,570 So if you are considering number of buttons, we have 35, 32 and 31, LucĂ­a. 112 00:10:05,140 --> 00:10:08,170 Now we have to mention what we want from these values. 113 00:10:08,290 --> 00:10:11,020 We can either take a mean of all these values. 114 00:10:11,290 --> 00:10:15,040 We can also take Max Trelew of these values. 115 00:10:15,400 --> 00:10:18,400 We can also take minimum of these three values. 116 00:10:19,390 --> 00:10:26,020 So now in this case, I also have to provide the aggregating function that we are going to use. 117 00:10:27,040 --> 00:10:33,620 So let's create a mean variable in which we will be the mean of disparate value. 118 00:10:33,820 --> 00:10:35,170 And the last filtered value. 119 00:10:37,660 --> 00:10:43,930 So again, we are creating another variable in our detasseling features with the name Gool underscore 120 00:10:43,930 --> 00:10:44,280 to me. 121 00:10:45,700 --> 00:10:50,740 And here we are again taking B.F. to but data. 122 00:10:51,820 --> 00:11:00,670 We are setting the ruling window of two periods and then we are taking them mean of these two values. 123 00:11:01,630 --> 00:11:06,880 So let's clear this and let's look at our new data. 124 00:11:10,990 --> 00:11:15,030 So you can see we have another variable that is roll under. 125 00:11:15,220 --> 00:11:17,910 For me, again, the Foreswear Lewis. 126 00:11:18,040 --> 00:11:23,290 And then since the previous period, data is not available for the first record. 127 00:11:24,370 --> 00:11:27,970 For those second reward, we are getting a value of thirty three point five. 128 00:11:29,110 --> 00:11:32,940 Thirty three point five is the average of 35 and 32. 129 00:11:33,610 --> 00:11:35,240 Similarly, for the third record. 130 00:11:35,260 --> 00:11:37,840 We are getting a value of 31, which is. 131 00:11:39,030 --> 00:11:42,540 An average of Tegretol and 30 and so on. 132 00:11:43,410 --> 00:11:52,160 So here our window size was off to the courts and I was aggregating function was mean. 133 00:11:54,150 --> 00:12:02,730 Similarly, let's create another variable where we want to take the maximum value of previous three 134 00:12:02,730 --> 00:12:03,360 periods. 135 00:12:04,260 --> 00:12:06,560 So we want to consider this current period. 136 00:12:06,850 --> 00:12:08,230 We want to consider previous. 137 00:12:08,240 --> 00:12:08,540 We did. 138 00:12:08,960 --> 00:12:11,650 Also, we want to consider previous to previous. 139 00:12:13,800 --> 00:12:24,060 So for example, for this third record, I want the value of 35 since 35 is the maximum of 35, 32 and 140 00:12:24,100 --> 00:12:24,480 30. 141 00:12:25,260 --> 00:12:31,470 So I want to consider three time periods and I want to get the max out of those three. 142 00:12:33,690 --> 00:12:39,010 Again, I am creating another variable that is Roll. underscore Max. 143 00:12:40,890 --> 00:12:43,320 We are using D.F. two but column. 144 00:12:45,390 --> 00:12:49,710 And in this case, my ruling window is of three units. 145 00:12:49,950 --> 00:12:53,190 So I'm using darkling methods that we know equal to three. 146 00:12:53,880 --> 00:12:58,110 And as an aggregating function, I am using max method. 147 00:12:58,710 --> 00:13:01,780 So here I don't want the mean of these three were loose. 148 00:13:01,860 --> 00:13:04,230 I want the maximum of these three values. 149 00:13:04,950 --> 00:13:05,970 Let's find this. 150 00:13:06,490 --> 00:13:10,350 Let's look at the word frame once again. 151 00:13:12,900 --> 00:13:15,390 You can see this is our data. 152 00:13:16,590 --> 00:13:19,140 We have another column for Roll Underscore, Max. 153 00:13:20,890 --> 00:13:23,510 For the first two records, the value is. 154 00:13:23,620 --> 00:13:27,090 And then since this time, the window sizes three. 155 00:13:27,340 --> 00:13:32,980 And even for the second record, we don't have the three values to perform this operation. 156 00:13:33,280 --> 00:13:40,300 So for the first two courts, Will Lewis and then and from the third record, we are getting our desired 157 00:13:40,300 --> 00:13:42,790 result for the third record. 158 00:13:43,090 --> 00:13:46,640 The values which we are considering is 35 prior to doing 30. 159 00:13:47,650 --> 00:13:50,370 The maximum of all these three values are pretty face. 160 00:13:50,400 --> 00:13:54,700 So that's why we are getting to the five year for this fourth record. 161 00:13:55,210 --> 00:13:59,500 The values we are considering are today to 30 and 31. 162 00:14:00,310 --> 00:14:03,530 The maximum of all these values is to. 163 00:14:04,160 --> 00:14:07,930 That's what we are getting at here for this last record. 164 00:14:07,960 --> 00:14:09,340 That is the record. 165 00:14:09,550 --> 00:14:11,170 The maximum value is 44. 166 00:14:11,670 --> 00:14:13,200 That's why we are getting 44. 167 00:14:14,080 --> 00:14:21,070 So you can use mine as well, Max, as well, or minimum or average value as well. 168 00:14:21,240 --> 00:14:29,950 Here, if you want some more details about this, you can again click at this link to look at the official 169 00:14:29,950 --> 00:14:31,420 documentation of Pinder's. 170 00:14:36,030 --> 00:14:44,920 Now, suppose we don't want to use a window, but we want to consider all the values from the setting 171 00:14:44,990 --> 00:14:45,900 of our data. 172 00:14:49,600 --> 00:14:56,340 For that, we can use expanding features now in expanding feature. 173 00:14:56,520 --> 00:14:59,310 We don't give a window size. 174 00:14:59,970 --> 00:15:04,560 We just consider all the values from the starting in that record. 175 00:15:05,520 --> 00:15:15,570 So, for example, if I want the maximum value of birth till 1959, I will consider if I went 32 to 176 00:15:15,780 --> 00:15:17,760 get the maximum of these two aloose. 177 00:15:18,540 --> 00:15:28,170 If I want to get the maximum value till 1959, 01 03, I have to consider these three values. 178 00:15:30,320 --> 00:15:38,750 For creating these type of records, we are we are considering all the values that we have to use the 179 00:15:38,870 --> 00:15:42,040 expanding method of pandas. 180 00:15:43,370 --> 00:15:51,500 So how to use that expanding is similar to ruling in ruling. 181 00:15:51,680 --> 00:15:55,670 We have to create a regional parameter for the size of window. 182 00:15:56,090 --> 00:16:02,690 And next, finding there is no need of that because we are considering all the values fill that time. 183 00:16:03,560 --> 00:16:08,150 So again, we are creating another where we will that is expand on that score, Max. 184 00:16:08,900 --> 00:16:12,620 And here we want to get the maximum value of. 185 00:16:12,630 --> 00:16:13,860 But to date. 186 00:16:14,780 --> 00:16:16,580 So I am using B, too. 187 00:16:17,960 --> 00:16:26,810 And again, the data then I can use dot X funding method and as I aggregating function, I can use dot 188 00:16:26,890 --> 00:16:30,130 max on this. 189 00:16:30,890 --> 00:16:34,210 Let's look at the heart of first values here. 190 00:16:35,900 --> 00:16:40,860 You can see here the maximum value of but first record is 35. 191 00:16:40,910 --> 00:16:43,240 So we are getting 35 for the second quarter. 192 00:16:43,310 --> 00:16:47,900 We are considering today if I went 32 again, the maximum is 35. 193 00:16:47,930 --> 00:16:51,280 So we are getting 35 for the third record as well. 194 00:16:51,590 --> 00:16:52,950 The maximum is 35. 195 00:16:53,210 --> 00:16:55,430 Since we are considering these three values. 196 00:16:56,050 --> 00:16:57,390 So we are getting 35. 197 00:16:58,060 --> 00:17:01,390 Now, let's move this 52 court here. 198 00:17:01,970 --> 00:17:04,430 We are considering all these five values. 199 00:17:05,030 --> 00:17:07,880 We have 35 to do today, 31 and 44. 200 00:17:08,000 --> 00:17:09,680 Again, the maximum is 44. 201 00:17:10,130 --> 00:17:13,070 Let's say here the value is 44. 202 00:17:14,300 --> 00:17:17,380 Now, for this record, say 10 through court. 203 00:17:18,590 --> 00:17:22,340 We are consolidating all the values from the starting of pain. 204 00:17:22,430 --> 00:17:24,230 So if I had to do 30. 205 00:17:24,530 --> 00:17:29,300 So on 43, 45, 38, 27. 206 00:17:30,440 --> 00:17:34,280 And the maximum out of all these values is for debate. 207 00:17:34,430 --> 00:17:36,340 That's why we are using for every. 208 00:17:39,820 --> 00:17:42,040 So that's all for featuring unity here. 209 00:17:42,230 --> 00:17:45,660 We all know discuss how to do featuring an Iranian pattern. 210 00:17:46,660 --> 00:17:52,660 We have not discussed the application, but in the later part of this course, we will learn how to 211 00:17:52,720 --> 00:18:00,640 apply this feature in unity to generate some new data or some meaningful, useful data. 212 00:18:01,150 --> 00:18:01,590 Thank you.