1 00:00:01,020 --> 00:00:06,300 Now we are beginning this section on feature engineering in this section. 2 00:00:06,390 --> 00:00:13,950 We are going to learn the following things will first discuss water features and what is feature engineering. 3 00:00:15,450 --> 00:00:21,330 Then we will see the rationale that as we explored the reason for doing feature engineering in times 4 00:00:21,330 --> 00:00:21,930 it is data. 5 00:00:23,610 --> 00:00:31,330 Then in the practical sessions, we will learn how to develop basic daytime based input features like 6 00:00:31,350 --> 00:00:34,800 build features and sliding window based features. 7 00:00:37,050 --> 00:00:39,480 So let us understand what our features. 8 00:00:42,820 --> 00:00:50,020 In time, cities forecasting problem, we usually have this type of data in one column. 9 00:00:50,170 --> 00:00:57,130 We will have daytime values and in the other column we have the values of the variable that we are monitoring. 10 00:00:57,670 --> 00:01:02,660 And we want to forecast, for example, the first column will have bid values. 11 00:01:03,100 --> 00:01:05,530 And the second column has the same values. 12 00:01:07,450 --> 00:01:16,630 We have to transform this type of data to this format where we have input variables and one output variable 13 00:01:16,750 --> 00:01:17,610 to the not model. 14 00:01:19,280 --> 00:01:26,900 For example, instead of having only a date here, we should probably input whether it is a weekday 15 00:01:26,900 --> 00:01:27,480 or weekend. 16 00:01:28,880 --> 00:01:30,650 This can be one input variable. 17 00:01:31,830 --> 00:01:40,230 We can input seasons such as summer or winter as another input, these inputs are called features. 18 00:01:42,080 --> 00:01:49,050 And the task of creating or inventing new features from the time to these data is called feature engineering. 19 00:01:51,060 --> 00:01:55,290 The idea is that our software does not understand much. 20 00:01:56,790 --> 00:02:02,530 The idea is that our software does not understand much when it sees a Dead Sea. 21 00:02:02,640 --> 00:02:07,050 Third, then it would make more sense from the data. 22 00:02:07,170 --> 00:02:10,920 If we delayed that third, then was a weekend or not. 23 00:02:12,080 --> 00:02:19,700 Or whether it rained or not on that day or whether people received salary in that week or not, et cetera. 24 00:02:22,520 --> 00:02:30,140 Such features, which might actually impact the output variable, are to be identified and given as 25 00:02:30,260 --> 00:02:30,680 input. 26 00:02:31,950 --> 00:02:38,770 And this is the aim of doing feature engineering in the practical part of this course. 27 00:02:39,070 --> 00:02:43,910 We are going to explore how to find these three classes of features. 28 00:02:46,440 --> 00:02:48,550 First is daytime features. 29 00:02:49,910 --> 00:02:52,670 Date and time values have several components. 30 00:02:53,840 --> 00:02:56,690 We try to separate those components for each observation. 31 00:02:57,560 --> 00:03:05,060 For example, from the date we can find day of the week, we can find season of the year. 32 00:03:06,320 --> 00:03:11,180 These components are present as information in the date only. 33 00:03:14,090 --> 00:03:16,730 Second, puzzle features are called lag features. 34 00:03:18,110 --> 00:03:20,650 These are values at previous time steps. 35 00:03:22,220 --> 00:03:25,580 For example, by forecasting this month's sale. 36 00:03:26,270 --> 00:03:27,740 It makes sense to consider. 37 00:03:27,860 --> 00:03:33,440 Last month is also third class of features, is called window features. 38 00:03:34,780 --> 00:03:39,880 These are a somebody of values over a fixed window of prior time stapes. 39 00:03:41,000 --> 00:03:48,320 For example, we can use the average seal of previous three months instead of last month to help predict 40 00:03:48,350 --> 00:03:54,180 this month is average values can help us remove noise in the data. 41 00:03:55,040 --> 00:03:59,390 And if carefully done, we can also remove seasonality effects from the data. 42 00:04:00,830 --> 00:04:04,120 So in some situations, window features are very helpful. 43 00:04:06,280 --> 00:04:07,720 Now, let's look at an example. 44 00:04:07,960 --> 00:04:14,380 To further understand these three types of features, in the first table, you'll see times these data 45 00:04:15,040 --> 00:04:16,580 that is in one column. 46 00:04:16,720 --> 00:04:17,880 You have the date value. 47 00:04:18,410 --> 00:04:22,630 And in the second column, you have the footfall in your store. 48 00:04:23,920 --> 00:04:26,890 That is the number of customers visiting your store. 49 00:04:28,000 --> 00:04:35,680 So on TNT, Jan, of 2020, you had 853 customers visiting your store on 11th Day. 50 00:04:35,860 --> 00:04:38,200 It was 376 and so on. 51 00:04:40,510 --> 00:04:49,120 Now, to convert this time to this data into a format which is usable for our model, we convert this 52 00:04:49,120 --> 00:04:50,910 date into features. 53 00:04:53,280 --> 00:04:59,460 The first is an example of big time feature we're corresponding to that particular date. 54 00:04:59,790 --> 00:05:03,890 We are assigning whether that be was a weekend or not. 55 00:05:05,850 --> 00:05:12,170 We are doing this because we think that probably footfall depends on whether the day is a weekday or 56 00:05:12,190 --> 00:05:12,620 weekend. 57 00:05:14,170 --> 00:05:18,880 So we tried to extract that information from the days that we have. 58 00:05:21,000 --> 00:05:24,460 In the second column, I have a lag feature. 59 00:05:25,770 --> 00:05:28,260 So we are looking at the footfall. 60 00:05:28,650 --> 00:05:30,400 Exactly seven days ago. 61 00:05:31,730 --> 00:05:38,700 Why we are doing this, because exactly seven days ago, it was the same day as today. 62 00:05:39,200 --> 00:05:43,530 So if it is Thursday, today, seven days ago will also be Thursday. 63 00:05:44,750 --> 00:05:51,050 And the idea behind that could be that seal on a particular day could be similar. 64 00:05:51,770 --> 00:05:56,360 So seal on this Saturday, it will be similar to SEAL on last Saturday. 65 00:05:58,620 --> 00:06:05,640 So that is why we create one more feature in which we are looking at the land value of the variable 66 00:06:05,700 --> 00:06:08,760 to be predicted in the third column. 67 00:06:09,150 --> 00:06:10,950 We are looking at a window feature. 68 00:06:12,990 --> 00:06:15,540 We are looking at the last seven days. 69 00:06:15,990 --> 00:06:17,820 This is the window at which we are looking. 70 00:06:18,450 --> 00:06:23,670 We are looking at the last seven days and finding the average footfall for last seven days. 71 00:06:25,350 --> 00:06:29,880 How does ERPs probably averaging the footfall for last seven days? 72 00:06:30,090 --> 00:06:33,600 Removes the noise and seasonality within a week. 73 00:06:34,500 --> 00:06:41,910 So average value in the last seven days can probably help us predict the footfall on the presently. 74 00:06:45,430 --> 00:06:53,080 So these are the three types of features, daytime features like features and window features that we 75 00:06:53,080 --> 00:06:56,410 will learn how to extract using our software. 76 00:06:58,660 --> 00:07:02,950 Now, within window features, we will explore two types of window features. 77 00:07:03,940 --> 00:07:07,410 One is a rolling window and the other is expanding window. 78 00:07:08,900 --> 00:07:15,740 The ruling window is when we've fixed the width of the window and we slide it forward as we move forward. 79 00:07:17,420 --> 00:07:27,380 For example, if I fixed the window with two last seven days on 10 January, I will look at the last 80 00:07:27,380 --> 00:07:27,980 seven days. 81 00:07:28,220 --> 00:07:30,060 That is from second down to nine. 82 00:07:31,070 --> 00:07:39,470 An average that this 985 is the average value of footfall from second down to 19. 83 00:07:41,420 --> 00:07:46,650 This 972 is average for full value from third down to 10. 84 00:07:48,160 --> 00:07:50,870 So the window, it stays the same. 85 00:07:51,050 --> 00:07:56,300 It is lost seven days only, but the window is sliding. 86 00:07:56,480 --> 00:07:57,740 That is on 10 then. 87 00:07:58,310 --> 00:08:02,910 The window was from second down to nine, 10 on 11th, 10. 88 00:08:04,680 --> 00:08:06,740 The window is from third down to 10. 89 00:08:07,640 --> 00:08:12,810 So each step that we take, we also move our window accordingly. 90 00:08:14,660 --> 00:08:21,410 The other option is an expanding window in which window size keeps on increasing. 91 00:08:23,500 --> 00:08:31,360 So in this scenario, if we are looking at maximum footfall value tilted at window of observation will 92 00:08:31,360 --> 00:08:34,840 keep on expanding by one day each day. 93 00:08:36,120 --> 00:08:45,150 So on TNT, we are looking at all the footfall values built and then so suppose the maximum value of 94 00:08:45,150 --> 00:08:48,760 footfall built and then was eleven ninety five. 95 00:08:50,150 --> 00:08:56,210 On 11th can also sense the value of football on Tenjin was less than this value. 96 00:08:56,750 --> 00:09:00,080 This value of maximum footfall stays the same. 97 00:09:02,000 --> 00:09:08,870 But on twelve, ten deep, thirteen hundred seventy six, footfall value will also be included in the 98 00:09:08,870 --> 00:09:09,290 window. 99 00:09:10,400 --> 00:09:14,660 And now the thirteen hundred seventy six value becomes the largest footfall value. 100 00:09:16,310 --> 00:09:22,280 So from 2010 onwards, maximum for full value will be thirteen hundred and seventy six. 101 00:09:22,940 --> 00:09:27,910 And this will continue until the footfall value becomes more than 13. 102 00:09:27,910 --> 00:09:28,850 And it's 76. 103 00:09:31,260 --> 00:09:38,430 So the concept is, even if we are increasing the date, we do not shift the window, we just expanded. 104 00:09:39,510 --> 00:09:47,490 So on 11 that the window will include eleven, then on twelve, then it will include eleven done and 105 00:09:47,490 --> 00:09:49,650 will then board and so on. 106 00:09:49,770 --> 00:09:57,060 It will keep on including the new date and do it window to this as expanding window. 107 00:09:59,440 --> 00:10:04,960 Before we start doing this, practically, I want to answer this one question that students often ask 108 00:10:06,460 --> 00:10:11,260 how do I know which features and how many features I should use for my problem? 109 00:10:13,130 --> 00:10:19,580 My suggestion is that you should understand the problem deeply and try to identify as many features 110 00:10:19,580 --> 00:10:20,120 as you can. 111 00:10:21,200 --> 00:10:28,220 Even if it intuitively makes less sense, I would suggest that you should create that feature and run 112 00:10:28,220 --> 00:10:28,910 the analysis. 113 00:10:29,570 --> 00:10:35,840 You will later see that most of our models also tell us which of the variables are useful and which 114 00:10:35,840 --> 00:10:36,290 are not. 115 00:10:38,330 --> 00:10:46,070 My point is that including a used less variable does less harm than excluding a useful variable. 116 00:10:47,600 --> 00:10:55,940 So do avoid absurd variables, but do not exclude a business relevant variable because it seems counterintuitive 117 00:10:55,940 --> 00:10:56,360 to you. 118 00:10:58,720 --> 00:10:59,230 That's all. 119 00:10:59,750 --> 00:11:02,680 Now, let us see the practical implementation of these concept.