1 00:00:00,570 --> 00:00:08,340 In this video, we will learn about this string split, four time series data, that trend is split 2 00:00:08,340 --> 00:00:09,070 four times. 3 00:00:09,180 --> 00:00:12,570 This data is very different from that strain. 4 00:00:12,570 --> 00:00:18,690 Is it for any other machine learning algorithms such as linear regression, logistic regression decision, 5 00:00:18,690 --> 00:00:19,610 trees, etc.? 6 00:00:21,060 --> 00:00:25,050 The first differences for other machine learning algorithms. 7 00:00:25,230 --> 00:00:31,720 We randomly choose a subsegment of our data as a set or validation set. 8 00:00:33,510 --> 00:00:42,210 But we cannot randomly choose data from more time series because in case of TIME cities, their data 9 00:00:42,300 --> 00:00:47,550 is organized in a particular order using dates or time values. 10 00:00:48,270 --> 00:00:53,100 And we can not randomly choose a particular value from this time cities. 11 00:00:54,540 --> 00:01:01,140 The second difference is that for other machine learning algorithms, we create three different sets 12 00:01:01,590 --> 00:01:04,410 test cream and validation set. 13 00:01:05,040 --> 00:01:07,400 We usually train our model on train set. 14 00:01:08,730 --> 00:01:11,850 We tune out what hyper parameters using validation set. 15 00:01:12,210 --> 00:01:18,370 And then and then we finally use the model to predict values on the test set. 16 00:01:20,910 --> 00:01:22,320 But in times it is data. 17 00:01:22,530 --> 00:01:26,310 We usually divide our data in just test and cream. 18 00:01:27,270 --> 00:01:36,450 This is mainly because in most of the cases we have limited data and using validation, data and training, 19 00:01:36,450 --> 00:01:41,910 the model makes more sense than using the validation data to validate the model. 20 00:01:43,140 --> 00:01:47,230 So for time cities, we just divide our data into test and train. 21 00:01:47,940 --> 00:01:53,130 And we do not randomly pick some values to create test set. 22 00:01:54,960 --> 00:02:01,650 We generally store last few values of our time cities to act as a test site. 23 00:02:03,600 --> 00:02:11,550 So if you have a month levalley data for some time cities, you may take last three or four months as 24 00:02:11,550 --> 00:02:12,480 your test set. 25 00:02:15,400 --> 00:02:19,030 Now, let's start creating test string data in Python. 26 00:02:21,340 --> 00:02:26,680 We will be using the BAILLY minimum temperature dataset that we were using earlier. 27 00:02:27,560 --> 00:02:31,980 So the data frame name is temp underscore B.F.. 28 00:02:32,540 --> 00:02:34,730 Oh, let's look at first five values. 29 00:02:37,220 --> 00:02:45,020 You can see that we have two columns, date and time, 10 percent for templated, the date column contained 30 00:02:45,020 --> 00:02:47,600 the date time values for our time cities. 31 00:02:48,470 --> 00:02:51,230 You can see this, the daily data of Embitter. 32 00:02:52,040 --> 00:02:58,610 So in the first row, we have Daito 1st of January in the second rowby of data of 2nd January. 33 00:03:00,230 --> 00:03:02,120 We can also look at the last five mandu. 34 00:03:10,630 --> 00:03:13,210 To get an idea of the lost value of these cities. 35 00:03:13,520 --> 00:03:15,530 So these are the last five lives. 36 00:03:16,150 --> 00:03:21,850 You can see we have data of 10 years from 1981 to 1990. 37 00:03:23,920 --> 00:03:30,270 Now, let's look at how many values we have in our data frame. 38 00:03:30,580 --> 00:03:32,470 So we are using dots, shape, attribute. 39 00:03:37,020 --> 00:03:45,250 We have three thousand six hundred and fifty groups and just two columns now. 40 00:03:46,510 --> 00:03:55,120 We are planning to use 80 percent of this data as our train site and the remaining 20 as a test site. 41 00:03:58,120 --> 00:04:06,760 So what we are going to do out of this 10 years of data, we will use first 88 as our train site and 42 00:04:06,760 --> 00:04:09,520 last two years as our test site. 43 00:04:10,150 --> 00:04:13,750 So not just get the first value of this shape. 44 00:04:14,320 --> 00:04:18,250 So this is the number of roads we have in our data. 45 00:04:19,630 --> 00:04:26,140 Now, we will create another way, even that is train size, in which we will get how many records we 46 00:04:26,140 --> 00:04:27,940 want in our train site. 47 00:04:28,600 --> 00:04:35,450 So we have three thousand six hundred and fifty records and we want to take 80 percent of these records 48 00:04:35,470 --> 00:04:36,400 and do our train set. 49 00:04:36,730 --> 00:04:41,660 So we are using and then bam dot be a dot chip. 50 00:04:42,160 --> 00:04:47,470 And we are getting the first element of our shape happen, which is three six five zero. 51 00:04:47,980 --> 00:04:50,570 And we are multiplying this value with Boin Day. 52 00:04:51,370 --> 00:04:54,390 And finally, we are converting this value in. 53 00:04:54,620 --> 00:05:02,770 And so suppose after multiplication we get a value of suppose two thousand six hundred four point five. 54 00:05:03,070 --> 00:05:06,520 We don't want that point five decimal value. 55 00:05:06,790 --> 00:05:13,570 So we are converting this value into an integer value because sizes this should be an integer value. 56 00:05:15,990 --> 00:05:20,830 Let's spend this Sauveur train sizes, two nine two zero records. 57 00:05:21,610 --> 00:05:26,500 So we want two nine two zero records in our train set and remaining records intersect. 58 00:05:29,500 --> 00:05:33,910 Not to create, Green said, we will split our time. 59 00:05:34,050 --> 00:05:34,260 Yep. 60 00:05:34,890 --> 00:05:42,430 Dappling, we will take first two nine two zero records as green and remaining records from two nine 61 00:05:42,430 --> 00:05:48,460 two zero two three six five zero index as asked to do that. 62 00:05:48,610 --> 00:05:54,680 We will use slicing of data frames so we will just see spam D.F.. 63 00:05:55,690 --> 00:05:59,240 Now we won all the records from first records. 64 00:05:59,410 --> 00:06:06,580 So the index is CEDO for the first record we want, although we lose from zero to the screen size screen 65 00:06:06,580 --> 00:06:08,090 sizes to nine two zero. 66 00:06:08,770 --> 00:06:10,600 So we will have all the records. 67 00:06:11,620 --> 00:06:16,090 We're indexes between zero two two nine two zero. 68 00:06:18,520 --> 00:06:22,720 You can see these are the indexes just before the column names. 69 00:06:23,620 --> 00:06:25,100 We are getting indexes. 70 00:06:25,630 --> 00:06:27,160 When we look at our data frame. 71 00:06:30,340 --> 00:06:34,240 So let's run this now for the test. 72 00:06:35,050 --> 00:06:39,250 We will need all the records, we're indexes that are then to name to zero. 73 00:06:39,790 --> 00:06:43,200 So here we are again, selecting over time B.F.. 74 00:06:43,690 --> 00:06:45,610 And we are selecting all the data. 75 00:06:46,450 --> 00:06:50,650 We're the index is greater than two nine two zero. 76 00:06:50,890 --> 00:06:54,220 So we had, I think, queen size here and then colon. 77 00:06:57,400 --> 00:07:04,050 And cooler, and then we are not writing anything, leaving it blank mean till the end. 78 00:07:06,390 --> 00:07:09,070 Similarly, we can also mention this last number here. 79 00:07:09,600 --> 00:07:13,980 So Green underscore the size, then call than three six, four lane. 80 00:07:14,040 --> 00:07:15,010 This will give the same. 81 00:07:20,640 --> 00:07:21,790 So let's run this. 82 00:07:23,890 --> 00:07:25,570 Let's debate what we have done. 83 00:07:25,900 --> 00:07:35,830 We have selected all the data from Index zero to the screen size as green, and we have selected the 84 00:07:35,830 --> 00:07:42,260 remaining data, which is from the screen size index to the last index as our work as well. 85 00:07:43,880 --> 00:07:47,830 Now let's look at the shape of forward green and as dataset. 86 00:07:50,930 --> 00:08:00,410 You can see we have two nine two zero records and train and we have remaining 730 records in test. 87 00:08:04,000 --> 00:08:08,920 This is always split data and to test and train for time cities. 88 00:08:11,200 --> 00:08:15,670 Now let's discuss another concept here that is walk forward validation. 89 00:08:16,900 --> 00:08:20,380 So suppose this light gray is your training data. 90 00:08:21,880 --> 00:08:30,940 What we usually do is we train them for more than, say, M1 on this training set, and then we will 91 00:08:30,940 --> 00:08:33,580 use this model to predict the future values. 92 00:08:34,030 --> 00:08:42,230 So we will use the same model to predict the value at time, even when doing time to do well over time. 93 00:08:42,430 --> 00:08:43,960 Three and so on. 94 00:08:46,540 --> 00:08:53,460 But since at time t two, we have all the information available for time, Devinn. 95 00:08:54,070 --> 00:09:02,200 So if you want to predict the value at Time B3 and you have the additional values at time B1 and Time 96 00:09:02,200 --> 00:09:11,470 T2, you can use these two new records to improve your model and then predict the value at time B three. 97 00:09:13,330 --> 00:09:21,760 Similarly, if you want to predict the value at time B five and you already have data of all the values 98 00:09:21,960 --> 00:09:29,530 and time before you want to use all the available information with you, you don't want to use your 99 00:09:29,530 --> 00:09:36,670 same M1 model that you have created at time, even because in that case you are missing out on values 100 00:09:36,700 --> 00:09:39,900 that are available at time B two to three and four. 101 00:09:42,070 --> 00:09:49,610 So what we can do is if we want to train a more modern four time T two, we can take all the values 102 00:09:49,650 --> 00:09:54,940 till time to even create a modern and use that model to predict the value at time. 103 00:09:54,950 --> 00:09:59,720 T2 If we want to predict the value at time D3. 104 00:10:00,190 --> 00:10:03,040 We will take all the values to retain T2. 105 00:10:03,730 --> 00:10:05,410 So we will take all the values. 106 00:10:05,440 --> 00:10:09,090 We'll create another model and predict the value at twenty three. 107 00:10:10,130 --> 00:10:17,590 Similarly, if we want to predict any value at the time to five, we will take all the information that 108 00:10:17,590 --> 00:10:25,150 is available to us the time before we'll create a model on that and then we will predict the value at 109 00:10:25,150 --> 00:10:25,540 time. 110 00:10:25,630 --> 00:10:25,880 P. 111 00:10:28,540 --> 00:10:37,350 So for our test set also, we are not going to create a single model and predict the values for the 112 00:10:37,370 --> 00:10:43,240 seven party records for the first record in our test set. 113 00:10:43,540 --> 00:10:51,160 We are going to use this train set and create a model and predict for the first value, for the second 114 00:10:51,160 --> 00:10:52,090 value of our test. 115 00:10:52,100 --> 00:10:55,120 It will take all the values of our train set. 116 00:10:55,360 --> 00:11:02,230 We will date the first when you afford a set, and then we will predict the value for our second record 117 00:11:02,350 --> 00:11:03,340 of our test set. 118 00:11:05,050 --> 00:11:12,490 So this way of validation is known as walk forward of elevation, and it usually give us more accuracy 119 00:11:12,700 --> 00:11:14,010 than a single time series. 120 00:11:14,020 --> 00:11:14,490 More than. 121 00:11:17,740 --> 00:11:21,120 So we will learn how to use walk forward validation as well. 122 00:11:22,540 --> 00:11:25,240 Along with creating a single day time model. 123 00:11:26,500 --> 00:11:27,590 That's all for this video. 124 00:11:27,730 --> 00:11:28,170 Thank you.