1 00:00:00,930 --> 00:00:05,720 Since we've built a model which is able to make predictions we've done some on the test data set. 2 00:00:05,730 --> 00:00:07,560 We've exported them. 3 00:00:07,560 --> 00:00:13,230 Now the people you share these predictions with or in fact yourself I know in my case I'm very very 4 00:00:13,230 --> 00:00:20,060 interested might be curious of which parts of the data led to these predictions. 5 00:00:20,070 --> 00:00:26,970 So this is where feature importance comes in feature importance. 6 00:00:26,970 --> 00:00:34,920 In other words it seeks to figure out which different attributes of the data were most important when 7 00:00:34,920 --> 00:00:38,790 it comes to predicting the target variable or in our case the sale. 8 00:00:39,090 --> 00:00:47,940 Let's write that down a feature important seeks to figure out which different attributes of the data 9 00:00:48,420 --> 00:00:56,280 were most important when it comes to predicting. 10 00:00:56,280 --> 00:00:58,940 Let's put this in bold target variable. 11 00:00:59,010 --> 00:01:02,530 So in our case it's the sale price. 12 00:01:02,610 --> 00:01:03,110 All right. 13 00:01:03,360 --> 00:01:05,240 So how might we do this. 14 00:01:05,310 --> 00:01:07,380 We know we're using a random forest regress. 15 00:01:07,470 --> 00:01:13,500 We've seen in a previous video in the classification project as well as the socket Line project some 16 00:01:13,500 --> 00:01:16,840 psychic line models have the attribute feature responses. 17 00:01:17,070 --> 00:01:20,700 But if we didn't know if we're using a model that we weren't familiar with maybe we could go something 18 00:01:20,700 --> 00:01:26,220 like this so random forest regress are feature important 19 00:01:28,680 --> 00:01:35,770 beautiful so maybe this will give us some results if we were to go and explore these. 20 00:01:35,770 --> 00:01:41,310 We've got the psychic loan documentation that's always helpful so we're looking in here. 21 00:01:41,310 --> 00:01:46,840 We might search for where it knew what we want feature importance is return the feature importance is 22 00:01:46,840 --> 00:01:48,940 the higher the more important the feature. 23 00:01:48,970 --> 00:01:50,260 Let's see what that actually does. 24 00:01:50,260 --> 00:01:52,280 So let's take our model. 25 00:01:52,300 --> 00:01:57,020 So find feature importance of our best model. 26 00:01:57,610 --> 00:02:04,530 So we'll take our ideal model and we'll find a feature in IS. 27 00:02:04,700 --> 00:02:11,310 Well now that returns a fairly large array whole bunch of different values here some of them are zero. 28 00:02:11,410 --> 00:02:12,960 Some of them are pretty low. 29 00:02:13,090 --> 00:02:14,760 Tend to the negative six there. 30 00:02:15,520 --> 00:02:17,140 But what does this have in common. 31 00:02:17,140 --> 00:02:19,440 Let's check out the length of this. 32 00:02:19,450 --> 00:02:23,050 What does this have in common with how training dataset 33 00:02:26,180 --> 00:02:28,240 hundred to one hundred and two. 34 00:02:28,350 --> 00:02:40,040 Are we mining fur is that we've got 102 columns here now we've got 102 values here so we're getting 35 00:02:40,040 --> 00:02:42,110 a value for every feature. 36 00:02:42,110 --> 00:02:49,810 So sales I'd say would map to this and machine day would map to this. 37 00:02:49,810 --> 00:02:58,670 Now we could make a dictionary by going ex trained up columns and intertwine these values but I don't 38 00:02:58,820 --> 00:02:59,920 know about you. 39 00:02:59,920 --> 00:03:01,770 I prefer to see things visually. 40 00:03:01,810 --> 00:03:07,040 So let's make a what do we make a helper function which helps us do that. 41 00:03:07,210 --> 00:03:15,430 So we might make a helper function help a function for plotting feature importance so we can see it 42 00:03:15,700 --> 00:03:16,970 visually. 43 00:03:17,030 --> 00:03:27,340 So go to here death plot features and we might take a list of columns and a list of importance is and 44 00:03:27,340 --> 00:03:28,780 we're going to set an end to 20. 45 00:03:28,780 --> 00:03:32,370 Now this will make sense in a second because we have 102 values here. 46 00:03:32,380 --> 00:03:37,480 But realistically we want to look at a plot that we only want the top 20 values and I mean that's why 47 00:03:37,480 --> 00:03:42,780 we set it so we can adjust it with this little function here helper functions are great. 48 00:03:42,850 --> 00:03:47,940 I should have started using functions earlier writing functions saves a lot of time put it that way. 49 00:03:48,070 --> 00:03:52,930 I used to just write the same line of code over and over and over again in different cells but I've 50 00:03:52,930 --> 00:03:56,870 started to get into the habit of writing functions for different uses. 51 00:03:56,890 --> 00:04:02,800 So what we might do is use a little panties trick here called chaining and what that means is just simply 52 00:04:02,800 --> 00:04:10,420 putting a number of different Panda's functions in brackets and we'll see that in a second because we're 53 00:04:10,420 --> 00:04:22,150 going to make a data frame called The F and it'll have two columns features can be columns and feature 54 00:04:22,750 --> 00:04:27,880 important things can be important says right. 55 00:04:27,880 --> 00:04:35,710 So we're just creating a data frame here and then we're going to call sort value so see here these are 56 00:04:35,710 --> 00:04:42,560 still within the brackets needs to hear what this means is is going to do PDA or data frame and Dot 57 00:04:42,560 --> 00:04:52,640 sort values in one hit and we want to sort it by feature in performances and we want ascending equal 58 00:04:52,640 --> 00:04:57,250 to False so we want it to go from highest to lowest yet. 59 00:04:57,250 --> 00:05:11,070 That makes sense and then we want to reset the index drop equals true right actually this bracket should 60 00:05:11,070 --> 00:05:12,210 be down here. 61 00:05:12,330 --> 00:05:18,290 So because you've got a bracket here and here and two dots that means it's just going to do these three 62 00:05:18,300 --> 00:05:20,520 panda steps in one hit. 63 00:05:20,520 --> 00:05:25,580 And now we need to plot so plot the diaphragm we've created. 64 00:05:25,950 --> 00:05:33,960 So what we might do is go fig ax instantiate a plot peyote dot subplots and then we might do a horizontal 65 00:05:33,960 --> 00:05:39,440 bar because in my experience plotting features it looks really good on a horizontal bar. 66 00:05:39,450 --> 00:05:42,270 And now we only want up to n. 67 00:05:42,360 --> 00:05:43,460 That's why we have n there. 68 00:05:43,470 --> 00:05:46,440 So that's gonna be the first 20 examples. 69 00:05:46,440 --> 00:05:55,350 And the same with feature important says we only want up to 20 so the first 20 then we're going to set 70 00:05:55,350 --> 00:06:05,690 the Y label so acts dot set y label to features because it's a horizontal bar X and Y are rearranged 71 00:06:06,480 --> 00:06:14,730 and we go ax dot set X label add some communication to it and put feature importance down the bottom. 72 00:06:14,730 --> 00:06:16,830 Let's see what this looks like actually. 73 00:06:16,930 --> 00:06:22,920 So we're gonna help a function we just run it there or we have to do to call this is go plot features 74 00:06:23,790 --> 00:06:25,740 and then we'll pass it column. 75 00:06:25,740 --> 00:06:29,650 So we just want the columns from X train because that's a variable there. 76 00:06:29,820 --> 00:06:41,640 And then the feature importance is are just the ideal model dot feature importance is let's see what 77 00:06:41,640 --> 00:06:44,820 this looks like okay. 78 00:06:45,290 --> 00:06:46,760 So we got the features on the left here. 79 00:06:46,770 --> 00:06:47,910 Their value here. 80 00:06:47,970 --> 00:06:53,510 I want the top one the most valuable feature to be at the top. 81 00:06:53,580 --> 00:06:56,370 So what we might do is ax darts. 82 00:06:56,360 --> 00:06:58,880 How do we do it invert. 83 00:06:59,160 --> 00:07:00,720 Can we invert. 84 00:07:00,990 --> 00:07:04,530 I think it's X don't invert y axis. 85 00:07:04,580 --> 00:07:07,250 Maybe that's not a function. 86 00:07:07,490 --> 00:07:12,950 If in doubt run the code invert y axis and that didn't work. 87 00:07:12,950 --> 00:07:17,020 Maybe it is a function bomb. 88 00:07:17,210 --> 00:07:18,740 If in doubt run the code. 89 00:07:18,740 --> 00:07:19,880 Trust your instincts right. 90 00:07:19,880 --> 00:07:23,440 That's how you learn things you practice by running it without looking up. 91 00:07:23,720 --> 00:07:25,370 If in doubt you can always look it up right. 92 00:07:25,370 --> 00:07:29,450 You can always ask how to invert our horizontal bar plot map plot lib. 93 00:07:29,450 --> 00:07:30,390 Google should be out to help. 94 00:07:30,740 --> 00:07:31,120 All right. 95 00:07:31,130 --> 00:07:33,280 So what can we infer from this. 96 00:07:33,560 --> 00:07:35,110 Says a fair bit going on. 97 00:07:35,150 --> 00:07:39,950 We've got about 20 different features here so it's saying that year nine. 98 00:07:39,970 --> 00:07:44,060 Let's have a look at our X train go ahead. 99 00:07:44,140 --> 00:07:46,110 Let's get that in here. 100 00:07:46,150 --> 00:07:52,150 So saying that a year that the bulldozer was made is the most important feature based on the ideal model. 101 00:07:52,150 --> 00:07:54,070 And in the product size what is that. 102 00:07:54,070 --> 00:08:01,960 Let's look at our data dictionary product size product size. 103 00:08:01,960 --> 00:08:03,340 Don't know what this is. 104 00:08:03,340 --> 00:08:04,330 OK that's great. 105 00:08:06,210 --> 00:08:10,510 The size grouping for a product group subsets with product group. 106 00:08:10,570 --> 00:08:14,050 This is an example of where you maybe need to do some more research and figure out what features are 107 00:08:14,050 --> 00:08:18,430 what if you've got something like this in your data dictionary might have to reach out to like a subject 108 00:08:18,430 --> 00:08:21,970 matter expert or we could just go X train 109 00:08:28,190 --> 00:08:29,330 product size 110 00:08:32,370 --> 00:08:35,430 maybe we check the value counts 111 00:08:38,830 --> 00:08:39,280 okay. 112 00:08:39,290 --> 00:08:42,930 So there's one two three four five six different sizes. 113 00:08:43,040 --> 00:08:46,640 Now zero is gonna be missing a lot of missing values there. 114 00:08:47,200 --> 00:08:49,620 All right but maybe this doesn't really tell us much. 115 00:08:49,620 --> 00:08:52,880 So maybe we could do it on our original data frame. 116 00:08:52,880 --> 00:08:53,560 There we go. 117 00:08:54,630 --> 00:08:54,950 All right. 118 00:08:54,960 --> 00:08:59,670 So the product size or that kind of makes sense that the product size is influencing the sale price 119 00:09:00,000 --> 00:09:01,900 sale year enclosure. 120 00:09:01,950 --> 00:09:06,750 Also doing a fair bit of damage there enclosure value counts. 121 00:09:06,810 --> 00:09:07,110 Okay. 122 00:09:07,140 --> 00:09:08,450 So that's. 123 00:09:08,490 --> 00:09:08,890 Mm hmm. 124 00:09:08,910 --> 00:09:13,970 That doesn't really make much sense to me so then we go back to the data dictionary and we find and 125 00:09:14,070 --> 00:09:20,420 closure machine configuration does machine have an enclosed cab or not. 126 00:09:20,430 --> 00:09:20,850 All right. 127 00:09:21,090 --> 00:09:25,570 So maybe we'd need a need to figure out what these different values might mean. 128 00:09:25,770 --> 00:09:31,560 But this is a kind of exploration you'd probably do towards the end of after you've built a model ride 129 00:09:31,560 --> 00:09:36,810 is bringing this to someone after you've made some predictions you might take this in a sort of a presentation. 130 00:09:36,810 --> 00:09:41,400 So maybe you're meeting with a client or meeting with another team member or something you're going 131 00:09:41,400 --> 00:09:46,190 Hey this is what the most important features our model has found. 132 00:09:46,200 --> 00:09:50,970 Does this agree with your sort of intuition with what someone already knows about the data. 133 00:09:50,970 --> 00:09:56,730 Does this make sense if a model is using these features to derive its predictions. 134 00:09:56,730 --> 00:10:03,750 Does this make sense or this information here might influence how you go about collecting data in the 135 00:10:03,750 --> 00:10:11,010 future on your bulldozer sales so you might put a bit more effort into the values here that are contributing 136 00:10:11,010 --> 00:10:15,410 most to predicting the sale price of a bulldozer in the future. 137 00:10:15,450 --> 00:10:19,790 The last probably question that I'll leave you with is we've kind of just glazed over it a little bit 138 00:10:19,830 --> 00:10:25,500 but I want you to do a little bit of research is why my knowing the feature importance is of a train 139 00:10:25,500 --> 00:10:26,850 model be helpful. 140 00:10:26,850 --> 00:10:27,960 That's a finishing question. 141 00:10:29,010 --> 00:10:35,250 So question to finish is going to involve some research. 142 00:10:35,250 --> 00:10:47,730 Why might knowing the feature importance is of a trained machine learning model be helpful. 143 00:10:51,020 --> 00:10:55,700 So that's gonna finish off this project to finish off with a little question here what you might want 144 00:10:55,700 --> 00:11:01,820 to try and do is if you've followed along with all of these steps here you might want to see how far 145 00:11:01,820 --> 00:11:04,760 you can you can go with hyper parameter tuning. 146 00:11:05,480 --> 00:11:09,740 So something like this maybe you could leave this runway for a while on your computer see what it finds 147 00:11:09,740 --> 00:11:20,040 out see if you can improve the the valid R and S L E score and see how far you'd get up on the leaderboard. 148 00:11:20,100 --> 00:11:26,160 Now if you do reach a point where you're not really improving with a random forest model what a final 149 00:11:26,160 --> 00:11:38,370 extension may also be is final challenge is what other machine learning models. 150 00:11:38,380 --> 00:11:47,560 Could you try on our dataset and what that might involve is something like searching for the psychic 151 00:11:47,590 --> 00:11:53,320 loan machine learning that so what we've done is we've gone through this and we've gone to regression 152 00:11:53,320 --> 00:11:58,450 problem we found ensemble regresses and we've used a random forest maybe you want to try another one 153 00:11:59,020 --> 00:12:05,110 of these regression models using the format of data that we've worked through in this notebook so I'll 154 00:12:05,110 --> 00:12:07,690 leave a little hint here so hint 155 00:12:13,540 --> 00:12:31,930 check out the regression section of this map or try to look at something like cat boost dot II or x 156 00:12:32,080 --> 00:12:34,390 g boost dot II. 157 00:12:35,260 --> 00:12:39,880 So these are two extra resources so extra curricular of course they're optional at this point because 158 00:12:39,880 --> 00:12:43,930 once we've started to work through these sort of projects now we've gone end to end on a full machine 159 00:12:43,930 --> 00:12:49,930 learning project your next steps are kind of trying to figure things out on your own taking what you've 160 00:12:49,930 --> 00:12:51,660 learnt from here and then expanding on that. 161 00:12:51,670 --> 00:12:56,050 That's the kind of information that's a kind of knowledge that that is really going to help out. 162 00:12:56,100 --> 00:13:01,840 Right so rather than sort of always going through projects like this it's doing some research and trying 163 00:13:01,840 --> 00:13:06,150 out new things remember step six of our little framework here. 164 00:13:06,220 --> 00:13:09,340 Experimentation that's the challenge I leave to you. 165 00:13:09,610 --> 00:13:16,080 But that being said if you have made it this far you have gone through the entire project. 166 00:13:16,090 --> 00:13:17,560 Congratulations. 167 00:13:17,560 --> 00:13:28,660 We've just gone end to end on a regression problem using machine learning how cool is that so have a 168 00:13:28,660 --> 00:13:33,430 look in the notebook see what improvements you can make if you have any questions whatsoever. 169 00:13:33,640 --> 00:13:34,740 Feel free to ask them. 170 00:13:34,750 --> 00:13:40,150 Leave them in the discord hat or in the Udemy interface somewhere where you can leave a question and 171 00:13:40,450 --> 00:13:41,650 we'll see what we can do from there. 172 00:13:41,680 --> 00:13:43,270 But congratulations. 173 00:13:43,300 --> 00:13:48,400 Working through a first end end regression project how exciting. 174 00:13:48,400 --> 00:13:49,020 All the best.