1 00:00:02,070 --> 00:00:06,480 We have been through a lot but we're going to persevere. 2 00:00:06,500 --> 00:00:07,560 I'm going to get through this together. 3 00:00:07,560 --> 00:00:09,330 We've got a few more functions to go through. 4 00:00:10,010 --> 00:00:13,730 So the next one is sample. 5 00:00:13,730 --> 00:00:17,630 Now this is a great way to to shuffle your data frame. 6 00:00:17,700 --> 00:00:21,130 What I mean by that is mix up the indexes. 7 00:00:21,300 --> 00:00:28,940 So in future videos when we start applying machine learning algorithms to to our data one of their most 8 00:00:28,940 --> 00:00:32,470 important steps is creating a training validation and test set. 9 00:00:33,110 --> 00:00:38,690 And one of the steps to doing that is random izing the order your data comes in. 10 00:00:38,690 --> 00:00:43,180 Because what you want to avoid is the order that these samples come in. 11 00:00:43,310 --> 00:00:48,830 May have some kind of influence on patterns in the data when now machine learning algorithm is learning 12 00:00:48,830 --> 00:00:49,580 things. 13 00:00:49,580 --> 00:00:54,500 We want it to be as general as possible so we don't want it to care about the order. 14 00:00:54,500 --> 00:01:00,080 Now there are some exceptions to the rule with time serious data and whatnot but I'm just going to show 15 00:01:00,080 --> 00:01:07,070 you this example if you wanted to shuffle rows in a data frame so let's do car sales. 16 00:01:07,090 --> 00:01:16,580 Don't sample and now sample is basically saying take a sample from this data frame with this parameter 17 00:01:16,580 --> 00:01:19,940 frac is the fraction of that sample. 18 00:01:20,060 --> 00:01:22,700 So it's gonna be in an integer form. 19 00:01:22,700 --> 00:01:27,380 So for example zero point five is 50 percent of the data. 20 00:01:27,380 --> 00:01:33,800 So it's gonna get five rows out of 10 but we in this case want to shuffle everything so we will use 21 00:01:33,800 --> 00:01:37,540 one for 100 percent of the data there we go. 22 00:01:37,540 --> 00:01:40,990 So now you can see that the index has been shuffled. 23 00:01:41,000 --> 00:01:46,180 So this is essentially the same data frame just with the order mixed around. 24 00:01:46,190 --> 00:01:51,440 Now the important thing to remember with PANDAS is that the information in a row will stay with its 25 00:01:51,440 --> 00:02:00,050 row so row nine NISSAN NISSAN white thirty one thousand six hundred kilometers four doors. 26 00:02:00,050 --> 00:02:01,760 So the rows stay the same. 27 00:02:01,790 --> 00:02:08,570 All that's changing with this function sample is the order that things appear in and now let's have 28 00:02:08,570 --> 00:02:14,370 a look at the data frame again silly May we keep forgetting to reassign. 29 00:02:14,620 --> 00:02:18,370 So we want to reassign here equals. 30 00:02:18,400 --> 00:02:19,930 There we go. 31 00:02:19,930 --> 00:02:25,090 Now it's going to definitely shuffle the car sales data frame or maybe we create a new one shuffled 32 00:02:25,120 --> 00:02:28,060 because we want to keep the original one in order. 33 00:02:28,060 --> 00:02:34,990 So we've got car sales shuffled this return a copy but car sales shuffled will be in a different order 34 00:02:34,990 --> 00:02:37,220 to the original car sales data frame. 35 00:02:37,360 --> 00:02:38,260 Beautiful. 36 00:02:38,260 --> 00:02:42,910 Now if we wanted to manipulate this shuffled data frame we could call the same functions we've called 37 00:02:42,910 --> 00:02:50,980 before on the shuffled version and another handy thing about the sample function is that say for example 38 00:02:50,980 --> 00:02:53,920 you had like a data frame with two million rows. 39 00:02:53,920 --> 00:02:55,890 Ours is really small it's only 10. 40 00:02:55,900 --> 00:03:02,020 But in practice you're going to be working on data sets with a lot more rows sometimes running functions 41 00:03:02,020 --> 00:03:09,670 in panders takes a long time on millions of different rows what you might want to do is practice on 42 00:03:09,680 --> 00:03:11,780 only 20 percent of the data. 43 00:03:12,490 --> 00:03:15,880 Only select 20 percent of data. 44 00:03:15,880 --> 00:03:18,120 And now this number could be arbitrary right. 45 00:03:18,130 --> 00:03:20,310 In our case we've chosen 20 percent. 46 00:03:20,320 --> 00:03:23,030 So that's giving us two different rows. 47 00:03:23,110 --> 00:03:27,630 If you had two million rows maybe you want to practice on 1 percent of the data. 48 00:03:27,640 --> 00:03:30,400 So that's still 20000 rows. 49 00:03:30,400 --> 00:03:36,460 And so that will allow you to do lots of different experiments a lot quicker than doing it all on two 50 00:03:36,460 --> 00:03:38,360 million rows at one time. 51 00:03:38,380 --> 00:03:42,730 So that's something you'll have to consider in the future in your projects when you're working on larger 52 00:03:42,730 --> 00:03:49,280 amounts of data is your computer powerful enough to continually run functions on millions of rows. 53 00:03:49,420 --> 00:03:55,930 Or do you want to flesh out some little experiments on smaller amounts of data first and then upgrade 54 00:03:55,930 --> 00:04:00,550 those experiments as you figure out what works and what doesn't. 55 00:04:00,560 --> 00:04:04,250 Now we look at our car sales shuffled data again 56 00:04:07,540 --> 00:04:09,800 we've got our indexes but they're all out of order. 57 00:04:09,800 --> 00:04:11,640 How would we get those back into order. 58 00:04:11,700 --> 00:04:18,430 We go car sales shuffled don't reset index this function as you could imagine is going to reset the 59 00:04:18,430 --> 00:04:19,720 index here. 60 00:04:19,720 --> 00:04:29,720 So we want warm and now there's a way here that you can remove this column so maybe we go in place. 61 00:04:29,960 --> 00:04:31,260 Let's have a look. 62 00:04:31,280 --> 00:04:33,700 Did that do that in place shuffled. 63 00:04:33,890 --> 00:04:43,030 Because by default the reset index column adds the old index as a new column on the very left. 64 00:04:43,040 --> 00:04:48,020 And so we've got all the numbers here but this is the shuffled index that it's made a column maybe we 65 00:04:48,020 --> 00:04:49,140 don't want that. 66 00:04:49,220 --> 00:04:54,760 I think we can get rid of that by in place equals true. 67 00:04:54,770 --> 00:05:01,150 Remember a lot of Panda's functions have the in-place parameter shuffle in place it goes through. 68 00:05:01,350 --> 00:05:03,230 It still has that. 69 00:05:03,740 --> 00:05:10,310 Well the way we figure this out is if we look at the documentation for reset index let's do that pandas 70 00:05:10,430 --> 00:05:16,320 reset index I don't want that extra index column there. 71 00:05:17,150 --> 00:05:24,750 So let's see drop do not try to insert index in the data frame columns. 72 00:05:24,750 --> 00:05:26,700 I think that's what we need. 73 00:05:26,700 --> 00:05:27,900 Is there an example. 74 00:05:28,080 --> 00:05:29,740 Drop equals true. 75 00:05:29,750 --> 00:05:29,970 Yeah. 76 00:05:30,210 --> 00:05:31,620 I think that's what we need. 77 00:05:31,620 --> 00:05:32,900 So we want in place egos. 78 00:05:32,940 --> 00:05:42,590 True and we want drop equals true drop is false by default yet default false. 79 00:05:42,590 --> 00:05:46,920 So this is another example of how you can look out what a function does in panders. 80 00:05:46,920 --> 00:05:53,610 I just googled the function name reset index again you won't know these things from the start but it's 81 00:05:53,700 --> 00:05:55,400 not about knowing things off by heart. 82 00:05:55,410 --> 00:05:56,850 I don't know everything off by heart. 83 00:05:56,850 --> 00:06:03,890 It's about researching different ways that functions work and trying things out if in doubt run the 84 00:06:03,890 --> 00:06:04,490 code. 85 00:06:04,550 --> 00:06:06,090 Let's do that. 86 00:06:06,260 --> 00:06:08,690 We still have that there. 87 00:06:10,400 --> 00:06:12,700 Maybe we come back up here. 88 00:06:12,980 --> 00:06:15,040 We're going to reset this reset. 89 00:06:15,040 --> 00:06:21,030 This reset this there we go. 90 00:06:21,930 --> 00:06:25,060 We had to reset our car sales shuffled data frame. 91 00:06:25,080 --> 00:06:27,240 Now we've set drop to equal true. 92 00:06:27,240 --> 00:06:29,280 We don't have that index column on the left. 93 00:06:29,280 --> 00:06:35,550 Here is one more thing we're going to look at and that's how to apply a function to a column. 94 00:06:35,610 --> 00:06:39,640 So let's say we'll have one we'll look at our car sales data frame. 95 00:06:39,660 --> 00:06:45,870 Remember this is the process you're going to be taking a lot is viewing your data frame and then manipulating 96 00:06:45,870 --> 00:06:48,750 it viewing it manipulating it viewing it manipulating it. 97 00:06:48,870 --> 00:06:53,940 So it's looking pretty private the moment but we want to convert our odometer to Miles. 98 00:06:53,940 --> 00:06:58,640 So how would we do that car sales first we have to select it. 99 00:06:58,930 --> 00:07:04,650 You do that by typing in the column name and then we're going to reassign it remembered to reassign 100 00:07:04,650 --> 00:07:08,450 at this time odometer kilometres. 101 00:07:08,490 --> 00:07:09,570 Yep. 102 00:07:09,600 --> 00:07:17,460 Now this is the apply function apply let you apply some kind of function whether it be num Pi or lambda 103 00:07:17,760 --> 00:07:19,300 to a certain column. 104 00:07:19,350 --> 00:07:23,820 So let's just type this in and lambda functions can be confusing to begin with. 105 00:07:23,820 --> 00:07:28,560 Took me a while to learn them but we're going to step through this one and figure out what it's actually 106 00:07:28,560 --> 00:07:37,390 doing so if in doubt run the code we want car sales to view the whole data frame run the code. 107 00:07:37,390 --> 00:07:37,960 There we go. 108 00:07:38,800 --> 00:07:40,890 Now what do you think this is just done. 109 00:07:40,890 --> 00:07:42,690 These numbers have changed. 110 00:07:42,830 --> 00:07:44,440 One hundred and fifty thousand. 111 00:07:44,610 --> 00:07:47,820 It is now ninety three thousand seven hundred seventy six. 112 00:07:48,210 --> 00:07:51,610 Well let's step through this odometer column tick. 113 00:07:51,630 --> 00:07:52,390 I don't wanna column. 114 00:07:52,410 --> 00:07:53,150 Tick. 115 00:07:53,220 --> 00:07:54,350 Dot apply. 116 00:07:54,430 --> 00:08:01,830 Yeah lambda now lambda is a keyword in python which is short for an anonymous function. 117 00:08:01,860 --> 00:08:08,330 So basically this is saying apply this function to X divided by one point six. 118 00:08:08,380 --> 00:08:09,250 So that's what this is saying. 119 00:08:09,270 --> 00:08:20,210 Apply recent X to be equal to x divided by one point six y one point six because conversion from kilometers 120 00:08:20,210 --> 00:08:25,800 to miles equals about one point six 1 0 0 9. 121 00:08:25,860 --> 00:08:27,690 But I've just rounded at one point six. 122 00:08:27,690 --> 00:08:32,700 That's going to divide our number of kilometers by one point six. 123 00:08:32,700 --> 00:08:38,110 Because in this case x is the values in our odometer column. 124 00:08:38,130 --> 00:08:43,710 So this little line here is going to look at this column and go every time it's going to pick up this 125 00:08:43,710 --> 00:08:50,550 hundred fifty thousand which is x and then divide it by one point six and then reassign it to that value. 126 00:08:50,550 --> 00:08:57,540 So the first column 94000 thereabouts first row sorry second row eighty seven thousand eight hundred 127 00:08:57,540 --> 00:09:01,500 ninety nine is actually about 55000 miles. 128 00:09:01,500 --> 00:09:03,810 So that's what this apply function is doing. 129 00:09:03,870 --> 00:09:10,020 And now this is just a simple example of a little lambda function but you can get very elaborate with 130 00:09:10,020 --> 00:09:13,250 the things you put in apply when I'm gonna dive too deep in that. 131 00:09:13,270 --> 00:09:19,510 Now I want you to know that apply is a way that you can assign a function to a column. 132 00:09:19,560 --> 00:09:26,680 She might have written another function somewhere else and then use apply to apply it to a column who 133 00:09:27,460 --> 00:09:32,070 we have been through an absolute mountain of information here. 134 00:09:32,290 --> 00:09:36,220 And don't fret you don't have to know all of this off my heart to begin with. 135 00:09:36,220 --> 00:09:37,570 Remember this stamps. 136 00:09:37,570 --> 00:09:39,070 Try it. 137 00:09:39,100 --> 00:09:43,950 Run your code if in doubt search for it. 138 00:09:44,300 --> 00:09:46,070 Try again. 139 00:09:46,070 --> 00:09:47,900 And then if you're still stuck. 140 00:09:48,350 --> 00:09:54,800 Ask in the meantime before we get into the next section this is going to wrap up the pandas section. 141 00:09:54,800 --> 00:10:00,290 Check out the resources section for a little bit more documentation about pandas. 142 00:10:00,290 --> 00:10:06,560 Some exercises you can try out for yourself but otherwise if you're with me let's jump in and learn 143 00:10:06,560 --> 00:10:11,240 a bit more about some other tools that we can use for data science and machine learning.