1
00:00:02,070 --> 00:00:06,480
We have been through a lot but we're going to persevere.

2
00:00:06,500 --> 00:00:07,560
I'm going to get through this together.

3
00:00:07,560 --> 00:00:09,330
We've got a few more functions to go through.

4
00:00:10,010 --> 00:00:13,730
So the next one is sample.

5
00:00:13,730 --> 00:00:17,630
Now this is a great way to to shuffle your data frame.

6
00:00:17,700 --> 00:00:21,130
What I mean by that is mix up the indexes.

7
00:00:21,300 --> 00:00:28,940
So in future videos when we start applying machine learning algorithms to to our data one of their most

8
00:00:28,940 --> 00:00:32,470
important steps is creating a training validation and test set.

9
00:00:33,110 --> 00:00:38,690
And one of the steps to doing that is random izing the order your data comes in.

10
00:00:38,690 --> 00:00:43,180
Because what you want to avoid is the order that these samples come in.

11
00:00:43,310 --> 00:00:48,830
May have some kind of influence on patterns in the data when now machine learning algorithm is learning

12
00:00:48,830 --> 00:00:49,580
things.

13
00:00:49,580 --> 00:00:54,500
We want it to be as general as possible so we don't want it to care about the order.

14
00:00:54,500 --> 00:01:00,080
Now there are some exceptions to the rule with time serious data and whatnot but I'm just going to show

15
00:01:00,080 --> 00:01:07,070
you this example if you wanted to shuffle rows in a data frame so let's do car sales.

16
00:01:07,090 --> 00:01:16,580
Don't sample and now sample is basically saying take a sample from this data frame with this parameter

17
00:01:16,580 --> 00:01:19,940
frac is the fraction of that sample.

18
00:01:20,060 --> 00:01:22,700
So it's gonna be in an integer form.

19
00:01:22,700 --> 00:01:27,380
So for example zero point five is 50 percent of the data.

20
00:01:27,380 --> 00:01:33,800
So it's gonna get five rows out of 10 but we in this case want to shuffle everything so we will use

21
00:01:33,800 --> 00:01:37,540
one for 100 percent of the data there we go.

22
00:01:37,540 --> 00:01:40,990
So now you can see that the index has been shuffled.

23
00:01:41,000 --> 00:01:46,180
So this is essentially the same data frame just with the order mixed around.

24
00:01:46,190 --> 00:01:51,440
Now the important thing to remember with PANDAS is that the information in a row will stay with its

25
00:01:51,440 --> 00:02:00,050
row so row nine NISSAN NISSAN white thirty one thousand six hundred kilometers four doors.

26
00:02:00,050 --> 00:02:01,760
So the rows stay the same.

27
00:02:01,790 --> 00:02:08,570
All that's changing with this function sample is the order that things appear in and now let's have

28
00:02:08,570 --> 00:02:14,370
a look at the data frame again silly May we keep forgetting to reassign.

29
00:02:14,620 --> 00:02:18,370
So we want to reassign here equals.

30
00:02:18,400 --> 00:02:19,930
There we go.

31
00:02:19,930 --> 00:02:25,090
Now it's going to definitely shuffle the car sales data frame or maybe we create a new one shuffled

32
00:02:25,120 --> 00:02:28,060
because we want to keep the original one in order.

33
00:02:28,060 --> 00:02:34,990
So we've got car sales shuffled this return a copy but car sales shuffled will be in a different order

34
00:02:34,990 --> 00:02:37,220
to the original car sales data frame.

35
00:02:37,360 --> 00:02:38,260
Beautiful.

36
00:02:38,260 --> 00:02:42,910
Now if we wanted to manipulate this shuffled data frame we could call the same functions we've called

37
00:02:42,910 --> 00:02:50,980
before on the shuffled version and another handy thing about the sample function is that say for example

38
00:02:50,980 --> 00:02:53,920
you had like a data frame with two million rows.

39
00:02:53,920 --> 00:02:55,890
Ours is really small it's only 10.

40
00:02:55,900 --> 00:03:02,020
But in practice you're going to be working on data sets with a lot more rows sometimes running functions

41
00:03:02,020 --> 00:03:09,670
in panders takes a long time on millions of different rows what you might want to do is practice on

42
00:03:09,680 --> 00:03:11,780
only 20 percent of the data.

43
00:03:12,490 --> 00:03:15,880
Only select 20 percent of data.

44
00:03:15,880 --> 00:03:18,120
And now this number could be arbitrary right.

45
00:03:18,130 --> 00:03:20,310
In our case we've chosen 20 percent.

46
00:03:20,320 --> 00:03:23,030
So that's giving us two different rows.

47
00:03:23,110 --> 00:03:27,630
If you had two million rows maybe you want to practice on 1 percent of the data.

48
00:03:27,640 --> 00:03:30,400
So that's still 20000 rows.

49
00:03:30,400 --> 00:03:36,460
And so that will allow you to do lots of different experiments a lot quicker than doing it all on two

50
00:03:36,460 --> 00:03:38,360
million rows at one time.

51
00:03:38,380 --> 00:03:42,730
So that's something you'll have to consider in the future in your projects when you're working on larger

52
00:03:42,730 --> 00:03:49,280
amounts of data is your computer powerful enough to continually run functions on millions of rows.

53
00:03:49,420 --> 00:03:55,930
Or do you want to flesh out some little experiments on smaller amounts of data first and then upgrade

54
00:03:55,930 --> 00:04:00,550
those experiments as you figure out what works and what doesn't.

55
00:04:00,560 --> 00:04:04,250
Now we look at our car sales shuffled data again

56
00:04:07,540 --> 00:04:09,800
we've got our indexes but they're all out of order.

57
00:04:09,800 --> 00:04:11,640
How would we get those back into order.

58
00:04:11,700 --> 00:04:18,430
We go car sales shuffled don't reset index this function as you could imagine is going to reset the

59
00:04:18,430 --> 00:04:19,720
index here.

60
00:04:19,720 --> 00:04:29,720
So we want warm and now there's a way here that you can remove this column so maybe we go in place.

61
00:04:29,960 --> 00:04:31,260
Let's have a look.

62
00:04:31,280 --> 00:04:33,700
Did that do that in place shuffled.

63
00:04:33,890 --> 00:04:43,030
Because by default the reset index column adds the old index as a new column on the very left.

64
00:04:43,040 --> 00:04:48,020
And so we've got all the numbers here but this is the shuffled index that it's made a column maybe we

65
00:04:48,020 --> 00:04:49,140
don't want that.

66
00:04:49,220 --> 00:04:54,760
I think we can get rid of that by in place equals true.

67
00:04:54,770 --> 00:05:01,150
Remember a lot of Panda's functions have the in-place parameter shuffle in place it goes through.

68
00:05:01,350 --> 00:05:03,230
It still has that.

69
00:05:03,740 --> 00:05:10,310
Well the way we figure this out is if we look at the documentation for reset index let's do that pandas

70
00:05:10,430 --> 00:05:16,320
reset index I don't want that extra index column there.

71
00:05:17,150 --> 00:05:24,750
So let's see drop do not try to insert index in the data frame columns.

72
00:05:24,750 --> 00:05:26,700
I think that's what we need.

73
00:05:26,700 --> 00:05:27,900
Is there an example.

74
00:05:28,080 --> 00:05:29,740
Drop equals true.

75
00:05:29,750 --> 00:05:29,970
Yeah.

76
00:05:30,210 --> 00:05:31,620
I think that's what we need.

77
00:05:31,620 --> 00:05:32,900
So we want in place egos.

78
00:05:32,940 --> 00:05:42,590
True and we want drop equals true drop is false by default yet default false.

79
00:05:42,590 --> 00:05:46,920
So this is another example of how you can look out what a function does in panders.

80
00:05:46,920 --> 00:05:53,610
I just googled the function name reset index again you won't know these things from the start but it's

81
00:05:53,700 --> 00:05:55,400
not about knowing things off by heart.

82
00:05:55,410 --> 00:05:56,850
I don't know everything off by heart.

83
00:05:56,850 --> 00:06:03,890
It's about researching different ways that functions work and trying things out if in doubt run the

84
00:06:03,890 --> 00:06:04,490
code.

85
00:06:04,550 --> 00:06:06,090
Let's do that.

86
00:06:06,260 --> 00:06:08,690
We still have that there.

87
00:06:10,400 --> 00:06:12,700
Maybe we come back up here.

88
00:06:12,980 --> 00:06:15,040
We're going to reset this reset.

89
00:06:15,040 --> 00:06:21,030
This reset this there we go.

90
00:06:21,930 --> 00:06:25,060
We had to reset our car sales shuffled data frame.

91
00:06:25,080 --> 00:06:27,240
Now we've set drop to equal true.

92
00:06:27,240 --> 00:06:29,280
We don't have that index column on the left.

93
00:06:29,280 --> 00:06:35,550
Here is one more thing we're going to look at and that's how to apply a function to a column.

94
00:06:35,610 --> 00:06:39,640
So let's say we'll have one we'll look at our car sales data frame.

95
00:06:39,660 --> 00:06:45,870
Remember this is the process you're going to be taking a lot is viewing your data frame and then manipulating

96
00:06:45,870 --> 00:06:48,750
it viewing it manipulating it viewing it manipulating it.

97
00:06:48,870 --> 00:06:53,940
So it's looking pretty private the moment but we want to convert our odometer to Miles.

98
00:06:53,940 --> 00:06:58,640
So how would we do that car sales first we have to select it.

99
00:06:58,930 --> 00:07:04,650
You do that by typing in the column name and then we're going to reassign it remembered to reassign

100
00:07:04,650 --> 00:07:08,450
at this time odometer kilometres.

101
00:07:08,490 --> 00:07:09,570
Yep.

102
00:07:09,600 --> 00:07:17,460
Now this is the apply function apply let you apply some kind of function whether it be num Pi or lambda

103
00:07:17,760 --> 00:07:19,300
to a certain column.

104
00:07:19,350 --> 00:07:23,820
So let's just type this in and lambda functions can be confusing to begin with.

105
00:07:23,820 --> 00:07:28,560
Took me a while to learn them but we're going to step through this one and figure out what it's actually

106
00:07:28,560 --> 00:07:37,390
doing so if in doubt run the code we want car sales to view the whole data frame run the code.

107
00:07:37,390 --> 00:07:37,960
There we go.

108
00:07:38,800 --> 00:07:40,890
Now what do you think this is just done.

109
00:07:40,890 --> 00:07:42,690
These numbers have changed.

110
00:07:42,830 --> 00:07:44,440
One hundred and fifty thousand.

111
00:07:44,610 --> 00:07:47,820
It is now ninety three thousand seven hundred seventy six.

112
00:07:48,210 --> 00:07:51,610
Well let's step through this odometer column tick.

113
00:07:51,630 --> 00:07:52,390
I don't wanna column.

114
00:07:52,410 --> 00:07:53,150
Tick.

115
00:07:53,220 --> 00:07:54,350
Dot apply.

116
00:07:54,430 --> 00:08:01,830
Yeah lambda now lambda is a keyword in python which is short for an anonymous function.

117
00:08:01,860 --> 00:08:08,330
So basically this is saying apply this function to X divided by one point six.

118
00:08:08,380 --> 00:08:09,250
So that's what this is saying.

119
00:08:09,270 --> 00:08:20,210
Apply recent X to be equal to x divided by one point six y one point six because conversion from kilometers

120
00:08:20,210 --> 00:08:25,800
to miles equals about one point six 1 0 0 9.

121
00:08:25,860 --> 00:08:27,690
But I've just rounded at one point six.

122
00:08:27,690 --> 00:08:32,700
That's going to divide our number of kilometers by one point six.

123
00:08:32,700 --> 00:08:38,110
Because in this case x is the values in our odometer column.

124
00:08:38,130 --> 00:08:43,710
So this little line here is going to look at this column and go every time it's going to pick up this

125
00:08:43,710 --> 00:08:50,550
hundred fifty thousand which is x and then divide it by one point six and then reassign it to that value.

126
00:08:50,550 --> 00:08:57,540
So the first column 94000 thereabouts first row sorry second row eighty seven thousand eight hundred

127
00:08:57,540 --> 00:09:01,500
ninety nine is actually about 55000 miles.

128
00:09:01,500 --> 00:09:03,810
So that's what this apply function is doing.

129
00:09:03,870 --> 00:09:10,020
And now this is just a simple example of a little lambda function but you can get very elaborate with

130
00:09:10,020 --> 00:09:13,250
the things you put in apply when I'm gonna dive too deep in that.

131
00:09:13,270 --> 00:09:19,510
Now I want you to know that apply is a way that you can assign a function to a column.

132
00:09:19,560 --> 00:09:26,680
She might have written another function somewhere else and then use apply to apply it to a column who

133
00:09:27,460 --> 00:09:32,070
we have been through an absolute mountain of information here.

134
00:09:32,290 --> 00:09:36,220
And don't fret you don't have to know all of this off my heart to begin with.

135
00:09:36,220 --> 00:09:37,570
Remember this stamps.

136
00:09:37,570 --> 00:09:39,070
Try it.

137
00:09:39,100 --> 00:09:43,950
Run your code if in doubt search for it.

138
00:09:44,300 --> 00:09:46,070
Try again.

139
00:09:46,070 --> 00:09:47,900
And then if you're still stuck.

140
00:09:48,350 --> 00:09:54,800
Ask in the meantime before we get into the next section this is going to wrap up the pandas section.

141
00:09:54,800 --> 00:10:00,290
Check out the resources section for a little bit more documentation about pandas.

142
00:10:00,290 --> 00:10:06,560
Some exercises you can try out for yourself but otherwise if you're with me let's jump in and learn

143
00:10:06,560 --> 00:10:11,240
a bit more about some other tools that we can use for data science and machine learning.