1 00:00:01,130 --> 00:00:07,730 By now you're probably thinking my goodness there is so much to remember and I'm not going to lie. 2 00:00:07,730 --> 00:00:10,080 I felt the same way when I first started. 3 00:00:10,100 --> 00:00:15,240 I went through some information on partners and realized Holy goodness there is. 4 00:00:15,260 --> 00:00:17,470 Well I didn't actually say Goodness I said a different word there. 5 00:00:17,480 --> 00:00:21,800 You could probably imagine what would but I said there's so much here. 6 00:00:21,800 --> 00:00:24,430 How am I going to remember all of these functions off my heart. 7 00:00:24,500 --> 00:00:29,570 And the point here is to remember you don't necessarily need to know all of this off by heart. 8 00:00:29,570 --> 00:00:34,790 The only reason I can kind of do this on the fly is because I've had a little bit of practice. 9 00:00:34,880 --> 00:00:37,700 Don't forget to always try to code yourself. 10 00:00:37,700 --> 00:00:40,100 You can look it up fine things like this. 11 00:00:40,100 --> 00:00:45,050 This is part of the research part of being a data scientist part of being a machine learning engineer 12 00:00:45,380 --> 00:00:50,150 is learning how to ask different questions think about how can I do this with the data and then trying 13 00:00:50,150 --> 00:00:51,270 to find out. 14 00:00:51,290 --> 00:00:56,480 With that being said let's have a look in this next little section here at how to manipulate data with 15 00:00:56,480 --> 00:00:57,330 partners. 16 00:00:57,350 --> 00:01:02,420 So put another little heading like we've been doing maybe manipulating data. 17 00:01:02,420 --> 00:01:03,520 That sounds a bit better. 18 00:01:03,530 --> 00:01:10,340 This came to press M for markdown because we're staying communicative Excellent. 19 00:01:10,470 --> 00:01:16,580 And now again the beautiful thing about pandas is that if you can think of it you can probably do it. 20 00:01:17,150 --> 00:01:22,160 So the first thing we'll see is manipulating data is some string methods. 21 00:01:22,190 --> 00:01:28,120 So Python has a bunch of different string methods like lower what do you think this will do if we go 22 00:01:28,130 --> 00:01:39,200 car sales make dot string dot lower let's say shift and enter it's lowered the string because it's data 23 00:01:39,200 --> 00:01:44,220 type is an object encoding the pandas but it's actually also a python string. 24 00:01:44,360 --> 00:01:51,150 So anything that you can do on strings in Python you can do on string columns in pandas. 25 00:01:51,280 --> 00:01:52,940 There's a bunch of different functions Amber. 26 00:01:52,960 --> 00:01:59,480 Well just remember that you can access the string value of a column by doing dot string. 27 00:01:59,480 --> 00:02:05,220 We did that above in the price column we converted it from a string to an integer. 28 00:02:05,220 --> 00:02:11,980 Now if we have a look at our car sales data frame again with there we didn't we just lower the string 29 00:02:13,030 --> 00:02:15,490 but it didn't actually save. 30 00:02:15,490 --> 00:02:22,540 Now here's an important concept to remember it's that pandas requires if you want to change a column 31 00:02:22,540 --> 00:02:25,030 it requires reassigning that column. 32 00:02:25,450 --> 00:02:34,030 So that means we want to set car sales make equals car sales make you want to do not Schering don't 33 00:02:34,330 --> 00:02:35,980 lower. 34 00:02:35,980 --> 00:02:40,900 There we go car sales make don't shrink don't lower. 35 00:02:41,950 --> 00:02:49,190 Now what this will do is it will reassign the car sales make column to be lowered strength. 36 00:02:49,240 --> 00:02:50,890 So let's have a look. 37 00:02:50,980 --> 00:02:51,910 There we go. 38 00:02:53,270 --> 00:02:54,430 The string has been lowered. 39 00:02:54,500 --> 00:02:57,630 So now all the levers in here are lowercase. 40 00:02:57,770 --> 00:03:01,790 You'll probably come across this a fair few times when you're working on different data frames is this 41 00:03:02,090 --> 00:03:09,290 this concept of reassignment what some functions have is that to save you from having to go. 42 00:03:09,420 --> 00:03:16,890 This column equals this column again some sort of function is some functions have a parameter called 43 00:03:17,040 --> 00:03:23,490 in place we'll have a look at that in place parameter along with some missing data you might be saying 44 00:03:23,520 --> 00:03:31,510 hey I'll cross sales data frame doesn't have any missing data but in practice in the real world and 45 00:03:31,530 --> 00:03:34,190 when you get real data sets I've made this one up. 46 00:03:34,240 --> 00:03:40,020 Right so I've typed in this into here real data sets often contain missing data. 47 00:03:40,410 --> 00:03:44,730 So they often look like this because you might have someone tracking some records. 48 00:03:44,730 --> 00:03:52,380 Ideally they'd look like this all the data there but in reality when you get a new data set sometimes 49 00:03:52,380 --> 00:03:54,220 it's got a few holes in it. 50 00:03:54,270 --> 00:03:54,710 Right. 51 00:03:54,720 --> 00:03:58,710 Someone hasn't tracked everything correctly and that's that's just life. 52 00:03:58,950 --> 00:04:01,020 You don't always get perfection. 53 00:04:01,020 --> 00:04:05,660 So we're going to do you remember how to import CSB file. 54 00:04:05,770 --> 00:04:07,440 I've already exported this one. 55 00:04:07,590 --> 00:04:08,860 So we'll go payday. 56 00:04:08,900 --> 00:04:11,120 Don't read CSB. 57 00:04:11,370 --> 00:04:18,930 We're gonna do tab auto complete car sales missing data dot CSC what this is going to do is save the 58 00:04:18,930 --> 00:04:24,840 car sales missing CSP and import it as a panda's data frame. 59 00:04:24,870 --> 00:04:33,210 So what it's gonna do is bring this into being a data frame shift and into now this is how Panda's describes 60 00:04:33,360 --> 00:04:37,740 missing data in a n for Nan. 61 00:04:37,740 --> 00:04:43,230 That's basically just saying hey I've got no value here not a number or something like that actually 62 00:04:43,230 --> 00:04:43,660 done. 63 00:04:43,680 --> 00:04:50,690 Not entirely sure what Nan stands for but all I know is that it just means there's no value here. 64 00:04:50,800 --> 00:04:51,180 All right. 65 00:04:51,450 --> 00:04:54,800 So what if you wanted to fill these missing values with something. 66 00:04:54,810 --> 00:04:59,970 So say for example we wanted to fill the odometer so we're accessing that. 67 00:04:59,970 --> 00:05:07,770 There's a function called Fill N.A. which is fill these in and values and you can pass fill n a some 68 00:05:07,770 --> 00:05:11,010 kind of value and it will fill these value. 69 00:05:11,040 --> 00:05:17,430 So maybe we want to fill up the odometer column with the mean of the rest of the odometer. 70 00:05:17,790 --> 00:05:22,450 Which in reality isn't a very good practice but just for demonstration sake. 71 00:05:22,470 --> 00:05:25,590 Maybe let's let's do that don't mean. 72 00:05:25,590 --> 00:05:33,630 So what this is saying oh no we want missing because this is our missing missing missing. 73 00:05:33,690 --> 00:05:41,120 There we go what this is saying is grab the car sales missing or domino column this one and fill in 74 00:05:41,120 --> 00:05:48,090 a fill the missing values with the car sales missing Domino columns mean value. 75 00:05:48,110 --> 00:05:50,960 So the values that do actually exist there. 76 00:05:51,230 --> 00:05:53,590 We're gonna get the main and fill it. 77 00:05:53,660 --> 00:05:55,590 Let's quickly see what that is. 78 00:05:55,590 --> 00:05:57,150 Car sales. 79 00:05:57,210 --> 00:05:59,550 Missing odometer. 80 00:06:02,010 --> 00:06:04,800 Don't mean so. 81 00:06:04,830 --> 00:06:07,740 If we do this correctly should fill them up with this value. 82 00:06:08,070 --> 00:06:08,640 Let's go. 83 00:06:08,650 --> 00:06:10,960 They're beautiful. 84 00:06:10,960 --> 00:06:13,860 Now we've got all those missing values. 85 00:06:13,870 --> 00:06:19,400 The third one down is now filled with the car sales missing odometer mean value. 86 00:06:19,410 --> 00:06:22,540 We've got a few recurrences there of all the missing ones. 87 00:06:22,560 --> 00:06:27,450 So now if we have a look at the data frame and this is a practice you'll be going through right you'll 88 00:06:27,450 --> 00:06:32,760 be making manipulations here and then you'll be consistently viewing the data frames he might do head 89 00:06:32,770 --> 00:06:34,760 but because it's only small. 90 00:06:34,830 --> 00:06:36,110 We'll look at the whole thing. 91 00:06:37,950 --> 00:06:39,140 What's happened here. 92 00:06:39,230 --> 00:06:43,170 It's still in I n this is the problem right. 93 00:06:43,170 --> 00:06:49,300 This is what I was mentioning before that in-place value that we can use. 94 00:06:49,350 --> 00:06:50,720 So let's see it in action. 95 00:06:50,730 --> 00:07:00,980 So we could do a reassignment by going car sales missing don't wait 96 00:07:06,010 --> 00:07:10,680 equals that and then reassigning the values to being something like that. 97 00:07:10,690 --> 00:07:14,590 But what we'll do is we'll look at the in-place parameter so you can see what I mean. 98 00:07:14,590 --> 00:07:16,620 So we'll get rid of that. 99 00:07:16,900 --> 00:07:20,420 We're going to copy this line of code so we don't have to retype it out. 100 00:07:20,440 --> 00:07:23,370 But remember should always practice retype in code. 101 00:07:23,680 --> 00:07:25,570 So you get used to it. 102 00:07:25,660 --> 00:07:28,540 Now this fill in a as a parameter. 103 00:07:28,630 --> 00:07:38,770 And lots of pandas functions have this in place equals it's false by default we want true. 104 00:07:38,770 --> 00:07:48,130 Actually it's a good time to look at the documentation so say we didn't know this fill in a function. 105 00:07:48,380 --> 00:07:50,360 What would you do to figure that out. 106 00:07:50,360 --> 00:07:58,280 So maybe we go to Google and go pandas how to fill missing data in a column 107 00:08:02,710 --> 00:08:04,270 working with missing data. 108 00:08:04,870 --> 00:08:10,630 Yes this is a panda's documentation get very familiar with looking at these when you first start looking 109 00:08:10,630 --> 00:08:14,840 at this documentation can be a bit overwhelming because there's a lot going on. 110 00:08:14,980 --> 00:08:20,230 Remember this documentation is for every single thing that pandas can do so going through it can take 111 00:08:20,230 --> 00:08:28,980 a while and if we're looking here maybe want to search fulfill fill fill missing values. 112 00:08:28,980 --> 00:08:30,160 There we go. 113 00:08:30,250 --> 00:08:31,700 Phil and I. 114 00:08:31,800 --> 00:08:33,420 Now often it comes with. 115 00:08:33,420 --> 00:08:37,330 Let me zoom in for you a little hyperlink to the exact function. 116 00:08:37,500 --> 00:08:42,260 Fill in a can fill in any values with non n a data in a couple of ways. 117 00:08:42,270 --> 00:08:42,690 Excellent. 118 00:08:42,720 --> 00:08:45,230 And then give some examples here. 119 00:08:45,330 --> 00:08:47,940 We're going to click on the fill in a function and see what happens. 120 00:08:47,940 --> 00:08:54,930 So we've got data framed up filename which is what we've got so far data frame or domino because we 121 00:08:54,930 --> 00:09:01,170 only want the odometer column not fill in a yellow tick for that value equals none. 122 00:09:01,170 --> 00:09:03,060 So by default the value because none. 123 00:09:03,060 --> 00:09:05,340 But it can take a few different values. 124 00:09:05,520 --> 00:09:10,610 In our case our value is the main of the odometer column. 125 00:09:10,790 --> 00:09:11,980 We'll come back. 126 00:09:12,120 --> 00:09:15,460 We can see this little in-place parameter. 127 00:09:15,570 --> 00:09:18,650 It's got a few others but we're not going to cover them for now. 128 00:09:18,720 --> 00:09:23,880 So in-place bull by default false if true fill in place. 129 00:09:23,880 --> 00:09:29,840 Note this will modify any other views on this object e.g. a no copies slice for a column in a data frame. 130 00:09:30,510 --> 00:09:34,830 So if we set in place to true it means fill in place. 131 00:09:34,860 --> 00:09:41,360 Let's have a look in place equals true before we do we'll take one last look at what our data frame 132 00:09:41,360 --> 00:09:44,760 looks like still missing values. 133 00:09:44,890 --> 00:09:46,000 So let's see what happens. 134 00:09:46,010 --> 00:09:49,970 Fill in a place equals true. 135 00:09:50,000 --> 00:09:52,870 Now let's have another look. 136 00:09:53,180 --> 00:09:56,360 Missing. 137 00:09:56,460 --> 00:10:01,110 There we go so we've filled the values without reassigning so without typing. 138 00:10:01,110 --> 00:10:06,230 Car sales missing odometer odometer 139 00:10:09,000 --> 00:10:10,770 equals that. 140 00:10:10,920 --> 00:10:12,300 So that's just the thing to remember. 141 00:10:12,300 --> 00:10:14,360 You can always do it this way. 142 00:10:14,370 --> 00:10:18,750 Remember how I said panders has multiple different ways of doing the same thing. 143 00:10:18,900 --> 00:10:24,270 One way to reassign values is to reassign them by putting the exact value you want to replace at the 144 00:10:24,270 --> 00:10:29,760 start or by setting that in place parameter of a function to true. 145 00:10:30,300 --> 00:10:33,270 So that'll really depend on how your workflow goes. 146 00:10:33,270 --> 00:10:38,340 Personally I'm not a fan of the in-place I like just leaving it by default and then reassigning the 147 00:10:38,340 --> 00:10:40,060 value at the start. 148 00:10:40,080 --> 00:10:44,250 Let's have a look at one more thing before we will take a little break. 149 00:10:44,370 --> 00:10:50,720 As this lecture is getting a bit long it's if we wanted to remove these missing value. 150 00:10:50,730 --> 00:10:56,250 So say for example we were happy with filling the odometer column but every other row that has missing 151 00:10:56,250 --> 00:10:56,950 values. 152 00:10:57,090 --> 00:11:02,460 We just want to ignore them because missing values aren't gonna help us with our data analysis. 153 00:11:02,460 --> 00:11:07,190 In truth they might but we want to just see how we would get rid of them. 154 00:11:07,230 --> 00:11:13,410 Car sales missing now there's a function here called drop and a now we might just see what happens if 155 00:11:13,410 --> 00:11:17,800 we just run that drop in a beautiful. 156 00:11:17,800 --> 00:11:19,110 Now if we type in here. 157 00:11:19,120 --> 00:11:23,150 Car sales missing what does that data frame look like. 158 00:11:23,140 --> 00:11:26,430 Oh it hasn't happened again. 159 00:11:26,560 --> 00:11:27,580 Well we've just covered this. 160 00:11:27,580 --> 00:11:31,160 What do you think we can do to fix this some so I might do. 161 00:11:31,160 --> 00:11:35,960 Car sales drop in a in place equals true 162 00:11:40,120 --> 00:11:49,270 shift into and we'll have a look at our data frame car sales missing beautiful Now what do you think 163 00:11:49,270 --> 00:11:53,380 the value is in maybe not putting in place as true now. 164 00:11:53,380 --> 00:11:59,680 Our car sales missing data frame is always going to return those rows with missing value is going to 165 00:11:59,680 --> 00:12:00,550 be dropped. 166 00:12:00,550 --> 00:12:02,470 What if we wanted to re access them. 167 00:12:02,470 --> 00:12:10,160 We'd have to re import the C as V so maybe we do that it goes PD dot read C.S.. 168 00:12:10,390 --> 00:12:16,850 Car sales missing so we'll reset our data frame. 169 00:12:16,850 --> 00:12:17,730 We'll have a look at it. 170 00:12:17,750 --> 00:12:23,270 This one's going to have missing values so maybe we create a new data frame. 171 00:12:23,270 --> 00:12:36,970 Car sales missing dropped equals car sales missing dot drop in a in place equals true. 172 00:12:37,020 --> 00:12:46,140 So what this will do is it returns a copy of our car sales missing data frame but with dropped values. 173 00:12:46,630 --> 00:12:49,730 Well I sit in place equals true. 174 00:12:49,760 --> 00:12:51,050 After a brief rerun. 175 00:12:51,120 --> 00:12:53,020 So that's okay. 176 00:12:53,040 --> 00:12:54,690 So you sometimes you make mistakes here. 177 00:12:54,690 --> 00:12:59,600 We're getting a drop in I know a guy car sales missing dropped beautiful. 178 00:12:59,610 --> 00:13:08,010 So it's going to be the same data frame but dropped the missing values then what we might do is car 179 00:13:08,010 --> 00:13:13,500 sales missing dropped save this to to see SUV as 180 00:13:16,170 --> 00:13:26,830 car sales missing dropped dot CSC that way we can if we did want to access the rows with missing values 181 00:13:26,860 --> 00:13:30,380 we can from our car sales missing data frame. 182 00:13:30,390 --> 00:13:37,350 This one here or if we wanted to manipulate our data frame with missing values dropped we can from this 183 00:13:37,350 --> 00:13:38,520 data frame. 184 00:13:38,730 --> 00:13:43,820 So that's the important distinction there between using in place or not using in place. 185 00:13:43,890 --> 00:13:44,210 All right. 186 00:13:44,220 --> 00:13:49,680 We've covered enough on manipulating data in this lesson for the time being so take a little break revise 187 00:13:49,680 --> 00:13:51,900 over the few functions that we've gone through. 188 00:13:51,990 --> 00:13:56,400 Have a trial out with in place versus not in place and I'll see you in the next video.