1 00:00:00,540 --> 00:00:01,520 Hurrah. 2 00:00:01,600 --> 00:00:03,540 Making some real progress here. 3 00:00:03,670 --> 00:00:08,190 There's almost an infinite way that you can view and select data with panders to head of frame. 4 00:00:08,290 --> 00:00:10,140 We're only really scratching the surface here. 5 00:00:10,150 --> 00:00:15,520 But once you learn these foundation steps you can really combine them in kind of any different way you 6 00:00:15,520 --> 00:00:18,580 can imagine and really transform and view your data in a different light. 7 00:00:19,150 --> 00:00:20,950 So let's keep going. 8 00:00:20,950 --> 00:00:23,660 We're still viewing and selecting data at the moment. 9 00:00:23,810 --> 00:00:30,100 So the next one I want to show you is paid a cross tab and this is a great way to compare two columns. 10 00:00:30,100 --> 00:00:36,110 If you want to see them like a little innovative innovative is that the right word. 11 00:00:36,130 --> 00:00:36,760 Not sure. 12 00:00:36,760 --> 00:00:40,990 Doesn't really matter but you'll see a little comparative way. 13 00:00:40,990 --> 00:00:43,030 Maybe that's a bit better word. 14 00:00:43,030 --> 00:00:47,190 So cross tab is basically NSA say a pen is grab the make column. 15 00:00:47,710 --> 00:00:50,260 Let's refresh our memory of what our data frame looks like. 16 00:00:50,260 --> 00:00:53,440 We have make widgets car makes doors which is here. 17 00:00:54,040 --> 00:01:01,010 So this is going to crossover the make column with the door column and so cross tab is called panders 18 00:01:01,010 --> 00:01:06,750 dot cross tab and you pass it one column and then another column until we go in here. 19 00:01:06,880 --> 00:01:09,580 These adores the amount of doors three four and five. 20 00:01:09,610 --> 00:01:11,860 So we got three four five. 21 00:01:12,370 --> 00:01:14,270 And these are the different makes here. 22 00:01:14,410 --> 00:01:20,120 We got four different makes Toyota Honda BMW and Nissan beautiful. 23 00:01:20,170 --> 00:01:24,950 So what this is doing is it's kind of aggregating these two columns together. 24 00:01:24,970 --> 00:01:27,160 So let's explore this BMW. 25 00:01:27,160 --> 00:01:33,020 There's one car with five doors Honda there's three cars with four doors. 26 00:01:33,100 --> 00:01:35,980 Toyota is the only car brand that has three doors. 27 00:01:36,040 --> 00:01:39,210 Not quite sure why that car would look like but who knows. 28 00:01:39,220 --> 00:01:41,950 This is a kind of data you'll come across. 29 00:01:41,950 --> 00:01:48,550 Okay now cross tab is great for comparing two columns but what if you wanted to compare more columns 30 00:01:48,580 --> 00:01:50,570 in the context of another. 31 00:01:50,680 --> 00:01:53,440 This is where a group buy comes in. 32 00:01:53,560 --> 00:01:55,250 So let's put a little comment here. 33 00:01:55,270 --> 00:01:58,800 Group by car sales. 34 00:01:58,850 --> 00:02:04,180 I was going to say Petey I'm used to say NPD Group Buy and we want the make column 35 00:02:06,860 --> 00:02:09,560 dot Main. 36 00:02:09,720 --> 00:02:13,580 Now what do you think before we even run this code. 37 00:02:13,600 --> 00:02:17,830 Now I'm going against my only rule here I've always run the code first. 38 00:02:17,850 --> 00:02:20,520 What do you think this will do if we looked at this. 39 00:02:20,790 --> 00:02:27,590 So we've got car sales which is our data frame that's up here we've got this little function here called 40 00:02:27,590 --> 00:02:32,900 group buy we know it's a function because it's followed by brackets and then within the group by we've 41 00:02:32,900 --> 00:02:39,780 got the make column we've got main had a think about it doesn't matter. 42 00:02:39,780 --> 00:02:42,960 You should write the code out first and press shift and enter and see what happens. 43 00:02:43,890 --> 00:02:50,220 So what this is going to do is it's going to group the data frame on the make column similar to what 44 00:02:50,220 --> 00:02:51,480 our cross tab did. 45 00:02:51,480 --> 00:02:54,150 But this time we have more columns here. 46 00:02:54,150 --> 00:03:00,600 So it's given us the main values of the odometer of all the different types and makes as well as the 47 00:03:00,600 --> 00:03:01,540 doors. 48 00:03:01,560 --> 00:03:07,270 So BMW the main odometer rating is eleven thousand one hundred seventy nine. 49 00:03:07,290 --> 00:03:09,520 And the main amount of doors is five. 50 00:03:09,600 --> 00:03:17,720 And that makes sense because there's only one BMW car but again this group by you could group by almost 51 00:03:17,720 --> 00:03:24,380 any column you can imagine and then call some operation on it as you go and you can imagine if you had 52 00:03:24,380 --> 00:03:29,510 more columns here and you wanted to combine them in some way and view them you can use this group by 53 00:03:29,510 --> 00:03:33,530 function to make sure you own the group it by a certain type. 54 00:03:33,530 --> 00:03:36,130 Maybe you want only white Toyotas. 55 00:03:36,260 --> 00:03:37,470 Maybe that's a little exercise. 56 00:03:37,490 --> 00:03:43,610 See how you could group it by only looking at say two white Toyotas yeah two white Toyotas and then 57 00:03:43,610 --> 00:03:44,830 seeing what happens from there. 58 00:03:45,170 --> 00:03:46,900 But I just want to show you. 59 00:03:46,900 --> 00:03:53,030 Group Buy it's kind of like cross tab but for focusing on more than one column at a time. 60 00:03:54,050 --> 00:03:55,760 So what's next. 61 00:03:55,760 --> 00:03:56,120 All right. 62 00:03:56,390 --> 00:03:58,970 Well maybe but we have a look at some visualizations here. 63 00:03:58,980 --> 00:04:00,250 That's a great idea. 64 00:04:00,290 --> 00:04:01,160 So go. 65 00:04:01,190 --> 00:04:08,230 Car sales odometer let's see if we can do a little plot here. 66 00:04:08,270 --> 00:04:09,410 I'm going to call dot plot. 67 00:04:09,410 --> 00:04:10,730 What do you think this will do. 68 00:04:10,730 --> 00:04:13,790 Run that beautiful. 69 00:04:13,910 --> 00:04:16,840 We'll get a plot of the different odometer value. 70 00:04:16,860 --> 00:04:22,280 So this is what the beautiful thing about pandas is how beautiful and how quickly you can get an analysis 71 00:04:22,280 --> 00:04:29,060 going and you might look at a table like this and not really be sure what's going on especially have 72 00:04:29,060 --> 00:04:34,700 lots of rows but getting a visualization out is a great way to quickly understand what's happening and 73 00:04:34,700 --> 00:04:40,040 we'll have a section coming up on how to make these visualizations a bit more interactive and a bit 74 00:04:40,040 --> 00:04:44,240 more prettier with labels and different information here. 75 00:04:44,450 --> 00:04:51,890 But what you might find is that if this plot doesn't show up you might need to run these two lines of 76 00:04:51,890 --> 00:04:52,520 code. 77 00:04:52,730 --> 00:05:00,470 And this is map plot lib in line because you see here this has generated this plot is actually remember 78 00:05:00,470 --> 00:05:04,000 how we create an environment and then installed a whole bunch of tools for us. 79 00:05:04,010 --> 00:05:10,340 Condit did that for us when we created our environment installed map plot lib because some of the functionality 80 00:05:10,340 --> 00:05:14,500 like the plot function in pandas depends on map plot lib. 81 00:05:14,570 --> 00:05:17,930 So behind the scenes Condor is actually set this up for us. 82 00:05:18,290 --> 00:05:23,270 And when we imported pandas right up the top here we won't go right back up the top because that'll 83 00:05:23,300 --> 00:05:30,620 take a little while but when we typed in import pandas as PD it might have done some work behind the 84 00:05:30,620 --> 00:05:37,810 scenes for us to organize this plot but if it hasn't you can type in these two lines of code import 85 00:05:37,880 --> 00:05:42,370 map plot lib not pi plot as penalty. 86 00:05:42,440 --> 00:05:48,410 And this one here this percentage sign at the front is a little magic function you know books have different 87 00:05:48,410 --> 00:05:50,930 magic functions which do a specific kind of thing. 88 00:05:50,990 --> 00:05:57,320 All this is saying is plot our map plot leave plots inside the notebook so we can see them. 89 00:05:57,950 --> 00:06:03,710 So if your plots don't show up make sure you've run these two lines of code so you can type them and 90 00:06:03,710 --> 00:06:07,820 put them at the top of your notebook potentially if your plots aren't showing up. 91 00:06:07,820 --> 00:06:12,590 So next thing we want to see is another kind of plot which is called a histogram. 92 00:06:12,590 --> 00:06:16,250 Now histogram is a great way to see the distribution. 93 00:06:16,430 --> 00:06:20,510 Distribution is a fancy word for spread of data. 94 00:06:20,510 --> 00:06:21,980 So you imagine this. 95 00:06:22,070 --> 00:06:29,380 Most of our odometer readings are between let's say this is 20 odd thousand up to 100000. 96 00:06:29,390 --> 00:06:36,140 So the reason you can see this kind of curve shape here is because this is where the average about odometer 97 00:06:36,140 --> 00:06:42,800 readings fall into and that these two values out here might be considered outliers because they're not 98 00:06:42,800 --> 00:06:45,320 within this big clump of values. 99 00:06:45,320 --> 00:06:49,880 So that's an easy way to run a histogram on your numerical columns. 100 00:06:49,910 --> 00:06:55,850 Let's have a look at the price column we want to see what different prices of cars we have dot plot 101 00:06:56,520 --> 00:06:59,980 her Now why is that. 102 00:06:59,980 --> 00:07:01,480 What do you think that came up. 103 00:07:01,480 --> 00:07:02,260 Do you remember. 104 00:07:02,890 --> 00:07:05,460 No numeric data to plant. 105 00:07:05,470 --> 00:07:06,620 Why is that. 106 00:07:07,600 --> 00:07:12,400 Let's have a look at the time price dot D type. 107 00:07:12,400 --> 00:07:19,450 We did dot D types at the start in one of the videos dot D type is just on a single column so d type 108 00:07:19,570 --> 00:07:21,190 O for object. 109 00:07:21,400 --> 00:07:22,480 So it's not numeric. 110 00:07:22,480 --> 00:07:26,660 So our error is not numeric dated a plot. 111 00:07:26,890 --> 00:07:28,510 What can we do here. 112 00:07:28,570 --> 00:07:33,790 Now we could plot around for a bit and try to get this to work but let's have a look at how we might 113 00:07:33,790 --> 00:07:44,960 research a problem like this but we might type in how to convert a panda's column price to integer 114 00:07:47,590 --> 00:07:50,970 price column object to it in pandas. 115 00:07:51,010 --> 00:07:54,370 Now this is a step you might take if you run into some kind of error. 116 00:07:54,430 --> 00:07:57,880 We could always play around with this price column and try to fix it. 117 00:07:57,880 --> 00:08:03,610 But I want to emphasize what what an actual workflow might look like when you're looking up something. 118 00:08:03,610 --> 00:08:09,270 So you've googled how to convert a pan is common priced integer because that's what you want to do when 119 00:08:09,270 --> 00:08:12,110 I click on this stack overflow. 120 00:08:12,170 --> 00:08:16,760 I have a column called amount withholds values that look like this. 121 00:08:16,780 --> 00:08:23,500 Now what you want to do is when you get to somewhere like Stack Overflow you want to see if this description 122 00:08:23,500 --> 00:08:26,140 of this question matches our problem. 123 00:08:26,200 --> 00:08:30,760 So they have a column called a mount which holds values that look like this at similar to us but ours 124 00:08:30,760 --> 00:08:32,240 is called price. 125 00:08:32,380 --> 00:08:37,090 When I do data framed up D times it returns this column as an object. 126 00:08:37,080 --> 00:08:38,950 Ah so does ours. 127 00:08:38,950 --> 00:08:40,810 How do I convert this to type int. 128 00:08:40,840 --> 00:08:41,890 Yes that's what we want to do. 129 00:08:41,920 --> 00:08:44,470 Let's have a look at some of these answers. 130 00:08:44,510 --> 00:08:46,450 OK this is using Reg. 131 00:08:46,750 --> 00:08:52,000 What you'll do with these answers is kind of look through them and see which one stands out to you the 132 00:08:52,000 --> 00:08:52,490 most. 133 00:08:52,510 --> 00:08:58,630 This green tech means that the question asker has marked this as the right answer to help them solve 134 00:08:58,630 --> 00:08:59,640 their problem. 135 00:08:59,950 --> 00:09:06,850 But sometimes there's other ones like maybe this one here I want to use one without regex cause rejects 136 00:09:06,880 --> 00:09:08,890 kind of confuses me. 137 00:09:09,310 --> 00:09:12,010 So this is also got 5 upvotes. 138 00:09:12,010 --> 00:09:13,070 One of them is mine. 139 00:09:13,120 --> 00:09:15,770 Previously in doing research for this video here. 140 00:09:15,790 --> 00:09:24,330 So what you might do is copy this and so this you can see they've started off with a mount which is 141 00:09:24,330 --> 00:09:29,430 like our price column and it's come out like this which is what we want to do we want to convert our 142 00:09:29,430 --> 00:09:30,270 price column. 143 00:09:30,270 --> 00:09:31,100 Let's have a look. 144 00:09:31,110 --> 00:09:37,340 Car sales want to convert our price columns to something like what our odometer column looks like. 145 00:09:38,170 --> 00:09:41,870 We've copied this piece of code from Stack Overflow. 146 00:09:41,920 --> 00:09:46,330 We're going to type it in here but we're going to use it as reference right because we want to get practice 147 00:09:46,330 --> 00:09:48,300 writing our own code. 148 00:09:48,460 --> 00:09:54,250 So we're going to compare it to what ours is let's zoom out one little bit so we can see two sales at 149 00:09:54,250 --> 00:09:55,120 once. 150 00:09:55,120 --> 00:09:55,710 Excellent. 151 00:09:55,720 --> 00:10:05,170 So this is saying data frame with a column amount equals data frame amount dot string dot replace some 152 00:10:05,200 --> 00:10:10,740 interesting combination of code there with nothing as type int. 153 00:10:10,810 --> 00:10:18,070 Now that's a fair bit in one guy but let's try replace the parts that we do know to our own data frame 154 00:10:18,370 --> 00:10:24,280 our data frame is actually called car sales so we're going to type in car sales our market column is 155 00:10:24,280 --> 00:10:34,360 actually called price so we'll type in price and we're going to reassign it to car sales price is what 156 00:10:34,360 --> 00:10:39,320 this is saying is Hey grab the price column and set it equal to this. 157 00:10:39,350 --> 00:10:44,740 Another car sales don't price column but let's put some adjustments. 158 00:10:44,830 --> 00:10:49,330 So this means access axis this string and replace. 159 00:10:49,330 --> 00:11:02,710 Now this little funky combination of slashes Dollar Signs comma and dots is saying hey find these symbols 160 00:11:03,310 --> 00:11:12,480 within the price column and replace it after the comma replace it with nothing and then change the type 161 00:11:12,510 --> 00:11:18,030 with as type 2 integer so let's go through that one more time. 162 00:11:18,070 --> 00:11:19,770 Car sales price column. 163 00:11:19,990 --> 00:11:27,040 We want to readjusted to equaling the car sales price column to keep the values the same but access 164 00:11:27,040 --> 00:11:32,480 the string and replace the dollar sign commas and dots. 165 00:11:32,560 --> 00:11:35,740 So dollar sign commas and dance with nothing. 166 00:11:35,740 --> 00:11:43,240 So this is an empty string and then change it from a string so dot as type Integer. 167 00:11:43,240 --> 00:11:45,910 Now that was a mouthful but remember. 168 00:11:45,970 --> 00:11:49,120 If in doubt run the code shift and into. 169 00:11:49,150 --> 00:11:53,130 Now let's see what our car sales data frame looks like. 170 00:11:53,140 --> 00:11:59,100 Beautiful we've changed the price column to look more like the odometer column. 171 00:11:59,320 --> 00:12:01,840 What's happened here. 172 00:12:02,200 --> 00:12:06,410 The price column has included the Saints. 173 00:12:07,360 --> 00:12:13,240 So maybe that's an exercise of how to fix that to not include the sense because we only want it in thousands 174 00:12:13,240 --> 00:12:18,080 and this is increased it's are now the Nissan the White Nissan is now nine hundred and seventy thousand 175 00:12:18,130 --> 00:12:25,540 dollars but let's see if we can plot it now price dot plot 176 00:12:27,890 --> 00:12:35,600 beautiful so the plot is incorrect because the numbers are just way too high but at least we can plot 177 00:12:35,600 --> 00:12:40,370 it now so the next thing we'd look at in terms of manipulating data is looking at this line of code 178 00:12:40,400 --> 00:12:42,360 and going how can we readjust that. 179 00:12:42,480 --> 00:12:48,890 So maybe that's a little exercise you could try try to readjust this line of code to not include those 180 00:12:48,890 --> 00:12:54,930 last two zeros on the end and now we've seen a few different ways of viewing and selecting data some 181 00:12:54,930 --> 00:13:02,110 of the major ones with pandas will now look at a few more ways that we can manipulate data so take a 182 00:13:02,110 --> 00:13:06,460 little break if you need go back over the code we've gone through and I'll see in the next video.