1 00:00:00,180 --> 00:00:07,260 Next up, we move on to one of the most useful and most important clauses in all of SQL, which is grouped 2 00:00:07,260 --> 00:00:08,740 by group. 3 00:00:08,800 --> 00:00:12,900 It also happens to be a little bit confusing up front. 4 00:00:12,900 --> 00:00:16,050 Or it can be when you're starting out, it's not terrible or anything. 5 00:00:16,440 --> 00:00:19,590 It's just conceptually different than anything we've done so far. 6 00:00:19,620 --> 00:00:21,270 So take a deep breath with me. 7 00:00:21,360 --> 00:00:21,930 Stay. 8 00:00:21,930 --> 00:00:22,560 Stay in there. 9 00:00:22,560 --> 00:00:23,060 Hang in there. 10 00:00:23,070 --> 00:00:23,940 We're going to make it. 11 00:00:23,940 --> 00:00:24,810 It will be fine. 12 00:00:24,810 --> 00:00:25,740 It's not that bad. 13 00:00:25,740 --> 00:00:26,940 I feel like I'm scaring you. 14 00:00:27,180 --> 00:00:34,950 But here is the definition from the docs grouped by summarizes or aggregates identical data into single 15 00:00:34,950 --> 00:00:35,580 rows. 16 00:00:35,580 --> 00:00:36,990 So what do I mean by that? 17 00:00:37,260 --> 00:00:40,170 Well, here's an example of how we could use it. 18 00:00:40,200 --> 00:00:41,340 I'll come back to this. 19 00:00:41,340 --> 00:00:46,350 But Group VI and then some column is going to group internally. 20 00:00:46,350 --> 00:00:50,970 It's going to group the rows from our books table based upon author L name. 21 00:00:51,330 --> 00:00:55,800 So if this is our table, I've simplified it drastically. 22 00:00:55,800 --> 00:00:57,890 We have what, three different authors? 23 00:00:57,900 --> 00:01:03,690 If I said to SQL, please group all of these rows by author last name. 24 00:01:03,690 --> 00:01:09,210 Don't worry about what it prints out or anything just yet, but you can visualize it this way in memory. 25 00:01:09,210 --> 00:01:15,540 It makes a group with all the books, all the rows written by Lahiri, and it also makes a group that 26 00:01:15,540 --> 00:01:20,610 has the game in rows together and a final group that has the Edgars rows. 27 00:01:20,610 --> 00:01:23,580 It's only one row, so this is not what's printed out. 28 00:01:23,580 --> 00:01:25,440 I want to be very clear about that. 29 00:01:25,440 --> 00:01:28,650 This is just what is happening behind the scenes. 30 00:01:28,650 --> 00:01:34,830 In fact, if I try this right now, let's do first, let's just do select author. 31 00:01:35,600 --> 00:01:37,640 L name from books. 32 00:01:37,760 --> 00:01:38,960 No grouping. 33 00:01:38,990 --> 00:01:41,150 Of course, we see all the duplicates. 34 00:01:42,290 --> 00:01:49,040 And if I instead say select author l name from books group by author L name. 35 00:01:49,220 --> 00:01:50,540 This is what we see. 36 00:01:50,570 --> 00:01:56,650 Just every every author's name appears once, but this doesn't show us what's happening behind the scenes. 37 00:01:56,660 --> 00:02:01,570 As I said, in memory, out of out of view in the code. 38 00:02:01,580 --> 00:02:08,120 Essentially, there are these groups that are formed and we can ask questions or query things about 39 00:02:08,120 --> 00:02:09,380 those individual groups. 40 00:02:09,380 --> 00:02:13,790 And that's where it becomes useful because right now I've basically done the same thing. 41 00:02:13,790 --> 00:02:20,900 It's just asking for the distinct author names, last names rather, but actually behind the scenes 42 00:02:20,900 --> 00:02:22,310 there's a lot more going on. 43 00:02:22,310 --> 00:02:27,680 So instead of just grouping them and then just selecting the author name after they've been grouped, 44 00:02:27,800 --> 00:02:30,140 what if I did something like this? 45 00:02:30,140 --> 00:02:33,440 What if we count how many books each author has written? 46 00:02:33,650 --> 00:02:39,920 So if you believe me that there are these groups behind the scenes grouped by the author last name, 47 00:02:40,280 --> 00:02:46,670 I should then be able to use count to find out how many rows exist in each group. 48 00:02:46,970 --> 00:02:48,920 And that's exactly what I'm going to do. 49 00:02:49,250 --> 00:02:56,150 So this query here says, Let's start at the end group all of our rows together by the author last name. 50 00:02:56,150 --> 00:03:02,360 So imagine we have these little mini groups all over the place, and then we say we want to select the 51 00:03:02,360 --> 00:03:07,280 author last name from each of those groups and of course, each group that has the same author last 52 00:03:07,280 --> 00:03:07,430 name. 53 00:03:07,430 --> 00:03:10,040 They have to because that's how they're grouped. 54 00:03:10,040 --> 00:03:14,780 So we take that name like Lahiri and then Count Star. 55 00:03:14,780 --> 00:03:18,470 And as we just saw, Count Star means count the rows. 56 00:03:18,470 --> 00:03:24,410 And in this case, it's going to count the rows of each group, not all the rows in the whole table, 57 00:03:24,410 --> 00:03:25,940 but the rows in each group. 58 00:03:26,150 --> 00:03:27,950 So let's see what that might end up with. 59 00:03:27,950 --> 00:03:32,720 We might see two for Lahiri, two for Gaiman, and one for Eggers. 60 00:03:32,720 --> 00:03:36,530 And that's based off of this fiction or this tiny data set from my slides. 61 00:03:36,530 --> 00:03:38,930 And the end result would look something like this. 62 00:03:38,930 --> 00:03:41,570 Lahiri To Gaiman. 63 00:03:41,570 --> 00:03:43,730 To Eggers one. 64 00:03:44,120 --> 00:03:46,880 So let's try it for real over here. 65 00:03:46,940 --> 00:03:53,420 Instead of selecting author l name from books grouped by author L name, I'm going to keep all of that, 66 00:03:53,420 --> 00:03:56,630 but after my author L name or before doesn't matter. 67 00:03:56,630 --> 00:04:00,650 But I'm going to select the author name first, so that will appear on the left. 68 00:04:00,650 --> 00:04:02,750 Then Count Star. 69 00:04:03,490 --> 00:04:08,650 And once again, the way that this works is that we have grouped by the author's last name. 70 00:04:08,650 --> 00:04:14,650 So there's all these mini groupings and then we count the rows in each of those mini groups and we see 71 00:04:14,680 --> 00:04:19,600 Lahiri appears twice, Garman appears three times, Edgar's three times. 72 00:04:19,600 --> 00:04:29,380 And I could even get fancier and do something like count and let's give it a name, like as books written. 73 00:04:30,100 --> 00:04:33,250 And then why don't we sort by that? 74 00:04:33,250 --> 00:04:38,140 So I'll get rid of my semicolon order by books written. 75 00:04:39,100 --> 00:04:41,020 And what did I mess up here? 76 00:04:41,140 --> 00:04:43,330 I think I have a missing space. 77 00:04:43,330 --> 00:04:44,260 Possibly. 78 00:04:44,290 --> 00:04:45,100 There we go. 79 00:04:45,490 --> 00:04:48,850 And I kind of want it to be in descending order. 80 00:04:49,750 --> 00:04:50,800 And here we are. 81 00:04:50,830 --> 00:04:55,290 We have each author in our dataset and the number of books that they wrote. 82 00:04:55,300 --> 00:04:57,280 It's not exactly accurate because. 83 00:04:57,280 --> 00:05:03,490 Harris If you remember, there's two Harris's the same last name, so we're only grouping by last name, 84 00:05:03,490 --> 00:05:08,080 not first name, but you can see we've got three, three, two, two. 85 00:05:08,080 --> 00:05:11,690 And that holds up with just how we hopefully remember the data. 86 00:05:11,710 --> 00:05:15,070 Of course, we could always take a look and just do select. 87 00:05:15,640 --> 00:05:19,690 How about title author L name from books? 88 00:05:19,990 --> 00:05:22,750 And we see, you know, gay men. 89 00:05:22,750 --> 00:05:25,120 One, two, three. 90 00:05:25,120 --> 00:05:26,980 Lahiri One, two. 91 00:05:26,980 --> 00:05:28,480 Eggers One, two. 92 00:05:28,660 --> 00:05:29,590 Is there a third? 93 00:05:29,590 --> 00:05:30,220 Three. 94 00:05:30,490 --> 00:05:36,910 So in this case, we can verify our work by counting the the old way, just counting, you know, using 95 00:05:36,910 --> 00:05:37,510 our fingers. 96 00:05:37,510 --> 00:05:43,380 But we don't need to do that because we can have group by do it for us or group by along with count. 97 00:05:43,390 --> 00:05:50,290 So let's try another example instead of grouping by the author's last name, why don't we try grouping 98 00:05:50,290 --> 00:05:56,860 by the released year so we could figure out how many books were released in a given year? 99 00:05:57,160 --> 00:05:59,860 So this table looks nasty because it's just too long. 100 00:05:59,860 --> 00:06:02,290 With my font size, it all kind of gets collapsed. 101 00:06:02,290 --> 00:06:07,770 But let's do a select released year from books. 102 00:06:07,780 --> 00:06:09,070 We'll just start with that. 103 00:06:09,070 --> 00:06:13,510 And you can see all of the years that we had different books released in and there's actually not that 104 00:06:13,510 --> 00:06:14,320 much overlap. 105 00:06:14,320 --> 00:06:15,430 We've already discussed this. 106 00:06:15,430 --> 00:06:19,300 There's a couple of years that show up more than once, but not that many. 107 00:06:19,900 --> 00:06:23,920 What I want to do now is select released year. 108 00:06:24,790 --> 00:06:28,510 From books and group by. 109 00:06:29,920 --> 00:06:30,940 Released here. 110 00:06:30,940 --> 00:06:33,670 And this is not going to do a whole lot for us. 111 00:06:33,670 --> 00:06:36,610 It basically will just show us the distinct years. 112 00:06:36,610 --> 00:06:41,950 But that's because we're not taking advantage of those groups that have been formed in memory. 113 00:06:41,950 --> 00:06:48,160 So in memory, there's a group that has I think there's two books in 2003, so there's two rows and 114 00:06:48,160 --> 00:06:50,140 then there's two rows for 2001. 115 00:06:50,230 --> 00:06:55,870 But we only get one printed out because there's a group here and a group there and so on. 116 00:06:55,870 --> 00:07:02,710 So I can use count once again count Star and this is going to count the number of rows in each group. 117 00:07:02,710 --> 00:07:08,410 And we see 2003 appears twice, 2001 appears three times. 118 00:07:08,410 --> 00:07:17,050 And well we've got Noel Darren Nolan there appears twice for the year 2017 once and then we could order 119 00:07:17,050 --> 00:07:19,660 them, we could limit, we could do whatever we want. 120 00:07:19,660 --> 00:07:25,840 But what I care about right now is this mechanic of grouping into smaller groups, these little mini 121 00:07:25,840 --> 00:07:26,260 tables. 122 00:07:26,260 --> 00:07:32,440 You can think of them kind of as many tables that have one or more rows, depending on if there's multiple 123 00:07:32,440 --> 00:07:36,280 years of 2000, three, multiple 2000 ones, those are grouped together. 124 00:07:36,280 --> 00:07:38,500 And then we're asking for the count. 125 00:07:38,500 --> 00:07:41,290 How many of those rows are present in each group? 126 00:07:41,980 --> 00:07:47,180 Okay, One last thing that's important to talk about is we have all of these rows, right? 127 00:07:47,230 --> 00:07:54,430 Select star from books, a rather all of these columns, book ID title released years quantity. 128 00:07:54,760 --> 00:07:59,860 I can't do something like select star from books group by. 129 00:08:01,120 --> 00:08:04,960 And then how about we grouped by author last name again? 130 00:08:05,680 --> 00:08:11,830 I'm going to run into trouble here because my SQL is getting complain and say, You you asked me to 131 00:08:11,830 --> 00:08:15,400 make all these groups based off of the author last name, which I did. 132 00:08:15,790 --> 00:08:19,990 Again, pretend it looks like this, these groups based off of the last name. 133 00:08:19,990 --> 00:08:24,430 But then you said Select star and star means every column. 134 00:08:24,430 --> 00:08:26,950 So how do I select a title? 135 00:08:26,950 --> 00:08:29,350 How do I select a year or release year? 136 00:08:29,350 --> 00:08:30,940 How do I select an ID? 137 00:08:31,060 --> 00:08:33,730 Because I have more than one of each of those. 138 00:08:33,820 --> 00:08:37,090 It doesn't work, so it gives me an error. 139 00:08:37,480 --> 00:08:45,970 It says, Well, you could change the SQL mode, and if you do that, you can kind of make it work. 140 00:08:46,000 --> 00:08:52,150 It will show you just one result or one match for each column, but I don't want to do that. 141 00:08:52,150 --> 00:08:57,970 Generally what we do is we make our groups and then we use some aggregate function like count. 142 00:08:57,970 --> 00:08:59,860 That's the only one we know so far. 143 00:08:59,920 --> 00:09:03,430 But we're going to learn men and Macs and average and some others. 144 00:09:03,430 --> 00:09:08,320 So we use that to boil down multiple rows in a group into a single value. 145 00:09:08,710 --> 00:09:13,150 But also we can access any of the columns that we're grouping by. 146 00:09:13,150 --> 00:09:21,430 So instead of Select Star, of course, I could select author last name because every row in that group 147 00:09:21,430 --> 00:09:23,470 has the same author last name. 148 00:09:23,560 --> 00:09:28,960 If we look at our visualization of the groups, we can select last name and print that out. 149 00:09:28,960 --> 00:09:30,940 Because Lahiri and Lahiri, it's the same. 150 00:09:30,940 --> 00:09:36,040 There's no ambiguity, but I couldn't ask for title, as we've already discussed. 151 00:09:36,160 --> 00:09:37,870 Author, Last Name, comma Title. 152 00:09:38,020 --> 00:09:39,040 Which title? 153 00:09:39,040 --> 00:09:39,880 There's a group. 154 00:09:39,910 --> 00:09:41,440 There's more than one title there. 155 00:09:41,440 --> 00:09:44,440 What do you want me to group or what do you want me to to display? 156 00:09:44,440 --> 00:09:46,480 That's me personifying SQL. 157 00:09:46,870 --> 00:09:49,330 So that's important to understand. 158 00:09:49,690 --> 00:09:55,810 Generally the formula is will group by some column, and then we might use that column in our select, 159 00:09:55,810 --> 00:10:01,450 but then we almost always are going to have some aggregate function going on like account min, max, 160 00:10:01,450 --> 00:10:02,800 average and sum. 161 00:10:02,800 --> 00:10:04,780 So I think that's a good note to end on. 162 00:10:04,780 --> 00:10:10,330 Let's move on and start learning some of those other things we can do those other built in aggregation 163 00:10:10,330 --> 00:10:11,140 functions.