1 00:00:00,020 --> 00:00:05,090 ‫So one of you guys asks a very interesting question, and I thought, I'll make this lecture just to 2 00:00:05,090 --> 00:00:07,700 ‫address that particular question in details. 3 00:00:07,850 --> 00:00:08,660 ‫Let's jump into it. 4 00:00:08,750 --> 00:00:11,060 ‫This comes from Adam Salama. 5 00:00:12,370 --> 00:00:16,180 ‫Hopefully that's how you pronounce there's an H in your name. 6 00:00:16,180 --> 00:00:17,620 ‫So I suppose it's Adam. 7 00:00:17,740 --> 00:00:24,200 ‫So question is an explain analyze shows, bitmap scan instead of an index only scan. 8 00:00:24,220 --> 00:00:25,570 ‫Why is that? 9 00:00:25,570 --> 00:00:30,880 ‫And thank you, by the way, Adam, for actually including the script, because, you know, I built 10 00:00:30,880 --> 00:00:36,370 ‫this this lecture months ago, even maybe years. 11 00:00:36,370 --> 00:00:36,520 ‫Right. 12 00:00:36,520 --> 00:00:40,360 ‫So I have no idea if I read a question, I don't remember anything. 13 00:00:40,360 --> 00:00:46,420 ‫So by you actually providing exactly what you did, I just copy it in my Postgres instance and I am 14 00:00:46,420 --> 00:00:48,550 ‫with you exactly the same thing. 15 00:00:48,550 --> 00:00:54,040 ‫And instead of asking me, Oh, in this minute you did this, I have to open the lecture. 16 00:00:54,070 --> 00:00:54,760 ‫Go to the minute. 17 00:00:54,760 --> 00:00:55,600 ‫It takes time. 18 00:00:55,600 --> 00:00:56,590 ‫Right this way. 19 00:00:56,590 --> 00:00:58,150 ‫I just look at what you have. 20 00:00:58,150 --> 00:00:59,470 ‫I Exactly. 21 00:00:59,470 --> 00:01:00,970 ‫I'm in the same boat as you. 22 00:01:00,970 --> 00:01:02,350 ‫So thank you for doing this. 23 00:01:02,500 --> 00:01:03,760 ‫Oh, this is awesome. 24 00:01:03,760 --> 00:01:04,930 ‫If you do this right. 25 00:01:05,050 --> 00:01:09,430 ‫So what what Adam is experiencing is he created the. 26 00:01:09,430 --> 00:01:10,840 ‫The grade table. 27 00:01:10,870 --> 00:01:12,170 ‫He created the partitions. 28 00:01:12,170 --> 00:01:18,650 ‫He just slammed a bunch of rows, a thousand rows, and then immediately did a query on the on the, 29 00:01:18,680 --> 00:01:25,310 ‫on the main partition, you know, doing explain, analyze, select count star from grades port where 30 00:01:25,340 --> 00:01:26,300 ‫G is equal one. 31 00:01:26,300 --> 00:01:30,680 ‫So he wants the first grade that we know that since G is equal to one, we're going to hit the first 32 00:01:30,680 --> 00:01:35,360 ‫partition right between 0 and 35 because that's that's the range. 33 00:01:35,360 --> 00:01:35,720 ‫Right? 34 00:01:35,720 --> 00:01:37,400 ‫So that's how how it works. 35 00:01:37,400 --> 00:01:38,390 ‫So we're going to hit that. 36 00:01:38,390 --> 00:01:42,680 ‫And we have an index there because we create an index on the main. 37 00:01:42,680 --> 00:01:47,320 ‫An index will be created on each partition because a partition is really just a table, right? 38 00:01:47,330 --> 00:01:49,580 ‫Each one is just an individual table. 39 00:01:49,580 --> 00:01:57,440 ‫And this partition, this main part is really just the logical thing, a nice view for you, you know, 40 00:01:58,430 --> 00:02:12,560 ‫So what what Adam is, is a, I suppose a question is like they got a bit bitmap index scan right on 41 00:02:12,560 --> 00:02:16,040 ‫the G zero 35 G index, which is perfect. 42 00:02:16,040 --> 00:02:18,800 ‫That's the index of that table. 43 00:02:18,830 --> 00:02:19,040 ‫Right. 44 00:02:19,490 --> 00:02:24,230 ‫And the cost here is 0 to 4. 45 00:02:24,230 --> 00:02:27,800 ‫And again, I made a mistake in one of my lectures that said, oh, this is millisecond. 46 00:02:27,800 --> 00:02:29,030 ‫That's that's not correct. 47 00:02:29,030 --> 00:02:32,300 ‫It's just a unit that Postgres uses. 48 00:02:32,330 --> 00:02:35,960 ‫And this unit means it's a cost, right? 49 00:02:37,100 --> 00:02:39,620 ‫The more the value, the higher the cost. 50 00:02:39,620 --> 00:02:42,770 ‫You can think of it as I o you can think of it as milliseconds. 51 00:02:42,800 --> 00:02:45,590 ‫Not really, but it could be anything. 52 00:02:45,590 --> 00:02:51,140 ‫And, and that's just the value that the higher the cost, the, the, the worse the performance. 53 00:02:51,140 --> 00:02:51,590 ‫Right. 54 00:02:52,070 --> 00:02:56,120 ‫So we estimated nine rows. 55 00:02:56,120 --> 00:02:58,070 ‫We got nine rows with this. 56 00:02:58,070 --> 00:02:59,000 ‫So it's just an estimate. 57 00:02:59,000 --> 00:03:01,610 ‫And then we the actual gave us one. 58 00:03:01,610 --> 00:03:06,170 ‫But the question is like why is it not using the index on his. 59 00:03:06,470 --> 00:03:07,640 ‫It's a count. 60 00:03:07,670 --> 00:03:08,330 ‫Right. 61 00:03:08,330 --> 00:03:10,160 ‫So I'm just counting the rows. 62 00:03:10,160 --> 00:03:14,390 ‫Like why does it have to go to the heap and the cost of going to the heap is high, right? 63 00:03:14,390 --> 00:03:21,470 ‫Because now going to the heap, you need to fetch the pages and all the columns for that table. 64 00:03:21,650 --> 00:03:21,770 ‫Right? 65 00:03:21,860 --> 00:03:30,560 ‫So if you have like a 100 columns going to the heap will by default fetch the page and fetch the table 66 00:03:30,560 --> 00:03:31,430 ‫the tuple. 67 00:03:31,520 --> 00:03:32,090 ‫Right. 68 00:03:32,090 --> 00:03:35,390 ‫And then the table will by default give you all the columns. 69 00:03:35,390 --> 00:03:35,840 ‫Right. 70 00:03:35,990 --> 00:03:36,650 ‫And. 71 00:03:37,380 --> 00:03:42,270 ‫And I know that some of you asked the question was like, oh, in this case, like, why is why do we 72 00:03:42,300 --> 00:03:43,620 ‫do select column? 73 00:03:43,620 --> 00:03:45,060 ‫Just just always do select star. 74 00:03:45,090 --> 00:03:46,620 ‫No, not not quite. 75 00:03:46,650 --> 00:03:50,490 ‫While true, when you do a tuple query, you get all the column. 76 00:03:50,490 --> 00:03:59,700 ‫But the but the parser still have to do work to actually crack that index and actually find the attributes 77 00:03:59,700 --> 00:04:05,520 ‫that you're looking for by parsing the binary page in Postgres. 78 00:04:05,520 --> 00:04:06,690 ‫So there is work there. 79 00:04:06,690 --> 00:04:07,890 ‫It's not like free. 80 00:04:08,220 --> 00:04:09,330 ‫So that's the first word. 81 00:04:09,360 --> 00:04:14,700 ‫Then you have to serialize it to the data structure in Postgres, then you have to transfer it to the 82 00:04:14,700 --> 00:04:15,240 ‫client. 83 00:04:15,240 --> 00:04:18,690 ‫So select Star is bad regardless, even if it's a raw store. 84 00:04:18,720 --> 00:04:19,140 ‫Right? 85 00:04:19,170 --> 00:04:21,630 ‫There is so much work, right? 86 00:04:22,110 --> 00:04:27,360 ‫Especially like let's say you have Blob Fields, you're doing Select Star, you're pulling it from the 87 00:04:27,480 --> 00:04:30,390 ‫host table and then pulling it back to the network to the client. 88 00:04:30,780 --> 00:04:33,750 ‫NA All right, so back to the original question. 89 00:04:33,750 --> 00:04:37,600 ‫Why did it do a heap scan, an index heap scan? 90 00:04:37,600 --> 00:04:40,720 ‫And we talked about that, a bitmap index scan. 91 00:04:40,720 --> 00:04:46,900 ‫What it does is that it scan the index and it doesn't really care about the columns per se. 92 00:04:47,890 --> 00:04:48,910 ‫Sorry, the tuples. 93 00:04:48,940 --> 00:04:51,850 ‫It just make marks of the pages. 94 00:04:51,850 --> 00:04:57,400 ‫Oh, we found page zero and page 17 and page 30. 95 00:04:57,430 --> 00:04:58,000 ‫Right. 96 00:04:58,030 --> 00:05:01,600 ‫These are the pages that contains the rows we want. 97 00:05:01,600 --> 00:05:03,190 ‫That's all what it does, right? 98 00:05:03,190 --> 00:05:07,660 ‫And then it takes that and fetches page zero with all its tuples. 99 00:05:07,660 --> 00:05:08,590 ‫Fetches page. 100 00:05:08,590 --> 00:05:09,250 ‫What did we say? 101 00:05:09,250 --> 00:05:09,790 ‫Seven. 102 00:05:09,790 --> 00:05:17,770 ‫And then fetches page 31 with all the rows and then it does an additional filter on the heap to filter 103 00:05:17,770 --> 00:05:19,030 ‫out what it doesn't want. 104 00:05:19,060 --> 00:05:21,610 ‫In this case, we're looking for grade equal one, right? 105 00:05:21,760 --> 00:05:29,620 ‫This is in this case the bitmap this way, because if it has to, it had to go to the heap for some 106 00:05:29,620 --> 00:05:30,220 ‫reason. 107 00:05:30,220 --> 00:05:32,260 ‫We don't know why in here. 108 00:05:32,440 --> 00:05:33,070 ‫I know. 109 00:05:33,070 --> 00:05:34,240 ‫I'm going to tell you why. 110 00:05:34,270 --> 00:05:34,750 ‫Right. 111 00:05:34,750 --> 00:05:40,390 ‫But if it has to go to the heap, it doesn't use an index only scan because it's not an index only scan. 112 00:05:40,390 --> 00:05:42,550 ‫We have to go to the heap to fetch an additional column. 113 00:05:42,550 --> 00:05:44,920 ‫But what is the column that it needed? 114 00:05:45,040 --> 00:05:46,600 ‫We didn't need anything. 115 00:05:46,640 --> 00:05:54,730 ‫The question the answer is there are two hidden system columns that Postgres uses to maintain the Mvcc 116 00:05:54,880 --> 00:05:57,220 ‫the Multiversion Concurrency control. 117 00:05:57,750 --> 00:06:04,450 ‫And those are the X-Men and then X-Men and some other columns. 118 00:06:04,470 --> 00:06:08,940 ‫This lives as just another column in the table in the heap. 119 00:06:09,090 --> 00:06:09,210 ‫Okay. 120 00:06:09,450 --> 00:06:10,620 ‫So in the pages. 121 00:06:10,620 --> 00:06:16,180 ‫And there's just two additional attributes, including the table ID as well. 122 00:06:16,200 --> 00:06:16,710 ‫Right. 123 00:06:16,770 --> 00:06:26,280 ‫So now, if you think about it, these two columns will tell you whether the row is visible or not, 124 00:06:26,310 --> 00:06:27,780 ‫whether the tuple is visible or not. 125 00:06:27,810 --> 00:06:28,590 ‫What does that mean? 126 00:06:28,620 --> 00:06:31,410 ‫So if I am transaction right. 127 00:06:31,950 --> 00:06:36,510 ‫If I'm transaction one and I start it, I'm doing some work here. 128 00:06:36,510 --> 00:06:37,350 ‫I started. 129 00:06:37,350 --> 00:06:41,250 ‫And meanwhile, after a little bit, while I didn't commit transaction one started running and then 130 00:06:41,250 --> 00:06:47,700 ‫another transaction continues to run right as it started another transaction and inserted a bunch of 131 00:06:47,700 --> 00:06:48,420 ‫tuples. 132 00:06:48,450 --> 00:06:48,960 ‫Right. 133 00:06:48,960 --> 00:06:50,190 ‫And didn't commit. 134 00:06:50,220 --> 00:06:52,440 ‫Does this transaction. 135 00:06:52,650 --> 00:06:55,020 ‫Transaction one should should it see these rows? 136 00:06:55,050 --> 00:06:56,390 ‫No, it should not. 137 00:06:56,400 --> 00:06:56,820 ‫Right. 138 00:06:56,820 --> 00:07:01,720 ‫Because it started before that transaction. 139 00:07:01,720 --> 00:07:09,960 ‫And even if it did, if transaction two did commit those rows, if I'm in transaction one and I started 140 00:07:09,970 --> 00:07:12,820 ‫a query, I started running a long query. 141 00:07:12,820 --> 00:07:14,710 ‫Let's say I'm selecting everything. 142 00:07:15,010 --> 00:07:25,450 ‫And just right after I started executing the query, then it's running the other, uh, the other transaction 143 00:07:25,450 --> 00:07:27,370 ‫committed the inserted rows. 144 00:07:27,460 --> 00:07:34,750 ‫This transaction should not see those inserted rows because it started right before the commit because 145 00:07:34,750 --> 00:07:36,310 ‫I have already read committed works. 146 00:07:36,310 --> 00:07:36,620 ‫Right. 147 00:07:36,640 --> 00:07:40,420 ‫Read committed works the moment at the start of the query. 148 00:07:40,450 --> 00:07:40,810 ‫Right. 149 00:07:40,810 --> 00:07:44,610 ‫So yeah, once we finish that query, the next select will see those inserts. 150 00:07:44,620 --> 00:07:46,150 ‫So that's how our isolation works. 151 00:07:46,150 --> 00:07:48,220 ‫We talked about it in details in this course, right? 152 00:07:48,370 --> 00:07:57,730 ‫The problem here is Postgres does not trust the index because the index has what it has. 153 00:07:57,730 --> 00:07:59,050 ‫The key, which is the grades. 154 00:07:59,050 --> 00:08:02,410 ‫In this case, that's where we're building the b-tree on. 155 00:08:02,410 --> 00:08:08,620 ‫And the value that is corresponding to the key is the tuple ID, right. 156 00:08:08,740 --> 00:08:16,060 ‫Which is which is basically the page index, which is zero one, which page and which index in that 157 00:08:16,060 --> 00:08:17,680 ‫page belongs to which tuple. 158 00:08:17,710 --> 00:08:17,980 ‫Right? 159 00:08:17,980 --> 00:08:20,280 ‫So that's how the tuple ID works in Postgres. 160 00:08:20,290 --> 00:08:22,330 ‫So that's the only information we have. 161 00:08:22,360 --> 00:08:26,080 ‫So we have no idea that someone inserted something. 162 00:08:26,080 --> 00:08:28,270 ‫When you insert something, you update the index as well, right? 163 00:08:28,270 --> 00:08:31,390 ‫So when you read it, that transaction should check. 164 00:08:31,420 --> 00:08:33,070 ‫Am I supposed to see that row? 165 00:08:33,100 --> 00:08:36,040 ‫Is this row deleted and I'm still needed? 166 00:08:36,190 --> 00:08:39,040 ‫Is this row inserted and I'm not really need to see it. 167 00:08:39,040 --> 00:08:46,330 ‫So this visibility information lives in the heap in the table as just another column which is x men 168 00:08:46,330 --> 00:08:51,400 ‫and x max x men tells you that this is the transaction that started the this is the transaction that 169 00:08:51,400 --> 00:08:52,760 ‫started the. 170 00:08:53,890 --> 00:08:59,740 ‫That created those tuples X max tells you this is the transaction that killed the tuple or deleted it 171 00:08:59,740 --> 00:09:01,390 ‫or even updated it. 172 00:09:01,390 --> 00:09:01,660 ‫Right? 173 00:09:01,720 --> 00:09:04,780 ‫Because Postgres creates a new tuple for every update. 174 00:09:04,990 --> 00:09:11,860 ‫So now, based on this information, Postgres said, All right, uh, for Adam's case, a select count. 175 00:09:11,870 --> 00:09:14,200 ‫I know I got this, but guess what? 176 00:09:14,230 --> 00:09:15,970 ‫These are new stuff. 177 00:09:15,970 --> 00:09:17,800 ‫He just inserted them, Remember? 178 00:09:17,800 --> 00:09:19,540 ‫They just inserted this row. 179 00:09:19,540 --> 00:09:21,900 ‫And then immediately after that, he did a query. 180 00:09:21,910 --> 00:09:29,650 ‫So based on Postgres will say, Well, I can't really trust those rows because they are. 181 00:09:29,680 --> 00:09:31,930 ‫They might not be visible for me. 182 00:09:31,930 --> 00:09:35,740 ‫So I need to go to the heap to read the X men. 183 00:09:35,740 --> 00:09:40,240 ‫And when I read the X men say, Oh, this is a transaction that started before me and committed. 184 00:09:40,420 --> 00:09:43,630 ‫I am supposed to see those, those rows. 185 00:09:43,630 --> 00:09:45,460 ‫I am actually supposed to see them. 186 00:09:45,460 --> 00:09:47,500 ‫So they will pull it back. 187 00:09:48,220 --> 00:09:48,460 ‫Right. 188 00:09:49,120 --> 00:09:49,900 ‫But. 189 00:09:50,770 --> 00:09:56,740 ‫You might say, ask the question Now, does that mean that every query now has to go to the heap to 190 00:09:56,740 --> 00:09:57,850 ‫check if it's visible or not? 191 00:09:58,000 --> 00:09:59,020 ‫Not really. 192 00:09:59,050 --> 00:10:02,320 ‫Postgres has this optimization called the Visibility Page. 193 00:10:02,350 --> 00:10:07,870 ‫Each page there's like a data section in memory that Postgres maintains that says, okay, page zero, 194 00:10:08,200 --> 00:10:10,490 ‫don't bother going to the heap. 195 00:10:10,510 --> 00:10:17,050 ‫All the rows in this page zero are visible to every single transaction. 196 00:10:17,410 --> 00:10:26,860 ‫There is no transaction that is running for a long time that requires the the old version or requires 197 00:10:26,920 --> 00:10:32,170 ‫not not to see these rows for some reason, for isolation purposes, all of them are required. 198 00:10:32,200 --> 00:10:33,020 ‫How do we do that? 199 00:10:33,040 --> 00:10:36,910 ‫We do a vacuum on the table. 200 00:10:36,910 --> 00:10:43,480 ‫If you do a vacuum on the table, Postgres will go through all of them and do this expensive check, 201 00:10:43,480 --> 00:10:48,520 ‫which is like, Is there anybody really running right now needing. 202 00:10:49,120 --> 00:10:51,080 ‫Is there anybody running? 203 00:10:51,620 --> 00:10:53,450 ‫Is there any transaction that is running that? 204 00:10:54,100 --> 00:10:55,810 ‫Needs not to see this row. 205 00:10:55,840 --> 00:10:56,220 ‫Right? 206 00:10:56,260 --> 00:10:59,720 ‫Is this one of these rows are invisible to any of them? 207 00:10:59,740 --> 00:11:02,650 ‫The question is no, because nobody is running. 208 00:11:02,650 --> 00:11:04,090 ‫So by running vacuum. 209 00:11:04,090 --> 00:11:04,310 ‫Right. 210 00:11:04,330 --> 00:11:05,260 ‫I actually tested it. 211 00:11:05,740 --> 00:11:12,280 ‫If you run vacuum right after this and then you run experiment, you get an index only scan. 212 00:11:12,820 --> 00:11:17,790 ‫So that explains why do we get an index and apply this to everything, Right? 213 00:11:17,800 --> 00:11:18,220 ‫Everything. 214 00:11:18,220 --> 00:11:20,200 ‫You get this heap, I think. 215 00:11:20,200 --> 00:11:20,370 ‫Yeah. 216 00:11:20,380 --> 00:11:24,310 ‫The heap plan is like the heap features you're going to see. 217 00:11:24,310 --> 00:11:26,050 ‫The heap fetches somewhere here. 218 00:11:26,080 --> 00:11:27,070 ‫Sometimes I see it. 219 00:11:27,160 --> 00:11:33,400 ‫I can't see it right now, but you'll see the number of pages that we had to go to the, to the, to 220 00:11:33,400 --> 00:11:34,300 ‫the heap for this. 221 00:11:34,570 --> 00:11:40,450 ‫So that's basically one reason why the the in this case, the the plan completely is different. 222 00:11:40,480 --> 00:11:42,760 ‫They said, hey, you know what, this is a brand new thing. 223 00:11:42,760 --> 00:11:46,180 ‫I really need to go to the heap anyway to do this. 224 00:11:46,600 --> 00:11:50,440 ‫Well, that's one reason sometimes the index only scan. 225 00:11:51,100 --> 00:11:53,320 ‫In this case, we'll always do that. 226 00:11:53,320 --> 00:11:53,980 ‫But. 227 00:11:55,140 --> 00:11:57,300 ‫Again, it's always not guaranteed, right? 228 00:11:57,750 --> 00:12:04,440 ‫Because especially in the count case, like you inserted a row, something is not visible. 229 00:12:04,800 --> 00:12:07,580 ‫Everything just just falls apart. 230 00:12:07,590 --> 00:12:13,110 ‫So it's really hard for the planner to pick up the right plan based on this information. 231 00:12:13,110 --> 00:12:13,440 ‫Right? 232 00:12:13,600 --> 00:12:17,150 ‫So so, yeah, that's all I wanted to talk about here. 233 00:12:17,160 --> 00:12:19,680 ‫It's really interesting to understand that. 234 00:12:19,680 --> 00:12:22,740 ‫And based on that, you can explain some of the things that happened to you. 235 00:12:22,740 --> 00:12:27,480 ‫I can't think of every possible situation, but always happen that if a planner does something, it 236 00:12:27,480 --> 00:12:28,650 ‫does it for a reason. 237 00:12:28,650 --> 00:12:31,350 ‫It might be a bad plan, right? 238 00:12:31,350 --> 00:12:31,800 ‫Listen to me. 239 00:12:31,800 --> 00:12:38,220 ‫It might be a bad plan because just the statistics are not up to date or the statistics are not really 240 00:12:38,220 --> 00:12:42,360 ‫smart enough to for a specific use case. 241 00:12:42,360 --> 00:12:45,600 ‫And that's basically another another another talk altogether. 242 00:12:45,600 --> 00:12:47,370 ‫This is advanced statistics. 243 00:12:47,370 --> 00:12:47,560 ‫Right. 244 00:12:47,640 --> 00:12:48,780 ‫And database. 245 00:12:48,870 --> 00:12:49,350 ‫All right, guys. 246 00:12:49,350 --> 00:12:50,640 ‫That's that's it for me today. 247 00:12:50,640 --> 00:12:51,840 ‫Hope you enjoyed this lecture.