1 00:00:00,150 --> 00:00:02,610 ‫Welcome to another Q&A lecture. 2 00:00:03,330 --> 00:00:06,420 ‫This is for the month of May 2022. 3 00:00:06,420 --> 00:00:12,660 ‫And the question for today's lecture is from Robyn. 4 00:00:13,530 --> 00:00:20,310 ‫And if you don't know, this is our series and, and this Udemy course where we pick one of your questions, 5 00:00:20,910 --> 00:00:22,230 ‫they're fantastic questions. 6 00:00:22,230 --> 00:00:27,390 ‫All of the questions I get is amazing and I answer Titans at all of them to my best of their knowledge 7 00:00:27,390 --> 00:00:28,530 ‫and obviously time. 8 00:00:29,310 --> 00:00:34,470 ‫But for those questions that need some more time, I dedicate a lecture. 9 00:00:34,500 --> 00:00:35,730 ‫Yeah, like this. 10 00:00:36,150 --> 00:00:40,350 ‫So this comes from Robin Srivastava. 11 00:00:41,420 --> 00:00:42,890 ‫Let's read it and discuss. 12 00:00:43,730 --> 00:00:46,490 ‫Index on column with duplicate values. 13 00:00:47,270 --> 00:00:57,050 ‫I absolutely love that when you guys watch my lectures and then start thinking about these cases because 14 00:00:57,050 --> 00:00:58,730 ‫they are real world cases. 15 00:00:59,270 --> 00:00:59,570 ‫Right? 16 00:00:59,600 --> 00:01:06,710 ‫I never mentioned anything about duplicate values, but this is this comes to you, right, as you think 17 00:01:06,710 --> 00:01:07,170 ‫about it. 18 00:01:07,210 --> 00:01:08,000 ‫So stuck about it. 19 00:01:08,210 --> 00:01:15,050 ‫And this is this is how you basically you see yourself evolve, right, as you ask these questions. 20 00:01:15,500 --> 00:01:22,400 ‫That means you're penetrating and you're breaking these walls that the confusion and everything. 21 00:01:23,240 --> 00:01:30,020 ‫So is really the question if you are creating an index on column which can have duplicate values. 22 00:01:30,260 --> 00:01:32,060 ‫So it's a non unique index, right? 23 00:01:32,720 --> 00:01:36,140 ‫Then how is it stored in the index data structure? 24 00:01:36,620 --> 00:01:42,950 ‫Does the that is a doesn't maintain a key with a name as duplicate values, a list of rows or corresponding 25 00:01:42,950 --> 00:01:43,670 ‫page information? 26 00:01:44,120 --> 00:01:45,320 ‫Here's what I want you to think. 27 00:01:45,320 --> 00:01:47,960 ‫So this is for Robert and everyone here. 28 00:01:47,960 --> 00:01:48,410 ‫Really. 29 00:01:48,980 --> 00:01:51,500 ‫So the question, I don't think we need this screen anymore. 30 00:01:51,500 --> 00:01:55,400 ‫So let's go back to this and just talk through an excuse. 31 00:01:55,400 --> 00:01:59,270 ‫My hair needs a haircut, so. 32 00:02:01,140 --> 00:02:06,120 ‫The moment you find yourself thinking about how is it implemented? 33 00:02:07,270 --> 00:02:10,600 ‫Uh, don't worry too much about that. 34 00:02:10,720 --> 00:02:11,530 ‫What does that mean? 35 00:02:12,310 --> 00:02:13,090 ‫It means that. 36 00:02:13,360 --> 00:02:14,050 ‫Whatever. 37 00:02:15,410 --> 00:02:22,610 ‫If you find a way yourself, like don't worry about how big companies do it. 38 00:02:23,120 --> 00:02:29,480 ‫It's like everyone make mistakes, everyone implement something and then find out that it's not really 39 00:02:29,480 --> 00:02:29,930 ‫efficient. 40 00:02:30,440 --> 00:02:30,660 ‫Right? 41 00:02:30,950 --> 00:02:31,340 ‫So. 42 00:02:32,450 --> 00:02:35,750 ‫You provided ways here to solve this problem. 43 00:02:37,040 --> 00:02:38,600 ‫Work it out yourself. 44 00:02:39,050 --> 00:02:41,420 ‫If you think these are optimized and they are fine. 45 00:02:42,440 --> 00:02:43,670 ‫This is an implementation. 46 00:02:44,150 --> 00:02:45,410 ‫There is no longer right. 47 00:02:46,010 --> 00:02:46,370 ‫Yeah. 48 00:02:46,400 --> 00:02:51,230 ‫There is one implementation that could read the wrong result and that's obviously not applicable, but 49 00:02:51,230 --> 00:02:53,600 ‫there is implementation that is more efficient than the other. 50 00:02:54,350 --> 00:03:02,000 ‫So that being said, I don't want you guys to worry about much as like, Oh, if both of us do it this 51 00:03:02,000 --> 00:03:03,410 ‫way, this is the right way. 52 00:03:03,860 --> 00:03:05,780 ‫If my cycle is there, this is the right way. 53 00:03:05,810 --> 00:03:06,110 ‫No. 54 00:03:07,360 --> 00:03:08,530 ‫No, not at all. 55 00:03:08,590 --> 00:03:16,850 ‫From my 20 years of experience, though, what I have been seeing is always I'd be wanting about, oh, 56 00:03:17,140 --> 00:03:21,680 ‫if someone else implemented something that must be better than my idea. 57 00:03:21,700 --> 00:03:22,120 ‫No. 58 00:03:22,810 --> 00:03:28,750 ‫As you gain more experience, of course, if you're new to the course, you get to stumble upon things 59 00:03:28,750 --> 00:03:29,940 ‫that you've never seen before. 60 00:03:29,950 --> 00:03:34,000 ‫You might have ideas, but they might be shut down because you still don't have. 61 00:03:35,030 --> 00:03:36,890 ‫Expert exploded knowledge. 62 00:03:36,890 --> 00:03:41,430 ‫But the more you knowledge you get, the more confident you become about your implementation. 63 00:03:41,450 --> 00:03:46,060 ‫I think I got into the tangent, but I think it's a little bit valuable here to talk about. 64 00:03:46,070 --> 00:03:49,010 ‫But so the index duplicate. 65 00:03:49,190 --> 00:03:51,800 ‫So let's start with this. 66 00:03:52,700 --> 00:03:57,950 ‫If you have a column, right, and I don't know. 67 00:03:59,340 --> 00:03:59,910 ‫Gender. 68 00:04:00,270 --> 00:04:00,570 ‫Right. 69 00:04:02,220 --> 00:04:04,440 ‫What are the values in the gender column? 70 00:04:05,640 --> 00:04:12,600 ‫Male, female, you know, on a specified you don't want it to specified or you can have a little bit 71 00:04:12,600 --> 00:04:17,440 ‫of a spectrum there based on what do you want to date a model to be. 72 00:04:18,050 --> 00:04:20,130 ‫Let's say you have three values you. 73 00:04:21,770 --> 00:04:23,810 ‫The moment you have these three values. 74 00:04:24,590 --> 00:04:24,890 ‫All right. 75 00:04:25,310 --> 00:04:27,260 ‫If you have 100 million rows. 76 00:04:29,270 --> 00:04:30,830 ‫Indexing their role. 77 00:04:31,190 --> 00:04:34,670 ‫So indexing that column becomes useless. 78 00:04:35,300 --> 00:04:36,050 ‫Let's think about it. 79 00:04:36,200 --> 00:04:36,500 ‫Right. 80 00:04:37,040 --> 00:04:38,930 ‫Adding an index on that. 81 00:04:40,380 --> 00:04:47,920 ‫Becomes really if you're just searching for, I don't know, all the I don't know male who watches their 82 00:04:47,940 --> 00:04:49,320 ‫certain shows, right? 83 00:04:49,320 --> 00:04:55,530 ‫For example, a female that watches this certain show like a writer building a system for them and to 84 00:04:55,560 --> 00:05:01,530 ‫gain experience indexing that column, just that becomes really useless because. 85 00:05:03,190 --> 00:05:05,690 ‫Remember, there's something called index selectivity. 86 00:05:06,560 --> 00:05:14,420 ‫So index selectivity means you want the index to give you as small a result as possible. 87 00:05:14,660 --> 00:05:23,840 ‫So if you if you index this and I don't know, like 99% of the value is the same, then it's pointless, 88 00:05:23,840 --> 00:05:24,080 ‫right? 89 00:05:24,080 --> 00:05:27,770 ‫Because it didn't reduce your results and it's not selective enough. 90 00:05:28,310 --> 00:05:29,420 ‫And that's something to worry about. 91 00:05:29,450 --> 00:05:30,460 ‫So that's one thing. 92 00:05:30,470 --> 00:05:31,100 ‫You have what I want. 93 00:05:31,490 --> 00:05:34,250 ‫But sometimes there is duplicate values regardless. 94 00:05:34,680 --> 00:05:34,940 ‫Right. 95 00:05:35,660 --> 00:05:37,110 ‫But there's, like, more selectivity. 96 00:05:37,140 --> 00:05:42,800 ‫Like, so there's a bunch of duplicate values, like, say, categories, like a movie show category. 97 00:05:42,830 --> 00:05:43,100 ‫Right. 98 00:05:44,020 --> 00:05:47,660 ‫And then then that is, has more variety, right? 99 00:05:47,660 --> 00:05:51,230 ‫So the the value is becomes duplicated. 100 00:05:51,260 --> 00:05:51,800 ‫Sure. 101 00:05:52,130 --> 00:05:59,900 ‫But they won't be as dense as like if you have two or three distinct values. 102 00:05:59,930 --> 00:06:00,530 ‫Does that make sense? 103 00:06:01,160 --> 00:06:08,060 ‫That being said, you can combine that index with other things to make it more selective, like if you 104 00:06:08,060 --> 00:06:09,760 ‫have other indexes by itself. 105 00:06:09,770 --> 00:06:11,120 ‫That's what I mean by itself. 106 00:06:11,120 --> 00:06:16,670 ‫It's a bit useless if you have a lot of duplicate values with very distinct, small, distinct values. 107 00:06:17,510 --> 00:06:19,400 ‫So this solution here. 108 00:06:21,330 --> 00:06:21,750 ‫Now. 109 00:06:21,900 --> 00:06:24,030 ‫Now I'm going to answer Robert and question again. 110 00:06:24,900 --> 00:06:27,980 ‫I felt that whatever I said is the kind of relevant to this. 111 00:06:27,990 --> 00:06:28,770 ‫That's why I mentioned. 112 00:06:29,370 --> 00:06:33,360 ‫So to have duplicate values. 113 00:06:34,020 --> 00:06:34,260 ‫Yeah. 114 00:06:34,890 --> 00:06:36,150 ‫That's how I'm indexing column. 115 00:06:37,380 --> 00:06:42,540 ‫And think about something like project code. 116 00:06:42,690 --> 00:06:42,980 ‫Right. 117 00:06:42,990 --> 00:06:45,150 ‫And you have like ten codes, right? 118 00:06:45,810 --> 00:06:48,540 ‫So different products will have the same quotes, obviously. 119 00:06:48,540 --> 00:06:48,750 ‫Right. 120 00:06:48,990 --> 00:06:50,790 ‫Like, oh, that this is not unique. 121 00:06:50,790 --> 00:06:53,400 ‫This is like a code code as a category. 122 00:06:53,400 --> 00:06:53,730 ‫I mean. 123 00:06:54,030 --> 00:06:54,210 ‫Right. 124 00:06:54,240 --> 00:06:57,330 ‫So the code zero code wine code three code for up to code ten. 125 00:06:58,320 --> 00:07:01,170 ‫So let's say you have ten products with the same code zero. 126 00:07:01,440 --> 00:07:02,970 ‫So if you're indexed by the code. 127 00:07:04,230 --> 00:07:05,670 ‫Then how? 128 00:07:05,850 --> 00:07:08,310 ‫How does the index looks like without any? 129 00:07:08,730 --> 00:07:09,870 ‫We aren't being smart. 130 00:07:11,250 --> 00:07:11,910 ‫How do we do it. 131 00:07:11,940 --> 00:07:14,110 ‫Today in the B3 leaf. 132 00:07:14,250 --> 00:07:14,580 ‫Right. 133 00:07:15,090 --> 00:07:18,720 ‫I get you have to watch the three lectures to understand this. 134 00:07:19,320 --> 00:07:20,640 ‫But what we're going to do is. 135 00:07:21,900 --> 00:07:24,990 ‫The key becomes the code. 136 00:07:24,990 --> 00:07:25,290 ‫Right. 137 00:07:25,290 --> 00:07:26,340 ‫So code zero. 138 00:07:27,290 --> 00:07:27,830 ‫And then. 139 00:07:28,970 --> 00:07:30,620 ‫The value becomes zero. 140 00:07:31,540 --> 00:07:31,750 ‫Right. 141 00:07:31,750 --> 00:07:33,640 ‫So to say it all, one depends. 142 00:07:33,640 --> 00:07:35,410 ‫This is a primary or secondary key. 143 00:07:35,770 --> 00:07:40,110 ‫The value will be either a pointer to the primary key. 144 00:07:40,120 --> 00:07:46,060 ‫So that would be a primary key of their own or it would be a pointer to the battle itself if it's like 145 00:07:46,060 --> 00:07:47,290 ‫in case of progress, right? 146 00:07:48,250 --> 00:07:53,410 ‫If it's a secondary key or if it's a primary key that you're looking at, the now primary key cannot 147 00:07:53,410 --> 00:07:54,080 ‫be unique. 148 00:07:54,100 --> 00:07:54,970 ‫So forget about that. 149 00:07:55,390 --> 00:07:59,820 ‫So, yeah, so it would be the code and the row itself, right? 150 00:07:59,830 --> 00:08:03,190 ‫The overall idea which like 32 bits would depend on the size of the role. 151 00:08:05,090 --> 00:08:09,440 ‫And yeah, if it's prime at a key, then it's going to be like the key, primary key. 152 00:08:09,440 --> 00:08:10,730 ‫That's why you have to be careful. 153 00:08:10,760 --> 00:08:12,920 ‫What is your primary key here? 154 00:08:12,920 --> 00:08:16,470 ‫Because it will be duplicated right in this case. 155 00:08:16,730 --> 00:08:22,520 ‫So code zero becomes this right, this value wrote role one, right, let's say. 156 00:08:22,790 --> 00:08:32,690 ‫And the code zero again is allowed to encode zero or three and code 004 and goes 005 and go. 157 00:08:33,440 --> 00:08:35,000 ‫So that's how we're going to do it. 158 00:08:35,660 --> 00:08:35,920 ‫All right. 159 00:08:36,650 --> 00:08:42,230 ‫Obviously, if you think about it, you just duplicated the value zero many times. 160 00:08:42,230 --> 00:08:48,170 ‫So if you're indexing a 32 bit and you know, that's 32 bits, 32 bit, 32 bit, the two in the index. 161 00:08:48,170 --> 00:08:49,460 ‫So the index will be bloated. 162 00:08:50,030 --> 00:08:50,270 ‫Right. 163 00:08:51,170 --> 00:08:53,030 ‫So that's that's what we have today. 164 00:08:53,330 --> 00:08:53,570 ‫Right. 165 00:08:55,880 --> 00:09:00,290 ‫As a as a naive implementation, if you will, only rate. 166 00:09:00,680 --> 00:09:02,510 ‫This was implemented for the longest time. 167 00:09:02,900 --> 00:09:10,310 ‫It's only recently when PostgreSQL it's only recently when PostgreSQL, I suppose is 13 implemented 168 00:09:10,320 --> 00:09:13,610 ‫something called the deduplication in the B three index. 169 00:09:14,450 --> 00:09:25,700 ‫So that what they did is they d duplicated this so that in case of doing 0010020003 that had they would 170 00:09:25,700 --> 00:09:33,380 ‫put zero and then a specific array of the all the rows effectively that points to that. 171 00:09:33,410 --> 00:09:33,740 ‫Right. 172 00:09:34,280 --> 00:09:40,700 ‫Or based on what Robin said, just says, okay, let's put let's put just the page information. 173 00:09:40,850 --> 00:09:41,140 ‫Right. 174 00:09:41,360 --> 00:09:42,470 ‫That's why these are just right. 175 00:09:42,710 --> 00:09:48,830 ‫Let's just put the page information and then we can filter that when we go to the page. 176 00:09:48,860 --> 00:09:49,130 ‫Right. 177 00:09:49,490 --> 00:09:50,810 ‫That's another implementation. 178 00:09:51,170 --> 00:10:02,180 ‫So you can do any implementation as as long as it, you know, as you work it out and then kind of measure 179 00:10:02,180 --> 00:10:03,470 ‫the value here. 180 00:10:03,650 --> 00:10:04,490 ‫It really depends. 181 00:10:04,730 --> 00:10:09,020 ‫But all comes back to what the user wan with the DB actually doing. 182 00:10:09,380 --> 00:10:10,850 ‫Why are you indexing value? 183 00:10:10,850 --> 00:10:15,860 ‫So if you're indexing something with a lot of duplicates, really think about it really hard. 184 00:10:16,490 --> 00:10:19,490 ‫You know, what value does this give you? 185 00:10:19,490 --> 00:10:21,530 ‫Test it, you know, do the explain. 186 00:10:21,530 --> 00:10:24,680 ‫We have done thousands, you know, of these lectures. 187 00:10:24,770 --> 00:10:25,310 ‫Thousands. 188 00:10:25,730 --> 00:10:28,130 ‫I'm exaggerating a little bit, but we did a lot, right. 189 00:10:28,160 --> 00:10:30,890 ‫Talking about this almost as add an index. 190 00:10:30,890 --> 00:10:34,640 ‫See, does the explainer give you any of the above about our access path? 191 00:10:34,970 --> 00:10:35,200 ‫Right. 192 00:10:35,510 --> 00:10:38,930 ‫This is it's all at the end of the day, try and at or thinking about. 193 00:10:38,930 --> 00:10:39,260 ‫Okay. 194 00:10:39,260 --> 00:10:39,490 ‫Yeah. 195 00:10:39,560 --> 00:10:48,140 ‫While the theory is is sound, my use case might not be, you know, might not fit this case. 196 00:10:48,260 --> 00:10:50,150 ‫And that's basically what I'm going to leave it. 197 00:10:50,150 --> 00:10:51,560 ‫I hope you enjoyed this lecture. 198 00:10:51,560 --> 00:10:52,460 ‫I'm going to see you on the next one. 199 00:10:52,850 --> 00:10:53,450 ‫Enjoy the course.