1 00:00:00,290 --> 00:00:05,870 ‫MongoDB is a document based NoSQL database. 2 00:00:07,730 --> 00:00:15,800 ‫It became very, very popular for its schemaless way of storing documents. 3 00:00:16,110 --> 00:00:21,780 ‫You know, because the friction when it comes to writing code has gone. 4 00:00:21,800 --> 00:00:24,800 ‫Compared to SQL based databases. 5 00:00:24,920 --> 00:00:33,230 ‫But I get this question a lot, and I thought this video will be the perfect Segway to actually answer 6 00:00:33,230 --> 00:00:33,920 ‫that question. 7 00:00:33,920 --> 00:00:37,000 ‫What is really the difference between NoSQL and SQL? 8 00:00:37,010 --> 00:00:39,410 ‫So I'm going to address that in this video. 9 00:00:39,410 --> 00:00:49,130 ‫And but the main purpose of this video is actually going through the evolution of MongoDB internal architecture. 10 00:00:49,130 --> 00:00:56,210 ‫So this is a topic that very rarely people discuss because we're going into the bowels of the database, 11 00:00:56,210 --> 00:01:01,860 ‫not the front end, in a sense that how you interact with it and store data, right? 12 00:01:01,880 --> 00:01:05,360 ‫So we're talking about the actual architecture of the internals, right? 13 00:01:05,870 --> 00:01:09,720 ‫There's been evolution up until version 5.3. 14 00:01:09,750 --> 00:01:13,760 ‫A very interesting, uh, feature was added. 15 00:01:13,770 --> 00:01:15,540 ‫It's called the Clustered Collections. 16 00:01:15,540 --> 00:01:18,140 ‫So I'll go through this, the evolution of this. 17 00:01:18,140 --> 00:01:27,270 ‫So I'll discuss what is the difference between SQL and NoSQL and in a very deep way, and we'll discuss. 18 00:01:28,440 --> 00:01:37,230 ‫The first version of MongoDB started with their storage engine and V1, then the Wiredtiger acquiring 19 00:01:37,230 --> 00:01:44,280 ‫them back in 2014, I think, and then moving all the way to the recent changes, which is the clustered 20 00:01:44,280 --> 00:01:45,240 ‫collections. 21 00:01:45,240 --> 00:01:48,630 ‫And this is all going to make sense by the end of the video, hopefully. 22 00:01:48,840 --> 00:01:50,070 ‫How about we get started? 23 00:01:50,070 --> 00:01:57,030 ‫All right, so I'm going to use my medium article, at least the images in my medium article to illustrate 24 00:01:57,030 --> 00:01:57,420 ‫this. 25 00:01:57,510 --> 00:02:04,140 ‫Think of it as like as the slide shows, but go ahead and make sure to follow me on Medium. 26 00:02:04,290 --> 00:02:06,900 ‫I started posting a lot of content there. 27 00:02:06,900 --> 00:02:11,610 ‫If you're if you're like if you like more the written medium than actual videos. 28 00:02:11,610 --> 00:02:15,120 ‫But the first thing we're going to discuss is like the database internals. 29 00:02:15,120 --> 00:02:24,900 ‫If you look really at any database, any database almost always will have two pieces. 30 00:02:24,900 --> 00:02:31,890 ‫And the most piece that we actually deal with and interact with is actually the front end piece of the 31 00:02:31,890 --> 00:02:34,110 ‫database, which is the API. 32 00:02:34,650 --> 00:02:42,120 ‫You see the most popular databases API to actually communicate to the database, to tell it what to 33 00:02:42,120 --> 00:02:47,040 ‫fetch to, to, to actually ask it to store something. 34 00:02:47,070 --> 00:02:54,210 ‫Is the SQL language which state for structured query language, right? 35 00:02:54,210 --> 00:02:58,260 ‫And that is the API that we know and love. 36 00:02:58,260 --> 00:03:03,870 ‫And another piece of a different API could be like Redis or Mongo, right? 37 00:03:03,870 --> 00:03:08,310 ‫So hey, get this document and store this document. 38 00:03:08,310 --> 00:03:14,940 ‫There is no structured query language, There is no selecting tables and fields, right? 39 00:03:14,970 --> 00:03:17,280 ‫It's just its own different API. 40 00:03:17,280 --> 00:03:21,290 ‫So the API can actually change based on the database. 41 00:03:21,300 --> 00:03:23,940 ‫The second portion is actually the data format. 42 00:03:23,940 --> 00:03:29,670 ‫When I ask you to get something or I want to store something, what am I giving you and what am I taking 43 00:03:29,670 --> 00:03:30,780 ‫back from you? 44 00:03:30,780 --> 00:03:34,920 ‫And this is where really a database can shine. 45 00:03:35,070 --> 00:03:40,050 ‫It's it's the core of any database system, the data format. 46 00:03:40,170 --> 00:03:48,840 ‫So for the longest time, databases has always been tables and rows and columns. 47 00:03:49,440 --> 00:03:49,950 ‫Right. 48 00:03:50,670 --> 00:03:59,390 ‫And to interact with these rows and columns, you use the SQL language right. 49 00:03:59,430 --> 00:04:00,720 ‫To query it. 50 00:04:00,720 --> 00:04:06,480 ‫So when people design the database back in the 70 seconds or 60 seconds even, right, they thought 51 00:04:06,480 --> 00:04:11,160 ‫about it and always say it's always going to be tables and it's always going to be rows and always going 52 00:04:11,160 --> 00:04:14,220 ‫to be columns, and then the application can build on top of it, right? 53 00:04:14,520 --> 00:04:17,010 ‫We built it bottom up, if you will. 54 00:04:17,730 --> 00:04:19,650 ‫And then, uh. 55 00:04:20,640 --> 00:04:22,850 ‫So that's one data format, right? 56 00:04:22,860 --> 00:04:25,290 ‫But then people challenged this. 57 00:04:25,320 --> 00:04:32,280 ‫People came in the 2000 era and say, wait a minute, why do I have to be? 58 00:04:33,340 --> 00:04:35,650 ‫Really fix to these tables. 59 00:04:35,650 --> 00:04:38,650 ‫I don't know what my application has nothing to do with tables. 60 00:04:39,190 --> 00:04:43,750 ‫As the web evolved, as the as the evolution of the web, right. 61 00:04:43,750 --> 00:04:48,580 ‫Came in and JSON and documents, really, I want to deal with documents. 62 00:04:48,580 --> 00:04:50,590 ‫I don't even have a schema per se. 63 00:04:50,620 --> 00:04:52,900 ‫I don't have tables with a specific schema. 64 00:04:52,900 --> 00:04:54,210 ‫I want to be flexible. 65 00:04:54,220 --> 00:04:56,830 ‫Why are you forcing me to do tables? 66 00:04:56,830 --> 00:05:02,830 ‫And that's the idea of where documents came in Later, graphs came in later. 67 00:05:02,830 --> 00:05:07,240 ‫Other column based storage came in, right? 68 00:05:07,240 --> 00:05:14,850 ‫And instead of row storage, all of this, it really automatically the database became this two part 69 00:05:14,860 --> 00:05:19,180 ‫where the front end and the storage engine, which is the most important part, the storage engine here, 70 00:05:19,750 --> 00:05:20,380 ‫you see. 71 00:05:20,650 --> 00:05:25,750 ‫So the storage, once we discuss this, there's the data format, how I'm returning these things to 72 00:05:25,750 --> 00:05:29,080 ‫the user and the user here, I really mean the application. 73 00:05:29,080 --> 00:05:35,270 ‫And again, the front end, front end here, I'm talking about the actual database front end. 74 00:05:35,270 --> 00:05:38,450 ‫It's in the database right portion. 75 00:05:38,930 --> 00:05:49,760 ‫And then the second portion, which is the most important part, is how am I storing the data on disk, 76 00:05:49,970 --> 00:05:50,420 ‫Right? 77 00:05:50,420 --> 00:05:57,950 ‫And the storage engine doesn't really care what you're storing in it to the storage engine. 78 00:05:57,950 --> 00:06:03,650 ‫You have something called a page and you in the page you have bytes. 79 00:06:04,250 --> 00:06:06,170 ‫That's all what it cares about. 80 00:06:06,170 --> 00:06:07,370 ‫And the. 81 00:06:08,140 --> 00:06:08,620 ‫Front. 82 00:06:08,620 --> 00:06:16,780 ‫And part of the database will say, Hey, by the way, in this page there is a bunch of rows, right? 83 00:06:16,960 --> 00:06:23,490 ‫I have a row store right where I have a table and I put the row and all the columns. 84 00:06:23,500 --> 00:06:31,450 ‫And then right after the final column of the first row, I put the second row and you can see that it's 85 00:06:31,450 --> 00:06:38,530 ‫just a if you if you think of it like an actual page, a rectangle, the first row, and then followed 86 00:06:38,530 --> 00:06:44,140 ‫by the second row, followed by the third row and all its column, fourth row and all its column, fifth 87 00:06:44,140 --> 00:06:46,380 ‫row and all column until the page fills. 88 00:06:46,390 --> 00:06:50,720 ‫That's why the storage engine has a property called the page size in A.B. 89 00:06:50,860 --> 00:06:51,430 ‫MySQL. 90 00:06:51,430 --> 00:06:55,450 ‫That's 16 KB kilobyte in Mongo MongoDB. 91 00:06:55,450 --> 00:07:03,190 ‫I don't remember MongoDB but Postgres is eight K and you can change this page size so in mongo. 92 00:07:04,060 --> 00:07:06,610 ‫What we're storing is just a document. 93 00:07:06,730 --> 00:07:16,270 ‫It's a JSON document that the front end receives and it turns it into a bunch of bytes and then we flush 94 00:07:16,270 --> 00:07:22,930 ‫it to a page and it's the same thing, the document or key, the first key and the value, and then 95 00:07:22,930 --> 00:07:25,660 ‫we just write it to the storage engine. 96 00:07:26,710 --> 00:07:33,040 ‫So if you think really, really about it, it's always the storage, what you're storing and what how 97 00:07:33,040 --> 00:07:35,560 ‫the front end is actually extracting this information. 98 00:07:35,560 --> 00:07:40,060 ‫So document graph, right based database. 99 00:07:40,060 --> 00:07:44,230 ‫When I say, Hey, this is a graph based database, the storage engine doesn't care. 100 00:07:44,590 --> 00:07:52,210 ‫It's just how the front end part of the database actually organizes the bytes such that when I store 101 00:07:52,210 --> 00:07:52,720 ‫them. 102 00:07:53,530 --> 00:08:02,250 ‫I want to read that page and get as much efficiency in my read as possible. 103 00:08:02,260 --> 00:08:06,110 ‫So another piece of the storage engine is indexes, right? 104 00:08:06,130 --> 00:08:09,850 ‫Because now we're storing things in a bunch of pages, right? 105 00:08:09,880 --> 00:08:13,120 ‫How are we storing them is also another story, right? 106 00:08:13,420 --> 00:08:16,300 ‫Are they are they just a bunch of files each? 107 00:08:16,330 --> 00:08:20,920 ‫Does each file represent a table or a collection in MongoDB? 108 00:08:21,010 --> 00:08:21,550 ‫Right. 109 00:08:21,580 --> 00:08:26,040 ‫Or am I storing the actual data in the indexes itself? 110 00:08:26,050 --> 00:08:27,310 ‫We're going to talk about all that. 111 00:08:27,310 --> 00:08:27,700 ‫Right. 112 00:08:27,730 --> 00:08:33,630 ‫Indexes will help fast track what you're looking for, Right? 113 00:08:33,640 --> 00:08:35,790 ‫That's also part of the storage engine. 114 00:08:35,800 --> 00:08:38,560 ‫The type of indexes you're creating. 115 00:08:38,670 --> 00:08:40,360 ‫Is it just is it a b-tree? 116 00:08:40,630 --> 00:08:49,330 ‫And all really helps pinpoint exactly what page you are trying to read. 117 00:08:49,450 --> 00:08:49,630 ‫Okay. 118 00:08:50,260 --> 00:08:58,130 ‫So if you have a like a table or a MongoDB collection, these MongoDB collections is just a bunch of 119 00:08:58,130 --> 00:09:00,050 ‫JSON documents, right? 120 00:09:00,980 --> 00:09:03,710 ‫Storage engine can decide, you know what, this document is really large. 121 00:09:03,710 --> 00:09:05,390 ‫I'm going to decide to compress it. 122 00:09:05,540 --> 00:09:08,180 ‫So that's all property of the storage engine. 123 00:09:08,180 --> 00:09:12,050 ‫The front end doesn't even know that this document is compressed. 124 00:09:12,080 --> 00:09:17,090 ‫All it does is, hey, just give me that document and this part will just decompress it and give you 125 00:09:17,090 --> 00:09:17,540 ‫back. 126 00:09:17,750 --> 00:09:21,200 ‫Give it back to the front end and the front end will return it. 127 00:09:21,220 --> 00:09:21,550 ‫Right? 128 00:09:21,560 --> 00:09:27,710 ‫So this is like a clear separation between these two and they can share tasks as well. 129 00:09:27,740 --> 00:09:34,340 ‫Of course, the files like where is the actual full data, you know, because indexes only have part 130 00:09:34,340 --> 00:09:35,180 ‫of the data, right? 131 00:09:35,340 --> 00:09:44,210 ‫Like I'm indexing on the first name and give me all first name is a bad index and I don't know, salary 132 00:09:44,240 --> 00:09:45,430 ‫maybe, right? 133 00:09:45,470 --> 00:09:51,950 ‫Salary is another data structure we create and then we traverse it back to get back to exactly to that 134 00:09:51,950 --> 00:09:52,340 ‫data. 135 00:09:52,340 --> 00:09:59,150 ‫Right to that data file, which then pulls the entire document or row and then we return it. 136 00:09:59,450 --> 00:10:03,620 ‫And you can you can really be creative here. 137 00:10:03,620 --> 00:10:04,970 ‫And that's what people did, right? 138 00:10:04,970 --> 00:10:08,300 ‫The storage engine also responsible for transactions. 139 00:10:08,570 --> 00:10:14,120 ‫You know, when I'm changing this and this and this and this and this, I want it to do it as one unit 140 00:10:14,150 --> 00:10:20,990 ‫of work such that if there is a failure, please roll back All these changes don't persist anything. 141 00:10:21,940 --> 00:10:23,470 ‫Half way through. 142 00:10:23,480 --> 00:10:25,760 ‫I want to be consistent. 143 00:10:25,760 --> 00:10:29,630 ‫I want to be atomic and I want to be isolated. 144 00:10:29,630 --> 00:10:33,200 ‫My concurrent transactions, all of these are properties of the storage engine. 145 00:10:33,200 --> 00:10:33,800 ‫Really. 146 00:10:34,040 --> 00:10:34,450 ‫Right? 147 00:10:34,460 --> 00:10:35,690 ‫I want durability. 148 00:10:35,690 --> 00:10:36,350 ‫I want that. 149 00:10:36,350 --> 00:10:40,670 ‫If if I say commit and you told me the front end. 150 00:10:40,670 --> 00:10:40,850 ‫Right. 151 00:10:40,940 --> 00:10:42,030 ‫That's another thing, right? 152 00:10:42,030 --> 00:10:46,400 ‫The transaction will say, Hey, I want you to commit and the storage engine will say, yes, you committed 153 00:10:46,400 --> 00:10:47,240 ‫successfully. 154 00:10:47,240 --> 00:10:54,860 ‫If I get the success and return to the user and then later you crashed, that data better be there when 155 00:10:54,860 --> 00:10:58,940 ‫I come back because you told me you committed successfully. 156 00:10:58,940 --> 00:10:59,300 ‫Right. 157 00:10:59,360 --> 00:11:00,470 ‫All of these things. 158 00:11:00,470 --> 00:11:05,990 ‫Well, the write ahead log or journaling, as MongoDB calls it, right? 159 00:11:05,990 --> 00:11:12,350 ‫As I'm writing things, usually when you write things, it goes it needs to go to the data file. 160 00:11:12,350 --> 00:11:17,660 ‫That's where the the major storage lies lives. 161 00:11:17,660 --> 00:11:25,710 ‫But but then writing to the data file is really expensive because you're writing in pages these massive 162 00:11:25,710 --> 00:11:28,710 ‫pages, eight K and 16 K, So right. 163 00:11:28,740 --> 00:11:34,770 ‫If imagine like you're touching one column, one property, you need to write a whole page. 164 00:11:34,770 --> 00:11:37,440 ‫There is no writing one byte in databases. 165 00:11:37,470 --> 00:11:38,400 ‫No, no, no, sir. 166 00:11:38,430 --> 00:11:44,370 ‫We don't go to disk and say, Hey, just change that tiny byte or just change that tiny three byte or 167 00:11:44,370 --> 00:11:45,780 ‫just change that one K byte. 168 00:11:45,810 --> 00:11:47,310 ‫Nope, you can do that. 169 00:11:47,640 --> 00:11:50,550 ‫That's not how SSDs and hard drive works. 170 00:11:50,550 --> 00:11:54,270 ‫You have to write in chunks and big chunks for efficiency. 171 00:11:54,310 --> 00:12:00,570 ‫You an IO, you're going to write an all sector, you do an SSD, you can write a whole page or a block 172 00:12:00,570 --> 00:12:03,990 ‫or an erasable unit based on the new technology of SSDs. 173 00:12:03,990 --> 00:12:04,500 ‫Right. 174 00:12:07,090 --> 00:12:14,530 ‫That's where the shingled hard drives comes into the picture, where they increase the right portions 175 00:12:14,530 --> 00:12:15,520 ‫and stuff like that. 176 00:12:15,850 --> 00:12:22,390 ‫We don't have byte addressability on disk, unfortunately, until today 2022. 177 00:12:22,420 --> 00:12:26,050 ‫We have byte addressability on RAM. 178 00:12:26,080 --> 00:12:28,680 ‫You can definitely write single byte and RAM. 179 00:12:28,690 --> 00:12:29,260 ‫Definitely. 180 00:12:29,260 --> 00:12:30,220 ‫That's fine. 181 00:12:30,310 --> 00:12:30,660 ‫Right. 182 00:12:30,730 --> 00:12:33,190 ‫But on disk persisted. 183 00:12:33,220 --> 00:12:34,750 ‫No, you gotta write on pages. 184 00:12:34,750 --> 00:12:35,220 ‫Right. 185 00:12:35,230 --> 00:12:36,970 ‫And that's what we have today. 186 00:12:37,960 --> 00:12:40,750 ‫And because of that cost. 187 00:12:40,750 --> 00:12:46,300 ‫Right writing to hold data files what the data storage engine does like as you change all these changes 188 00:12:46,300 --> 00:12:46,810 ‫goes to ram. 189 00:12:46,810 --> 00:12:48,110 ‫Bup bup bup bup bup bup bup bup bup. 190 00:12:48,130 --> 00:12:49,750 ‫We call them dirty pages. 191 00:12:49,780 --> 00:12:56,400 ‫The moment you touch a page where you have a row or collection or document, we just mark it as dirty. 192 00:12:56,410 --> 00:12:58,450 ‫And again, the storage engine doesn't know it's a document. 193 00:12:58,450 --> 00:12:59,680 ‫It just knows it's bytes. 194 00:12:59,710 --> 00:13:03,880 ‫It knows it's a page with a bunch of bytes that you touched. 195 00:13:03,910 --> 00:13:04,500 ‫Right? 196 00:13:04,510 --> 00:13:07,460 ‫And then you write to the memory and so it's fast. 197 00:13:07,460 --> 00:13:13,390 ‫And then later the storage engine will collect as much changes as possible and then flush it once right 198 00:13:13,430 --> 00:13:14,690 ‫to the database. 199 00:13:14,930 --> 00:13:17,600 ‫All of this the job of the storage engine. 200 00:13:18,440 --> 00:13:23,270 ‫I still didn't come to the difference between school and SQL, but we'll get to get there, right? 201 00:13:23,570 --> 00:13:29,390 ‫You clearly are going to see it, I think, by by this time, if you're still watching or listening, 202 00:13:29,420 --> 00:13:32,060 ‫you're probably going to know the difference, right? 203 00:13:33,470 --> 00:13:37,100 ‫So we're not writing immediately. 204 00:13:37,310 --> 00:13:39,860 ‫We're collecting this change, you might say, Hussein, but wait a minute. 205 00:13:39,860 --> 00:13:40,790 ‫You're writing to Ram. 206 00:13:40,790 --> 00:13:42,290 ‫If I commit, you're writing to Ram. 207 00:13:42,290 --> 00:13:43,360 ‫What if I crash? 208 00:13:43,370 --> 00:13:44,690 ‫That's the problem, right? 209 00:13:44,780 --> 00:13:49,880 ‫So that's why in case to to recover from the crash, we create this called thing. 210 00:13:49,880 --> 00:13:52,520 ‫This thing is called Wall, the write ahead log. 211 00:13:52,520 --> 00:14:01,910 ‫So as we write to the RAM to this data pages, we also write on disk tiny things that says, Hey, here's 212 00:14:01,910 --> 00:14:03,950 ‫a journal on this date. 213 00:14:03,950 --> 00:14:14,780 ‫I on this date write Dear Diary, On this date I updated the salary from 10,000 to 10,050 cent. 214 00:14:15,710 --> 00:14:16,730 ‫It's a bad year. 215 00:14:16,760 --> 00:14:18,200 ‫What do you want me to say? 216 00:14:18,620 --> 00:14:18,920 ‫Right. 217 00:14:18,920 --> 00:14:19,520 ‫So. 218 00:14:19,520 --> 00:14:21,020 ‫And this on this date? 219 00:14:21,020 --> 00:14:22,520 ‫I wrote this on this date. 220 00:14:22,520 --> 00:14:24,560 ‫You just write the changes. 221 00:14:24,800 --> 00:14:30,320 ‫So then in case of a crash, we're going to lose the dirty pages on the on memory. 222 00:14:30,320 --> 00:14:36,650 ‫But if I came back, I have all the wall and I have the last checkpoint on the data file. 223 00:14:36,650 --> 00:14:41,870 ‫So I restore that and I redo the changes. 224 00:14:41,870 --> 00:14:43,610 ‫I apply the wall. 225 00:14:44,250 --> 00:14:46,470 ‫To the data files. 226 00:14:46,470 --> 00:14:51,580 ‫And now in memory, I have the final representation as it was when I crashed. 227 00:14:51,600 --> 00:14:52,980 ‫Brilliant design. 228 00:14:53,190 --> 00:14:58,320 ‫Anyway, I'm going all over the place, but the storage engine front end, that is the main pieces. 229 00:14:58,320 --> 00:14:59,940 ‫So we talked about what the front end is. 230 00:14:59,970 --> 00:15:01,790 ‫We talked about what the storage engine is. 231 00:15:01,800 --> 00:15:07,410 ‫The difference between the SQL and NoSQL mainly. 232 00:15:07,410 --> 00:15:09,060 ‫Is this puppy the front end? 233 00:15:10,130 --> 00:15:17,870 ‫The NoSQL guys came in and says, Are you really restricted me with this tables and columns and the 234 00:15:17,900 --> 00:15:18,390 ‫sequel? 235 00:15:18,410 --> 00:15:19,130 ‫I hate SQL. 236 00:15:19,160 --> 00:15:20,560 ‫I don't like SQL at all. 237 00:15:20,570 --> 00:15:21,180 ‫Right? 238 00:15:21,200 --> 00:15:22,190 ‫I don't like it. 239 00:15:23,270 --> 00:15:26,000 ‫And it just just didn't fit our application. 240 00:15:26,000 --> 00:15:29,570 ‫I want to be like, I want just to give you a document, just store it. 241 00:15:29,570 --> 00:15:31,880 ‫And that's where they redesigned. 242 00:15:32,120 --> 00:15:35,600 ‫I think someone came in one day and shop was like, No SQL. 243 00:15:35,750 --> 00:15:36,980 ‫They start a movement. 244 00:15:36,980 --> 00:15:40,490 ‫They said No SQL, no SQL, no SQL, no more SQL. 245 00:15:40,610 --> 00:15:45,290 ‫And they created their own technical storage engine. 246 00:15:45,740 --> 00:15:50,080 ‫And they I believe, if I'm not mistaken, they didn't even have indexes. 247 00:15:50,090 --> 00:15:55,850 ‫So, see, you guys are so because when databases Oracle and SQL Server created, they were just so 248 00:15:55,880 --> 00:16:05,570 ‫wired to be, you know, to have tables and rows and so everything was glued together and sticky can't 249 00:16:05,570 --> 00:16:06,020 ‫change it. 250 00:16:06,020 --> 00:16:08,180 ‫So they created everything from scratch. 251 00:16:08,210 --> 00:16:09,260 ‫A storage engine. 252 00:16:09,620 --> 00:16:11,670 ‫I'm storing just documents, for example. 253 00:16:11,670 --> 00:16:13,050 ‫That's the first use case. 254 00:16:13,050 --> 00:16:15,780 ‫I have a document that JSON document just store it. 255 00:16:15,780 --> 00:16:17,130 ‫It's just a bunch of bytes. 256 00:16:17,160 --> 00:16:22,680 ‫Later they added transactions, later they added wall, later they have the indexes, they slowly added 257 00:16:22,740 --> 00:16:25,440 ‫it and then the API just took it and set. 258 00:16:26,730 --> 00:16:31,300 ‫The user will get a document and we're going to store it on the storage engine. 259 00:16:31,320 --> 00:16:31,920 ‫That's it. 260 00:16:31,950 --> 00:16:33,180 ‫It's just a bunch of bytes. 261 00:16:33,210 --> 00:16:40,650 ‫We're going to convert the JSON into basin bison binary JSON, and then we persist it. 262 00:16:40,680 --> 00:16:42,290 ‫That's the only difference. 263 00:16:42,300 --> 00:16:44,550 ‫That's the only difference in SQL. 264 00:16:44,550 --> 00:16:45,710 ‫NoSQL, right? 265 00:16:46,720 --> 00:16:52,180 ‫The data format, which we changed from tables and rows into documents, and then the API, which is 266 00:16:52,180 --> 00:16:58,150 ‫the get and set instead of just SQL and clear separation. 267 00:16:58,150 --> 00:17:06,550 ‫And then of course there are out of the box storage engines such as Leveldb or Myrocks write Rocksdb. 268 00:17:06,580 --> 00:17:11,650 ‫Sorry, Rocksdb is a very popular storage engine that does exactly that takes a bunch of bytes, doesn't 269 00:17:11,650 --> 00:17:17,470 ‫care what you have in your bytes, it doesn't care, just gives you the beauty of indexes and storage 270 00:17:17,470 --> 00:17:19,310 ‫engine, all this stuff, right? 271 00:17:19,330 --> 00:17:24,490 ‫But then in the front end you can build your database the way you want. 272 00:17:24,580 --> 00:17:26,470 ‫That's why you can build a graph database. 273 00:17:26,470 --> 00:17:35,090 ‫So graph will prioritize not rows or columns per se or even documents, but the traversability like. 274 00:17:35,140 --> 00:17:39,250 ‫So if this is not connected to this node to connect to this node, I want to store them next to each 275 00:17:39,250 --> 00:17:41,860 ‫other right in this way. 276 00:17:41,860 --> 00:17:49,490 ‫And so that the whole goal between the API and the storage and the front end is that when I do an I 277 00:17:49,490 --> 00:17:50,810 ‫O and I give me a page. 278 00:17:51,980 --> 00:17:57,000 ‫You want as much as possible this page to be to have everything you need. 279 00:17:57,020 --> 00:18:03,710 ‫You don't want to go back to read more pages and I can go for ages about this. 280 00:18:03,710 --> 00:18:08,810 ‫You know, this just the efficiency of the I think this is the most important thing, but we still didn't 281 00:18:08,810 --> 00:18:11,870 ‫get to the main part, which is the MongoDB databases. 282 00:18:11,870 --> 00:18:14,500 ‫So now we talked about NoSQL versus SQL. 283 00:18:14,510 --> 00:18:15,230 ‫What's the difference? 284 00:18:15,270 --> 00:18:24,650 ‫Right now, what we want to discuss is the first version ish of MongoDB. 285 00:18:24,800 --> 00:18:28,550 ‫Yeah, this is prior to 4.2. 286 00:18:28,670 --> 00:18:35,630 ‫MongoDB first storage engine was called memory map version one, which is literally just a bunch of 287 00:18:35,840 --> 00:18:37,340 ‫data files, Right. 288 00:18:37,340 --> 00:18:39,900 ‫And the data file, right. 289 00:18:41,130 --> 00:18:41,460 ‫Uh. 290 00:18:43,020 --> 00:18:49,060 ‫The data file are stored document after one, one document after the other. 291 00:18:49,080 --> 00:18:52,320 ‫Now, I don't know if there is one data file have per collection. 292 00:18:52,320 --> 00:18:55,370 ‫Maybe when you have a collection you will have a data file. 293 00:18:55,380 --> 00:18:55,740 ‫Maybe. 294 00:18:55,740 --> 00:18:56,670 ‫Maybe it's different. 295 00:18:56,670 --> 00:18:58,110 ‫But what? 296 00:18:59,140 --> 00:19:03,430 ‫The brilliant design behind the first version was an offset based. 297 00:19:03,460 --> 00:19:07,390 ‫That means, Hey, I want this document. 298 00:19:07,660 --> 00:19:08,750 ‫Document. 299 00:19:08,770 --> 00:19:10,720 ‫This particular document with an ID. 300 00:19:10,990 --> 00:19:16,170 ‫So what Mongo has is as a unique identifier, right? 301 00:19:16,180 --> 00:19:23,350 ‫If you know about that and this, your unique identifier will tell you exactly what about this document? 302 00:19:23,350 --> 00:19:23,830 ‫Right. 303 00:19:24,220 --> 00:19:26,230 ‫And there is an index attached to it. 304 00:19:26,260 --> 00:19:26,710 ‫Right? 305 00:19:27,750 --> 00:19:28,980 ‫And this index. 306 00:19:28,980 --> 00:19:30,510 ‫This is a B3 index. 307 00:19:30,540 --> 00:19:35,070 ‫When you traverse the B3 index, you find the IDs. 308 00:19:35,160 --> 00:19:35,420 ‫Okay. 309 00:19:35,740 --> 00:19:37,010 ‫It's in this page. 310 00:19:37,020 --> 00:19:37,740 ‫It's in this page. 311 00:19:37,740 --> 00:19:38,810 ‫And then you find it. 312 00:19:38,820 --> 00:19:45,000 ‫The pointer of this unique identifier is something called a disk location. 313 00:19:45,180 --> 00:19:47,520 ‫I think it's a 32 byte. 314 00:19:47,550 --> 00:19:49,800 ‫It's actually 64 bit. 315 00:19:49,800 --> 00:19:50,310 ‫Sorry. 316 00:19:50,340 --> 00:19:53,760 ‫It's a 64 bit pointer. 317 00:19:53,760 --> 00:19:54,780 ‫32. 318 00:19:54,810 --> 00:20:01,560 ‫The 32 bits, the first 32 bits tells you the file name, which file and the second 32 bit tells you 319 00:20:01,560 --> 00:20:04,740 ‫the offset because now you know which file. 320 00:20:04,740 --> 00:20:08,130 ‫But then the file is, is one gig, right. 321 00:20:08,160 --> 00:20:12,570 ‫Where exactly is the document in this file is the offset. 322 00:20:12,960 --> 00:20:16,860 ‫So with one single read, you can go. 323 00:20:16,890 --> 00:20:17,910 ‫Exactly. 324 00:20:17,910 --> 00:20:24,540 ‫Because this is how the OS read write the OS will read, will give you the file name, says hey, go 325 00:20:24,540 --> 00:20:25,740 ‫exactly to that location. 326 00:20:25,740 --> 00:20:27,700 ‫You can absolutely do that in the file system. 327 00:20:27,700 --> 00:20:32,590 ‫Allow it to say read that portion and read for X amount of bytes. 328 00:20:32,800 --> 00:20:33,040 ‫Right. 329 00:20:33,280 --> 00:20:37,210 ‫So I suppose the another property is the file is the document size. 330 00:20:37,210 --> 00:20:39,700 ‫So you need to store also the document size, right? 331 00:20:40,300 --> 00:20:45,490 ‫So say read this part and then you're going to read that, right? 332 00:20:45,490 --> 00:20:48,430 ‫And then you get a bunch of pages probably. 333 00:20:48,430 --> 00:20:52,780 ‫And then if you're lucky, you're going to get one document or more. 334 00:20:52,810 --> 00:20:53,020 ‫Right? 335 00:20:53,170 --> 00:20:55,360 ‫That's why the document also has a fixed size. 336 00:20:55,360 --> 00:20:59,520 ‫You can't go beyond certain size because of these limitations, right? 337 00:20:59,530 --> 00:21:00,580 ‫So now you got it. 338 00:21:00,580 --> 00:21:05,410 ‫So you do one B3 scan from the ID, right? 339 00:21:05,440 --> 00:21:11,260 ‫To find exactly which document to pull. 340 00:21:11,410 --> 00:21:11,950 ‫Right. 341 00:21:12,610 --> 00:21:17,740 ‫Again, you're going to get a bunch of bytes and then the front end is responsible to pass. 342 00:21:18,260 --> 00:21:21,950 ‫The bytes to actually find documents per se. 343 00:21:22,130 --> 00:21:27,830 ‫And of course, if this was like a relational database, then going to be columns and rows, right? 344 00:21:27,830 --> 00:21:31,820 ‫If there was a graph, you're going to pass it such that, you know the beginning and the end. 345 00:21:31,880 --> 00:21:32,330 ‫Right? 346 00:21:32,480 --> 00:21:35,480 ‫And it's not really rocket science at the end of the day. 347 00:21:35,540 --> 00:21:39,200 ‫So we're getting a big O of log N, right? 348 00:21:39,200 --> 00:21:45,110 ‫So it's just A1IO or multiple i os to traverse the nodes. 349 00:21:45,230 --> 00:21:51,450 ‫That's why it's important that the b-tree is small enough to fit in memory such that because the bit 350 00:21:51,860 --> 00:21:55,550 ‫the index is just not a data structure which is persisted on disk. 351 00:21:55,580 --> 00:21:57,890 ‫You read it from disk and you put it in memory. 352 00:21:57,890 --> 00:21:59,630 ‫Hopefully it fits in memory. 353 00:21:59,630 --> 00:22:06,350 ‫That's why this scored Actually one of one problem that this car faced was they moved from MongoDB because 354 00:22:06,350 --> 00:22:10,580 ‫their indexes were so large they couldn't even fit in memory. 355 00:22:10,820 --> 00:22:19,650 ‫And if your index doesn't fit in memory, that means as you traverse right, the operating system will 356 00:22:19,650 --> 00:22:25,140 ‫will do these paging and swap files and will swap things to disk if it's not used right. 357 00:22:25,470 --> 00:22:31,530 ‫And this scanning is going to become slower just to find the disk lock. 358 00:22:31,920 --> 00:22:33,540 ‫But that was the original thing. 359 00:22:33,540 --> 00:22:40,110 ‫The problem, the clear problem with this is anything you touch, you change the document size, you 360 00:22:40,110 --> 00:22:42,330 ‫update it to a longer string. 361 00:22:42,360 --> 00:22:46,560 ‫The entire file is now scrambled. 362 00:22:46,560 --> 00:22:46,920 ‫Right? 363 00:22:46,920 --> 00:22:51,450 ‫Because the offset you change the physical offset of the disk. 364 00:22:51,450 --> 00:22:52,020 ‫Right. 365 00:22:52,020 --> 00:22:57,900 ‫I suppose you can play with games with this, but this became very, very problematic, right? 366 00:22:58,080 --> 00:22:59,850 ‫Because the documents are based on offset. 367 00:22:59,850 --> 00:23:03,030 ‫The moment you change the document size, you push it a little bit. 368 00:23:03,060 --> 00:23:05,460 ‫The whole offsets are now off. 369 00:23:05,460 --> 00:23:06,060 ‫Right? 370 00:23:06,090 --> 00:23:08,920 ‫That was the original design, I suppose, if I'm not mistaken. 371 00:23:08,920 --> 00:23:15,600 ‫And my Isam Isom in MySQL, which is no longer used because of the same reason. 372 00:23:15,600 --> 00:23:18,210 ‫Yeah, it's nice for read only. 373 00:23:18,330 --> 00:23:19,110 ‫It's beautiful. 374 00:23:19,110 --> 00:23:19,860 ‫All right. 375 00:23:20,190 --> 00:23:21,450 ‫If I not changing it. 376 00:23:21,450 --> 00:23:22,740 ‫Yeah, it's very fast. 377 00:23:22,740 --> 00:23:27,960 ‫You know exactly what it is and you pull it, but as you change it, it's just. 378 00:23:27,990 --> 00:23:29,510 ‫It becomes really a mess. 379 00:23:29,520 --> 00:23:30,810 ‫I suppose you can play tricks. 380 00:23:30,810 --> 00:23:32,610 ‫Of course you can update the offsets. 381 00:23:32,610 --> 00:23:33,780 ‫Offsets, right? 382 00:23:33,930 --> 00:23:35,520 ‫You can update the offsets. 383 00:23:35,520 --> 00:23:36,060 ‫But. 384 00:23:36,720 --> 00:23:37,710 ‫That was a problem. 385 00:23:37,760 --> 00:23:43,230 ‫And plus, another problem with the map is the locking model, right? 386 00:23:43,260 --> 00:23:44,100 ‫That's another thing. 387 00:23:44,100 --> 00:23:47,700 ‫That is a responsibility of the storage engine really locking. 388 00:23:47,700 --> 00:23:48,170 ‫Right. 389 00:23:48,180 --> 00:23:53,970 ‫How do you prevent two people from editing the same document at the same time? 390 00:23:55,020 --> 00:23:57,350 ‫You shouldn't really do that, right? 391 00:23:57,360 --> 00:23:58,440 ‫Database is No. 392 00:23:58,440 --> 00:24:01,290 ‫Two database will allow you to update the same. 393 00:24:02,320 --> 00:24:04,150 ‫Unit of work, if you will. 394 00:24:04,600 --> 00:24:05,710 ‫If it's a row. 395 00:24:05,890 --> 00:24:09,010 ‫If it's a table, If it's a collection. 396 00:24:09,100 --> 00:24:09,910 ‫Right. 397 00:24:10,490 --> 00:24:11,960 ‫In EMAP. 398 00:24:12,380 --> 00:24:14,750 ‫It was very strict, right? 399 00:24:14,840 --> 00:24:16,790 ‫Imagine this like. 400 00:24:18,300 --> 00:24:22,950 ‫The first version of map was even they didn't bother. 401 00:24:22,970 --> 00:24:26,850 ‫Imagine, because these are people who are rebuilding a database from scratch. 402 00:24:26,850 --> 00:24:31,980 ‫So they didn't think about all this stuff that the databases. 403 00:24:31,980 --> 00:24:34,290 ‫People have been doing it for years. 404 00:24:34,290 --> 00:24:34,470 ‫Right? 405 00:24:34,560 --> 00:24:35,850 ‫For decades, actually. 406 00:24:36,090 --> 00:24:43,080 ‫So the first problem they run into is like, oh, two people can change the same doc, different documents. 407 00:24:43,200 --> 00:24:45,660 ‫Oh, the offsets are all base. 408 00:24:45,660 --> 00:24:46,910 ‫Oh, you know what? 409 00:24:46,920 --> 00:24:49,380 ‫Let's just create a lock, a global lock. 410 00:24:49,380 --> 00:24:53,100 ‫So the first version was a global lock per database. 411 00:24:53,100 --> 00:24:54,270 ‫So No. 412 00:24:54,270 --> 00:24:55,110 ‫Two. 413 00:24:55,770 --> 00:25:02,730 ‫Transactions can actually change documents in different collections at all. 414 00:25:02,730 --> 00:25:08,280 ‫So if you have collection one collection two, you can even change collection one and collection two 415 00:25:08,310 --> 00:25:09,450 ‫documents. 416 00:25:09,870 --> 00:25:12,590 ‫Concurrently, they are serialized. 417 00:25:12,600 --> 00:25:14,220 ‫There is one global lock. 418 00:25:14,340 --> 00:25:19,890 ‫Again, that was the first first version because it's a single database lock. 419 00:25:20,010 --> 00:25:22,590 ‫So you say read this data files. 420 00:25:22,590 --> 00:25:24,660 ‫This tells me that the data files are actually. 421 00:25:26,090 --> 00:25:26,690 ‫Collapse. 422 00:25:26,690 --> 00:25:32,570 ‫So multiple data files, I mean, multiple collections can live in the same data files. 423 00:25:32,570 --> 00:25:37,100 ‫That's one reason you have to acquire a lock so that No. 424 00:25:37,100 --> 00:25:37,820 ‫Two, No. 425 00:25:37,820 --> 00:25:38,960 ‫Two, transaction can change it. 426 00:25:38,960 --> 00:25:40,700 ‫But then they improve this in three. 427 00:25:40,730 --> 00:25:44,180 ‫Three in the version three that wasn't the version. 428 00:25:44,180 --> 00:25:49,730 ‫Two of Mongo and Mongo two, They made it a collection level lock, which is still not good. 429 00:25:49,850 --> 00:25:50,360 ‫Right. 430 00:25:50,540 --> 00:25:54,020 ‫It's it's for the, for, for the SQL people. 431 00:25:54,020 --> 00:25:55,820 ‫It's like saying a table lock. 432 00:25:55,850 --> 00:26:02,180 ‫Imagine you have a table of a million row and you want to insert a row in the table and then you want 433 00:26:02,180 --> 00:26:06,170 ‫to update another row in the in the same table has nothing to do with each other. 434 00:26:06,170 --> 00:26:06,650 ‫Right. 435 00:26:06,680 --> 00:26:08,180 ‫Imagine these are blocked. 436 00:26:08,180 --> 00:26:09,530 ‫Yes, it was blocked. 437 00:26:09,530 --> 00:26:13,280 ‫And if you're using a map, this is still the case. 438 00:26:13,280 --> 00:26:18,530 ‫One collection, which is deprecated, by the way, V1 is deprecated now. 439 00:26:18,530 --> 00:26:21,560 ‫One collection is a is a pair collection lock. 440 00:26:21,560 --> 00:26:26,580 ‫So now sure, you can do a concurrent write on two different collections, right? 441 00:26:26,820 --> 00:26:28,200 ‫Without blocking. 442 00:26:28,200 --> 00:26:33,180 ‫But now if you're updating the same document, that's a problem, right? 443 00:26:33,420 --> 00:26:37,830 ‫So then it became very challenging to manage the storage engine. 444 00:26:37,830 --> 00:26:44,010 ‫So what MongoDB did is, as you know what, let's just acquire this Wiredtiger storage engine, very, 445 00:26:44,010 --> 00:26:48,270 ‫very popular, very efficient storage engine. 446 00:26:48,270 --> 00:26:49,590 ‫So what they did is they. 447 00:26:50,940 --> 00:26:56,190 ‫MongoDB just described this and they built a storage engine out of the box. 448 00:26:56,340 --> 00:27:01,320 ‫This has become what we call a wiredtiger wiredtiger write. 449 00:27:01,350 --> 00:27:02,880 ‫And the front end didn't change. 450 00:27:03,000 --> 00:27:05,520 ‫So your application code doesn't change. 451 00:27:05,520 --> 00:27:07,690 ‫The storage engine in the back end changed, right? 452 00:27:07,710 --> 00:27:09,750 ‫So now this is Wiredtiger. 453 00:27:09,780 --> 00:27:12,450 ‫They gave Wiredtiger the ability. 454 00:27:13,380 --> 00:27:22,260 ‫Now, here's the thing with white Tiger, the ability of document level locking has become popular. 455 00:27:22,470 --> 00:27:26,610 ‫Now you can update two documents on the same collection. 456 00:27:27,060 --> 00:27:28,920 ‫I'm not saying these things. 457 00:27:28,920 --> 00:27:30,960 ‫And you might say this is all exist. 458 00:27:30,960 --> 00:27:31,260 ‫I know. 459 00:27:31,260 --> 00:27:37,560 ‫But I'm telling you the history of things because building databases is not really a trivial thing. 460 00:27:37,710 --> 00:27:45,660 ‫The the brilliant engineers went through this and they are you know, they ran into a lot of challenges 461 00:27:45,690 --> 00:27:46,680 ‫and this is one of them. 462 00:27:46,680 --> 00:27:54,690 ‫So the Wiredtiger storage engine allowed you to update two different documents on the same collection 463 00:27:54,690 --> 00:27:59,310 ‫concurrently, which is now a beautiful thing. 464 00:27:59,310 --> 00:27:59,580 ‫Right? 465 00:27:59,610 --> 00:28:08,850 ‫Now we can and this is now made it equivalent to basically all databases because the databases have 466 00:28:08,850 --> 00:28:11,700 ‫row level locks, like at least MySQL and Postgres. 467 00:28:11,730 --> 00:28:19,180 ‫You cannot you can definitely update two rows on the same table, but you cannot update the same row 468 00:28:19,210 --> 00:28:20,560 ‫on the same table. 469 00:28:21,220 --> 00:28:21,700 ‫Right. 470 00:28:21,730 --> 00:28:27,910 ‫We acquire a lock and then the second transaction tries to update the same row. 471 00:28:28,270 --> 00:28:33,880 ‫That will basically pause the second transaction with row level locking. 472 00:28:33,910 --> 00:28:37,840 ‫Now there is like I think what's the database called? 473 00:28:38,680 --> 00:28:39,420 ‫Yoga? 474 00:28:39,460 --> 00:28:39,940 ‫Yoga. 475 00:28:39,970 --> 00:28:40,360 ‫Yoga. 476 00:28:40,400 --> 00:28:42,160 ‫DB If I'm not mistaken. 477 00:28:42,190 --> 00:28:49,630 ‫They even introduced column level locking, which is another thing, like if I if I have a row and I'm 478 00:28:49,630 --> 00:28:53,290 ‫updating field one in that row, but you're updating field two. 479 00:28:53,320 --> 00:28:54,700 ‫Technically, I'm not. 480 00:28:54,730 --> 00:28:56,590 ‫We're not changing the same thing. 481 00:28:56,650 --> 00:29:01,990 ‫Postgres will lock you even if you're updating different thing MySQL, if I'm not mistaken, they will 482 00:29:01,990 --> 00:29:04,000 ‫also lock it because it's a row level lock. 483 00:29:04,030 --> 00:29:05,050 ‫But now. 484 00:29:05,920 --> 00:29:10,380 ‫You can also include column level locking which says, hey, if you. 485 00:29:10,570 --> 00:29:14,140 ‫Yeah, you touched this row but different fields from this row. 486 00:29:14,320 --> 00:29:15,820 ‫Same thing with the document. 487 00:29:15,820 --> 00:29:16,390 ‫Right? 488 00:29:16,570 --> 00:29:22,450 ‫I am really just updating this field in the document and JSON document and someone is inserting a new 489 00:29:22,450 --> 00:29:24,900 ‫field or updating another or re locking. 490 00:29:24,910 --> 00:29:26,200 ‫Do we really need to lock it? 491 00:29:26,230 --> 00:29:29,880 ‫Well, at the end of the day, this is what we do. 492 00:29:29,890 --> 00:29:30,880 ‫We lock it. 493 00:29:31,120 --> 00:29:38,140 ‫So yeah, if you have if you happen to have to transaction updating the same row, even different columns, 494 00:29:38,290 --> 00:29:44,770 ‫you can't do that unless you have column level locking or key level locking, if you will, in non which 495 00:29:44,770 --> 00:29:46,120 ‫I don't think it exists. 496 00:29:46,540 --> 00:29:50,830 ‫And believe me when you when you when I'm talking about these things, this is not cheap. 497 00:29:50,830 --> 00:29:51,310 ‫Right. 498 00:29:51,460 --> 00:29:56,530 ‫The moment you introduce column level locking, that's another expense because now you have to keep 499 00:29:56,560 --> 00:29:59,350 ‫track of what you're locking and locks are. 500 00:29:59,380 --> 00:29:59,860 ‫Guess what? 501 00:29:59,860 --> 00:30:07,430 ‫In memory and row locks are more expensive than page locks or table locks or collection locks because 502 00:30:07,430 --> 00:30:14,810 ‫you just need one versus if you have million and you updated a million rows and transactions are in 503 00:30:14,810 --> 00:30:15,680 ‫progress. 504 00:30:15,680 --> 00:30:17,960 ‫That's a million lock, right? 505 00:30:18,110 --> 00:30:20,300 ‫Imagine adding column locks to that. 506 00:30:20,300 --> 00:30:26,570 ‫So million times, whatever columns you're updating, it becomes really challenges, right? 507 00:30:27,150 --> 00:30:27,950 ‫Yeah. 508 00:30:27,980 --> 00:30:30,020 ‫Database building database is not trivial. 509 00:30:30,050 --> 00:30:30,410 ‫All right. 510 00:30:30,440 --> 00:30:32,110 ‫Go back to Wiredtiger. 511 00:30:32,120 --> 00:30:33,650 ‫We talked about that, right? 512 00:30:33,680 --> 00:30:40,260 ‫Mongo Wiredtiger introduced compression, which didn't exist, by the way, in MVP, right? 513 00:30:40,280 --> 00:30:41,540 ‫It didn't exist here. 514 00:30:41,570 --> 00:30:43,610 ‫Wiredtiger introduced compression. 515 00:30:43,610 --> 00:30:51,440 ‫Now, when you actually take the document, Wiredtiger compresses the json document, so that's really 516 00:30:51,440 --> 00:30:52,400 ‫brilliant. 517 00:30:52,430 --> 00:30:58,430 ‫Now you're because especially JSON documents have these fields repeated all the time, right? 518 00:30:58,730 --> 00:31:01,250 ‫The field repeats, so you need to compress it. 519 00:31:01,250 --> 00:31:04,730 ‫So MongoDB Wiredtiger actually compresses that. 520 00:31:04,730 --> 00:31:05,800 ‫So that's tiny. 521 00:31:05,810 --> 00:31:06,740 ‫Why is it tiny? 522 00:31:06,740 --> 00:31:13,880 ‫Because now if I'm compressing it, the page will fit more document 1IO will give me more documents. 523 00:31:13,970 --> 00:31:18,140 ‫Then it was A1IO in uncompressed. 524 00:31:18,380 --> 00:31:21,320 ‫If 1IO uncompressed give me three documents. 525 00:31:21,350 --> 00:31:26,900 ‫1IO compressed in a single page will give me 20 documents. 526 00:31:27,660 --> 00:31:36,000 ‫This is really powerful because now I don't really need to go if I'm fetching 20 documents in in, in 527 00:31:36,000 --> 00:31:38,160 ‫the older models, I have to do multiple iOS. 528 00:31:38,190 --> 00:31:44,580 ‫I have to hit the disk multiple times versus in the Wiredtiger Tiger just one pulled all this stuff 529 00:31:44,580 --> 00:31:45,390 ‫compressed. 530 00:31:45,420 --> 00:31:51,090 ‫Do a little decompression in the client side in memory and you get a beautiful 20 documents. 531 00:31:51,090 --> 00:31:55,050 ‫The major thing you have to think about here, how do I save iOS? 532 00:31:55,080 --> 00:32:02,430 ‫That is the number one job of a DBA, of a developer of a database saving iOS. 533 00:32:02,460 --> 00:32:06,180 ‫The list, the IO, the faster the database, nothing else matter. 534 00:32:07,020 --> 00:32:08,940 ‫That is exactly what it is. 535 00:32:11,280 --> 00:32:14,520 ‫All right, so now what? 536 00:32:14,520 --> 00:32:17,880 ‫The way Wiredtiger stored the database is completely changed. 537 00:32:17,880 --> 00:32:20,640 ‫It's no longer using this disk lock thing, right. 538 00:32:20,640 --> 00:32:24,780 ‫Where it's just a bunch of data file, and then you have offset because offsets are really terrible, 539 00:32:24,780 --> 00:32:31,170 ‫right, for changing like the offset changes and you have to update everything like one changing one 540 00:32:31,170 --> 00:32:33,900 ‫documents will will screw up all your offsets. 541 00:32:33,930 --> 00:32:34,470 ‫Right? 542 00:32:34,680 --> 00:32:40,470 ‫So what they did instead, they stored it as a cluster b-tree index. 543 00:32:40,470 --> 00:32:42,690 ‫And I talked about this in another video. 544 00:32:42,930 --> 00:32:48,540 ‫I'm not going to go in details, but in a in a nutshell, right, They have something called the record 545 00:32:48,570 --> 00:32:49,560 ‫ID here. 546 00:32:49,560 --> 00:32:51,630 ‫And you can basically create anything. 547 00:32:51,630 --> 00:32:55,800 ‫This is a hidden index cluster index into wiredtiger and. 548 00:32:56,470 --> 00:32:58,990 ‫Work based on the key you can search. 549 00:32:58,990 --> 00:33:03,850 ‫And when you get here, the value is actually the entire document. 550 00:33:03,850 --> 00:33:05,630 ‫And not only a document, right. 551 00:33:05,650 --> 00:33:09,430 ‫But physically all the documents. 552 00:33:10,290 --> 00:33:13,380 ‫Right or ordered next to each other. 553 00:33:13,380 --> 00:33:21,450 ‫So the page that you land on here in the leaf pages are the data is the data. 554 00:33:21,480 --> 00:33:23,310 ‫This is the data. 555 00:33:23,340 --> 00:33:25,500 ‫The entire data is the index. 556 00:33:25,500 --> 00:33:27,000 ‫That's what a clustered index is. 557 00:33:27,040 --> 00:33:33,030 ‫It's it's by default what you get for one MySQL, not in Postgres, but in MySQL. 558 00:33:33,030 --> 00:33:34,680 ‫Everything is a cluster index. 559 00:33:34,680 --> 00:33:40,890 ‫Every table has a cluster index and that's how your data is organized around the index. 560 00:33:40,890 --> 00:33:46,290 ‫So your table is organized around this index where the leaf pages is the data. 561 00:33:46,290 --> 00:33:49,860 ‫So now if you land here, you get the document and guess what? 562 00:33:49,890 --> 00:33:54,720 ‫Because it's on the one page, you get any document before it and you get any documents after it. 563 00:33:54,720 --> 00:33:58,590 ‫And because it's compressed, you're going to get a lot of tight documents as well. 564 00:33:58,590 --> 00:34:04,590 ‫So you you read this page and you get all the documents nearby because it's ordered. 565 00:34:04,710 --> 00:34:06,060 ‫Not only that. 566 00:34:06,920 --> 00:34:14,010 ‫Each leaf page in B+ tree is actually linked to the next page and to the next page and to the next page. 567 00:34:14,030 --> 00:34:18,080 ‫It's a linked list of pages, so the entire data is right here. 568 00:34:18,080 --> 00:34:23,870 ‫So if you find this, if you want to do a range query and say, find me all all record IDs between X 569 00:34:23,870 --> 00:34:29,900 ‫and Y and we're going to talk about what record is because this is not the ID of the document. 570 00:34:29,900 --> 00:34:34,640 ‫And that's the problem that Wiredtiger and Mongo introduced in a way. 571 00:34:35,470 --> 00:34:44,830 ‫Now, if we have this, you do a if you do a range scan, you're going to get all the documents that 572 00:34:44,830 --> 00:34:45,670 ‫are next to each other. 573 00:34:45,670 --> 00:34:50,380 ‫So a range scan is really powerful in B plus three, especially if it's clustered, because now you're 574 00:34:50,380 --> 00:34:54,100 ‫going to get all the nice documents tucked in together, right? 575 00:34:54,100 --> 00:34:56,380 ‫So you can find your. 576 00:34:57,210 --> 00:35:02,340 ‫Document using a B+ tree search in Wiredtiger using the required. 577 00:35:02,400 --> 00:35:02,890 ‫But guess what? 578 00:35:02,910 --> 00:35:03,930 ‫What is this record ID? 579 00:35:05,970 --> 00:35:08,130 ‫It doesn't mean anything to the user. 580 00:35:08,160 --> 00:35:09,720 ‫This is an internal thing. 581 00:35:10,230 --> 00:35:13,050 ‫But where did this disk lock happen? 582 00:35:13,080 --> 00:35:16,050 ‫This used to be called the disk lock, but they changed it. 583 00:35:16,050 --> 00:35:17,130 ‫And that's what they had. 584 00:35:17,160 --> 00:35:27,810 ‫They had this as disk lock and their indexes, the ID, the actual user facing ID document index has 585 00:35:27,810 --> 00:35:31,800 ‫been mapped always to the disk lock because that's what we had, Right? 586 00:35:31,800 --> 00:35:32,320 ‫Disk lock. 587 00:35:32,340 --> 00:35:33,420 ‫That's that's exactly this. 588 00:35:33,420 --> 00:35:34,110 ‫This. 589 00:35:34,990 --> 00:35:35,700 ‫Is this. 590 00:35:35,710 --> 00:35:39,580 ‫This used to be this this look they later changed it to recorded. 591 00:35:39,700 --> 00:35:41,800 ‫So it's like it doesn't make sense to call it this log. 592 00:35:42,010 --> 00:35:47,380 ‫But then this record ID now is just a pointer to where not to disk. 593 00:35:47,410 --> 00:35:51,940 ‫It is a pointer to this B plus three, which is the hidden index. 594 00:35:51,940 --> 00:35:58,630 ‫So now if you're actually searching for the ID, the primary key, you're doing two lock ups in Wiredtiger, 595 00:35:58,630 --> 00:35:59,830 ‫not one. 596 00:36:00,370 --> 00:36:10,510 ‫So actually I'd look up in Wiredtiger were slower than the older one because now you have to you have 597 00:36:10,510 --> 00:36:16,210 ‫to search two indexes, you have to load two indexes in memory, double the space, double the searches, 598 00:36:16,210 --> 00:36:17,290 ‫double the IO. 599 00:36:17,410 --> 00:36:22,810 ‫You have to write, you have to write to multiple indexes because you have to sync those two guys together. 600 00:36:24,790 --> 00:36:30,360 ‫Secondary index is not so much because the secondary indexes, right, If you think about it, really 601 00:36:30,370 --> 00:36:31,570 ‫secondary indexes. 602 00:36:31,600 --> 00:36:33,100 ‫Secondary indexes. 603 00:36:33,370 --> 00:36:40,360 ‫Now just point directly to the record ID, So yeah, in this particular case, all of these indexes 604 00:36:40,360 --> 00:36:46,150 ‫always point to the record ID, whether it's a primary index or a secondary index, they all point to 605 00:36:46,150 --> 00:36:52,540 ‫the primary key and that's what's causing us the double search effectively, right? 606 00:36:53,320 --> 00:36:56,290 ‫So very similar to MySQL. 607 00:36:56,290 --> 00:37:03,610 ‫Not quite because MySQL primary key is actually this thing, right? 608 00:37:03,700 --> 00:37:11,080 ‫But the primary key in the first version, at least from 4.2, 4.2 to 5.2 very recent. 609 00:37:11,080 --> 00:37:11,950 ‫This change, by the way. 610 00:37:11,950 --> 00:37:12,220 ‫Right. 611 00:37:14,370 --> 00:37:18,450 ‫Until very recently, 5.24.225.2 is like this. 612 00:37:18,480 --> 00:37:23,250 ‫When you search for ID, you find this and then you do another search, another search. 613 00:37:23,250 --> 00:37:25,350 ‫This is not a big o of one, right? 614 00:37:25,350 --> 00:37:30,180 ‫This is a big O of log n plus big O of log n two searches. 615 00:37:30,210 --> 00:37:30,610 ‫Right. 616 00:37:30,630 --> 00:37:34,260 ‫Whereas this guy you do big O of log n and then big O of one. 617 00:37:34,650 --> 00:37:36,150 ‫So now we have this beautiful design. 618 00:37:36,450 --> 00:37:43,410 ‫The problems we understood now the IDs, the problem we have, we have to we have to kind of duplicate 619 00:37:43,410 --> 00:37:44,310 ‫style, right? 620 00:37:44,970 --> 00:37:50,620 ‫The record is a 64 bit same thing here, but secondary indexes all point to the record already. 621 00:37:50,650 --> 00:37:53,310 ‫That's the state of art as of 5.2, right. 622 00:37:53,730 --> 00:37:58,530 ‫And the ID index is just another secondary index at this point. 623 00:37:58,530 --> 00:38:05,880 ‫It's not really a true primary index because the primary index, by definition at least, is the clustered 624 00:38:05,880 --> 00:38:06,360 ‫index. 625 00:38:06,360 --> 00:38:06,670 ‫Right? 626 00:38:06,690 --> 00:38:07,530 ‫It is this one. 627 00:38:07,530 --> 00:38:09,120 ‫But we have now doubled. 628 00:38:10,020 --> 00:38:17,200 ‫Now let's go to the final stage where 5.3, I think is July of 2022. 629 00:38:17,200 --> 00:38:19,750 ‫Really very, very brand new feature. 630 00:38:19,960 --> 00:38:26,650 ‫It's called clustered collections, where you can create a collection and you can make it a clustered 631 00:38:26,650 --> 00:38:27,480 ‫collection. 632 00:38:27,490 --> 00:38:36,760 ‫That means the wire tiger hidden key disappears and instead this becomes your hidden index. 633 00:38:36,760 --> 00:38:39,190 ‫Effectively, this becomes your clustered index. 634 00:38:39,190 --> 00:38:43,660 ‫And the ID field is the main focus for this. 635 00:38:43,690 --> 00:38:52,000 ‫Now, if you're searching by ID right, you will immediately search by ID, do a little bit lookup and 636 00:38:52,000 --> 00:38:54,520 ‫then find the document because the cluster document is right here. 637 00:38:54,520 --> 00:38:57,940 ‫All the leaf pages have were full documents right here. 638 00:38:58,630 --> 00:39:00,790 ‫Pretty neat. 639 00:39:01,360 --> 00:39:05,590 ‫You don't really need to do these two lookups anymore if you're searching for ID, right? 640 00:39:05,920 --> 00:39:07,780 ‫Again, this is not this is an option. 641 00:39:07,780 --> 00:39:09,990 ‫It's not you don't have to do it right. 642 00:39:10,000 --> 00:39:15,340 ‫So if you still want this design, for some reason, we're going to talk about why in a minute. 643 00:39:15,970 --> 00:39:17,080 ‫You can still have it. 644 00:39:17,080 --> 00:39:19,180 ‫But now in this guy. 645 00:39:20,620 --> 00:39:21,520 ‫You can do this. 646 00:39:21,550 --> 00:39:23,120 ‫What's the problem with this now? 647 00:39:23,380 --> 00:39:24,400 ‫We talked about the good thing. 648 00:39:24,400 --> 00:39:25,180 ‫The good thing. 649 00:39:25,270 --> 00:39:26,130 ‫Single search. 650 00:39:26,140 --> 00:39:28,390 ‫If you're using the ID for MongoDB. 651 00:39:28,540 --> 00:39:28,840 ‫Right. 652 00:39:29,530 --> 00:39:33,910 ‫If you're looking up a document by its ID, it's a single, beautiful search. 653 00:39:33,940 --> 00:39:36,610 ‫Immediately find the document based in document. 654 00:39:36,640 --> 00:39:37,240 ‫Right. 655 00:39:37,240 --> 00:39:40,270 ‫And you're going to if you're lucky, you're going to find anything in next to it. 656 00:39:40,270 --> 00:39:40,420 ‫Right. 657 00:39:40,420 --> 00:39:41,760 ‫It's not just one document. 658 00:39:41,770 --> 00:39:45,160 ‫This is a collection of documents in a single page. 659 00:39:45,370 --> 00:39:48,700 ‫I got a I got to find out what's the page size and why Tiger. 660 00:39:49,240 --> 00:39:49,570 ‫Right. 661 00:39:49,690 --> 00:39:50,920 ‫But this is what you get. 662 00:39:50,920 --> 00:39:54,190 ‫You're going to get this and it's going to be cached in memory temporarily. 663 00:39:54,250 --> 00:39:54,550 ‫Right? 664 00:39:54,550 --> 00:40:01,480 ‫So if you're lucky, the next the previous ID next to it is also you're going to get that as well, 665 00:40:01,480 --> 00:40:04,480 ‫right, If the sequence really makes sense here. 666 00:40:04,570 --> 00:40:05,890 ‫The problem, though. 667 00:40:06,540 --> 00:40:11,460 ‫The problem, my friends, is now let's go back to the secondary indexes. 668 00:40:11,760 --> 00:40:16,560 ‫The moment you introduce it, now this becomes identical to MySQL. 669 00:40:16,590 --> 00:40:22,920 ‫MongoDB after 5.3, if you choose to be a cluster connection, it's almost identical to MySQL now. 670 00:40:22,950 --> 00:40:25,080 ‫It became identical to MySQL. 671 00:40:25,170 --> 00:40:29,670 ‫The ID field, which is the primary key, is the cluster index. 672 00:40:29,700 --> 00:40:34,110 ‫The secondary indexes point to what now? 673 00:40:35,090 --> 00:40:37,520 ‫They have to point to the ID, right? 674 00:40:37,550 --> 00:40:41,330 ‫There is no record that you moved where the data is. 675 00:40:41,750 --> 00:40:42,500 ‫Right. 676 00:40:42,770 --> 00:40:44,240 ‫Previously the second index. 677 00:40:44,330 --> 00:40:46,640 ‫I should have drawn this, but sorry, I did not. 678 00:40:46,670 --> 00:40:53,540 ‫The second index is used to point to this thing, The Hidden, which is a very tiny value recorded 64 679 00:40:53,540 --> 00:40:53,720 ‫bit. 680 00:40:53,750 --> 00:40:54,350 ‫That's it. 681 00:40:55,190 --> 00:40:57,170 ‫You know, how large is the field? 682 00:40:58,540 --> 00:41:01,540 ‫And did I did I actually mention that it's called the object ID? 683 00:41:01,810 --> 00:41:03,520 ‫I actually mentioned that someone highlighted it. 684 00:41:03,850 --> 00:41:05,590 ‫12 bytes. 685 00:41:05,620 --> 00:41:07,780 ‫Dude, this is bytes, not bits. 686 00:41:07,810 --> 00:41:14,590 ‫This thing is a 12 bytes by default and it has like the first four bytes is the timestamp. 687 00:41:14,590 --> 00:41:15,760 ‫The second three bytes is. 688 00:41:15,760 --> 00:41:21,420 ‫I don't know what this is because Mongo decided to scale first, right? 689 00:41:21,430 --> 00:41:25,510 ‫So they wanted their IDs to be unique across machines. 690 00:41:25,510 --> 00:41:32,140 ‫So even the second four bytes is a combination between the process ID and the machine name. 691 00:41:32,140 --> 00:41:36,640 ‫And so the idea is truly universally identified across machines. 692 00:41:36,640 --> 00:41:37,660 ‫So that's why it's so big. 693 00:41:37,690 --> 00:41:39,420 ‫12 bytes is so large. 694 00:41:39,430 --> 00:41:39,790 ‫Yeah. 695 00:41:39,790 --> 00:41:42,610 ‫So it's 12 bytes compared to eight bytes. 696 00:41:42,610 --> 00:41:42,820 ‫Right. 697 00:41:42,820 --> 00:41:49,270 ‫Because 64 bit is, is eight bytes and 12 bytes is 12 bytes. 698 00:41:49,270 --> 00:41:49,690 ‫Right. 699 00:41:49,720 --> 00:41:51,520 ‫So four bytes extra you might say. 700 00:41:51,520 --> 00:41:52,210 ‫I'll say who cares. 701 00:41:52,210 --> 00:41:52,870 ‫Four bytes extra. 702 00:41:52,870 --> 00:41:54,010 ‫But here's the thing. 703 00:41:54,040 --> 00:41:54,550 ‫Here's the thing. 704 00:41:54,550 --> 00:41:58,120 ‫I didn't know MongoDB actually allow you in. 705 00:41:58,120 --> 00:42:02,050 ‫Those who who use MongoDB more might know. 706 00:42:02,920 --> 00:42:05,260 ‫You can actually set anything in the ID field. 707 00:42:05,260 --> 00:42:07,510 ‫So it's a user controlled field. 708 00:42:08,140 --> 00:42:11,260 ‫If you don't set an ID, it's going to generate one for you. 709 00:42:11,260 --> 00:42:19,390 ‫But if you do set it, you can have it to be a very large people can have crazy ID values and guess 710 00:42:19,390 --> 00:42:19,900 ‫what? 711 00:42:20,500 --> 00:42:26,680 ‫The secondary index is now has to point to the ID because that's where the data is. 712 00:42:27,530 --> 00:42:35,390 ‫And that's where all the problems of MySQL arise, where if the ID is a poorly chosen value, if the 713 00:42:35,390 --> 00:42:43,670 ‫primary key is a poorly chosen like a good right, again, there's a lot of, of course discussion about 714 00:42:43,670 --> 00:42:48,200 ‫having a UUID as a primary key, but we know it's very large. 715 00:42:48,230 --> 00:42:54,130 ‫If you use it as a as a primary key, then those primary keys are stored in the secondary indexes as 716 00:42:54,140 --> 00:42:57,170 ‫values and that's what blows everything up. 717 00:42:57,800 --> 00:43:01,070 ‫So now the secondary indexes just blow up. 718 00:43:01,780 --> 00:43:02,380 ‫Right. 719 00:43:02,380 --> 00:43:07,360 ‫And that's the basically the evolution of MongoDB, you guys, right. 720 00:43:08,990 --> 00:43:09,860 ‫As a summary. 721 00:43:09,980 --> 00:43:11,680 ‫We started with a map. 722 00:43:11,690 --> 00:43:12,150 ‫Right. 723 00:43:12,170 --> 00:43:13,520 ‫Move to Wiredtiger. 724 00:43:13,550 --> 00:43:18,620 ‫Gain a little bit of new features, but introduce the new problems for sync 5.3 and six zero. 725 00:43:18,650 --> 00:43:20,960 ‫You can actually do clustered indexes. 726 00:43:21,410 --> 00:43:22,220 ‫You're going to see in the next one. 727 00:43:22,220 --> 00:43:23,450 ‫You hope you enjoyed this video. 728 00:43:23,480 --> 00:43:24,050 ‫Goodbye.