1
00:00:00,290 --> 00:00:05,870
‫MongoDB is a document based NoSQL database.

2
00:00:07,730 --> 00:00:15,800
‫It became very, very popular for its schemaless way of storing documents.

3
00:00:16,110 --> 00:00:21,780
‫You know, because the friction when it comes to writing code has gone.

4
00:00:21,800 --> 00:00:24,800
‫Compared to SQL based databases.

5
00:00:24,920 --> 00:00:33,230
‫But I get this question a lot, and I thought this video will be the perfect Segway to actually answer

6
00:00:33,230 --> 00:00:33,920
‫that question.

7
00:00:33,920 --> 00:00:37,000
‫What is really the difference between NoSQL and SQL?

8
00:00:37,010 --> 00:00:39,410
‫So I'm going to address that in this video.

9
00:00:39,410 --> 00:00:49,130
‫And but the main purpose of this video is actually going through the evolution of MongoDB internal architecture.

10
00:00:49,130 --> 00:00:56,210
‫So this is a topic that very rarely people discuss because we're going into the bowels of the database,

11
00:00:56,210 --> 00:01:01,860
‫not the front end, in a sense that how you interact with it and store data, right?

12
00:01:01,880 --> 00:01:05,360
‫So we're talking about the actual architecture of the internals, right?

13
00:01:05,870 --> 00:01:09,720
‫There's been evolution up until version 5.3.

14
00:01:09,750 --> 00:01:13,760
‫A very interesting, uh, feature was added.

15
00:01:13,770 --> 00:01:15,540
‫It's called the Clustered Collections.

16
00:01:15,540 --> 00:01:18,140
‫So I'll go through this, the evolution of this.

17
00:01:18,140 --> 00:01:27,270
‫So I'll discuss what is the difference between SQL and NoSQL and in a very deep way, and we'll discuss.

18
00:01:28,440 --> 00:01:37,230
‫The first version of MongoDB started with their storage engine and V1, then the Wiredtiger acquiring

19
00:01:37,230 --> 00:01:44,280
‫them back in 2014, I think, and then moving all the way to the recent changes, which is the clustered

20
00:01:44,280 --> 00:01:45,240
‫collections.

21
00:01:45,240 --> 00:01:48,630
‫And this is all going to make sense by the end of the video, hopefully.

22
00:01:48,840 --> 00:01:50,070
‫How about we get started?

23
00:01:50,070 --> 00:01:57,030
‫All right, so I'm going to use my medium article, at least the images in my medium article to illustrate

24
00:01:57,030 --> 00:01:57,420
‫this.

25
00:01:57,510 --> 00:02:04,140
‫Think of it as like as the slide shows, but go ahead and make sure to follow me on Medium.

26
00:02:04,290 --> 00:02:06,900
‫I started posting a lot of content there.

27
00:02:06,900 --> 00:02:11,610
‫If you're if you're like if you like more the written medium than actual videos.

28
00:02:11,610 --> 00:02:15,120
‫But the first thing we're going to discuss is like the database internals.

29
00:02:15,120 --> 00:02:24,900
‫If you look really at any database, any database almost always will have two pieces.

30
00:02:24,900 --> 00:02:31,890
‫And the most piece that we actually deal with and interact with is actually the front end piece of the

31
00:02:31,890 --> 00:02:34,110
‫database, which is the API.

32
00:02:34,650 --> 00:02:42,120
‫You see the most popular databases API to actually communicate to the database, to tell it what to

33
00:02:42,120 --> 00:02:47,040
‫fetch to, to, to actually ask it to store something.

34
00:02:47,070 --> 00:02:54,210
‫Is the SQL language which state for structured query language, right?

35
00:02:54,210 --> 00:02:58,260
‫And that is the API that we know and love.

36
00:02:58,260 --> 00:03:03,870
‫And another piece of a different API could be like Redis or Mongo, right?

37
00:03:03,870 --> 00:03:08,310
‫So hey, get this document and store this document.

38
00:03:08,310 --> 00:03:14,940
‫There is no structured query language, There is no selecting tables and fields, right?

39
00:03:14,970 --> 00:03:17,280
‫It's just its own different API.

40
00:03:17,280 --> 00:03:21,290
‫So the API can actually change based on the database.

41
00:03:21,300 --> 00:03:23,940
‫The second portion is actually the data format.

42
00:03:23,940 --> 00:03:29,670
‫When I ask you to get something or I want to store something, what am I giving you and what am I taking

43
00:03:29,670 --> 00:03:30,780
‫back from you?

44
00:03:30,780 --> 00:03:34,920
‫And this is where really a database can shine.

45
00:03:35,070 --> 00:03:40,050
‫It's it's the core of any database system, the data format.

46
00:03:40,170 --> 00:03:48,840
‫So for the longest time, databases has always been tables and rows and columns.

47
00:03:49,440 --> 00:03:49,950
‫Right.

48
00:03:50,670 --> 00:03:59,390
‫And to interact with these rows and columns, you use the SQL language right.

49
00:03:59,430 --> 00:04:00,720
‫To query it.

50
00:04:00,720 --> 00:04:06,480
‫So when people design the database back in the 70 seconds or 60 seconds even, right, they thought

51
00:04:06,480 --> 00:04:11,160
‫about it and always say it's always going to be tables and it's always going to be rows and always going

52
00:04:11,160 --> 00:04:14,220
‫to be columns, and then the application can build on top of it, right?

53
00:04:14,520 --> 00:04:17,010
‫We built it bottom up, if you will.

54
00:04:17,730 --> 00:04:19,650
‫And then, uh.

55
00:04:20,640 --> 00:04:22,850
‫So that's one data format, right?

56
00:04:22,860 --> 00:04:25,290
‫But then people challenged this.

57
00:04:25,320 --> 00:04:32,280
‫People came in the 2000 era and say, wait a minute, why do I have to be?

58
00:04:33,340 --> 00:04:35,650
‫Really fix to these tables.

59
00:04:35,650 --> 00:04:38,650
‫I don't know what my application has nothing to do with tables.

60
00:04:39,190 --> 00:04:43,750
‫As the web evolved, as the as the evolution of the web, right.

61
00:04:43,750 --> 00:04:48,580
‫Came in and JSON and documents, really, I want to deal with documents.

62
00:04:48,580 --> 00:04:50,590
‫I don't even have a schema per se.

63
00:04:50,620 --> 00:04:52,900
‫I don't have tables with a specific schema.

64
00:04:52,900 --> 00:04:54,210
‫I want to be flexible.

65
00:04:54,220 --> 00:04:56,830
‫Why are you forcing me to do tables?

66
00:04:56,830 --> 00:05:02,830
‫And that's the idea of where documents came in Later, graphs came in later.

67
00:05:02,830 --> 00:05:07,240
‫Other column based storage came in, right?

68
00:05:07,240 --> 00:05:14,850
‫And instead of row storage, all of this, it really automatically the database became this two part

69
00:05:14,860 --> 00:05:19,180
‫where the front end and the storage engine, which is the most important part, the storage engine here,

70
00:05:19,750 --> 00:05:20,380
‫you see.

71
00:05:20,650 --> 00:05:25,750
‫So the storage, once we discuss this, there's the data format, how I'm returning these things to

72
00:05:25,750 --> 00:05:29,080
‫the user and the user here, I really mean the application.

73
00:05:29,080 --> 00:05:35,270
‫And again, the front end, front end here, I'm talking about the actual database front end.

74
00:05:35,270 --> 00:05:38,450
‫It's in the database right portion.

75
00:05:38,930 --> 00:05:49,760
‫And then the second portion, which is the most important part, is how am I storing the data on disk,

76
00:05:49,970 --> 00:05:50,420
‫Right?

77
00:05:50,420 --> 00:05:57,950
‫And the storage engine doesn't really care what you're storing in it to the storage engine.

78
00:05:57,950 --> 00:06:03,650
‫You have something called a page and you in the page you have bytes.

79
00:06:04,250 --> 00:06:06,170
‫That's all what it cares about.

80
00:06:06,170 --> 00:06:07,370
‫And the.

81
00:06:08,140 --> 00:06:08,620
‫Front.

82
00:06:08,620 --> 00:06:16,780
‫And part of the database will say, Hey, by the way, in this page there is a bunch of rows, right?

83
00:06:16,960 --> 00:06:23,490
‫I have a row store right where I have a table and I put the row and all the columns.

84
00:06:23,500 --> 00:06:31,450
‫And then right after the final column of the first row, I put the second row and you can see that it's

85
00:06:31,450 --> 00:06:38,530
‫just a if you if you think of it like an actual page, a rectangle, the first row, and then followed

86
00:06:38,530 --> 00:06:44,140
‫by the second row, followed by the third row and all its column, fourth row and all its column, fifth

87
00:06:44,140 --> 00:06:46,380
‫row and all column until the page fills.

88
00:06:46,390 --> 00:06:50,720
‫That's why the storage engine has a property called the page size in A.B.

89
00:06:50,860 --> 00:06:51,430
‫MySQL.

90
00:06:51,430 --> 00:06:55,450
‫That's 16 KB kilobyte in Mongo MongoDB.

91
00:06:55,450 --> 00:07:03,190
‫I don't remember MongoDB but Postgres is eight K and you can change this page size so in mongo.

92
00:07:04,060 --> 00:07:06,610
‫What we're storing is just a document.

93
00:07:06,730 --> 00:07:16,270
‫It's a JSON document that the front end receives and it turns it into a bunch of bytes and then we flush

94
00:07:16,270 --> 00:07:22,930
‫it to a page and it's the same thing, the document or key, the first key and the value, and then

95
00:07:22,930 --> 00:07:25,660
‫we just write it to the storage engine.

96
00:07:26,710 --> 00:07:33,040
‫So if you think really, really about it, it's always the storage, what you're storing and what how

97
00:07:33,040 --> 00:07:35,560
‫the front end is actually extracting this information.

98
00:07:35,560 --> 00:07:40,060
‫So document graph, right based database.

99
00:07:40,060 --> 00:07:44,230
‫When I say, Hey, this is a graph based database, the storage engine doesn't care.

100
00:07:44,590 --> 00:07:52,210
‫It's just how the front end part of the database actually organizes the bytes such that when I store

101
00:07:52,210 --> 00:07:52,720
‫them.

102
00:07:53,530 --> 00:08:02,250
‫I want to read that page and get as much efficiency in my read as possible.

103
00:08:02,260 --> 00:08:06,110
‫So another piece of the storage engine is indexes, right?

104
00:08:06,130 --> 00:08:09,850
‫Because now we're storing things in a bunch of pages, right?

105
00:08:09,880 --> 00:08:13,120
‫How are we storing them is also another story, right?

106
00:08:13,420 --> 00:08:16,300
‫Are they are they just a bunch of files each?

107
00:08:16,330 --> 00:08:20,920
‫Does each file represent a table or a collection in MongoDB?

108
00:08:21,010 --> 00:08:21,550
‫Right.

109
00:08:21,580 --> 00:08:26,040
‫Or am I storing the actual data in the indexes itself?

110
00:08:26,050 --> 00:08:27,310
‫We're going to talk about all that.

111
00:08:27,310 --> 00:08:27,700
‫Right.

112
00:08:27,730 --> 00:08:33,630
‫Indexes will help fast track what you're looking for, Right?

113
00:08:33,640 --> 00:08:35,790
‫That's also part of the storage engine.

114
00:08:35,800 --> 00:08:38,560
‫The type of indexes you're creating.

115
00:08:38,670 --> 00:08:40,360
‫Is it just is it a b-tree?

116
00:08:40,630 --> 00:08:49,330
‫And all really helps pinpoint exactly what page you are trying to read.

117
00:08:49,450 --> 00:08:49,630
‫Okay.

118
00:08:50,260 --> 00:08:58,130
‫So if you have a like a table or a MongoDB collection, these MongoDB collections is just a bunch of

119
00:08:58,130 --> 00:09:00,050
‫JSON documents, right?

120
00:09:00,980 --> 00:09:03,710
‫Storage engine can decide, you know what, this document is really large.

121
00:09:03,710 --> 00:09:05,390
‫I'm going to decide to compress it.

122
00:09:05,540 --> 00:09:08,180
‫So that's all property of the storage engine.

123
00:09:08,180 --> 00:09:12,050
‫The front end doesn't even know that this document is compressed.

124
00:09:12,080 --> 00:09:17,090
‫All it does is, hey, just give me that document and this part will just decompress it and give you

125
00:09:17,090 --> 00:09:17,540
‫back.

126
00:09:17,750 --> 00:09:21,200
‫Give it back to the front end and the front end will return it.

127
00:09:21,220 --> 00:09:21,550
‫Right?

128
00:09:21,560 --> 00:09:27,710
‫So this is like a clear separation between these two and they can share tasks as well.

129
00:09:27,740 --> 00:09:34,340
‫Of course, the files like where is the actual full data, you know, because indexes only have part

130
00:09:34,340 --> 00:09:35,180
‫of the data, right?

131
00:09:35,340 --> 00:09:44,210
‫Like I'm indexing on the first name and give me all first name is a bad index and I don't know, salary

132
00:09:44,240 --> 00:09:45,430
‫maybe, right?

133
00:09:45,470 --> 00:09:51,950
‫Salary is another data structure we create and then we traverse it back to get back to exactly to that

134
00:09:51,950 --> 00:09:52,340
‫data.

135
00:09:52,340 --> 00:09:59,150
‫Right to that data file, which then pulls the entire document or row and then we return it.

136
00:09:59,450 --> 00:10:03,620
‫And you can you can really be creative here.

137
00:10:03,620 --> 00:10:04,970
‫And that's what people did, right?

138
00:10:04,970 --> 00:10:08,300
‫The storage engine also responsible for transactions.

139
00:10:08,570 --> 00:10:14,120
‫You know, when I'm changing this and this and this and this and this, I want it to do it as one unit

140
00:10:14,150 --> 00:10:20,990
‫of work such that if there is a failure, please roll back All these changes don't persist anything.

141
00:10:21,940 --> 00:10:23,470
‫Half way through.

142
00:10:23,480 --> 00:10:25,760
‫I want to be consistent.

143
00:10:25,760 --> 00:10:29,630
‫I want to be atomic and I want to be isolated.

144
00:10:29,630 --> 00:10:33,200
‫My concurrent transactions, all of these are properties of the storage engine.

145
00:10:33,200 --> 00:10:33,800
‫Really.

146
00:10:34,040 --> 00:10:34,450
‫Right?

147
00:10:34,460 --> 00:10:35,690
‫I want durability.

148
00:10:35,690 --> 00:10:36,350
‫I want that.

149
00:10:36,350 --> 00:10:40,670
‫If if I say commit and you told me the front end.

150
00:10:40,670 --> 00:10:40,850
‫Right.

151
00:10:40,940 --> 00:10:42,030
‫That's another thing, right?

152
00:10:42,030 --> 00:10:46,400
‫The transaction will say, Hey, I want you to commit and the storage engine will say, yes, you committed

153
00:10:46,400 --> 00:10:47,240
‫successfully.

154
00:10:47,240 --> 00:10:54,860
‫If I get the success and return to the user and then later you crashed, that data better be there when

155
00:10:54,860 --> 00:10:58,940
‫I come back because you told me you committed successfully.

156
00:10:58,940 --> 00:10:59,300
‫Right.

157
00:10:59,360 --> 00:11:00,470
‫All of these things.

158
00:11:00,470 --> 00:11:05,990
‫Well, the write ahead log or journaling, as MongoDB calls it, right?

159
00:11:05,990 --> 00:11:12,350
‫As I'm writing things, usually when you write things, it goes it needs to go to the data file.

160
00:11:12,350 --> 00:11:17,660
‫That's where the the major storage lies lives.

161
00:11:17,660 --> 00:11:25,710
‫But but then writing to the data file is really expensive because you're writing in pages these massive

162
00:11:25,710 --> 00:11:28,710
‫pages, eight K and 16 K, So right.

163
00:11:28,740 --> 00:11:34,770
‫If imagine like you're touching one column, one property, you need to write a whole page.

164
00:11:34,770 --> 00:11:37,440
‫There is no writing one byte in databases.

165
00:11:37,470 --> 00:11:38,400
‫No, no, no, sir.

166
00:11:38,430 --> 00:11:44,370
‫We don't go to disk and say, Hey, just change that tiny byte or just change that tiny three byte or

167
00:11:44,370 --> 00:11:45,780
‫just change that one K byte.

168
00:11:45,810 --> 00:11:47,310
‫Nope, you can do that.

169
00:11:47,640 --> 00:11:50,550
‫That's not how SSDs and hard drive works.

170
00:11:50,550 --> 00:11:54,270
‫You have to write in chunks and big chunks for efficiency.

171
00:11:54,310 --> 00:12:00,570
‫You an IO, you're going to write an all sector, you do an SSD, you can write a whole page or a block

172
00:12:00,570 --> 00:12:03,990
‫or an erasable unit based on the new technology of SSDs.

173
00:12:03,990 --> 00:12:04,500
‫Right.

174
00:12:07,090 --> 00:12:14,530
‫That's where the shingled hard drives comes into the picture, where they increase the right portions

175
00:12:14,530 --> 00:12:15,520
‫and stuff like that.

176
00:12:15,850 --> 00:12:22,390
‫We don't have byte addressability on disk, unfortunately, until today 2022.

177
00:12:22,420 --> 00:12:26,050
‫We have byte addressability on RAM.

178
00:12:26,080 --> 00:12:28,680
‫You can definitely write single byte and RAM.

179
00:12:28,690 --> 00:12:29,260
‫Definitely.

180
00:12:29,260 --> 00:12:30,220
‫That's fine.

181
00:12:30,310 --> 00:12:30,660
‫Right.

182
00:12:30,730 --> 00:12:33,190
‫But on disk persisted.

183
00:12:33,220 --> 00:12:34,750
‫No, you gotta write on pages.

184
00:12:34,750 --> 00:12:35,220
‫Right.

185
00:12:35,230 --> 00:12:36,970
‫And that's what we have today.

186
00:12:37,960 --> 00:12:40,750
‫And because of that cost.

187
00:12:40,750 --> 00:12:46,300
‫Right writing to hold data files what the data storage engine does like as you change all these changes

188
00:12:46,300 --> 00:12:46,810
‫goes to ram.

189
00:12:46,810 --> 00:12:48,110
‫Bup bup bup bup bup bup bup bup bup.

190
00:12:48,130 --> 00:12:49,750
‫We call them dirty pages.

191
00:12:49,780 --> 00:12:56,400
‫The moment you touch a page where you have a row or collection or document, we just mark it as dirty.

192
00:12:56,410 --> 00:12:58,450
‫And again, the storage engine doesn't know it's a document.

193
00:12:58,450 --> 00:12:59,680
‫It just knows it's bytes.

194
00:12:59,710 --> 00:13:03,880
‫It knows it's a page with a bunch of bytes that you touched.

195
00:13:03,910 --> 00:13:04,500
‫Right?

196
00:13:04,510 --> 00:13:07,460
‫And then you write to the memory and so it's fast.

197
00:13:07,460 --> 00:13:13,390
‫And then later the storage engine will collect as much changes as possible and then flush it once right

198
00:13:13,430 --> 00:13:14,690
‫to the database.

199
00:13:14,930 --> 00:13:17,600
‫All of this the job of the storage engine.

200
00:13:18,440 --> 00:13:23,270
‫I still didn't come to the difference between school and SQL, but we'll get to get there, right?

201
00:13:23,570 --> 00:13:29,390
‫You clearly are going to see it, I think, by by this time, if you're still watching or listening,

202
00:13:29,420 --> 00:13:32,060
‫you're probably going to know the difference, right?

203
00:13:33,470 --> 00:13:37,100
‫So we're not writing immediately.

204
00:13:37,310 --> 00:13:39,860
‫We're collecting this change, you might say, Hussein, but wait a minute.

205
00:13:39,860 --> 00:13:40,790
‫You're writing to Ram.

206
00:13:40,790 --> 00:13:42,290
‫If I commit, you're writing to Ram.

207
00:13:42,290 --> 00:13:43,360
‫What if I crash?

208
00:13:43,370 --> 00:13:44,690
‫That's the problem, right?

209
00:13:44,780 --> 00:13:49,880
‫So that's why in case to to recover from the crash, we create this called thing.

210
00:13:49,880 --> 00:13:52,520
‫This thing is called Wall, the write ahead log.

211
00:13:52,520 --> 00:14:01,910
‫So as we write to the RAM to this data pages, we also write on disk tiny things that says, Hey, here's

212
00:14:01,910 --> 00:14:03,950
‫a journal on this date.

213
00:14:03,950 --> 00:14:14,780
‫I on this date write Dear Diary, On this date I updated the salary from 10,000 to 10,050 cent.

214
00:14:15,710 --> 00:14:16,730
‫It's a bad year.

215
00:14:16,760 --> 00:14:18,200
‫What do you want me to say?

216
00:14:18,620 --> 00:14:18,920
‫Right.

217
00:14:18,920 --> 00:14:19,520
‫So.

218
00:14:19,520 --> 00:14:21,020
‫And this on this date?

219
00:14:21,020 --> 00:14:22,520
‫I wrote this on this date.

220
00:14:22,520 --> 00:14:24,560
‫You just write the changes.

221
00:14:24,800 --> 00:14:30,320
‫So then in case of a crash, we're going to lose the dirty pages on the on memory.

222
00:14:30,320 --> 00:14:36,650
‫But if I came back, I have all the wall and I have the last checkpoint on the data file.

223
00:14:36,650 --> 00:14:41,870
‫So I restore that and I redo the changes.

224
00:14:41,870 --> 00:14:43,610
‫I apply the wall.

225
00:14:44,250 --> 00:14:46,470
‫To the data files.

226
00:14:46,470 --> 00:14:51,580
‫And now in memory, I have the final representation as it was when I crashed.

227
00:14:51,600 --> 00:14:52,980
‫Brilliant design.

228
00:14:53,190 --> 00:14:58,320
‫Anyway, I'm going all over the place, but the storage engine front end, that is the main pieces.

229
00:14:58,320 --> 00:14:59,940
‫So we talked about what the front end is.

230
00:14:59,970 --> 00:15:01,790
‫We talked about what the storage engine is.

231
00:15:01,800 --> 00:15:07,410
‫The difference between the SQL and NoSQL mainly.

232
00:15:07,410 --> 00:15:09,060
‫Is this puppy the front end?

233
00:15:10,130 --> 00:15:17,870
‫The NoSQL guys came in and says, Are you really restricted me with this tables and columns and the

234
00:15:17,900 --> 00:15:18,390
‫sequel?

235
00:15:18,410 --> 00:15:19,130
‫I hate SQL.

236
00:15:19,160 --> 00:15:20,560
‫I don't like SQL at all.

237
00:15:20,570 --> 00:15:21,180
‫Right?

238
00:15:21,200 --> 00:15:22,190
‫I don't like it.

239
00:15:23,270 --> 00:15:26,000
‫And it just just didn't fit our application.

240
00:15:26,000 --> 00:15:29,570
‫I want to be like, I want just to give you a document, just store it.

241
00:15:29,570 --> 00:15:31,880
‫And that's where they redesigned.

242
00:15:32,120 --> 00:15:35,600
‫I think someone came in one day and shop was like, No SQL.

243
00:15:35,750 --> 00:15:36,980
‫They start a movement.

244
00:15:36,980 --> 00:15:40,490
‫They said No SQL, no SQL, no SQL, no more SQL.

245
00:15:40,610 --> 00:15:45,290
‫And they created their own technical storage engine.

246
00:15:45,740 --> 00:15:50,080
‫And they I believe, if I'm not mistaken, they didn't even have indexes.

247
00:15:50,090 --> 00:15:55,850
‫So, see, you guys are so because when databases Oracle and SQL Server created, they were just so

248
00:15:55,880 --> 00:16:05,570
‫wired to be, you know, to have tables and rows and so everything was glued together and sticky can't

249
00:16:05,570 --> 00:16:06,020
‫change it.

250
00:16:06,020 --> 00:16:08,180
‫So they created everything from scratch.

251
00:16:08,210 --> 00:16:09,260
‫A storage engine.

252
00:16:09,620 --> 00:16:11,670
‫I'm storing just documents, for example.

253
00:16:11,670 --> 00:16:13,050
‫That's the first use case.

254
00:16:13,050 --> 00:16:15,780
‫I have a document that JSON document just store it.

255
00:16:15,780 --> 00:16:17,130
‫It's just a bunch of bytes.

256
00:16:17,160 --> 00:16:22,680
‫Later they added transactions, later they added wall, later they have the indexes, they slowly added

257
00:16:22,740 --> 00:16:25,440
‫it and then the API just took it and set.

258
00:16:26,730 --> 00:16:31,300
‫The user will get a document and we're going to store it on the storage engine.

259
00:16:31,320 --> 00:16:31,920
‫That's it.

260
00:16:31,950 --> 00:16:33,180
‫It's just a bunch of bytes.

261
00:16:33,210 --> 00:16:40,650
‫We're going to convert the JSON into basin bison binary JSON, and then we persist it.

262
00:16:40,680 --> 00:16:42,290
‫That's the only difference.

263
00:16:42,300 --> 00:16:44,550
‫That's the only difference in SQL.

264
00:16:44,550 --> 00:16:45,710
‫NoSQL, right?

265
00:16:46,720 --> 00:16:52,180
‫The data format, which we changed from tables and rows into documents, and then the API, which is

266
00:16:52,180 --> 00:16:58,150
‫the get and set instead of just SQL and clear separation.

267
00:16:58,150 --> 00:17:06,550
‫And then of course there are out of the box storage engines such as Leveldb or Myrocks write Rocksdb.

268
00:17:06,580 --> 00:17:11,650
‫Sorry, Rocksdb is a very popular storage engine that does exactly that takes a bunch of bytes, doesn't

269
00:17:11,650 --> 00:17:17,470
‫care what you have in your bytes, it doesn't care, just gives you the beauty of indexes and storage

270
00:17:17,470 --> 00:17:19,310
‫engine, all this stuff, right?

271
00:17:19,330 --> 00:17:24,490
‫But then in the front end you can build your database the way you want.

272
00:17:24,580 --> 00:17:26,470
‫That's why you can build a graph database.

273
00:17:26,470 --> 00:17:35,090
‫So graph will prioritize not rows or columns per se or even documents, but the traversability like.

274
00:17:35,140 --> 00:17:39,250
‫So if this is not connected to this node to connect to this node, I want to store them next to each

275
00:17:39,250 --> 00:17:41,860
‫other right in this way.

276
00:17:41,860 --> 00:17:49,490
‫And so that the whole goal between the API and the storage and the front end is that when I do an I

277
00:17:49,490 --> 00:17:50,810
‫O and I give me a page.

278
00:17:51,980 --> 00:17:57,000
‫You want as much as possible this page to be to have everything you need.

279
00:17:57,020 --> 00:18:03,710
‫You don't want to go back to read more pages and I can go for ages about this.

280
00:18:03,710 --> 00:18:08,810
‫You know, this just the efficiency of the I think this is the most important thing, but we still didn't

281
00:18:08,810 --> 00:18:11,870
‫get to the main part, which is the MongoDB databases.

282
00:18:11,870 --> 00:18:14,500
‫So now we talked about NoSQL versus SQL.

283
00:18:14,510 --> 00:18:15,230
‫What's the difference?

284
00:18:15,270 --> 00:18:24,650
‫Right now, what we want to discuss is the first version ish of MongoDB.

285
00:18:24,800 --> 00:18:28,550
‫Yeah, this is prior to 4.2.

286
00:18:28,670 --> 00:18:35,630
‫MongoDB first storage engine was called memory map version one, which is literally just a bunch of

287
00:18:35,840 --> 00:18:37,340
‫data files, Right.

288
00:18:37,340 --> 00:18:39,900
‫And the data file, right.

289
00:18:41,130 --> 00:18:41,460
‫Uh.

290
00:18:43,020 --> 00:18:49,060
‫The data file are stored document after one, one document after the other.

291
00:18:49,080 --> 00:18:52,320
‫Now, I don't know if there is one data file have per collection.

292
00:18:52,320 --> 00:18:55,370
‫Maybe when you have a collection you will have a data file.

293
00:18:55,380 --> 00:18:55,740
‫Maybe.

294
00:18:55,740 --> 00:18:56,670
‫Maybe it's different.

295
00:18:56,670 --> 00:18:58,110
‫But what?

296
00:18:59,140 --> 00:19:03,430
‫The brilliant design behind the first version was an offset based.

297
00:19:03,460 --> 00:19:07,390
‫That means, Hey, I want this document.

298
00:19:07,660 --> 00:19:08,750
‫Document.

299
00:19:08,770 --> 00:19:10,720
‫This particular document with an ID.

300
00:19:10,990 --> 00:19:16,170
‫So what Mongo has is as a unique identifier, right?

301
00:19:16,180 --> 00:19:23,350
‫If you know about that and this, your unique identifier will tell you exactly what about this document?

302
00:19:23,350 --> 00:19:23,830
‫Right.

303
00:19:24,220 --> 00:19:26,230
‫And there is an index attached to it.

304
00:19:26,260 --> 00:19:26,710
‫Right?

305
00:19:27,750 --> 00:19:28,980
‫And this index.

306
00:19:28,980 --> 00:19:30,510
‫This is a B3 index.

307
00:19:30,540 --> 00:19:35,070
‫When you traverse the B3 index, you find the IDs.

308
00:19:35,160 --> 00:19:35,420
‫Okay.

309
00:19:35,740 --> 00:19:37,010
‫It's in this page.

310
00:19:37,020 --> 00:19:37,740
‫It's in this page.

311
00:19:37,740 --> 00:19:38,810
‫And then you find it.

312
00:19:38,820 --> 00:19:45,000
‫The pointer of this unique identifier is something called a disk location.

313
00:19:45,180 --> 00:19:47,520
‫I think it's a 32 byte.

314
00:19:47,550 --> 00:19:49,800
‫It's actually 64 bit.

315
00:19:49,800 --> 00:19:50,310
‫Sorry.

316
00:19:50,340 --> 00:19:53,760
‫It's a 64 bit pointer.

317
00:19:53,760 --> 00:19:54,780
‫32.

318
00:19:54,810 --> 00:20:01,560
‫The 32 bits, the first 32 bits tells you the file name, which file and the second 32 bit tells you

319
00:20:01,560 --> 00:20:04,740
‫the offset because now you know which file.

320
00:20:04,740 --> 00:20:08,130
‫But then the file is, is one gig, right.

321
00:20:08,160 --> 00:20:12,570
‫Where exactly is the document in this file is the offset.

322
00:20:12,960 --> 00:20:16,860
‫So with one single read, you can go.

323
00:20:16,890 --> 00:20:17,910
‫Exactly.

324
00:20:17,910 --> 00:20:24,540
‫Because this is how the OS read write the OS will read, will give you the file name, says hey, go

325
00:20:24,540 --> 00:20:25,740
‫exactly to that location.

326
00:20:25,740 --> 00:20:27,700
‫You can absolutely do that in the file system.

327
00:20:27,700 --> 00:20:32,590
‫Allow it to say read that portion and read for X amount of bytes.

328
00:20:32,800 --> 00:20:33,040
‫Right.

329
00:20:33,280 --> 00:20:37,210
‫So I suppose the another property is the file is the document size.

330
00:20:37,210 --> 00:20:39,700
‫So you need to store also the document size, right?

331
00:20:40,300 --> 00:20:45,490
‫So say read this part and then you're going to read that, right?

332
00:20:45,490 --> 00:20:48,430
‫And then you get a bunch of pages probably.

333
00:20:48,430 --> 00:20:52,780
‫And then if you're lucky, you're going to get one document or more.

334
00:20:52,810 --> 00:20:53,020
‫Right?

335
00:20:53,170 --> 00:20:55,360
‫That's why the document also has a fixed size.

336
00:20:55,360 --> 00:20:59,520
‫You can't go beyond certain size because of these limitations, right?

337
00:20:59,530 --> 00:21:00,580
‫So now you got it.

338
00:21:00,580 --> 00:21:05,410
‫So you do one B3 scan from the ID, right?

339
00:21:05,440 --> 00:21:11,260
‫To find exactly which document to pull.

340
00:21:11,410 --> 00:21:11,950
‫Right.

341
00:21:12,610 --> 00:21:17,740
‫Again, you're going to get a bunch of bytes and then the front end is responsible to pass.

342
00:21:18,260 --> 00:21:21,950
‫The bytes to actually find documents per se.

343
00:21:22,130 --> 00:21:27,830
‫And of course, if this was like a relational database, then going to be columns and rows, right?

344
00:21:27,830 --> 00:21:31,820
‫If there was a graph, you're going to pass it such that, you know the beginning and the end.

345
00:21:31,880 --> 00:21:32,330
‫Right?

346
00:21:32,480 --> 00:21:35,480
‫And it's not really rocket science at the end of the day.

347
00:21:35,540 --> 00:21:39,200
‫So we're getting a big O of log N, right?

348
00:21:39,200 --> 00:21:45,110
‫So it's just A1IO or multiple i os to traverse the nodes.

349
00:21:45,230 --> 00:21:51,450
‫That's why it's important that the b-tree is small enough to fit in memory such that because the bit

350
00:21:51,860 --> 00:21:55,550
‫the index is just not a data structure which is persisted on disk.

351
00:21:55,580 --> 00:21:57,890
‫You read it from disk and you put it in memory.

352
00:21:57,890 --> 00:21:59,630
‫Hopefully it fits in memory.

353
00:21:59,630 --> 00:22:06,350
‫That's why this scored Actually one of one problem that this car faced was they moved from MongoDB because

354
00:22:06,350 --> 00:22:10,580
‫their indexes were so large they couldn't even fit in memory.

355
00:22:10,820 --> 00:22:19,650
‫And if your index doesn't fit in memory, that means as you traverse right, the operating system will

356
00:22:19,650 --> 00:22:25,140
‫will do these paging and swap files and will swap things to disk if it's not used right.

357
00:22:25,470 --> 00:22:31,530
‫And this scanning is going to become slower just to find the disk lock.

358
00:22:31,920 --> 00:22:33,540
‫But that was the original thing.

359
00:22:33,540 --> 00:22:40,110
‫The problem, the clear problem with this is anything you touch, you change the document size, you

360
00:22:40,110 --> 00:22:42,330
‫update it to a longer string.

361
00:22:42,360 --> 00:22:46,560
‫The entire file is now scrambled.

362
00:22:46,560 --> 00:22:46,920
‫Right?

363
00:22:46,920 --> 00:22:51,450
‫Because the offset you change the physical offset of the disk.

364
00:22:51,450 --> 00:22:52,020
‫Right.

365
00:22:52,020 --> 00:22:57,900
‫I suppose you can play with games with this, but this became very, very problematic, right?

366
00:22:58,080 --> 00:22:59,850
‫Because the documents are based on offset.

367
00:22:59,850 --> 00:23:03,030
‫The moment you change the document size, you push it a little bit.

368
00:23:03,060 --> 00:23:05,460
‫The whole offsets are now off.

369
00:23:05,460 --> 00:23:06,060
‫Right?

370
00:23:06,090 --> 00:23:08,920
‫That was the original design, I suppose, if I'm not mistaken.

371
00:23:08,920 --> 00:23:15,600
‫And my Isam Isom in MySQL, which is no longer used because of the same reason.

372
00:23:15,600 --> 00:23:18,210
‫Yeah, it's nice for read only.

373
00:23:18,330 --> 00:23:19,110
‫It's beautiful.

374
00:23:19,110 --> 00:23:19,860
‫All right.

375
00:23:20,190 --> 00:23:21,450
‫If I not changing it.

376
00:23:21,450 --> 00:23:22,740
‫Yeah, it's very fast.

377
00:23:22,740 --> 00:23:27,960
‫You know exactly what it is and you pull it, but as you change it, it's just.

378
00:23:27,990 --> 00:23:29,510
‫It becomes really a mess.

379
00:23:29,520 --> 00:23:30,810
‫I suppose you can play tricks.

380
00:23:30,810 --> 00:23:32,610
‫Of course you can update the offsets.

381
00:23:32,610 --> 00:23:33,780
‫Offsets, right?

382
00:23:33,930 --> 00:23:35,520
‫You can update the offsets.

383
00:23:35,520 --> 00:23:36,060
‫But.

384
00:23:36,720 --> 00:23:37,710
‫That was a problem.

385
00:23:37,760 --> 00:23:43,230
‫And plus, another problem with the map is the locking model, right?

386
00:23:43,260 --> 00:23:44,100
‫That's another thing.

387
00:23:44,100 --> 00:23:47,700
‫That is a responsibility of the storage engine really locking.

388
00:23:47,700 --> 00:23:48,170
‫Right.

389
00:23:48,180 --> 00:23:53,970
‫How do you prevent two people from editing the same document at the same time?

390
00:23:55,020 --> 00:23:57,350
‫You shouldn't really do that, right?

391
00:23:57,360 --> 00:23:58,440
‫Database is No.

392
00:23:58,440 --> 00:24:01,290
‫Two database will allow you to update the same.

393
00:24:02,320 --> 00:24:04,150
‫Unit of work, if you will.

394
00:24:04,600 --> 00:24:05,710
‫If it's a row.

395
00:24:05,890 --> 00:24:09,010
‫If it's a table, If it's a collection.

396
00:24:09,100 --> 00:24:09,910
‫Right.

397
00:24:10,490 --> 00:24:11,960
‫In EMAP.

398
00:24:12,380 --> 00:24:14,750
‫It was very strict, right?

399
00:24:14,840 --> 00:24:16,790
‫Imagine this like.

400
00:24:18,300 --> 00:24:22,950
‫The first version of map was even they didn't bother.

401
00:24:22,970 --> 00:24:26,850
‫Imagine, because these are people who are rebuilding a database from scratch.

402
00:24:26,850 --> 00:24:31,980
‫So they didn't think about all this stuff that the databases.

403
00:24:31,980 --> 00:24:34,290
‫People have been doing it for years.

404
00:24:34,290 --> 00:24:34,470
‫Right?

405
00:24:34,560 --> 00:24:35,850
‫For decades, actually.

406
00:24:36,090 --> 00:24:43,080
‫So the first problem they run into is like, oh, two people can change the same doc, different documents.

407
00:24:43,200 --> 00:24:45,660
‫Oh, the offsets are all base.

408
00:24:45,660 --> 00:24:46,910
‫Oh, you know what?

409
00:24:46,920 --> 00:24:49,380
‫Let's just create a lock, a global lock.

410
00:24:49,380 --> 00:24:53,100
‫So the first version was a global lock per database.

411
00:24:53,100 --> 00:24:54,270
‫So No.

412
00:24:54,270 --> 00:24:55,110
‫Two.

413
00:24:55,770 --> 00:25:02,730
‫Transactions can actually change documents in different collections at all.

414
00:25:02,730 --> 00:25:08,280
‫So if you have collection one collection two, you can even change collection one and collection two

415
00:25:08,310 --> 00:25:09,450
‫documents.

416
00:25:09,870 --> 00:25:12,590
‫Concurrently, they are serialized.

417
00:25:12,600 --> 00:25:14,220
‫There is one global lock.

418
00:25:14,340 --> 00:25:19,890
‫Again, that was the first first version because it's a single database lock.

419
00:25:20,010 --> 00:25:22,590
‫So you say read this data files.

420
00:25:22,590 --> 00:25:24,660
‫This tells me that the data files are actually.

421
00:25:26,090 --> 00:25:26,690
‫Collapse.

422
00:25:26,690 --> 00:25:32,570
‫So multiple data files, I mean, multiple collections can live in the same data files.

423
00:25:32,570 --> 00:25:37,100
‫That's one reason you have to acquire a lock so that No.

424
00:25:37,100 --> 00:25:37,820
‫Two, No.

425
00:25:37,820 --> 00:25:38,960
‫Two, transaction can change it.

426
00:25:38,960 --> 00:25:40,700
‫But then they improve this in three.

427
00:25:40,730 --> 00:25:44,180
‫Three in the version three that wasn't the version.

428
00:25:44,180 --> 00:25:49,730
‫Two of Mongo and Mongo two, They made it a collection level lock, which is still not good.

429
00:25:49,850 --> 00:25:50,360
‫Right.

430
00:25:50,540 --> 00:25:54,020
‫It's it's for the, for, for the SQL people.

431
00:25:54,020 --> 00:25:55,820
‫It's like saying a table lock.

432
00:25:55,850 --> 00:26:02,180
‫Imagine you have a table of a million row and you want to insert a row in the table and then you want

433
00:26:02,180 --> 00:26:06,170
‫to update another row in the in the same table has nothing to do with each other.

434
00:26:06,170 --> 00:26:06,650
‫Right.

435
00:26:06,680 --> 00:26:08,180
‫Imagine these are blocked.

436
00:26:08,180 --> 00:26:09,530
‫Yes, it was blocked.

437
00:26:09,530 --> 00:26:13,280
‫And if you're using a map, this is still the case.

438
00:26:13,280 --> 00:26:18,530
‫One collection, which is deprecated, by the way, V1 is deprecated now.

439
00:26:18,530 --> 00:26:21,560
‫One collection is a is a pair collection lock.

440
00:26:21,560 --> 00:26:26,580
‫So now sure, you can do a concurrent write on two different collections, right?

441
00:26:26,820 --> 00:26:28,200
‫Without blocking.

442
00:26:28,200 --> 00:26:33,180
‫But now if you're updating the same document, that's a problem, right?

443
00:26:33,420 --> 00:26:37,830
‫So then it became very challenging to manage the storage engine.

444
00:26:37,830 --> 00:26:44,010
‫So what MongoDB did is, as you know what, let's just acquire this Wiredtiger storage engine, very,

445
00:26:44,010 --> 00:26:48,270
‫very popular, very efficient storage engine.

446
00:26:48,270 --> 00:26:49,590
‫So what they did is they.

447
00:26:50,940 --> 00:26:56,190
‫MongoDB just described this and they built a storage engine out of the box.

448
00:26:56,340 --> 00:27:01,320
‫This has become what we call a wiredtiger wiredtiger write.

449
00:27:01,350 --> 00:27:02,880
‫And the front end didn't change.

450
00:27:03,000 --> 00:27:05,520
‫So your application code doesn't change.

451
00:27:05,520 --> 00:27:07,690
‫The storage engine in the back end changed, right?

452
00:27:07,710 --> 00:27:09,750
‫So now this is Wiredtiger.

453
00:27:09,780 --> 00:27:12,450
‫They gave Wiredtiger the ability.

454
00:27:13,380 --> 00:27:22,260
‫Now, here's the thing with white Tiger, the ability of document level locking has become popular.

455
00:27:22,470 --> 00:27:26,610
‫Now you can update two documents on the same collection.

456
00:27:27,060 --> 00:27:28,920
‫I'm not saying these things.

457
00:27:28,920 --> 00:27:30,960
‫And you might say this is all exist.

458
00:27:30,960 --> 00:27:31,260
‫I know.

459
00:27:31,260 --> 00:27:37,560
‫But I'm telling you the history of things because building databases is not really a trivial thing.

460
00:27:37,710 --> 00:27:45,660
‫The the brilliant engineers went through this and they are you know, they ran into a lot of challenges

461
00:27:45,690 --> 00:27:46,680
‫and this is one of them.

462
00:27:46,680 --> 00:27:54,690
‫So the Wiredtiger storage engine allowed you to update two different documents on the same collection

463
00:27:54,690 --> 00:27:59,310
‫concurrently, which is now a beautiful thing.

464
00:27:59,310 --> 00:27:59,580
‫Right?

465
00:27:59,610 --> 00:28:08,850
‫Now we can and this is now made it equivalent to basically all databases because the databases have

466
00:28:08,850 --> 00:28:11,700
‫row level locks, like at least MySQL and Postgres.

467
00:28:11,730 --> 00:28:19,180
‫You cannot you can definitely update two rows on the same table, but you cannot update the same row

468
00:28:19,210 --> 00:28:20,560
‫on the same table.

469
00:28:21,220 --> 00:28:21,700
‫Right.

470
00:28:21,730 --> 00:28:27,910
‫We acquire a lock and then the second transaction tries to update the same row.

471
00:28:28,270 --> 00:28:33,880
‫That will basically pause the second transaction with row level locking.

472
00:28:33,910 --> 00:28:37,840
‫Now there is like I think what's the database called?

473
00:28:38,680 --> 00:28:39,420
‫Yoga?

474
00:28:39,460 --> 00:28:39,940
‫Yoga.

475
00:28:39,970 --> 00:28:40,360
‫Yoga.

476
00:28:40,400 --> 00:28:42,160
‫DB If I'm not mistaken.

477
00:28:42,190 --> 00:28:49,630
‫They even introduced column level locking, which is another thing, like if I if I have a row and I'm

478
00:28:49,630 --> 00:28:53,290
‫updating field one in that row, but you're updating field two.

479
00:28:53,320 --> 00:28:54,700
‫Technically, I'm not.

480
00:28:54,730 --> 00:28:56,590
‫We're not changing the same thing.

481
00:28:56,650 --> 00:29:01,990
‫Postgres will lock you even if you're updating different thing MySQL, if I'm not mistaken, they will

482
00:29:01,990 --> 00:29:04,000
‫also lock it because it's a row level lock.

483
00:29:04,030 --> 00:29:05,050
‫But now.

484
00:29:05,920 --> 00:29:10,380
‫You can also include column level locking which says, hey, if you.

485
00:29:10,570 --> 00:29:14,140
‫Yeah, you touched this row but different fields from this row.

486
00:29:14,320 --> 00:29:15,820
‫Same thing with the document.

487
00:29:15,820 --> 00:29:16,390
‫Right?

488
00:29:16,570 --> 00:29:22,450
‫I am really just updating this field in the document and JSON document and someone is inserting a new

489
00:29:22,450 --> 00:29:24,900
‫field or updating another or re locking.

490
00:29:24,910 --> 00:29:26,200
‫Do we really need to lock it?

491
00:29:26,230 --> 00:29:29,880
‫Well, at the end of the day, this is what we do.

492
00:29:29,890 --> 00:29:30,880
‫We lock it.

493
00:29:31,120 --> 00:29:38,140
‫So yeah, if you have if you happen to have to transaction updating the same row, even different columns,

494
00:29:38,290 --> 00:29:44,770
‫you can't do that unless you have column level locking or key level locking, if you will, in non which

495
00:29:44,770 --> 00:29:46,120
‫I don't think it exists.

496
00:29:46,540 --> 00:29:50,830
‫And believe me when you when you when I'm talking about these things, this is not cheap.

497
00:29:50,830 --> 00:29:51,310
‫Right.

498
00:29:51,460 --> 00:29:56,530
‫The moment you introduce column level locking, that's another expense because now you have to keep

499
00:29:56,560 --> 00:29:59,350
‫track of what you're locking and locks are.

500
00:29:59,380 --> 00:29:59,860
‫Guess what?

501
00:29:59,860 --> 00:30:07,430
‫In memory and row locks are more expensive than page locks or table locks or collection locks because

502
00:30:07,430 --> 00:30:14,810
‫you just need one versus if you have million and you updated a million rows and transactions are in

503
00:30:14,810 --> 00:30:15,680
‫progress.

504
00:30:15,680 --> 00:30:17,960
‫That's a million lock, right?

505
00:30:18,110 --> 00:30:20,300
‫Imagine adding column locks to that.

506
00:30:20,300 --> 00:30:26,570
‫So million times, whatever columns you're updating, it becomes really challenges, right?

507
00:30:27,150 --> 00:30:27,950
‫Yeah.

508
00:30:27,980 --> 00:30:30,020
‫Database building database is not trivial.

509
00:30:30,050 --> 00:30:30,410
‫All right.

510
00:30:30,440 --> 00:30:32,110
‫Go back to Wiredtiger.

511
00:30:32,120 --> 00:30:33,650
‫We talked about that, right?

512
00:30:33,680 --> 00:30:40,260
‫Mongo Wiredtiger introduced compression, which didn't exist, by the way, in MVP, right?

513
00:30:40,280 --> 00:30:41,540
‫It didn't exist here.

514
00:30:41,570 --> 00:30:43,610
‫Wiredtiger introduced compression.

515
00:30:43,610 --> 00:30:51,440
‫Now, when you actually take the document, Wiredtiger compresses the json document, so that's really

516
00:30:51,440 --> 00:30:52,400
‫brilliant.

517
00:30:52,430 --> 00:30:58,430
‫Now you're because especially JSON documents have these fields repeated all the time, right?

518
00:30:58,730 --> 00:31:01,250
‫The field repeats, so you need to compress it.

519
00:31:01,250 --> 00:31:04,730
‫So MongoDB Wiredtiger actually compresses that.

520
00:31:04,730 --> 00:31:05,800
‫So that's tiny.

521
00:31:05,810 --> 00:31:06,740
‫Why is it tiny?

522
00:31:06,740 --> 00:31:13,880
‫Because now if I'm compressing it, the page will fit more document 1IO will give me more documents.

523
00:31:13,970 --> 00:31:18,140
‫Then it was A1IO in uncompressed.

524
00:31:18,380 --> 00:31:21,320
‫If 1IO uncompressed give me three documents.

525
00:31:21,350 --> 00:31:26,900
‫1IO compressed in a single page will give me 20 documents.

526
00:31:27,660 --> 00:31:36,000
‫This is really powerful because now I don't really need to go if I'm fetching 20 documents in in, in

527
00:31:36,000 --> 00:31:38,160
‫the older models, I have to do multiple iOS.

528
00:31:38,190 --> 00:31:44,580
‫I have to hit the disk multiple times versus in the Wiredtiger Tiger just one pulled all this stuff

529
00:31:44,580 --> 00:31:45,390
‫compressed.

530
00:31:45,420 --> 00:31:51,090
‫Do a little decompression in the client side in memory and you get a beautiful 20 documents.

531
00:31:51,090 --> 00:31:55,050
‫The major thing you have to think about here, how do I save iOS?

532
00:31:55,080 --> 00:32:02,430
‫That is the number one job of a DBA, of a developer of a database saving iOS.

533
00:32:02,460 --> 00:32:06,180
‫The list, the IO, the faster the database, nothing else matter.

534
00:32:07,020 --> 00:32:08,940
‫That is exactly what it is.

535
00:32:11,280 --> 00:32:14,520
‫All right, so now what?

536
00:32:14,520 --> 00:32:17,880
‫The way Wiredtiger stored the database is completely changed.

537
00:32:17,880 --> 00:32:20,640
‫It's no longer using this disk lock thing, right.

538
00:32:20,640 --> 00:32:24,780
‫Where it's just a bunch of data file, and then you have offset because offsets are really terrible,

539
00:32:24,780 --> 00:32:31,170
‫right, for changing like the offset changes and you have to update everything like one changing one

540
00:32:31,170 --> 00:32:33,900
‫documents will will screw up all your offsets.

541
00:32:33,930 --> 00:32:34,470
‫Right?

542
00:32:34,680 --> 00:32:40,470
‫So what they did instead, they stored it as a cluster b-tree index.

543
00:32:40,470 --> 00:32:42,690
‫And I talked about this in another video.

544
00:32:42,930 --> 00:32:48,540
‫I'm not going to go in details, but in a in a nutshell, right, They have something called the record

545
00:32:48,570 --> 00:32:49,560
‫ID here.

546
00:32:49,560 --> 00:32:51,630
‫And you can basically create anything.

547
00:32:51,630 --> 00:32:55,800
‫This is a hidden index cluster index into wiredtiger and.

548
00:32:56,470 --> 00:32:58,990
‫Work based on the key you can search.

549
00:32:58,990 --> 00:33:03,850
‫And when you get here, the value is actually the entire document.

550
00:33:03,850 --> 00:33:05,630
‫And not only a document, right.

551
00:33:05,650 --> 00:33:09,430
‫But physically all the documents.

552
00:33:10,290 --> 00:33:13,380
‫Right or ordered next to each other.

553
00:33:13,380 --> 00:33:21,450
‫So the page that you land on here in the leaf pages are the data is the data.

554
00:33:21,480 --> 00:33:23,310
‫This is the data.

555
00:33:23,340 --> 00:33:25,500
‫The entire data is the index.

556
00:33:25,500 --> 00:33:27,000
‫That's what a clustered index is.

557
00:33:27,040 --> 00:33:33,030
‫It's it's by default what you get for one MySQL, not in Postgres, but in MySQL.

558
00:33:33,030 --> 00:33:34,680
‫Everything is a cluster index.

559
00:33:34,680 --> 00:33:40,890
‫Every table has a cluster index and that's how your data is organized around the index.

560
00:33:40,890 --> 00:33:46,290
‫So your table is organized around this index where the leaf pages is the data.

561
00:33:46,290 --> 00:33:49,860
‫So now if you land here, you get the document and guess what?

562
00:33:49,890 --> 00:33:54,720
‫Because it's on the one page, you get any document before it and you get any documents after it.

563
00:33:54,720 --> 00:33:58,590
‫And because it's compressed, you're going to get a lot of tight documents as well.

564
00:33:58,590 --> 00:34:04,590
‫So you you read this page and you get all the documents nearby because it's ordered.

565
00:34:04,710 --> 00:34:06,060
‫Not only that.

566
00:34:06,920 --> 00:34:14,010
‫Each leaf page in B+ tree is actually linked to the next page and to the next page and to the next page.

567
00:34:14,030 --> 00:34:18,080
‫It's a linked list of pages, so the entire data is right here.

568
00:34:18,080 --> 00:34:23,870
‫So if you find this, if you want to do a range query and say, find me all all record IDs between X

569
00:34:23,870 --> 00:34:29,900
‫and Y and we're going to talk about what record is because this is not the ID of the document.

570
00:34:29,900 --> 00:34:34,640
‫And that's the problem that Wiredtiger and Mongo introduced in a way.

571
00:34:35,470 --> 00:34:44,830
‫Now, if we have this, you do a if you do a range scan, you're going to get all the documents that

572
00:34:44,830 --> 00:34:45,670
‫are next to each other.

573
00:34:45,670 --> 00:34:50,380
‫So a range scan is really powerful in B plus three, especially if it's clustered, because now you're

574
00:34:50,380 --> 00:34:54,100
‫going to get all the nice documents tucked in together, right?

575
00:34:54,100 --> 00:34:56,380
‫So you can find your.

576
00:34:57,210 --> 00:35:02,340
‫Document using a B+ tree search in Wiredtiger using the required.

577
00:35:02,400 --> 00:35:02,890
‫But guess what?

578
00:35:02,910 --> 00:35:03,930
‫What is this record ID?

579
00:35:05,970 --> 00:35:08,130
‫It doesn't mean anything to the user.

580
00:35:08,160 --> 00:35:09,720
‫This is an internal thing.

581
00:35:10,230 --> 00:35:13,050
‫But where did this disk lock happen?

582
00:35:13,080 --> 00:35:16,050
‫This used to be called the disk lock, but they changed it.

583
00:35:16,050 --> 00:35:17,130
‫And that's what they had.

584
00:35:17,160 --> 00:35:27,810
‫They had this as disk lock and their indexes, the ID, the actual user facing ID document index has

585
00:35:27,810 --> 00:35:31,800
‫been mapped always to the disk lock because that's what we had, Right?

586
00:35:31,800 --> 00:35:32,320
‫Disk lock.

587
00:35:32,340 --> 00:35:33,420
‫That's that's exactly this.

588
00:35:33,420 --> 00:35:34,110
‫This.

589
00:35:34,990 --> 00:35:35,700
‫Is this.

590
00:35:35,710 --> 00:35:39,580
‫This used to be this this look they later changed it to recorded.

591
00:35:39,700 --> 00:35:41,800
‫So it's like it doesn't make sense to call it this log.

592
00:35:42,010 --> 00:35:47,380
‫But then this record ID now is just a pointer to where not to disk.

593
00:35:47,410 --> 00:35:51,940
‫It is a pointer to this B plus three, which is the hidden index.

594
00:35:51,940 --> 00:35:58,630
‫So now if you're actually searching for the ID, the primary key, you're doing two lock ups in Wiredtiger,

595
00:35:58,630 --> 00:35:59,830
‫not one.

596
00:36:00,370 --> 00:36:10,510
‫So actually I'd look up in Wiredtiger were slower than the older one because now you have to you have

597
00:36:10,510 --> 00:36:16,210
‫to search two indexes, you have to load two indexes in memory, double the space, double the searches,

598
00:36:16,210 --> 00:36:17,290
‫double the IO.

599
00:36:17,410 --> 00:36:22,810
‫You have to write, you have to write to multiple indexes because you have to sync those two guys together.

600
00:36:24,790 --> 00:36:30,360
‫Secondary index is not so much because the secondary indexes, right, If you think about it, really

601
00:36:30,370 --> 00:36:31,570
‫secondary indexes.

602
00:36:31,600 --> 00:36:33,100
‫Secondary indexes.

603
00:36:33,370 --> 00:36:40,360
‫Now just point directly to the record ID, So yeah, in this particular case, all of these indexes

604
00:36:40,360 --> 00:36:46,150
‫always point to the record ID, whether it's a primary index or a secondary index, they all point to

605
00:36:46,150 --> 00:36:52,540
‫the primary key and that's what's causing us the double search effectively, right?

606
00:36:53,320 --> 00:36:56,290
‫So very similar to MySQL.

607
00:36:56,290 --> 00:37:03,610
‫Not quite because MySQL primary key is actually this thing, right?

608
00:37:03,700 --> 00:37:11,080
‫But the primary key in the first version, at least from 4.2, 4.2 to 5.2 very recent.

609
00:37:11,080 --> 00:37:11,950
‫This change, by the way.

610
00:37:11,950 --> 00:37:12,220
‫Right.

611
00:37:14,370 --> 00:37:18,450
‫Until very recently, 5.24.225.2 is like this.

612
00:37:18,480 --> 00:37:23,250
‫When you search for ID, you find this and then you do another search, another search.

613
00:37:23,250 --> 00:37:25,350
‫This is not a big o of one, right?

614
00:37:25,350 --> 00:37:30,180
‫This is a big O of log n plus big O of log n two searches.

615
00:37:30,210 --> 00:37:30,610
‫Right.

616
00:37:30,630 --> 00:37:34,260
‫Whereas this guy you do big O of log n and then big O of one.

617
00:37:34,650 --> 00:37:36,150
‫So now we have this beautiful design.

618
00:37:36,450 --> 00:37:43,410
‫The problems we understood now the IDs, the problem we have, we have to we have to kind of duplicate

619
00:37:43,410 --> 00:37:44,310
‫style, right?

620
00:37:44,970 --> 00:37:50,620
‫The record is a 64 bit same thing here, but secondary indexes all point to the record already.

621
00:37:50,650 --> 00:37:53,310
‫That's the state of art as of 5.2, right.

622
00:37:53,730 --> 00:37:58,530
‫And the ID index is just another secondary index at this point.

623
00:37:58,530 --> 00:38:05,880
‫It's not really a true primary index because the primary index, by definition at least, is the clustered

624
00:38:05,880 --> 00:38:06,360
‫index.

625
00:38:06,360 --> 00:38:06,670
‫Right?

626
00:38:06,690 --> 00:38:07,530
‫It is this one.

627
00:38:07,530 --> 00:38:09,120
‫But we have now doubled.

628
00:38:10,020 --> 00:38:17,200
‫Now let's go to the final stage where 5.3, I think is July of 2022.

629
00:38:17,200 --> 00:38:19,750
‫Really very, very brand new feature.

630
00:38:19,960 --> 00:38:26,650
‫It's called clustered collections, where you can create a collection and you can make it a clustered

631
00:38:26,650 --> 00:38:27,480
‫collection.

632
00:38:27,490 --> 00:38:36,760
‫That means the wire tiger hidden key disappears and instead this becomes your hidden index.

633
00:38:36,760 --> 00:38:39,190
‫Effectively, this becomes your clustered index.

634
00:38:39,190 --> 00:38:43,660
‫And the ID field is the main focus for this.

635
00:38:43,690 --> 00:38:52,000
‫Now, if you're searching by ID right, you will immediately search by ID, do a little bit lookup and

636
00:38:52,000 --> 00:38:54,520
‫then find the document because the cluster document is right here.

637
00:38:54,520 --> 00:38:57,940
‫All the leaf pages have were full documents right here.

638
00:38:58,630 --> 00:39:00,790
‫Pretty neat.

639
00:39:01,360 --> 00:39:05,590
‫You don't really need to do these two lookups anymore if you're searching for ID, right?

640
00:39:05,920 --> 00:39:07,780
‫Again, this is not this is an option.

641
00:39:07,780 --> 00:39:09,990
‫It's not you don't have to do it right.

642
00:39:10,000 --> 00:39:15,340
‫So if you still want this design, for some reason, we're going to talk about why in a minute.

643
00:39:15,970 --> 00:39:17,080
‫You can still have it.

644
00:39:17,080 --> 00:39:19,180
‫But now in this guy.

645
00:39:20,620 --> 00:39:21,520
‫You can do this.

646
00:39:21,550 --> 00:39:23,120
‫What's the problem with this now?

647
00:39:23,380 --> 00:39:24,400
‫We talked about the good thing.

648
00:39:24,400 --> 00:39:25,180
‫The good thing.

649
00:39:25,270 --> 00:39:26,130
‫Single search.

650
00:39:26,140 --> 00:39:28,390
‫If you're using the ID for MongoDB.

651
00:39:28,540 --> 00:39:28,840
‫Right.

652
00:39:29,530 --> 00:39:33,910
‫If you're looking up a document by its ID, it's a single, beautiful search.

653
00:39:33,940 --> 00:39:36,610
‫Immediately find the document based in document.

654
00:39:36,640 --> 00:39:37,240
‫Right.

655
00:39:37,240 --> 00:39:40,270
‫And you're going to if you're lucky, you're going to find anything in next to it.

656
00:39:40,270 --> 00:39:40,420
‫Right.

657
00:39:40,420 --> 00:39:41,760
‫It's not just one document.

658
00:39:41,770 --> 00:39:45,160
‫This is a collection of documents in a single page.

659
00:39:45,370 --> 00:39:48,700
‫I got a I got to find out what's the page size and why Tiger.

660
00:39:49,240 --> 00:39:49,570
‫Right.

661
00:39:49,690 --> 00:39:50,920
‫But this is what you get.

662
00:39:50,920 --> 00:39:54,190
‫You're going to get this and it's going to be cached in memory temporarily.

663
00:39:54,250 --> 00:39:54,550
‫Right?

664
00:39:54,550 --> 00:40:01,480
‫So if you're lucky, the next the previous ID next to it is also you're going to get that as well,

665
00:40:01,480 --> 00:40:04,480
‫right, If the sequence really makes sense here.

666
00:40:04,570 --> 00:40:05,890
‫The problem, though.

667
00:40:06,540 --> 00:40:11,460
‫The problem, my friends, is now let's go back to the secondary indexes.

668
00:40:11,760 --> 00:40:16,560
‫The moment you introduce it, now this becomes identical to MySQL.

669
00:40:16,590 --> 00:40:22,920
‫MongoDB after 5.3, if you choose to be a cluster connection, it's almost identical to MySQL now.

670
00:40:22,950 --> 00:40:25,080
‫It became identical to MySQL.

671
00:40:25,170 --> 00:40:29,670
‫The ID field, which is the primary key, is the cluster index.

672
00:40:29,700 --> 00:40:34,110
‫The secondary indexes point to what now?

673
00:40:35,090 --> 00:40:37,520
‫They have to point to the ID, right?

674
00:40:37,550 --> 00:40:41,330
‫There is no record that you moved where the data is.

675
00:40:41,750 --> 00:40:42,500
‫Right.

676
00:40:42,770 --> 00:40:44,240
‫Previously the second index.

677
00:40:44,330 --> 00:40:46,640
‫I should have drawn this, but sorry, I did not.

678
00:40:46,670 --> 00:40:53,540
‫The second index is used to point to this thing, The Hidden, which is a very tiny value recorded 64

679
00:40:53,540 --> 00:40:53,720
‫bit.

680
00:40:53,750 --> 00:40:54,350
‫That's it.

681
00:40:55,190 --> 00:40:57,170
‫You know, how large is the field?

682
00:40:58,540 --> 00:41:01,540
‫And did I did I actually mention that it's called the object ID?

683
00:41:01,810 --> 00:41:03,520
‫I actually mentioned that someone highlighted it.

684
00:41:03,850 --> 00:41:05,590
‫12 bytes.

685
00:41:05,620 --> 00:41:07,780
‫Dude, this is bytes, not bits.

686
00:41:07,810 --> 00:41:14,590
‫This thing is a 12 bytes by default and it has like the first four bytes is the timestamp.

687
00:41:14,590 --> 00:41:15,760
‫The second three bytes is.

688
00:41:15,760 --> 00:41:21,420
‫I don't know what this is because Mongo decided to scale first, right?

689
00:41:21,430 --> 00:41:25,510
‫So they wanted their IDs to be unique across machines.

690
00:41:25,510 --> 00:41:32,140
‫So even the second four bytes is a combination between the process ID and the machine name.

691
00:41:32,140 --> 00:41:36,640
‫And so the idea is truly universally identified across machines.

692
00:41:36,640 --> 00:41:37,660
‫So that's why it's so big.

693
00:41:37,690 --> 00:41:39,420
‫12 bytes is so large.

694
00:41:39,430 --> 00:41:39,790
‫Yeah.

695
00:41:39,790 --> 00:41:42,610
‫So it's 12 bytes compared to eight bytes.

696
00:41:42,610 --> 00:41:42,820
‫Right.

697
00:41:42,820 --> 00:41:49,270
‫Because 64 bit is, is eight bytes and 12 bytes is 12 bytes.

698
00:41:49,270 --> 00:41:49,690
‫Right.

699
00:41:49,720 --> 00:41:51,520
‫So four bytes extra you might say.

700
00:41:51,520 --> 00:41:52,210
‫I'll say who cares.

701
00:41:52,210 --> 00:41:52,870
‫Four bytes extra.

702
00:41:52,870 --> 00:41:54,010
‫But here's the thing.

703
00:41:54,040 --> 00:41:54,550
‫Here's the thing.

704
00:41:54,550 --> 00:41:58,120
‫I didn't know MongoDB actually allow you in.

705
00:41:58,120 --> 00:42:02,050
‫Those who who use MongoDB more might know.

706
00:42:02,920 --> 00:42:05,260
‫You can actually set anything in the ID field.

707
00:42:05,260 --> 00:42:07,510
‫So it's a user controlled field.

708
00:42:08,140 --> 00:42:11,260
‫If you don't set an ID, it's going to generate one for you.

709
00:42:11,260 --> 00:42:19,390
‫But if you do set it, you can have it to be a very large people can have crazy ID values and guess

710
00:42:19,390 --> 00:42:19,900
‫what?

711
00:42:20,500 --> 00:42:26,680
‫The secondary index is now has to point to the ID because that's where the data is.

712
00:42:27,530 --> 00:42:35,390
‫And that's where all the problems of MySQL arise, where if the ID is a poorly chosen value, if the

713
00:42:35,390 --> 00:42:43,670
‫primary key is a poorly chosen like a good right, again, there's a lot of, of course discussion about

714
00:42:43,670 --> 00:42:48,200
‫having a UUID as a primary key, but we know it's very large.

715
00:42:48,230 --> 00:42:54,130
‫If you use it as a as a primary key, then those primary keys are stored in the secondary indexes as

716
00:42:54,140 --> 00:42:57,170
‫values and that's what blows everything up.

717
00:42:57,800 --> 00:43:01,070
‫So now the secondary indexes just blow up.

718
00:43:01,780 --> 00:43:02,380
‫Right.

719
00:43:02,380 --> 00:43:07,360
‫And that's the basically the evolution of MongoDB, you guys, right.

720
00:43:08,990 --> 00:43:09,860
‫As a summary.

721
00:43:09,980 --> 00:43:11,680
‫We started with a map.

722
00:43:11,690 --> 00:43:12,150
‫Right.

723
00:43:12,170 --> 00:43:13,520
‫Move to Wiredtiger.

724
00:43:13,550 --> 00:43:18,620
‫Gain a little bit of new features, but introduce the new problems for sync 5.3 and six zero.

725
00:43:18,650 --> 00:43:20,960
‫You can actually do clustered indexes.

726
00:43:21,410 --> 00:43:22,220
‫You're going to see in the next one.

727
00:43:22,220 --> 00:43:23,450
‫You hope you enjoyed this video.

728
00:43:23,480 --> 00:43:24,050
‫Goodbye.