1 00:00:00,370 --> 00:00:06,240 ‫Filters are one of those concepts that always confused me for the longest time in computer science. 2 00:00:06,390 --> 00:00:14,220 ‫I'm going to take a few minutes to actually explain it to you guys and not what are they, but why do 3 00:00:14,220 --> 00:00:15,650 ‫they exist. 4 00:00:15,660 --> 00:00:18,780 ‫So I'm going to flip that question a little bit, if you're interested. 5 00:00:19,080 --> 00:00:19,680 ‫Stay tuned. 6 00:00:21,410 --> 00:00:23,810 ‫So here is our problem. 7 00:00:23,840 --> 00:00:24,470 ‫Forget about balloon. 8 00:00:25,190 --> 00:00:30,860 ‫Here's a problem today that we know how to solve, but we can do a better. 9 00:00:32,280 --> 00:00:40,380 ‫I'm going to ride a service, a Web service, express, no gas ride that essentially check if my username 10 00:00:40,380 --> 00:00:41,370 ‫exists or not. 11 00:00:41,850 --> 00:00:48,210 ‫And if you think about it a little bit, this this this capability, this feature is very simple to 12 00:00:48,210 --> 00:00:48,540 ‫build. 13 00:00:48,630 --> 00:00:49,010 ‫Right. 14 00:00:49,350 --> 00:00:53,370 ‫Build a database with all the user names as you start writing user names. 15 00:00:53,490 --> 00:00:56,580 ‫If you want to build this interface, you make a good request. 16 00:00:56,730 --> 00:00:58,260 ‫Does Paul exist? 17 00:00:58,410 --> 00:01:06,510 ‫You make a request to the server, express Jango, anything, and then you execute a query against your 18 00:01:06,510 --> 00:01:11,160 ‫database, selects user name from the stable. 19 00:01:11,370 --> 00:01:13,050 ‫Hopefully you have an index there. 20 00:01:13,320 --> 00:01:17,150 ‫And if the of the record comes back, that means that username exists. 21 00:01:17,310 --> 00:01:20,340 ‫If not, then it doesn't exist. 22 00:01:20,640 --> 00:01:21,200 ‫Right. 23 00:01:22,400 --> 00:01:24,150 ‫Probably this is very slow. 24 00:01:24,900 --> 00:01:25,410 ‫Right. 25 00:01:25,830 --> 00:01:29,690 ‫And this feature is going to be very popular. 26 00:01:29,700 --> 00:01:33,300 ‫Ride all these users going to this web page and timing. 27 00:01:33,300 --> 00:01:37,370 ‫Hey, does test one, two or three exist, does whatever. 28 00:01:37,530 --> 00:01:39,970 ‫Everybody wants a fancy nickname. 29 00:01:39,990 --> 00:01:40,380 ‫Right. 30 00:01:40,840 --> 00:01:42,390 ‫So here's the problem. 31 00:01:42,390 --> 00:01:42,680 ‫Right. 32 00:01:43,320 --> 00:01:44,470 ‫This is very slow. 33 00:01:44,880 --> 00:01:45,690 ‫So what do we do? 34 00:01:45,750 --> 00:01:48,660 ‫Well, I heard about this Redus thing, right? 35 00:01:48,660 --> 00:01:51,120 ‫That is actually in memory database. 36 00:01:51,120 --> 00:01:54,560 ‫So let's take it from disk and put it in right as well. 37 00:01:55,590 --> 00:01:56,460 ‫That's fine. 38 00:01:56,550 --> 00:02:00,840 ‫Or you're going to do the same thing, execute the same giedd request. 39 00:02:00,990 --> 00:02:02,670 ‫But this time I'm going to have the database. 40 00:02:02,670 --> 00:02:03,180 ‫Right. 41 00:02:03,180 --> 00:02:09,510 ‫And if it's not there, OK, I might sometimes need to go to the actual database because these two can 42 00:02:09,510 --> 00:02:10,230 ‫get out of sync. 43 00:02:10,230 --> 00:02:16,950 ‫So you created some inefficiency and you actually doubled your memory footprint because you're storing 44 00:02:16,950 --> 00:02:20,700 ‫data here and storing data here just to solve this simple problem. 45 00:02:20,820 --> 00:02:29,070 ‫OK, so you we know how to solve this thing, but some smart people, computer science professors came 46 00:02:29,070 --> 00:02:35,070 ‫up with a solution, very efficient solution, and they called it Bloem Filters. 47 00:02:35,190 --> 00:02:37,920 ‫So let's explain what these things are. 48 00:02:38,650 --> 00:02:39,870 ‫OK, so. 49 00:02:41,210 --> 00:02:42,700 ‫With Bloom Fulcher. 50 00:02:44,070 --> 00:02:47,020 ‫We're going to use some in-memory representation. 51 00:02:47,040 --> 00:02:49,440 ‫Usually it's very tiny. 52 00:02:49,710 --> 00:02:52,480 ‫I'm using 64 bit in this case, OK? 53 00:02:52,650 --> 00:02:55,430 ‫And this 64 bit magically have some numbers, right? 54 00:02:55,620 --> 00:02:58,260 ‫And this gets the bit zeros not set to zero. 55 00:02:58,500 --> 00:03:00,590 ‫This is one this is zero zero zero zero. 56 00:03:00,600 --> 00:03:01,320 ‫And this is one. 57 00:03:01,320 --> 00:03:01,640 ‫Right. 58 00:03:01,920 --> 00:03:03,300 ‫So how does it come up? 59 00:03:03,300 --> 00:03:04,080 ‫Will come to that. 60 00:03:04,080 --> 00:03:05,840 ‫But here's the thing. 61 00:03:06,540 --> 00:03:10,580 ‫If we're going to make a request, hey, does you does Paula exist? 62 00:03:10,890 --> 00:03:14,520 ‫We will make a request to the server and the server will write a function. 63 00:03:14,790 --> 00:03:20,240 ‫We will hash the string Paul and Maudet 64. 64 00:03:20,400 --> 00:03:26,280 ‫And if you know mod, what will happen is this result will only come back with the result from zero 65 00:03:26,490 --> 00:03:27,590 ‫to 63. 66 00:03:27,600 --> 00:03:28,020 ‫Right. 67 00:03:28,710 --> 00:03:30,810 ‫And just like that out of the box. 68 00:03:32,110 --> 00:03:35,190 ‫You're going to get collision all the time, but that's fine. 69 00:03:35,700 --> 00:03:39,360 ‫So in this case, Paul is bet number three, right. 70 00:03:39,370 --> 00:03:47,520 ‫And we if we go ahead and check in my integer in my 64 bit integer, this is the filter that we build. 71 00:03:48,030 --> 00:03:49,680 ‫Does this bit exist? 72 00:03:49,680 --> 00:03:50,400 ‫Is it set? 73 00:03:50,760 --> 00:03:51,270 ‫No. 74 00:03:51,450 --> 00:04:00,210 ‫If it's not set, then you can absolutely 100 percent guarantee say that Paul does not exist in the 75 00:04:00,210 --> 00:04:02,550 ‫database because it's not set here. 76 00:04:02,580 --> 00:04:05,010 ‫And we're going to show how that that happened. 77 00:04:05,670 --> 00:04:09,660 ‫OK, so Paul doesn't exist, so I didn't have to even query the database. 78 00:04:10,020 --> 00:04:14,380 ‫Let's take another example right where I'm going to check if Jack exists. 79 00:04:14,440 --> 00:04:19,380 ‫I'm going to make a get request to the server and I'm going to mod that string. 80 00:04:19,380 --> 00:04:26,210 ‫Jack, I wish first of all, we're going to hash the string, Jack, get a bunch of big number, right. 81 00:04:26,370 --> 00:04:30,950 ‫And then Mod 64, I'm going to get a value from zero and 63. 82 00:04:30,960 --> 00:04:32,340 ‫It happened to be 63. 83 00:04:32,550 --> 00:04:34,410 ‫I think that bit a 60. 84 00:04:34,410 --> 00:04:35,280 ‫Oh, a said. 85 00:04:35,820 --> 00:04:37,860 ‫And if it's said here's the thing. 86 00:04:38,220 --> 00:04:41,520 ‫If it's if that bit said that means Jack. 87 00:04:42,430 --> 00:04:43,900 ‫Maybe they're. 88 00:04:44,930 --> 00:04:52,910 ‫And why maybe maybe because there might be another strike that matched hash and Mod 64 that resulted 89 00:04:52,910 --> 00:04:57,110 ‫in 63 and was not necessary Jack himself. 90 00:04:57,110 --> 00:05:00,340 ‫Right, but some other strength that matched it. 91 00:05:00,830 --> 00:05:03,260 ‫But that is actually enough for us. 92 00:05:03,270 --> 00:05:09,040 ‫If it's set, then, well, it said I'm going to take the hit and hit the database. 93 00:05:09,380 --> 00:05:16,700 ‫So I kind of saved myself some queries that there is this perfect. 94 00:05:16,850 --> 00:05:21,650 ‫No, but it's a very efficient thing to actually query. 95 00:05:21,920 --> 00:05:24,730 ‫Right, to prevent unnecessary querying. 96 00:05:25,250 --> 00:05:32,870 ‫By the way, Cassandra uses this in their implementation of that consistent caching all the time. 97 00:05:33,360 --> 00:05:33,710 ‫Right. 98 00:05:34,290 --> 00:05:36,710 ‫There is stables and all that stuff. 99 00:05:36,710 --> 00:05:38,240 ‫They use this internally. 100 00:05:38,240 --> 00:05:38,570 ‫Right. 101 00:05:38,750 --> 00:05:46,640 ‫Because anything any time you want to avoid an expensive query to check if something exists or not or 102 00:05:47,120 --> 00:05:50,800 ‫if you want to do this query, but you're not sure if you're going to get a result or not. 103 00:05:51,080 --> 00:05:52,830 ‫Blue filters are very useful. 104 00:05:52,890 --> 00:05:55,460 ‫Yes, there are some of disadvantages to this. 105 00:05:55,460 --> 00:05:58,780 ‫But let's look at actually how to create a Blum filter. 106 00:05:58,790 --> 00:06:04,010 ‫I have a brand spanking new bit set here, 64 bit. 107 00:06:04,010 --> 00:06:04,370 ‫Right. 108 00:06:04,640 --> 00:06:08,600 ‫And I'm going to create user Jack for the first time. 109 00:06:08,600 --> 00:06:11,780 ‫It's a it's a blank database, right? 110 00:06:11,810 --> 00:06:12,470 ‫There is nothing. 111 00:06:12,470 --> 00:06:13,700 ‫And I'm going to create Jack. 112 00:06:14,450 --> 00:06:20,780 ‫So if I'm going to create Jack, I am going obviously to make a post request the server to the express 113 00:06:20,780 --> 00:06:23,950 ‫server and I'm going to hash Jack more. 114 00:06:23,960 --> 00:06:24,710 ‫Sixty four. 115 00:06:24,950 --> 00:06:27,320 ‫I am going to get 63 in this case. 116 00:06:27,530 --> 00:06:36,080 ‫And what do you do before writing to the database, the user name Jack, as you sit this bit niceties, 117 00:06:36,080 --> 00:06:36,650 ‫isn't it? 118 00:06:36,830 --> 00:06:38,930 ‫And then you obviously right at the levels. 119 00:06:38,990 --> 00:06:44,630 ‫So this is how we start building this in-memory representation of blah filter. 120 00:06:44,640 --> 00:06:45,020 ‫Right. 121 00:06:45,590 --> 00:06:48,590 ‫And then let's try Paul. 122 00:06:49,220 --> 00:06:50,570 ‫Hey, I'm going to create a user. 123 00:06:50,570 --> 00:06:54,980 ‫Paul Proof Post Paul Ryan Mod 64. 124 00:06:54,980 --> 00:06:55,460 ‫What do we get? 125 00:06:55,460 --> 00:06:56,210 ‫All three. 126 00:06:56,210 --> 00:06:57,520 ‫Let's start with number three. 127 00:06:57,800 --> 00:06:58,150 ‫All right. 128 00:06:58,160 --> 00:06:59,690 ‫So far, so far, so good. 129 00:07:00,110 --> 00:07:00,630 ‫Let's try. 130 00:07:00,660 --> 00:07:02,550 ‫And obviously we related to the database. 131 00:07:03,020 --> 00:07:04,580 ‫Let's try some other user. 132 00:07:04,790 --> 00:07:05,270 ‫Tim. 133 00:07:05,480 --> 00:07:08,560 ‫Well, I'm going to take Tim and hash it more. 134 00:07:08,580 --> 00:07:10,460 ‫64, guess what? 135 00:07:10,460 --> 00:07:12,230 ‫I got number 63 again. 136 00:07:12,410 --> 00:07:15,200 ‫And that's absolutely perfect. 137 00:07:15,470 --> 00:07:16,070 ‫That's OK. 138 00:07:16,070 --> 00:07:22,580 ‫Because you're going to get you only have 63, 64 bits, obviously all the strings and names in the 139 00:07:22,580 --> 00:07:22,970 ‫war. 140 00:07:23,180 --> 00:07:25,580 ‫You will fill between these things. 141 00:07:25,580 --> 00:07:25,910 ‫Right. 142 00:07:26,600 --> 00:07:30,140 ‫And obviously, when you say 63, it's already said so. 143 00:07:30,140 --> 00:07:34,940 ‫You don't have to even bother yourself setting it because it's already said, but you always have to 144 00:07:34,940 --> 00:07:36,620 ‫had the database and write it. 145 00:07:36,720 --> 00:07:37,070 ‫Right. 146 00:07:37,280 --> 00:07:38,760 ‫So that's how it's actually made. 147 00:07:38,780 --> 00:07:40,040 ‫Let's take another user, Ali. 148 00:07:40,160 --> 00:07:40,550 ‫All right. 149 00:07:40,790 --> 00:07:44,830 ‫So Ali, hash Ali and get six more. 150 00:07:44,840 --> 00:07:45,560 ‫Sixty four. 151 00:07:45,590 --> 00:07:50,120 ‫You're going to have big before in this case and you're going to set that bit. 152 00:07:50,270 --> 00:07:50,660 ‫All right. 153 00:07:50,840 --> 00:07:51,980 ‫And then obviously right there. 154 00:07:51,990 --> 00:07:53,020 ‫That is all right, guys. 155 00:07:53,060 --> 00:07:55,760 ‫So that's essentially Bloomfield's are in a nutshell. 156 00:07:55,910 --> 00:08:00,500 ‫I know the actual implementation of Bloom filters are a little bit fancier. 157 00:08:00,500 --> 00:08:03,890 ‫They use like three locations and all that stuff. 158 00:08:03,890 --> 00:08:04,190 ‫Right. 159 00:08:04,340 --> 00:08:07,400 ‫Sometimes they have they have more bits. 160 00:08:07,400 --> 00:08:07,730 ‫Right. 161 00:08:08,000 --> 00:08:13,850 ‫They use three hash functions just to make the odds harder to get right. 162 00:08:14,540 --> 00:08:17,960 ‫But and that's that's just to me, that's just an implementation. 163 00:08:17,960 --> 00:08:22,970 ‫But if you if you understand how it works, that's how it works and that's why it exists. 164 00:08:22,970 --> 00:08:23,260 ‫Right. 165 00:08:23,540 --> 00:08:25,400 ‫So some limitation of Blum filter. 166 00:08:25,400 --> 00:08:31,730 ‫You can get into a case where all of these puppies become one one one one one, one one. 167 00:08:31,940 --> 00:08:36,590 ‫And in this case, you will your blue filter is essentially useless. 168 00:08:36,590 --> 00:08:40,670 ‫You you became the first case where you're always going to head the database. 169 00:08:41,000 --> 00:08:42,470 ‫It's not really harmful. 170 00:08:42,680 --> 00:08:49,190 ‫It's not going to there is not going to slow you down, but it's not going to give you any benefits 171 00:08:49,190 --> 00:08:49,970 ‫per say. 172 00:08:49,970 --> 00:08:50,330 ‫Right. 173 00:08:50,570 --> 00:08:52,460 ‫So are you going to have to think about this? 174 00:08:52,610 --> 00:09:00,440 ‫Like the bigger you make this thing right, then you kind of interfere with your memory footprint. 175 00:09:00,710 --> 00:09:03,410 ‫But I mean, it depends how big it is, right? 176 00:09:03,410 --> 00:09:03,830 ‫Really. 177 00:09:04,100 --> 00:09:08,960 ‫But the shorter it is, then you're going to get all these false positive cases. 178 00:09:08,960 --> 00:09:11,710 ‫Where you going to always hit the database regardless. 179 00:09:11,820 --> 00:09:12,830 ‫I'm all right, guys. 180 00:09:13,040 --> 00:09:14,150 ‫That's it for me today. 181 00:09:14,300 --> 00:09:15,410 ‫I hope you enjoyed this video. 182 00:09:15,410 --> 00:09:16,790 ‫I'm going to see in the next one. 183 00:09:17,300 --> 00:09:17,680 ‫You guys. 184 00:09:17,690 --> 00:09:17,940 ‫They are.