1 00:00:00,760 --> 00:00:04,400 So, let's just pause on this screen for a minute 2 00:00:04,400 --> 00:00:06,280 because the lake looks amazing 3 00:00:06,280 --> 00:00:08,420 and I would love to be sitting in that rowboat 4 00:00:08,420 --> 00:00:10,470 just kind of rowing around the mountains. 5 00:00:11,509 --> 00:00:12,700 (Brian sighing) 6 00:00:12,700 --> 00:00:14,220 Feeling refreshed? 7 00:00:14,220 --> 00:00:15,100 Me too. 8 00:00:15,100 --> 00:00:18,490 Let's talk about Introduction to Data Lakes. 9 00:00:18,490 --> 00:00:19,730 In this lesson, 10 00:00:19,730 --> 00:00:22,260 we are going to be talking about the terminology. 11 00:00:22,260 --> 00:00:24,150 What is a Data Lake? We're 12 00:00:24,150 --> 00:00:27,040 going to talk about structured versus unstructured data 13 00:00:27,040 --> 00:00:28,690 and what that looks like. 14 00:00:28,690 --> 00:00:31,450 We're going to talk about why Data Lakes are used 15 00:00:31,450 --> 00:00:32,550 and we're going to take a look 16 00:00:32,550 --> 00:00:34,393 at some Data Lake architecture. 17 00:00:35,890 --> 00:00:36,723 To start off with, 18 00:00:36,723 --> 00:00:40,390 let's talk about structured versus unstructured. 19 00:00:40,390 --> 00:00:44,660 So really that's a discussion of SQL versus NoSQL. 20 00:00:44,660 --> 00:00:47,660 Now, when we talk about structured data, or SQL, 21 00:00:47,660 --> 00:00:49,680 we're talking about relational data. 22 00:00:49,680 --> 00:00:52,360 Now, relational versus non-relational, 23 00:00:52,360 --> 00:00:55,070 when you think relational, think Excel spreadsheet, right? 24 00:00:55,070 --> 00:00:57,400 You have rows and columns. 25 00:00:57,400 --> 00:01:00,510 When you think non-SQL or non-relational, 26 00:01:00,510 --> 00:01:04,090 what you're thinking about there is a variety. 27 00:01:04,090 --> 00:01:06,610 Things that change in shape or size. 28 00:01:06,610 --> 00:01:08,570 You might have a row, 29 00:01:08,570 --> 00:01:12,190 but inside of a cell within that row, you have JSON. 30 00:01:12,190 --> 00:01:16,480 So, non-relational is different shapes and sizes. 31 00:01:16,480 --> 00:01:21,240 Relational is Excel spreadsheet. Fixed shape, fixed size, 32 00:01:21,240 --> 00:01:24,340 fixed schema versus dynamic. Same discussion, right? 33 00:01:24,340 --> 00:01:27,380 Fixed schema is the same rows and columns. 34 00:01:27,380 --> 00:01:30,580 Dynamic means those rows and columns might change depending 35 00:01:30,580 --> 00:01:32,280 upon the data source or even change 36 00:01:32,280 --> 00:01:34,223 within the same data source over time. 37 00:01:35,160 --> 00:01:36,290 When we look at SQL, 38 00:01:36,290 --> 00:01:40,700 SQL is well-designed for complex queries because it's fixed. 39 00:01:40,700 --> 00:01:44,100 It's easier to do complex queries on fixed data. 40 00:01:44,100 --> 00:01:45,730 When you look at NoSQL, 41 00:01:45,730 --> 00:01:47,980 that's not very great for complex queries 42 00:01:47,980 --> 00:01:50,830 because it takes a lot longer to search for the data, 43 00:01:50,830 --> 00:01:53,403 because again, there's no fixed schema. 44 00:01:54,650 --> 00:01:57,930 So with that, in a SQL database or a SQL, 45 00:01:57,930 --> 00:01:59,700 you're going to see more vertical scaling 46 00:01:59,700 --> 00:02:02,750 as you look at how we grow. 47 00:02:02,750 --> 00:02:07,570 So vertical scaling means we're going to add more RAM 48 00:02:07,570 --> 00:02:09,110 and CPU power, right? 49 00:02:09,110 --> 00:02:12,040 That's how we're going to solve the problems that we have. 50 00:02:12,040 --> 00:02:15,000 In NoSQL, we're going to have more horizontal scaling. 51 00:02:15,000 --> 00:02:17,680 So we're going to add multiple devices together 52 00:02:17,680 --> 00:02:20,520 and you'll see that as we go through the DP-203 53 00:02:20,520 --> 00:02:21,640 and look more in depth, 54 00:02:21,640 --> 00:02:24,093 especially at Azure Synapse, for example. 55 00:02:27,280 --> 00:02:32,260 So with that, let's reintroduce Blob storage. 56 00:02:32,260 --> 00:02:33,530 Wait, what? 57 00:02:33,530 --> 00:02:35,653 What is a Blob, exactly? 58 00:02:36,800 --> 00:02:39,150 A Blob is just a general-purpose object store 59 00:02:39,150 --> 00:02:43,180 and it's really for virtually any storage scenario. 60 00:02:43,180 --> 00:02:45,860 So when we look at, specifically, Blobs, 61 00:02:45,860 --> 00:02:47,910 we're going to be thinking about serving images, 62 00:02:47,910 --> 00:02:50,350 serving video or audio, 63 00:02:50,350 --> 00:02:54,450 storing lots of log files, or disaster recovery. 64 00:02:54,450 --> 00:02:57,840 It's also fantastic as a backup source. 65 00:02:57,840 --> 00:03:01,100 So when you look at backup for regulatory, 66 00:03:01,100 --> 00:03:04,360 you'll see Blob as a very common scenario 67 00:03:04,360 --> 00:03:06,083 or common storage choice. 68 00:03:07,470 --> 00:03:10,010 Now, if you remember from the last lesson, 69 00:03:10,010 --> 00:03:12,790 we also talked about Data Lakes being a part 70 00:03:12,790 --> 00:03:14,550 of Blob storage. 71 00:03:14,550 --> 00:03:18,353 So let's talk about Data Lakes with Blob storage. 72 00:03:19,780 --> 00:03:21,960 The big thing that separates Data Lakes 73 00:03:21,960 --> 00:03:24,610 when we look at Blobs versus Data Lake 74 00:03:24,610 --> 00:03:26,800 in Blob storage is the use 75 00:03:26,800 --> 00:03:29,930 or enabling of hierarchical namespace. 76 00:03:29,930 --> 00:03:34,090 So what that big word really means is just taking our files 77 00:03:34,090 --> 00:03:37,010 and we're going to organize those into directories 78 00:03:37,010 --> 00:03:38,730 or subdirectories, 79 00:03:38,730 --> 00:03:40,850 much like you would see in Finder for Macs 80 00:03:40,850 --> 00:03:43,500 or File Explorer for PC, right? 81 00:03:43,500 --> 00:03:45,710 That directory/subdirectory, 82 00:03:45,710 --> 00:03:48,760 that is an example of hierarchical namespace. 83 00:03:48,760 --> 00:03:50,780 And so when you're standing up Blob storage, 84 00:03:50,780 --> 00:03:53,430 if you turn on hierarchical namespace, 85 00:03:53,430 --> 00:03:56,370 you can create a Data Lake. 86 00:03:56,370 --> 00:03:58,660 One more important piece to think about 87 00:03:58,660 --> 00:04:02,870 when we create an hierarchical namespace or a Blob storage, 88 00:04:02,870 --> 00:04:05,000 once you create that hierarchical namespace, 89 00:04:05,000 --> 00:04:09,530 which is done upon creation of the Blob storage, 90 00:04:09,530 --> 00:04:12,400 once you enable that, you cannot change it on the fly. 91 00:04:12,400 --> 00:04:14,840 So if I've enabled hierarchical namespace, 92 00:04:14,840 --> 00:04:17,603 I can't go back and un-enable it later down the road. 93 00:04:19,340 --> 00:04:22,053 So why we want hierarchical namespace? 94 00:04:23,360 --> 00:04:25,320 The main 2 reasons are, 1; 95 00:04:25,320 --> 00:04:27,330 it decreases our processing time 96 00:04:27,330 --> 00:04:29,760 because the files are more structured, 97 00:04:29,760 --> 00:04:33,610 and 2; file systems are more familiar and easier. 98 00:04:33,610 --> 00:04:34,960 This is really important 99 00:04:34,960 --> 00:04:37,400 if you want to have non-IT people 100 00:04:38,370 --> 00:04:41,000 in your Blob storage. 101 00:04:41,000 --> 00:04:43,530 Because, with those files and directories, 102 00:04:43,530 --> 00:04:45,960 it looks a lot more like a general desktop 103 00:04:45,960 --> 00:04:47,410 to be able to search through folders 104 00:04:47,410 --> 00:04:49,840 and find individual files. 105 00:04:49,840 --> 00:04:51,800 So those are the 2 main reasons 106 00:04:51,800 --> 00:04:53,923 that we would want to enable a Data Lake. 107 00:04:57,040 --> 00:05:00,270 So let's talk about Data Lake architecture now. 108 00:05:00,270 --> 00:05:03,670 So, to start off with, we have our data source. 109 00:05:03,670 --> 00:05:06,240 This could be a laptop or a desktop 110 00:05:06,240 --> 00:05:07,680 or it could be another database. 111 00:05:07,680 --> 00:05:08,550 It really doesn't matter. 112 00:05:08,550 --> 00:05:10,580 We just have some sort of data source 113 00:05:10,580 --> 00:05:12,460 that is pulling into the cloud. 114 00:05:12,460 --> 00:05:16,110 So we're going to ingest that with Data Factory 115 00:05:16,110 --> 00:05:18,340 and then we have our Data Lake. Now, 116 00:05:18,340 --> 00:05:21,780 when the data first comes in, we're going to create zones. 117 00:05:21,780 --> 00:05:23,930 This is a very common practice. 118 00:05:23,930 --> 00:05:26,000 The first zone is typically raw 119 00:05:26,000 --> 00:05:28,260 which just means it's just the raw data. 120 00:05:28,260 --> 00:05:30,610 So we might have multiple sources coming in 121 00:05:30,610 --> 00:05:32,230 and all of those are getting dumped 122 00:05:32,230 --> 00:05:34,310 into this raw data source. 123 00:05:34,310 --> 00:05:35,750 They're not processed. 124 00:05:35,750 --> 00:05:37,170 They're not curated. 125 00:05:37,170 --> 00:05:38,510 There might be errors in it. 126 00:05:38,510 --> 00:05:39,510 We have no idea. 127 00:05:39,510 --> 00:05:42,483 It's just a direct dump into our Data Lake. 128 00:05:43,380 --> 00:05:46,980 From there, we're going to use Data Factory 129 00:05:46,980 --> 00:05:51,100 to move data from raw into processed, 130 00:05:51,100 --> 00:05:54,430 and it moves data into a process zone typically 131 00:05:54,430 --> 00:05:56,850 by moving it through a transformation service 132 00:05:56,850 --> 00:05:58,890 like Databricks. 133 00:05:58,890 --> 00:06:00,440 And so we're going to take that data 134 00:06:00,440 --> 00:06:02,500 and we're going to clean it up a little bit. 135 00:06:02,500 --> 00:06:05,760 And maybe we remove the null columns that are in there, 136 00:06:05,760 --> 00:06:09,850 or maybe we make sure that all of the columns match 137 00:06:09,850 --> 00:06:11,240 or we map things. 138 00:06:11,240 --> 00:06:14,610 We would do that, combine those sources together, 139 00:06:14,610 --> 00:06:17,470 and move that into a processed zone 140 00:06:17,470 --> 00:06:21,113 which is generally a cleaner area of a Data Lake. 141 00:06:22,400 --> 00:06:25,710 Once that's done, we would repeat the same process 142 00:06:25,710 --> 00:06:29,560 and we would move data into a curated zone. 143 00:06:29,560 --> 00:06:33,830 Now a curated zone is usually going to be a different look. 144 00:06:33,830 --> 00:06:38,060 So finance might want data slightly different than HR. 145 00:06:38,060 --> 00:06:40,670 And so the curated section is generally going 146 00:06:40,670 --> 00:06:43,170 to be more specifically cleaned up 147 00:06:43,170 --> 00:06:45,610 with a focus on a different look 148 00:06:45,610 --> 00:06:49,490 for a silo-like or business group-like HR, 149 00:06:49,490 --> 00:06:53,180 or finance, or marketing, or something like that. 150 00:06:53,180 --> 00:06:54,870 Now keep in mind when we talk about this, 151 00:06:54,870 --> 00:06:56,770 this is a very basic example. 152 00:06:56,770 --> 00:06:59,220 Your business might do things slightly different 153 00:06:59,220 --> 00:07:01,900 but this is pretty commonly what happens 154 00:07:01,900 --> 00:07:02,760 and what you want to think 155 00:07:02,760 --> 00:07:05,960 about most importantly is those zones. 156 00:07:05,960 --> 00:07:07,700 Because if you don't have these zones 157 00:07:07,700 --> 00:07:10,090 and everything is just raw, 158 00:07:10,090 --> 00:07:12,450 you wind up with what's known as a data dump. 159 00:07:12,450 --> 00:07:15,220 Basically, you have massive amounts of data 160 00:07:15,220 --> 00:07:18,610 and it turns into so much data that it becomes very hard 161 00:07:18,610 --> 00:07:21,410 to pull insights out of to create business reports 162 00:07:21,410 --> 00:07:23,850 or other things that the business needs 163 00:07:23,850 --> 00:07:25,590 to make decisions on. 164 00:07:25,590 --> 00:07:28,080 So keep that in mind as we talk about Data Lakes 165 00:07:28,080 --> 00:07:30,670 and what you should have as you move forward. 166 00:07:31,960 --> 00:07:33,590 Finally, after all that's done, 167 00:07:33,590 --> 00:07:36,840 like I said, you're going to pull those data insights out 168 00:07:36,840 --> 00:07:39,060 of the curated folder and you're going to create reports 169 00:07:39,060 --> 00:07:42,310 and those reports are used to make business decisions. 170 00:07:42,310 --> 00:07:43,510 Should we change the price? 171 00:07:43,510 --> 00:07:45,090 Should we buy this new thing? 172 00:07:45,090 --> 00:07:47,203 Should we restructure, or whatever? 173 00:07:48,160 --> 00:07:51,210 So now let's take a look at this lesson in review. 174 00:07:51,210 --> 00:07:54,060 To start off with, we talked about terminology. 175 00:07:54,060 --> 00:07:55,030 We talked about the differences 176 00:07:55,030 --> 00:07:56,840 between Blobs and Data Lakes 177 00:07:56,840 --> 00:07:59,800 with the introduction of hierarchical namespace. 178 00:07:59,800 --> 00:08:01,670 That is very important. 179 00:08:01,670 --> 00:08:04,220 We talked about structured and unstructured data. 180 00:08:04,220 --> 00:08:06,610 So, SQL versus NoSQL. 181 00:08:06,610 --> 00:08:08,390 Now this one is really important. 182 00:08:08,390 --> 00:08:10,440 Make sure that you understand the differences 183 00:08:10,440 --> 00:08:12,400 between structured and unstructured 184 00:08:12,400 --> 00:08:14,860 and why you would choose one over the other, 185 00:08:14,860 --> 00:08:16,700 because this is going to drive 186 00:08:16,700 --> 00:08:19,300 not only a very important data engineering concept, 187 00:08:19,300 --> 00:08:23,050 but it'll also drive what solutions you choose in Azure. 188 00:08:23,050 --> 00:08:25,403 So it's a critical component to understand. 189 00:08:26,360 --> 00:08:28,920 We talked about why Data Lakes are used. 190 00:08:28,920 --> 00:08:31,180 Again, it's that hierarchical namespace. 191 00:08:31,180 --> 00:08:32,393 So think about that. 192 00:08:33,370 --> 00:08:35,750 And we talked about Data Lake architecture. 193 00:08:35,750 --> 00:08:37,720 So proper planning and zone creation 194 00:08:37,720 --> 00:08:39,973 to help avoid those data dumps. 195 00:08:40,970 --> 00:08:42,800 So make sure you have a good handle on that 196 00:08:42,800 --> 00:08:44,733 and I will see you in the next lesson.