1 00:00:00,520 --> 00:00:02,460 All right, Gurus, welcome back. 2 00:00:02,460 --> 00:00:05,310 In this lesson, we are going to be talking about 3 00:00:05,310 --> 00:00:07,150 Azure Databricks. 4 00:00:07,150 --> 00:00:08,510 And specifically, 5 00:00:08,510 --> 00:00:12,630 I'm going to be teaching you what Azure Databricks is. 6 00:00:12,630 --> 00:00:14,830 So like the other lessons, 7 00:00:14,830 --> 00:00:16,140 we're going to walk through an introduction 8 00:00:16,140 --> 00:00:17,710 to Azure Databricks. 9 00:00:17,710 --> 00:00:18,640 Then we're going to talk about 10 00:00:18,640 --> 00:00:21,880 the when, where, and why it's used, 11 00:00:21,880 --> 00:00:22,800 how it works, 12 00:00:22,800 --> 00:00:25,430 and I'm going to do that by showing you a little bit 13 00:00:25,430 --> 00:00:27,310 of a live look in the demo, 14 00:00:27,310 --> 00:00:30,180 kind of like we did on Synapse Analytics. 15 00:00:30,180 --> 00:00:32,493 So with that, let's dive in and get going. 16 00:00:33,620 --> 00:00:36,350 Introducing Azure Databricks. 17 00:00:36,350 --> 00:00:38,330 What is Azure Databricks? 18 00:00:38,330 --> 00:00:41,070 Well, it has 3 main functions. 19 00:00:41,070 --> 00:00:44,680 First, it serves as database. 20 00:00:44,680 --> 00:00:47,780 First, it serves as Databricks SQL. 21 00:00:47,780 --> 00:00:50,230 And what that means is Databricks SQL 22 00:00:50,230 --> 00:00:53,370 is just a simple secure access to data 23 00:00:53,370 --> 00:00:56,770 that allows you to create and reuse SQL queries. 24 00:00:56,770 --> 00:00:59,540 Next, it's used in data engineering. 25 00:00:59,540 --> 00:01:02,510 And then third, it's used in machine learning. 26 00:01:02,510 --> 00:01:03,870 Now for our purposes, 27 00:01:03,870 --> 00:01:07,050 we really only care about data engineering. 28 00:01:07,050 --> 00:01:09,220 So the other things we're not even going to talk about, 29 00:01:09,220 --> 00:01:11,873 because it doesn't relate to the DP-203. 30 00:01:13,640 --> 00:01:15,280 Now specifically for this course, 31 00:01:15,280 --> 00:01:17,550 when we're talking about data engineering, 32 00:01:17,550 --> 00:01:20,380 we're going to be talking about transformations. 33 00:01:20,380 --> 00:01:21,840 So as you can see 34 00:01:21,840 --> 00:01:25,663 from the sign of over here on the right, we have an error. 35 00:01:26,520 --> 00:01:30,913 Now, someone has gone along and messed up the sign. 36 00:01:31,850 --> 00:01:35,810 This is a common fix for something like Azure Databricks. 37 00:01:35,810 --> 00:01:37,230 It's going to process and curate, 38 00:01:37,230 --> 00:01:39,280 if you remember from our Databricks lesson, 39 00:01:39,280 --> 00:01:42,240 and it does that by taking and standardizing. 40 00:01:42,240 --> 00:01:44,400 So remember we about the country codes, 41 00:01:44,400 --> 00:01:48,200 where you could have USA or US or United States, 42 00:01:48,200 --> 00:01:51,430 or even just a, you know, 02 for a country code, right? 43 00:01:51,430 --> 00:01:54,250 Well, Databricks is going to be able to go in 44 00:01:54,250 --> 00:01:57,410 and make all of that data look the same 45 00:01:57,410 --> 00:02:00,290 so that when we do processing later on, 46 00:02:00,290 --> 00:02:02,400 we're all looking at the same source of data. 47 00:02:02,400 --> 00:02:04,870 And we can be reasonably certain that we've cured it 48 00:02:04,870 --> 00:02:08,913 from all of the least obvious errors in the data. 49 00:02:10,770 --> 00:02:13,640 So looking at where it actually lives, 50 00:02:13,640 --> 00:02:17,710 I've pulled up a Microsoft diagram here, 51 00:02:17,710 --> 00:02:19,570 and you can see so we're looking at this, 52 00:02:19,570 --> 00:02:21,310 and hopefully you're starting to sense a pattern. 53 00:02:21,310 --> 00:02:25,150 But we have our ingest, far left, we have our store. 54 00:02:25,150 --> 00:02:27,060 So we ingest data into the system, 55 00:02:27,060 --> 00:02:28,630 then we store the data, 56 00:02:28,630 --> 00:02:31,420 and then we pull it out of our storage, our data lake, 57 00:02:31,420 --> 00:02:35,570 or our BLOB, and we use Databricks to prep and train. 58 00:02:35,570 --> 00:02:38,010 And so that's going to be the transformation. 59 00:02:38,010 --> 00:02:39,650 Now, when we talk about training data, 60 00:02:39,650 --> 00:02:41,220 that's more machine learning. 61 00:02:41,220 --> 00:02:44,320 What we're talking about is prepping data for the 62 00:02:44,320 --> 00:02:48,550 model-and-serve phase in Azure Synapse Analytics, or Cosmos 63 00:02:48,550 --> 00:02:50,100 DB, or something else. 64 00:02:50,100 --> 00:02:51,760 So Databricks is going to serve 65 00:02:51,760 --> 00:02:53,480 as that transformational layer 66 00:02:53,480 --> 00:02:56,020 to curate and process your data into something 67 00:02:56,020 --> 00:02:57,660 that it needs to be. 68 00:02:57,660 --> 00:03:01,560 Now, at the very top, you'll also notice Data Factory. 69 00:03:01,560 --> 00:03:03,470 This is something you're going to see continuously 70 00:03:03,470 --> 00:03:05,150 on all of the diagrams, 71 00:03:05,150 --> 00:03:07,260 because Data Factory, if you remember, 72 00:03:07,260 --> 00:03:09,410 serves as the movement in the pipelines, 73 00:03:09,410 --> 00:03:12,100 to take that data, and move it through those stages, 74 00:03:12,100 --> 00:03:13,943 including Databricks. 75 00:03:15,490 --> 00:03:19,650 So now let's talk about some core concepts of Databricks. 76 00:03:19,650 --> 00:03:21,800 The first is clusters. 77 00:03:21,800 --> 00:03:24,470 Clusters are just a group of compute resources 78 00:03:24,470 --> 00:03:26,510 that you're going to use to process 79 00:03:26,510 --> 00:03:28,573 your Data Factory transformations. 80 00:03:29,580 --> 00:03:31,950 Think of it as the engine, if you will. 81 00:03:31,950 --> 00:03:34,840 Next up, we have our workspace. 82 00:03:34,840 --> 00:03:37,860 So workspaces kind of serve as the filing cabinet 83 00:03:37,860 --> 00:03:40,120 to store all of our files, 84 00:03:40,120 --> 00:03:42,020 which we'll talk about here in just a second. 85 00:03:42,020 --> 00:03:45,320 But the workspace is kind of like a filing cabinet. 86 00:03:45,320 --> 00:03:46,760 Now in that workspace 87 00:03:46,760 --> 00:03:49,200 you're going to have individual folders. 88 00:03:49,200 --> 00:03:51,270 Folders are the notebooks. 89 00:03:51,270 --> 00:03:55,760 The notebooks are going to store all of your cells 90 00:03:55,760 --> 00:03:58,720 which are your individual pieces of code. 91 00:03:58,720 --> 00:04:02,270 So we have our individual pieces code, written in cells, 92 00:04:02,270 --> 00:04:04,000 stored in notebooks. 93 00:04:04,000 --> 00:04:06,550 The notebooks are stored in our filing cabinet 94 00:04:06,550 --> 00:04:08,083 or our workspace. 95 00:04:08,970 --> 00:04:12,040 And then, lastly, we have our libraries. 96 00:04:12,040 --> 00:04:15,890 Libraries are a package or a module 97 00:04:15,890 --> 00:04:20,890 that conveys or gives additional functionality to Databricks 98 00:04:21,520 --> 00:04:23,620 so that we can do something specific 99 00:04:23,620 --> 00:04:26,640 that wouldn't normally be there in Databricks 100 00:04:26,640 --> 00:04:27,860 as a standard unit. 101 00:04:27,860 --> 00:04:32,270 So we can create and import our libraries as needed as well. 102 00:04:32,270 --> 00:04:34,440 Lastly don't forget, are tables. 103 00:04:34,440 --> 00:04:37,040 Tables are just the storage for all 104 00:04:37,040 --> 00:04:39,850 of the structured data that's created in Databricks. 105 00:04:39,850 --> 00:04:41,680 So you'll see that as well. 106 00:04:41,680 --> 00:04:46,680 These are the 6 main concepts that go with Databricks. 107 00:04:49,590 --> 00:04:52,670 All right, so now let's hop over into the portal 108 00:04:52,670 --> 00:04:56,200 and take a look at Databricks in action. 109 00:04:56,200 --> 00:05:00,460 So just like with Synapse, I have opened up the Azure portal 110 00:05:00,460 --> 00:05:02,830 and I have clicked on Launch Workspace 111 00:05:02,830 --> 00:05:05,940 to give us our Azure Databricks workspace. 112 00:05:05,940 --> 00:05:09,850 Now you'll note that over here on the left-hand side, 113 00:05:09,850 --> 00:05:12,610 again, you're going to see that menu bar that's 114 00:05:12,610 --> 00:05:15,460 going to walk you through the main pieces that you need. 115 00:05:15,460 --> 00:05:17,410 You can also find it here in the middle, 116 00:05:17,410 --> 00:05:19,070 including additional documentation 117 00:05:19,070 --> 00:05:20,870 if you really want to get down into the weeds 118 00:05:20,870 --> 00:05:22,203 on a particular topic. 119 00:05:23,070 --> 00:05:26,830 So for us, you can see first, at our very far left, 120 00:05:26,830 --> 00:05:28,880 this is our workspace. 121 00:05:28,880 --> 00:05:30,810 So remember that is our filing cabinet, 122 00:05:30,810 --> 00:05:32,210 and it's going to break it down, 123 00:05:32,210 --> 00:05:35,380 so that we can see our individual users. 124 00:05:35,380 --> 00:05:36,390 And we're going to go ahead 125 00:05:36,390 --> 00:05:39,173 and click on the Quickstart Notebook. 126 00:05:40,410 --> 00:05:43,320 Now the notebook, remember that's the folder, right? 127 00:05:43,320 --> 00:05:47,130 That's going to store our individual cells. 128 00:05:47,130 --> 00:05:49,410 And you can see that each one of these 129 00:05:49,410 --> 00:05:51,460 is an individual Cell. 130 00:05:51,460 --> 00:05:55,720 And I've just opened up a Databricks Quickstart notebook. 131 00:05:55,720 --> 00:05:59,220 So we can see some data right at the start to begin with. 132 00:05:59,220 --> 00:06:00,840 And I want to scroll down and just show you 133 00:06:00,840 --> 00:06:02,150 a few different things. 134 00:06:02,150 --> 00:06:03,900 So the first is we can go ahead 135 00:06:03,900 --> 00:06:08,900 and start creating snippets of code in any of these cells. 136 00:06:09,890 --> 00:06:11,580 In addition, it's going to give us data, 137 00:06:11,580 --> 00:06:13,170 as we run those cells. 138 00:06:13,170 --> 00:06:14,673 And in this one you can see at the bottom, 139 00:06:14,673 --> 00:06:18,490 it's already went ahead and run all of the cells for us. 140 00:06:18,490 --> 00:06:20,650 So we can see a table format of the data. 141 00:06:20,650 --> 00:06:23,120 And the other nice thing is with Databricks, 142 00:06:23,120 --> 00:06:24,480 I can also click on 143 00:06:24,480 --> 00:06:27,700 and I can see graphs of the data as well, 144 00:06:27,700 --> 00:06:30,120 if I want to see the data in different formats, 145 00:06:30,120 --> 00:06:31,963 which is very, very helpful. 146 00:06:33,070 --> 00:06:35,000 So as a data engineer, 147 00:06:35,000 --> 00:06:37,120 I can go through, write some snippets of code, 148 00:06:37,120 --> 00:06:39,240 and start to see how it's working 149 00:06:39,240 --> 00:06:40,520 right at the very beginning, 150 00:06:40,520 --> 00:06:43,830 rather than having to complete the entire code base 151 00:06:43,830 --> 00:06:45,863 before I can run and get real data. 152 00:06:47,170 --> 00:06:49,040 The other piece that I want to show you is here, 153 00:06:49,040 --> 00:06:52,610 see this little % sign, and then python, 154 00:06:52,610 --> 00:06:55,320 Databricks allows you to use different code languages, 155 00:06:55,320 --> 00:06:57,510 and we can actually switch between cells. 156 00:06:57,510 --> 00:07:00,760 So I can just do %python and start writing in Python. 157 00:07:00,760 --> 00:07:03,610 Or I could do %scala and start writing in Scala. 158 00:07:03,610 --> 00:07:06,430 That's also really handy, if you have, for some reason, 159 00:07:06,430 --> 00:07:08,380 a need to write in different languages, 160 00:07:08,380 --> 00:07:10,120 you can actually change that on the fly 161 00:07:10,120 --> 00:07:12,133 as you go through individual cells. 162 00:07:14,080 --> 00:07:15,270 So there you go, 163 00:07:15,270 --> 00:07:20,060 there is a super high-level look at Databricks 164 00:07:20,060 --> 00:07:21,393 in the Azure portal. 165 00:07:22,330 --> 00:07:26,090 Finally, let's jump in and talk about our review. 166 00:07:26,090 --> 00:07:28,430 First, we have our introduction. 167 00:07:28,430 --> 00:07:32,260 So, know what Databricks is, you need to know what it is 168 00:07:33,250 --> 00:07:37,070 and where you would use it, which is in transformations. 169 00:07:37,070 --> 00:07:39,670 And specifically, large-scale transformations. 170 00:07:39,670 --> 00:07:42,470 That's really where you're going to want to use Databricks. 171 00:07:44,500 --> 00:07:47,210 Core concepts, remember the cluster of the workspace, 172 00:07:47,210 --> 00:07:49,440 cell, library, and table. 173 00:07:49,440 --> 00:07:50,990 That's going to be important just for you 174 00:07:50,990 --> 00:07:53,990 to kind of get in your head, how all of this works together. 175 00:07:55,700 --> 00:07:58,200 And then finally, we looked at this in the portal. 176 00:07:58,200 --> 00:08:01,510 So again, we're going to dive further into these concepts 177 00:08:01,510 --> 00:08:02,610 for what you need to know, 178 00:08:02,610 --> 00:08:04,330 but right now, what I really want you to do 179 00:08:04,330 --> 00:08:05,970 is to be able to get in your head 180 00:08:05,970 --> 00:08:10,110 where these services live and what they do. 181 00:08:10,110 --> 00:08:13,053 If you've done that, then you should be good to go. 182 00:08:13,960 --> 00:08:17,440 All right, we are almost done with this section 183 00:08:17,440 --> 00:08:21,930 in the next lesson are going to jump in and have a review. 184 00:08:21,930 --> 00:08:22,880 I'll see you there.