1 00:00:00,710 --> 00:00:02,180 Hey, what's up, Gurus? 2 00:00:02,180 --> 00:00:04,510 In this lesson, we are going to be talking 3 00:00:04,510 --> 00:00:08,560 about an introduction to Azure Data Factory. 4 00:00:08,560 --> 00:00:13,560 Specifically, I'm going to teach you what Data Factory is. 5 00:00:13,890 --> 00:00:17,620 We're going to talk about core concepts that you need to 6 00:00:17,620 --> 00:00:18,960 know, and we're going to talk a little bit 7 00:00:18,960 --> 00:00:21,940 about Data Factory architecture, so you can start to see 8 00:00:21,940 --> 00:00:25,170 how all of these concepts play together. 9 00:00:25,170 --> 00:00:28,500 So let's start with the Data Factory introduction. 10 00:00:28,500 --> 00:00:30,610 So when we talk about Azure Data Factory, 11 00:00:30,610 --> 00:00:34,730 Microsoft says it's a cloud-based data integration service 12 00:00:34,730 --> 00:00:37,320 that allows you to create data-driven workflows 13 00:00:37,320 --> 00:00:39,920 in the cloud that orchestrates and automates 14 00:00:39,920 --> 00:00:42,203 data movement and transformation. To summarize, 15 00:00:43,240 --> 00:00:48,240 we're going to create activities that moves data, 16 00:00:48,280 --> 00:00:52,130 and helps to complete your cloud projects. 17 00:00:52,130 --> 00:00:55,520 So when you think Data Factory, think pipelines 18 00:00:55,520 --> 00:00:58,240 that help connect all of the different components 19 00:00:58,240 --> 00:01:00,103 of your cloud service together. 20 00:01:02,630 --> 00:01:03,660 The other piece to think 21 00:01:03,660 --> 00:01:06,960 about is data pipeline orchestration. 22 00:01:06,960 --> 00:01:10,050 So not only are we connecting those pieces, 23 00:01:10,050 --> 00:01:12,900 but we're also helping to orchestrate the movement. 24 00:01:12,900 --> 00:01:15,940 We're controlling the flow of the data, not just creating 25 00:01:15,940 --> 00:01:18,340 the infrastructure that allows that to happen. 26 00:01:18,340 --> 00:01:20,090 And we'll talk more about this in a few minutes, 27 00:01:20,090 --> 00:01:23,773 but those are the 2 big concepts for Azure Data Factory. 28 00:01:25,590 --> 00:01:28,139 So when we talk about core concepts for Data Factory, 29 00:01:28,139 --> 00:01:30,630 the first is a pipeline. 30 00:01:30,630 --> 00:01:35,530 So the pipeline here is a logical grouping of activities, 31 00:01:35,530 --> 00:01:38,410 and these activities perform a task. 32 00:01:38,410 --> 00:01:41,700 For instance, an activity could be data movement. 33 00:01:41,700 --> 00:01:45,810 So we could have activity that takes an ingestion source, 34 00:01:45,810 --> 00:01:49,480 or takes a source from your computer, for instance, pulls it 35 00:01:49,480 --> 00:01:52,690 into the cloud, and puts it into a data lake. 36 00:01:52,690 --> 00:01:54,360 That would be an activity, 37 00:01:54,360 --> 00:01:57,540 and the task it's performing is data movement. 38 00:01:57,540 --> 00:02:00,739 So that is what an activity is, and then a pipeline is 39 00:02:00,739 --> 00:02:04,093 a grouping of all of those activities together. 40 00:02:06,220 --> 00:02:11,220 So with the activity, we have 3 different types. 41 00:02:11,310 --> 00:02:14,070 First is data movement, which I talked about. 42 00:02:14,070 --> 00:02:18,090 Second is data transformation, so that would be connecting 43 00:02:18,090 --> 00:02:20,600 to Databricks, or doing something 44 00:02:20,600 --> 00:02:23,340 that's going to transform data, and then the third is 45 00:02:23,340 --> 00:02:26,810 just general control activities that help us to control data 46 00:02:26,810 --> 00:02:28,270 in one way or another. 47 00:02:29,910 --> 00:02:32,000 Datasets is the next type. 48 00:02:32,000 --> 00:02:36,770 So, datasets is a data structure within your data store. 49 00:02:36,770 --> 00:02:39,310 So this is where the data that you need for inputs 50 00:02:39,310 --> 00:02:41,580 or your outputs for the projects and the pipelines, 51 00:02:41,580 --> 00:02:43,540 this is where that data lives. 52 00:02:43,540 --> 00:02:46,760 Finally, we have linked services. 53 00:02:46,760 --> 00:02:49,940 So linked services is actually the connection string 54 00:02:49,940 --> 00:02:53,660 that's going to tell Data Factory how to find 55 00:02:53,660 --> 00:02:55,563 or connect to your data. 56 00:02:57,500 --> 00:03:00,420 So with those concepts in mind, let's see if we can put this 57 00:03:00,420 --> 00:03:02,253 into some architecture. 58 00:03:03,280 --> 00:03:06,620 And I'm actually going to start with the far right-hand side 59 00:03:06,620 --> 00:03:10,140 here with the pipeline, and we're going to work our way 60 00:03:10,140 --> 00:03:13,740 back. So the pipeline, like I said, was that large grouping 61 00:03:13,740 --> 00:03:18,330 of activities, and within that pipeline, we can do things 62 00:03:18,330 --> 00:03:20,500 like schedule a pipeline, or monitor, 63 00:03:20,500 --> 00:03:23,633 or manage the pipeline that we have created. 64 00:03:24,790 --> 00:03:28,580 Moving back a step then, activities that are grouped 65 00:03:28,580 --> 00:03:32,370 together to create the pipeline, activities are going to do 66 00:03:32,370 --> 00:03:36,960 things that either control, move, or transform data. 67 00:03:36,960 --> 00:03:39,310 Could be something like hive, or connecting through 68 00:03:39,310 --> 00:03:42,460 to Databricks, or running a stored procedure, 69 00:03:42,460 --> 00:03:45,180 or copying or moving data. 70 00:03:45,180 --> 00:03:47,243 Those are all activities. 71 00:03:48,920 --> 00:03:53,540 Now, our activity is going to run 72 00:03:53,540 --> 00:03:56,710 on our linked service down here at the bottom. 73 00:03:56,710 --> 00:03:59,340 So it's going to use that connection string, 74 00:03:59,340 --> 00:04:04,270 which is the linked service, to pull the data, or connect 75 00:04:04,270 --> 00:04:07,930 to the service so that it can do what it needs to do. 76 00:04:07,930 --> 00:04:11,570 When the activity is done, typically, it's going to move 77 00:04:11,570 --> 00:04:14,800 data, and create a new updated dataset. 78 00:04:14,800 --> 00:04:17,510 So we're going to transform a dataset, 79 00:04:17,510 --> 00:04:20,940 and then put the new dataset somewhere else, 80 00:04:20,940 --> 00:04:23,814 or we're going to move data from one place to another, 81 00:04:23,814 --> 00:04:27,520 or combine 3 datasets into one, 82 00:04:27,520 --> 00:04:31,080 and that activity is going to produce your dataset. 83 00:04:31,080 --> 00:04:32,680 It's also going to be consuming data, 84 00:04:32,680 --> 00:04:34,890 because you have your start and your end, right? 85 00:04:34,890 --> 00:04:36,700 So if we're going to move data, we have to consume 86 00:04:36,700 --> 00:04:40,650 the old data, transform it, and then produce the new data. 87 00:04:40,650 --> 00:04:43,540 And that ties back in, because you have to be able 88 00:04:43,540 --> 00:04:44,770 to connect to that dataset, 89 00:04:44,770 --> 00:04:46,860 and you do that through that link service. 90 00:04:46,860 --> 00:04:49,820 So really, we're kind of creating a chain here. 91 00:04:49,820 --> 00:04:52,340 So all of this ties up into our pipelines, 92 00:04:52,340 --> 00:04:53,800 and the last piece, 93 00:04:53,800 --> 00:04:56,560 besides just building these connections, is thinking 94 00:04:56,560 --> 00:04:57,750 about the orchestration. 95 00:04:57,750 --> 00:04:59,810 And this is where Data Factory shines, 96 00:04:59,810 --> 00:05:02,990 because you can take the steps to your pipeline, 97 00:05:02,990 --> 00:05:04,810 and you can automate them. 98 00:05:04,810 --> 00:05:06,900 You can create schedules around them. 99 00:05:06,900 --> 00:05:10,020 You can monitor each step in the pipeline to check 100 00:05:10,020 --> 00:05:12,910 for errors, and set alerts if that happens, 101 00:05:12,910 --> 00:05:16,590 or set other things like a webhook, for instance. 102 00:05:16,590 --> 00:05:19,580 So we can actually manage how the pipeline performs 103 00:05:19,580 --> 00:05:22,810 and functions, and what happens if something fails. 104 00:05:22,810 --> 00:05:27,720 So to tie all of this together, let's talk about our review. 105 00:05:27,720 --> 00:05:29,120 So let's start off our review 106 00:05:29,120 --> 00:05:31,370 with that Data Factory introduction. 107 00:05:31,370 --> 00:05:34,720 Data Factory is really designed to create the connections 108 00:05:34,720 --> 00:05:38,890 to build pipelines to move and transform data 109 00:05:38,890 --> 00:05:42,320 within your cloud projects, and then it's also designed 110 00:05:42,320 --> 00:05:45,940 to orchestrate the flow of that data, or those steps, 111 00:05:45,940 --> 00:05:49,773 or activities within your cloud project. 112 00:05:51,660 --> 00:05:54,570 Core concepts, we talked about pipelines, activities, 113 00:05:54,570 --> 00:05:56,543 Datasets, and link services. And 114 00:05:58,680 --> 00:06:01,690 then, finally, we talked about Data Factory architecture, 115 00:06:01,690 --> 00:06:06,350 and how those core concepts fit into that data architecture. 116 00:06:06,350 --> 00:06:09,610 Now, I realize that we have reached the end, 117 00:06:09,610 --> 00:06:11,960 and that can be a very sad thing. 118 00:06:11,960 --> 00:06:15,344 However, the good news is, coming up in section 5 119 00:06:15,344 --> 00:06:19,800 we are actually going to hands-on practice for Data Factory, 120 00:06:19,800 --> 00:06:22,350 as well as I'm going to be showing you some demos 121 00:06:22,350 --> 00:06:24,900 as we dive deeper into those lessons. 122 00:06:24,900 --> 00:06:28,440 So we'll be able to see how all of these concepts play out. 123 00:06:28,440 --> 00:06:31,470 So don't worry about diving deeper. 124 00:06:31,470 --> 00:06:33,820 We will definitely do that, but I wanted to make sure 125 00:06:33,820 --> 00:06:35,910 that we had the groundwork, so that you understood 126 00:06:35,910 --> 00:06:39,710 what Data Factory was, and why it was important. 127 00:06:39,710 --> 00:06:41,823 All right, I'll see you in the next lesson.