1 00:00:00,920 --> 00:00:02,010 Welcome back. 2 00:00:02,010 --> 00:00:02,930 In this lesson, 3 00:00:02,930 --> 00:00:04,990 we are going to be putting the puzzle pieces together 4 00:00:04,990 --> 00:00:08,553 and identifying Azure services for batch processing. 5 00:00:10,340 --> 00:00:12,410 So in this lesson, specifically, 6 00:00:12,410 --> 00:00:15,760 we are going to take a look at what batch processing is 7 00:00:16,910 --> 00:00:20,813 and how we typically use batch processing in Azure. 8 00:00:22,820 --> 00:00:23,870 Finally, we're going to talk 9 00:00:23,870 --> 00:00:26,550 about some common batch processing solutions 10 00:00:26,550 --> 00:00:29,713 that you need to know for the DP-203. 11 00:00:32,010 --> 00:00:35,660 So to start off, let's talk about the tale of 2 processes 12 00:00:35,660 --> 00:00:37,440 just a little bit more deeply. 13 00:00:37,440 --> 00:00:39,980 When we talk about batch services, 14 00:00:39,980 --> 00:00:41,920 really think, more pizza. 15 00:00:41,920 --> 00:00:45,280 So we are going to have a bunch of pizza orders 16 00:00:45,280 --> 00:00:46,690 that come into the restaurant. 17 00:00:46,690 --> 00:00:48,280 We take those pizza orders, 18 00:00:48,280 --> 00:00:51,920 and we process all of them in this oven, right? 19 00:00:51,920 --> 00:00:54,260 We bake the pizza and process it in the oven. 20 00:00:54,260 --> 00:00:57,170 We are not going to take every individual slice 21 00:00:57,170 --> 00:00:59,510 and run them out to the table one slice at a time. 22 00:00:59,510 --> 00:01:01,670 No, we're going to take the entire table, 23 00:01:01,670 --> 00:01:02,660 do all of the orders 24 00:01:02,660 --> 00:01:06,010 and then take all of those orders out to the table. 25 00:01:06,010 --> 00:01:07,510 Streaming is different. 26 00:01:07,510 --> 00:01:10,340 When we talk about streaming, we have flowing water. 27 00:01:10,340 --> 00:01:12,610 And it's just going to continually flow. 28 00:01:12,610 --> 00:01:15,620 We don't stop and then start the water and then, 29 00:01:15,620 --> 00:01:17,220 it's just going to keep going. 30 00:01:17,220 --> 00:01:18,520 Same thing's true for the data. 31 00:01:18,520 --> 00:01:23,200 So batch processing, think about bounded and unbounded data. 32 00:01:23,200 --> 00:01:25,560 So bounded just means we have a start and a stop. 33 00:01:25,560 --> 00:01:28,070 We're going to take that bounded data source, 34 00:01:28,070 --> 00:01:29,370 we know how big it is, 35 00:01:29,370 --> 00:01:33,050 and we are going to do a batch process transformation 36 00:01:33,050 --> 00:01:35,380 on that bounded data source. 37 00:01:35,380 --> 00:01:37,760 Streaming then is unbounded data. 38 00:01:37,760 --> 00:01:39,540 We don't know how big it is, 39 00:01:39,540 --> 00:01:43,573 so we're just going to start processing with more unknowns. 40 00:01:44,860 --> 00:01:48,250 So batch data is not required immediately. 41 00:01:48,250 --> 00:01:49,350 This is going to be something, 42 00:01:49,350 --> 00:01:51,350 typically you see that like in a bank for instance, 43 00:01:51,350 --> 00:01:52,850 they'll do a nightly run. 44 00:01:52,850 --> 00:01:54,461 They'll take all of the orders from the day, 45 00:01:54,461 --> 00:01:57,958 they'll transform, put them all into databases, 46 00:01:57,958 --> 00:02:01,820 and that is going to be a batch process. 47 00:02:01,820 --> 00:02:04,830 Typically, you can handle much larger transformations 48 00:02:04,830 --> 00:02:08,040 with batch, and it's going to be lower cost. 49 00:02:08,040 --> 00:02:10,230 So larger data, lower cost, 50 00:02:10,230 --> 00:02:12,680 that's typically what we're going to see with batch. 51 00:02:13,780 --> 00:02:17,020 With streaming, we need the data right now. 52 00:02:17,020 --> 00:02:19,160 So things to think about for streaming 53 00:02:19,160 --> 00:02:21,460 would be like a recommendation engine. 54 00:02:21,460 --> 00:02:23,830 So I'm on a website, I'm looking at something, 55 00:02:23,830 --> 00:02:25,730 and you want to tell me, 56 00:02:25,730 --> 00:02:27,520 while I'm still on the website, 57 00:02:27,520 --> 00:02:30,830 some other services that might be good for me to purchase. 58 00:02:30,830 --> 00:02:31,680 That would be something 59 00:02:31,680 --> 00:02:33,970 that you would need streaming data for. 60 00:02:33,970 --> 00:02:35,130 Typically you're going to have 61 00:02:35,130 --> 00:02:37,850 less complex transformations with streaming, 62 00:02:37,850 --> 00:02:42,070 because we can't stop and run complex transformations. 63 00:02:42,070 --> 00:02:45,170 Instead, we have to keep moving that data. 64 00:02:45,170 --> 00:02:47,718 And so we can't do as much transformation 65 00:02:47,718 --> 00:02:50,500 as we can do with batch. 66 00:02:50,500 --> 00:02:53,060 And because of that, we're going to have higher costs, 67 00:02:53,060 --> 00:02:54,950 because we're doing things on the fly, 68 00:02:54,950 --> 00:02:57,180 and so typically that's going to be higher cost 69 00:02:57,180 --> 00:02:58,930 than what you would see with batch. 70 00:03:01,270 --> 00:03:04,570 So let's take a look then at where batch is used. 71 00:03:04,570 --> 00:03:06,820 So I already talked about the banking example 72 00:03:06,820 --> 00:03:08,570 for batch processing. 73 00:03:08,570 --> 00:03:12,550 Another common use of batch would be at retail outlets. 74 00:03:12,550 --> 00:03:14,760 So you take all of those orders 75 00:03:14,760 --> 00:03:18,240 and you process all of the orders overnight. 76 00:03:18,240 --> 00:03:22,150 And that way, in the morning or on a weekly basis, 77 00:03:22,150 --> 00:03:24,270 managers of those different stores 78 00:03:24,270 --> 00:03:25,770 can take a look at all the sales 79 00:03:25,770 --> 00:03:27,370 and where those sales were coming in. 80 00:03:27,370 --> 00:03:30,720 It's not something that needs to be done on the fly. 81 00:03:30,720 --> 00:03:34,960 Hospitals, another good example for batch processing. 82 00:03:34,960 --> 00:03:38,640 So taking a look at high volumes of patient information, 83 00:03:38,640 --> 00:03:40,470 transaction information, 84 00:03:40,470 --> 00:03:43,010 those are other things that you would see for batch. 85 00:03:43,010 --> 00:03:44,730 And then finally, for marketing. 86 00:03:44,730 --> 00:03:47,200 So with batch processing in marketing, 87 00:03:47,200 --> 00:03:49,940 we could be looking at a marketing campaign 88 00:03:49,940 --> 00:03:52,300 and taking a look at, for instance, 89 00:03:52,300 --> 00:03:53,430 hits on a website, 90 00:03:53,430 --> 00:03:55,330 processing all of that data together 91 00:03:55,330 --> 00:03:57,530 and seeing if we can come up with some insights 92 00:03:57,530 --> 00:04:00,676 to determine how we need to change our campaigns. 93 00:04:00,676 --> 00:04:01,509 It wouldn't be something 94 00:04:01,509 --> 00:04:03,540 I'd need to see every single minute, 95 00:04:03,540 --> 00:04:04,373 but it would be something 96 00:04:04,373 --> 00:04:05,720 that I would want to see every week 97 00:04:05,720 --> 00:04:06,913 or every 2 weeks. 98 00:04:09,190 --> 00:04:10,680 Some challenges that we want to think about 99 00:04:10,680 --> 00:04:12,460 when we look at data, 100 00:04:12,460 --> 00:04:14,160 first would be data format. 101 00:04:14,160 --> 00:04:17,220 So is all of the data in the correct format? 102 00:04:17,220 --> 00:04:18,414 And when I talk about format, 103 00:04:18,414 --> 00:04:22,100 we're kind of thinking about things like country code, 104 00:04:22,100 --> 00:04:24,210 for example, or name. 105 00:04:24,210 --> 00:04:26,480 Is it first name, last name as 2 different columns? 106 00:04:26,480 --> 00:04:29,200 Is it first name, last name as 1 column? 107 00:04:29,200 --> 00:04:30,810 What is the data in? 108 00:04:30,810 --> 00:04:32,140 What's the schema look like? 109 00:04:32,140 --> 00:04:35,620 Is everything match up there for batch processing? 110 00:04:35,620 --> 00:04:37,410 Second, would be encoding. 111 00:04:37,410 --> 00:04:39,300 So in addition to the data 112 00:04:39,300 --> 00:04:40,890 maybe being in a different format, 113 00:04:40,890 --> 00:04:43,360 it's also possible that the data 114 00:04:43,360 --> 00:04:46,240 has been encoded different ways, 115 00:04:46,240 --> 00:04:49,060 and so we have to convert the data into a form 116 00:04:49,060 --> 00:04:51,110 that could be used for data processing. 117 00:04:51,110 --> 00:04:53,100 For instance, we want everything in JSON. 118 00:04:53,100 --> 00:04:57,010 Third, dealing with Windows and missed runs. 119 00:04:57,010 --> 00:04:59,330 So we've talked a little bit about Windows, 120 00:04:59,330 --> 00:05:02,710 so what is the frame of the data that we're looking at? 121 00:05:02,710 --> 00:05:05,900 What happens if you have data 122 00:05:05,900 --> 00:05:08,360 that sits outside of those Windows? 123 00:05:08,360 --> 00:05:09,770 What are you going to do with that? 124 00:05:09,770 --> 00:05:11,950 And then also, missed runs. 125 00:05:11,950 --> 00:05:14,890 So what happens if I try and run a nightly process 126 00:05:14,890 --> 00:05:16,550 and it doesn't happen? 127 00:05:16,550 --> 00:05:17,630 Are we going to run it again? 128 00:05:17,630 --> 00:05:19,163 Two times? Five times? 129 00:05:20,530 --> 00:05:22,370 Just challenges that you want to consider 130 00:05:22,370 --> 00:05:24,090 for batch processing. 131 00:05:24,090 --> 00:05:27,000 Now we'll also tell you that dealing with Windows 132 00:05:27,000 --> 00:05:29,140 and missed runs and batch processing 133 00:05:29,140 --> 00:05:31,690 is infinitely easier than with streaming, 134 00:05:31,690 --> 00:05:33,790 but still something you want to think about. 135 00:05:35,570 --> 00:05:36,530 So let's talk a little bit 136 00:05:36,530 --> 00:05:40,120 about where batch processing services live. 137 00:05:40,120 --> 00:05:42,300 So we start off with an input source, 138 00:05:42,300 --> 00:05:43,840 a blob or a data lake, 139 00:05:43,840 --> 00:05:46,460 some sort of place that's holding our data, 140 00:05:46,460 --> 00:05:49,310 and then we start to look at doing batch processing. 141 00:05:49,310 --> 00:05:52,330 So you could do batch processing through Databricks 142 00:05:52,330 --> 00:05:54,930 or through Synapse Analytics 143 00:05:54,930 --> 00:05:59,620 or through HDInsight or Azure Data Lake Analytics. 144 00:05:59,620 --> 00:06:01,430 Those are all the common services 145 00:06:01,430 --> 00:06:03,990 that you would see for batch processing. 146 00:06:03,990 --> 00:06:04,927 And then you take that data 147 00:06:04,927 --> 00:06:08,680 and you move it through one of those services. 148 00:06:08,680 --> 00:06:11,700 And then it's going to come out something like Power BI, 149 00:06:11,700 --> 00:06:13,690 where we are going to do 150 00:06:13,690 --> 00:06:16,520 some sort of reporting off of it, typically. 151 00:06:16,520 --> 00:06:18,560 And then finally down here at the bottom, 152 00:06:18,560 --> 00:06:21,540 you see we have Data Factory orchestration. 153 00:06:21,540 --> 00:06:22,500 Now in this section, 154 00:06:22,500 --> 00:06:25,800 we are actually going to talk a lot about Data Factory, 155 00:06:25,800 --> 00:06:29,210 because the critical element to batch processing, 156 00:06:29,210 --> 00:06:31,870 or one of the critical components to batch processing, 157 00:06:31,870 --> 00:06:33,660 is moving that data, 158 00:06:33,660 --> 00:06:36,380 building a pipeline that can perform 159 00:06:36,380 --> 00:06:39,480 the batch processing things that we need. 160 00:06:39,480 --> 00:06:41,280 So in this section, we'll talk a lot 161 00:06:41,280 --> 00:06:43,830 about how to take data from that data lake, 162 00:06:43,830 --> 00:06:46,320 move it into something like Databricks 163 00:06:46,320 --> 00:06:48,900 to do our transformations, and then pull it out of that 164 00:06:48,900 --> 00:06:52,340 and move it into something else like Power BI. 165 00:06:52,340 --> 00:06:53,530 Of these services, 166 00:06:53,530 --> 00:06:55,710 I will tell you that Databricks 167 00:06:55,710 --> 00:06:58,040 and Azure Synapse Analytics at the top 168 00:06:58,040 --> 00:07:01,023 are by far the most common for batch processing. 169 00:07:01,910 --> 00:07:05,330 I will also tell you that as we look at batch processing 170 00:07:05,330 --> 00:07:07,360 in the DP-203, 171 00:07:07,360 --> 00:07:10,650 you're much more likely to see questions 172 00:07:10,650 --> 00:07:15,020 on how to move data, Data Factory orchestration, 173 00:07:15,020 --> 00:07:16,800 those kinds of things, 174 00:07:16,800 --> 00:07:19,600 than you will specific types of batch processing. 175 00:07:19,600 --> 00:07:22,550 For instance, you're much less likely to see a question 176 00:07:22,550 --> 00:07:26,930 on how you perform a specific transformation in HDInsight 177 00:07:26,930 --> 00:07:30,530 than you would to see questions on Data Factory pipelines. 178 00:07:30,530 --> 00:07:32,230 So just keep that in mind. 179 00:07:32,230 --> 00:07:36,683 And that's based upon the exam requirements in the DP-203. 180 00:07:38,480 --> 00:07:40,750 All right, so to review this lesson, 181 00:07:40,750 --> 00:07:42,387 we talked about batch processing 182 00:07:42,387 --> 00:07:45,763 and the characteristics of batch processing versus stream. 183 00:07:47,150 --> 00:07:49,270 We talked about examples and challenges. 184 00:07:49,270 --> 00:07:52,550 So be ready to defend batch versus stream, 185 00:07:52,550 --> 00:07:55,720 because a lot of things that you see in the corporate world, 186 00:07:55,720 --> 00:07:58,270 everybody wants information now, 187 00:07:58,270 --> 00:08:00,730 and you need to be aware 188 00:08:00,730 --> 00:08:02,700 of the differences between batch and stream 189 00:08:02,700 --> 00:08:04,420 to be able to explain 190 00:08:04,420 --> 00:08:06,610 why you would prefer batch over stream, 191 00:08:06,610 --> 00:08:10,050 be that cost, be that the ability to do transformations, 192 00:08:10,050 --> 00:08:12,060 simplicity, whatever it is, 193 00:08:12,060 --> 00:08:13,860 if there's really not a need for stream, 194 00:08:13,860 --> 00:08:16,180 or the other way around if there is. 195 00:08:16,180 --> 00:08:17,933 So just be aware of that. 196 00:08:18,930 --> 00:08:21,690 And then finally, the Azure technology choices. 197 00:08:21,690 --> 00:08:23,320 So if we're going to do batch processing, 198 00:08:23,320 --> 00:08:26,830 what are the services that we are most likely going to use 199 00:08:26,830 --> 00:08:28,490 in Azure. 200 00:08:28,490 --> 00:08:30,530 All right, that's it for this lesson. 201 00:08:30,530 --> 00:08:32,683 I will see you in the next.