1 00:00:00,990 --> 00:00:02,250 In this lesson, 2 00:00:02,250 --> 00:00:05,280 we are going to talk about how to measure the performance 3 00:00:05,280 --> 00:00:07,120 of data movement. 4 00:00:07,120 --> 00:00:07,953 In order to do that, 5 00:00:07,953 --> 00:00:10,000 we're going to start at the beginning 6 00:00:10,000 --> 00:00:13,270 by talking about what that actually is. 7 00:00:13,270 --> 00:00:14,760 Then we're going to jump in 8 00:00:14,760 --> 00:00:17,680 and we're going to talk about estimating throughput. 9 00:00:17,680 --> 00:00:21,330 Follow that up by taking a look in Azure Data Factory 10 00:00:21,330 --> 00:00:23,670 with the Monitor Copy Activity 11 00:00:23,670 --> 00:00:27,530 so you can see how all of this comes together. 12 00:00:27,530 --> 00:00:29,180 But let's start at the beginning. 13 00:00:30,280 --> 00:00:33,240 Measuring throughput is most likely going to be encountered 14 00:00:33,240 --> 00:00:36,543 in Data Factory or Synapse Analytics. 15 00:00:37,600 --> 00:00:40,710 This involves large-scale data transfers 16 00:00:40,710 --> 00:00:43,850 by using the copy activity in pipelines. 17 00:00:43,850 --> 00:00:46,950 So what we're trying to do is we're trying to optimize 18 00:00:46,950 --> 00:00:50,000 how long it takes data to move 19 00:00:50,000 --> 00:00:53,080 from one place to another using pipelines 20 00:00:53,080 --> 00:00:56,040 in Data Factory or Synapse Analytics. 21 00:00:56,040 --> 00:01:00,290 Now doing this, we're looking at serverless architecture, 22 00:01:00,290 --> 00:01:02,040 whether it's Data Factory or Synapse, 23 00:01:02,040 --> 00:01:04,370 both use serverless architecture. 24 00:01:04,370 --> 00:01:06,570 They also use parallelism, 25 00:01:06,570 --> 00:01:08,470 so this means we can have multiple streams 26 00:01:08,470 --> 00:01:10,010 working at the same time. 27 00:01:10,010 --> 00:01:12,980 And then finally, full utilization. 28 00:01:12,980 --> 00:01:17,820 So both of these pipelines fully utilize network bandwidth 29 00:01:17,820 --> 00:01:22,500 and they fully utilize your input/output operations. 30 00:01:22,500 --> 00:01:24,300 So we don't need to consider that 31 00:01:24,300 --> 00:01:26,773 when we're looking at the copy activity. 32 00:01:27,800 --> 00:01:29,450 So to sum all of that up, 33 00:01:29,450 --> 00:01:31,020 basically what that means 34 00:01:31,020 --> 00:01:34,980 is we can use this bandwidth chart 35 00:01:34,980 --> 00:01:37,950 to figure out roughly how long it's going to take us 36 00:01:37,950 --> 00:01:39,510 to move data. 37 00:01:39,510 --> 00:01:40,600 So for instance, 38 00:01:40,600 --> 00:01:43,000 if I had a bandwidth 39 00:01:43,000 --> 00:01:47,650 of 50 megabits per second 40 00:01:47,650 --> 00:01:50,270 and I had a data size of 1 gigabit, 41 00:01:50,270 --> 00:01:53,580 I would expect that it's going to take me about 3 minutes 42 00:01:53,580 --> 00:01:55,110 in order to move that data 43 00:01:55,110 --> 00:01:57,880 if everything is optimal. 44 00:01:57,880 --> 00:01:59,040 We jump all the way over, 45 00:01:59,040 --> 00:02:03,210 and let's say I have a 1 gigabit per second bandwidth 46 00:02:03,210 --> 00:02:05,140 and let's say that I'm trying to move a terabyte, 47 00:02:05,140 --> 00:02:07,870 well, that's going to take me 2.3 hours, 48 00:02:07,870 --> 00:02:09,130 and so on and so forth. 49 00:02:09,130 --> 00:02:10,580 So we can use this chart, 50 00:02:10,580 --> 00:02:14,000 because we have the full utilization 51 00:02:14,000 --> 00:02:15,590 and we have the parallelism 52 00:02:15,590 --> 00:02:19,410 of Data Factory and Azure Synapse pipelines. 53 00:02:19,410 --> 00:02:21,410 So once we have that down, 54 00:02:21,410 --> 00:02:24,510 now we need to go in and see what's actually happening. 55 00:02:24,510 --> 00:02:26,580 And so to do that, let's jump into the portal 56 00:02:26,580 --> 00:02:29,120 and let's take a look at the copy data activity 57 00:02:29,120 --> 00:02:33,070 in Data Factory to see some actual performance. 58 00:02:33,070 --> 00:02:35,240 So here we are in Data Factory, 59 00:02:35,240 --> 00:02:37,440 and I built just a single, one-step pipeline 60 00:02:37,440 --> 00:02:38,690 that I just triggered 61 00:02:38,690 --> 00:02:41,180 so we can take a look and start to see 62 00:02:41,180 --> 00:02:43,750 how long it takes to move that data. 63 00:02:43,750 --> 00:02:44,650 So in order to do that, 64 00:02:44,650 --> 00:02:47,140 let's go ahead and jump over to Monitor 65 00:02:47,140 --> 00:02:49,880 and let's take a look at our pipeline run. 66 00:02:49,880 --> 00:02:52,870 So you can see that our pipeline run is successful, 67 00:02:52,870 --> 00:02:55,030 and it took a whopping 12 secs. 68 00:02:55,030 --> 00:02:56,470 But let's dive in a little further 69 00:02:56,470 --> 00:02:59,490 to see if we can figure out what actually happened. 70 00:02:59,490 --> 00:03:03,500 So you can see here we have our copy data activity, 71 00:03:03,500 --> 00:03:07,503 and it copied data with a duration of 9 seconds. 72 00:03:09,170 --> 00:03:11,980 If I click on my handy-dandy little glasses there, 73 00:03:11,980 --> 00:03:13,760 it's actually going to pull this up. 74 00:03:13,760 --> 00:03:14,670 So you can see here 75 00:03:14,670 --> 00:03:17,680 that it moved a whopping 4 kilobytes of data. 76 00:03:17,680 --> 00:03:19,360 It gives me my copy duration, 77 00:03:19,360 --> 00:03:21,230 and then it gives me actual details 78 00:03:21,230 --> 00:03:23,760 on the parallel copy. 79 00:03:23,760 --> 00:03:26,270 So we have one parallel copy that was used. 80 00:03:26,270 --> 00:03:30,240 It tells me also, in the duration, this is the queue 81 00:03:30,240 --> 00:03:31,390 and this is the transfer. 82 00:03:31,390 --> 00:03:34,230 So I can see how long it took at both steps. 83 00:03:34,230 --> 00:03:36,200 So from that, I can go back here 84 00:03:36,200 --> 00:03:41,200 and I can compare my throughput chart with this chart here, 85 00:03:42,390 --> 00:03:45,790 and I can see exactly how long the copy activity took. 86 00:03:45,790 --> 00:03:48,160 And then I can see how long it should have taken 87 00:03:48,160 --> 00:03:50,340 if everything was optimized. 88 00:03:50,340 --> 00:03:53,440 And then from there, I can do optimization techniques 89 00:03:53,440 --> 00:03:55,930 in order to hopefully make up some of that slack 90 00:03:55,930 --> 00:03:58,610 if I'm seeing a difference in what's expected 91 00:03:58,610 --> 00:04:00,023 versus what's observed. 92 00:04:01,200 --> 00:04:03,930 So the key points to remember from this lesson, 93 00:04:03,930 --> 00:04:06,120 one, we're going to estimate 94 00:04:06,120 --> 00:04:08,900 how long the data should take to move. 95 00:04:08,900 --> 00:04:12,020 We're going to compare that in Data Factory or Synapse 96 00:04:12,020 --> 00:04:14,320 to how long it actually took, 97 00:04:14,320 --> 00:04:16,700 and then we're going to optimize as needed. 98 00:04:16,700 --> 00:04:19,300 Keep in mind that full utilization 99 00:04:19,300 --> 00:04:21,670 allows for that throughput estimation, 100 00:04:21,670 --> 00:04:23,220 so I can use that chart 101 00:04:23,220 --> 00:04:24,580 to figure out what's different 102 00:04:24,580 --> 00:04:25,470 and I don't need to worry 103 00:04:25,470 --> 00:04:27,953 about things like bandwidth or IOPS. 104 00:04:29,220 --> 00:04:31,920 Finally the Monitor Copy Data activity 105 00:04:31,920 --> 00:04:36,140 is what we're using to measure that performance activity. 106 00:04:36,140 --> 00:04:36,973 Pretty simple, 107 00:04:36,973 --> 00:04:38,930 if you understand those 3 concepts, 108 00:04:38,930 --> 00:04:41,300 you should be good to go on the next lesson 109 00:04:41,300 --> 00:04:43,090 and you should have a pretty good handle on 110 00:04:43,090 --> 00:04:44,350 if you're moving data, 111 00:04:44,350 --> 00:04:47,400 how long you should actually expect it to take. 112 00:04:47,400 --> 00:04:49,723 With that, I'll see you in the next lesson.