1
00:00:00,990 --> 00:00:02,250
<v ->In this lesson,</v>

2
00:00:02,250 --> 00:00:05,280
we are going to talk about how to measure the performance

3
00:00:05,280 --> 00:00:07,120
of data movement.

4
00:00:07,120 --> 00:00:07,953
In order to do that,

5
00:00:07,953 --> 00:00:10,000
we're going to start at the beginning

6
00:00:10,000 --> 00:00:13,270
by talking about what that actually is.

7
00:00:13,270 --> 00:00:14,760
Then we're going to jump in

8
00:00:14,760 --> 00:00:17,680
and we're going to talk about estimating throughput.

9
00:00:17,680 --> 00:00:21,330
Follow that up by taking a look in Azure Data Factory

10
00:00:21,330 --> 00:00:23,670
with the Monitor Copy Activity

11
00:00:23,670 --> 00:00:27,530
so you can see how all of this comes together.

12
00:00:27,530 --> 00:00:29,180
But let's start at the beginning.

13
00:00:30,280 --> 00:00:33,240
Measuring throughput is most likely going to be encountered

14
00:00:33,240 --> 00:00:36,543
in Data Factory or Synapse Analytics.

15
00:00:37,600 --> 00:00:40,710
This involves large-scale data transfers

16
00:00:40,710 --> 00:00:43,850
by using the copy activity in pipelines.

17
00:00:43,850 --> 00:00:46,950
So what we're trying to do is we're trying to optimize

18
00:00:46,950 --> 00:00:50,000
how long it takes data to move

19
00:00:50,000 --> 00:00:53,080
from one place to another using pipelines

20
00:00:53,080 --> 00:00:56,040
in Data Factory or Synapse Analytics.

21
00:00:56,040 --> 00:01:00,290
Now doing this, we're looking at serverless architecture,

22
00:01:00,290 --> 00:01:02,040
whether it's Data Factory or Synapse,

23
00:01:02,040 --> 00:01:04,370
both use serverless architecture.

24
00:01:04,370 --> 00:01:06,570
They also use parallelism,

25
00:01:06,570 --> 00:01:08,470
so this means we can have multiple streams

26
00:01:08,470 --> 00:01:10,010
working at the same time.

27
00:01:10,010 --> 00:01:12,980
And then finally, full utilization.

28
00:01:12,980 --> 00:01:17,820
So both of these pipelines fully utilize network bandwidth

29
00:01:17,820 --> 00:01:22,500
and they fully utilize your input/output operations.

30
00:01:22,500 --> 00:01:24,300
So we don't need to consider that

31
00:01:24,300 --> 00:01:26,773
when we're looking at the copy activity.

32
00:01:27,800 --> 00:01:29,450
So to sum all of that up,

33
00:01:29,450 --> 00:01:31,020
basically what that means

34
00:01:31,020 --> 00:01:34,980
is we can use this bandwidth chart

35
00:01:34,980 --> 00:01:37,950
to figure out roughly how long it's going to take us

36
00:01:37,950 --> 00:01:39,510
to move data.

37
00:01:39,510 --> 00:01:40,600
So for instance,

38
00:01:40,600 --> 00:01:43,000
if I had a bandwidth

39
00:01:43,000 --> 00:01:47,650
of 50 megabits per second

40
00:01:47,650 --> 00:01:50,270
and I had a data size of 1 gigabit,

41
00:01:50,270 --> 00:01:53,580
I would expect that it's going to take me about 3 minutes

42
00:01:53,580 --> 00:01:55,110
in order to move that data

43
00:01:55,110 --> 00:01:57,880
if everything is optimal.

44
00:01:57,880 --> 00:01:59,040
We jump all the way over,

45
00:01:59,040 --> 00:02:03,210
and let's say I have a 1 gigabit per second bandwidth

46
00:02:03,210 --> 00:02:05,140
and let's say that I'm trying to move a terabyte,

47
00:02:05,140 --> 00:02:07,870
well, that's going to take me 2.3 hours,

48
00:02:07,870 --> 00:02:09,130
and so on and so forth.

49
00:02:09,130 --> 00:02:10,580
So we can use this chart,

50
00:02:10,580 --> 00:02:14,000
because we have the full utilization

51
00:02:14,000 --> 00:02:15,590
and we have the parallelism

52
00:02:15,590 --> 00:02:19,410
of Data Factory and Azure Synapse pipelines.

53
00:02:19,410 --> 00:02:21,410
So once we have that down,

54
00:02:21,410 --> 00:02:24,510
now we need to go in and see what's actually happening.

55
00:02:24,510 --> 00:02:26,580
And so to do that, let's jump into the portal

56
00:02:26,580 --> 00:02:29,120
and let's take a look at the copy data activity

57
00:02:29,120 --> 00:02:33,070
in Data Factory to see some actual performance.

58
00:02:33,070 --> 00:02:35,240
So here we are in Data Factory,

59
00:02:35,240 --> 00:02:37,440
and I built just a single, one-step pipeline

60
00:02:37,440 --> 00:02:38,690
that I just triggered

61
00:02:38,690 --> 00:02:41,180
so we can take a look and start to see

62
00:02:41,180 --> 00:02:43,750
how long it takes to move that data.

63
00:02:43,750 --> 00:02:44,650
So in order to do that,

64
00:02:44,650 --> 00:02:47,140
let's go ahead and jump over to Monitor

65
00:02:47,140 --> 00:02:49,880
and let's take a look at our pipeline run.

66
00:02:49,880 --> 00:02:52,870
So you can see that our pipeline run is successful,

67
00:02:52,870 --> 00:02:55,030
and it took a whopping 12 secs.

68
00:02:55,030 --> 00:02:56,470
But let's dive in a little further

69
00:02:56,470 --> 00:02:59,490
to see if we can figure out what actually happened.

70
00:02:59,490 --> 00:03:03,500
So you can see here we have our copy data activity,

71
00:03:03,500 --> 00:03:07,503
and it copied data with a duration of 9 seconds.

72
00:03:09,170 --> 00:03:11,980
If I click on my handy-dandy little glasses there,

73
00:03:11,980 --> 00:03:13,760
it's actually going to pull this up.

74
00:03:13,760 --> 00:03:14,670
So you can see here

75
00:03:14,670 --> 00:03:17,680
that it moved a whopping 4 kilobytes of data.

76
00:03:17,680 --> 00:03:19,360
It gives me my copy duration,

77
00:03:19,360 --> 00:03:21,230
and then it gives me actual details

78
00:03:21,230 --> 00:03:23,760
on the parallel copy.

79
00:03:23,760 --> 00:03:26,270
So we have one parallel copy that was used.

80
00:03:26,270 --> 00:03:30,240
It tells me also, in the duration, this is the queue

81
00:03:30,240 --> 00:03:31,390
and this is the transfer.

82
00:03:31,390 --> 00:03:34,230
So I can see how long it took at both steps.

83
00:03:34,230 --> 00:03:36,200
So from that, I can go back here

84
00:03:36,200 --> 00:03:41,200
and I can compare my throughput chart with this chart here,

85
00:03:42,390 --> 00:03:45,790
and I can see exactly how long the copy activity took.

86
00:03:45,790 --> 00:03:48,160
And then I can see how long it should have taken

87
00:03:48,160 --> 00:03:50,340
if everything was optimized.

88
00:03:50,340 --> 00:03:53,440
And then from there, I can do optimization techniques

89
00:03:53,440 --> 00:03:55,930
in order to hopefully make up some of that slack

90
00:03:55,930 --> 00:03:58,610
if I'm seeing a difference in what's expected

91
00:03:58,610 --> 00:04:00,023
versus what's observed.

92
00:04:01,200 --> 00:04:03,930
So the key points to remember from this lesson,

93
00:04:03,930 --> 00:04:06,120
one, we're going to estimate

94
00:04:06,120 --> 00:04:08,900
how long the data should take to move.

95
00:04:08,900 --> 00:04:12,020
We're going to compare that in Data Factory or Synapse

96
00:04:12,020 --> 00:04:14,320
to how long it actually took,

97
00:04:14,320 --> 00:04:16,700
and then we're going to optimize as needed.

98
00:04:16,700 --> 00:04:19,300
Keep in mind that full utilization

99
00:04:19,300 --> 00:04:21,670
allows for that throughput estimation,

100
00:04:21,670 --> 00:04:23,220
so I can use that chart

101
00:04:23,220 --> 00:04:24,580
to figure out what's different

102
00:04:24,580 --> 00:04:25,470
and I don't need to worry

103
00:04:25,470 --> 00:04:27,953
about things like bandwidth or IOPS.

104
00:04:29,220 --> 00:04:31,920
Finally the Monitor Copy Data activity

105
00:04:31,920 --> 00:04:36,140
is what we're using to measure that performance activity.

106
00:04:36,140 --> 00:04:36,973
Pretty simple,

107
00:04:36,973 --> 00:04:38,930
if you understand those 3 concepts,

108
00:04:38,930 --> 00:04:41,300
you should be good to go on the next lesson

109
00:04:41,300 --> 00:04:43,090
and you should have a pretty good handle on

110
00:04:43,090 --> 00:04:44,350
if you're moving data,

111
00:04:44,350 --> 00:04:47,400
how long you should actually expect it to take.

112
00:04:47,400 --> 00:04:49,723
With that, I'll see you in the next lesson.