1
00:00:00,600 --> 00:00:01,460
Hey, Cloud Gurus.

2
00:00:01,460 --> 00:00:04,050
Welcome back to our further adventures in data,

3
00:00:04,050 --> 00:00:06,383
continuing on with pruning data.

4
00:00:08,730 --> 00:00:10,820
In this lesson, we're going to be taking a look at

5
00:00:10,820 --> 00:00:12,680
what data skipping is,

6
00:00:12,680 --> 00:00:15,300
how Z-ordering plays into that,

7
00:00:15,300 --> 00:00:18,710
and really building up to dynamic file pruning,

8
00:00:18,710 --> 00:00:21,210
and then making sure we know how that all ties together

9
00:00:21,210 --> 00:00:22,043
in a review.

10
00:00:24,690 --> 00:00:26,900
First step, data skipping.

11
00:00:26,900 --> 00:00:28,190
There is metadata collection

12
00:00:28,190 --> 00:00:30,140
that is happening automatically.

13
00:00:30,140 --> 00:00:31,500
There's no need to enable this.

14
00:00:31,500 --> 00:00:33,970
It's part of the Databricks runtime

15
00:00:33,970 --> 00:00:36,380
and is enabled whenever applicable.

16
00:00:36,380 --> 00:00:39,550
So data skipping information is automatically collected

17
00:00:39,550 --> 00:00:42,133
when data is written into a Delta table.

18
00:00:43,050 --> 00:00:45,680
This allows us to have target ranges,

19
00:00:45,680 --> 00:00:47,270
where the Delta lake on Databricks

20
00:00:47,270 --> 00:00:51,810
uses the minimum and maximum values to speed up queries.

21
00:00:51,810 --> 00:00:53,770
The min and max values are part of

22
00:00:53,770 --> 00:00:56,100
that aforementioned metadata,

23
00:00:56,100 --> 00:00:59,283
and it is capitalizing on those to target specific data.

24
00:01:00,200 --> 00:01:03,190
Now, this all depends on the data layout.

25
00:01:03,190 --> 00:01:05,100
For the highest level of effectiveness,

26
00:01:05,100 --> 00:01:07,970
something like Z-ordering should be used,

27
00:01:07,970 --> 00:01:10,513
which begs the question, what is Z-ordering?

28
00:01:11,600 --> 00:01:14,420
Well, this is a technique for co-locating

29
00:01:14,420 --> 00:01:18,000
related information in the same set of files.

30
00:01:18,000 --> 00:01:21,050
It is automatically used by the data-skipping algorithms

31
00:01:21,050 --> 00:01:22,740
of Data Lake on Databricks

32
00:01:22,740 --> 00:01:26,570
to substantially reduce the amount of data to be read.

33
00:01:26,570 --> 00:01:28,530
And, of course, these are the same algorithms

34
00:01:28,530 --> 00:01:30,860
that we were just speaking about.

35
00:01:30,860 --> 00:01:32,210
All of that to say,

36
00:01:32,210 --> 00:01:34,960
dynamic file pruning, or DFP,

37
00:01:34,960 --> 00:01:38,060
is what we're using to prune out unnecessary data.

38
00:01:38,060 --> 00:01:40,660
This is all about going fast.

39
00:01:40,660 --> 00:01:43,540
It can dramatically improve query performance

40
00:01:43,540 --> 00:01:47,400
because it allows files to be skipped within partitions.

41
00:01:47,400 --> 00:01:49,080
Most of our performance tactics

42
00:01:49,080 --> 00:01:50,410
have been centered around

43
00:01:50,410 --> 00:01:53,020
limiting the result set as much as possible.

44
00:01:53,020 --> 00:01:55,920
And so we've already looked at ways to partition data

45
00:01:55,920 --> 00:01:58,330
and only pull back the partitions we need.

46
00:01:58,330 --> 00:02:00,050
This is taking that a step farther

47
00:02:00,050 --> 00:02:01,930
and allowing us to skip information

48
00:02:01,930 --> 00:02:04,010
within the partition itself,

49
00:02:04,010 --> 00:02:07,163
such as only searching for records in this week.

50
00:02:08,400 --> 00:02:10,490
The performance impact is correlated

51
00:02:10,490 --> 00:02:12,250
to the clustering that's used,

52
00:02:12,250 --> 00:02:14,800
and so this is where Z-ordering comes in.

53
00:02:14,800 --> 00:02:19,610
It relies on pre-sorted data such as Z-ordering clustering.

54
00:02:19,610 --> 00:02:21,180
And so when you're using Z-ordering,

55
00:02:21,180 --> 00:02:23,143
it is especially speedy.

56
00:02:24,220 --> 00:02:26,890
It really stands out for the non-partitioned.

57
00:02:26,890 --> 00:02:30,000
It's especially efficient for non-partitioned tables

58
00:02:30,000 --> 00:02:32,773
or for joins of non-partitioned columns.

59
00:02:35,180 --> 00:02:38,130
Some of the basic configurations you'll encounter are

60
00:02:38,130 --> 00:02:43,130
spark.databricks.optimizer.dynamicPartitionPruning.

61
00:02:43,420 --> 00:02:47,430
And this is the top-level setting to turn it on or off.

62
00:02:47,430 --> 00:02:49,283
By default, it's set as true.

63
00:02:50,660 --> 00:02:51,493
There's also

64
00:02:51,493 --> 00:02:54,153
spark.databricks.optimizer.deltaTableSizeThreshold,

65
00:02:56,280 --> 00:03:00,280
defaulting to 10 billion bytes, or 10 gigs.

66
00:03:00,280 --> 00:03:02,850
And this represents the minimum size and bytes

67
00:03:02,850 --> 00:03:05,710
of the Delta table on the probe side of the join

68
00:03:05,710 --> 00:03:08,220
required to trigger DFP.

69
00:03:08,220 --> 00:03:09,990
If the probe side is not very large,

70
00:03:09,990 --> 00:03:12,820
it's probably not worthwhile to push down the filters,

71
00:03:12,820 --> 00:03:15,383
and at that point, you can just scan the whole table.

72
00:03:16,310 --> 00:03:17,410
Lastly, there's

73
00:03:17,410 --> 00:03:20,103
spark.databricks.optimizer.deltaTableFilesThreshold,

74
00:03:23,631 --> 00:03:27,880
defaulting to 10 for Databricks runtime 8.4 and above.

75
00:03:27,880 --> 00:03:30,890
This represents the number of files of the Delta table

76
00:03:30,890 --> 00:03:34,410
on the probe side of the join required to trigger DFP.

77
00:03:34,410 --> 00:03:36,840
And when the probe-side table contains fewer files

78
00:03:36,840 --> 00:03:40,080
than the threshold value, DFP is not triggered.

79
00:03:40,080 --> 00:03:42,000
If a table only has a few files,

80
00:03:42,000 --> 00:03:45,253
it's probably not worthwhile to enable dynamic file pruning.

81
00:03:48,300 --> 00:03:51,230
By way of review, Delta Lake on Databricks

82
00:03:51,230 --> 00:03:54,103
utilizes metadata to power data skipping.

83
00:03:55,610 --> 00:03:58,060
Z-ordering co-locates related information

84
00:03:58,060 --> 00:04:00,963
in the same set of files, pre sorting your data.

85
00:04:02,060 --> 00:04:05,470
And then dynamic file pruning utilizes these technologies

86
00:04:05,470 --> 00:04:10,260
to skip files within a partition based on filters.

87
00:04:10,260 --> 00:04:13,020
This is a powerful technology for speeding up your queries.

88
00:04:13,020 --> 00:04:14,530
And I hope this lesson helps you understand

89
00:04:14,530 --> 00:04:15,780
how it accomplishes that.

90
00:04:16,910 --> 00:04:17,870
Thanks for joining me.

91
00:04:17,870 --> 00:04:18,710
And when you're ready,

92
00:04:18,710 --> 00:04:20,260
I'll see you in the next video.