1 00:00:00,600 --> 00:00:01,460 Hey, Cloud Gurus. 2 00:00:01,460 --> 00:00:04,050 Welcome back to our further adventures in data, 3 00:00:04,050 --> 00:00:06,383 continuing on with pruning data. 4 00:00:08,730 --> 00:00:10,820 In this lesson, we're going to be taking a look at 5 00:00:10,820 --> 00:00:12,680 what data skipping is, 6 00:00:12,680 --> 00:00:15,300 how Z-ordering plays into that, 7 00:00:15,300 --> 00:00:18,710 and really building up to dynamic file pruning, 8 00:00:18,710 --> 00:00:21,210 and then making sure we know how that all ties together 9 00:00:21,210 --> 00:00:22,043 in a review. 10 00:00:24,690 --> 00:00:26,900 First step, data skipping. 11 00:00:26,900 --> 00:00:28,190 There is metadata collection 12 00:00:28,190 --> 00:00:30,140 that is happening automatically. 13 00:00:30,140 --> 00:00:31,500 There's no need to enable this. 14 00:00:31,500 --> 00:00:33,970 It's part of the Databricks runtime 15 00:00:33,970 --> 00:00:36,380 and is enabled whenever applicable. 16 00:00:36,380 --> 00:00:39,550 So data skipping information is automatically collected 17 00:00:39,550 --> 00:00:42,133 when data is written into a Delta table. 18 00:00:43,050 --> 00:00:45,680 This allows us to have target ranges, 19 00:00:45,680 --> 00:00:47,270 where the Delta lake on Databricks 20 00:00:47,270 --> 00:00:51,810 uses the minimum and maximum values to speed up queries. 21 00:00:51,810 --> 00:00:53,770 The min and max values are part of 22 00:00:53,770 --> 00:00:56,100 that aforementioned metadata, 23 00:00:56,100 --> 00:00:59,283 and it is capitalizing on those to target specific data. 24 00:01:00,200 --> 00:01:03,190 Now, this all depends on the data layout. 25 00:01:03,190 --> 00:01:05,100 For the highest level of effectiveness, 26 00:01:05,100 --> 00:01:07,970 something like Z-ordering should be used, 27 00:01:07,970 --> 00:01:10,513 which begs the question, what is Z-ordering? 28 00:01:11,600 --> 00:01:14,420 Well, this is a technique for co-locating 29 00:01:14,420 --> 00:01:18,000 related information in the same set of files. 30 00:01:18,000 --> 00:01:21,050 It is automatically used by the data-skipping algorithms 31 00:01:21,050 --> 00:01:22,740 of Data Lake on Databricks 32 00:01:22,740 --> 00:01:26,570 to substantially reduce the amount of data to be read. 33 00:01:26,570 --> 00:01:28,530 And, of course, these are the same algorithms 34 00:01:28,530 --> 00:01:30,860 that we were just speaking about. 35 00:01:30,860 --> 00:01:32,210 All of that to say, 36 00:01:32,210 --> 00:01:34,960 dynamic file pruning, or DFP, 37 00:01:34,960 --> 00:01:38,060 is what we're using to prune out unnecessary data. 38 00:01:38,060 --> 00:01:40,660 This is all about going fast. 39 00:01:40,660 --> 00:01:43,540 It can dramatically improve query performance 40 00:01:43,540 --> 00:01:47,400 because it allows files to be skipped within partitions. 41 00:01:47,400 --> 00:01:49,080 Most of our performance tactics 42 00:01:49,080 --> 00:01:50,410 have been centered around 43 00:01:50,410 --> 00:01:53,020 limiting the result set as much as possible. 44 00:01:53,020 --> 00:01:55,920 And so we've already looked at ways to partition data 45 00:01:55,920 --> 00:01:58,330 and only pull back the partitions we need. 46 00:01:58,330 --> 00:02:00,050 This is taking that a step farther 47 00:02:00,050 --> 00:02:01,930 and allowing us to skip information 48 00:02:01,930 --> 00:02:04,010 within the partition itself, 49 00:02:04,010 --> 00:02:07,163 such as only searching for records in this week. 50 00:02:08,400 --> 00:02:10,490 The performance impact is correlated 51 00:02:10,490 --> 00:02:12,250 to the clustering that's used, 52 00:02:12,250 --> 00:02:14,800 and so this is where Z-ordering comes in. 53 00:02:14,800 --> 00:02:19,610 It relies on pre-sorted data such as Z-ordering clustering. 54 00:02:19,610 --> 00:02:21,180 And so when you're using Z-ordering, 55 00:02:21,180 --> 00:02:23,143 it is especially speedy. 56 00:02:24,220 --> 00:02:26,890 It really stands out for the non-partitioned. 57 00:02:26,890 --> 00:02:30,000 It's especially efficient for non-partitioned tables 58 00:02:30,000 --> 00:02:32,773 or for joins of non-partitioned columns. 59 00:02:35,180 --> 00:02:38,130 Some of the basic configurations you'll encounter are 60 00:02:38,130 --> 00:02:43,130 spark.databricks.optimizer.dynamicPartitionPruning. 61 00:02:43,420 --> 00:02:47,430 And this is the top-level setting to turn it on or off. 62 00:02:47,430 --> 00:02:49,283 By default, it's set as true. 63 00:02:50,660 --> 00:02:51,493 There's also 64 00:02:51,493 --> 00:02:54,153 spark.databricks.optimizer.deltaTableSizeThreshold, 65 00:02:56,280 --> 00:03:00,280 defaulting to 10 billion bytes, or 10 gigs. 66 00:03:00,280 --> 00:03:02,850 And this represents the minimum size and bytes 67 00:03:02,850 --> 00:03:05,710 of the Delta table on the probe side of the join 68 00:03:05,710 --> 00:03:08,220 required to trigger DFP. 69 00:03:08,220 --> 00:03:09,990 If the probe side is not very large, 70 00:03:09,990 --> 00:03:12,820 it's probably not worthwhile to push down the filters, 71 00:03:12,820 --> 00:03:15,383 and at that point, you can just scan the whole table. 72 00:03:16,310 --> 00:03:17,410 Lastly, there's 73 00:03:17,410 --> 00:03:20,103 spark.databricks.optimizer.deltaTableFilesThreshold, 74 00:03:23,631 --> 00:03:27,880 defaulting to 10 for Databricks runtime 8.4 and above. 75 00:03:27,880 --> 00:03:30,890 This represents the number of files of the Delta table 76 00:03:30,890 --> 00:03:34,410 on the probe side of the join required to trigger DFP. 77 00:03:34,410 --> 00:03:36,840 And when the probe-side table contains fewer files 78 00:03:36,840 --> 00:03:40,080 than the threshold value, DFP is not triggered. 79 00:03:40,080 --> 00:03:42,000 If a table only has a few files, 80 00:03:42,000 --> 00:03:45,253 it's probably not worthwhile to enable dynamic file pruning. 81 00:03:48,300 --> 00:03:51,230 By way of review, Delta Lake on Databricks 82 00:03:51,230 --> 00:03:54,103 utilizes metadata to power data skipping. 83 00:03:55,610 --> 00:03:58,060 Z-ordering co-locates related information 84 00:03:58,060 --> 00:04:00,963 in the same set of files, pre sorting your data. 85 00:04:02,060 --> 00:04:05,470 And then dynamic file pruning utilizes these technologies 86 00:04:05,470 --> 00:04:10,260 to skip files within a partition based on filters. 87 00:04:10,260 --> 00:04:13,020 This is a powerful technology for speeding up your queries. 88 00:04:13,020 --> 00:04:14,530 And I hope this lesson helps you understand 89 00:04:14,530 --> 00:04:15,780 how it accomplishes that. 90 00:04:16,910 --> 00:04:17,870 Thanks for joining me. 91 00:04:17,870 --> 00:04:18,710 And when you're ready, 92 00:04:18,710 --> 00:04:20,260 I'll see you in the next video.