1 00:00:00,017 --> 00:00:01,460 Hey Cloud Gurus. 2 00:00:01,460 --> 00:00:04,710 Welcome back. We've talked a little bit about data zones. 3 00:00:04,710 --> 00:00:06,560 We've talked about folder structure. 4 00:00:06,560 --> 00:00:08,990 Now let's talk about the files actually going 5 00:00:08,990 --> 00:00:10,193 into that structure. 6 00:00:12,300 --> 00:00:14,440 In this lesson, we're going to start by taking a look 7 00:00:14,440 --> 00:00:17,600 at the file types available to us, 8 00:00:17,600 --> 00:00:19,330 and we're going to go over specific ones, 9 00:00:19,330 --> 00:00:22,033 such as Avro, Parquet, and ORC. 10 00:00:23,260 --> 00:00:25,910 After that, we'll do a little comparison of those three 11 00:00:25,910 --> 00:00:28,133 and wrap everything up with a review. 12 00:00:30,520 --> 00:00:32,300 We've talked a little bit about data lakes 13 00:00:32,300 --> 00:00:34,240 and how they're a great place to store 14 00:00:34,240 --> 00:00:36,340 all kinds of information. 15 00:00:36,340 --> 00:00:39,610 The good news is that all formats are welcome, 16 00:00:39,610 --> 00:00:43,970 be that CSV, JSON, whatever. You can throw it all 17 00:00:43,970 --> 00:00:45,710 in your data lake. 18 00:00:45,710 --> 00:00:49,478 The bad news is all formats are welcome. 19 00:00:49,478 --> 00:00:52,940 And I don't have all these pictures of fruit for no reason. 20 00:00:52,940 --> 00:00:55,440 If you think about a basket of fruit, if I tell you, 21 00:00:55,440 --> 00:00:57,340 hey, there's a bunch of fruit. 22 00:00:57,340 --> 00:00:58,850 Well, it doesn't really tell you much 23 00:00:58,850 --> 00:01:01,870 because an apple and a lemon are very different. 24 00:01:01,870 --> 00:01:04,150 And so you're going to want to be careful 25 00:01:04,150 --> 00:01:05,660 which one you're using. 26 00:01:05,660 --> 00:01:09,530 Just because you can use something, doesn't mean you should. 27 00:01:09,530 --> 00:01:12,300 Even though CSV and JSON are common files 28 00:01:12,300 --> 00:01:13,780 and easy to work with, 29 00:01:13,780 --> 00:01:17,610 they're not necessarily ideal for analytics situations. 30 00:01:17,610 --> 00:01:19,910 They can lead to lots of small files, 31 00:01:19,910 --> 00:01:22,930 which creates performance and cost issues for us. 32 00:01:22,930 --> 00:01:25,840 And they're not designed to work with parallel systems. 33 00:01:25,840 --> 00:01:28,870 Luckily, even if our data comes in these formats, 34 00:01:28,870 --> 00:01:31,300 we can consolidate them into ones more useful 35 00:01:31,300 --> 00:01:32,483 to our purpose. 36 00:01:33,856 --> 00:01:37,457 The first of these that we'll discuss is Avro, 37 00:01:37,457 --> 00:01:40,820 and it uses row-based storage, where all the record data 38 00:01:40,820 --> 00:01:42,490 is stored together. 39 00:01:42,490 --> 00:01:44,110 If we take a look at this visually, 40 00:01:44,110 --> 00:01:46,830 let's say we have a table with names 41 00:01:46,830 --> 00:01:48,890 and course topics in it. 42 00:01:48,890 --> 00:01:52,690 You can see that each of the rows is stored sequentially. 43 00:01:52,690 --> 00:01:55,160 So we have all of our guru1 information, 44 00:01:55,160 --> 00:01:57,270 all of our guru2 information, 45 00:01:57,270 --> 00:01:59,773 and then all of our guru3 information. 46 00:02:03,240 --> 00:02:05,740 The schema for Avro is stored in JSON, 47 00:02:05,740 --> 00:02:07,903 which means it's easy for humans to read. 48 00:02:09,210 --> 00:02:11,180 And the data is stored in binary, 49 00:02:11,180 --> 00:02:13,240 which is easy for computers to read. 50 00:02:13,240 --> 00:02:14,903 Meaning it's very efficient. 51 00:02:17,360 --> 00:02:19,840 Some of the benefits of Avro include 52 00:02:19,840 --> 00:02:22,270 easily handling schema changes. 53 00:02:22,270 --> 00:02:24,640 We all know the fast pace of IT. 54 00:02:24,640 --> 00:02:27,060 And schema changes are a real thing these days. 55 00:02:27,060 --> 00:02:29,670 You need to adapt to growing business needs 56 00:02:29,670 --> 00:02:31,540 and Avro easily handles that 57 00:02:31,540 --> 00:02:33,563 because of its JSON format at schema. 58 00:02:34,960 --> 00:02:38,192 It's great for write-heavy workloads because it's row-based, 59 00:02:38,192 --> 00:02:43,000 and being row-based also means it's great for ETL operations 60 00:02:43,000 --> 00:02:44,453 that need all the columns. 61 00:02:45,970 --> 00:02:49,880 Comparing apples to oranges, Parquet is another option. 62 00:02:49,880 --> 00:02:52,267 And this one uses columnar storage. 63 00:02:52,267 --> 00:02:54,730 And this is where the values of each column type 64 00:02:54,730 --> 00:02:56,320 are stored together. 65 00:02:56,320 --> 00:02:58,960 And so using the same example from earlier, 66 00:02:58,960 --> 00:03:01,170 instead of having all of our row information 67 00:03:01,170 --> 00:03:04,900 stored sequentially, we have each column type. 68 00:03:04,900 --> 00:03:07,730 So we have all of our ID columns stored together, 69 00:03:07,730 --> 00:03:11,723 all of our name column, and all of our topic column. 70 00:03:14,380 --> 00:03:17,616 A standout feature of Parquet is that it has nested data 71 00:03:17,616 --> 00:03:21,943 structures, allowing nested fields to be read individually. 72 00:03:23,760 --> 00:03:27,670 And just like Avro, it's data is stored in binary, 73 00:03:27,670 --> 00:03:30,633 also making it efficient for computers to use. 74 00:03:32,730 --> 00:03:36,450 Some of its benefits include: it's great for wide column 75 00:03:36,450 --> 00:03:40,700 queries, it's efficient for read-heavy workloads, 76 00:03:40,700 --> 00:03:43,550 and this makes it a great choice for analytics in general, 77 00:03:43,550 --> 00:03:45,341 especially in the curated zone, 78 00:03:45,341 --> 00:03:47,474 where read performance really matters. 79 00:03:47,474 --> 00:03:51,520 And you can easily query a subset of columns 80 00:03:51,520 --> 00:03:52,763 or nested data. 81 00:03:55,710 --> 00:04:00,120 Another option is ORC, or Optimized Row Columnar. 82 00:04:00,120 --> 00:04:01,810 And this one was designed to improve 83 00:04:01,810 --> 00:04:03,623 on the other file formats. 84 00:04:05,160 --> 00:04:08,253 It too uses columnar storage, just like Parquet, 85 00:04:09,500 --> 00:04:13,490 but it uses stripes, and so it has collections of rows 86 00:04:13,490 --> 00:04:15,973 that contain row data in columnar format. 87 00:04:17,090 --> 00:04:20,023 And like all the others, its data is stored in binary. 88 00:04:22,110 --> 00:04:26,360 Some of its benefits include: it supports ACID properties, 89 00:04:26,360 --> 00:04:29,123 which can be a big deal if your business requires that; 90 00:04:30,600 --> 00:04:34,190 it compresses more efficiently; 91 00:04:34,190 --> 00:04:36,823 and its stripes enable large efficient reads. 92 00:04:39,486 --> 00:04:41,420 That was a lot of information. 93 00:04:41,420 --> 00:04:44,170 So let's do a bit of a TLDR on which file you would pick 94 00:04:44,170 --> 00:04:45,483 for which situation. 95 00:04:46,480 --> 00:04:48,110 For our analytical queries, 96 00:04:48,110 --> 00:04:51,100 we're going to use either Parquet or ORC 97 00:04:51,100 --> 00:04:53,820 because of their read performance benefits. 98 00:04:53,820 --> 00:04:56,390 Conversely, if we have heavy write operations, 99 00:04:56,390 --> 00:04:59,320 such as ETL work, we're going to use Avro 100 00:04:59,320 --> 00:05:00,970 because of its row-based storage. 101 00:05:01,950 --> 00:05:05,376 For nested data, that's a standout feature of Parquet, 102 00:05:05,376 --> 00:05:08,140 and having ACID properties is going to point us 103 00:05:08,140 --> 00:05:09,523 toward using ORC. 104 00:05:10,570 --> 00:05:12,880 Lastly, if you need a lot of flexibility 105 00:05:12,880 --> 00:05:14,600 in your schema evolution, 106 00:05:14,600 --> 00:05:16,713 that is going to bring you to Avro. 107 00:05:17,770 --> 00:05:21,100 And so they are each created to work efficiently 108 00:05:21,100 --> 00:05:22,510 and analytic solutions. 109 00:05:22,510 --> 00:05:24,810 It just depends on your particular needs 110 00:05:24,810 --> 00:05:26,023 as to which you'll use. 111 00:05:29,290 --> 00:05:32,960 By way of review, while you can store any type of data, 112 00:05:32,960 --> 00:05:34,900 you want to use formats that are designed 113 00:05:34,900 --> 00:05:37,010 for big data processing. 114 00:05:37,010 --> 00:05:40,560 As I said before, lots of CSV or JSON files 115 00:05:40,560 --> 00:05:43,660 could lead to a large number of small files. 116 00:05:43,660 --> 00:05:46,640 And that's your enemy in these analytics situations. 117 00:05:46,640 --> 00:05:49,940 It gives you less performance at a higher cost. 118 00:05:49,940 --> 00:05:51,980 And so you're going to want to stick with Avro, 119 00:05:51,980 --> 00:05:53,533 Parquet, or ORC. 120 00:05:55,390 --> 00:05:57,770 Your primary decision may come down to 121 00:05:57,770 --> 00:06:02,180 whether you need row-based, Avro, or columnar storage 122 00:06:02,180 --> 00:06:03,533 in Parquet and ORC. 123 00:06:05,950 --> 00:06:08,400 And while parquet-K and ORC are similar, 124 00:06:08,400 --> 00:06:10,080 they do have differences. 125 00:06:10,080 --> 00:06:13,950 And so evaluate those based on your business needs 126 00:06:13,950 --> 00:06:16,000 in order to choose which is best for you. 127 00:06:17,470 --> 00:06:19,570 I hope that you have enjoyed this lesson 128 00:06:19,570 --> 00:06:22,410 and that the differences in these files are clear to you. 129 00:06:22,410 --> 00:06:23,730 If you have any questions, 130 00:06:23,730 --> 00:06:25,600 please feel free to reach out to me 131 00:06:25,600 --> 00:06:27,400 and I'll see you in the next lesson.