1
00:00:00,017 --> 00:00:01,460
Hey Cloud Gurus.

2
00:00:01,460 --> 00:00:04,710
Welcome back. We've talked a little bit about data zones.

3
00:00:04,710 --> 00:00:06,560
We've talked about folder structure.

4
00:00:06,560 --> 00:00:08,990
Now let's talk about the files actually going

5
00:00:08,990 --> 00:00:10,193
into that structure.

6
00:00:12,300 --> 00:00:14,440
In this lesson, we're going to start by taking a look

7
00:00:14,440 --> 00:00:17,600
at the file types available to us,

8
00:00:17,600 --> 00:00:19,330
and we're going to go over specific ones,

9
00:00:19,330 --> 00:00:22,033
such as Avro, Parquet, and ORC.

10
00:00:23,260 --> 00:00:25,910
After that, we'll do a little comparison of those three

11
00:00:25,910 --> 00:00:28,133
and wrap everything up with a review.

12
00:00:30,520 --> 00:00:32,300
We've talked a little bit about data lakes

13
00:00:32,300 --> 00:00:34,240
and how they're a great place to store

14
00:00:34,240 --> 00:00:36,340
all kinds of information.

15
00:00:36,340 --> 00:00:39,610
The good news is that all formats are welcome,

16
00:00:39,610 --> 00:00:43,970
be that CSV, JSON, whatever. You can throw it all

17
00:00:43,970 --> 00:00:45,710
in your data lake.

18
00:00:45,710 --> 00:00:49,478
The bad news is all formats are welcome.

19
00:00:49,478 --> 00:00:52,940
And I don't have all these pictures of fruit for no reason.

20
00:00:52,940 --> 00:00:55,440
If you think about a basket of fruit, if I tell you,

21
00:00:55,440 --> 00:00:57,340
hey, there's a bunch of fruit.

22
00:00:57,340 --> 00:00:58,850
Well, it doesn't really tell you much

23
00:00:58,850 --> 00:01:01,870
because an apple and a lemon are very different.

24
00:01:01,870 --> 00:01:04,150
And so you're going to want to be careful

25
00:01:04,150 --> 00:01:05,660
which one you're using.

26
00:01:05,660 --> 00:01:09,530
Just because you can use something, doesn't mean you should.

27
00:01:09,530 --> 00:01:12,300
Even though CSV and JSON are common files

28
00:01:12,300 --> 00:01:13,780
and easy to work with,

29
00:01:13,780 --> 00:01:17,610
they're not necessarily ideal for analytics situations.

30
00:01:17,610 --> 00:01:19,910
They can lead to lots of small files,

31
00:01:19,910 --> 00:01:22,930
which creates performance and cost issues for us.

32
00:01:22,930 --> 00:01:25,840
And they're not designed to work with parallel systems.

33
00:01:25,840 --> 00:01:28,870
Luckily, even if our data comes in these formats,

34
00:01:28,870 --> 00:01:31,300
we can consolidate them into ones more useful

35
00:01:31,300 --> 00:01:32,483
to our purpose.

36
00:01:33,856 --> 00:01:37,457
The first of these that we'll discuss is Avro,

37
00:01:37,457 --> 00:01:40,820
and it uses row-based storage, where all the record data

38
00:01:40,820 --> 00:01:42,490
is stored together.

39
00:01:42,490 --> 00:01:44,110
If we take a look at this visually,

40
00:01:44,110 --> 00:01:46,830
let's say we have a table with names

41
00:01:46,830 --> 00:01:48,890
and course topics in it.

42
00:01:48,890 --> 00:01:52,690
You can see that each of the rows is stored sequentially.

43
00:01:52,690 --> 00:01:55,160
So we have all of our guru1 information,

44
00:01:55,160 --> 00:01:57,270
all of our guru2 information,

45
00:01:57,270 --> 00:01:59,773
and then all of our guru3 information.

46
00:02:03,240 --> 00:02:05,740
The schema for Avro is stored in JSON,

47
00:02:05,740 --> 00:02:07,903
which means it's easy for humans to read.

48
00:02:09,210 --> 00:02:11,180
And the data is stored in binary,

49
00:02:11,180 --> 00:02:13,240
which is easy for computers to read.

50
00:02:13,240 --> 00:02:14,903
Meaning it's very efficient.

51
00:02:17,360 --> 00:02:19,840
Some of the benefits of Avro include

52
00:02:19,840 --> 00:02:22,270
easily handling schema changes.

53
00:02:22,270 --> 00:02:24,640
We all know the fast pace of IT.

54
00:02:24,640 --> 00:02:27,060
And schema changes are a real thing these days.

55
00:02:27,060 --> 00:02:29,670
You need to adapt to growing business needs

56
00:02:29,670 --> 00:02:31,540
and Avro easily handles that

57
00:02:31,540 --> 00:02:33,563
because of its JSON format at schema.

58
00:02:34,960 --> 00:02:38,192
It's great for write-heavy workloads because it's row-based,

59
00:02:38,192 --> 00:02:43,000
and being row-based also means it's great for ETL operations

60
00:02:43,000 --> 00:02:44,453
that need all the columns.

61
00:02:45,970 --> 00:02:49,880
Comparing apples to oranges, Parquet is another option.

62
00:02:49,880 --> 00:02:52,267
And this one uses columnar storage.

63
00:02:52,267 --> 00:02:54,730
And this is where the values of each column type

64
00:02:54,730 --> 00:02:56,320
are stored together.

65
00:02:56,320 --> 00:02:58,960
And so using the same example from earlier,

66
00:02:58,960 --> 00:03:01,170
instead of having all of our row information

67
00:03:01,170 --> 00:03:04,900
stored sequentially, we have each column type.

68
00:03:04,900 --> 00:03:07,730
So we have all of our ID columns stored together,

69
00:03:07,730 --> 00:03:11,723
all of our name column, and all of our topic column.

70
00:03:14,380 --> 00:03:17,616
A standout feature of Parquet is that it has nested data

71
00:03:17,616 --> 00:03:21,943
structures, allowing nested fields to be read individually.

72
00:03:23,760 --> 00:03:27,670
And just like Avro, it's data is stored in binary,

73
00:03:27,670 --> 00:03:30,633
also making it efficient for computers to use.

74
00:03:32,730 --> 00:03:36,450
Some of its benefits include: it's great for wide column

75
00:03:36,450 --> 00:03:40,700
queries, it's efficient for read-heavy workloads,

76
00:03:40,700 --> 00:03:43,550
and this makes it a great choice for analytics in general,

77
00:03:43,550 --> 00:03:45,341
especially in the curated zone,

78
00:03:45,341 --> 00:03:47,474
where read performance really matters.

79
00:03:47,474 --> 00:03:51,520
And you can easily query a subset of columns

80
00:03:51,520 --> 00:03:52,763
or nested data.

81
00:03:55,710 --> 00:04:00,120
Another option is ORC, or Optimized Row Columnar.

82
00:04:00,120 --> 00:04:01,810
And this one was designed to improve

83
00:04:01,810 --> 00:04:03,623
on the other file formats.

84
00:04:05,160 --> 00:04:08,253
It too uses columnar storage, just like Parquet,

85
00:04:09,500 --> 00:04:13,490
but it uses stripes, and so it has collections of rows

86
00:04:13,490 --> 00:04:15,973
that contain row data in columnar format.

87
00:04:17,090 --> 00:04:20,023
And like all the others, its data is stored in binary.

88
00:04:22,110 --> 00:04:26,360
Some of its benefits include: it supports ACID properties,

89
00:04:26,360 --> 00:04:29,123
which can be a big deal if your business requires that;

90
00:04:30,600 --> 00:04:34,190
it compresses more efficiently;

91
00:04:34,190 --> 00:04:36,823
and its stripes enable large efficient reads.

92
00:04:39,486 --> 00:04:41,420
That was a lot of information.

93
00:04:41,420 --> 00:04:44,170
So let's do a bit of a TLDR on which file you would pick

94
00:04:44,170 --> 00:04:45,483
for which situation.

95
00:04:46,480 --> 00:04:48,110
For our analytical queries,

96
00:04:48,110 --> 00:04:51,100
we're going to use either Parquet or ORC

97
00:04:51,100 --> 00:04:53,820
because of their read performance benefits.

98
00:04:53,820 --> 00:04:56,390
Conversely, if we have heavy write operations,

99
00:04:56,390 --> 00:04:59,320
such as ETL work, we're going to use Avro

100
00:04:59,320 --> 00:05:00,970
because of its row-based storage.

101
00:05:01,950 --> 00:05:05,376
For nested data, that's a standout feature of Parquet,

102
00:05:05,376 --> 00:05:08,140
and having ACID properties is going to point us

103
00:05:08,140 --> 00:05:09,523
toward using ORC.

104
00:05:10,570 --> 00:05:12,880
Lastly, if you need a lot of flexibility

105
00:05:12,880 --> 00:05:14,600
in your schema evolution,

106
00:05:14,600 --> 00:05:16,713
that is going to bring you to Avro.

107
00:05:17,770 --> 00:05:21,100
And so they are each created to work efficiently

108
00:05:21,100 --> 00:05:22,510
and analytic solutions.

109
00:05:22,510 --> 00:05:24,810
It just depends on your particular needs

110
00:05:24,810 --> 00:05:26,023
as to which you'll use.

111
00:05:29,290 --> 00:05:32,960
By way of review, while you can store any type of data,

112
00:05:32,960 --> 00:05:34,900
you want to use formats that are designed

113
00:05:34,900 --> 00:05:37,010
for big data processing.

114
00:05:37,010 --> 00:05:40,560
As I said before, lots of CSV or JSON files

115
00:05:40,560 --> 00:05:43,660
could lead to a large number of small files.

116
00:05:43,660 --> 00:05:46,640
And that's your enemy in these analytics situations.

117
00:05:46,640 --> 00:05:49,940
It gives you less performance at a higher cost.

118
00:05:49,940 --> 00:05:51,980
And so you're going to want to stick with Avro,

119
00:05:51,980 --> 00:05:53,533
Parquet, or ORC.

120
00:05:55,390 --> 00:05:57,770
Your primary decision may come down to

121
00:05:57,770 --> 00:06:02,180
whether you need row-based, Avro, or columnar storage

122
00:06:02,180 --> 00:06:03,533
in Parquet and ORC.

123
00:06:05,950 --> 00:06:08,400
And while parquet-K and ORC are similar,

124
00:06:08,400 --> 00:06:10,080
they do have differences.

125
00:06:10,080 --> 00:06:13,950
And so evaluate those based on your business needs

126
00:06:13,950 --> 00:06:16,000
in order to choose which is best for you.

127
00:06:17,470 --> 00:06:19,570
I hope that you have enjoyed this lesson

128
00:06:19,570 --> 00:06:22,410
and that the differences in these files are clear to you.

129
00:06:22,410 --> 00:06:23,730
If you have any questions,

130
00:06:23,730 --> 00:06:25,600
please feel free to reach out to me

131
00:06:25,600 --> 00:06:27,400
and I'll see you in the next lesson.