1
00:00:01,090 --> 00:00:03,040
Let's talk about Hadoop.

2
00:00:03,130 --> 00:00:09,550
So as companies started to generate more and more data are databases that we handed like my ask you

3
00:00:09,550 --> 00:00:17,980
all became too inefficient or unable to hold the large amounts of data and Hadoop became a cornerstone

4
00:00:18,250 --> 00:00:19,990
of data engineering.

5
00:00:20,050 --> 00:00:24,550
You see Hadoop is an open source distributed processing framework.

6
00:00:24,610 --> 00:00:29,350
It allows us to do a data processing and storage for big data.

7
00:00:29,590 --> 00:00:31,410
And it was actually developed by Yahoo.

8
00:00:31,410 --> 00:00:32,370
Back in the day.

9
00:00:32,560 --> 00:00:41,500
And then later donated to the Apache Software Foundation Hadoop was all the rage when the Big Data craze

10
00:00:41,500 --> 00:00:48,610
started Hadoop was a solution that allowed all these companies with petabytes of data to store this

11
00:00:48,610 --> 00:00:51,110
information in Hadoop.

12
00:00:51,130 --> 00:00:55,030
Now Hadoop was essentially a data lake right.

13
00:00:55,030 --> 00:00:56,440
It's a data lake solution.

14
00:00:56,440 --> 00:01:01,030
It allows us to just store all this data for us.

15
00:01:01,030 --> 00:01:07,640
But the popularity of Hadoop was because of two big things.

16
00:01:07,750 --> 00:01:16,990
One was HD F S that is the Hadoop distributed file system Hadoop was able to store so much data.

17
00:01:16,990 --> 00:01:24,040
Because of this HD DFS a file system just like you have on your computer that allowed it to store files

18
00:01:24,040 --> 00:01:25,960
on multiple computers.

19
00:01:25,960 --> 00:01:31,120
That is we can use Hadoop on multiple computers to store as much data as we want.

20
00:01:31,120 --> 00:01:40,400
It was scalable we could scale up so data is stored across different physical computers then was map

21
00:01:40,490 --> 00:01:48,800
reduced because once we store data we need to perform some jobs some processing on that data right and

22
00:01:48,890 --> 00:01:57,530
map produce in Hadoop allowed us to perform jobs against this data that we had in a data lake using

23
00:01:57,530 --> 00:02:05,120
languages like Java or Python you can use map reduce and Hadoop to do what we call batch processing

24
00:02:05,540 --> 00:02:12,320
that is every night we can run some sort of a job to clean our data out to process our data and make

25
00:02:12,320 --> 00:02:20,730
it useful for the company now Map Reduce isn't used as often anymore as we'll see later on in the videos

26
00:02:21,090 --> 00:02:27,130
because we have something called Apache Spark that's actually a little bit faster but we'll get to that.

27
00:02:27,240 --> 00:02:33,510
The idea here is that we finally had a tool with Hadoop that allows us to store a lot of data across

28
00:02:33,690 --> 00:02:43,590
multiple machines using the CFS and run batch jobs using Map Reduce on this large sets of data and a

29
00:02:43,590 --> 00:02:51,860
lot of other tools started popping up around Hadoop tools such as hive love their logo by the way check

30
00:02:51,860 --> 00:03:02,390
that out anyway hive makes your Hadoop cluster feel like it's a relational database even though it isn't

31
00:03:02,780 --> 00:03:10,780
it allows you to use hive and right ask you well queries against your HDD DFS file system.

32
00:03:11,360 --> 00:03:20,390
So in a sense you kind of had this data lake of Hadoop and also a little bit of a data warehouse type

33
00:03:20,390 --> 00:03:27,190
thing with hive by allowing ask who all commands on your data that hopefully was processed using map

34
00:03:27,190 --> 00:03:33,350
produce and you may not have heard of Hadoop before because it's used by data engineers.

35
00:03:33,350 --> 00:03:40,990
Remember data engineers usually don't work on systems that are customer facing that is they don't build

36
00:03:41,050 --> 00:03:46,030
databases that you would use with your app or with an web app.

37
00:03:46,030 --> 00:03:55,000
Instead they build the systems behind the scenes that ingests collects runs EDL jobs that is extract

38
00:03:55,000 --> 00:04:03,640
transform load on data so Hadoop is that behind the scenes storage layer that stores large amount of

39
00:04:03,640 --> 00:04:13,000
data running batch processing against this data so that we can have some useful information and Hadoop

40
00:04:13,030 --> 00:04:19,610
is still very popular and very capable when it comes to large data petabytes of data.

41
00:04:19,750 --> 00:04:21,820
Let's learn a few more tools in the next video.