1 00:00:01,090 --> 00:00:03,040 Let's talk about Hadoop. 2 00:00:03,130 --> 00:00:09,550 So as companies started to generate more and more data are databases that we handed like my ask you 3 00:00:09,550 --> 00:00:17,980 all became too inefficient or unable to hold the large amounts of data and Hadoop became a cornerstone 4 00:00:18,250 --> 00:00:19,990 of data engineering. 5 00:00:20,050 --> 00:00:24,550 You see Hadoop is an open source distributed processing framework. 6 00:00:24,610 --> 00:00:29,350 It allows us to do a data processing and storage for big data. 7 00:00:29,590 --> 00:00:31,410 And it was actually developed by Yahoo. 8 00:00:31,410 --> 00:00:32,370 Back in the day. 9 00:00:32,560 --> 00:00:41,500 And then later donated to the Apache Software Foundation Hadoop was all the rage when the Big Data craze 10 00:00:41,500 --> 00:00:48,610 started Hadoop was a solution that allowed all these companies with petabytes of data to store this 11 00:00:48,610 --> 00:00:51,110 information in Hadoop. 12 00:00:51,130 --> 00:00:55,030 Now Hadoop was essentially a data lake right. 13 00:00:55,030 --> 00:00:56,440 It's a data lake solution. 14 00:00:56,440 --> 00:01:01,030 It allows us to just store all this data for us. 15 00:01:01,030 --> 00:01:07,640 But the popularity of Hadoop was because of two big things. 16 00:01:07,750 --> 00:01:16,990 One was HD F S that is the Hadoop distributed file system Hadoop was able to store so much data. 17 00:01:16,990 --> 00:01:24,040 Because of this HD DFS a file system just like you have on your computer that allowed it to store files 18 00:01:24,040 --> 00:01:25,960 on multiple computers. 19 00:01:25,960 --> 00:01:31,120 That is we can use Hadoop on multiple computers to store as much data as we want. 20 00:01:31,120 --> 00:01:40,400 It was scalable we could scale up so data is stored across different physical computers then was map 21 00:01:40,490 --> 00:01:48,800 reduced because once we store data we need to perform some jobs some processing on that data right and 22 00:01:48,890 --> 00:01:57,530 map produce in Hadoop allowed us to perform jobs against this data that we had in a data lake using 23 00:01:57,530 --> 00:02:05,120 languages like Java or Python you can use map reduce and Hadoop to do what we call batch processing 24 00:02:05,540 --> 00:02:12,320 that is every night we can run some sort of a job to clean our data out to process our data and make 25 00:02:12,320 --> 00:02:20,730 it useful for the company now Map Reduce isn't used as often anymore as we'll see later on in the videos 26 00:02:21,090 --> 00:02:27,130 because we have something called Apache Spark that's actually a little bit faster but we'll get to that. 27 00:02:27,240 --> 00:02:33,510 The idea here is that we finally had a tool with Hadoop that allows us to store a lot of data across 28 00:02:33,690 --> 00:02:43,590 multiple machines using the CFS and run batch jobs using Map Reduce on this large sets of data and a 29 00:02:43,590 --> 00:02:51,860 lot of other tools started popping up around Hadoop tools such as hive love their logo by the way check 30 00:02:51,860 --> 00:03:02,390 that out anyway hive makes your Hadoop cluster feel like it's a relational database even though it isn't 31 00:03:02,780 --> 00:03:10,780 it allows you to use hive and right ask you well queries against your HDD DFS file system. 32 00:03:11,360 --> 00:03:20,390 So in a sense you kind of had this data lake of Hadoop and also a little bit of a data warehouse type 33 00:03:20,390 --> 00:03:27,190 thing with hive by allowing ask who all commands on your data that hopefully was processed using map 34 00:03:27,190 --> 00:03:33,350 produce and you may not have heard of Hadoop before because it's used by data engineers. 35 00:03:33,350 --> 00:03:40,990 Remember data engineers usually don't work on systems that are customer facing that is they don't build 36 00:03:41,050 --> 00:03:46,030 databases that you would use with your app or with an web app. 37 00:03:46,030 --> 00:03:55,000 Instead they build the systems behind the scenes that ingests collects runs EDL jobs that is extract 38 00:03:55,000 --> 00:04:03,640 transform load on data so Hadoop is that behind the scenes storage layer that stores large amount of 39 00:04:03,640 --> 00:04:13,000 data running batch processing against this data so that we can have some useful information and Hadoop 40 00:04:13,030 --> 00:04:19,610 is still very popular and very capable when it comes to large data petabytes of data. 41 00:04:19,750 --> 00:04:21,820 Let's learn a few more tools in the next video.