1 00:00:01,380 --> 00:00:08,250 OK so in the previous video we understood that a data engineer essentially creates this data pipeline 2 00:00:08,250 --> 00:00:14,160 for us for people like data scientists to have available data to work with. 3 00:00:14,610 --> 00:00:23,010 But I want to use a different example let's pretend that Keiko Corp. has a need and that need is for 4 00:00:23,220 --> 00:00:23,760 water. 5 00:00:23,880 --> 00:00:32,900 The employees need water to work hard right so let's think of this as a problem that a data engineer 6 00:00:33,410 --> 00:00:33,890 solves. 7 00:00:34,670 --> 00:00:41,210 Now bear with me we're going to be talking about water for a few minutes but I promise it relates to 8 00:00:41,210 --> 00:00:42,390 data. 9 00:00:42,520 --> 00:00:47,460 You see the very first thing for us to get water is rain right. 10 00:00:47,470 --> 00:00:52,810 We need rain and this is similar to the data sources that we had. 11 00:00:52,870 --> 00:01:01,750 That is the Web sites the mobile phones the cameras the cars the drones all of these are collecting 12 00:01:01,750 --> 00:01:02,460 data. 13 00:01:02,500 --> 00:01:03,910 That's what the raindrops are. 14 00:01:04,510 --> 00:01:09,910 So in order for us to get water to our employees we have to get water from somewhere and that water 15 00:01:09,910 --> 00:01:11,160 comes from rain. 16 00:01:12,180 --> 00:01:22,260 Next that water gets collected into different streams and rivers from all over the place right from 17 00:01:22,350 --> 00:01:23,350 all over the world. 18 00:01:23,370 --> 00:01:28,920 Just like data from all different places these streams of water start for me. 19 00:01:29,190 --> 00:01:30,750 And what happens next. 20 00:01:30,780 --> 00:01:35,140 Well these streams of water turn into lakes. 21 00:01:35,160 --> 00:01:41,790 They all flow into some sort of a lake and that lake kind of sit still. 22 00:01:41,840 --> 00:01:47,080 But again we as a Keiko Corp just want water from employees. 23 00:01:47,120 --> 00:01:54,350 Well we obviously have a dam on a lake and that dam on a lake or a system to use that lake. 24 00:01:54,350 --> 00:02:02,810 Well we transport that water to something like a filtration sanitary area so that we can clean the water 25 00:02:03,590 --> 00:02:06,670 sanitize it so it's safe for people to drink. 26 00:02:06,740 --> 00:02:16,490 So we process that water that has been collected and then finally after all of that we create all the 27 00:02:16,490 --> 00:02:22,300 plumbing and pipes for us to deliver water to Keiko corp. 28 00:02:22,400 --> 00:02:29,530 That's how we satisfy the needs of Keiko corp to provide water for their employees. 29 00:02:29,690 --> 00:02:32,220 And this is how data works. 30 00:02:32,390 --> 00:02:40,430 And this is what a data engineer built a data engineer starts off with what we call data ingestion that 31 00:02:40,430 --> 00:02:50,610 is acquiring data from various sources and we acquire all these different sources of data and ingested 32 00:02:52,020 --> 00:02:57,980 into what we call a data lake a data lake is a collection. 33 00:02:57,990 --> 00:03:06,780 Well all this data into one location from there we could just leave the lake as it is. 34 00:03:06,930 --> 00:03:08,820 But there's a problem. 35 00:03:08,820 --> 00:03:12,390 We don't want the lake to overflow dry up. 36 00:03:12,390 --> 00:03:15,120 We need to manage this lake somehow. 37 00:03:15,120 --> 00:03:17,350 We don't want to flood the village. 38 00:03:17,400 --> 00:03:18,610 We need dams. 39 00:03:18,720 --> 00:03:21,070 We need pipes to the houses. 40 00:03:21,210 --> 00:03:28,840 We need to filter and clean the water and we need piping to the houses. 41 00:03:28,920 --> 00:03:35,140 So after we ingest this data from various sources we collected in a data lake. 42 00:03:35,250 --> 00:03:42,660 We then perform something called data transformation that is converting data from one format to another 43 00:03:43,530 --> 00:03:52,020 usually into something we call a data warehouse a data warehouse is a place that stores accessible data 44 00:03:52,320 --> 00:03:59,160 that is useful for the business that is although the lake collects so much water so much data that was 45 00:03:59,340 --> 00:04:00,910 kind of all over the place. 46 00:04:00,960 --> 00:04:08,910 A data engineer looks at this data and forms or uses the parts of the data that are useful and puts 47 00:04:08,910 --> 00:04:13,610 it in a data warehouse so that other parts of the business can use it. 48 00:04:14,070 --> 00:04:20,220 Again a good way to think about it is that a data lake is a pool of raw data. 49 00:04:20,220 --> 00:04:27,900 The purpose of which we still don't know what to do with in a data warehouse is a location for structured 50 00:04:28,140 --> 00:04:32,530 filtered data that has been processed and has a specific purpose. 51 00:04:32,610 --> 00:04:36,940 That is we have data in here that is useful for the business. 52 00:04:36,990 --> 00:04:45,120 So that means that data lakes are usually less organized and has less filtration of data than something 53 00:04:45,120 --> 00:04:46,820 like a data warehouse. 54 00:04:46,830 --> 00:04:49,140 Now why would businesses want to do that. 55 00:04:49,320 --> 00:04:57,660 Well for one it's a lot easier to analyze data when it's organized well we also might have data in here 56 00:04:57,660 --> 00:04:59,000 that we don't really need. 57 00:04:59,010 --> 00:05:01,370 We also save on storage space right. 58 00:05:01,380 --> 00:05:07,610 Because we don't have to store all of this data and just store structure data only. 59 00:05:07,620 --> 00:05:09,060 That is only useful to us. 60 00:05:09,060 --> 00:05:17,850 We also save ourselves some money so to review a data engineer simply built this pipeline of taking 61 00:05:17,850 --> 00:05:26,250 the data creation and data capture using data engineering practices to build this pipeline. 62 00:05:26,490 --> 00:05:33,450 So that data can now be analyzed by data scientists and data analysts but we still have a little bit 63 00:05:33,450 --> 00:05:35,430 more to go so I'll see you in the next video.