1 00:00:01,740 --> 00:00:02,050 OK. 2 00:00:02,070 --> 00:00:05,950 OK so what does a data engineer actually do. 3 00:00:05,950 --> 00:00:10,960 Let's start off with three big wars that we need to understand before we really understand what a data 4 00:00:10,990 --> 00:00:12,760 engineer does. 5 00:00:12,760 --> 00:00:18,730 Yes I know I keep circling around data engineer and I keep getting you excited but I promise we're getting 6 00:00:18,730 --> 00:00:20,130 really really close. 7 00:00:20,230 --> 00:00:25,450 We just need to discuss these three wars that you may or may not have heard of. 8 00:00:25,470 --> 00:00:33,360 One is data mining and data mining simply means pre processing and extracting some knowledge from the 9 00:00:33,360 --> 00:00:42,680 data so we use some sort of data to extract knowledge from that's what data mining is Big Data. 10 00:00:42,700 --> 00:00:49,090 Well it means that we have a lot of data a lot of variables so much data in fact that you can't run 11 00:00:49,600 --> 00:00:57,620 on your laptop and there's just too much data for your laptop or computer to hold usually when we talk 12 00:00:57,620 --> 00:00:58,970 about big data. 13 00:00:58,970 --> 00:01:06,500 It's data that's so big that you need to have it running on cloud computing or multiple computers such 14 00:01:06,500 --> 00:01:14,960 as a W.S. Azure and Google cloud because they have a lot of computers and a lot of storage to store 15 00:01:14,990 --> 00:01:15,880 this data. 16 00:01:15,920 --> 00:01:23,180 So that's what big data has Big Data is data that usually can't have just on one computer. 17 00:01:23,210 --> 00:01:26,840 This usually happens because what data sets get so big. 18 00:01:26,840 --> 00:01:35,490 We have petabytes and petabytes of data now so just having data in a database like my ask you all or 19 00:01:35,490 --> 00:01:42,450 post grass or any type of database becomes really difficult when that database is just one single machine 20 00:01:43,650 --> 00:01:43,980 now. 21 00:01:43,980 --> 00:01:50,550 New technologies were invented to solve this problem of big data like Hadoop and no Eskew. 22 00:01:50,610 --> 00:02:01,580 Which we'll talk about now a Data Pipeline is essentially a pipeline that a data engineer built to essentially 23 00:02:02,420 --> 00:02:10,310 use the fact that we had this big amount of data we need to extract information from this data using 24 00:02:10,310 --> 00:02:11,570 data mining. 25 00:02:11,570 --> 00:02:19,580 So we need to bring or build a pipeline that allows us to flow from that unknown large amount of data 26 00:02:19,820 --> 00:02:25,560 to a pipeline that extracts data to a more useful form. 27 00:02:25,700 --> 00:02:34,670 So a data engineer essentially does this creates a data pipeline where all the information that different 28 00:02:34,670 --> 00:02:44,270 devices like IO T devices mobile applications web apps cameras cars and pretty much anything that collects 29 00:02:44,390 --> 00:02:53,570 data and stores information or logs data into servers or to the cloud a data engineer essentially accumulates 30 00:02:53,840 --> 00:03:02,630 all this information into nicely packed databases and stores engines so that different parts of the 31 00:03:02,630 --> 00:03:11,210 company can create visualizations they can monitor the performance of their product they can get business 32 00:03:11,240 --> 00:03:18,230 insights and make business decision from this data and even use this data on their apps for example 33 00:03:18,230 --> 00:03:24,770 for user profiles before a data scientist or a machine learning expert or even a business intelligence 34 00:03:24,770 --> 00:03:28,770 or data analyst gets hired for a big company. 35 00:03:28,790 --> 00:03:35,690 The thing that they need to do before all of that can be done is to hire a data engineer they build 36 00:03:35,690 --> 00:03:40,300 the pipeline that allows us to work as data scientists. 37 00:03:40,340 --> 00:03:44,960 Remember this diagram and how we've been focusing on this part well. 38 00:03:45,530 --> 00:03:50,270 A data engineer allows us to do this data collection part. 39 00:03:50,360 --> 00:03:57,760 They bring in all this information organize it in a way for us to do our data modelling. 40 00:03:57,860 --> 00:04:05,120 They really help us with this data collection part that usually as a machine learning engineer or data 41 00:04:05,120 --> 00:04:11,090 scientist you don't have to concern yourself too much with but you know what my favorite demonstration 42 00:04:11,090 --> 00:04:14,370 of what a data engineer is is in the next video. 43 00:04:14,480 --> 00:04:18,820 I hope you're excited because things are gonna start to make more and more sense. 44 00:04:18,920 --> 00:04:20,430 I'll see you on that one by.