1 00:00:00,850 --> 00:00:01,590 Welcome back. 2 00:00:01,600 --> 00:00:10,000 So we just went through this whole flow of how data is used by a company but there's a few more things 3 00:00:10,000 --> 00:00:16,000 I want to talk about because right now we just talked about water and how does this really apply. 4 00:00:16,030 --> 00:00:20,710 What kind of tools do data engineers actually use. 5 00:00:20,710 --> 00:00:27,880 Well you may have heard of something like Kafka and we'll get to Kafka later on. 6 00:00:28,000 --> 00:00:36,340 But something like collecting these data streams and putting them into a data lake might be something 7 00:00:36,340 --> 00:00:41,080 that Kafka or Apache Kafka might do for data lakes for example. 8 00:00:41,080 --> 00:00:45,280 These are simply programs that companies can use. 9 00:00:45,280 --> 00:00:51,800 For example you may have heard of Hadoop or Amazon S3 or the Azure Data Lake. 10 00:00:51,820 --> 00:01:00,260 These are programs that have been built by engineers to hold large amounts of data like a data like 11 00:01:01,190 --> 00:01:03,280 and then for a data warehouse. 12 00:01:03,380 --> 00:01:10,910 We have tools like Google big query Amazon rest shift even Amazon Athena these are data warehouses that 13 00:01:10,910 --> 00:01:17,720 allow engineers to make queries or analyze this structure data. 14 00:01:17,720 --> 00:01:18,940 Now I know what you're thinking. 15 00:01:19,010 --> 00:01:19,970 Who does what. 16 00:01:19,970 --> 00:01:26,000 In this whole system we've learned that the data engineer creates this entire system for us. 17 00:01:26,030 --> 00:01:35,700 They use different tools and programs to ingest data and then put it into a data lake or a data warehouse. 18 00:01:35,810 --> 00:01:41,640 So as a data scientist and a machine learning expert which data do you use. 19 00:01:41,660 --> 00:01:50,500 Pause the video and think about it ready for the answer well most of the time you'd be working with 20 00:01:50,500 --> 00:01:58,690 a data lake because remember if you're doing machine learning the more data you have the better. 21 00:01:58,690 --> 00:01:59,270 Right. 22 00:02:00,120 --> 00:02:08,060 And with machine learning you can use structured or unstructured data so you can go into a data lake 23 00:02:08,540 --> 00:02:17,090 and actually grab a bunch of data to use for your models whether they're in CSC forms or any other forms 24 00:02:18,220 --> 00:02:26,440 and usually data warehouses are used by B AIS or business intelligent people or business analyst or 25 00:02:26,530 --> 00:02:35,260 data analysts to make visualization or analyze data because the data warehouse has usually more structured 26 00:02:35,260 --> 00:02:40,470 data that has been cleaned out than it was when it was a data lake. 27 00:02:40,480 --> 00:02:45,700 It's a lot easier to understand and use the data for information. 28 00:02:45,700 --> 00:02:50,440 Now as a data scientist you can use the data from a data warehouse. 29 00:02:50,440 --> 00:02:56,790 And this isn't just a rule it's usually you use whatever data is usable to you. 30 00:02:57,070 --> 00:03:04,240 But this is a good way to think about things that a data scientist uses as much data that they can as 31 00:03:04,240 --> 00:03:11,920 much useful data as they can while somebody like a business intelligence person or a data analyst has 32 00:03:12,010 --> 00:03:20,800 already the data cleaned processed by a data engineer and use something like a data warehouse to analyze 33 00:03:20,920 --> 00:03:24,660 data and Google's big query is exactly that. 34 00:03:24,660 --> 00:03:30,880 It allows somebody with not too much engineering experience or programming experience to analyze this 35 00:03:31,270 --> 00:03:34,590 data in a data warehouse. 36 00:03:34,600 --> 00:03:43,330 Another way to think about it is like this usually a software engineer a software developer app developer 37 00:03:43,360 --> 00:03:54,140 mobile developer they build programs and apps that users and customers use and that releases data then 38 00:03:54,620 --> 00:04:04,910 a data engineer would build this piping and pipeline for us to ingest data and store it in different 39 00:04:04,910 --> 00:04:13,390 services like Hadoop like Google big query so that that data can be accessed by the rest of the business. 40 00:04:13,400 --> 00:04:23,690 Next we have data scientists that use the data lake as well as the data scientists to extract information 41 00:04:23,750 --> 00:04:27,440 and deliver some sort of business value. 42 00:04:27,450 --> 00:04:36,030 Finally we have data analysts or business intelligence to use something like a data warehouse or structured 43 00:04:36,030 --> 00:04:39,640 data to again derive business value. 44 00:04:39,960 --> 00:04:42,530 And I hope that clears the picture a little bit. 45 00:04:42,570 --> 00:04:48,990 When you hear all these titles out there in the industry now the industry is fast evolving and there's 46 00:04:48,990 --> 00:04:50,330 definitely some overlap. 47 00:04:50,340 --> 00:04:56,250 And sometimes what one job description might say might be different than the other but these are general 48 00:04:56,340 --> 00:05:02,850 simplified rules that you can use to understand how each role plays into the part of a company.