1 00:00:01,200 --> 00:00:02,520 Welcome back. 2 00:00:02,550 --> 00:00:12,840 So we learned that as a data engineer you collect data you ingest data and then you store it in a data 3 00:00:12,840 --> 00:00:16,500 lake often a Hadoop cluster. 4 00:00:16,500 --> 00:00:24,390 Now this idea of a data lake can also be used with something like a W S S three object storage. 5 00:00:24,390 --> 00:00:26,780 So those are the two most popular ones. 6 00:00:26,850 --> 00:00:34,020 And then once they're in your data lake you can usually run batch jobs on this data to clean it up or 7 00:00:34,050 --> 00:00:42,090 modify it or wrangle it in some way so Hadoop allowed us to do that with map produce. 8 00:00:42,090 --> 00:00:43,720 We can use Apache Spark. 9 00:00:43,890 --> 00:00:52,320 You can just manually write scripts on Amazon S3 or you can just even do best jobs with com databases 10 00:00:52,650 --> 00:00:54,750 like askew all databases. 11 00:00:54,750 --> 00:01:02,870 But this idea of stream processing is now becoming more and more popular instead of processing data 12 00:01:02,930 --> 00:01:07,820 using just batch jobs let's say every hour or every day. 13 00:01:08,010 --> 00:01:12,740 There's now a movement towards this idea of real time stream processing. 14 00:01:12,740 --> 00:01:20,750 That means when data is received processing it right away instead of waiting a bit until we can do batches 15 00:01:20,750 --> 00:01:26,600 at a time this allows us to have faster reactions to data. 16 00:01:26,600 --> 00:01:32,690 Now this idea of stream processing is pretty complicated but one of the main tools that's being used 17 00:01:32,690 --> 00:01:37,840 right now for stream processing is something called Kafka Kafka. 18 00:01:37,850 --> 00:01:42,790 If we look on their Web site is a distributed streaming platform. 19 00:01:43,130 --> 00:01:46,940 It allows us to read and write streams of data like messaging system. 20 00:01:46,940 --> 00:01:51,860 It allows us to process and it allows us to store data as well. 21 00:01:51,860 --> 00:01:59,900 But as you can see what Kafka does it allows us to receive messages and pass it on to different places 22 00:02:03,750 --> 00:02:13,200 so why you'll often see is we use Apache Kafka to receive all these messages logs and data from all 23 00:02:13,200 --> 00:02:23,230 these sources we collect them in a central location and then Kafka can pass it on to different locations. 24 00:02:23,310 --> 00:02:32,130 For example we can pass off whatever Kafka receives to a Hadoop cluster to store that data and then 25 00:02:32,190 --> 00:02:42,470 perhaps if we want use Apache Spark to process that data into a data warehouse but now with the idea 26 00:02:42,470 --> 00:02:51,710 of real time processing Kafka allows us to now pass things on to well itself or we can use something 27 00:02:51,710 --> 00:02:59,840 like spark streaming or flank which we've talked about a patchy flank that is used more for stream processing 28 00:03:00,290 --> 00:03:08,420 or Apache Storm or even cornices by Amazon which actually can also be placed like Kafka here to handle 29 00:03:08,510 --> 00:03:12,400 all these messages from all these locations. 30 00:03:12,440 --> 00:03:16,150 Now at the end of the day all of this might have been confusing. 31 00:03:16,280 --> 00:03:20,080 You may be overwhelmed by all of these tools that are available. 32 00:03:20,300 --> 00:03:25,670 But remember what I said at the beginning of this section a data engineer is somebody that you'll interact 33 00:03:25,670 --> 00:03:28,310 with as a data scientist. 34 00:03:28,310 --> 00:03:34,070 So it's good to know what they do what their roles are what their responsibilities are and you'll work 35 00:03:34,070 --> 00:03:41,270 closely with them for you to have the data available to perform analysis on it or build machine learning 36 00:03:41,270 --> 00:03:41,720 models. 37 00:03:41,930 --> 00:03:45,490 But I hope that this diagram now makes sense. 38 00:03:45,680 --> 00:03:52,610 A data engineer creates this landscape for you where they ingest data through Kafka store information 39 00:03:52,610 --> 00:03:55,690 using something like Hadoop or Amazon S3. 40 00:03:55,730 --> 00:04:05,210 They process that data using something like Apache Spark or Canis and then they even store data in data 41 00:04:05,210 --> 00:04:05,950 warehouses. 42 00:04:06,050 --> 00:04:14,360 So tools like redshift or Bitcoin can be used by people inside the business to learn something useful 43 00:04:14,600 --> 00:04:19,130 from the data they build the structure for you to flourish. 44 00:04:19,130 --> 00:04:24,590 So be nice to them and if you're ever interested in any of these technologies they're changing very 45 00:04:24,590 --> 00:04:31,140 fast but they're definitely worth a look and they're not as intimidating to learn as you might think. 46 00:04:31,280 --> 00:04:32,920 I'll see in the next one by.