1
00:00:01,200 --> 00:00:02,520
Welcome back.

2
00:00:02,550 --> 00:00:12,840
So we learned that as a data engineer you collect data you ingest data and then you store it in a data

3
00:00:12,840 --> 00:00:16,500
lake often a Hadoop cluster.

4
00:00:16,500 --> 00:00:24,390
Now this idea of a data lake can also be used with something like a W S S three object storage.

5
00:00:24,390 --> 00:00:26,780
So those are the two most popular ones.

6
00:00:26,850 --> 00:00:34,020
And then once they're in your data lake you can usually run batch jobs on this data to clean it up or

7
00:00:34,050 --> 00:00:42,090
modify it or wrangle it in some way so Hadoop allowed us to do that with map produce.

8
00:00:42,090 --> 00:00:43,720
We can use Apache Spark.

9
00:00:43,890 --> 00:00:52,320
You can just manually write scripts on Amazon S3 or you can just even do best jobs with com databases

10
00:00:52,650 --> 00:00:54,750
like askew all databases.

11
00:00:54,750 --> 00:01:02,870
But this idea of stream processing is now becoming more and more popular instead of processing data

12
00:01:02,930 --> 00:01:07,820
using just batch jobs let's say every hour or every day.

13
00:01:08,010 --> 00:01:12,740
There's now a movement towards this idea of real time stream processing.

14
00:01:12,740 --> 00:01:20,750
That means when data is received processing it right away instead of waiting a bit until we can do batches

15
00:01:20,750 --> 00:01:26,600
at a time this allows us to have faster reactions to data.

16
00:01:26,600 --> 00:01:32,690
Now this idea of stream processing is pretty complicated but one of the main tools that's being used

17
00:01:32,690 --> 00:01:37,840
right now for stream processing is something called Kafka Kafka.

18
00:01:37,850 --> 00:01:42,790
If we look on their Web site is a distributed streaming platform.

19
00:01:43,130 --> 00:01:46,940
It allows us to read and write streams of data like messaging system.

20
00:01:46,940 --> 00:01:51,860
It allows us to process and it allows us to store data as well.

21
00:01:51,860 --> 00:01:59,900
But as you can see what Kafka does it allows us to receive messages and pass it on to different places

22
00:02:03,750 --> 00:02:13,200
so why you'll often see is we use Apache Kafka to receive all these messages logs and data from all

23
00:02:13,200 --> 00:02:23,230
these sources we collect them in a central location and then Kafka can pass it on to different locations.

24
00:02:23,310 --> 00:02:32,130
For example we can pass off whatever Kafka receives to a Hadoop cluster to store that data and then

25
00:02:32,190 --> 00:02:42,470
perhaps if we want use Apache Spark to process that data into a data warehouse but now with the idea

26
00:02:42,470 --> 00:02:51,710
of real time processing Kafka allows us to now pass things on to well itself or we can use something

27
00:02:51,710 --> 00:02:59,840
like spark streaming or flank which we've talked about a patchy flank that is used more for stream processing

28
00:03:00,290 --> 00:03:08,420
or Apache Storm or even cornices by Amazon which actually can also be placed like Kafka here to handle

29
00:03:08,510 --> 00:03:12,400
all these messages from all these locations.

30
00:03:12,440 --> 00:03:16,150
Now at the end of the day all of this might have been confusing.

31
00:03:16,280 --> 00:03:20,080
You may be overwhelmed by all of these tools that are available.

32
00:03:20,300 --> 00:03:25,670
But remember what I said at the beginning of this section a data engineer is somebody that you'll interact

33
00:03:25,670 --> 00:03:28,310
with as a data scientist.

34
00:03:28,310 --> 00:03:34,070
So it's good to know what they do what their roles are what their responsibilities are and you'll work

35
00:03:34,070 --> 00:03:41,270
closely with them for you to have the data available to perform analysis on it or build machine learning

36
00:03:41,270 --> 00:03:41,720
models.

37
00:03:41,930 --> 00:03:45,490
But I hope that this diagram now makes sense.

38
00:03:45,680 --> 00:03:52,610
A data engineer creates this landscape for you where they ingest data through Kafka store information

39
00:03:52,610 --> 00:03:55,690
using something like Hadoop or Amazon S3.

40
00:03:55,730 --> 00:04:05,210
They process that data using something like Apache Spark or Canis and then they even store data in data

41
00:04:05,210 --> 00:04:05,950
warehouses.

42
00:04:06,050 --> 00:04:14,360
So tools like redshift or Bitcoin can be used by people inside the business to learn something useful

43
00:04:14,600 --> 00:04:19,130
from the data they build the structure for you to flourish.

44
00:04:19,130 --> 00:04:24,590
So be nice to them and if you're ever interested in any of these technologies they're changing very

45
00:04:24,590 --> 00:04:31,140
fast but they're definitely worth a look and they're not as intimidating to learn as you might think.

46
00:04:31,280 --> 00:04:32,920
I'll see in the next one by.