1 00:00:00,910 --> 00:00:04,510 Hello, and welcome to Mars, 2 00:00:04,510 --> 00:00:06,420 or at least the view from Mars. 3 00:00:06,420 --> 00:00:09,040 In this lesson, we are going to be talking about 4 00:00:09,040 --> 00:00:11,830 data engineering at a glance. 5 00:00:11,830 --> 00:00:14,080 And so what does that mean, exactly? 6 00:00:14,080 --> 00:00:16,070 So, today we're going to talk about 7 00:00:16,070 --> 00:00:18,050 the 4 pieces of data, right? 8 00:00:18,050 --> 00:00:21,470 So, the first thing we have is data movement. 9 00:00:21,470 --> 00:00:22,800 Then we're going to talk a little bit 10 00:00:22,800 --> 00:00:25,420 about data ingestion and what that is. We're 11 00:00:25,420 --> 00:00:29,230 going to talk about data storage and data transformation. 12 00:00:29,230 --> 00:00:32,710 So, this lesson is going to be a lot about data. 13 00:00:32,710 --> 00:00:34,053 Let's get started. 14 00:00:35,870 --> 00:00:40,220 In the beginning of data we have a shopper. 15 00:00:40,220 --> 00:00:41,910 Let's call her Sally. 16 00:00:41,910 --> 00:00:44,950 Now, Sally likes to buy lots of stuff. 17 00:00:44,950 --> 00:00:48,840 So, she is going to go to the store and purchase 18 00:00:48,840 --> 00:00:50,460 all of these things. 19 00:00:50,460 --> 00:00:54,363 Now, when she does that, she generates data. 20 00:00:55,290 --> 00:00:58,190 Now, this data can't just live 21 00:00:58,190 --> 00:01:02,520 at the register where it's generated, it has to move. 22 00:01:02,520 --> 00:01:05,420 And so the very first piece of data engineering 23 00:01:05,420 --> 00:01:06,840 is data movement. 24 00:01:06,840 --> 00:01:09,170 We have all these data points being created somewhere 25 00:01:09,170 --> 00:01:10,750 and then they have to move. 26 00:01:10,750 --> 00:01:12,320 And as that data moves, 27 00:01:12,320 --> 00:01:15,513 we have to be able to do stuff with it, right? 28 00:01:16,580 --> 00:01:19,170 So, what we'll have is we'll have-- 29 00:01:19,170 --> 00:01:21,100 jumping into a business sense now-- 30 00:01:21,100 --> 00:01:24,810 we'll have, on the left, our ingestion sources. 31 00:01:24,810 --> 00:01:27,060 So, this is the data that's being generated. 32 00:01:27,060 --> 00:01:29,790 It could be media, it could be logs, 33 00:01:29,790 --> 00:01:31,790 it could be business data that's a database 34 00:01:31,790 --> 00:01:35,720 coming from some sort of shopper like Sally, right? 35 00:01:35,720 --> 00:01:40,600 And that data needs to be pulled into our cloud. 36 00:01:40,600 --> 00:01:43,130 Now, to do that we're going to have what's called 37 00:01:43,130 --> 00:01:45,040 an ingestion layer. 38 00:01:45,040 --> 00:01:47,190 The ingestion layer is where we take all 39 00:01:47,190 --> 00:01:49,350 of those different data sources-- 40 00:01:49,350 --> 00:01:51,500 could be a bunch of different cash registers-- 41 00:01:51,500 --> 00:01:55,460 it could be, like I said, log files. Whatever you want. 42 00:01:55,460 --> 00:01:58,100 And we're going to ingest that data 43 00:01:58,100 --> 00:02:01,500 into something like Data Factory. 44 00:02:01,500 --> 00:02:04,170 Now, we're going to talk a lot about these services later. 45 00:02:04,170 --> 00:02:07,200 So, don't get bogged down in what Data Factory is. 46 00:02:07,200 --> 00:02:10,320 Just know that Azure Data Factory can be a fantastic 47 00:02:10,320 --> 00:02:14,163 ingestion source to pull data into the cloud. 48 00:02:15,310 --> 00:02:17,490 Once we have that data in Data Factory, 49 00:02:17,490 --> 00:02:19,200 we got to do something with it, right? 50 00:02:19,200 --> 00:02:22,233 So, we've moved it, but where are we moving it to? 51 00:02:23,370 --> 00:02:26,650 That's where our storage layer comes in 52 00:02:26,650 --> 00:02:28,290 and we'll take something 53 00:02:28,290 --> 00:02:30,620 like Azure Data Lake Storage Gen2 54 00:02:31,530 --> 00:02:35,040 and we will move our data from Data Factory 55 00:02:35,040 --> 00:02:37,440 into the Data Lake. 56 00:02:37,440 --> 00:02:39,730 Now, once we have it in that Data Lake 57 00:02:39,730 --> 00:02:42,900 we're going to have all kinds of different data in there 58 00:02:42,900 --> 00:02:45,433 and it's not going to be particularly clean. 59 00:02:47,270 --> 00:02:50,170 This is where data transformation comes in. 60 00:02:50,170 --> 00:02:51,590 And this is a great example. 61 00:02:51,590 --> 00:02:54,010 So, we have our antiques sign 62 00:02:54,010 --> 00:02:56,710 or anitques, I don't know. 63 00:02:56,710 --> 00:02:58,160 However you want to say that. 64 00:02:58,160 --> 00:03:01,330 So, we have a misspelling here, right? 65 00:03:01,330 --> 00:03:03,870 And when we look at data, that's what's going to happen. 66 00:03:03,870 --> 00:03:05,490 You're going to have all sorts 67 00:03:05,490 --> 00:03:08,720 of differences between different data types. 68 00:03:08,720 --> 00:03:10,610 It could be duplicate data. 69 00:03:10,610 --> 00:03:12,860 It could be data that's misspelled. 70 00:03:12,860 --> 00:03:15,800 It could even be something like a country. 71 00:03:15,800 --> 00:03:18,400 So, for instance, if we have the country, 72 00:03:18,400 --> 00:03:21,270 like United States, we could have some places, 73 00:03:21,270 --> 00:03:24,290 some data sources listing that as US 74 00:03:24,290 --> 00:03:27,400 some listing it as USA, some United States, 75 00:03:27,400 --> 00:03:30,280 some may even have a 2-digit country code. 76 00:03:30,280 --> 00:03:33,230 All of those sources reference the United States 77 00:03:33,230 --> 00:03:34,870 but they're all different. 78 00:03:34,870 --> 00:03:37,380 And when we try and use that data 79 00:03:37,380 --> 00:03:41,040 it becomes very difficult with duplicates 80 00:03:41,040 --> 00:03:45,990 or missing data, or misspelled data, or different types. 81 00:03:45,990 --> 00:03:48,607 And so we do data transformation. 82 00:03:48,607 --> 00:03:51,800 So you could use a service like Azure Databricks 83 00:03:51,800 --> 00:03:56,090 or Azure Synapse Analytics to do data transformation. 84 00:03:56,090 --> 00:03:58,770 So, we could pull data from that Data Lake 85 00:03:58,770 --> 00:04:00,550 and we could start to manipulate 86 00:04:00,550 --> 00:04:03,000 and work all of that data together 87 00:04:03,000 --> 00:04:05,100 so that it's all similar. 88 00:04:05,100 --> 00:04:08,930 This is data transformation, and you'll also hear it called, 89 00:04:08,930 --> 00:04:10,527 like, a cleaning step or 90 00:04:12,770 --> 00:04:14,790 a process data layer. 91 00:04:14,790 --> 00:04:15,890 That's what we're talking about. 92 00:04:15,890 --> 00:04:18,230 We're talking about making all of the data the same 93 00:04:18,230 --> 00:04:21,600 so that it can be used for business decisions. 94 00:04:21,600 --> 00:04:23,373 So, that's our data transformation. 95 00:04:25,290 --> 00:04:28,120 And so then the question comes up, why do we care? 96 00:04:28,120 --> 00:04:28,953 Right? 97 00:04:28,953 --> 00:04:32,880 The real goal of data engineering is to give you accessible, 98 00:04:32,880 --> 00:04:36,440 clean data that's in a usable format. 99 00:04:36,440 --> 00:04:39,950 And when we do that, we empower these things on the right. 100 00:04:39,950 --> 00:04:42,820 We empower actionable business decisions. 101 00:04:42,820 --> 00:04:46,010 So, we give reports that you could actually look at and say, 102 00:04:46,010 --> 00:04:47,490 oh, I can see trends here. 103 00:04:47,490 --> 00:04:51,240 Or I can see a step that we need to do in our business. 104 00:04:51,240 --> 00:04:53,170 We can also generate machine learning 105 00:04:53,170 --> 00:04:55,960 through that and machine learning insights. 106 00:04:55,960 --> 00:04:57,890 So, we can give information to customers, 107 00:04:57,890 --> 00:05:01,240 such as recommended products, or things like that. 108 00:05:01,240 --> 00:05:05,890 And by doing all of this, we lower those customer barriers. 109 00:05:05,890 --> 00:05:08,210 Just like the example I just gave, 110 00:05:08,210 --> 00:05:10,430 if I'm looking to purchase something, 111 00:05:10,430 --> 00:05:12,210 my barrier is going to be lower 112 00:05:12,210 --> 00:05:14,040 if it's a 1-click purchase. 113 00:05:14,040 --> 00:05:16,670 If the system knows my information 114 00:05:16,670 --> 00:05:18,530 and can pull it back up easily, 115 00:05:18,530 --> 00:05:22,120 if the system can recommend new products to me 116 00:05:22,120 --> 00:05:23,820 that I might find interesting 117 00:05:23,820 --> 00:05:27,180 based upon my previous purchasing habits, right? 118 00:05:27,180 --> 00:05:30,410 All of those things--if you can send me a coupon, even-- 119 00:05:30,410 --> 00:05:33,260 because you know that I purchased something in the past. 120 00:05:33,260 --> 00:05:36,800 All of these solutions that are generated off of the data 121 00:05:36,800 --> 00:05:38,150 that's coming through our system 122 00:05:38,150 --> 00:05:40,410 and driven through data engineering 123 00:05:40,410 --> 00:05:42,903 can help to lower customer barriers. 124 00:05:44,660 --> 00:05:47,560 So, in this lesson, we've done a very high-level 125 00:05:47,560 --> 00:05:49,480 look at data movement. 126 00:05:49,480 --> 00:05:50,880 Hey, where's it coming from? 127 00:05:50,880 --> 00:05:52,300 Where's it going? 128 00:05:52,300 --> 00:05:54,070 We looked at data ingestion. 129 00:05:54,070 --> 00:05:57,160 So, how do we ingest data into the cloud system? 130 00:05:57,160 --> 00:05:59,270 And what does that look like? 131 00:05:59,270 --> 00:06:02,750 Data storage. We talked about the management of data storage 132 00:06:02,750 --> 00:06:05,700 and why it's important to store data in the cloud. 133 00:06:05,700 --> 00:06:08,310 And then we talked quite a bit about data transformation. 134 00:06:08,310 --> 00:06:10,430 So, the cleaning, the manipulating, 135 00:06:10,430 --> 00:06:13,850 transforming data into a usable format. 136 00:06:13,850 --> 00:06:17,040 Now, keep in mind through the entire rest of this course, 137 00:06:17,040 --> 00:06:18,970 we are going to be diving way down deep 138 00:06:18,970 --> 00:06:22,360 into all of these concepts, but data engineering-- 139 00:06:22,360 --> 00:06:24,280 generally, anything you're going to do-- 140 00:06:24,280 --> 00:06:27,500 is going to fall into one of these 4 concepts. 141 00:06:27,500 --> 00:06:30,220 So, keep that in mind and we will see you 142 00:06:30,220 --> 00:06:32,340 in the next lesson as we dive a little further 143 00:06:32,340 --> 00:06:33,823 into data engineering.