1 00:00:00,320 --> 00:00:02,210 Okay, so now let's talk about 2 00:00:02,210 --> 00:00:05,770 another architecture for Big Data Ingestion Pipeline. 3 00:00:05,770 --> 00:00:08,920 So ideally we want the application ingestion pipeline 4 00:00:08,920 --> 00:00:11,660 to be fully serverless, fully managed by AWS, 5 00:00:11,660 --> 00:00:13,820 and we want to collect data in real-time, 6 00:00:13,820 --> 00:00:16,250 we want to transform the data, we want to query 7 00:00:16,250 --> 00:00:18,290 the transformed data using SQL 8 00:00:18,290 --> 00:00:20,750 and the reports we create using these queries, 9 00:00:20,750 --> 00:00:22,600 maybe they should be stored in S3, 10 00:00:22,600 --> 00:00:25,370 and then we want to load that data into a data warehouse 11 00:00:25,370 --> 00:00:27,349 and create dashboards on it. 12 00:00:27,349 --> 00:00:30,500 So overall, the usual big data problems 13 00:00:30,500 --> 00:00:32,720 of ingestion, collection, transformations, 14 00:00:32,720 --> 00:00:34,263 querying and analysis. 15 00:00:35,201 --> 00:00:36,550 So how do we do this? 16 00:00:36,550 --> 00:00:39,310 And, there's some technologies, we may not have seen them 17 00:00:39,310 --> 00:00:40,830 directly in this course, but that's Okay, 18 00:00:40,830 --> 00:00:43,780 I'll just introduce them to you in this pipeline, 19 00:00:43,780 --> 00:00:45,050 because they really help. 20 00:00:45,050 --> 00:00:48,500 So, IoT Devices, let's assume that the producers of data 21 00:00:48,500 --> 00:00:52,050 are IoT Devices. And so there is this really cool services 22 00:00:52,994 --> 00:00:56,110 in Amazon Cloud services and it's called IoT Core, 23 00:00:56,110 --> 00:00:59,080 and IoT Core helps you manage these IoT devices, 24 00:00:59,080 --> 00:01:01,500 so remember this if you go into the exam. 25 00:01:01,500 --> 00:01:04,440 Now these devices can send data in real-time, 26 00:01:04,440 --> 00:01:09,390 to IoT Core and IoT Core directly into a Kinesis Data Stream 27 00:01:09,390 --> 00:01:11,220 So data stream for Kinesis, remember, 28 00:01:11,220 --> 00:01:15,072 it allows to basically pipe big data, in real-time, 29 00:01:15,072 --> 00:01:18,630 very fast into this Kinesis service. 30 00:01:18,630 --> 00:01:22,740 Now Kinesis can be talking to Kinesis Data Firehose 31 00:01:22,740 --> 00:01:26,950 and Firehose allows us to, for example, every one minute, 32 00:01:26,950 --> 00:01:30,680 put and offload data into an Amazon S3 Bucket 33 00:01:30,680 --> 00:01:32,370 and that will be an Ingestion Bucket. 34 00:01:32,370 --> 00:01:33,940 So here, what we've done here 35 00:01:33,940 --> 00:01:37,230 is that we've had a whole pipeline to get a lot of data 36 00:01:37,230 --> 00:01:39,380 from a lot of devices in real-time, 37 00:01:39,380 --> 00:01:43,210 and put it every one minute into an S3 Bucket. 38 00:01:43,210 --> 00:01:45,700 On top of it, it's possible for us to cleanse 39 00:01:45,700 --> 00:01:48,010 or really quickly transform the data, 40 00:01:48,010 --> 00:01:49,690 using an AWS Lambda function, 41 00:01:49,690 --> 00:01:52,882 that is directly linked to Kinesis Data Firehose. 42 00:01:52,882 --> 00:01:54,630 So now we have that Ingestion Bucket, 43 00:01:54,630 --> 00:01:55,970 what can we do with it? 44 00:01:55,970 --> 00:02:00,970 Well for example, we can trigger and SQS topic, 45 00:02:01,250 --> 00:02:03,850 an SQS Queue, sorry, and it's optional 46 00:02:03,850 --> 00:02:07,000 and maybe the SQS Queue can trigger an AWS Lambda function. 47 00:02:07,000 --> 00:02:09,850 I say optional because Lambda can be directly triggered 48 00:02:09,850 --> 00:02:12,820 by our S3 Bucket, but I just wanted to show you 49 00:02:12,820 --> 00:02:15,730 the possibility of invoking SQS for that slide. 50 00:02:15,730 --> 00:02:20,730 So, Lambda will trigger an Amazon Athena SQL query, 51 00:02:22,020 --> 00:02:24,220 and this Athena query will pull data 52 00:02:24,220 --> 00:02:27,240 from the Ingestion Bucket and will do an SQL query 53 00:02:27,240 --> 00:02:29,810 that's all serverless and their outputs 54 00:02:29,810 --> 00:02:32,410 are this serverless query, will go into 55 00:02:32,410 --> 00:02:35,260 a reporting bucket, maybe again in Amazon S3 56 00:02:35,260 --> 00:02:36,730 as different Bucket. 57 00:02:36,730 --> 00:02:39,370 So from this we have the data, it's been reported on, 58 00:02:39,370 --> 00:02:41,360 it's been cleansed and analyzed, 59 00:02:41,360 --> 00:02:45,340 we can either directly visualize it, using QuickSight. 60 00:02:45,340 --> 00:02:48,120 QuickSight is a way for us to visualize the data 61 00:02:48,120 --> 00:02:51,150 into an Amazon S3 Bucket, or we can load our data 62 00:02:51,150 --> 00:02:54,390 into a proper data warehouse for analytics, 63 00:02:54,390 --> 00:02:55,830 such as Amazon Redshift. 64 00:02:55,830 --> 00:02:58,430 Now please not that Redshift is not serverless, 65 00:02:58,430 --> 00:03:01,670 and so this Redshift data warehouse can also serve 66 00:03:01,670 --> 00:03:04,770 as an in point for QuickSights. 67 00:03:04,770 --> 00:03:07,630 But this shows you, overall, what you can expect 68 00:03:07,630 --> 00:03:10,910 in a Big Data Ingestion Pipeline at a high level, 69 00:03:10,910 --> 00:03:13,590 including real-time ingestion, transformation, 70 00:03:13,590 --> 00:03:17,800 serverless Lambda and some data warehousing using Redshift 71 00:03:17,800 --> 00:03:20,150 and visualization using QiuckSight. 72 00:03:20,150 --> 00:03:22,200 So, let's discuss about this pipeline. 73 00:03:22,200 --> 00:03:24,810 IoT Core allows you to harvest many data 74 00:03:24,810 --> 00:03:26,598 from many IoT devices. 75 00:03:26,598 --> 00:03:29,000 Kinesis is great real-time data collection, 76 00:03:29,000 --> 00:03:31,180 and Firehose helps you with data delivery 77 00:03:31,180 --> 00:03:33,280 to S3 in near real-time, 78 00:03:33,280 --> 00:03:36,470 so one minute is the lowest frequency you can choose. 79 00:03:36,470 --> 00:03:39,550 Lambda can help Firehose with data transformation, 80 00:03:39,550 --> 00:03:42,790 and then, Amazon S3 can trigger notifications to SQS, 81 00:03:42,790 --> 00:03:44,660 SNS or Lambda. 82 00:03:44,660 --> 00:03:47,440 Lambda can subscribe to SQS, but we could, as I've said, 83 00:03:47,440 --> 00:03:50,950 connected S3 to Lambda, and Athena is a serverless 84 00:03:50,950 --> 00:03:53,004 SQL service, and we can store the results 85 00:03:53,004 --> 00:03:55,555 of Athena directly back into S3. 86 00:03:55,555 --> 00:03:58,280 And the reporting buckets contain analyzed data 87 00:03:58,280 --> 00:04:01,010 and we can use reporting tools, such as QuickSight, 88 00:04:01,010 --> 00:04:02,269 for visualization or Redshift, 89 00:04:02,269 --> 00:04:04,254 if we want to do more analytics on it. 90 00:04:04,254 --> 00:04:06,420 So that's it for Big Data Ingestion Pipeline 91 00:04:06,420 --> 00:04:08,740 but at a high level from a Solution Architecture perspective 92 00:04:08,740 --> 00:04:11,090 It's really good to know how these things work. 93 00:04:11,952 --> 00:04:13,065 I hope you liked this 94 00:04:13,065 --> 00:04:14,383 and I will see you in the next lecture.