1 00:00:00,820 --> 00:00:02,150 Hey, Cloud Gurus. 2 00:00:02,150 --> 00:00:03,110 Welcome to our lesson 3 00:00:03,110 --> 00:00:05,853 on normalizing and denormalizing values. 4 00:00:07,410 --> 00:00:08,243 In this lesson, 5 00:00:08,243 --> 00:00:11,550 we're going to do a quick overview of what normalization is, 6 00:00:11,550 --> 00:00:14,490 jump over and show a demo in Azure Data Factory, 7 00:00:14,490 --> 00:00:16,303 and then, wrap up with a review. 8 00:00:19,330 --> 00:00:22,800 Normalization is a common practice when analyzing data, 9 00:00:22,800 --> 00:00:25,300 and the goal of normalization is to change the values 10 00:00:25,300 --> 00:00:28,920 of numeric columns in the dataset to use a common scale, 11 00:00:28,920 --> 00:00:32,110 without distorting differences in the ranges of values 12 00:00:32,110 --> 00:00:34,190 or losing information. 13 00:00:34,190 --> 00:00:37,013 Or, put another way, we're scaling the values, 14 00:00:37,875 --> 00:00:39,370 so that the mean of all values is 0 15 00:00:39,370 --> 00:00:41,253 and the standard deviation is 1. 16 00:00:42,221 --> 00:00:43,747 And this helps greatly, 17 00:00:43,747 --> 00:00:46,410 if you have, say, 2 columns with vastly different scales. 18 00:00:46,410 --> 00:00:48,050 Maybe one in the 10s or 100s 19 00:00:48,050 --> 00:00:50,810 and another in the 1000s or 1,000,000s, 20 00:00:50,810 --> 00:00:53,300 because that can cause problems with your data work. 21 00:00:53,300 --> 00:00:56,090 Particularly if you're combining a couple of columns 22 00:00:56,090 --> 00:00:58,023 as features for data modeling. 23 00:00:59,000 --> 00:01:02,070 You'll really see this used a lot in machine learning. 24 00:01:02,070 --> 00:01:05,520 Some algorithms require data to be normalized ahead of time. 25 00:01:05,520 --> 00:01:08,840 Others don't, but this helps place our data in a better way 26 00:01:08,840 --> 00:01:12,540 for machine learning and other applications to use it. 27 00:01:12,540 --> 00:01:15,193 And of course, denormalizing is just the reverse. 28 00:01:17,220 --> 00:01:18,930 So with that quick definition in mind, 29 00:01:18,930 --> 00:01:20,980 let's jump over to the Azure portal 30 00:01:20,980 --> 00:01:23,353 and take a look at this in Azure Data Factory. 31 00:01:25,580 --> 00:01:27,800 Here we are in Azure Data Factory Studio. 32 00:01:27,800 --> 00:01:30,920 And since this is a little bit more of a complex data flow, 33 00:01:30,920 --> 00:01:33,950 I've gone ahead and set it up for the sake of time. 34 00:01:33,950 --> 00:01:37,400 Basically, what we have here, is 3 different sources, 35 00:01:37,400 --> 00:01:39,870 but all pulling from a product's table 36 00:01:39,870 --> 00:01:42,140 in order to normalize the value. 37 00:01:42,140 --> 00:01:47,140 We're first taking the products and using a SQL query, 38 00:01:47,350 --> 00:01:51,300 under Source Options, to find the standard deviation 39 00:01:51,300 --> 00:01:53,313 of the standard cost column. 40 00:01:54,680 --> 00:01:58,853 We're then caching that in a cache sink. 41 00:01:59,690 --> 00:02:01,560 Similarly, for the average, 42 00:02:01,560 --> 00:02:04,920 we're using a source options query to find the average, 43 00:02:04,920 --> 00:02:08,210 and then, storing that in a sink as well. 44 00:02:08,210 --> 00:02:11,290 Finally, we're taking these same product's information 45 00:02:11,290 --> 00:02:13,800 and using a derived column. 46 00:02:13,800 --> 00:02:15,846 Within this derived column, 47 00:02:15,846 --> 00:02:17,880 we're utilizing all that we've put together so far 48 00:02:17,880 --> 00:02:19,463 in this expression. 49 00:02:20,330 --> 00:02:21,860 Now I'll open up expression builder, 50 00:02:21,860 --> 00:02:24,050 so you can see it a little bit better. 51 00:02:24,050 --> 00:02:25,210 But we're using this formula 52 00:02:25,210 --> 00:02:27,840 to find those normalized values. 53 00:02:27,840 --> 00:02:30,230 We're taking our standard cost column 54 00:02:30,230 --> 00:02:34,600 and subtracting the stored standard cost average. 55 00:02:34,600 --> 00:02:36,088 And you can see, 56 00:02:36,088 --> 00:02:39,373 this is how we reference the information in that cache sink. 57 00:02:40,700 --> 00:02:42,920 We're then taking the result of that 58 00:02:42,920 --> 00:02:46,630 and dividing by the standard cost deviation, 59 00:02:46,630 --> 00:02:50,083 again, referencing one of those cache sinks. 60 00:02:51,360 --> 00:02:55,313 And altogether, this formula produces our normalized values. 61 00:02:58,150 --> 00:02:59,955 To take a look at this, 62 00:02:59,955 --> 00:03:02,460 we can come over to the Data Preview tab. 63 00:03:02,460 --> 00:03:03,293 And you can see 64 00:03:03,293 --> 00:03:06,510 that our standard cost doesn't look like a cost at all. 65 00:03:06,510 --> 00:03:10,790 Instead, we have values in the ranges of 1s and 0s, 66 00:03:10,790 --> 00:03:13,730 allowing us to easily see what our standard deviations are 67 00:03:13,730 --> 00:03:15,520 and making things much easier to work with 68 00:03:15,520 --> 00:03:16,743 in a standardized way. 69 00:03:19,010 --> 00:03:22,600 By way of review, normalizing value scales them, 70 00:03:22,600 --> 00:03:26,450 so that the mean is 0 and the standard deviation is 1. 71 00:03:26,450 --> 00:03:28,580 So if the data point is greater than 0, 72 00:03:28,580 --> 00:03:30,970 it means that it's higher than the mean. 73 00:03:30,970 --> 00:03:33,440 Also, if the normalized value is less than 0, 74 00:03:33,440 --> 00:03:35,770 then it's lower than the mean. 75 00:03:35,770 --> 00:03:37,580 And so, it allows you to quickly see 76 00:03:37,580 --> 00:03:40,410 how many standard deviations the original data point 77 00:03:40,410 --> 00:03:41,603 is from the mean. 78 00:03:42,500 --> 00:03:45,500 You can use mapping data flows in Azure Data Factory, 79 00:03:45,500 --> 00:03:47,730 or modules in Azure Machine Learning, 80 00:03:47,730 --> 00:03:49,593 to accomplish this normalization. 81 00:03:50,770 --> 00:03:53,720 And this is particularly helpful in machine learning, 82 00:03:53,720 --> 00:03:56,340 ensuring that features have similar ranges of values 83 00:03:56,340 --> 00:03:57,173 to work with. 84 00:03:58,210 --> 00:03:59,320 That's it, Gurus. 85 00:03:59,320 --> 00:04:00,760 I hope this helps give you an idea 86 00:04:00,760 --> 00:04:02,850 of how to work with normalization. 87 00:04:02,850 --> 00:04:05,400 When you are ready, I'll see you in the next video.