1 00:00:00,770 --> 00:00:05,080 Hey it's Andre here with a quick little tip for you now. 2 00:00:05,120 --> 00:00:11,450 Over the next couple of videos Daniel is going to show you his machine learning workflow and one of 3 00:00:11,450 --> 00:00:17,810 the things that he's going to do is well clean the data transform it and do all sort of manipulation 4 00:00:17,810 --> 00:00:18,810 to it. 5 00:00:18,950 --> 00:00:25,490 Now I want us to take a high level view right now because when we're working with Jupiter notebooks 6 00:00:25,820 --> 00:00:27,560 it can get really confusing. 7 00:00:27,590 --> 00:00:33,360 It almost seems like we're doing a bunch of steps without understanding the why. 8 00:00:33,380 --> 00:00:38,600 But I want to take a step back here and understand why we do the things we're going to do over the next 9 00:00:38,600 --> 00:00:42,190 couple of videos you see as a data scientist. 10 00:00:42,560 --> 00:00:46,000 More data doesn't necessarily mean it's always good. 11 00:00:46,010 --> 00:00:50,710 I mean yes generally the more data we have the more data we have to work with. 12 00:00:50,810 --> 00:00:55,060 But at the end of the day we can have completely useless data. 13 00:00:55,130 --> 00:01:03,290 We only want useful data so that we can get real results real actionable steps that we can take and 14 00:01:03,290 --> 00:01:04,270 what we're going to see. 15 00:01:04,270 --> 00:01:11,440 Daniel do is do this step before we even create our machine learning model. 16 00:01:11,510 --> 00:01:20,160 You see we want to run this step cleaning data transforming data and reducing data so let's talk about 17 00:01:20,160 --> 00:01:20,370 that. 18 00:01:20,910 --> 00:01:23,100 Why do we need to clean data. 19 00:01:23,100 --> 00:01:30,180 Well we want to remove and replace data because sometimes data is missing. 20 00:01:30,180 --> 00:01:30,510 Right. 21 00:01:30,510 --> 00:01:35,340 We might have tables with missing values missing labels. 22 00:01:35,340 --> 00:01:36,650 That's not going to help us. 23 00:01:36,660 --> 00:01:44,990 And it's not going to allow us to build the right machine learning model so we usually remove a row 24 00:01:44,990 --> 00:01:52,040 or a column that's empty or has missing fields or we might calculate some sort of average if we're doing 25 00:01:52,040 --> 00:01:57,470 house prices will fill an empty price list to maybe an average house. 26 00:01:57,620 --> 00:02:06,020 You might even notice some outliers in your data and just remove them next is data transformation. 27 00:02:06,020 --> 00:02:09,680 You see computers don't really understand concepts right. 28 00:02:09,710 --> 00:02:17,210 Most of the time computers just understand ones and zeros just numbers really in order for us to teach 29 00:02:17,210 --> 00:02:19,020 computers most of the time. 30 00:02:19,020 --> 00:02:24,920 We want to transform our data into some sort of form that a computer can understand. 31 00:02:24,920 --> 00:02:31,490 So what we're going to see is Daniel actually convert some of our information into numbers. 32 00:02:31,490 --> 00:02:38,090 You might have things like the price of a house or a salary that are numbers and easy for us to create 33 00:02:38,150 --> 00:02:45,640 models with because computers understand numbers fairly easily but it might find it hard to know what 34 00:02:46,040 --> 00:02:47,290 a color is. 35 00:02:47,400 --> 00:02:49,360 What is green what is blue. 36 00:02:49,360 --> 00:02:50,950 A computer doesn't understand that. 37 00:02:50,950 --> 00:03:00,400 Instead we convert colors into numbers like our G.B. collars so that a computer understands a good way 38 00:03:00,400 --> 00:03:06,850 we transform data is usually between zeros and ones such as Do you have heart disease or do you not 39 00:03:07,330 --> 00:03:14,290 know heart disease is zero heart disease as one we have to make sure that data across the board uses 40 00:03:14,290 --> 00:03:19,590 the same units maybe one data uses meters another data uses yards. 41 00:03:19,690 --> 00:03:25,570 This idea of transforming data into a useful form is really really difficult and it just comes with 42 00:03:25,570 --> 00:03:26,140 practice. 43 00:03:26,200 --> 00:03:31,180 But you're going to see Daniel how he transforms data so it's more and more workable. 44 00:03:31,210 --> 00:03:34,690 So once we've cleaned the data once we've transformed the data. 45 00:03:34,690 --> 00:03:44,660 Next is to reduce our data why reduce our data don't we want as much data as possible. 46 00:03:44,670 --> 00:03:50,560 Well theoretically yes but remember everything costs money. 47 00:03:50,580 --> 00:03:53,950 The more data we have the more CPR you. 48 00:03:54,150 --> 00:03:58,470 The more energy it takes for us to run our computation. 49 00:03:58,470 --> 00:04:07,380 So if we able to get the same result on less data that actually saves us money it saves companies money 50 00:04:07,500 --> 00:04:13,220 especially big companies like Google that have a lot of data and run a lot of spew. 51 00:04:13,260 --> 00:04:21,400 Sometimes the idea of data reduction can also be called dimensionality reduction or column reduction. 52 00:04:21,400 --> 00:04:27,790 You can have many columns on your table but it makes things really really slow so you can actually remove 53 00:04:27,790 --> 00:04:33,730 some columns that maybe you find irrelevant and won't be needed for your machine learning model. 54 00:04:33,820 --> 00:04:39,060 Again these are things that Daniel is going to go through but I want you to keep this in mind. 55 00:04:39,130 --> 00:04:45,130 This idea that as a data scientist you can't just assume all data you have is automatically going to 56 00:04:45,130 --> 00:04:45,730 be perfect. 57 00:04:45,730 --> 00:04:46,730 You have to clean it. 58 00:04:46,780 --> 00:04:48,060 You have to transform it. 59 00:04:48,070 --> 00:04:49,400 You have to reduce it. 60 00:04:49,420 --> 00:04:57,610 This idea of wrangling data of moving around data into a form that is useful to you is part of the skill. 61 00:04:57,610 --> 00:05:00,540 So watch Daniel do that and learn along the way. 62 00:05:00,940 --> 00:05:02,380 I'll see you in the next one. 63 00:05:02,380 --> 00:05:02,650 Bye bye.