1 00:00:00,670 --> 00:00:01,740 Welcome back. 2 00:00:01,750 --> 00:00:06,310 In the last video we saw an end to end psychic loan workflow. 3 00:00:06,310 --> 00:00:10,410 Now we're gonna break that down and jump into each of those steps individually. 4 00:00:10,450 --> 00:00:12,680 The first one is getting the data ready. 5 00:00:12,700 --> 00:00:16,540 You probably noticed I've kind of list to find this just because I'm a bit of a nerd so that way we 6 00:00:16,540 --> 00:00:20,170 don't have to scroll right back up to top to see what we're covering. 7 00:00:20,170 --> 00:00:25,870 We can just put this here this little cool trick wherever we want in the Jupiter notebook because we've 8 00:00:25,870 --> 00:00:29,440 run this cell and instantiated this as a list variable. 9 00:00:29,660 --> 00:00:33,890 So we're going to tick off getting the data ready in this section. 10 00:00:33,890 --> 00:00:35,940 And another thing we probably should have done right at the top. 11 00:00:35,950 --> 00:00:39,100 We can usually do it in any machine learning notebook that we're working on. 12 00:00:39,220 --> 00:00:45,580 So we'll just type in here standard imports just so we have them ready at our arsenal. 13 00:00:45,610 --> 00:00:55,960 We've already done some pie pandas as PD import map bot lib dot pie plot as TLT. 14 00:00:56,960 --> 00:01:02,270 And then we'll do a little plot lib inline so our plots appear in the notebook. 15 00:01:02,270 --> 00:01:06,830 So this is this three four lines of code that you can run at the top of most notebooks and then you'll 16 00:01:06,830 --> 00:01:11,900 see throughout as we start to use different socket loan functions we could also start to put them up 17 00:01:11,900 --> 00:01:14,150 the top two but we'll leave it at those for now. 18 00:01:14,630 --> 00:01:16,180 So let's go down to where we were. 19 00:01:16,310 --> 00:01:18,650 This section we're getting our data ready. 20 00:01:18,650 --> 00:01:25,360 Now the reason we have to do so is because most the time data doesn't come ready to be used with a psychic 21 00:01:25,360 --> 00:01:27,130 line machine learning model. 22 00:01:27,180 --> 00:01:29,090 Now Grant a little heading. 23 00:01:29,440 --> 00:01:30,760 So we have to get it ready. 24 00:01:31,150 --> 00:01:39,970 And getting data ready to be used with machine learning and then the three main things that we'll have 25 00:01:39,970 --> 00:01:40,530 to do. 26 00:01:40,570 --> 00:01:44,020 Number one is let's communicate a bit better Daniel. 27 00:01:44,140 --> 00:02:00,710 Three main things we have to do is one split the data into features and labels usually x and y. 28 00:02:00,710 --> 00:02:02,300 This is what you'll find them cold. 29 00:02:02,390 --> 00:02:07,880 Generally if you're on the internet somewhere or this like a loan library usually calls features x and 30 00:02:07,880 --> 00:02:19,730 labels y and then number two is feeling also called imputing or disregarding missing values. 31 00:02:19,970 --> 00:02:24,950 So if any of our rows in our data set have missing values if there's any incomplete fields we may have 32 00:02:24,950 --> 00:02:27,990 to fill them up or we may have to get rid of them completely. 33 00:02:28,010 --> 00:02:32,510 Those samples because a machine learning model can't learn when there's nothing there. 34 00:02:32,690 --> 00:02:42,680 You'll see it throws an era then the final one is converting non numerical values to numerical values 35 00:02:44,390 --> 00:02:54,050 also called feature encoding a.k.a. if we have say we have our car sales or we need to turn that into 36 00:02:54,080 --> 00:02:56,380 markdown beautiful. 37 00:02:56,900 --> 00:03:02,700 So if we have our car sales data and we have Toyota we have the color et cetera. 38 00:03:02,730 --> 00:03:05,340 A machine learning model can't understand Toyota. 39 00:03:05,360 --> 00:03:06,220 What have I done there. 40 00:03:06,380 --> 00:03:07,500 Can't understand Honda. 41 00:03:07,510 --> 00:03:08,660 Come and stand read it. 42 00:03:08,690 --> 00:03:12,440 We have to turn these into numbers and we'll see how to how to do that. 43 00:03:12,440 --> 00:03:13,910 In this section. 44 00:03:13,910 --> 00:03:14,560 All right. 45 00:03:14,660 --> 00:03:20,330 Well we need a data set to begin with first and I think we still have our heart disease data set imported. 46 00:03:21,110 --> 00:03:22,740 So let's check that. 47 00:03:22,740 --> 00:03:23,330 Go ahead. 48 00:03:24,380 --> 00:03:25,190 Wonderful. 49 00:03:25,190 --> 00:03:30,440 So where do step one first we've seen this before in our workflow but we'll just get a succinct version 50 00:03:30,440 --> 00:03:31,960 of how to actually do it. 51 00:03:31,970 --> 00:03:37,020 So in this case we want to use the feature columns to predict why. 52 00:03:37,040 --> 00:03:38,060 So what are we doing here. 53 00:03:38,060 --> 00:03:45,180 We're keeping it nice and simple we use pandas X equals heart disease don't drop. 54 00:03:45,320 --> 00:03:50,740 So we want to remove the target column along Axis 1. 55 00:03:50,960 --> 00:03:54,970 And now remember in a panda's data frame. 56 00:03:55,160 --> 00:03:56,740 Access equals one means. 57 00:03:56,780 --> 00:04:01,070 This access here all the columns axis and axis 0 is the Rose axis. 58 00:04:01,400 --> 00:04:10,340 So X is now going to be every single column except for Target wonderful and you know what our Y is going 59 00:04:10,340 --> 00:04:10,570 to be. 60 00:04:10,570 --> 00:04:12,890 Because that's the labels of our machine learning problem. 61 00:04:13,030 --> 00:04:16,910 Our Y is going to be the target column. 62 00:04:16,970 --> 00:04:20,700 Heart disease and we'll select it like that. 63 00:04:20,770 --> 00:04:24,760 We'll look at why dot head was here first five samples. 64 00:04:24,760 --> 00:04:25,140 Wonderful. 65 00:04:25,140 --> 00:04:27,600 We go back up here one on one on one. 66 00:04:27,670 --> 00:04:30,880 You see why is so beautiful. 67 00:04:30,880 --> 00:04:36,230 Now the next thing we have to do is split data into training and test sets. 68 00:04:36,370 --> 00:04:43,330 So in machine learning one of the most fundamental principles is never evaluate or test your models 69 00:04:43,720 --> 00:04:45,690 on data that it is learned from. 70 00:04:45,760 --> 00:04:51,610 Which is why we split it into training and test sets so suck it loan has a convenient function for allowing 71 00:04:51,610 --> 00:04:52,240 us to do that. 72 00:04:52,810 --> 00:04:58,330 So split the data into training and test sets. 73 00:04:58,420 --> 00:05:03,370 Remember right back at the start when going over concepts if you're looking at the test data it's like 74 00:05:03,400 --> 00:05:08,890 looking at the final exam before you've looked at the practice exam not what you want to be doing right. 75 00:05:08,890 --> 00:05:14,530 If the professor accidentally leaked the final exam everyone would be getting perfect marks and no one 76 00:05:14,530 --> 00:05:16,470 will be actually learning anything. 77 00:05:16,480 --> 00:05:23,320 So we want to from S.K. learn model selection SCA line has a fair few different modules but we'll see 78 00:05:23,320 --> 00:05:25,010 some of the most useful ones. 79 00:05:25,060 --> 00:05:28,780 Import train test split. 80 00:05:28,840 --> 00:05:29,930 Beautiful. 81 00:05:30,050 --> 00:05:35,710 Now when we call train test split it's going to return for different values. 82 00:05:35,770 --> 00:05:41,720 One is X train one is X test one is y train and one is y test. 83 00:05:41,770 --> 00:05:43,150 So let's see what happens. 84 00:05:43,210 --> 00:05:45,360 Train test splint. 85 00:05:45,370 --> 00:05:52,480 And then we pass it our X data and see what it says actually split arrays or matrices into random train 86 00:05:52,480 --> 00:05:53,620 and test subsets. 87 00:05:53,680 --> 00:05:54,160 Beautiful. 88 00:05:54,160 --> 00:05:54,850 That's what we want. 89 00:05:54,850 --> 00:05:59,790 We want some data to train on and we want some data to evaluate our machine learning models on its will 90 00:05:59,800 --> 00:06:06,280 pass it out features and we'll pass it our labels and we'll define the test size as been zero point 91 00:06:06,280 --> 00:06:06,610 two. 92 00:06:07,780 --> 00:06:09,600 So let's see what happens. 93 00:06:09,610 --> 00:06:10,280 Wonderful. 94 00:06:10,330 --> 00:06:11,850 That runs smoothly. 95 00:06:11,900 --> 00:06:17,710 Let's check out the shapes of our new matrices because remember our data here is really just a matrix 96 00:06:17,740 --> 00:06:21,950 in a data frame or a nun pie in the array in a data frame. 97 00:06:22,000 --> 00:06:28,610 So on X trained what shape will test x test shape y train shape. 98 00:06:28,660 --> 00:06:31,380 And finally y test what shape. 99 00:06:31,960 --> 00:06:32,620 Let's see this. 100 00:06:32,620 --> 00:06:36,240 Okay so 242 with 13 columns. 101 00:06:36,310 --> 00:06:39,550 So 2042 by 13 and 61 by 13. 102 00:06:39,550 --> 00:06:39,830 Okay. 103 00:06:39,850 --> 00:06:41,230 This makes sense. 104 00:06:41,260 --> 00:06:49,450 So if we look up here as one two three four five six seven eight nine ten eleven twelve thirteen thirteen 105 00:06:49,450 --> 00:06:50,580 different features. 106 00:06:50,620 --> 00:06:55,710 That's why our X train variable is 242 by thirteen. 107 00:06:55,750 --> 00:06:58,120 But where did this 240 to come from. 108 00:06:58,120 --> 00:07:04,780 Well that's because we've decided here that we want our test data set to be 20 per cent of the overall 109 00:07:04,780 --> 00:07:05,650 data. 110 00:07:05,650 --> 00:07:06,610 So let's have a look at this. 111 00:07:06,610 --> 00:07:16,380 If we go X dot shape because X is before we split it right we've got X up here X dot shape is 303. 112 00:07:16,390 --> 00:07:20,110 So we have 303 samples total. 113 00:07:20,210 --> 00:07:27,890 Let's just check that three a three samples total and 80 percent of them are going to be training data 114 00:07:27,920 --> 00:07:29,770 for the machine learning model. 115 00:07:29,890 --> 00:07:38,290 I trained up shape we want to times that by zero point eight to two hundred forty two point four. 116 00:07:38,390 --> 00:07:44,310 We add these together turn 42 plus 61. 117 00:07:44,310 --> 00:07:48,840 All it's done is it just rounded it down and automatically carve the other 20 per cent into the test 118 00:07:48,840 --> 00:07:52,740 set and it's done the same for y train and Y test. 119 00:07:52,980 --> 00:07:53,930 Beautiful. 120 00:07:53,970 --> 00:07:58,890 Now we've split our data into training and test the next step. 121 00:07:58,920 --> 00:08:02,940 We'll go back up here is to fill it or converting it. 122 00:08:02,940 --> 00:08:04,550 Making sure it's all numerical. 123 00:08:04,560 --> 00:08:06,950 So we might see this maybe we do that. 124 00:08:07,170 --> 00:08:08,190 We fill it first. 125 00:08:08,280 --> 00:08:09,570 Yeah that's a good idea. 126 00:08:09,600 --> 00:08:11,280 We'll come back and we'll do that in the next video. 127 00:08:11,310 --> 00:08:16,650 So maybe have a go at playing around with one of our CSB files and splitting it into training and test 128 00:08:16,650 --> 00:08:22,410 data or splitting it into x and y first and then change around this parameter here to see what happens 129 00:08:22,470 --> 00:08:27,990 when we change that two point three what you think will happen if I press shift enter here the numbers 130 00:08:27,990 --> 00:08:29,110 will change. 131 00:08:29,490 --> 00:08:31,660 Maybe you want to play around with a different fraction. 132 00:08:31,830 --> 00:08:36,370 Otherwise I'll see you in the next video and we'll look at how to fill some missing data.