1 00:00:00,560 --> 00:00:01,660 Welcome back. 2 00:00:01,700 --> 00:00:06,920 The last video we finished off checking it out correlation matrix and we looked at our workflow and 3 00:00:06,920 --> 00:00:08,710 saw that we're now up to step five. 4 00:00:08,720 --> 00:00:13,100 So we've covered problem definition data violation we've looked at the features we've done little bit 5 00:00:13,100 --> 00:00:20,300 of data analysis now it's time to stop bringing in machine learning so let's see how we do that let's 6 00:00:20,300 --> 00:00:27,020 go into here we'll create a little heading it might do it 5.0 modeling and we're gonna turn this into 7 00:00:27,030 --> 00:00:31,950 markdown and so what's our problem. 8 00:00:31,960 --> 00:00:33,250 Let's scroll back up. 9 00:00:33,610 --> 00:00:43,400 We defined it up here right back at the star so problem definition is step 1. 10 00:00:43,440 --> 00:00:48,180 So in a statement given clinical parameters about a patient can we predict whether or not they have 11 00:00:48,180 --> 00:00:48,690 heart disease. 12 00:00:48,690 --> 00:00:53,550 So we're working with the classification problem okay as what we're trying to be answering with machine 13 00:00:53,550 --> 00:00:55,640 learning and evaluation. 14 00:00:55,740 --> 00:01:00,720 So this our evaluation metric because right now we've got a dataset kind of we're just exploring it. 15 00:01:00,720 --> 00:01:02,300 We're experimenting here. 16 00:01:02,340 --> 00:01:06,960 So if we wanted to actually get this into production into a real life setting we'd probably want it 17 00:01:06,960 --> 00:01:08,030 to be fairly good right. 18 00:01:08,040 --> 00:01:12,420 Because it's predicting something as important as whether someone or not has heart disease. 19 00:01:12,420 --> 00:01:18,420 So we want a minimum of 95 percent accuracy at predicting whether or not a patient has heart disease 20 00:01:18,450 --> 00:01:22,130 during our proof of concept which is kind of like what working through here. 21 00:01:22,450 --> 00:01:26,040 And if we can make it to that we'll keep pursuing it we'll keep going see if we can improve it make 22 00:01:26,040 --> 00:01:28,120 it better or see what we need to improve. 23 00:01:28,140 --> 00:01:32,490 So this is what we're gonna be trying to solve with machine learning we're trying to answer this problem 24 00:01:32,490 --> 00:01:33,020 statement. 25 00:01:33,690 --> 00:01:38,340 And this is our valuation metric something that we'll be working towards and of course each of these 26 00:01:38,340 --> 00:01:41,970 could change but these are just give us a little roadmap or something we can head towards. 27 00:01:42,660 --> 00:01:45,540 So let's go down see what we would do. 28 00:01:45,540 --> 00:01:51,450 The first thing we might do is relook at our our dataset remind ourselves of what we're doing. 29 00:01:51,450 --> 00:01:59,220 So we're going to be using these columns to try and predict the type column so a.k.a. the independent 30 00:01:59,220 --> 00:02:06,380 variables to predict the dependent variable here and the first thing we might do is split our data into 31 00:02:06,380 --> 00:02:13,800 x and y so we'll split it into features and labels and then what we might do is create a training and 32 00:02:13,800 --> 00:02:19,200 test split so model what we did in the socket line section when we worked through the socket line machine 33 00:02:19,200 --> 00:02:20,620 learning workflow. 34 00:02:20,920 --> 00:02:21,780 So let's do that. 35 00:02:21,930 --> 00:02:26,910 So let's go split data into X and Y 36 00:02:30,350 --> 00:02:34,790 X is going to be equal to every column except the tiger column. 37 00:02:34,790 --> 00:02:38,940 That's what we want to DFT drop we're going to drop the target. 38 00:02:39,200 --> 00:02:46,130 We're going to set the axis equal to 1 Beautiful and then Y is going to be these are all labels. 39 00:02:46,250 --> 00:02:48,330 So we're going to go DLF. 40 00:02:48,670 --> 00:02:54,430 We can just go target there we go. 41 00:02:54,490 --> 00:02:56,890 Now let's visualize x. 42 00:02:57,250 --> 00:02:58,260 Wonderful. 43 00:02:58,480 --> 00:03:04,060 No target column and we'll visualize y to remind ourselves it's just 0 1 whether or not someone has 44 00:03:04,060 --> 00:03:07,660 heart disease a coronary classification because there's only two options. 45 00:03:07,780 --> 00:03:15,040 Binary meaning 0 1 and now what we have to do is split our data into a training and test with. 46 00:03:15,220 --> 00:03:22,120 Now we've seen this before but it's worth reiterating why not use all the data to try to model what 47 00:03:22,120 --> 00:03:24,010 is a train and test split. 48 00:03:24,100 --> 00:03:28,990 Well let's say you wanted to take your model into the hospital and start using it on patients. 49 00:03:28,990 --> 00:03:35,500 How would you know how well your model goes on a new patient not included in the original full data 50 00:03:35,500 --> 00:03:35,800 set. 51 00:03:35,800 --> 00:03:43,970 Now we have so if we built a model on every single sample in this this data frame how would we know 52 00:03:43,970 --> 00:03:51,050 how would we have an insight into seeing how it would go on someone that our model has never seen before. 53 00:03:51,080 --> 00:03:53,020 This is where the test set comes in. 54 00:03:53,060 --> 00:03:58,820 It's used to mimic taking our model into a real environment as much as possible or at least it tries 55 00:03:58,820 --> 00:04:00,480 to as much as possible. 56 00:04:00,500 --> 00:04:03,830 It's important to never let our model learn from a test set. 57 00:04:04,010 --> 00:04:06,190 It should only be evaluated on a test. 58 00:04:07,130 --> 00:04:10,790 So we're gonna have a training set which is like when you're studying for an exam it's like the course 59 00:04:10,790 --> 00:04:11,360 material. 60 00:04:11,360 --> 00:04:15,460 So the machine learning model is going to learn the patterns in the course material then it's gonna 61 00:04:15,500 --> 00:04:19,250 be evaluated on the final exam which is the test set. 62 00:04:19,340 --> 00:04:29,770 So let's say that what we're going to do is split data into train and test sets so to do so we can use 63 00:04:30,240 --> 00:04:37,090 num pi random seed first so we can reproduce our results and we can split our data into train and test 64 00:04:37,090 --> 00:04:40,290 sets using socket lines train test split. 65 00:04:40,420 --> 00:04:51,220 So you go split into train and test set X train x test we saw a lot of this in the socket loan section 66 00:04:53,360 --> 00:04:58,750 but it's paramount to reiterate that that whenever you try to test or whenever you try to evaluate your 67 00:04:58,750 --> 00:05:06,520 model it should be on data that the model has never seen before test size and do the standard zero point 68 00:05:06,520 --> 00:05:07,050 two. 69 00:05:07,990 --> 00:05:12,130 So this X is what we've created up here now dependent variable. 70 00:05:12,220 --> 00:05:16,840 And this is why which is our labels. 71 00:05:16,840 --> 00:05:25,340 So now we're going to shift and into wonderful and we're going to have look at the training data. 72 00:05:25,650 --> 00:05:34,880 So we've got X trying we can see that it's shuffled it and its scope 242 rows out of the total 303 rows. 73 00:05:34,920 --> 00:05:37,720 So it's got about 80 percent of the entire rows. 74 00:05:37,770 --> 00:05:43,110 Now we can do the same with the Y train and we might do find the land of Y train as well 75 00:05:46,420 --> 00:05:52,500 y train is the same length as X train but it's only got the labels. 76 00:05:52,500 --> 00:05:59,330 And so if we compared the indexes here it's got the labels for each of these samples 1 3 2 1 3 2 wonderful. 77 00:05:59,340 --> 00:06:01,800 Now what do we do. 78 00:06:01,800 --> 00:06:06,140 Well now we've got our data into x and y we've got train and test as well. 79 00:06:06,150 --> 00:06:09,630 So we've got training data and we've got testing data. 80 00:06:09,690 --> 00:06:12,810 The next thing is to build a machine learning model. 81 00:06:12,810 --> 00:06:16,560 And we're working with classification. 82 00:06:16,620 --> 00:06:19,950 This comes to a point where it's like what model should we use. 83 00:06:19,950 --> 00:06:31,260 Now we've got our data split into training and test sets it's time to build a machine learning model 84 00:06:33,700 --> 00:06:34,960 we'll train it. 85 00:06:34,960 --> 00:06:45,880 So find the patterns on the training set and we'll test it use the patents. 86 00:06:46,100 --> 00:06:50,510 So using their patent it's found on the test set. 87 00:06:51,080 --> 00:06:53,960 But what machine learning model should we use. 88 00:06:53,960 --> 00:07:00,380 Well if you remember in the socket loan section we checked it out the socket loan machine learning map 89 00:07:01,650 --> 00:07:05,090 so we're gonna go in here. 90 00:07:05,220 --> 00:07:06,760 Wonderful. 91 00:07:06,780 --> 00:07:07,830 So this is our start. 92 00:07:07,860 --> 00:07:11,850 So we followed this through we know we've seen this before we've got a classification problem we're 93 00:07:11,850 --> 00:07:18,270 trying to classify whether or not someone has heart disease based on their health parameters based on 94 00:07:18,270 --> 00:07:19,890 their medical parameters. 95 00:07:19,920 --> 00:07:26,040 So what we might do is only try a few of these you could try all of them but we're going to try and 96 00:07:26,040 --> 00:07:32,130 do is use cane neighbors classifier we'll use this one so if we open that in the new tab and then we're 97 00:07:32,130 --> 00:07:37,800 going to also use our trusty random forest so we'll open that as well. 98 00:07:37,800 --> 00:07:44,250 So ensemble methods is a random forest in here random forest random forest classifier that's what we're 99 00:07:44,250 --> 00:07:44,700 after. 100 00:07:44,790 --> 00:07:48,450 Yes beautiful how we got here. 101 00:07:48,460 --> 00:07:50,020 The nearest neighbors. 102 00:07:50,020 --> 00:07:51,070 Wonderful. 103 00:07:51,070 --> 00:07:53,200 So if we look through that we'd see how we could use it. 104 00:07:53,230 --> 00:07:54,770 But we're going to see this in practice. 105 00:07:54,910 --> 00:07:56,510 But there's one more. 106 00:07:56,510 --> 00:08:06,270 If we go back here back to our machine learning map and it's not listed on here I wonder why that is. 107 00:08:06,510 --> 00:08:08,700 Well we're going to see it in the next video what it is. 108 00:08:08,710 --> 00:08:13,180 I'll give you a little clue it's called logistic regression and you might be thinking Hey we're working 109 00:08:13,180 --> 00:08:17,690 on a classification problem logistic regression has the word regression in it. 110 00:08:17,950 --> 00:08:21,150 And why isn't it in this classification part. 111 00:08:21,160 --> 00:08:25,270 Well that's a great question because I'm not really sure either but we're going to have a look at those 112 00:08:25,270 --> 00:08:26,620 three models in the next video. 113 00:08:26,710 --> 00:08:30,230 So we'll create them and we'll see how each of them goes. 114 00:08:30,250 --> 00:08:31,960 Because remember we're experimenting now. 115 00:08:31,990 --> 00:08:38,860 We're now modeling slash experiments phase and one part of the experiments phase is trying different 116 00:08:38,860 --> 00:08:40,630 machine learning models so that's what we're gonna do. 117 00:08:40,630 --> 00:08:46,090 We're going to try three different machine learning models and see how they compare with their results 118 00:08:46,120 --> 00:08:50,240 on the test set and then of course we want the best one. 119 00:08:50,290 --> 00:08:50,950 So let's do that.