1 00:00:00,350 --> 00:00:00,750 All right. 2 00:00:01,050 --> 00:00:06,720 I'm very excited it's time to start modelling it's time to start using machine learning to figure out 3 00:00:06,720 --> 00:00:12,620 whether or not we can classify whether someone has heart disease or not based on their health parameter. 4 00:00:12,720 --> 00:00:18,330 So in previous video we said that we're going to try out three machine learning models so we'll put 5 00:00:18,330 --> 00:00:20,180 them out here. 6 00:00:20,250 --> 00:00:35,780 We're going to try three different machine learning models one logistic regression to K nearest 7 00:00:38,400 --> 00:00:47,240 neighbors classifier and three random forest classifier. 8 00:00:47,240 --> 00:00:53,270 And if you're wondering how we came about these three because we went through this diagram the psychic 9 00:00:53,290 --> 00:00:58,030 line machine learning map we've we followed this through based on our data. 10 00:00:58,150 --> 00:01:03,660 We came across here we actually a good question you might be asking is hey I don't see logistic regression 11 00:01:03,670 --> 00:01:08,260 we kind of covered that before where it said logistic regression that doesn't really make sense it's 12 00:01:08,260 --> 00:01:09,550 got regression here. 13 00:01:09,550 --> 00:01:15,660 So we're not trying to predict no more trying to predict a category why are we using it for classification 14 00:01:15,670 --> 00:01:20,850 and I was pretty stumped too when I found out that we could use logistic regression for classification 15 00:01:21,390 --> 00:01:24,040 you might be thinking even further back. 16 00:01:24,040 --> 00:01:28,030 How did you even find logistic regression if it's not in here. 17 00:01:28,030 --> 00:01:33,340 Well I was a bit confused at this too I was wondering why is it logistic regression listed here. 18 00:01:33,910 --> 00:01:44,110 But after searching for go S.K. learn logistic regression and you might be wondering what if I didn't 19 00:01:44,110 --> 00:01:46,990 even know to search for SBA loan logistic regression. 20 00:01:46,990 --> 00:01:57,040 Well the way I stumbled upon this is I just went machine learning models used for classification problems. 21 00:01:57,040 --> 00:02:01,930 I search something like that and then someone suggested logistic regression. 22 00:02:01,960 --> 00:02:07,300 And so I decided to search logistic regression and I found it in here. 23 00:02:07,300 --> 00:02:13,390 So I get learn it had an implementation I could of course build it from scratch but I'm more of a practitioner 24 00:02:13,390 --> 00:02:21,870 right I want to apply models that someone else's built in the past so and then I read through and maybe 25 00:02:21,870 --> 00:02:27,210 it wasn't in this logistic regression maybe it was in the user guide that's I think where it was so 26 00:02:27,210 --> 00:02:29,220 logistic regression Here we go. 27 00:02:29,220 --> 00:02:33,870 Despite its name is a linear model for classification rather than regression. 28 00:02:33,870 --> 00:02:39,450 So this line here kind of clarifies the confusion in the name in logistic regression. 29 00:02:39,840 --> 00:02:41,370 And of course you could read through all this. 30 00:02:41,370 --> 00:02:45,060 You go to some map here of how it's actually implemented. 31 00:02:45,060 --> 00:02:51,300 But the reason why we're trying it is because we're pretending that we've tried linear SBC we actually 32 00:02:51,300 --> 00:02:54,960 have in a previous socket line video I'm going to say that it's not working. 33 00:02:55,290 --> 00:02:59,460 So now we're trying we've gone through here we're not working in text data we're going to try k near 34 00:02:59,460 --> 00:03:01,980 as neighbors classifier and ensemble classifier. 35 00:03:02,010 --> 00:03:07,480 So that's that's how we've deduced these three different models and why logistic regression. 36 00:03:07,500 --> 00:03:12,120 Well I've decided to throw it in here because it's not listed there and we want to kind of check it 37 00:03:12,120 --> 00:03:17,220 out because if it's not on the the standard list of algorithms to try we might try and figure it out 38 00:03:17,220 --> 00:03:18,500 anyway. 39 00:03:18,600 --> 00:03:23,840 And again if this process seems like it's not really structured at all it's like your way. 40 00:03:23,930 --> 00:03:25,970 We did Daniel find that machine learning model. 41 00:03:25,980 --> 00:03:27,600 What do you even think of. 42 00:03:27,600 --> 00:03:31,080 Again it's just from searching something like this. 43 00:03:31,080 --> 00:03:36,760 Once we've defined that our problem is classification if you're trying to figure out which machine learning 44 00:03:36,760 --> 00:03:42,130 model that you want to use it's something as simple as this and you could search for it and explore. 45 00:03:42,460 --> 00:03:46,680 And if we go back to our keynote here that's what we're doing. 46 00:03:46,690 --> 00:03:47,850 We're up to experiments. 47 00:03:47,840 --> 00:03:50,410 We're trying different machine learning models. 48 00:03:50,410 --> 00:03:53,340 That's enough talking actually. 49 00:03:53,640 --> 00:03:56,770 One more thing and that's what I want to try and encourage you to do right. 50 00:03:56,790 --> 00:04:00,830 Is that because there's so many different ways to do things and machine learning. 51 00:04:01,020 --> 00:04:06,090 A lot of it is about just searching for the answer or not even the answer just asking questions and 52 00:04:06,090 --> 00:04:07,490 trying to implement them. 53 00:04:07,500 --> 00:04:09,390 That's what we're gonna do now. 54 00:04:09,390 --> 00:04:14,790 So because we want to build three different models and because we want to evaluate them and because 55 00:04:14,790 --> 00:04:19,140 we want to do a couple of experiments to see which one is the best we want to train them on the training 56 00:04:19,140 --> 00:04:26,910 data and test them on the test data evaluate them what we might do is set up a little dictionary with 57 00:04:26,910 --> 00:04:34,590 our models in it and then we'll create a function to fit and score our models so rather than rewrite 58 00:04:34,590 --> 00:04:38,490 all the same code for feeding and evaluating our models set settle up in a function. 59 00:04:38,490 --> 00:04:39,870 Yeah that's a that's a good idea. 60 00:04:39,870 --> 00:04:42,200 Daniel let's do that all right. 61 00:04:42,950 --> 00:04:47,960 So put models in a dictionary. 62 00:04:47,960 --> 00:05:02,090 We're going to get models equals logistic regression and we're gonna go here and this is logistic regression 63 00:05:03,800 --> 00:05:09,750 and then we want to go okay and then we're going to instantiate the class here. 64 00:05:09,860 --> 00:05:16,130 So it's gonna be k neighbors we can probably place tab order complete let's be real and then we're gonna 65 00:05:16,130 --> 00:05:24,660 go here random forest and this is going to be our trusty random forest classifier beautiful. 66 00:05:25,350 --> 00:05:29,730 And then let's go create a function. 67 00:05:29,730 --> 00:05:34,290 We want to fit and score the model so we want to train it need a function to train our models in the 68 00:05:34,290 --> 00:05:40,830 training data and then evaluate them on the test data to fit and score models. 69 00:05:40,940 --> 00:05:48,240 Death fit and score something simple it'll need to take out a dictionary of models and then we'll take 70 00:05:48,240 --> 00:05:57,890 X train x test y train y test and we could put a little doctoring here just to make sure our code is 71 00:05:58,270 --> 00:05:59,010 is legible. 72 00:05:59,130 --> 00:06:02,140 If someone else was to use this function what would it do. 73 00:06:02,220 --> 00:06:11,990 So fits and evaluates given machine learning models and then we'll tell it what parameters we're using 74 00:06:11,990 --> 00:06:21,690 so models a dig of different so I could learn Shayne lining models. 75 00:06:21,960 --> 00:06:22,880 What's X trained. 76 00:06:22,970 --> 00:06:41,610 We know what X Chinese training data no libels x test is testing data No Labels and then why train is 77 00:06:41,670 --> 00:06:50,790 training labels and then why test is test labels beautiful. 78 00:06:51,630 --> 00:06:57,940 And now what should we do we might set a random seed set random seed even if it's within the function 79 00:06:58,270 --> 00:07:03,200 to make sure that our results are reproducible and then we're going to go here. 80 00:07:03,400 --> 00:07:09,730 We'll make a list to keep model scores or actually this is really better. 81 00:07:10,090 --> 00:07:11,470 Make a dictionary. 82 00:07:11,530 --> 00:07:14,660 That's what we want because we're working with dictionaries. 83 00:07:14,710 --> 00:07:22,080 We want to go model scores equals an empty dictionary because we're gonna fill it up in a second we'll 84 00:07:22,080 --> 00:07:25,050 talk through this function as we as we do it. 85 00:07:25,140 --> 00:07:27,490 So loop through models. 86 00:07:27,540 --> 00:07:35,140 This is where we're going to go for name model so name is the key model is the value of our dictionary. 87 00:07:35,160 --> 00:07:44,010 So for key value in models to item so we're accessing the items of a dictionary with DOT items then 88 00:07:44,010 --> 00:07:48,340 we're gonna go fit the model to the data. 89 00:07:48,340 --> 00:07:54,090 That's where we go model dot fit X train y train. 90 00:07:54,090 --> 00:07:55,380 So if fitting our model there. 91 00:07:55,380 --> 00:08:03,720 So for each model it's going to fit it to the training data and then we want to evaluate the model and 92 00:08:03,840 --> 00:08:06,690 append its score to model scores. 93 00:08:06,720 --> 00:08:11,490 So this is how we're going to evaluate each model in one hit. 94 00:08:11,490 --> 00:08:15,330 We're going to save the name of it to our dictionary. 95 00:08:15,330 --> 00:08:19,560 Here is the key and then it's score as a value in model scores. 96 00:08:19,560 --> 00:08:22,660 So model scores name. 97 00:08:22,830 --> 00:08:25,620 So we're creating a key in the model scores. 98 00:08:25,620 --> 00:08:26,790 Empty dictionary. 99 00:08:26,970 --> 00:08:30,330 And then we're going to go equals model dot score. 100 00:08:30,330 --> 00:08:32,490 So this is our model here. 101 00:08:32,490 --> 00:08:37,020 So for example if it was logistic regression we've just fit it. 102 00:08:37,050 --> 00:08:40,090 So logistic regression not fit X train y train. 103 00:08:40,260 --> 00:08:50,550 And now we're appending logistic regression dots score x test y test to model scores under the name 104 00:08:51,180 --> 00:08:52,200 logistic regression. 105 00:08:53,010 --> 00:08:57,570 And if we've gone through that and you're thinking well that's a fair few steps in it it's only about 106 00:08:57,570 --> 00:08:58,950 two or three but that's right. 107 00:08:58,950 --> 00:09:03,840 When I first worked with these kind of functions I was eye color on used to just doing this line by 108 00:09:03,840 --> 00:09:08,590 line but by being efficient in this project. 109 00:09:08,760 --> 00:09:13,640 So we're gonna return model scores which will return a dictionary so we'll hit shift and enter. 110 00:09:13,860 --> 00:09:14,640 Mm hmm. 111 00:09:14,880 --> 00:09:16,380 And I wonder if this will work. 112 00:09:16,380 --> 00:09:20,610 So basically what it's going to do is we're gonna take our dictionary of models it's gonna set up a 113 00:09:20,610 --> 00:09:26,340 random seed it'll set up an empty dictionary and it's gonna loop through our models dictionary. 114 00:09:26,370 --> 00:09:34,340 So for name model So for logistic regression logistic regression and model store items logistic regression 115 00:09:34,440 --> 00:09:37,300 we'll pretend it's logistic regression not fit. 116 00:09:37,410 --> 00:09:42,990 So selling the logistic regression model to find the patterns in the training data and then it's going 117 00:09:42,990 --> 00:09:50,640 to create a key in our empty dictionary model scores with the score of how well our logistic regression 118 00:09:50,640 --> 00:09:57,210 model performs on the test data a.k.a. using the patterns that it's found in the training data and then 119 00:09:57,210 --> 00:10:03,360 it's going to return our model scores dictionary and then we're gonna see how each of our model performs 120 00:10:03,780 --> 00:10:08,490 in the test dataset and our three models here we should have three different scores. 121 00:10:08,730 --> 00:10:13,700 So I'm to leave it on a cliffhanger there and we're gonna evaluate our three models in the next video 122 00:10:13,890 --> 00:10:14,850 so we'll see their.