0 1 00:00:00,870 --> 00:00:06,870 So now that we understand the intuition behind our regression algorithm, we can run and evaluate our 1 2 00:00:06,870 --> 00:00:08,250 regression. 2 3 00:00:08,250 --> 00:00:12,480 We will write the Python code to actually crunch the numbers. 3 4 00:00:12,480 --> 00:00:18,860 So I hope that at this point you still have your Jupyter notebook open and your session is still connected. 4 5 00:00:19,080 --> 00:00:23,940 While the big advantage of using the online version of Jupyter notebook is that you can get started 5 6 00:00:23,940 --> 00:00:30,930 right away and you don't have to install anything, but if you're inactive for a while and you haven't 6 7 00:00:30,930 --> 00:00:36,150 been using it then it's possible that you can get disconnected and lose your work. 7 8 00:00:36,180 --> 00:00:38,690 So in that case you might see something like this. 8 9 00:00:39,510 --> 00:00:45,450 And it's important to remember that you can always save your work by saying "Download as" and then "Notebook" 9 10 00:00:46,140 --> 00:00:51,990 and you can always restore your work by going back to try Jupyter with Python and then simply uploading 10 11 00:00:52,290 --> 00:00:57,420 the Jupyter notebook that you downloaded previously and your data file. 11 12 00:00:57,420 --> 00:01:03,620 So if you upload those, then you can continue where you left off. In the next module, 12 13 00:01:03,660 --> 00:01:07,580 I will walk you through how to install Jupyter locally on your machine. 13 14 00:01:07,650 --> 00:01:13,760 You might only encounter this situation if you are trying it out using binder through the web portal. 14 15 00:01:13,860 --> 00:01:20,460 Now without further ado, let's give our notebook the capability to run a regression. This capability, just 15 16 00:01:20,460 --> 00:01:26,780 like the others, it's going to come from a module. In this case this module is going to be called scikit-learn. 16 17 00:01:26,820 --> 00:01:33,870 Scikit-learn is one of the most popular machine learning modules in Python and we can get hold 17 18 00:01:33,870 --> 00:01:41,850 of it in our Jupyter notebook simply by typing "import sklearn", but we're only looking for something 18 19 00:01:41,850 --> 00:01:48,720 very specific out of scikit-learn, so instead of importing all of scikit-learn, what we're gonna do instead 19 20 00:01:48,720 --> 00:01:56,910 is we're going to say from as sklearn.linear_model and I hit Tab on my keyboard here to bring 20 21 00:01:56,910 --> 00:02:01,290 up this option, we're gonna import linear regression. 21 22 00:02:01,350 --> 00:02:07,560 Once again, I typed the first few characters there and then I hit Enter to insert the rest of the code 22 23 00:02:08,040 --> 00:02:15,360 and this avoids any sort of typos like, you know, having a lowercase r here for example. After we've 23 24 00:02:15,420 --> 00:02:21,770 added this line of code, let's hit Shift+Enter on our keyboard or click "Run" to run the cell. 24 25 00:02:22,230 --> 00:02:29,490 Alternatively, if you've opened this notebook from fresh, you might have to go to "Cell" and then "Run All". 25 26 00:02:30,030 --> 00:02:37,170 Let me add a few more cells here at the bottom, and then we can add the code to run our linear regression. 26 27 00:02:38,100 --> 00:02:44,640 The task of running a linear regression for calculating the slope of our line and the intercepts is 27 28 00:02:44,910 --> 00:02:50,760 once more going to be done by an object; so we're gonna create this object and we're gonna give it a name. 28 29 00:02:50,820 --> 00:02:58,200 So what I'm going to call it - regression, and I'm going to set it equal to LinearRegression with some 29 30 00:02:58,200 --> 00:03:00,330 parentheses at the end. 30 31 00:03:00,330 --> 00:03:08,340 This bit of code here will create our object and we're gonna be storing it inside here so we can always 31 32 00:03:08,340 --> 00:03:12,580 refer to our linear regression by the name "regression". 32 33 00:03:12,600 --> 00:03:18,150 So now that we've created our object we can actually tell it to do something; we can tell it to run our 33 34 00:03:18,150 --> 00:03:24,750 regression. And the way we're gonna do that is simply by using the regression and then putting a dot 34 35 00:03:24,750 --> 00:03:31,770 after it and then writing "fit". Inside the parentheses we have to tell it two things: namely the features - 35 36 00:03:32,130 --> 00:03:39,840 our X, and our labels - our lowercase y. When we hit Shift+Enter on the cell it will crunch the numbers 36 37 00:03:40,290 --> 00:03:43,060 and, just like that, we've actually run our regression. 37 38 00:03:43,110 --> 00:03:46,380 Now, I know we don't see any output here but trust me, 38 39 00:03:46,400 --> 00:03:52,380 the numbers have been crunched and to prove this we can pull up the slope coefficient and the intercept 39 40 00:03:52,590 --> 00:03:58,660 that were calculated by our regression. We can get hold of the slope through our regression object. 40 41 00:03:58,800 --> 00:04:07,110 So regression.coef and here I'm hitting Tab on my keyboard to bring up that menu there and then hit 41 42 00:04:07,230 --> 00:04:09,590 Enter to insert the code. 42 43 00:04:09,630 --> 00:04:14,790 Now you can, of course, type this out and you'll get the same result but just remember that there's an 43 44 00:04:14,850 --> 00:04:16,700 underscore here at the end. 44 45 00:04:16,770 --> 00:04:18,480 That's part of the name. 45 46 00:04:18,510 --> 00:04:23,950 So let me run the cell and let's take a look at what the slope coefficient is. 46 47 00:04:23,970 --> 00:04:27,630 So here we see that it's 3.11. 47 48 00:04:27,660 --> 00:04:33,840 Now one of the great things about Jupyter notebook is that we can add some markdown cells so I can add 48 49 00:04:33,840 --> 00:04:41,860 a cell here and I can change it to "Markdown" to add a little bit of explanation to what my code is doing. 49 50 00:04:41,970 --> 00:04:47,820 That way if I come back to this notebook in the future and I'm wondering what regression.coef_ does, 50 51 00:04:47,820 --> 00:04:53,550 I can look at the markdown cell above and I can leave myself a little note; for example, I can say slope 51 52 00:04:53,550 --> 00:04:59,050 coefficient and Shift+Enter and then I'll get this text inserted here. 52 53 00:04:59,070 --> 00:05:04,570 Now, if I wanted to leave myself little notes inside the actual cells where I've got my Python code, 53 54 00:05:04,800 --> 00:05:11,160 then I have to use something called a comment and I can add a comment here with a hashtag or a pound 54 55 00:05:11,160 --> 00:05:19,530 symbol and I can add this here and right, for example, theta _1 and you can see that the text 55 56 00:05:19,530 --> 00:05:25,530 here is green and that means that this text is considered to be a comment and will be ignored. 56 57 00:05:25,530 --> 00:05:32,130 It's not going to be treated as code and I can actually execute the cell and I'll run just fine, but 57 58 00:05:32,130 --> 00:05:37,170 if I were delete this little symbol here - this pound symbol, and try to run it then I would get an error 58 59 00:05:37,200 --> 00:05:39,600 and it would tell me "invalid syntax". 59 60 00:05:39,780 --> 00:05:42,690 So that little pound symbol is very, very important. 60 61 00:05:42,690 --> 00:05:46,490 That's what's going to differentiate code from comments. 61 62 00:05:47,010 --> 00:05:49,440 OK, so we've got our slope coefficient. 62 63 00:05:49,440 --> 00:05:52,460 What about our intercept? Our intercept 63 64 00:05:52,470 --> 00:05:55,810 we can pull up similarly through this regression object. 64 65 00:05:55,860 --> 00:06:02,300 So "regression.", and then put "intercept" with the underscore at the end. 65 66 00:06:02,550 --> 00:06:08,970 Once again I hit Tab on my keyboard there - I'm not such a fast typer, and it inserted the code for 66 67 00:06:08,970 --> 00:06:09,910 me. 67 68 00:06:10,020 --> 00:06:12,650 So let's take a look at what the value of this intercept is. 68 69 00:06:12,840 --> 00:06:17,280 Here we can see that it's about negative 7.2 million. 69 70 00:06:17,850 --> 00:06:21,660 So, now we know the regression slope and the intercept. 70 71 00:06:21,660 --> 00:06:26,840 And while it's not bad to have these two numbers to hand, we can actually do better than this. 71 72 00:06:26,880 --> 00:06:31,360 Wouldn't it be nice if we could just plot a line on our chart? 72 73 00:06:31,410 --> 00:06:35,070 Because we've painstakingly visualized our data. 73 74 00:06:35,070 --> 00:06:38,640 Why not just plot the regression line on this chart as well? 74 75 00:06:38,640 --> 00:06:39,990 So let's do just that. 75 76 00:06:39,990 --> 00:06:46,350 What I'm going to do is I'm gonna take this entire cell and I'm going to copy it and then I'm going 76 77 00:06:46,350 --> 00:06:49,810 to come down here and I'm going to paste it. 77 78 00:06:49,980 --> 00:06:53,440 And the reason I'm doing this is simply to have a reference. 78 79 00:06:53,460 --> 00:06:58,500 So here I'm going to have the chart without the line and here I'm going to modify my code here so I 79 80 00:06:58,500 --> 00:07:02,460 get the chart with the line. And you can move these cells around as well. 80 81 00:07:02,490 --> 00:07:09,820 So if I use this little up arrow here, then I can move it above this cell here. Wonderful! 81 82 00:07:09,820 --> 00:07:12,340 So, how do we plot a line on here? 82 83 00:07:12,790 --> 00:07:16,210 Well, we can use matplotlib once again. 83 84 00:07:16,210 --> 00:07:16,840 So, matplotlib 84 85 00:07:16,840 --> 00:07:20,450 has a functionality called plot. 85 86 00:07:20,590 --> 00:07:29,770 So plt.plot will allow us to plot a line on this chart, but we have to supply some information; we 86 87 00:07:29,770 --> 00:07:30,330 have to tell 87 88 00:07:30,410 --> 00:07:33,160 matplotlib what exactly to plot. 88 89 00:07:33,220 --> 00:07:34,990 And for that it will actually need two things. 89 90 00:07:35,000 --> 00:07:41,710 It will need the Xs and the y's so I'll need some information for where it should plot things on this axis, 90 91 00:07:41,710 --> 00:07:44,780 and where should plot things on this axis. Now, 91 92 00:07:44,830 --> 00:07:47,140 production budgets are our feature. 92 93 00:07:47,170 --> 00:07:49,960 So for this axis we can use our Xs, right? 93 94 00:07:50,440 --> 00:07:56,920 So I'm going to put an X here, and then a comma and I have to supply the y value. And what should this 94 95 00:07:56,920 --> 00:07:57,510 be? 95 96 00:07:57,550 --> 00:08:02,820 We don't want to use the actual value that we have for the gross revenue. 96 97 00:08:02,860 --> 00:08:08,770 Instead, what we would like to do is we'd want to use the predicted value from our regression and we 97 98 00:08:08,770 --> 00:08:14,710 can get hold of those values by calculating a predicted value for each of the X values. 98 99 00:08:14,710 --> 00:08:19,640 To do that we'll need the regression object from scikit-learn that we used earlier. 99 100 00:08:19,660 --> 00:08:22,140 All we need to do is type regression.predict(x), 100 101 00:08:22,190 --> 00:08:28,960 then it will calculate a prediction for each of the budget values in our 101 102 00:08:28,960 --> 00:08:30,900 data and plot it on the graph. 102 103 00:08:30,920 --> 00:08:36,370 So let me hit Shift+Enter on the cell and let's take a look at what we've got scrolling down. 103 104 00:08:36,370 --> 00:08:39,520 I can see we've got a regression line right here. 104 105 00:08:39,580 --> 00:08:40,910 Fantastic! 105 106 00:08:40,910 --> 00:08:44,460 But, it'd be a bit nicer if it stood out a little more, right? 106 107 00:08:44,470 --> 00:08:45,910 Let's give it a color. 107 108 00:08:45,930 --> 00:08:47,100 Let's give it a width. 108 109 00:08:47,110 --> 00:08:53,540 Let's make it a bit thicker. And we can do that by going into this line of code here. 109 110 00:08:53,590 --> 00:09:00,100 And just before this ending parenthesis we can add a comma and then we're gonna add a color here. 110 111 00:09:00,100 --> 00:09:07,070 So I'm gonna see color is equal to red in single quotes and that will make our line red, 111 112 00:09:07,150 --> 00:09:13,050 as you can guess. And to make it thicker, we can specify the line width. 112 113 00:09:13,390 --> 00:09:22,030 So linewidth, all in lowercase in one word is equal to, let's try 4; if I hit Shift+Enter to refresh 113 114 00:09:22,040 --> 00:09:28,680 my cell, then I can see that my regression line now is a lot thicker and it's changed in color. 114 115 00:09:28,690 --> 00:09:36,850 What we see now is, we see the relationship between our production budgets and our movie revenue as predicted 115 116 00:09:36,850 --> 00:09:39,080 by our linear regression model. 116 117 00:09:39,220 --> 00:09:46,810 And that means we can move on to the final part of our data science workflow, namely evaluating and analyzing 117 118 00:09:46,840 --> 00:09:48,740 our algorithms performance. 118 119 00:09:48,790 --> 00:09:52,090 How do we do? Did we do a good job or a bad job? 119 120 00:09:52,090 --> 00:09:56,000 What can the movie's budgets tell us about the movie revenue? 120 121 00:09:56,080 --> 00:10:00,250 This is the point where we have to think very, very hard about our model. 121 122 00:10:00,250 --> 00:10:06,010 The question that we should ask ourselves at this point is: are these parameters actually plausible? 122 123 00:10:06,010 --> 00:10:08,080 Let's take a look at that slope coefficient. 123 124 00:10:08,080 --> 00:10:10,220 It's 3.11. 124 125 00:10:10,270 --> 00:10:16,180 It means that there is a positive relationship between budget and revenue and not only that, it means 125 126 00:10:16,180 --> 00:10:21,790 that for each dollar that we spend on producing the movie, we should get around 3.1 dollars 126 127 00:10:21,790 --> 00:10:23,410 in revenue in return. 127 128 00:10:23,410 --> 00:10:27,180 And I actually think this seems to make sense. 128 129 00:10:27,340 --> 00:10:29,170 Bigger budget films tend to do better. 129 130 00:10:29,200 --> 00:10:34,740 So that's good news for us, right? Because we've put our life savings on the line for this zombie movie, 130 131 00:10:34,740 --> 00:10:35,070 right? 131 132 00:10:35,710 --> 00:10:38,850 Now, what about the other parameter? The intercept. 132 133 00:10:38,980 --> 00:10:42,230 This one is at -7.2 million. 133 134 00:10:42,250 --> 00:10:44,050 How do we interpret that? 134 135 00:10:44,320 --> 00:10:50,500 What this intercept is literally telling us is that a movie with a budget of zero would actually lose 135 136 00:10:50,740 --> 00:10:52,930 over 7 million dollars. 136 137 00:10:52,930 --> 00:10:55,140 So that's a bit problematic. 137 138 00:10:55,150 --> 00:10:56,120 Right? 138 139 00:10:56,140 --> 00:11:03,790 That seems quite unrealistic because if you and I went out to make a movie with a thousand dollars, it's 139 140 00:11:03,790 --> 00:11:09,340 pretty unlikely that seven million would just disappear from our bank accounts. 140 141 00:11:09,340 --> 00:11:12,090 So, this is a lot less realistic. 141 142 00:11:12,340 --> 00:11:16,300 And this begs the question - what should we conclude about our model? 142 143 00:11:16,300 --> 00:11:19,540 Well it means that we should actually take it with a grain of salt. 143 144 00:11:19,570 --> 00:11:26,210 We just have to accept that our model is a dramatic simplification of the real world. 144 145 00:11:26,410 --> 00:11:32,140 And as such we should be a little bit careful on how much we believe the predictions of our model, especially 145 146 00:11:32,140 --> 00:11:33,820 at the extreme ends. 146 147 00:11:33,910 --> 00:11:40,150 Just look at the distance here look at the gap between this data point and our line - the predictions 147 148 00:11:40,240 --> 00:11:45,640 of our model seem to fit the data a lot worse at the extreme. 148 149 00:11:45,640 --> 00:11:51,700 So how would we use this model to make a prediction, anyhow? Say we need to predict the revenue for a film with 149 150 00:11:51,700 --> 00:11:53,860 a 50 million dollar budget. 150 151 00:11:54,040 --> 00:11:55,350 We know what our intercepts are. 151 152 00:11:55,690 --> 00:11:58,740 We know what the theta zero and the theta one is equal to. 152 153 00:11:58,990 --> 00:12:04,090 And if we want to know what the revenue would be for a film with a budget of 50 million, all we would 153 154 00:12:04,090 --> 00:12:13,160 have to do is to substitute the values of our parameters that we estimated into this equation. 154 155 00:12:13,420 --> 00:12:19,300 So our X is going to be 50 million and if we do the math then we get our prediction, at least according 155 156 00:12:19,300 --> 00:12:20,070 to our model. 156 157 00:12:20,110 --> 00:12:20,700 Right? 157 158 00:12:20,920 --> 00:12:23,350 On a chart it would look something like this. 158 159 00:12:23,400 --> 00:12:28,560 We can draw a line up from 50 million and then predict how much, 159 160 00:12:28,650 --> 00:12:31,180 what it is that we're actually going to make of the movie? 160 161 00:12:31,180 --> 00:12:37,930 And it's gonna be a little bit less than three times the amount that we invested so about 148 million 161 162 00:12:37,930 --> 00:12:40,550 dollars - which is not bad, right? 162 163 00:12:40,630 --> 00:12:42,360 But, how do we know if it's accurate? 163 164 00:12:42,400 --> 00:12:45,060 How can we measure how good our model is? 164 165 00:12:45,070 --> 00:12:51,340 So even though it is very very simplistic we can still ask the question of how much of the real world 165 166 00:12:51,340 --> 00:12:53,560 data it actually explains. 166 167 00:12:53,560 --> 00:12:55,650 And for that we need some kind of measure. 167 168 00:12:55,720 --> 00:13:01,650 We need some kind of statistic and the measure that we're going to look at is called R squared. 168 169 00:13:01,840 --> 00:13:09,460 Also called the goodness of fit. To look at our R squared we'll simply take our regression and we write 169 170 00:13:09,460 --> 00:13:12,160 regression.score. 170 171 00:13:12,580 --> 00:13:20,350 And within the parentheses we supply our capital X and our lowercase y - our feature and our labels and 171 172 00:13:20,350 --> 00:13:27,460 we hit Shift+Enter and here what we see is that the R squared is approximately 0.55. 172 173 00:13:27,850 --> 00:13:34,100 This number here is the amount of variation in film revenue that is explained by the film's budget. 173 174 00:13:34,270 --> 00:13:42,010 And I got to say that 55 percent is actually pretty good because think about it this way are very simplistic 174 175 00:13:42,010 --> 00:13:48,580 model with a single feature, namely the production budget, can explain around 55 percent of the variation 175 176 00:13:48,580 --> 00:13:50,920 that we see in worldwide movie earnings. 176 177 00:13:51,340 --> 00:13:57,220 I'd say that's pretty good news for a first try but, of course, we should be a little bit cautious 177 178 00:13:57,220 --> 00:14:01,490 reading into this model too much, because we've still got a lot to learn. 178 179 00:14:01,570 --> 00:14:08,230 For example how would our model do if we added more features, like how long it took to make or if it's 179 180 00:14:08,230 --> 00:14:09,370 a sequel? 180 181 00:14:09,370 --> 00:14:10,960 Would we get more realism? 181 182 00:14:10,960 --> 00:14:14,560 Would it make our model perform better and make our predictions more accurate? 182 183 00:14:14,800 --> 00:14:21,730 And perhaps we should evaluate our model, not just on the data that we used for training it, but on new 183 184 00:14:21,730 --> 00:14:29,110 data, data that it hasn't seen yet and also, what if the relationship that we have here is actually non-linear. 184 185 00:14:29,140 --> 00:14:32,770 What if we somehow need to transform the data to get a better fit? 185 186 00:14:33,280 --> 00:14:39,820 So in a way our analysis has left us with a lot more questions that we should investigate and we will 186 187 00:14:39,820 --> 00:14:43,000 do just that in the upcoming modules. 187 188 00:14:43,000 --> 00:14:48,670 Well, you got started on the first project and you went through the whole data science workflow; you've 188 189 00:14:48,670 --> 00:14:53,740 gathered the data, you've cleaned the data, you've visualized it and then you ran a machine learning algorithm 189 190 00:14:53,950 --> 00:14:58,690 and then you've evaluated the results and you've even made a prediction, but this is only the start. 190 191 00:14:59,230 --> 00:15:05,230 We've got a whole lot more ground to cover and we'll dive deep into a lot of these concepts and the 191 192 00:15:05,230 --> 00:15:11,140 techniques that we've introduced here. In the next module we're going to install Jupyter and we're also 192 193 00:15:11,140 --> 00:15:13,500 going to learn a little bit more about regression. 193 194 00:15:13,990 --> 00:15:17,550 But the real focus will be on learning more Python programming. 194 195 00:15:17,740 --> 00:15:23,290 From there, we're going to learn about gradient descent and how optimization works for many machine learning 195 196 00:15:23,290 --> 00:15:25,570 algorithms that we're going to encounter. 196 197 00:15:25,570 --> 00:15:31,120 And after that we're going to use multivariable regression and predict some real estate prices in Boston. 197 198 00:15:31,660 --> 00:15:38,710 From there we're going to build an actual spam filter from scratch using a Naive Bayes classifier and 198 199 00:15:38,710 --> 00:15:40,280 then we're gonna take it up a notch, 199 200 00:15:40,290 --> 00:15:44,690 and we're gonna dive into deep learning with neural networks and TensorFlow. 200 201 00:15:44,800 --> 00:15:47,920 So for all of that and more, I'll see you on the next lessons.