0 1 00:00:00,370 --> 00:00:04,450 So, speaking of data, let's create some data for our example. 1 2 00:00:04,590 --> 00:00:10,240 I'll add a few more cells here so that I'm working more in the middle of the screen and then I'm gonna 2 3 00:00:10,260 --> 00:00:19,650 add a comment, say "Make sample data" and I'm going to create two variables here - an x_5 variable 3 4 00:00:20,310 --> 00:00:23,360 and a y_5 variable. 4 5 00:00:23,390 --> 00:00:24,490 That's our fifth example, 5 6 00:00:24,510 --> 00:00:35,820 after all. Our x_5 is going to be equal to a numpy array, so "np.array( 6 7 00:00:36,240 --> 00:00:37,570 [])", 7 8 00:00:37,950 --> 00:00:40,140 and I'm going to punch in here seven values. 8 9 00:00:40,230 --> 00:00:45,360 So if you type in something else, then we're gonna get different results. 9 10 00:00:45,450 --> 00:00:56,200 But I'm going to punch in 0.1, 1.2, 2.4, 3.2, 4.1, 10 11 00:00:57,130 --> 00:01:01,800 5.7 and 6.5. 11 12 00:01:03,010 --> 00:01:05,150 So these are gonna be our X values, 12 13 00:01:05,230 --> 00:01:09,540 seven of them in total. Our y values, 13 14 00:01:09,580 --> 00:01:11,110 so y_5 14 15 00:01:11,320 --> 00:01:15,160 is gonna be equal to also an np array, 15 16 00:01:15,250 --> 00:01:18,470 same thing parentheses, square brackets and here 16 17 00:01:18,470 --> 00:01:29,670 we're going to punch in 1.7, 2.4, 3.5, 3.0, 6.1 17 18 00:01:30,690 --> 00:01:36,130 9.4 and 8.2. 18 19 00:01:36,150 --> 00:01:43,410 So we have two arrays and let's print out the shape of these arrays below. 19 20 00:01:43,410 --> 00:01:55,040 So I'm going to say "print('Shape of x_5 array :', x_5. 20 21 00:01:55,620 --> 00:02:01,620 shape)" and we're gonna do the same thing for our y array. 21 22 00:02:01,810 --> 00:02:09,090 I'm going to fix my typo, shape and then I'm going to do the same thing for our y array - I'm going to say print and 22 23 00:02:09,090 --> 00:02:19,020 then add a string "'Shape of y_5 array: ', y_5.shape". 23 24 00:02:19,470 --> 00:02:22,130 So let's take a look at the shape of these arrays. 24 25 00:02:22,560 --> 00:02:29,500 So they're both flattened meaning they only have one dimension and they've got seven values. 25 26 00:02:29,810 --> 00:02:30,260 Brilliant. 26 27 00:02:30,270 --> 00:02:35,500 So there's our fake sample data and we've added our print statements. 27 28 00:02:35,610 --> 00:02:41,680 Now let's fit a linear regression to this sample data. 28 29 00:02:41,760 --> 00:02:47,640 That way we can calculate the intercept and the slope parameter and we can check our work that we're 29 30 00:02:47,640 --> 00:02:51,540 gonna do against what we know to be correct. 30 31 00:02:51,570 --> 00:02:59,490 So scroll to the very, very top of the file where you're adding your import statements and then in 31 32 00:02:59,490 --> 00:03:06,140 the cell add "from sklearn.linear_model 32 33 00:03:06,570 --> 00:03:13,190 import LinearRegression", capital L, 33 34 00:03:13,190 --> 00:03:25,120 capital R. Remember to hit Shift+Enter and then scroll back down to where you left off and let's run 34 35 00:03:25,180 --> 00:03:29,060 a quick regression. So I'm going to add a comment 35 36 00:03:29,110 --> 00:03:39,550 "Quick linear regression", we're gonna run it like so, we're gonna say "regr = LinearRegresson", 36 37 00:03:41,600 --> 00:03:50,450 capital L, capital R, parentheses at the end and then we're gonna see "regr.fit()" then 37 38 00:03:50,630 --> 00:03:57,500 we're gonna supply two arguments within the parentheses - x_5, comma and then 38 39 00:03:57,560 --> 00:03:59,160 y_5. 39 40 00:03:59,930 --> 00:04:05,320 So these are two arrays, our two pieces of data. 40 41 00:04:05,420 --> 00:04:08,040 Now let's see what happens. 41 42 00:04:08,060 --> 00:04:16,300 Because I'm going tell you right now this isn't gonna work just like that. When we hit Shift+Enter, 42 43 00:04:16,340 --> 00:04:24,320 we can scroll down and take a look at the error - what we get is a value error - "Found input variables with 43 44 00:04:24,380 --> 00:04:30,770 inconsistent number of samples: [1,7]". 44 45 00:04:30,860 --> 00:04:38,150 So that's a very cryptic error message, but let's take a look at the quick documentation for this fit 45 46 00:04:38,510 --> 00:04:41,690 method and see what we need. 46 47 00:04:41,690 --> 00:04:42,490 So. 47 48 00:04:42,890 --> 00:04:45,940 So our fit method takes an X and Y. 48 49 00:04:46,100 --> 00:04:53,310 So the X is our training data and the Y are our target values. 49 50 00:04:53,330 --> 00:05:02,270 Now the interesting thing to note here is that this method requires a two dimensional input. 50 51 00:05:02,320 --> 00:05:09,960 So at the moment our shape of our x ray and our y array is actually one dimensional it's just got seven 51 52 00:05:09,960 --> 00:05:11,010 values in it. 52 53 00:05:11,070 --> 00:05:15,780 But we can see here that this method requires a two dimensional inputs. 53 54 00:05:16,020 --> 00:05:20,070 It needs to know how many samples and how many features. 54 55 00:05:20,070 --> 00:05:22,130 Now we've got only one feature right. 55 56 00:05:22,140 --> 00:05:28,800 We've only got some x values but we still have to pay attention to the kind of array that we're giving. 56 57 00:05:29,020 --> 00:05:36,360 So even though we've only got one feature namely our Xs we still have to pass in a two dimensional array 57 58 00:05:37,290 --> 00:05:40,180 and the same is true for our wise. 58 59 00:05:40,260 --> 00:05:47,700 So even though we're only predicting one value namely the y value we still have to pass in a two dimensional 59 60 00:05:48,060 --> 00:05:48,910 array. 60 61 00:05:48,960 --> 00:05:54,500 So let's go back up here and take a look at where we're creating these arrays. 61 62 00:05:54,600 --> 00:06:02,100 We're going to have to add some additional Python code to reshape or transform our x's and our y array 62 63 00:06:02,700 --> 00:06:07,190 to meet this requirement from our fit method from our regression. 63 64 00:06:07,230 --> 00:06:11,000 I'll show you two possible ways we can solve this problem. 64 65 00:06:11,040 --> 00:06:13,430 Here's the first one - for our x's. 65 66 00:06:13,470 --> 00:06:24,540 If I add another pair of square brackets inside the parentheses and then I put a dot afterwards and 66 67 00:06:24,540 --> 00:06:32,520 write transpose, yeah with parentheses at the end and hit Shift+Enter, 67 68 00:06:33,240 --> 00:06:38,430 then you'll see in the print statement that our X array now is two dimensional. 68 69 00:06:38,430 --> 00:06:41,940 It's got seven rows and one column. 69 70 00:06:42,420 --> 00:06:49,950 So this is one way you can take a one dimensional array which was just seven values and make it into 70 71 00:06:50,100 --> 00:06:57,780 two dimensions even though it's only got one column - two pairs of square brackets and then calling the 71 72 00:06:57,840 --> 00:07:07,150 transpose method. Another way you can accomplish the very very same thing is by using a method called 72 73 00:07:07,360 --> 00:07:08,140 reshape. 73 74 00:07:08,140 --> 00:07:16,540 So writing ".reshape" and then in the parentheses I'm going to supply how many rows I want - 7 74 75 00:07:17,140 --> 00:07:25,410 comma and then how many columns I want - namely 1; 7 rows 1 column. 75 76 00:07:25,410 --> 00:07:31,220 Let me hit Shift+Enter and update the shape of our y array. 76 77 00:07:31,400 --> 00:07:32,010 Voila! 77 78 00:07:32,750 --> 00:07:34,850 I get exactly the same result. 78 79 00:07:34,920 --> 00:07:37,810 So two ways you can do it. 79 80 00:07:38,040 --> 00:07:43,650 Now with that in place I'm going to click into our cell below where we're doing our regression 80 81 00:07:44,310 --> 00:07:49,560 and I'm going to hit Shift+Enter and voila! Our error goes away. 81 82 00:07:49,560 --> 00:07:52,840 So we've managed to fit our regression perfectly. 82 83 00:07:52,920 --> 00:07:54,790 Now what are we actually interested in? 83 84 00:07:54,810 --> 00:07:57,220 We're interested in the theta values right - 84 85 00:07:57,240 --> 00:08:02,220 the intercept and our slope parameter - our theta zero and our theta one. 85 86 00:08:02,340 --> 00:08:03,290 Let's print these out. 86 87 00:08:03,780 --> 00:08:16,030 I'm going to put a print statement here and write the string "'Theta 0:', regr.intercept_ 87 88 00:08:16,060 --> 00:08:19,290 [0]". 88 89 00:08:19,420 --> 00:08:25,990 That's how we got our intercept, our theta zero; and then I'm going to add another print statement "'Theta 89 90 00:08:25,990 --> 00:08:37,900 91 1: ', regr.coef_" and this was two dimensional remember. 90 92 00:08:37,900 --> 00:08:43,780 So we had s[0] and then another square brackets with a zero. 91 93 00:08:43,900 --> 00:08:54,070 So first column first row of this two dimensional coefficient array. I'm going to hit Shift+Enter we can see 92 94 00:08:54,190 --> 00:09:01,600 what the values. Our intercept is at 0.85 approximately and our theta one, our 93 95 00:09:01,600 --> 00:09:07,120 slope has the value 1.2 approximately. 94 96 00:09:07,200 --> 00:09:15,270 So now that we have the parameters that describe our best fit line let's throw them up on screen because 95 97 00:09:15,300 --> 00:09:19,580 I'm a very, very visual person and I'm sure it'll help you as well. 96 98 00:09:19,590 --> 00:09:28,950 So I'm going to add a few more cells, scroll down and now let's plot this stuff. First let's plot our data points. 97 99 00:09:29,050 --> 00:09:29,790 So this 98 100 00:09:29,790 --> 00:09:37,490 we can do with a scatter plot so "plt.scatter" then in the parentheses we'll have "x_5, 99 101 00:09:37,500 --> 00:09:46,400 y_5", and then size for these points maybe at 50. 100 102 00:09:46,420 --> 00:09:52,270 So that's our scatter plot and now we'll add our best fit line between these points. 101 103 00:09:52,300 --> 00:10:03,360 So I'll write "plt.plot(x_5, )", our x values and then for our y values, the predicted 102 104 00:10:03,360 --> 00:10:10,470 values, I'm going to write "regr", which is our linear regression, 103 105 00:10:10,470 --> 00:10:17,420 then ".predict()" and then in the parentheses I'm going to supply again the x values. 104 106 00:10:17,460 --> 00:10:21,420 So I'm going to predict the y values based on the x values. 105 107 00:10:21,590 --> 00:10:22,830 I'm going to this line orange. 106 108 00:10:22,830 --> 00:10:36,180 So I'm going to say "color='orange'" and maybe 'linewidth=3'. 107 109 00:10:36,280 --> 00:10:38,280 Let's add some labels as well. 108 110 00:10:38,280 --> 00:10:42,070 So "plt.xlabel()", 109 111 00:10:42,250 --> 00:10:48,270 it's gonna be the string "x values" and "plt.ylabel()", 110 112 00:10:48,280 --> 00:10:51,760 is gonna be the string "y values". 111 113 00:10:52,060 --> 00:10:57,360 Last but not least "plt.show()". 112 114 00:10:57,420 --> 00:11:00,160 Now let's take a gander what this looks like. 113 115 00:11:01,570 --> 00:11:08,560 Voila! Brilliant. Here we can clearly see our data points and the best fit line. 114 116 00:11:08,710 --> 00:11:17,950 Nothing new so far but I quickly want to mention something quite important and that's to do with a difference 115 117 00:11:17,980 --> 00:11:22,880 in notation that we're gonna be using in this example than in the previous examples. 116 118 00:11:23,380 --> 00:11:26,200 You see this isn't the cost function that we're looking at. 117 119 00:11:26,260 --> 00:11:34,390 This is the best fit line and the data. The plots that we've been running previously have been the cost 118 120 00:11:34,390 --> 00:11:40,210 functions and we're going to be plotting the mean square error shortly. 119 121 00:11:40,260 --> 00:11:46,110 Now the cost functions plot the relationship between the parameters that we're trying to estimate and 120 122 00:11:46,110 --> 00:11:49,720 how far off we were from the actual data. 121 123 00:11:49,920 --> 00:11:55,980 So the cost functions don't plot the actual data itself. 122 124 00:11:56,130 --> 00:11:57,960 That regression line right here, 123 125 00:11:58,080 --> 00:12:00,750 this graph - that's plotting the data. 124 126 00:12:00,750 --> 00:12:09,060 It's a function of x and y and not the theta parameters, theta zero and theta one, that we're gonna be 125 127 00:12:09,120 --> 00:12:13,710 interested in and looking at for our cost function. 126 128 00:12:13,890 --> 00:12:20,760 And the reason that I'm mentioning this is that previously we've been using x and y for our cost function 127 129 00:12:20,760 --> 00:12:26,020 parameters because that's how we tend to learn about mathematics in school and functions. 128 130 00:12:26,160 --> 00:12:32,610 And it's quite a comfortable sort of notation but when looking at this cost function that's coming up. 129 131 00:12:32,610 --> 00:12:40,500 It's going depend on the theta values and the theta values are not our x's and y's so just keep that 130 132 00:12:40,500 --> 00:12:43,560 in mind, x_5 and y_5 131 133 00:12:43,560 --> 00:12:49,620 in this example have a different purpose than the data that we generated in the previous lesson.