0 1 00:00:00,420 --> 00:00:06,450 Let's revisit our old friend the scatter plot. We saw in the previous lessons 1 2 00:00:06,540 --> 00:00:14,430 how important it is to use plots in conjunction with descriptive statistics to spot patterns and outliers. 2 3 00:00:15,030 --> 00:00:16,910 Using both of these tools together 3 4 00:00:17,100 --> 00:00:20,920 we get a more complete picture of what's actually going on. 4 5 00:00:21,210 --> 00:00:27,990 So far we've been visualizing our data and looked at the distribution of values of some individual features, 5 6 00:00:28,380 --> 00:00:36,270 like RM for like the average number of rooms, or RAD, the index of accessibility to highways. 6 7 00:00:36,270 --> 00:00:43,950 We can dig deeper into the relationships between these feature pairs as well as between the features 7 8 00:00:44,160 --> 00:00:47,760 and our target value with some scatter plots. 8 9 00:00:47,760 --> 00:00:53,520 The correlation matrix that we created already hinted at the fact that there are relationships amongst 9 10 00:00:53,520 --> 00:01:01,360 the features that we can visualize. So on that note, I'd like to start this lesson off with a challenge. 10 11 00:01:01,380 --> 00:01:09,650 First, I want you to picture what the relationship would look like between the NOX and DIS features. 11 12 00:01:09,930 --> 00:01:17,910 If you recall, NOX was a measure of pollution and DIS was a measure of distance from employment centers. 12 13 00:01:17,910 --> 00:01:25,610 Picture a graph in your head and then write the two lines of Python code to visualize the scatter plot. 13 14 00:01:25,630 --> 00:01:29,030 I'll give you a few seconds to pause the video before I show you the solution. 14 15 00:01:31,870 --> 00:01:32,770 Ready? 15 16 00:01:32,770 --> 00:01:33,580 Here we go. 16 17 00:01:33,580 --> 00:01:40,750 So we've created a scatter plot many times before with "plt", which is our matplotlib module, 17 18 00:01:41,470 --> 00:01:47,930 ".scatter()" and then we have to supply two things, the data for the x axis and the data for the y axis. 18 19 00:01:47,950 --> 00:01:52,960 So on the x axis we're gonna have our "data[ 19 20 00:01:53,260 --> 00:01:54,290 DIS']", 20 21 00:01:54,520 --> 00:01:59,020 so this is going to be our distance which is going to be on the x axis, and on the y axis, 21 22 00:01:59,020 --> 00:02:07,030 we're going to have our pollution measure, which is going to be "data['NOX']" and then 22 23 00:02:07,030 --> 00:02:10,150 finally "PLT.show()". 23 24 00:02:10,150 --> 00:02:14,560 So hitting Shift+Enter just gave me this error because I've come back to this notebook and I haven't 24 25 00:02:14,560 --> 00:02:16,260 run the cells above it yet. 25 26 00:02:16,420 --> 00:02:25,270 So I'm going to go to "Cell" > "Run All" and then I'm going to wait a little bit, scroll all the way down, 26 27 00:02:25,600 --> 00:02:26,950 and here we go. 27 28 00:02:26,950 --> 00:02:29,610 Was this the relationship that you imagined in your head? 28 29 00:02:29,790 --> 00:02:32,230 A kind of downward sloping line? 29 30 00:02:32,560 --> 00:02:35,740 Let me add some labels to this graph before I give you my interpretation. 30 31 00:02:35,890 --> 00:02:51,760 So "plt.xlabel('DIS - Distance from employment', fontsize = 14)", and then for 31 32 00:02:51,760 --> 00:02:56,100 the Y label, I'm going to copy this line, paste it in, 32 33 00:02:56,140 --> 00:02:57,510 change it to the Y label, 33 34 00:02:57,580 --> 00:03:09,480 change that to read "NOX - Nitric Oxide Pollution", and then my figure size I want to change as well. 34 35 00:03:09,700 --> 00:03:19,230 So I'm going to make it a bit bigger, so I'm going to say "plt.figure(figsize = ())", 35 36 00:03:19,770 --> 00:03:22,720 say maybe 9 and 6. 36 37 00:03:22,800 --> 00:03:32,730 And then I'm also going to add a title here, I'm going to say "plt.title('DIS vs 37 38 00:03:32,970 --> 00:03:42,300 NOX', fontsize = 14)". We're going to refresh this graph, see what we get. 38 39 00:03:43,800 --> 00:03:50,760 Okay so this makes the relationship between distance from employment centers and NOX, our nitric oxide 39 40 00:03:50,760 --> 00:03:57,660 pollution much, much more clear. What we can see here is that as distance increases, as we go more to the 40 41 00:03:57,660 --> 00:04:03,580 right of this chart here, pollution goes down and this makes sense, right? 41 42 00:04:03,750 --> 00:04:10,380 The city center of Boston is going to be an employment center but city centers would also have much 42 43 00:04:10,380 --> 00:04:14,510 more air pollution than in the suburbs or on the outskirts of the city. 43 44 00:04:15,540 --> 00:04:21,930 Now one thing that might be quite interesting to add to this graph is a little bit of transparency on 44 45 00:04:21,960 --> 00:04:29,190 these data points as well as the, maybe putting down the correlation that we've calculated up here and 45 46 00:04:29,190 --> 00:04:31,530 including that in our title. 46 47 00:04:31,530 --> 00:04:35,670 So let's do that now. To calculate the correlation, 47 48 00:04:35,670 --> 00:04:48,890 I'm going to add a nox_dis_corr variable, set that equal to "data['NOX']. 48 49 00:04:49,280 --> 00:04:59,060 corr(data['DIS'])" and then when I'm going to do is in the title I'm going to 49 50 00:04:59,060 --> 00:05:08,420 use this variable here and I'm going to include it in my string and I'm going to use Python's fstring notation 50 51 00:05:08,600 --> 00:05:09,710 to accomplish this. 51 52 00:05:09,710 --> 00:05:16,400 So I'm going to put f in front of the single quote and then I'm going to modify my string as follows, I'm 52 53 00:05:16,400 --> 00:05:22,160 going to say "(Correlation )" and here's the key, 53 54 00:05:22,160 --> 00:05:27,800 "{nox_dis_corr}". 54 55 00:05:27,860 --> 00:05:33,750 So this is going to grab our variable from up here, 55 56 00:05:33,760 --> 00:05:41,680 it's gonna grab our correlation between distance and pollution and it's going to insert it into our string. 56 57 00:05:41,710 --> 00:05:44,100 And that's thanks to the fact that we have 57 58 00:05:44,100 --> 00:05:50,750 the curly bracket notation outside of the variable name and this little f in front. 58 59 00:05:50,770 --> 00:05:53,550 So let me hit Shift+Enter and see what this looks like. 59 60 00:05:54,870 --> 00:05:55,320 Voila! 60 61 00:05:56,210 --> 00:06:00,400 Now we've got a graphical representation of our data and the correlation, 61 62 00:06:00,470 --> 00:06:06,500 all in one place. And the correlation is indeed negative and it's quite high actually with 0.77. 62 63 00:06:06,500 --> 00:06:09,440 Now in terms of styling, 63 64 00:06:09,450 --> 00:06:13,350 you might say to yourself: You know what this number is way too precise, 64 65 00:06:13,350 --> 00:06:17,710 it's difficult to read because it's got too many values after the decimal point. 65 66 00:06:17,760 --> 00:06:23,790 So why don't we round it? And we can do this with the Python round function. 66 67 00:06:23,790 --> 00:06:29,760 So I'm going to do it up here where I've actually calculated the correlation and I'm just going to surround 67 68 00:06:30,030 --> 00:06:37,590 my correlation calculation with this Python function, so "round", comma at the end and then a value for 68 69 00:06:37,590 --> 00:06:43,920 how many decimal places I want to round it to. So I'm going to round it to three decimal places and close my 69 70 00:06:43,920 --> 00:06:45,580 parentheses at the end. 70 71 00:06:45,810 --> 00:06:52,020 If I press Shift+Enter now it should refresh and we should get something like this, we should get 71 72 00:06:52,170 --> 00:06:56,760 -0.769. 72 73 00:06:56,760 --> 00:07:02,460 The other thing I quite like doing with scatter plots is adding a little bit of transparency to the 73 74 00:07:02,460 --> 00:07:10,050 data points so that we can get a better feel for how dense particular areas of the chart are. 74 75 00:07:10,050 --> 00:07:14,700 So in my line of code where I'm creating my scatter plot, namely this one, I'm going to add some other 75 76 00:07:14,730 --> 00:07:16,290 keyword arguments. 76 77 00:07:16,410 --> 00:07:25,430 The transparency is set with the alpha keyword, and I'm going to set it to a value of 0.6. 77 78 00:07:25,610 --> 00:07:32,140 Let me hit Shift+Enter and we can clearly see that there's a lot more data points here than over here. 78 79 00:07:32,150 --> 00:07:38,900 I think this is a nice touch, but we can make this even more explicit by changing the size of our dots 79 80 00:07:39,230 --> 00:07:40,830 and making them a little bit larger. 80 81 00:07:40,830 --> 00:07:48,740 So if I choose something like 80, "s = 80" as a keyword argument, changing the size, then I've got slightly 81 82 00:07:48,740 --> 00:07:51,150 larger dots for my data points. 82 83 00:07:51,170 --> 00:07:58,340 Now of course we can continue adding keyword arguments here to style the graph as we see fit, famously 83 84 00:07:58,520 --> 00:08:01,900 color and there's quite a few to choose from. 84 85 00:08:02,060 --> 00:08:08,980 I'm going to go with indigo and give my scatter plot a purple make over. Okay, 85 86 00:08:09,010 --> 00:08:13,970 so I think creating a scatterplot with matplotlib is pretty straightforward, 86 87 00:08:14,470 --> 00:08:21,160 but now let's do the same thing with the seaborn module to mix it up a little bit, because remember 87 88 00:08:21,220 --> 00:08:24,430 I said that seaborn builds upon matplotlib? 88 89 00:08:24,460 --> 00:08:30,390 Well you're gonna see in a minute how seaborn really adds some nice little touches to these visualizations. 89 90 00:08:30,400 --> 00:08:31,600 Check this out. 90 91 00:08:31,600 --> 00:08:37,770 So I'm going to come down here, add few more cells and then I'm going to write the following code, 91 92 00:08:37,960 --> 00:08:47,980 I'm going to say "sns.jointplot" so sns being the name for seaborn module and then jointplot 92 93 00:08:48,130 --> 00:08:52,620 being the function to create our scatter plot. 93 94 00:08:52,810 --> 00:09:04,870 So I'm going to say "jointplot(x=data['DIS'], y=data['NOX'])" and 94 95 00:09:04,870 --> 00:09:14,340 then on the next line I'm going to say "plt.show()", hit Shift+Enter and what we get is something like this. 95 96 00:09:14,410 --> 00:09:20,870 Now again I've only specified two parameters in my function call here, but you can already see that there 96 97 00:09:20,870 --> 00:09:27,250 is some sort of histogram on the side and there's some additional data being provided here in this corner. 97 98 00:09:27,330 --> 00:09:32,480 Now if you can't read this on your screen, this is actually the Pearson correlation coefficient down 98 99 00:09:32,480 --> 00:09:36,310 to two decimal places, -0.77. 99 100 00:09:36,530 --> 00:09:43,730 I can make the chart a little larger so that it's a bit more clear by going to the arguments and providing 100 101 00:09:43,730 --> 00:09:46,220 the size argument. 101 102 00:09:46,280 --> 00:09:53,170 So I'm going to say "size = 7", increase the size a little bit but not too much. 102 103 00:09:53,330 --> 00:09:59,010 Now you should see the chart appear a little bit larger on your screen, but I think these histogram is 103 104 00:09:59,300 --> 00:10:05,420 and the correlation coefficient and the fact that it adds some labels for the y axis and the x axis 104 105 00:10:05,840 --> 00:10:12,960 automatically straight out of the box is a really, really nice touch. In terms of styling. 105 106 00:10:12,980 --> 00:10:19,010 one thing that you might notice is that that the Jupyter notebook remembers how you've styled charts 106 107 00:10:19,130 --> 00:10:20,420 previously. 107 108 00:10:20,540 --> 00:10:25,130 So if you're working in a new cell and you want a new look for the chart you might have to reset the 108 109 00:10:25,130 --> 00:10:31,670 styling. The way to reset the styling for seaborn is with a function called "set". 109 110 00:10:31,820 --> 00:10:37,540 So "sns.set()" will reset the styling to the default styling. 110 111 00:10:37,540 --> 00:10:44,780 So now if I press Shift+Enter I get the default parameters for the styling and we kind of get this look 111 112 00:10:44,900 --> 00:10:46,640 right here. 112 113 00:10:46,670 --> 00:10:52,100 This set function is a good function to remember if you've ever got like a little bit of a longer notebook 113 114 00:10:52,130 --> 00:10:55,670 that we've got here and you might have written some code up above 114 115 00:10:55,670 --> 00:11:01,320 that changes the styling of these charts and you want to do something different and your notebook 115 116 00:11:01,330 --> 00:11:08,710 is behaving a little bit unexpectedly, so "sns.set()" resets the styling to default and 116 117 00:11:09,170 --> 00:11:18,730 "sns.set_style()" allows us to choose kind of like a template style to use for the chart. 117 118 00:11:18,770 --> 00:11:25,370 So there's a couple of templates to choose from, one of them is called white and then and this template 118 119 00:11:25,400 --> 00:11:29,510 will make our chart look like so which is kind of what we had before. 119 120 00:11:29,930 --> 00:11:36,440 But there's another template called white grid, which then have these grid lines to the chart like so. 120 121 00:11:36,460 --> 00:11:43,820 Now of course there's also like dark red and dark and pressing Shift+Tab on this function will actually 121 122 00:11:43,820 --> 00:11:49,690 show us what some of the options are - dark grid, dark, dark, white, ticks, 122 123 00:11:49,880 --> 00:11:57,020 got a couple to choose from if we want. And you even got some examples on how you would use them. 123 124 00:11:57,030 --> 00:12:03,870 So for example if you wanted to use ticks you can even provide the tick size as an additional argument. 124 125 00:12:03,870 --> 00:12:04,250 All right. 125 126 00:12:04,290 --> 00:12:11,280 So that's a little bit more detail on how you can control the aesthetics of your seaborn chart in your 126 127 00:12:11,280 --> 00:12:12,440 notebook. 127 128 00:12:12,600 --> 00:12:17,880 But the last thing I want to mention is that there is an additional template that you can mix and match 128 129 00:12:17,940 --> 00:12:27,540 with say white grid or dark grid and these templates are called contexts, if you will. 129 130 00:12:27,570 --> 00:12:37,830 So "sns.set_context()" will allow us to put in a template here for how this 130 131 00:12:37,830 --> 00:12:46,710 chart is gonna be used. For example, a context might be "talk", and if I use that then you can see that the 131 132 00:12:46,710 --> 00:12:52,170 font size is a lot larger and the dots are a little bit more clear. 132 133 00:12:52,200 --> 00:12:58,170 So this is presumably because you want to present this chart somewhere. 133 134 00:12:58,200 --> 00:13:00,810 Now there's a couple of other contexts as well. 134 135 00:13:00,900 --> 00:13:05,560 You can use "notebook" which will make the chart look like this. 135 136 00:13:05,670 --> 00:13:10,230 This is a template that's quite good if you're viewing this kind of stuff on a monitor and you're not 136 137 00:13:10,500 --> 00:13:17,730 having to throw it up on a screen or like a presentation and pressing Shift+Enter on context shows 137 138 00:13:17,730 --> 00:13:20,920 us that there's a couple of other options as well. 138 139 00:13:21,060 --> 00:13:27,550 So there's "paper", there's "poster" and there's "talk" and "notebook" which we've already looked at. 139 140 00:13:27,570 --> 00:13:34,800 I'm going to go with "talk" just to make it a little bit more readable on the video. The very last thing 140 141 00:13:34,800 --> 00:13:43,260 I'm going to mention on the styling front is how you can get the similar sort of transparency and the color 141 142 00:13:43,950 --> 00:13:46,290 that we have here on matplotlib. 142 143 00:13:46,350 --> 00:13:52,860 So I'm going to show you how you can set that by supplying certain arguments. The color argument is pretty 143 144 00:13:52,860 --> 00:13:53,450 straightforward, 144 145 00:13:53,460 --> 00:14:05,530 so "color = 'indigo'" will give us a purple chart but when it comes to the transparency, you have to supply 145 146 00:14:05,530 --> 00:14:12,190 the argument in a different way, because jointplot doesn't take an argument called alpha, that's only 146 147 00:14:12,190 --> 00:14:14,180 for matplotlib. 147 148 00:14:14,320 --> 00:14:23,880 Instead you have to go to the keyword arguments, so "joint_kws = " 148 149 00:14:24,070 --> 00:14:31,450 and here you provide a dictionary, a Python dictionary, so that uses the curly braces notation and then 149 150 00:14:31,450 --> 00:14:42,130 a key value pair - "alpha" for their key, colon, and then say 0.5 for the value. If I press Shift+ 150 151 00:14:42,130 --> 00:14:47,260 Enter now we'll have the transparency applied to our data points. 151 152 00:14:47,860 --> 00:14:53,920 So I hope you find it useful to see two different ways of generating this chart with different modules. 152 153 00:14:53,920 --> 00:14:58,780 The first one matplotlib and the second one being seaborn. 153 154 00:14:58,920 --> 00:15:01,390 Now there's a really cool thing I want to show you next. 154 155 00:15:01,800 --> 00:15:05,560 And that's to do with the fact that this jointplot method is actually incredibly powerful. 155 156 00:15:05,680 --> 00:15:08,840 So let me copy this cell and paste it below. 156 157 00:15:08,950 --> 00:15:16,270 So I have two copies of it now and what I'm going to do is just for comparison I'm going to modify how these 157 158 00:15:16,270 --> 00:15:19,000 data points are represented here. 158 159 00:15:19,090 --> 00:15:24,100 So I'm going to go with a blue color to set them apart. 159 160 00:15:24,100 --> 00:15:27,050 So I've got my blue one here, purple one here. 160 161 00:15:27,400 --> 00:15:34,300 And then what I'm going to do is I'm going to show you this keyword argument that I'm going to change in the 161 162 00:15:34,300 --> 00:15:43,630 quick documentation. So pressing Shift+Enter jointplot shows us that this keyword argument, "kind", 162 163 00:15:44,170 --> 00:15:47,200 is set to scatter by default. 163 164 00:15:47,380 --> 00:15:57,250 But there's other values that this can take, for example "kde", "reg", "resid" and "hex". 164 165 00:15:57,340 --> 00:16:02,290 So we've actually got a choice between five different values. 165 166 00:16:02,290 --> 00:16:04,390 Let me show you what one of them does. 166 167 00:16:04,540 --> 00:16:10,960 I'm going to go ahead and delete this argument here where we've set our alpha value. 167 168 00:16:10,960 --> 00:16:18,060 And if I press Shift+Enter you can see that we no longer have any alpha values on our chart. 168 169 00:16:18,400 --> 00:16:27,910 But if I come in here and I change the kind to "hex", so writing "kind = 'hex'" 169 170 00:16:28,830 --> 00:16:36,540 and then I hit Shift+Enter, we get the following. We get a chart that looks like this and what this chart is 170 171 00:16:36,540 --> 00:16:43,620 doing is that it's aggregating the data points that all fall in a certain area and then it shades them 171 172 00:16:43,620 --> 00:16:51,900 in depending on how many data points there are in that particular sector. So you're aggregating the data 172 173 00:16:51,900 --> 00:16:53,980 points over like a little 2D area. 173 174 00:16:54,120 --> 00:17:00,030 And again the shading gives us a very good idea of the density of the data points in that particular 174 175 00:17:00,030 --> 00:17:02,520 part of the plot. 175 176 00:17:02,520 --> 00:17:07,350 In other words we're aggregating the data in a hexagonal grid. 176 177 00:17:07,350 --> 00:17:11,280 And I think this is a quite a beautiful visualization actually. 177 178 00:17:11,370 --> 00:17:16,290 And it's one that you don't tend to see that often but it does remind me a little bit of a board game 178 179 00:17:16,290 --> 00:17:17,970 called Settlers of Catan. 179 180 00:17:18,030 --> 00:17:19,110 But maybe that's just me.