0 1 00:00:01,050 --> 00:00:06,870 Now that we've defined and formulated our question, it's time to gather the data. 1 2 00:00:06,870 --> 00:00:10,440 So let's have a think about the kind of data that we're looking to gather. 2 3 00:00:10,440 --> 00:00:16,040 First off, we need data on our feature, namely the movie budgets in US dollars. 3 4 00:00:16,050 --> 00:00:22,740 The second thing that we need is data on our target, namely the movie revenue in US dollars. 4 5 00:00:23,820 --> 00:00:26,220 So, where would we go about finding this kind of data? 5 6 00:00:27,120 --> 00:00:34,200 Well lucky for us, there exists a website called "The Numbers" and their slogan is "Where Data and the Movie 6 7 00:00:34,200 --> 00:00:35,770 Business Meet". 7 8 00:00:35,820 --> 00:00:39,030 This sounds perfect for our project. 8 9 00:00:39,030 --> 00:00:40,520 So let's check it out. 9 10 00:00:40,560 --> 00:00:42,660 So this is what the website looks like. 10 11 00:00:42,660 --> 00:00:50,640 If we go to Movies > Budgets and Finances, then we get the performance records of various films, and they've 11 12 00:00:50,700 --> 00:00:56,670 even got some tables here on some common searches or common questions that people might be interested 12 13 00:00:56,670 --> 00:00:57,240 in. 13 14 00:00:57,360 --> 00:01:01,080 For example, they've got the biggest budget films right here. 14 15 00:01:01,080 --> 00:01:07,050 And if we scroll down, they've got the movies with the lowest budgets to earn one million dollars at 15 16 00:01:07,050 --> 00:01:08,850 the U.S. box office. 16 17 00:01:08,850 --> 00:01:15,480 And I think a notable film here is El Mariachi, which apparently only had a production budget of seven 17 18 00:01:15,480 --> 00:01:19,850 thousand dollars but made about two million in the U.S. box office. 18 19 00:01:19,980 --> 00:01:27,660 If we click on this link here where it says "the complete list", then we get the entire table of all the 19 20 00:01:27,660 --> 00:01:30,990 films in the database right here. 20 21 00:01:30,990 --> 00:01:32,430 And this table is actually enormous. 21 22 00:01:32,430 --> 00:01:33,420 It's very, very long. 22 23 00:01:33,450 --> 00:01:35,670 There's quite a few different entries here. 23 24 00:01:36,420 --> 00:01:44,280 And what I've done is I've selected all of these and I've copy- pasted them into a spreadsheet. Now, to 24 25 00:01:44,280 --> 00:01:46,280 save you this kind of tedious work, 25 26 00:01:46,350 --> 00:01:51,260 I've made a CSV file available under the lesson resources. 26 27 00:01:51,390 --> 00:01:58,120 All you have to do is download it and open it in Excel or in Google Sheets. 27 28 00:01:58,410 --> 00:02:03,900 And once you open this file, what you should see is the movies sorted by their rank. 28 29 00:02:04,020 --> 00:02:10,010 So at place number one, we've got Avatar then a Star Wars film then Pirates Of The Caribbean. 29 30 00:02:10,170 --> 00:02:13,170 All of these films are sorted by their production budget, 30 31 00:02:13,230 --> 00:02:19,980 so column D. And if we scroll down, then we can see that there's a lot of entries in this file. 31 32 00:02:20,130 --> 00:02:25,260 As a matter of fact, there's about 5000 entries in this CSV file. 32 33 00:02:25,260 --> 00:02:31,950 So, this completes our data gathering step and we can move on to the next part of the workflow, namely 33 34 00:02:32,190 --> 00:02:33,330 cleaning the data. 34 35 00:02:33,780 --> 00:02:36,790 So, let's take a closer look at what we've actually got. 35 36 00:02:36,810 --> 00:02:43,350 If I scroll down here, then I can see there's a lot of numbers, but occasionally we might spot something 36 37 00:02:43,350 --> 00:02:44,610 a little bit odd. 37 38 00:02:44,790 --> 00:02:50,490 For example, here in columns E and F we have two big fat zeros. Scrolling up, 38 39 00:02:50,700 --> 00:02:58,680 I can see these are the columns for Worldwide Gross and Domestic Gross and yet this movie entitled Singularity 39 40 00:02:59,070 --> 00:03:07,170 with a production value of 175 million dollars has zero revenue. 40 41 00:03:07,200 --> 00:03:11,690 And the reason for this might be found in this column, our second column here. 41 42 00:03:11,850 --> 00:03:14,830 What we have here is the projected release date. 42 43 00:03:15,030 --> 00:03:21,610 So, end of 2020 is when this movie is due to be released according to this dataset. 43 44 00:03:21,660 --> 00:03:28,080 And what they've done is they've put a placeholder of zero into these two columns instead of removing 44 45 00:03:28,080 --> 00:03:29,490 this movie entirely. 45 46 00:03:29,490 --> 00:03:35,010 They've included it in the data set with one hundred and seventy five million dollars production budget, 46 47 00:03:35,430 --> 00:03:39,930 but because it hasn't made any revenue to date, they've added some zeros here. 47 48 00:03:39,930 --> 00:03:44,610 Now, this might actually be a little bit of a problem for us, because we're looking to make a prediction 48 49 00:03:44,910 --> 00:03:46,590 for our movie revenue. 49 50 00:03:46,800 --> 00:03:53,340 And if our dataset is going to include films that have not been released yet and have big fat zeros 50 51 00:03:53,340 --> 00:03:58,210 for the revenue, it might make our estimates a little bit unreliable. 51 52 00:03:58,230 --> 00:03:58,900 Right? 52 53 00:03:58,920 --> 00:04:04,750 So, one thing that might actually be worth doing, is checking how many films, out of the five thousand 53 54 00:04:04,750 --> 00:04:08,010 three hundred, actually have a revenue of zero. 54 55 00:04:08,130 --> 00:04:11,390 And we can do this by sorting by Worldwide Gross. 55 56 00:04:11,730 --> 00:04:19,830 If I select my data here and then open the menu and you go to Data > Sort Range, then I can actually sort 56 57 00:04:20,010 --> 00:04:21,250 by column number E. 57 58 00:04:21,570 --> 00:04:26,760 And, since we don't want to include that first row in the sort, I'm just going to tick this box here and 58 59 00:04:26,760 --> 00:04:33,780 then click Sort, and what we see here is that there's quite a few films that have a zero entry for Worldwide 59 60 00:04:33,810 --> 00:04:35,880 Gross and Domestic Gross. 60 61 00:04:35,880 --> 00:04:37,360 Now, why might that be? 61 62 00:04:37,380 --> 00:04:43,350 Well, for some of them the release date is in the future, but looking at this column, we see that not every 62 63 00:04:43,350 --> 00:04:48,470 film that has a 0 here is due to be released in the future. 63 64 00:04:48,480 --> 00:04:56,100 So for example, this one here "Black Water Transit" was supposedly released at the end of 2008, and this 64 65 00:04:56,100 --> 00:05:01,590 film also has a big fat zero in Worldwide Gross. Why might that be? 65 66 00:05:01,600 --> 00:05:06,860 Now, one possible answer is looking at the description of the website. 66 67 00:05:06,910 --> 00:05:13,680 So, if we go here then we see this line: "The data, to the best of our knowledge, is accurate, 67 68 00:05:13,840 --> 00:05:17,620 but there are gaps and disputed figures". 68 69 00:05:17,620 --> 00:05:21,120 So, some of these zeros might be gaps. 69 70 00:05:21,190 --> 00:05:26,770 It might just be that the data is incomplete for some films, but I've actually looked up this Blackwater 70 71 00:05:26,770 --> 00:05:34,780 Transit film and on Wikipedia I can see that in post-production there were actually quite a few lawsuits 71 72 00:05:34,780 --> 00:05:40,210 and there was quite a lot of back and forth between the production company and the filmmakers and the 72 73 00:05:40,210 --> 00:05:41,890 producers and so on. 73 74 00:05:41,890 --> 00:05:49,990 So this film actually never got released and this release date here of end of 2008 was probably a planned 74 75 00:05:49,990 --> 00:05:53,080 release date, but the movie actually never came out. 75 76 00:05:53,080 --> 00:05:59,530 Now, this might be something that happened to quite a few of these films, and they might have a 0 for 76 77 00:05:59,590 --> 00:06:01,180 various other reasons, too. 77 78 00:06:01,390 --> 00:06:09,100 But, this poses a big problem for our analysis. If we're looking to predict movie revenue based on the 78 79 00:06:09,100 --> 00:06:13,660 budget, then having these zero entries here is going to be a problem. 79 80 00:06:14,140 --> 00:06:20,350 So, what I would suggest, given that some of these films were probably not released, other ones the data 80 81 00:06:20,350 --> 00:06:26,260 might not be available, is that we should actually exclude the films that have 0 here for Worldwide 81 82 00:06:26,260 --> 00:06:27,960 Gross and Domestic Gross. 82 83 00:06:28,150 --> 00:06:33,320 We should select all of these entries and delete them from our dataset. 83 84 00:06:33,340 --> 00:06:41,320 So if I come down here to row 358 and scroll all the way up, then I can select all of these rows, I can 84 85 00:06:41,320 --> 00:06:43,860 right click and I can delete the rows. 85 86 00:06:44,020 --> 00:06:47,300 We're going to exclude these rows from our analysis. 86 87 00:06:47,350 --> 00:06:52,930 The next thing that we're going to do is, we're actually going to have to do some calculations and these 87 88 00:06:52,930 --> 00:06:56,430 involve the Production Budget and the Worldwide Gross. 88 89 00:06:56,560 --> 00:06:58,880 These other columns we're actually not going to use. 89 90 00:06:58,990 --> 00:07:00,640 We're not going to use the Domestic Gross. 90 91 00:07:00,640 --> 00:07:04,240 We're not going to use the Movie Title, or the Rank or the Release Date. 91 92 00:07:04,270 --> 00:07:09,910 So what I would suggest is that, as part of the data cleaning process, we're going to delete these entire 92 93 00:07:09,910 --> 00:07:10,830 columns. 93 94 00:07:10,930 --> 00:07:15,810 So, we're going to delete these columns and we're going to delete this column right here. 94 95 00:07:15,910 --> 00:07:20,640 That way we can trim our spreadsheet to what actually matters for our analysis. 95 96 00:07:20,680 --> 00:07:26,170 The next thing that we need to do is we need to have a think about these special characters that might 96 97 00:07:26,170 --> 00:07:31,210 be in the dataset, and by special characters I mean anything other than a number. 97 98 00:07:31,270 --> 00:07:38,230 So, we're gonna be doing some calculations with all of this data and those calculations might be troubled 98 99 00:07:38,230 --> 00:07:41,790 by the fact that there's a dollar sign here, or a comma. 99 100 00:07:41,800 --> 00:07:44,960 This type of formatting could cause problems. 100 101 00:07:45,100 --> 00:07:51,220 So, it's better to remove it and have a pure number in these cells, rather than all this extra stuff. 101 102 00:07:51,220 --> 00:07:55,960 And the way that we can format all these numbers and get rid of these symbols is by selecting the columns, 102 103 00:07:56,260 --> 00:08:04,420 going to Format > Number and then scrolling down where it says More Formats and going to Custom Number Format. 103 104 00:08:04,480 --> 00:08:09,410 And here we can pick the second one down which is this decimal number and click Apply. 104 105 00:08:09,430 --> 00:08:16,110 Now we've got no more dollar signs in our columns, so we're almost done cleaning our data. 105 106 00:08:16,150 --> 00:08:20,410 The last thing I'm going to do is, I'm going to rename my column headings. 106 107 00:08:20,410 --> 00:08:25,720 And the only reason I'm doing that is because I'm going to remove spaces and these custom symbols here - 107 108 00:08:25,930 --> 00:08:31,900 the parentheses and the dollar signs from the headings to make our lives a little bit easier on. I find 108 109 00:08:31,900 --> 00:08:37,800 that, as soon as I have special symbols or white spaces in my headings, I make a lot more typos when I've 109 110 00:08:37,800 --> 00:08:39,910 referred to particular columns. 110 111 00:08:39,910 --> 00:08:48,250 So for this one, I'm just gonna call this production_budget_usd, and this one here 111 112 00:08:48,550 --> 00:08:56,230 I'm going to call worldwide_gross_usd. With these two names for our headings, I've 112 113 00:08:56,230 --> 00:09:02,050 got very expressive headings, and yet there's no whitespace and there's no special characters. 113 114 00:09:02,050 --> 00:09:05,340 This will make our life a little bit easier down the line. 114 115 00:09:05,350 --> 00:09:12,250 So, in summary, for the data cleaning, what we've done is: we've looked for missing data, we've looked for 115 116 00:09:12,310 --> 00:09:19,210 incomplete data, we've looked for inaccurate data, and we've done some tidying up by checking the formatting 116 117 00:09:19,270 --> 00:09:21,380 of our data in detail. 117 118 00:09:21,520 --> 00:09:27,310 And the reason we've done all of this is so that we don't have problems when we provide this data to 118 119 00:09:27,310 --> 00:09:30,060 our machine learning algorithm down the line. 119 120 00:09:30,100 --> 00:09:35,470 Now we can move on to the next step, which is exploring and visualizing our data. 120 121 00:09:35,470 --> 00:09:40,210 Mind you, if we spot any more problems during this next step we might have to do a little bit more cleaning 121 122 00:09:40,210 --> 00:09:41,150 as well. 122 123 00:09:41,260 --> 00:09:46,300 So, you'll find that very often in practice, you're going to be doing the cleaning and the exploration 123 124 00:09:46,480 --> 00:09:47,500 side by side. 124 125 00:09:47,950 --> 00:09:49,370 I'll see you in the next lesson. 125 126 00:09:49,390 --> 00:09:49,930 Take care.