0
1
00:00:01,050 --> 00:00:06,870
Now that we've defined and formulated our question, it's time to gather the data.
1

2
00:00:06,870 --> 00:00:10,440
So let's have a think about the kind of data that we're looking to gather.
2

3
00:00:10,440 --> 00:00:16,040
First off, we need data on our feature, namely the movie budgets in US dollars.
3

4
00:00:16,050 --> 00:00:22,740
The second thing that we need is data on our target, namely the movie revenue in US dollars.
4

5
00:00:23,820 --> 00:00:26,220
So, where would we go about finding this kind of data?
5

6
00:00:27,120 --> 00:00:34,200
Well lucky for us, there exists a website called "The Numbers" and their slogan is "Where Data and the Movie
6

7
00:00:34,200 --> 00:00:35,770
Business Meet".
7

8
00:00:35,820 --> 00:00:39,030
This sounds perfect for our project.
8

9
00:00:39,030 --> 00:00:40,520
So let's check it out.
9

10
00:00:40,560 --> 00:00:42,660
So this is what the website looks like.
10

11
00:00:42,660 --> 00:00:50,640
If we go to Movies > Budgets and Finances, then we get the performance records of various films, and they've
11

12
00:00:50,700 --> 00:00:56,670
even got some tables here on some common searches or common questions that people might be interested
12

13
00:00:56,670 --> 00:00:57,240
in.
13

14
00:00:57,360 --> 00:01:01,080
For example, they've got the biggest budget films right here.
14

15
00:01:01,080 --> 00:01:07,050
And if we scroll down, they've got the movies with the lowest budgets to earn one million dollars at
15

16
00:01:07,050 --> 00:01:08,850
the U.S. box office.
16

17
00:01:08,850 --> 00:01:15,480
And I think a notable film here is El Mariachi, which apparently only had a production budget of seven
17

18
00:01:15,480 --> 00:01:19,850
thousand dollars but made about two million in the U.S. box office.
18

19
00:01:19,980 --> 00:01:27,660
If we click on this link here where it says "the complete list", then we get the entire table of all the
19

20
00:01:27,660 --> 00:01:30,990
films in the database right here.
20

21
00:01:30,990 --> 00:01:32,430
And this table is actually enormous.
21

22
00:01:32,430 --> 00:01:33,420
It's very, very long.
22

23
00:01:33,450 --> 00:01:35,670
There's quite a few different entries here.
23

24
00:01:36,420 --> 00:01:44,280
And what I've done is I've selected all of these and I've copy- pasted them into a spreadsheet. Now, to
24

25
00:01:44,280 --> 00:01:46,280
save you this kind of tedious work,
25

26
00:01:46,350 --> 00:01:51,260
I've made a CSV file available under the lesson resources.
26

27
00:01:51,390 --> 00:01:58,120
All you have to do is download it and open it in Excel or in Google Sheets.
27

28
00:01:58,410 --> 00:02:03,900
And once you open this file, what you should see is the movies sorted by their rank.
28

29
00:02:04,020 --> 00:02:10,010
So at place number one, we've got Avatar then a Star Wars film then Pirates Of The Caribbean.
29

30
00:02:10,170 --> 00:02:13,170
All of these films are sorted by their production budget,
30

31
00:02:13,230 --> 00:02:19,980
so column D. And if we scroll down, then we can see that there's a lot of entries in this file.
31

32
00:02:20,130 --> 00:02:25,260
As a matter of fact, there's about 5000 entries in this CSV file.
32

33
00:02:25,260 --> 00:02:31,950
So, this completes our data gathering step and we can move on to the next part of the workflow, namely
33

34
00:02:32,190 --> 00:02:33,330
cleaning the data.
34

35
00:02:33,780 --> 00:02:36,790
So, let's take a closer look at what we've actually got.
35

36
00:02:36,810 --> 00:02:43,350
If I scroll down here, then I can see there's a lot of numbers, but occasionally we might spot something
36

37
00:02:43,350 --> 00:02:44,610
a little bit odd.
37

38
00:02:44,790 --> 00:02:50,490
For example, here in columns E and F we have two big fat zeros. Scrolling up,
38

39
00:02:50,700 --> 00:02:58,680
I can see these are the columns for Worldwide Gross and Domestic Gross and yet this movie entitled Singularity
39

40
00:02:59,070 --> 00:03:07,170
with a production value of 175 million dollars has zero revenue.
40

41
00:03:07,200 --> 00:03:11,690
And the reason for this might be found in this column, our second column here.
41

42
00:03:11,850 --> 00:03:14,830
What we have here is the projected release date.
42

43
00:03:15,030 --> 00:03:21,610
So, end of 2020 is when this movie is due to be released according to this dataset.
43

44
00:03:21,660 --> 00:03:28,080
And what they've done is they've put a placeholder of zero into these two columns instead of removing
44

45
00:03:28,080 --> 00:03:29,490
this movie entirely.
45

46
00:03:29,490 --> 00:03:35,010
They've included it in the data set with one hundred and seventy five million dollars production budget,
46

47
00:03:35,430 --> 00:03:39,930
but because it hasn't made any revenue to date, they've added some zeros here.
47

48
00:03:39,930 --> 00:03:44,610
Now, this might actually be a little bit of a problem for us, because we're looking to make a prediction
48

49
00:03:44,910 --> 00:03:46,590
for our movie revenue.
49

50
00:03:46,800 --> 00:03:53,340
And if our dataset is going to include films that have not been released yet and have big fat zeros
50

51
00:03:53,340 --> 00:03:58,210
for the revenue, it might make our estimates a little bit unreliable.
51

52
00:03:58,230 --> 00:03:58,900
Right?
52

53
00:03:58,920 --> 00:04:04,750
So, one thing that might actually be worth doing, is checking how many films, out of the five thousand
53

54
00:04:04,750 --> 00:04:08,010
three hundred, actually have a revenue of zero.
54

55
00:04:08,130 --> 00:04:11,390
And we can do this by sorting by Worldwide Gross.
55

56
00:04:11,730 --> 00:04:19,830
If I select my data here and then open the menu and you go to Data > Sort Range, then I can actually sort
56

57
00:04:20,010 --> 00:04:21,250
by column number E.
57

58
00:04:21,570 --> 00:04:26,760
And, since we don't want to include that first row in the sort, I'm just going to tick this box here and
58

59
00:04:26,760 --> 00:04:33,780
then click Sort, and what we see here is that there's quite a few films that have a zero entry for Worldwide
59

60
00:04:33,810 --> 00:04:35,880
Gross and Domestic Gross.
60

61
00:04:35,880 --> 00:04:37,360
Now, why might that be?
61

62
00:04:37,380 --> 00:04:43,350
Well, for some of them the release date is in the future, but looking at this column, we see that not every
62

63
00:04:43,350 --> 00:04:48,470
film that has a 0 here is due to be released in the future.
63

64
00:04:48,480 --> 00:04:56,100
So for example, this one here "Black Water Transit" was supposedly released at the end of 2008, and this
64

65
00:04:56,100 --> 00:05:01,590
film also has a big fat zero in Worldwide Gross. Why might that be?
65

66
00:05:01,600 --> 00:05:06,860
Now, one possible answer is looking at the description of the website.
66

67
00:05:06,910 --> 00:05:13,680
So, if we go here then we see this line: "The data, to the best of our knowledge, is accurate,
67

68
00:05:13,840 --> 00:05:17,620
but there are gaps and disputed figures".
68

69
00:05:17,620 --> 00:05:21,120
So, some of these zeros might be gaps.
69

70
00:05:21,190 --> 00:05:26,770
It might just be that the data is incomplete for some films, but I've actually looked up this Blackwater
70

71
00:05:26,770 --> 00:05:34,780
Transit film and on Wikipedia I can see that in post-production there were actually quite a few lawsuits
71

72
00:05:34,780 --> 00:05:40,210
and there was quite a lot of back and forth between the production company and the filmmakers and the
72

73
00:05:40,210 --> 00:05:41,890
producers and so on.
73

74
00:05:41,890 --> 00:05:49,990
So this film actually never got released and this release date here of end of 2008 was probably a planned
74

75
00:05:49,990 --> 00:05:53,080
release date, but the movie actually never came out.
75

76
00:05:53,080 --> 00:05:59,530
Now, this might be something that happened to quite a few of these films, and they might have a 0 for
76

77
00:05:59,590 --> 00:06:01,180
various other reasons, too.
77

78
00:06:01,390 --> 00:06:09,100
But, this poses a big problem for our analysis. If we're looking to predict movie revenue based on the
78

79
00:06:09,100 --> 00:06:13,660
budget, then having these zero entries here is going to be a problem.
79

80
00:06:14,140 --> 00:06:20,350
So, what I would suggest, given that some of these films were probably not released, other ones the data
80

81
00:06:20,350 --> 00:06:26,260
might not be available, is that we should actually exclude the films that have 0 here for Worldwide
81

82
00:06:26,260 --> 00:06:27,960
Gross and Domestic Gross.
82

83
00:06:28,150 --> 00:06:33,320
We should select all of these entries and delete them from our dataset.
83

84
00:06:33,340 --> 00:06:41,320
So if I come down here to row 358 and scroll all the way up, then I can select all of these rows, I can
84

85
00:06:41,320 --> 00:06:43,860
right click and I can delete the rows.
85

86
00:06:44,020 --> 00:06:47,300
We're going to exclude these rows from our analysis.
86

87
00:06:47,350 --> 00:06:52,930
The next thing that we're going to do is, we're actually going to have to do some calculations and these
87

88
00:06:52,930 --> 00:06:56,430
involve the Production Budget and the Worldwide Gross.
88

89
00:06:56,560 --> 00:06:58,880
These other columns we're actually not going to use.
89

90
00:06:58,990 --> 00:07:00,640
We're not going to use the Domestic Gross.
90

91
00:07:00,640 --> 00:07:04,240
We're not going to use the Movie Title, or the Rank or the Release Date.
91

92
00:07:04,270 --> 00:07:09,910
So what I would suggest is that, as part of the data cleaning process, we're going to delete these entire
92

93
00:07:09,910 --> 00:07:10,830
columns.
93

94
00:07:10,930 --> 00:07:15,810
So, we're going to delete these columns and we're going to delete this column right here.
94

95
00:07:15,910 --> 00:07:20,640
That way we can trim our spreadsheet to what actually matters for our analysis.
95

96
00:07:20,680 --> 00:07:26,170
The next thing that we need to do is we need to have a think about these special characters that might
96

97
00:07:26,170 --> 00:07:31,210
be in the dataset, and by special characters I mean anything other than a number.
97

98
00:07:31,270 --> 00:07:38,230
So, we're gonna be doing some calculations with all of this data and those calculations might be troubled
98

99
00:07:38,230 --> 00:07:41,790
by the fact that there's a dollar sign here, or a comma.
99

100
00:07:41,800 --> 00:07:44,960
This type of formatting could cause problems.
100

101
00:07:45,100 --> 00:07:51,220
So, it's better to remove it and have a pure number in these cells, rather than all this extra stuff.
101

102
00:07:51,220 --> 00:07:55,960
And the way that we can format all these numbers and get rid of these symbols is by selecting the columns,
102

103
00:07:56,260 --> 00:08:04,420
going to Format > Number and then scrolling down where it says More Formats and going to Custom Number Format.
103

104
00:08:04,480 --> 00:08:09,410
And here we can pick the second one down which is this decimal number and click Apply.
104

105
00:08:09,430 --> 00:08:16,110
Now we've got no more dollar signs in our columns, so we're almost done cleaning our data.
105

106
00:08:16,150 --> 00:08:20,410
The last thing I'm going to do is, I'm going to rename my column headings.
106

107
00:08:20,410 --> 00:08:25,720
And the only reason I'm doing that is because I'm going to remove spaces and these custom symbols here -
107

108
00:08:25,930 --> 00:08:31,900
the parentheses and the dollar signs from the headings to make our lives a little bit easier on. I find
108

109
00:08:31,900 --> 00:08:37,800
that, as soon as I have special symbols or white spaces in my headings, I make a lot more typos when I've
109

110
00:08:37,800 --> 00:08:39,910
referred to particular columns.
110

111
00:08:39,910 --> 00:08:48,250
So for this one, I'm just gonna call this production_budget_usd, and this one here
111

112
00:08:48,550 --> 00:08:56,230
I'm going to call worldwide_gross_usd. With these two names for our headings, I've
112

113
00:08:56,230 --> 00:09:02,050
got very expressive headings, and yet there's no whitespace and there's no special characters.
113

114
00:09:02,050 --> 00:09:05,340
This will make our life a little bit easier down the line.
114

115
00:09:05,350 --> 00:09:12,250
So, in summary, for the data cleaning, what we've done is: we've looked for missing data, we've looked for
115

116
00:09:12,310 --> 00:09:19,210
incomplete data, we've looked for inaccurate data, and we've done some tidying up by checking the formatting
116

117
00:09:19,270 --> 00:09:21,380
of our data in detail.
117

118
00:09:21,520 --> 00:09:27,310
And the reason we've done all of this is so that we don't have problems when we provide this data to
118

119
00:09:27,310 --> 00:09:30,060
our machine learning algorithm down the line.
119

120
00:09:30,100 --> 00:09:35,470
Now we can move on to the next step, which is exploring and visualizing our data.
120

121
00:09:35,470 --> 00:09:40,210
Mind you, if we spot any more problems during this next step we might have to do a little bit more cleaning
121

122
00:09:40,210 --> 00:09:41,150
as well.
122

123
00:09:41,260 --> 00:09:46,300
So, you'll find that very often in practice, you're going to be doing the cleaning and the exploration
123

124
00:09:46,480 --> 00:09:47,500
side by side.
124

125
00:09:47,950 --> 00:09:49,370
I'll see you in the next lesson.
125

126
00:09:49,390 --> 00:09:49,930
Take care.