0 1 00:00:01,030 --> 00:00:08,590 Welcome to the first module! In this module we'll learn about how to use data science to solve a practical 1 2 00:00:08,590 --> 00:00:16,840 problem and we will also introduce our first machine learning algorithm - namely linear regression. Regression 2 3 00:00:16,930 --> 00:00:21,920 is the workhorse of data science, business analytics and research. 3 4 00:00:22,180 --> 00:00:28,120 You'll find people using regression whenever someone is looking to model a phenomenon or make a prediction 4 5 00:00:28,390 --> 00:00:31,620 or discover the hidden relationships among things. 5 6 00:00:31,630 --> 00:00:36,410 Also, I want to show you how to solve problems using Data Science. 6 7 00:00:36,430 --> 00:00:38,980 So what kind of problem are you going to look at? 7 8 00:00:38,980 --> 00:00:45,220 Well, imagine that you're a film producer. You provide money to make movies and when the movies make money, 8 9 00:00:45,790 --> 00:00:46,850 you make money. 9 10 00:00:47,140 --> 00:00:50,620 And when the movies lose money, you lose money. 10 11 00:00:50,620 --> 00:00:52,930 But today you discover a problem. 11 12 00:00:53,230 --> 00:00:58,710 You found out that you just put a whole bunch of money into a movie called "Zombies Rise Again 12 13 00:00:58,710 --> 00:00:59,800 Part 5". 13 14 00:01:00,430 --> 00:01:07,030 And if that doesn't sound like a terrible title for a film, then I don't know what does. Worse yet, you've 14 15 00:01:07,030 --> 00:01:09,070 staked your reputation on this film, 15 16 00:01:09,130 --> 00:01:11,990 you've mortgaged your house, you've sold your car, 16 17 00:01:12,340 --> 00:01:15,830 you've sold your grandmother's silverware to fund this film. 17 18 00:01:15,880 --> 00:01:17,740 So, is your money gone? 18 19 00:01:17,740 --> 00:01:19,210 Will you go bankrupt? 19 20 00:01:19,210 --> 00:01:22,490 Well, to answer these questions you would have to be able to predict the future. 20 21 00:01:22,810 --> 00:01:25,710 And this is where data science will come to the rescue. 21 22 00:01:26,410 --> 00:01:33,700 Let's have a think about how a data scientist would approach this problem. Well first, a data scientist 22 23 00:01:33,940 --> 00:01:38,910 would carefully formulate the question that they're looking to answer. Why? 23 24 00:01:39,000 --> 00:01:46,540 Well, a clear and well formulated question will determine the research and it will also affect the kind 24 25 00:01:46,540 --> 00:01:50,900 of data that you will go out and gather. In fact, 25 26 00:01:50,900 --> 00:01:59,930 step 2 is gathering the data that will help us answer the question. But real world data is also messy. 26 27 00:01:59,930 --> 00:02:02,030 So we have to clean the data. 27 28 00:02:02,030 --> 00:02:05,260 We have to look out for missing data or incomplete data. 28 29 00:02:05,270 --> 00:02:08,880 We have to look out for errors and even bound formatting. 29 30 00:02:08,990 --> 00:02:12,530 In fact, we have to explore the data that we've gathered. 30 31 00:02:12,530 --> 00:02:17,540 And often this means visualizing the data so that we can better understand what it is that we're working 31 32 00:02:17,540 --> 00:02:23,860 with. A graph or a chart is much more helpful than a table of numbers. 32 33 00:02:23,960 --> 00:02:25,730 And after we've done all that, 33 34 00:02:25,730 --> 00:02:33,110 the next step is actually training our algorithm using our computer to identify patterns in the data. 34 35 00:02:33,110 --> 00:02:37,000 In our case, that algorithm will be our linear regression. 35 36 00:02:38,090 --> 00:02:41,420 And finally, we have to evaluate the results. 36 37 00:02:41,480 --> 00:02:43,240 How did our algorithm do? 37 38 00:02:43,350 --> 00:02:47,090 Did it answer our question? How accurate was our algorithm 38 39 00:02:47,090 --> 00:02:53,900 in answering our question? The process that I've outlined here is the process data scientists use to 39 40 00:02:53,900 --> 00:02:55,470 solve problems. 40 41 00:02:55,490 --> 00:02:59,850 This is their workflow for understanding and making sense of the world. 41 42 00:02:59,960 --> 00:03:01,800 So let's tackle that first step. 42 43 00:03:01,850 --> 00:03:04,660 Let's formulate our question. 43 44 00:03:04,730 --> 00:03:06,450 So here's our first shot at this. 44 45 00:03:06,500 --> 00:03:09,830 How much money will our movie make? Now, 45 46 00:03:10,040 --> 00:03:12,830 this is actually not a very good question. 46 47 00:03:12,830 --> 00:03:16,260 The problem is is that this question is too vague. 47 48 00:03:16,400 --> 00:03:19,970 For starters, what do we actually mean with the word money? 48 49 00:03:20,060 --> 00:03:21,350 Can we be more specific? 49 50 00:03:22,280 --> 00:03:23,570 How about this? 50 51 00:03:23,570 --> 00:03:27,220 How about "How much revenue will our movie make?"? Now, 51 52 00:03:27,230 --> 00:03:34,040 this is already a much better question, because revenue is something that means something very specific 52 53 00:03:34,250 --> 00:03:35,510 in the business world. 53 54 00:03:35,540 --> 00:03:41,330 It's something that accountants actually measure and track and that means that we can actually find 54 55 00:03:41,330 --> 00:03:43,610 data on movie revenue. 55 56 00:03:43,880 --> 00:03:47,350 But, what does movie revenue actually depend on? 56 57 00:03:47,390 --> 00:03:48,950 Let's have a think about this. 57 58 00:03:49,040 --> 00:03:55,820 Maybe it depends on the actors. Do movies starring Scarlett Johansson do particularly well? Or maybe it depends 58 59 00:03:55,820 --> 00:04:01,470 on the script? Or maybe what actually matters is the film's overall budget? 59 60 00:04:01,790 --> 00:04:07,610 Because after all, you have to pay good actors and you have to pay for special effects and marketing 60 61 00:04:07,610 --> 00:04:09,310 costs a lot of money too. 61 62 00:04:09,350 --> 00:04:14,480 So maybe the overall budget would be the best thing to look at. 62 63 00:04:14,510 --> 00:04:22,040 So if that's the case then we have a hypothesis - the movie budget could indeed be the thing to investigate. 63 64 00:04:23,350 --> 00:04:28,930 After all, there are many famous examples of big budget films that did exceptionally well at the box 64 65 00:04:28,930 --> 00:04:29,790 office. 65 66 00:04:29,950 --> 00:04:38,240 Movies like Avatar, Titanic and The Avengers cost an absolute fortune, but they were enormously successful. 66 67 00:04:38,380 --> 00:04:44,200 But can the success of these movies tell us something about the success of our movie? 67 68 00:04:44,200 --> 00:04:52,090 If we knew how much movies like Avatar or The Avengers spent and how much they made, can we predict how 68 69 00:04:52,090 --> 00:04:53,420 much we'll make 69 70 00:04:53,470 --> 00:04:59,260 based on how much we're spending? So let's go back to our original question and let's make it a little 70 71 00:04:59,260 --> 00:05:00,720 bit more specific. 71 72 00:05:00,760 --> 00:05:06,060 "Can we use movie budgets to predict movie revenue?". Now, 72 73 00:05:06,220 --> 00:05:08,380 that's already a much better question. 73 74 00:05:08,410 --> 00:05:09,280 Why? 74 75 00:05:09,280 --> 00:05:15,910 Well, it's testable, and by testable, I mean that we can actually check if there is a relationship between 75 76 00:05:15,910 --> 00:05:18,760 movie budgets and movie revenue. 76 77 00:05:18,760 --> 00:05:23,630 This is something that we can measure using data and linear regression. 77 78 00:05:23,890 --> 00:05:29,590 And as data scientists we can actually look at this question and identify exactly what it is that we're 78 79 00:05:29,590 --> 00:05:36,930 trying to predict, namely the movie revenue. A data scientist would call this the dependent variable and 79 80 00:05:36,970 --> 00:05:40,000 in machine learning this would be called the target. 80 81 00:05:40,000 --> 00:05:46,930 Also, we can identify what it is that we're using to make the prediction, namely the movie budgets. A data 81 82 00:05:46,930 --> 00:05:53,410 scientist would call the budgets the independent variable and in machine learning this would be called 82 83 00:05:53,500 --> 00:05:54,620 a feature. 83 84 00:05:54,670 --> 00:05:58,600 So now that we've formulated our question we can move on to Step Two. 84 85 00:05:58,750 --> 00:06:05,320 We can go out there and gather some data. And we'll do just that in the next lesson. 85 86 00:06:05,350 --> 00:06:06,120 I'll see you there.