0 1 00:00:00,660 --> 00:00:03,810 Welcome to this brand new module. 1 2 00:00:03,810 --> 00:00:10,020 Let's start with outlining the premise and talking about the problem that we're going to be trying to 2 3 00:00:10,020 --> 00:00:17,720 solve in the upcoming lessons. Imagine that you're looking for a job in the field of data science or 3 4 00:00:17,720 --> 00:00:19,430 machine learning. 4 5 00:00:19,430 --> 00:00:22,400 Every company out there is swimming in data. 5 6 00:00:22,580 --> 00:00:26,980 So there is a lot of demand for data scientists and machine learning experts. 6 7 00:00:27,110 --> 00:00:33,260 And that's good news for you because, guess what, you've just found a position advertised in the career 7 8 00:00:33,260 --> 00:00:35,960 section of your favorite company. 8 9 00:00:36,380 --> 00:00:43,880 So naturally you go ahead and you polish your CV and you carefully research the company and you write 9 10 00:00:43,940 --> 00:00:50,870 a cover letter and you send off your application and you start practicing your interview questions with 10 11 00:00:50,870 --> 00:00:54,920 your most patient friend or career advisor. 11 12 00:00:54,920 --> 00:01:01,070 Now, meanwhile at the offices of your favorite company that you just applied for, the job applications 12 13 00:01:01,070 --> 00:01:04,200 from the candidates just start flooding in. 13 14 00:01:04,290 --> 00:01:10,820 There is many, many applications and it's not even easy from the company's perspective to select a good 14 15 00:01:10,820 --> 00:01:12,290 applicant. 15 16 00:01:12,290 --> 00:01:17,180 So you know, even though there's a lot of job opportunities in the field of data science and machine 16 17 00:01:17,180 --> 00:01:20,530 learning these days, it's still super competitive. 17 18 00:01:21,600 --> 00:01:28,740 In fact, the classic job application process of selecting candidates based on their CV's, their cover 18 19 00:01:28,780 --> 00:01:36,870 letters, maybe numerical tests and then how people do in interviews, just isn't really working that well. 19 20 00:01:37,740 --> 00:01:45,240 So what more and more companies are doing is giving their applicants a project to work on. 20 21 00:01:45,510 --> 00:01:53,700 For example, my friend recently applied for a job at Uber and prior to the interview they basically wrote 21 22 00:01:53,700 --> 00:01:58,950 him an email that said "Well, we need you to predict demand for our taxis. 22 23 00:01:58,950 --> 00:02:02,970 Here's some data. Send us your solution to this problem by next week." 23 24 00:02:02,970 --> 00:02:03,660 Bam. 24 25 00:02:03,660 --> 00:02:04,780 And that was it. 25 26 00:02:04,830 --> 00:02:10,110 My friend spent the next week toiling over this problem to produce a solution for these guys and move 26 27 00:02:10,110 --> 00:02:13,700 on to the next step in the application process. 27 28 00:02:13,770 --> 00:02:20,090 So, let's imagine that this is exactly the kind of situation that you find yourself in right now. 28 29 00:02:20,190 --> 00:02:25,860 You receive an email from the company that you've just applied for in response to your job application 29 30 00:02:26,370 --> 00:02:31,350 and they liked your CV and they're giving you this opportunity to show what you're made of during a 30 31 00:02:31,350 --> 00:02:32,660 project. 31 32 00:02:32,670 --> 00:02:37,280 Here's the gist of the email "Our team receives too much spam. 32 33 00:02:37,410 --> 00:02:42,920 We only want legitimate emails in our inbox and we want all spam to be filtered out." 33 34 00:02:42,930 --> 00:02:43,980 "Here's some data." 34 35 00:02:43,980 --> 00:02:46,950 "Send us your solution by next week." 35 36 00:02:46,950 --> 00:02:50,680 Tick tock, the clock is ticking. 36 37 00:02:50,730 --> 00:02:52,500 So what do we do next? 37 38 00:02:52,500 --> 00:02:57,800 Well the first step is always formulating the question. 38 39 00:02:57,960 --> 00:03:07,370 And what this usually means is translating a business problem into a machine learning problem. 39 40 00:03:07,380 --> 00:03:08,430 Now, what I mean by that? 40 41 00:03:08,610 --> 00:03:15,150 Well we have a clear business objective in this case from the company - we need to filter out spam emails. 41 42 00:03:16,380 --> 00:03:20,370 But what does this mean from a machine learning perspective? 42 43 00:03:20,370 --> 00:03:26,460 Well, to filter out the spam emails, we first have to know which emails are spam and which emails are 43 44 00:03:26,460 --> 00:03:27,720 legitimate. 44 45 00:03:27,780 --> 00:03:32,730 In other words, for each incoming email we have to assign a category. 45 46 00:03:32,730 --> 00:03:39,810 If the e-mail is legitimate then we have to classify it as Not Spam and the e-mail is then allowed to 46 47 00:03:39,810 --> 00:03:41,330 go on to the inbox. 47 48 00:03:42,270 --> 00:03:50,730 However, if the email had the characteristics of a spam email then our algorithm must detect this and 48 49 00:03:50,820 --> 00:03:55,970 label this email as spam and classify it as such. 49 50 00:03:56,010 --> 00:04:04,140 Thus our objective is to learn what characterizes a spam email so that we can start to classify all 50 51 00:04:04,140 --> 00:04:05,420 the emails. 51 52 00:04:05,490 --> 00:04:10,320 So if you think about it, this is actually quite a different kind of problem from when we were estimating 52 53 00:04:10,320 --> 00:04:14,060 the real estate prices in the previous module. 53 54 00:04:14,190 --> 00:04:20,180 Previously, we had to estimate a quantity - given a set of characteristics for a property, 54 55 00:04:20,340 --> 00:04:23,160 we were estimating the house price. 55 56 00:04:23,310 --> 00:04:27,740 We had something that was called a regression style problem. 56 57 00:04:28,050 --> 00:04:29,920 This time we have a different problem. 57 58 00:04:29,970 --> 00:04:31,980 We have to categorize things. 58 59 00:04:31,980 --> 00:04:36,630 We have a classification style problem. With regression 59 60 00:04:36,870 --> 00:04:45,460 you're fitting to the data, but with classification you're separating the data. Regression and classification 60 61 00:04:45,460 --> 00:04:51,460 style problems are in fact two of the very, very common kind of problems that you will find yourself 61 62 00:04:51,460 --> 00:04:52,930 solving. 62 63 00:04:52,930 --> 00:05:00,610 Typical examples of regression problems have to do with estimating quantity, like say next year sales 63 64 00:05:00,700 --> 00:05:08,200 or how much time a particular task takes and common examples of classification style problems are things 64 65 00:05:08,200 --> 00:05:16,740 like segmenting your business's customers, detecting cancer, or in our case filtering out spam emails. 65 66 00:05:16,750 --> 00:05:23,770 But you know, the thing is there is an additional niggle with this problem here that we were given. Previously, 66 67 00:05:24,070 --> 00:05:29,710 we were working with numbers, we were doing calculations with these numbers and feeding these numbers 67 68 00:05:30,040 --> 00:05:31,140 into our algorithm. 68 69 00:05:32,170 --> 00:05:35,620 But now we're gonna be dealing with emails instead. 69 70 00:05:35,770 --> 00:05:38,790 We're gonna be working with text data. 70 71 00:05:39,040 --> 00:05:43,930 So how do we train our algorithm when we're given text? 71 72 00:05:44,020 --> 00:05:48,160 Can we just pipe in our emails directly into the algorithm and get a sensible answer? 72 73 00:05:49,450 --> 00:05:51,140 Probably not. 73 74 00:05:51,310 --> 00:05:57,030 Text data just isn't suited to run calculations on in its raw form. 74 75 00:05:57,310 --> 00:06:05,530 So this means we have to find a way to translate our text data into a format that an algorithm can understand 75 76 00:06:05,590 --> 00:06:06,740 and work with. 76 77 00:06:06,850 --> 00:06:14,860 We have to process the emails before handing them off to our algorithm to do the calculations. So you 77 78 00:06:14,860 --> 00:06:19,000 can see there's already a lot of challenges ahead. 78 79 00:06:19,300 --> 00:06:22,180 But we've already taken the first step. 79 80 00:06:22,180 --> 00:06:29,980 We've looked at the business problem at hand and we've translated this problem into a machine learning 80 81 00:06:29,980 --> 00:06:31,200 problem. 81 82 00:06:31,270 --> 00:06:36,580 Here's how we can formulate our objective in a machine learning kind of way. 82 83 00:06:36,670 --> 00:06:43,090 We're gonna take the emails from the company and the first thing we're gonna do is pre-process that 83 84 00:06:43,090 --> 00:06:51,340 text data, then we're going to train a machine learning model that can classify an email as either spam 84 85 00:06:51,820 --> 00:06:53,420 or not spam. 85 86 00:06:53,500 --> 00:07:01,420 And finally, we're going to test and evaluate the performance of our trained model. And with our trained 86 87 00:07:01,540 --> 00:07:10,180 spam classifier at hand, we can filter out all incoming spam emails very, very quickly. And you never know, 87 88 00:07:10,180 --> 00:07:16,080 right, by sending our train model back to the company along with an explanation of our methodology, 88 89 00:07:16,150 --> 00:07:22,650 we should be able to clinch that job offer or at least get invited to the next round for an interview. 89 90 00:07:22,650 --> 00:07:24,540 I'll see you in the next lesson. 90 91 00:07:24,550 --> 00:07:24,960 Take care.