0
1
00:00:00,660 --> 00:00:03,810
Welcome to this brand new module.
1

2
00:00:03,810 --> 00:00:10,020
Let's start with outlining the premise and talking about the problem that we're going to be trying to
2

3
00:00:10,020 --> 00:00:17,720
solve in the upcoming lessons. Imagine that you're looking for a job in the field of data science or
3

4
00:00:17,720 --> 00:00:19,430
machine learning.
4

5
00:00:19,430 --> 00:00:22,400
Every company out there is swimming in data.
5

6
00:00:22,580 --> 00:00:26,980
So there is a lot of demand for data scientists and machine learning experts.
6

7
00:00:27,110 --> 00:00:33,260
And that's good news for you because, guess what, you've just found a position advertised in the career
7

8
00:00:33,260 --> 00:00:35,960
section of your favorite company.
8

9
00:00:36,380 --> 00:00:43,880
So naturally you go ahead and you polish your CV and you carefully research the company and you write
9

10
00:00:43,940 --> 00:00:50,870
a cover letter and you send off your application and you start practicing your interview questions with
10

11
00:00:50,870 --> 00:00:54,920
your most patient friend or career advisor.
11

12
00:00:54,920 --> 00:01:01,070
Now, meanwhile at the offices of your favorite company that you just applied for, the job applications
12

13
00:01:01,070 --> 00:01:04,200
from the candidates just start flooding in.
13

14
00:01:04,290 --> 00:01:10,820
There is many, many applications and it's not even easy from the company's perspective to select a good
14

15
00:01:10,820 --> 00:01:12,290
applicant.
15

16
00:01:12,290 --> 00:01:17,180
So you know, even though there's a lot of job opportunities in the field of data science and machine
16

17
00:01:17,180 --> 00:01:20,530
learning these days, it's still super competitive.
17

18
00:01:21,600 --> 00:01:28,740
In fact, the classic job application process of selecting candidates based on their CV's, their cover
18

19
00:01:28,780 --> 00:01:36,870
letters, maybe numerical tests and then how people do in interviews, just isn't really working that well.
19

20
00:01:37,740 --> 00:01:45,240
So what more and more companies are doing is giving their applicants a project to work on.
20

21
00:01:45,510 --> 00:01:53,700
For example, my friend recently applied for a job at Uber and prior to the interview they basically wrote
21

22
00:01:53,700 --> 00:01:58,950
him an email that said "Well, we need you to predict demand for our taxis.
22

23
00:01:58,950 --> 00:02:02,970
Here's some data. Send us your solution to this problem by next week."
23

24
00:02:02,970 --> 00:02:03,660
Bam.
24

25
00:02:03,660 --> 00:02:04,780
And that was it.
25

26
00:02:04,830 --> 00:02:10,110
My friend spent the next week toiling over this problem to produce a solution for these guys and move
26

27
00:02:10,110 --> 00:02:13,700
on to the next step in the application process.
27

28
00:02:13,770 --> 00:02:20,090
So, let's imagine that this is exactly the kind of situation that you find yourself in right now.
28

29
00:02:20,190 --> 00:02:25,860
You receive an email from the company that you've just applied for in response to your job application
29

30
00:02:26,370 --> 00:02:31,350
and they liked your CV and they're giving you this opportunity to show what you're made of during a
30

31
00:02:31,350 --> 00:02:32,660
project.
31

32
00:02:32,670 --> 00:02:37,280
Here's the gist of the email "Our team receives too much spam.
32

33
00:02:37,410 --> 00:02:42,920
We only want legitimate emails in our inbox and we want all spam to be filtered out."
33

34
00:02:42,930 --> 00:02:43,980
"Here's some data."
34

35
00:02:43,980 --> 00:02:46,950
"Send us your solution by next week."
35

36
00:02:46,950 --> 00:02:50,680
Tick tock, the clock is ticking.
36

37
00:02:50,730 --> 00:02:52,500
So what do we do next?
37

38
00:02:52,500 --> 00:02:57,800
Well the first step is always formulating the question.
38

39
00:02:57,960 --> 00:03:07,370
And what this usually means is translating a business problem into a machine learning problem.
39

40
00:03:07,380 --> 00:03:08,430
Now, what I mean by that?
40

41
00:03:08,610 --> 00:03:15,150
Well we have a clear business objective in this case from the company - we need to filter out spam emails.
41

42
00:03:16,380 --> 00:03:20,370
But what does this mean from a machine learning perspective?
42

43
00:03:20,370 --> 00:03:26,460
Well, to filter out the spam emails, we first have to know which emails are spam and which emails are
43

44
00:03:26,460 --> 00:03:27,720
legitimate.
44

45
00:03:27,780 --> 00:03:32,730
In other words, for each incoming email we have to assign a category.
45

46
00:03:32,730 --> 00:03:39,810
If the e-mail is legitimate then we have to classify it as Not Spam and the e-mail is then allowed to
46

47
00:03:39,810 --> 00:03:41,330
go on to the inbox.
47

48
00:03:42,270 --> 00:03:50,730
However, if the email had the characteristics of a spam email then our algorithm must detect this and
48

49
00:03:50,820 --> 00:03:55,970
label this email as spam and classify it as such.
49

50
00:03:56,010 --> 00:04:04,140
Thus our objective is to learn what characterizes a spam email so that we can start to classify all
50

51
00:04:04,140 --> 00:04:05,420
the emails.
51

52
00:04:05,490 --> 00:04:10,320
So if you think about it, this is actually quite a different kind of problem from when we were estimating
52

53
00:04:10,320 --> 00:04:14,060
the real estate prices in the previous module.
53

54
00:04:14,190 --> 00:04:20,180
Previously, we had to estimate a quantity - given a set of characteristics for a property,
54

55
00:04:20,340 --> 00:04:23,160
we were estimating the house price.
55

56
00:04:23,310 --> 00:04:27,740
We had something that was called a regression style problem.
56

57
00:04:28,050 --> 00:04:29,920
This time we have a different problem.
57

58
00:04:29,970 --> 00:04:31,980
We have to categorize things.
58

59
00:04:31,980 --> 00:04:36,630
We have a classification style problem. With regression
59

60
00:04:36,870 --> 00:04:45,460
you're fitting to the data, but with classification you're separating the data. Regression and classification
60

61
00:04:45,460 --> 00:04:51,460
style problems are in fact two of the very, very common kind of problems that you will find yourself
61

62
00:04:51,460 --> 00:04:52,930
solving.
62

63
00:04:52,930 --> 00:05:00,610
Typical examples of regression problems have to do with estimating quantity, like say next year sales
63

64
00:05:00,700 --> 00:05:08,200
or how much time a particular task takes and common examples of classification style problems are things
64

65
00:05:08,200 --> 00:05:16,740
like segmenting your business's customers, detecting cancer, or in our case filtering out spam emails.
65

66
00:05:16,750 --> 00:05:23,770
But you know, the thing is there is an additional niggle with this problem here that we were given. Previously,
66

67
00:05:24,070 --> 00:05:29,710
we were working with numbers, we were doing calculations with these numbers and feeding these numbers
67

68
00:05:30,040 --> 00:05:31,140
into our algorithm.
68

69
00:05:32,170 --> 00:05:35,620
But now we're gonna be dealing with emails instead.
69

70
00:05:35,770 --> 00:05:38,790
We're gonna be working with text data.
70

71
00:05:39,040 --> 00:05:43,930
So how do we train our algorithm when we're given text?
71

72
00:05:44,020 --> 00:05:48,160
Can we just pipe in our emails directly into the algorithm and get a sensible answer?
72

73
00:05:49,450 --> 00:05:51,140
Probably not.
73

74
00:05:51,310 --> 00:05:57,030
Text data just isn't suited to run calculations on in its raw form.
74

75
00:05:57,310 --> 00:06:05,530
So this means we have to find a way to translate our text data into a format that an algorithm can understand
75

76
00:06:05,590 --> 00:06:06,740
and work with.
76

77
00:06:06,850 --> 00:06:14,860
We have to process the emails before handing them off to our algorithm to do the calculations. So you
77

78
00:06:14,860 --> 00:06:19,000
can see there's already a lot of challenges ahead.
78

79
00:06:19,300 --> 00:06:22,180
But we've already taken the first step.
79

80
00:06:22,180 --> 00:06:29,980
We've looked at the business problem at hand and we've translated this problem into a machine learning
80

81
00:06:29,980 --> 00:06:31,200
problem.
81

82
00:06:31,270 --> 00:06:36,580
Here's how we can formulate our objective in a machine learning kind of way.
82

83
00:06:36,670 --> 00:06:43,090
We're gonna take the emails from the company and the first thing we're gonna do is pre-process that
83

84
00:06:43,090 --> 00:06:51,340
text data, then we're going to train a machine learning model that can classify an email as either spam
84

85
00:06:51,820 --> 00:06:53,420
or not spam.
85

86
00:06:53,500 --> 00:07:01,420
And finally, we're going to test and evaluate the performance of our trained model. And with our trained
86

87
00:07:01,540 --> 00:07:10,180
spam classifier at hand, we can filter out all incoming spam emails very, very quickly. And you never know,
87

88
00:07:10,180 --> 00:07:16,080
right, by sending our train model back to the company along with an explanation of our methodology,
88

89
00:07:16,150 --> 00:07:22,650
we should be able to clinch that job offer or at least get invited to the next round for an interview.
89

90
00:07:22,650 --> 00:07:24,540
I'll see you in the next lesson.
90

91
00:07:24,550 --> 00:07:24,960
Take care.