0
1
00:00:00,320 --> 00:00:06,090
Alright! So in the last lesson we've talked about file paths. In this lesson we're gonna put those to use.
1

2
00:00:06,580 --> 00:00:07,530
First stop,
2

3
00:00:07,590 --> 00:00:11,640
let's put some headings into our notebook. So this one here,
3

4
00:00:11,640 --> 00:00:22,100
I'm going to change to Markup and we're gonna keep all our notebook imports in the cell below. The next
4

5
00:00:22,100 --> 00:00:30,920
section heading I'm going to add is gonna be called "Constants". This is where we're going to add all our
5

6
00:00:30,920 --> 00:00:32,320
file paths.
6

7
00:00:32,480 --> 00:00:38,930
We're gonna have a single cell where we're gonna put all the pieces of information that don't change.
7

8
00:00:38,930 --> 00:00:44,240
Now I want to work more in the middle of the screen for you guys, but one of the handy things about Jupyter
8

9
00:00:44,240 --> 00:00:50,840
notebook is that it's got all these handy keyboard shortcuts and under the Help menu, the keyboard
9

10
00:00:50,840 --> 00:00:53,790
shortcuts for Jupyter are outlined right here.
10

11
00:00:54,110 --> 00:01:01,340
So I can see that scrolling down, my keyboard shortcut for inserting a new cell it's just pressing the
11

12
00:01:01,340 --> 00:01:03,660
letter B when I'm in command mode.
12

13
00:01:03,950 --> 00:01:04,890
Yeah, there's two modes.
13

14
00:01:04,890 --> 00:01:08,540
There's the Edit mode and there's the Command mode.
14

15
00:01:08,540 --> 00:01:12,810
You can tell that you're in the Edit mode when the cell is green because your cursor is blinking there.
15

16
00:01:13,220 --> 00:01:17,780
And you can tell that you're in the Command mode when the cell is blue.
16

17
00:01:17,780 --> 00:01:25,570
So if I press B on my keyboard right now when I'm in command mode, then I get new cells appearing like
17

18
00:01:25,570 --> 00:01:26,330
so.
18

19
00:01:26,780 --> 00:01:36,770
Now what I'll do is I'm going to create a constant and call it "EXAMPLE_FILE" and this constant
19

20
00:01:36,950 --> 00:01:44,870
will have my file path to my "practice_emal.txt" file and this is gonna be my relative
20

21
00:01:44,900 --> 00:01:48,670
path. A good naming convention by the way for constants,
21

22
00:01:48,830 --> 00:01:53,930
in other words for values that you don't want to update is to write them like so with capital letters
22

23
00:01:54,260 --> 00:02:00,530
separated by an underscore. The next thing I'm going to do is I'm going to copy the file path to this
23

24
00:02:00,530 --> 00:02:01,500
practice email.
24

25
00:02:01,640 --> 00:02:12,730
So I'm going to go to "Get Info" and I'm going to copy this file path here, close this, go here and then
25

26
00:02:12,880 --> 00:02:20,410
in my single quotes, I'm going to paste it in here. Because I'm just interested in the relative file path
26

27
00:02:20,680 --> 00:02:25,600
not the absolute file path, I'm going to delete this bit here.
27

28
00:02:25,660 --> 00:02:28,560
So it's just going to read "SpamData/
28

29
00:02:28,820 --> 00:02:31,450
01_Processing",
29

30
00:02:31,450 --> 00:02:36,740
and then what I'm going to do at the end is I'm going to tack on the file name.
30

31
00:02:36,850 --> 00:02:40,000
So "practice_
31

32
00:02:40,210 --> 00:02:41,970
email.txt".
32

33
00:02:42,430 --> 00:02:49,330
I'm going to include the extension as well. So I'll have my relative path from my MLProjects folder
33

34
00:02:49,600 --> 00:02:58,870
where this notebook lives, going into the 01_Processing folder and then I'll have my file name at the
34

35
00:02:58,870 --> 00:02:59,930
end.
35

36
00:02:59,950 --> 00:03:04,820
Now if you're working alongside with me, make sure you don't have any typos here.
36

37
00:03:04,840 --> 00:03:08,980
Make sure you are using the forward slashes instead of the back slashes.
37

38
00:03:08,980 --> 00:03:16,480
And if your text matches mine and your "practice_email.txt"  file is in exactly the same
38

39
00:03:16,480 --> 00:03:22,180
location as my file then we're not going to have any issues.
39

40
00:03:22,180 --> 00:03:28,900
Now let's practice reading a single file. So I'll add a new section heading here,
40

41
00:03:29,110 --> 00:03:39,020
call it "Reading Files" and in here we're going to start talking to our operating system.
41

42
00:03:39,640 --> 00:03:48,700
The kind of object, the kind of thing that we need to read a file from the disk is called a stream or
42

43
00:03:48,730 --> 00:03:52,210
a file object. In Python,
43

44
00:03:52,360 --> 00:03:59,620
you can open a particular file with the built in "open()" function and this open function of course needs
44

45
00:03:59,620 --> 00:04:03,480
to know which file it's meant to open. In our case,
45

46
00:04:03,550 --> 00:04:06,710
we're just going to use the constant that we defined above.
46

47
00:04:06,880 --> 00:04:16,420
I'm going to type "EX" and hit Tab on my keyboard to insert the rest of this code. Because we've hit Shift+
47

48
00:04:16,450 --> 00:04:23,170
Enter on this cell, Jupyter notebook actually recognizes this name here and it will autocomplete the
48

49
00:04:23,170 --> 00:04:24,300
name for us
49

50
00:04:24,340 --> 00:04:29,990
which really speeds up our typing. Now that we've called the open function,
50

51
00:04:30,160 --> 00:04:34,060
it will return to us a file object or a stream.
51

52
00:04:34,060 --> 00:04:41,740
Now I want to store this object in a variable, so I'm going to say "stream = open(
52

53
00:04:41,740 --> 00:04:48,820
EXAMPLE_FILE)". Once I've got my stream I can read the individual lines in this file. So I'm going to say "stream.
53

54
00:04:49,030 --> 00:04:57,430
read()". This method will go through the lines of the file one by one and I'm going to
54

55
00:04:57,430 --> 00:05:04,540
store the output of this method in a variable called "message".
55

56
00:05:04,540 --> 00:05:10,660
When I'm done reading my file, I'm going to tell Python to stop looking at this file, because we've reached
56

57
00:05:10,660 --> 00:05:13,810
the end and we're not planning to do anything further with it.
57

58
00:05:13,850 --> 00:05:18,110
So I'm going to say "stream.close()".
58

59
00:05:18,460 --> 00:05:24,760
All right, so we've opened a file, we've read the contents, we've stored those contents in a variable and
59

60
00:05:24,760 --> 00:05:28,420
then we've closed our stream because we're done reading the file.
60

61
00:05:28,420 --> 00:05:37,210
Let's print out the contents of this file so "print(message)", Shift+Enter and here's what we have.
61

62
00:05:37,730 --> 00:05:40,400
This is the structure of an email.
62

63
00:05:40,520 --> 00:05:48,920
The first bit is called the email header, so it has all this information about who sent the email, who
63

64
00:05:48,920 --> 00:05:54,110
it was to, who was cc'd, what was the subject and so on.
64

65
00:05:54,470 --> 00:05:57,850
And then after the header there's a blank line
65

66
00:05:57,890 --> 00:06:02,060
what follows is the email body.
66

67
00:06:02,190 --> 00:06:10,470
Now this text here is an email body that I've copied from a book called "The Timewaster Letters" by
67

68
00:06:10,470 --> 00:06:11,550
Robin Cooper.
68

69
00:06:11,940 --> 00:06:20,010
And if you're ever looking for a humorous read, then check out this book. Coming back up to our Python code,
69

70
00:06:20,430 --> 00:06:28,650
let me press Shift+Tab on my open function and take a closer look at the quick documentation. We can
70

71
00:06:28,650 --> 00:06:33,060
see here that in the parameters there's a couple of inputs.
71

72
00:06:33,060 --> 00:06:36,080
The first one of course is the file.
72

73
00:06:36,180 --> 00:06:39,970
The second input has a default value of 'r'.
73

74
00:06:39,990 --> 00:06:45,590
So this means that by default we are only reading the file, not writing one.
74

75
00:06:45,750 --> 00:06:51,390
The next parameter that I want to talk about is this thing called "encoding", which by default is set to
75

76
00:06:51,390 --> 00:06:51,840
None.
76

77
00:06:53,450 --> 00:07:00,890
An Encoding is how the computer handles letters and text. After all,
77

78
00:07:00,890 --> 00:07:05,770
every character needs to be translated into ones and zeros at the end of the day.
78

79
00:07:06,050 --> 00:07:14,060
In the days of early computers, this was fairly simple because computers really only supported 127 American
79

80
00:07:14,180 --> 00:07:16,140
English characters.
80

81
00:07:16,160 --> 00:07:18,540
This was called ASCII.
81

82
00:07:18,800 --> 00:07:24,140
If you were French for example and had an accent above a letter then you were out of luck.
82

83
00:07:24,200 --> 00:07:28,190
No accents for you Frédéric Chopin.
83

84
00:07:28,220 --> 00:07:36,430
If you are Russian or Chinese or Thai or Indian or German or, well, from anywhere really, then you are
84

85
00:07:36,430 --> 00:07:37,790
also screwed.
85

86
00:07:37,790 --> 00:07:41,870
ASCII doesn't have a character for the letters in your alphabet.
86

87
00:07:42,990 --> 00:07:49,530
On the upside people more creative than I have used ASCII characters to create some really neat art
87

88
00:07:49,650 --> 00:07:53,040
to make the Internet a much more interesting place.
88

89
00:07:53,430 --> 00:07:55,590
Back in our Python code.
89

90
00:07:55,590 --> 00:08:02,910
If we leave the encoding blank as we have done here, we will use the default system encoding. What's the
90

91
00:08:02,910 --> 00:08:04,940
default encoding on our system?
91

92
00:08:04,950 --> 00:08:13,800
Well let's make Jupyter actually tell us explicitly. If we import the system library and then write "sys.
92

93
00:08:14,310 --> 00:08:23,850
getfilesystemencoding()" and hit Shift+Enter,
93

94
00:08:23,850 --> 00:08:29,220
we can see what the default file system encoding is on our own machines.
94

95
00:08:29,250 --> 00:08:38,940
So in my case it's utf-8. UTF-8 is Unicode which is the Python 3 standard as well.
95

96
00:08:39,570 --> 00:08:44,910
If you have a different default encoding set on your machine, I'd actually be quite curious to know, so
96

97
00:08:44,970 --> 00:08:48,110
please do share the comments section for this lesson.
97

98
00:08:49,400 --> 00:08:57,470
Now, since your default and my default might be different and I also know that the spam messages in
98

99
00:08:57,470 --> 00:09:04,670
our dataset are written in English, you and I should specify the same encoding. That way we will both
99

100
00:09:04,670 --> 00:09:13,880
get the same results and for that purpose, we will use an encoding called "Latin-1". Coming back
100

101
00:09:13,880 --> 00:09:20,420
up here to our open function and we add an argument for our encoding.
101

102
00:09:20,420 --> 00:09:30,720
I can type "en", Tab on my keyboard then Enter to insert 'latin-1' and if I hit Shift+
102

103
00:09:30,750 --> 00:09:39,930
Enter now on this cell, we can see that we read the exact same file and we can still retrieve our message
103

104
00:09:39,980 --> 00:09:41,150
body.
104

105
00:09:41,220 --> 00:09:47,420
Now one thing that you might be quite curious about is what is this message variable here?
105

106
00:09:47,460 --> 00:09:53,230
In other words, what does this read method actually return from the stream?
106

107
00:09:53,310 --> 00:09:57,950
We can have a look at this by printing out the type of the message variable.
107

108
00:09:57,960 --> 00:10:08,760
So if I wrap my type function into a print statement, we can do just that. So "print(type(
108

109
00:10:08,970 --> 00:10:16,410
message))", and then closing the parentheses and hitting Shift+Enter, we can see that we are dealing with
109

110
00:10:16,560 --> 00:10:18,210
a string.
110

111
00:10:18,390 --> 00:10:26,010
In other words, we open the file, we read the contents of the file. That leaves us with a string or a piece
111

112
00:10:26,010 --> 00:10:34,470
of text, then we close the file and printing the file out like this shows it to us below the cell.
112

113
00:10:34,470 --> 00:10:35,580
Brilliant.
113

114
00:10:35,670 --> 00:10:41,490
Now even though this is an example email that I've included as part of the lesson resources,
114

115
00:10:41,760 --> 00:10:48,870
this example email still shows us the general format for all the emails in the dataset and they all
115

116
00:10:48,870 --> 00:10:50,830
look very, very similar right.
116

117
00:10:50,850 --> 00:10:56,640
They have a header at the very top which includes all sorts of information, namely where the email was
117

118
00:10:56,640 --> 00:11:00,300
sent from, who it was sent to, the subject of the email,
118

119
00:11:00,300 --> 00:11:07,950
the timestamps when the email was sent and even the timestamps of the email transfer agents and the
119

120
00:11:07,950 --> 00:11:10,340
rooting information of the email.
120

121
00:11:11,130 --> 00:11:16,650
Now typically as a user if you fire up Gmail or Hotmail or Outlook or what have you,
121

122
00:11:16,770 --> 00:11:24,360
all of this is hidden but it's still there even though it's usually rendered invisible.
122

123
00:11:24,360 --> 00:11:32,460
Now for our purposes of training the spam classifier, what we will do is we will look exclusively at
123

124
00:11:32,460 --> 00:11:35,550
the body of our email files.
124

125
00:11:35,550 --> 00:11:43,830
We're actually going to ignore the header and just focus our attention on extracting the email body
125

126
00:11:44,580 --> 00:11:48,990
and I'll show you how to do just that in the next lessons.
126

127
00:11:49,020 --> 00:11:50,520
Now before I move on,
127

128
00:11:50,520 --> 00:11:56,400
I just wanna leave you with a nice tidbit. Encodings are actually a pretty interesting niche topic
128

129
00:11:56,760 --> 00:12:05,570
on their own and the XKCD webcomic has actually done a lovely strip on Unicode and, you know what,
129

130
00:12:05,640 --> 00:12:12,360
I've even seen the ASCII table get its own scene and be a major plot point in the movie The Martian
130

131
00:12:12,780 --> 00:12:17,320
which was, as far as I remember, all about saving Matt Damon.
131

132
00:12:17,640 --> 00:12:21,150
Anyhow check it out and I'll see you in the next lesson.