0 1 00:00:00,320 --> 00:00:06,090 Alright! So in the last lesson we've talked about file paths. In this lesson we're gonna put those to use. 1 2 00:00:06,580 --> 00:00:07,530 First stop, 2 3 00:00:07,590 --> 00:00:11,640 let's put some headings into our notebook. So this one here, 3 4 00:00:11,640 --> 00:00:22,100 I'm going to change to Markup and we're gonna keep all our notebook imports in the cell below. The next 4 5 00:00:22,100 --> 00:00:30,920 section heading I'm going to add is gonna be called "Constants". This is where we're going to add all our 5 6 00:00:30,920 --> 00:00:32,320 file paths. 6 7 00:00:32,480 --> 00:00:38,930 We're gonna have a single cell where we're gonna put all the pieces of information that don't change. 7 8 00:00:38,930 --> 00:00:44,240 Now I want to work more in the middle of the screen for you guys, but one of the handy things about Jupyter 8 9 00:00:44,240 --> 00:00:50,840 notebook is that it's got all these handy keyboard shortcuts and under the Help menu, the keyboard 9 10 00:00:50,840 --> 00:00:53,790 shortcuts for Jupyter are outlined right here. 10 11 00:00:54,110 --> 00:01:01,340 So I can see that scrolling down, my keyboard shortcut for inserting a new cell it's just pressing the 11 12 00:01:01,340 --> 00:01:03,660 letter B when I'm in command mode. 12 13 00:01:03,950 --> 00:01:04,890 Yeah, there's two modes. 13 14 00:01:04,890 --> 00:01:08,540 There's the Edit mode and there's the Command mode. 14 15 00:01:08,540 --> 00:01:12,810 You can tell that you're in the Edit mode when the cell is green because your cursor is blinking there. 15 16 00:01:13,220 --> 00:01:17,780 And you can tell that you're in the Command mode when the cell is blue. 16 17 00:01:17,780 --> 00:01:25,570 So if I press B on my keyboard right now when I'm in command mode, then I get new cells appearing like 17 18 00:01:25,570 --> 00:01:26,330 so. 18 19 00:01:26,780 --> 00:01:36,770 Now what I'll do is I'm going to create a constant and call it "EXAMPLE_FILE" and this constant 19 20 00:01:36,950 --> 00:01:44,870 will have my file path to my "practice_emal.txt" file and this is gonna be my relative 20 21 00:01:44,900 --> 00:01:48,670 path. A good naming convention by the way for constants, 21 22 00:01:48,830 --> 00:01:53,930 in other words for values that you don't want to update is to write them like so with capital letters 22 23 00:01:54,260 --> 00:02:00,530 separated by an underscore. The next thing I'm going to do is I'm going to copy the file path to this 23 24 00:02:00,530 --> 00:02:01,500 practice email. 24 25 00:02:01,640 --> 00:02:12,730 So I'm going to go to "Get Info" and I'm going to copy this file path here, close this, go here and then 25 26 00:02:12,880 --> 00:02:20,410 in my single quotes, I'm going to paste it in here. Because I'm just interested in the relative file path 26 27 00:02:20,680 --> 00:02:25,600 not the absolute file path, I'm going to delete this bit here. 27 28 00:02:25,660 --> 00:02:28,560 So it's just going to read "SpamData/ 28 29 00:02:28,820 --> 00:02:31,450 01_Processing", 29 30 00:02:31,450 --> 00:02:36,740 and then what I'm going to do at the end is I'm going to tack on the file name. 30 31 00:02:36,850 --> 00:02:40,000 So "practice_ 31 32 00:02:40,210 --> 00:02:41,970 email.txt". 32 33 00:02:42,430 --> 00:02:49,330 I'm going to include the extension as well. So I'll have my relative path from my MLProjects folder 33 34 00:02:49,600 --> 00:02:58,870 where this notebook lives, going into the 01_Processing folder and then I'll have my file name at the 34 35 00:02:58,870 --> 00:02:59,930 end. 35 36 00:02:59,950 --> 00:03:04,820 Now if you're working alongside with me, make sure you don't have any typos here. 36 37 00:03:04,840 --> 00:03:08,980 Make sure you are using the forward slashes instead of the back slashes. 37 38 00:03:08,980 --> 00:03:16,480 And if your text matches mine and your "practice_email.txt" file is in exactly the same 38 39 00:03:16,480 --> 00:03:22,180 location as my file then we're not going to have any issues. 39 40 00:03:22,180 --> 00:03:28,900 Now let's practice reading a single file. So I'll add a new section heading here, 40 41 00:03:29,110 --> 00:03:39,020 call it "Reading Files" and in here we're going to start talking to our operating system. 41 42 00:03:39,640 --> 00:03:48,700 The kind of object, the kind of thing that we need to read a file from the disk is called a stream or 42 43 00:03:48,730 --> 00:03:52,210 a file object. In Python, 43 44 00:03:52,360 --> 00:03:59,620 you can open a particular file with the built in "open()" function and this open function of course needs 44 45 00:03:59,620 --> 00:04:03,480 to know which file it's meant to open. In our case, 45 46 00:04:03,550 --> 00:04:06,710 we're just going to use the constant that we defined above. 46 47 00:04:06,880 --> 00:04:16,420 I'm going to type "EX" and hit Tab on my keyboard to insert the rest of this code. Because we've hit Shift+ 47 48 00:04:16,450 --> 00:04:23,170 Enter on this cell, Jupyter notebook actually recognizes this name here and it will autocomplete the 48 49 00:04:23,170 --> 00:04:24,300 name for us 49 50 00:04:24,340 --> 00:04:29,990 which really speeds up our typing. Now that we've called the open function, 50 51 00:04:30,160 --> 00:04:34,060 it will return to us a file object or a stream. 51 52 00:04:34,060 --> 00:04:41,740 Now I want to store this object in a variable, so I'm going to say "stream = open( 52 53 00:04:41,740 --> 00:04:48,820 EXAMPLE_FILE)". Once I've got my stream I can read the individual lines in this file. So I'm going to say "stream. 53 54 00:04:49,030 --> 00:04:57,430 read()". This method will go through the lines of the file one by one and I'm going to 54 55 00:04:57,430 --> 00:05:04,540 store the output of this method in a variable called "message". 55 56 00:05:04,540 --> 00:05:10,660 When I'm done reading my file, I'm going to tell Python to stop looking at this file, because we've reached 56 57 00:05:10,660 --> 00:05:13,810 the end and we're not planning to do anything further with it. 57 58 00:05:13,850 --> 00:05:18,110 So I'm going to say "stream.close()". 58 59 00:05:18,460 --> 00:05:24,760 All right, so we've opened a file, we've read the contents, we've stored those contents in a variable and 59 60 00:05:24,760 --> 00:05:28,420 then we've closed our stream because we're done reading the file. 60 61 00:05:28,420 --> 00:05:37,210 Let's print out the contents of this file so "print(message)", Shift+Enter and here's what we have. 61 62 00:05:37,730 --> 00:05:40,400 This is the structure of an email. 62 63 00:05:40,520 --> 00:05:48,920 The first bit is called the email header, so it has all this information about who sent the email, who 63 64 00:05:48,920 --> 00:05:54,110 it was to, who was cc'd, what was the subject and so on. 64 65 00:05:54,470 --> 00:05:57,850 And then after the header there's a blank line 65 66 00:05:57,890 --> 00:06:02,060 what follows is the email body. 66 67 00:06:02,190 --> 00:06:10,470 Now this text here is an email body that I've copied from a book called "The Timewaster Letters" by 67 68 00:06:10,470 --> 00:06:11,550 Robin Cooper. 68 69 00:06:11,940 --> 00:06:20,010 And if you're ever looking for a humorous read, then check out this book. Coming back up to our Python code, 69 70 00:06:20,430 --> 00:06:28,650 let me press Shift+Tab on my open function and take a closer look at the quick documentation. We can 70 71 00:06:28,650 --> 00:06:33,060 see here that in the parameters there's a couple of inputs. 71 72 00:06:33,060 --> 00:06:36,080 The first one of course is the file. 72 73 00:06:36,180 --> 00:06:39,970 The second input has a default value of 'r'. 73 74 00:06:39,990 --> 00:06:45,590 So this means that by default we are only reading the file, not writing one. 74 75 00:06:45,750 --> 00:06:51,390 The next parameter that I want to talk about is this thing called "encoding", which by default is set to 75 76 00:06:51,390 --> 00:06:51,840 None. 76 77 00:06:53,450 --> 00:07:00,890 An Encoding is how the computer handles letters and text. After all, 77 78 00:07:00,890 --> 00:07:05,770 every character needs to be translated into ones and zeros at the end of the day. 78 79 00:07:06,050 --> 00:07:14,060 In the days of early computers, this was fairly simple because computers really only supported 127 American 79 80 00:07:14,180 --> 00:07:16,140 English characters. 80 81 00:07:16,160 --> 00:07:18,540 This was called ASCII. 81 82 00:07:18,800 --> 00:07:24,140 If you were French for example and had an accent above a letter then you were out of luck. 82 83 00:07:24,200 --> 00:07:28,190 No accents for you Frédéric Chopin. 83 84 00:07:28,220 --> 00:07:36,430 If you are Russian or Chinese or Thai or Indian or German or, well, from anywhere really, then you are 84 85 00:07:36,430 --> 00:07:37,790 also screwed. 85 86 00:07:37,790 --> 00:07:41,870 ASCII doesn't have a character for the letters in your alphabet. 86 87 00:07:42,990 --> 00:07:49,530 On the upside people more creative than I have used ASCII characters to create some really neat art 87 88 00:07:49,650 --> 00:07:53,040 to make the Internet a much more interesting place. 88 89 00:07:53,430 --> 00:07:55,590 Back in our Python code. 89 90 00:07:55,590 --> 00:08:02,910 If we leave the encoding blank as we have done here, we will use the default system encoding. What's the 90 91 00:08:02,910 --> 00:08:04,940 default encoding on our system? 91 92 00:08:04,950 --> 00:08:13,800 Well let's make Jupyter actually tell us explicitly. If we import the system library and then write "sys. 92 93 00:08:14,310 --> 00:08:23,850 getfilesystemencoding()" and hit Shift+Enter, 93 94 00:08:23,850 --> 00:08:29,220 we can see what the default file system encoding is on our own machines. 94 95 00:08:29,250 --> 00:08:38,940 So in my case it's utf-8. UTF-8 is Unicode which is the Python 3 standard as well. 95 96 00:08:39,570 --> 00:08:44,910 If you have a different default encoding set on your machine, I'd actually be quite curious to know, so 96 97 00:08:44,970 --> 00:08:48,110 please do share the comments section for this lesson. 97 98 00:08:49,400 --> 00:08:57,470 Now, since your default and my default might be different and I also know that the spam messages in 98 99 00:08:57,470 --> 00:09:04,670 our dataset are written in English, you and I should specify the same encoding. That way we will both 99 100 00:09:04,670 --> 00:09:13,880 get the same results and for that purpose, we will use an encoding called "Latin-1". Coming back 100 101 00:09:13,880 --> 00:09:20,420 up here to our open function and we add an argument for our encoding. 101 102 00:09:20,420 --> 00:09:30,720 I can type "en", Tab on my keyboard then Enter to insert 'latin-1' and if I hit Shift+ 102 103 00:09:30,750 --> 00:09:39,930 Enter now on this cell, we can see that we read the exact same file and we can still retrieve our message 103 104 00:09:39,980 --> 00:09:41,150 body. 104 105 00:09:41,220 --> 00:09:47,420 Now one thing that you might be quite curious about is what is this message variable here? 105 106 00:09:47,460 --> 00:09:53,230 In other words, what does this read method actually return from the stream? 106 107 00:09:53,310 --> 00:09:57,950 We can have a look at this by printing out the type of the message variable. 107 108 00:09:57,960 --> 00:10:08,760 So if I wrap my type function into a print statement, we can do just that. So "print(type( 108 109 00:10:08,970 --> 00:10:16,410 message))", and then closing the parentheses and hitting Shift+Enter, we can see that we are dealing with 109 110 00:10:16,560 --> 00:10:18,210 a string. 110 111 00:10:18,390 --> 00:10:26,010 In other words, we open the file, we read the contents of the file. That leaves us with a string or a piece 111 112 00:10:26,010 --> 00:10:34,470 of text, then we close the file and printing the file out like this shows it to us below the cell. 112 113 00:10:34,470 --> 00:10:35,580 Brilliant. 113 114 00:10:35,670 --> 00:10:41,490 Now even though this is an example email that I've included as part of the lesson resources, 114 115 00:10:41,760 --> 00:10:48,870 this example email still shows us the general format for all the emails in the dataset and they all 115 116 00:10:48,870 --> 00:10:50,830 look very, very similar right. 116 117 00:10:50,850 --> 00:10:56,640 They have a header at the very top which includes all sorts of information, namely where the email was 117 118 00:10:56,640 --> 00:11:00,300 sent from, who it was sent to, the subject of the email, 118 119 00:11:00,300 --> 00:11:07,950 the timestamps when the email was sent and even the timestamps of the email transfer agents and the 119 120 00:11:07,950 --> 00:11:10,340 rooting information of the email. 120 121 00:11:11,130 --> 00:11:16,650 Now typically as a user if you fire up Gmail or Hotmail or Outlook or what have you, 121 122 00:11:16,770 --> 00:11:24,360 all of this is hidden but it's still there even though it's usually rendered invisible. 122 123 00:11:24,360 --> 00:11:32,460 Now for our purposes of training the spam classifier, what we will do is we will look exclusively at 123 124 00:11:32,460 --> 00:11:35,550 the body of our email files. 124 125 00:11:35,550 --> 00:11:43,830 We're actually going to ignore the header and just focus our attention on extracting the email body 125 126 00:11:44,580 --> 00:11:48,990 and I'll show you how to do just that in the next lessons. 126 127 00:11:49,020 --> 00:11:50,520 Now before I move on, 127 128 00:11:50,520 --> 00:11:56,400 I just wanna leave you with a nice tidbit. Encodings are actually a pretty interesting niche topic 128 129 00:11:56,760 --> 00:12:05,570 on their own and the XKCD webcomic has actually done a lovely strip on Unicode and, you know what, 129 130 00:12:05,640 --> 00:12:12,360 I've even seen the ASCII table get its own scene and be a major plot point in the movie The Martian 130 131 00:12:12,780 --> 00:12:17,320 which was, as far as I remember, all about saving Matt Damon. 131 132 00:12:17,640 --> 00:12:21,150 Anyhow check it out and I'll see you in the next lesson.