0 1 00:00:00,790 --> 00:00:07,720 Now that we've covered the theory behind the Naive Bayes Classifier in detail, it's time to get our hands 1 2 00:00:07,720 --> 00:00:11,430 dirty and write some Python code. In this lesson, 2 3 00:00:11,440 --> 00:00:17,080 we've got a big and important topic coming up, namely how do you get your code to talk to your computer's 3 4 00:00:17,170 --> 00:00:18,650 operating system? 4 5 00:00:18,670 --> 00:00:21,480 How do you read a file? 5 6 00:00:21,500 --> 00:00:25,850 Now a lot of this knowledge is going to be transferable to whatever programming language you're working 6 7 00:00:25,850 --> 00:00:27,900 in and whatever project you're working in. 7 8 00:00:28,040 --> 00:00:32,540 But with Python and machine learning, you're gonna be working with a lot of data and that data is gonna 8 9 00:00:32,540 --> 00:00:33,710 be stored somewhere. 9 10 00:00:34,250 --> 00:00:40,700 Most likely that's going to be stored locally on your computer or it's gonna be stored on a server. For 10 11 00:00:40,700 --> 00:00:45,890 many of our projects, our Python code is going to need to interact with our own computer's operating 11 12 00:00:45,890 --> 00:00:50,700 system and actually be able to read the files from the disk. 12 13 00:00:50,720 --> 00:00:55,580 So while we're on this topic about learning how to read files, we're also going to cover quite a few 13 14 00:00:55,580 --> 00:01:01,530 other topics as well, namely file paths, file locations, file extensions. 14 15 00:01:01,580 --> 00:01:06,650 We're also going to see how our email data is actually structured and we're also going to see how we 15 16 00:01:06,650 --> 00:01:11,120 can extract a message body from a raw email. 16 17 00:01:11,150 --> 00:01:12,810 But first things first. 17 18 00:01:12,950 --> 00:01:18,860 Whenever you're referencing a file in your Python code you're going to need to specify exactly where 18 19 00:01:18,860 --> 00:01:21,820 that file is and what that file is called. 19 20 00:01:21,980 --> 00:01:31,430 And this means you have to specify two things, namely the file path and the file name. In the lesson resources 20 21 00:01:31,550 --> 00:01:35,120 that you've downloaded and added to your projects folder, 21 22 00:01:35,120 --> 00:01:38,260 I included a practice email. 22 23 00:01:38,270 --> 00:01:48,230 So taking a peek inside the spam data folder and then opening "01_Processing", you should see a file called 23 24 00:01:48,320 --> 00:01:52,150 "practice_email.txt". 24 25 00:01:52,280 --> 00:01:56,450 Now, the file name for this file should be pretty obvious because we're looking right at it. 25 26 00:01:56,510 --> 00:02:04,280 It's "practice_email.txt". That "txt" part is called the file extension and this signals 26 27 00:02:04,370 --> 00:02:08,340 what kind of file this is. A txt file 27 28 00:02:08,420 --> 00:02:16,520 is just a boring old text file, but the thing is, these file extensions are key to the computer or you 28 29 00:02:16,520 --> 00:02:20,270 or me understanding what kind of file it is that we're working with. 29 30 00:02:20,360 --> 00:02:24,710 The extension tells us something about a file's format. 30 31 00:02:24,710 --> 00:02:33,980 So for example, Microsoft Word documents have the extension "doc" or "docx" and PDFs have the extension 31 32 00:02:34,310 --> 00:02:41,920 ".pdf". Windows executable files on the other hand have the very dangerous extension ".exe". 32 33 00:02:42,260 --> 00:02:50,110 So if you ever receive an email that has an attachment with a ".exe" extension, don't open the attachment. 33 34 00:02:50,120 --> 00:02:55,310 This was always the classic way that people installed viruses on their computers back in the day. 34 35 00:02:55,490 --> 00:03:00,560 I was going through my spam directory recently and one thing I noticed actually was that more and more 35 36 00:03:00,560 --> 00:03:07,400 folks are sending out their malicious code with the ".js" or JavaScript extension instead of 36 37 00:03:07,430 --> 00:03:08,330 "exe". 37 38 00:03:08,360 --> 00:03:16,760 So yeah, it's always good to pay close attention to these file extensions. Now that we've talked about 38 39 00:03:16,880 --> 00:03:20,070 the file name and the file extension, 39 40 00:03:20,090 --> 00:03:21,860 what about this other thing that I've mentioned? 40 41 00:03:21,860 --> 00:03:29,270 What about the path? The path or the file path is the location of the file. 41 42 00:03:29,270 --> 00:03:33,920 Let me quickly show you how you can actually view the file path on Mac 42 43 00:03:33,920 --> 00:03:37,250 and afterwards I'll show you how to see it on Windows. 43 44 00:03:37,250 --> 00:03:42,550 If you're using a Mac, then you can open Finder and you can go to this "practice_ 44 45 00:03:42,560 --> 00:03:50,300 email.txt" file in Finder and then you can go to "File" > "Get Info" or Command+I 45 46 00:03:50,360 --> 00:03:56,820 and what you can see right here is the location of this file. 46 47 00:03:56,890 --> 00:04:02,590 It's on my hard drive, under my username, inside my projects folder, 47 48 00:04:02,740 --> 00:04:06,340 then SpamData and then 01_Processing. 48 49 00:04:06,340 --> 00:04:12,160 This here is the location of the "practice_email.txt" file. 49 50 00:04:12,220 --> 00:04:19,810 This right here is my file path. If you're running Windows on the other hand, you can always see the file 50 51 00:04:19,810 --> 00:04:24,490 path and the location in this address bar right up here. 51 52 00:04:24,490 --> 00:04:27,430 So right now this is the project's folder, 52 53 00:04:27,430 --> 00:04:35,700 double clicking on SpamData will open my folder and you can see now that the location here has updated. 53 54 00:04:35,770 --> 00:04:44,800 If I open "Processing" then I can clearly see the full file path to this practice email right here in 54 55 00:04:44,800 --> 00:04:53,800 my address bar. If I click inside the address bar, then we can see the names of the folders and the sub 55 56 00:04:53,800 --> 00:04:57,910 folders separated by a backslash. 56 57 00:04:57,910 --> 00:05:04,630 In other words, my Processing folder is inside SpamData which is inside my projects folder which is 57 58 00:05:04,630 --> 00:05:10,780 inside my username folder under Users and then on my C drive. 58 59 00:05:11,140 --> 00:05:17,850 This long piece of text that we're looking at right here is the full path. And this brings me to the 59 60 00:05:17,850 --> 00:05:21,510 two different kind of paths that you'll be encountering. 60 61 00:05:21,570 --> 00:05:30,190 The first one is actually called an absolute path. Think of the absolute path as the long form. 61 62 00:05:30,250 --> 00:05:34,090 This is the full path to the file or folder. On Windows, 62 63 00:05:34,090 --> 00:05:40,330 you often see the absolute path starting with C and then going to your particular user profile and then 63 64 00:05:40,330 --> 00:05:44,050 down to a particular folder. On Mac, 64 65 00:05:44,050 --> 00:05:49,070 the absolute path to the practice email would look something more like this. 65 66 00:05:49,430 --> 00:05:49,880 All right. 66 67 00:05:49,930 --> 00:05:58,000 I think that's easy enough. But, while the absolute path is the long form, there is also a shorthand. You 67 68 00:05:58,000 --> 00:06:02,050 see, you don't always have to work with an absolute path. 68 69 00:06:02,050 --> 00:06:10,860 Instead, you can work with what's called a relative path. A relative path uses your current location as 69 70 00:06:10,860 --> 00:06:16,680 the starting location and then points to a file or folder from there. 70 71 00:06:16,710 --> 00:06:24,780 So suppose I'm in my MLProjects folder in Finder, then the relative path to the practice email would 71 72 00:06:24,780 --> 00:06:28,850 be SpamData, 01_Processing 72 73 00:06:29,100 --> 00:06:30,510 and there it is. 73 74 00:06:30,600 --> 00:06:38,110 There is my "practice_email.txt" file. As such, the relative path from my MLProjects 74 75 00:06:38,110 --> 00:06:44,620 folder to my practice_email.txt file would look something like this. 75 76 00:06:44,800 --> 00:06:50,700 What you're looking at here is actually the formatting for windows. On a Mac, 76 77 00:06:50,710 --> 00:06:55,720 you'd have a relative path that would look more like this. Notice, 77 78 00:06:55,780 --> 00:07:02,620 the only difference are the slashes. Mac uses the forward slash and Windows uses the backslash in front 78 79 00:07:02,620 --> 00:07:04,900 of the file and folder names. 79 80 00:07:04,900 --> 00:07:08,980 Now this might be a small thing to point out, but the direction of the slashes actually matters when 80 81 00:07:08,980 --> 00:07:10,820 you're writing your Python code. 81 82 00:07:10,870 --> 00:07:16,930 So even though in Windows you see backslash is used in the path on the address bar, 82 83 00:07:16,990 --> 00:07:24,040 it's actually a little problematic to try and just copy paste what's in here and the reason is is that 83 84 00:07:24,130 --> 00:07:30,940 in Python you actually want to use a forward slash, because the backslash has a special purpose in the 84 85 00:07:30,940 --> 00:07:33,150 Python programming language. 85 86 00:07:33,220 --> 00:07:39,370 Let's fire up a new Python notebook and I'll show you exactly what I mean and why we want to use forward 86 87 00:07:39,370 --> 00:07:46,130 slashes. In Jupyter navigate to your MLProjects folder and then click "New" > 87 88 00:07:46,370 --> 00:07:59,540 "Python 3" notebook. Let's rename this notebook to "06 Bayes Classifier" and click "Rename". Say we print a line 88 89 00:07:59,750 --> 00:08:02,970 'What is up?' and hit Shift+Enter. 89 90 00:08:02,990 --> 00:08:04,790 Then we see this text printed below. 90 91 00:08:05,630 --> 00:08:08,930 But what if we wanted to write 'What's up?'. 91 92 00:08:08,930 --> 00:08:13,130 So if I wanted to write "What's up?" with an apostrophe, 92 93 00:08:13,130 --> 00:08:22,940 you can see that I now have three single quotes in this string and that means that my string starts 93 94 00:08:22,940 --> 00:08:30,920 here, ends here and this part onwards here is considered Python code and then it starts another string 94 95 00:08:30,920 --> 00:08:31,800 here. 95 96 00:08:31,850 --> 00:08:39,030 In other words, Python needs to know that it should ignore this single quote and for that we use the 96 97 00:08:39,030 --> 00:08:44,480 backslash. The backslash escapes the character that follows it. 97 98 00:08:44,580 --> 00:08:50,130 So by using the backslash, this single quote here is treated as part of the string 98 99 00:08:50,280 --> 00:08:55,980 and if I hit Shift+Enter, I can get that apostrophe to show up in my output. 99 100 00:08:56,190 --> 00:09:03,960 And that's also why when you're providing a path as a string, you want to make sure use forward slashes 100 101 00:09:04,230 --> 00:09:11,690 instead of backslash is that you get from copy and pasting from the Windows address bar. Okay, 101 102 00:09:11,710 --> 00:09:20,420 so so much for the theory behind relative paths, file names, file extensions and absolute paths. In the next 102 103 00:09:20,420 --> 00:09:21,050 lesson, 103 104 00:09:21,170 --> 00:09:25,760 I'll show you how to read that practice email into Jupyter notebook. 104 105 00:09:25,760 --> 00:09:29,630 I'm gonna go ahead and get some coffee, but I hope I'll see you on the next lesson.