0 1 00:00:00,580 --> 00:00:01,050 All right. 1 2 00:00:01,050 --> 00:00:07,080 So in this lesson I'm going to talk a little bit more about the data that we're working with 2 3 00:00:07,080 --> 00:00:14,460 and I'm also going to show you how you can get hold of it. In terms of where we are in the project, 3 4 00:00:14,460 --> 00:00:20,350 well, we've got our objective sorted and we've clarified what it is that we're going to do. 4 5 00:00:20,610 --> 00:00:27,390 In other words, we've nailed down the question that we're looking to answer that is: Is an e-mail spam 5 6 00:00:27,960 --> 00:00:35,580 or is an email legitimate? Now we're gonna go on to the next step, which is gathering the data to answer 6 7 00:00:35,730 --> 00:00:37,300 this question. 7 8 00:00:37,350 --> 00:00:42,700 Remember, the premise is that a company has provided the data for you. 8 9 00:00:42,930 --> 00:00:45,880 And I'm going to play the role of the company here. 9 10 00:00:45,900 --> 00:00:52,530 So in this lesson's resources there is a zip file that you can download with everything that you need 10 11 00:00:52,530 --> 00:00:54,200 for this module. 11 12 00:00:54,240 --> 00:01:00,670 Now I highly encourage you to pause the video and do that right now, just download the lesson resources. 12 13 00:01:01,470 --> 00:01:04,230 But this isn't what this video is about, right. 13 14 00:01:04,410 --> 00:01:11,130 What I really want to show you is where I got the data from in the first place and that way you'll know 14 15 00:01:11,460 --> 00:01:18,120 how to source the data and how you can pull up detailed information about the data yourself. 15 16 00:01:18,120 --> 00:01:22,890 You never know, this kind of stuff might really come in handy with future projects of your own. 16 17 00:01:22,920 --> 00:01:23,220 Right. 17 18 00:01:24,060 --> 00:01:24,290 OK. 18 19 00:01:24,300 --> 00:01:25,680 So let's dive right in. 19 20 00:01:25,830 --> 00:01:28,290 So where does the data actually come from? 20 21 00:01:28,290 --> 00:01:35,760 Well, all the e-mail data that we're gonna be working with actually comes from an open source anti spam 21 22 00:01:35,760 --> 00:01:39,240 platform called SpamAssassin. 22 23 00:01:39,250 --> 00:01:43,410 Now I know that in terms of design, this website doesn't look like much but I think this is a really 23 24 00:01:43,410 --> 00:01:45,050 really great project. 24 25 00:01:45,150 --> 00:01:53,700 And the other thing is that Spam Assassin actually have a huge data set of spam emails and legitimate 25 26 00:01:53,730 --> 00:02:01,350 e-mails that they're making available to the public to do things like train a spam classifier. 26 27 00:02:01,350 --> 00:02:10,110 The easiest way that you can find this data set is to google the words Spam Assassin public corpus. 27 28 00:02:10,650 --> 00:02:16,680 And if you do that the top results should look something like this. 28 29 00:02:16,700 --> 00:02:20,570 If I click on this link I could take you to this page here. 29 30 00:02:20,580 --> 00:02:23,610 Mind you the euro up here might change right. 30 31 00:02:23,610 --> 00:02:27,750 That's completely up to the people who are behind Spam Assassin. 31 32 00:02:27,840 --> 00:02:36,330 So hence googling Spam Assassin public Corpus is a surefire way to find the right Web site. 32 33 00:02:36,330 --> 00:02:42,360 Again the design here can be a little bit off putting, it looks a little bit dodge, but what you've essentially 33 34 00:02:42,360 --> 00:02:49,860 got here are a list of archives, you've got a list of collections of spam e-mails that you can download, 34 35 00:02:50,460 --> 00:02:58,520 the file naming here gives a little bit of a clue to the recency of the archive itself. 35 36 00:02:58,530 --> 00:02:58,740 Right. 36 37 00:02:58,740 --> 00:03:06,960 So sometimes you'll see like a duplicate because the archive was actually updated - for example this file 37 38 00:03:06,960 --> 00:03:07,470 here, 38 39 00:03:07,650 --> 00:03:11,690 it's probably just an updated version of this one here. 39 40 00:03:11,700 --> 00:03:13,020 I'm just going by the file name here. 40 41 00:03:13,320 --> 00:03:17,380 I don't think the last modified column is actually that useful in this case. 41 42 00:03:17,400 --> 00:03:24,600 Now a moment ago you saw me use the word corpus in my google search query and you see that word again 42 43 00:03:24,600 --> 00:03:25,070 here. 43 44 00:03:25,080 --> 00:03:25,950 Corpus. 44 45 00:03:25,950 --> 00:03:27,060 Right? 45 46 00:03:27,120 --> 00:03:33,720 This is actually quite an important piece of jargon to know, because it'll come up again and again when 46 47 00:03:33,810 --> 00:03:43,080 you're working with text datasets. A corpus is defined as a large and structured set of texts. 47 48 00:03:43,080 --> 00:03:48,240 You'll hear the word corpus also used when people refer to say all the written works of a particular 48 49 00:03:48,270 --> 00:03:50,380 author like Shakespeare, 49 50 00:03:50,430 --> 00:04:00,100 right? Now a related term that you'll come across as well when gathering text data is the word "Document". 50 51 00:04:00,120 --> 00:04:08,730 Now this actually means something quite specific in this context. A document refers to a particular email 51 52 00:04:09,240 --> 00:04:14,420 in our corpus, so the corpus refers to a set of all documents, 52 53 00:04:14,920 --> 00:04:19,250 a. document refers to a particular item in the corpus. 53 54 00:04:20,130 --> 00:04:25,150 So you'll see both of these two pieces of jargon come up again and again in this module. 54 55 00:04:25,150 --> 00:04:28,560 And here's a bonus question for you. 55 56 00:04:28,560 --> 00:04:34,980 Can you guess the plural of the word corpus? 56 57 00:04:35,190 --> 00:04:36,780 I'm not gonna leave you hanging there. 57 58 00:04:36,850 --> 00:04:46,160 It's a "Corpora" - one corpus, three corpora, not corpuses or corepi or what have you. 58 59 00:04:47,080 --> 00:04:53,350 I found this out the hard way by the way, with people giving me quite funny looks at a conference and 59 60 00:04:53,440 --> 00:04:55,970 I don't want the same thing to happen to you. 60 61 00:04:55,990 --> 00:04:56,840 So there you go, 61 62 00:04:56,860 --> 00:05:03,760 corpora. Now you know. Back on the Spam Assassin website, let me point you towards the detailed 62 63 00:05:03,910 --> 00:05:07,630 description of the data set that we're working with. 63 64 00:05:07,630 --> 00:05:12,950 It's right here under this unassuming "readme.html". 64 65 00:05:13,030 --> 00:05:20,650 This is where you can see the revision history of the dataset and a brief explanation of what's actually 65 66 00:05:20,800 --> 00:05:23,010 in the archive files. 66 67 00:05:23,110 --> 00:05:31,240 Now, speaking of archives, you might be wondering how to actually download and open these files to look 67 68 00:05:31,240 --> 00:05:33,640 at an individual email. 68 69 00:05:33,700 --> 00:05:41,380 It's actually not that straightforward, because you don't tend to see these ".tar.bz2" file 69 70 00:05:41,380 --> 00:05:46,920 types very, very often and it's not an extension you come across every day, right. 70 71 00:05:46,960 --> 00:05:50,490 So let me show you how you can work with these files on a Mac 71 72 00:05:50,500 --> 00:05:56,550 first and after that I'll show you the workflow on a Windows machine too. 72 73 00:05:56,920 --> 00:06:01,330 Obviously, the first thing you have to do is download one of these files. 73 74 00:06:01,360 --> 00:06:03,980 So I'm gonna click on this spam archive here, 74 75 00:06:03,990 --> 00:06:10,990 the most recent one from 2005 and I'm going to save it to my downloads folder. 75 76 00:06:11,200 --> 00:06:14,780 If I pull up my downloads folder I can see it sitting right here, 76 77 00:06:14,800 --> 00:06:22,470 it's got the icon of a zip archive and it's 2.1 megabytes. On Mac, 77 78 00:06:22,480 --> 00:06:29,860 all you have to do is pretty much double click on it, it expands it meaning it unarchives or unzips 78 79 00:06:29,860 --> 00:06:34,390 it and it'll create a new folder for you like so. 79 80 00:06:34,390 --> 00:06:36,250 So let me open this folder. 80 81 00:06:36,250 --> 00:06:41,290 You can already see here it contains 1397 items. 81 82 00:06:41,300 --> 00:06:47,050 So there's quite a few things in here and they all look like this. 82 83 00:06:47,080 --> 00:06:49,960 That's pretty intimidating 83 84 00:06:49,960 --> 00:06:54,380 and looks like complete junk when you see it for like the first time. 84 85 00:06:54,430 --> 00:07:02,110 So let me show you what these files actually are and how you would look at them. To open one of these 85 86 00:07:02,110 --> 00:07:02,590 files, 86 87 00:07:02,620 --> 00:07:09,040 you need to use a text editor. On Windows, Notepad comes preinstalled. On Mac, 87 88 00:07:09,040 --> 00:07:16,150 it's TextEdit, but the thing is if you right click on this and you go to "Open With", my operating system 88 89 00:07:16,180 --> 00:07:19,410 isn't even suggesting a text editor here to open it with, 89 90 00:07:19,450 --> 00:07:21,760 so I manually have to choose it. 90 91 00:07:21,970 --> 00:07:26,860 One of my favorite text editors these days is actually called Atom, so you can download and install this 91 92 00:07:26,860 --> 00:07:28,390 thing if you like. 92 93 00:07:28,780 --> 00:07:36,150 Let me show you what it looks like when I open this file with this text editor. This is pretty much what 93 94 00:07:36,150 --> 00:07:37,290 you'll see. 94 95 00:07:37,290 --> 00:07:43,610 You'll see the top of the e-mail here followed by the body text of the email, 95 96 00:07:43,620 --> 00:07:44,720 like so. 96 97 00:07:44,820 --> 00:07:53,610 So we've successfully opened one of these emails in a text editor after unarchiving the thing that we 97 98 00:07:53,610 --> 00:07:56,950 downloaded from the Spam Assassin web site. 98 99 00:07:56,970 --> 00:07:58,990 This is the workflow on Mac 99 100 00:07:59,010 --> 00:08:01,070 in any case. Everybody who's on Windows, 100 101 00:08:01,200 --> 00:08:04,030 I apologize for making you wait this long. 101 102 00:08:04,050 --> 00:08:06,890 I'm going to get to that right now. 102 103 00:08:06,930 --> 00:08:07,250 All right. 103 104 00:08:07,320 --> 00:08:16,770 So this is a Windows 8 machine, but Windows 10 isn't so different and all the same principles apply. 104 105 00:08:16,770 --> 00:08:23,940 We first have to extract the zip file and then we can open a file in a text editor like Notepad or, yes, 105 106 00:08:24,220 --> 00:08:25,370 Atom. 106 107 00:08:25,530 --> 00:08:32,550 So I'm going to download this "tar.bz2" file and save it to my Downloads folder which I've got 107 108 00:08:33,000 --> 00:08:39,750 open for you right here and then I can open it with an unarchiver of my choice. 108 109 00:08:39,750 --> 00:08:40,910 So I've got a couple. 109 110 00:08:41,070 --> 00:08:42,510 There's quite a few options here. 110 111 00:08:42,540 --> 00:08:49,980 I'm gonna use a program called "7-Zip" and you can follow the two step process here with me. 111 112 00:08:49,980 --> 00:09:00,180 So this is the "7-Zip" program with the archive file inside and I'm going to say "Extract" and I'll 112 113 00:09:00,180 --> 00:09:08,310 extract it to my Downloads folder and you should see the "tar" file pop up here. 113 114 00:09:08,370 --> 00:09:17,060 So this is actually another archive, so I'm going to have to repeat this step one more time. So I'm going 114 115 00:09:17,050 --> 00:09:25,890 to open this again with "7-Zip" and then I'm going to extract my spam email folder from inside this 115 116 00:09:25,890 --> 00:09:30,390 one and put it into, say, my Downloads folder. 116 117 00:09:30,390 --> 00:09:38,070 So this is the process that I have to follow on my Windows machine to get to my spam emails - extracting 117 118 00:09:38,070 --> 00:09:47,140 them twice. Now let me show you how you can pull up one of these emails in Notepad. If I double click on 118 119 00:09:47,140 --> 00:09:55,600 this thing, I'm going to have to choose what program I want to use to open this file. The file extension 119 120 00:09:55,930 --> 00:10:03,010 of this particular file isn't helping Windows figure out which program is appropriate to open this file 120 121 00:10:03,010 --> 00:10:04,580 with by default. 121 122 00:10:04,630 --> 00:10:12,670 So again we have to choose it manually. If I go to "More Options" I can scroll down here and I will find 122 123 00:10:12,670 --> 00:10:15,820 the Notepad program listed here. 123 124 00:10:15,850 --> 00:10:17,600 So this is Notepad. 124 125 00:10:18,340 --> 00:10:20,960 Uh good ole, good ole Notepad. 125 126 00:10:21,190 --> 00:10:24,640 And it does open the file perfectly. 126 127 00:10:24,640 --> 00:10:32,580 It's not formatting it in a very friendly way, but it is opening it. 127 128 00:10:32,650 --> 00:10:40,270 So this is how you can look at a particular email with a default program not having to install anything 128 129 00:10:40,390 --> 00:10:41,920 on your machine. 129 130 00:10:41,920 --> 00:10:42,190 Right. 130 131 00:10:42,700 --> 00:10:49,660 So you shouldn't have a problem opening any of these files and looking at the emails individually if 131 132 00:10:49,660 --> 00:10:54,580 you want to. Now I think you'll tire of Notepad pretty quickly, 132 133 00:10:54,580 --> 00:11:00,310 so you know, if you have a decent text editor like Atom or say Notepad++ or what have you then 133 134 00:11:00,310 --> 00:11:04,620 it formats it for you a lot more user friendly. 134 135 00:11:05,060 --> 00:11:11,470 And this is what I would encourage you to use rather than the default program that's installed.