0 1 00:00:00,330 --> 00:00:07,410 Now a really nice thing to do at this point is to potentially save this information as a file, because 1 2 00:00:07,410 --> 00:00:13,110 we've done all this work reading all the individual files, extracting the email bodies, cleaning up our 2 3 00:00:13,110 --> 00:00:18,910 data and creating this nice structure here that we're going to work with going forward, 3 4 00:00:18,990 --> 00:00:26,370 it's probably a good idea to save this dataframe in some sort of format as its own file. 4 5 00:00:26,370 --> 00:00:33,810 That way we can simply load this one file and skip all the work that we've done beforehand. So I'm going to 5 6 00:00:33,810 --> 00:00:45,730 insert a markdown cell here and I'm going to write "Save to File using Pandas", add a few more rows and 6 7 00:00:45,750 --> 00:00:50,330 here we're gonna write the code that saves our dataframe to the disk. 7 8 00:00:50,340 --> 00:00:57,240 Now, we could be saving this file in a number of different formats, for example like a CSV file or 8 9 00:00:57,270 --> 00:00:59,970 a text file or what have you. 9 10 00:01:00,240 --> 00:01:04,140 And most of these libraries like pandas or numpy or what have you, 10 11 00:01:04,260 --> 00:01:08,590 they'll have the functionality to create a file locally on your disk. 11 12 00:01:08,700 --> 00:01:16,740 The method that I want to show you in this lesson is called "to_json", "data.to_json()" 12 13 00:01:17,190 --> 00:01:22,320 will be the method that we're going to use. If I hit Shift+Tab, I 13 14 00:01:22,390 --> 00:01:24,020 get a quick description 14 15 00:01:24,160 --> 00:01:32,320 and here you can see that the key information that we need to supply is a path or a buffer. 15 16 00:01:32,330 --> 00:01:37,910 Now this is just a string with a relative path and a file name. 16 17 00:01:37,910 --> 00:01:44,110 We've actually worked with something very, very similar already with our practice email. 17 18 00:01:44,120 --> 00:01:47,480 So let's create another constant up here and call it "DATA_ 18 19 00:01:47,510 --> 00:01:58,890 JSON_FILE" and set that equal to single quotes, and then I'm going to copy this bit here, because 19 20 00:01:58,890 --> 00:02:05,760 I'm very, very lazy and I don't want to make any typos and I'm going to paste it in here and then just 20 21 00:02:05,760 --> 00:02:18,780 tack on my file name as well. I'm going to call this "email-text-data.json"; "json" here is the 21 22 00:02:18,780 --> 00:02:29,820 file extension. Let me hit Shift+Enter on this. Now I can come down here and simply call on this constant. 22 23 00:02:31,100 --> 00:02:36,570 Now before I hit Shift+Enter on the cell, I've pulled up my Processing folder here on the left so you 23 24 00:02:36,570 --> 00:02:41,430 can see what's happening when I actually execute this code. 24 25 00:02:41,430 --> 00:02:47,060 You should see this new JSON File being created. And here we go. 25 26 00:02:47,200 --> 00:02:48,310 There it is. 26 27 00:02:48,310 --> 00:02:54,640 Now if you've done some web development or you've worked with APIs, then you're probably familiar with 27 28 00:02:54,640 --> 00:02:59,410 the JSON file format, but if you're not and you're seeing this for the first time, 28 29 00:02:59,710 --> 00:03:04,280 let me quickly show you what the JSON file format looks like. 29 30 00:03:04,390 --> 00:03:09,710 So what I'm going to do is I'm going to open this file in my text editor. 30 31 00:03:09,760 --> 00:03:14,530 Now, one thing to be aware of is that this file is actually huge. 31 32 00:03:14,590 --> 00:03:19,840 This is a file that has 17.7 megabytes in it. 32 33 00:03:19,840 --> 00:03:27,340 So it's a quite a large file with a lot of text, as you can imagine from parsing about 5800 33 34 00:03:27,370 --> 00:03:30,460 emails. When you open this file, 34 35 00:03:30,460 --> 00:03:34,720 then you see something that looks like this. 35 36 00:03:34,930 --> 00:03:45,440 You see a curly brace, then the word "CATEGORY", then '"0":1, "1":1" and so 36 37 00:03:45,440 --> 00:03:46,920 on, right. 37 38 00:03:47,000 --> 00:03:52,300 This formatting here looks very, very, very unfriendly. 38 39 00:03:52,370 --> 00:03:53,240 Right. 39 40 00:03:53,390 --> 00:03:57,770 But I can show you this in a lot more of a readable fashion. 40 41 00:03:58,010 --> 00:04:08,000 If I select my entire JSON and paste it into a website like jsonmate.com, we can beautify our 41 42 00:04:08,000 --> 00:04:14,870 JSON and make it easier to read. After clicking "Beautify" and letting this website run for a little 42 43 00:04:14,870 --> 00:04:22,880 while, we can see that the formatting here is a lot more friendly now. Scrolling down to the JSON editor, 43 44 00:04:23,780 --> 00:04:30,670 we start to recognize our column names from the dataframe as three different keys in this JSON. 44 45 00:04:30,800 --> 00:04:36,920 We've got the CATEGORY key, we've got the MESSAGE key and we've got the FILE_NAME key. 45 46 00:04:37,210 --> 00:04:45,110 The category was simply 1 or 0 for spam or non-spam and these numbers here highlighted in blue, those are 46 47 00:04:45,110 --> 00:04:51,830 my document IDs, so document number 60 was a spam email. 47 48 00:04:51,860 --> 00:05:00,770 Similarly, if I expand the messages, I again have my document ID here, 0, 1, 2, 3, 4, 5, 6, 7 and so on, all the 48 49 00:05:00,770 --> 00:05:04,100 way to 5795 49 50 00:05:04,100 --> 00:05:10,010 and I have the text of all the spam emails here. 50 51 00:05:10,070 --> 00:05:16,640 In other words, even though the file formatting does look kind of scary on the JSON, especially for a 51 52 00:05:16,640 --> 00:05:23,460 massive file like this, it still preserves the overall structure of our dataframe. 52 53 00:05:23,510 --> 00:05:29,630 And JSON in fact is one of the most common ways that messages are passed around the web, 53 54 00:05:29,630 --> 00:05:35,480 so every time you hit up a server and request some weather data or what have you, the information that 54 55 00:05:35,480 --> 00:05:39,650 you're likely to get back is going to be in this JSON format. 55 56 00:05:39,650 --> 00:05:40,130 All right. 56 57 00:05:40,130 --> 00:05:43,280 So that pretty much wraps up the data cleaning that we're going to do. 57 58 00:05:43,430 --> 00:05:47,520 We've also managed not to just read a bunch of files from our disk. 58 59 00:05:47,570 --> 00:05:54,730 We've also managed to save all the work that we've done to a JSON file. 59 60 00:05:54,920 --> 00:06:02,830 The next step in our workflow is exploring and visualizing the data in our dataset. Looking forward to 60 61 00:06:02,830 --> 00:06:04,480 seeing you in the next lesson. 61 62 00:06:04,480 --> 00:06:04,920 Take care.