0 1 00:00:00,390 --> 00:00:05,700 In this lesson I want to talk to you a little bit more about cleaning data. 1 2 00:00:05,700 --> 00:00:09,770 We've already gathered all our data and we've started on this process. 2 3 00:00:10,050 --> 00:00:17,640 In particular what we've done so far is we've extracted the relevant data, namely the e-mail bodies from 3 4 00:00:17,970 --> 00:00:25,770 all our text files and we've also converted all the text files into a pandas dataframe. 4 5 00:00:25,800 --> 00:00:32,190 In other words, we've taken over 5000 different files, we've extracted the actual pieces of data that 5 6 00:00:32,190 --> 00:00:40,020 we need from it and we've converted all of that into a format that we can work with in our Python code. 6 7 00:00:40,020 --> 00:00:46,500 Now it's time for the next steps, namely checking for empty emails, checking for missing values or no 7 8 00:00:46,500 --> 00:00:52,470 values in our data that we've imported into our Python notebook. 8 9 00:00:52,500 --> 00:00:55,890 So without further ado, let me show you how to do that 9 10 00:00:55,890 --> 00:01:05,670 in Jupyter notebook. I'm going to add another markdown cell right here that reads "Data Cleaning: Checking 10 11 00:01:05,670 --> 00:01:10,850 for Missing Values". 11 12 00:01:10,960 --> 00:01:13,770 Of course it would help if I could spell a bit better. 12 13 00:01:13,870 --> 00:01:15,820 But there we go. 13 14 00:01:16,240 --> 00:01:21,890 The handy thing is that we've put all our data into a pandas dataframe. 14 15 00:01:22,060 --> 00:01:32,680 The first thing I want to do is check if any message bodies are equal to null. By null I mean that the value 15 16 00:01:32,680 --> 00:01:33,950 is missing. 16 17 00:01:34,240 --> 00:01:41,830 The way I can access a particular column in my dataframe, namely the messages column, is by writing 17 18 00:01:41,830 --> 00:01:45,040 "data.MESSAGE". 18 19 00:01:45,040 --> 00:01:52,660 This will access my message column which looks like so. The alternative syntax that you'll see to accomplish 19 20 00:01:52,660 --> 00:02:00,940 the exact same thing is with the square brackets and then the actual column name between single quotes. 20 21 00:02:00,940 --> 00:02:05,750 So that reads "MESSAGE". Voila! 21 22 00:02:05,930 --> 00:02:06,870 Great. 22 23 00:02:06,880 --> 00:02:12,980 Now there are 5800 rows in this column. 23 24 00:02:12,980 --> 00:02:22,620 So if I wanted to check if any of them has a missing value, then I can use the method "isnull". What we 24 25 00:02:22,620 --> 00:02:34,010 get in this case is a value True or False based on whether or not any particular value was null or not. 25 26 00:02:34,020 --> 00:02:42,450 Now this is a very, very handy method, but it doesn't help me check if any of these values is equal to 26 27 00:02:42,450 --> 00:02:43,430 null. 27 28 00:02:43,540 --> 00:02:47,000 I mean part of the reason is is that, yes, I can scroll through here, 28 29 00:02:47,490 --> 00:02:56,240 but Jupyter will actually only display so many rows. It will hide the middle part of my dataframe. 29 30 00:02:56,400 --> 00:03:00,990 So what I'm going to do is I'm going to chain some more methods. 30 31 00:03:00,990 --> 00:03:04,860 The first one is this values attribute. 31 32 00:03:04,860 --> 00:03:12,600 And what this will do is instead of giving me the index name and the value here, it will just give me 32 33 00:03:13,410 --> 00:03:15,060 numpy array. 33 34 00:03:15,210 --> 00:03:18,760 It will just give me the values without the index. 34 35 00:03:18,810 --> 00:03:24,870 The nice thing about this is that if I wanted to check if any of the 5800 rows 35 36 00:03:25,200 --> 00:03:33,740 has a missing value, I can use the "any" method and chain that onto here and get my result. 36 37 00:03:33,780 --> 00:03:36,980 So the good news for us is that the result is False. 37 38 00:03:37,020 --> 00:03:46,380 We have no missing values in our message column. Now, missing values are actually not the same thing as 38 39 00:03:46,560 --> 00:03:57,010 empty emails, right? An empty email would be something like this, an empty string. If I check what this 39 40 00:03:57,010 --> 00:04:02,560 is I will actually see that is of type string. 40 41 00:04:02,590 --> 00:04:11,840 This would simply be an empty string, a string of length zero. So what would be an example of a missing 41 42 00:04:11,840 --> 00:04:21,140 value? Say I had a variable called "my_var" and we don't want to put any value inside this variable, we 42 43 00:04:21,140 --> 00:04:24,780 simply want it to be empty or contain nothing. 43 44 00:04:24,950 --> 00:04:28,010 How would we accomplish this in Python? 44 45 00:04:28,010 --> 00:04:33,650 Now you might think that given that we've used isnull as a method on our dataframe, you could simply 45 46 00:04:33,650 --> 00:04:39,800 write something like "myvar = null" and create your empty variable. 46 47 00:04:39,800 --> 00:04:45,770 However, even though this null keyword is commonly used in other programming languages like say Java, 47 48 00:04:46,340 --> 00:04:51,400 in Python there is another keyword that has the same meaning 48 49 00:04:51,500 --> 00:04:53,450 and this is the keyword None. 49 50 00:04:53,720 --> 00:04:58,250 And you can see that Jupyter notebook is instantly picking up the fact that this is a keyword with the 50 51 00:04:58,250 --> 00:05:00,500 syntax highlighting. 51 52 00:05:00,530 --> 00:05:05,750 So in this case, if I was to check the type of my variable, 52 53 00:05:08,560 --> 00:05:12,100 I would see that it is indeed a NoneType, 53 54 00:05:12,100 --> 00:05:14,950 my_var is truly empty. 54 55 00:05:14,950 --> 00:05:17,480 It truly contains nothing. 55 56 00:05:17,620 --> 00:05:18,830 Now conceptually, 56 57 00:05:18,970 --> 00:05:26,170 the way I like to remember this difference between zero or an empty string with a null value is with 57 58 00:05:26,170 --> 00:05:27,400 this picture. 58 59 00:05:27,400 --> 00:05:31,900 The Internet has truly blessed us with lots of terrible programming jokes and I'm going to do my 59 60 00:05:31,900 --> 00:05:33,640 best to share them with you. 60 61 00:05:33,820 --> 00:05:39,520 But in all seriousness, this is actually quite a helpful way about thinking about null values or None 61 62 00:05:39,520 --> 00:05:41,470 values as they're known in Python. 62 63 00:05:42,490 --> 00:05:48,730 Okay, so even though that we've established that there's no entry in our message column with the value 63 64 00:05:48,790 --> 00:05:57,740 null or None, it doesn't mean that we don't have empty emails that could be in our dataset. 64 65 00:05:57,790 --> 00:06:04,030 Okay, so let me show you how we would write a check for discovering if there are any strings of length 65 66 00:06:04,180 --> 00:06:07,870 zero in this entire message column in our dataframe. 66 67 00:06:09,440 --> 00:06:21,440 I'll add a little comment here that reads "check if there are empty emails (string length zero)". 67 68 00:06:21,450 --> 00:06:27,420 The way I would go about doing this is grabbing our dataframe, then grabbing the message column, then 68 69 00:06:27,450 --> 00:06:35,150 converting that to a string and then grabbing the length of these strings. 69 70 00:06:35,150 --> 00:06:37,410 Let's take a look at what this looks like. 70 71 00:06:37,460 --> 00:06:44,210 Here we can see how many characters each email body has in our messages column. 71 72 00:06:46,320 --> 00:06:50,950 Now if we wanted to check if any of them are equal to zero, 72 73 00:06:51,240 --> 00:06:57,630 in other words if any of them look like this, we can simply use the double equal sign and then zero for 73 74 00:06:57,630 --> 00:06:59,170 the logical test. 74 75 00:06:59,290 --> 00:07:06,710 If I press Shift+Enter now we again get an entire column of True or False values. 75 76 00:07:06,890 --> 00:07:07,320 All right. 76 77 00:07:07,320 --> 00:07:13,380 We can't look through 5800 different messages, so the any method is our best 77 78 00:07:13,380 --> 00:07:20,370 friend again to see if there are any rows where the value is equal to True instead of False. 78 79 00:07:20,370 --> 00:07:21,700 Let's take a look. 79 80 00:07:21,810 --> 00:07:22,830 Huh. 80 81 00:07:22,980 --> 00:07:28,620 So there are some empty emails somewhere in our dataframe. 81 82 00:07:28,640 --> 00:07:34,470 What we don't know yet is how many of these rows have a string length of zero. 82 83 00:07:34,530 --> 00:07:36,320 So how would we check that? 83 84 00:07:36,550 --> 00:07:40,170 Well we could take our logical test here, 84 85 00:07:40,170 --> 00:07:45,300 remember this returned 5800 entries of True or False. 85 86 00:07:45,360 --> 00:07:54,120 Now the Boolean value of False is equated with the numerical value of 0 and the Boolean value of True 86 87 00:07:54,540 --> 00:07:57,010 equates to the value 1. 87 88 00:07:57,030 --> 00:08:04,200 So if we wanted to know how many rows have a string length of 0, all we would have to do is sum them up. 88 89 00:08:04,200 --> 00:08:11,940 This sums up all the zeros and any ones, any true values in the 5800 entries. 89 90 00:08:13,560 --> 00:08:14,270 Okay. 90 91 00:08:14,440 --> 00:08:17,320 So we're going deeper down the rabbit hole. 91 92 00:08:17,530 --> 00:08:24,970 We now know that among the 5800 entries there are 4 strings of length 92 93 00:08:24,970 --> 00:08:32,360 zero. Now, we have to find out where in our enormous dataframe these strings are. 93 94 00:08:32,530 --> 00:08:36,100 We have to find the rows that contain these empty strings. 94 95 00:08:36,100 --> 00:08:39,090 But before we do that I want to pose a challenge to you. 95 96 00:08:39,910 --> 00:08:45,790 Namely, I'd like you to pause the video and write the Python code that would check for the number of 96 97 00:08:45,850 --> 00:08:49,720 entries with null or None values. 97 98 00:08:49,720 --> 00:08:51,570 Did you solve it? 98 99 00:08:51,580 --> 00:08:52,860 Here's the solution. 99 100 00:08:52,900 --> 00:08:59,620 Now I know we don't have any null values in this dataframe, but that doesn't mean that your future projects 100 101 00:08:59,830 --> 00:09:02,630 won't be having this issue. 101 102 00:09:02,680 --> 00:09:14,020 So with "data.MESSAGE" we can pull up our message column and then with "isnull()" we can find out if 102 103 00:09:14,080 --> 00:09:16,410 there are any null values. 103 104 00:09:16,690 --> 00:09:23,980 And by summing up all the True and False values that we get out of this bit of code here, we can find 104 105 00:09:23,980 --> 00:09:27,230 out the number of null values. 105 106 00:09:27,310 --> 00:09:32,380 The answer of zero shouldn't be surprising, because we've already established that there are no null 106 107 00:09:32,380 --> 00:09:34,300 values in the message column. 107 108 00:09:35,200 --> 00:09:41,710 But I do think that this is quite an important piece of the data cleaning workflow, so it's good to have 108 109 00:09:41,710 --> 00:09:45,430 this down and practice this once or twice. 109 110 00:09:45,430 --> 00:09:49,480 Let me add some more empty cells here. 110 111 00:09:49,480 --> 00:10:01,140 Now I'll add a markdown cell and what I'll do here is I'll just write "Locate empty emails". 111 112 00:10:01,710 --> 00:10:06,580 Let's find out the row names of our empty emails. 112 113 00:10:06,810 --> 00:10:11,010 As you can probably guess this condition is gonna be key. 113 114 00:10:11,010 --> 00:10:17,760 This is what we're going to be using to get a pandas series of True or False values for where these 114 115 00:10:17,760 --> 00:10:19,100 rows are empty. 115 116 00:10:20,300 --> 00:10:30,110 Now if I wanted to pull out the index names for where these empty values are, then I could select based 116 117 00:10:30,260 --> 00:10:41,060 on this criteria where these values are True from our dataframe and then access the index attribute 117 118 00:10:41,660 --> 00:10:43,070 from the dataframe. 118 119 00:10:43,070 --> 00:10:52,120 So "data[]" and then put the logical condition in here and then have ".index" afterwards. 119 120 00:10:52,120 --> 00:10:54,140 And now we'll get the index names. 120 121 00:10:54,330 --> 00:10:58,400 The row names for where this condition is True. 121 122 00:10:58,500 --> 00:10:59,030 Check it out. 122 123 00:11:00,000 --> 00:11:03,500 OK, so this is interesting right? 123 124 00:11:03,510 --> 00:11:11,100 "cmds, cmds, cmds" and then ".DS_Store". 124 125 00:11:11,190 --> 00:11:13,910 These are the index names. Now, 125 126 00:11:13,910 --> 00:11:19,340 does anything strike you as odd with these index names? 126 127 00:11:19,340 --> 00:11:22,630 Let's see how they compare to all the other file names. 127 128 00:11:22,760 --> 00:11:30,260 So just scrolling through our dataset here, we can see that most of these emails actually have file 128 129 00:11:30,260 --> 00:11:40,190 names that looks something like this. The "cmds" and ".DS_Store" as a file name does seem to be a little bit 129 130 00:11:40,280 --> 00:11:41,590 odd. 130 131 00:11:41,600 --> 00:11:45,080 Let's see if we can find out what these files are. 131 132 00:11:45,140 --> 00:11:52,460 If I go to my spam_1 folder and I scroll all the way down then I indeed see one of these 132 133 00:11:52,460 --> 00:11:54,280 files here. 133 134 00:11:54,320 --> 00:11:59,970 This actually does not seem to be an email, right? 134 135 00:12:00,020 --> 00:12:07,620 Taking a look at this file here in my Atom text editor I can see that it looks like this. 135 136 00:12:07,720 --> 00:12:11,140 This is definitely not part of our dataset. 136 137 00:12:11,190 --> 00:12:19,720 This "cmds" file is actually a system file that you'll see both on Windows and Mac when you unzip the 137 138 00:12:19,780 --> 00:12:22,270 original email archive. 138 139 00:12:22,270 --> 00:12:27,530 This is not an email file that we should include in our dataframe. 139 140 00:12:27,580 --> 00:12:35,460 You'll also find this cmds file in the spam_2 folder and in the easy_ham_1 140 141 00:12:35,460 --> 00:12:39,330 folder. Now what about this ".DS_ 141 142 00:12:39,430 --> 00:12:48,030 Store" file? Now, because the file name is preceded by a full stop, by a dot, 142 143 00:12:48,030 --> 00:12:52,000 this "DS_Store" file is actually a hidden file. 143 144 00:12:52,180 --> 00:13:00,460 This is a Mac OS specific file that stores the attributes of a particular folder. And this system file 144 145 00:13:00,490 --> 00:13:05,560 was created when I downloaded, unzipped and then rezipped the course materials. 145 146 00:13:05,620 --> 00:13:07,750 In other words, this is not an email. 146 147 00:13:07,750 --> 00:13:13,130 This is an artifact of me preparing the lesson resources for you. 147 148 00:13:13,390 --> 00:13:17,920 Now obviously we don't want any of these four files in our dataframe. 148 149 00:13:17,980 --> 00:13:20,110 We just want our email data. 149 150 00:13:20,110 --> 00:13:23,490 But why are these files in our dataframe in the first place 150 151 00:13:23,620 --> 00:13:24,870 if we don't want them there? 151 152 00:13:24,880 --> 00:13:27,100 What, what happened? 152 153 00:13:27,100 --> 00:13:30,310 Well let's take a look at our file reading function. 153 154 00:13:30,310 --> 00:13:39,130 Looking at the e-mail body generator code and this walk function from the os library, we can see that 154 155 00:13:39,340 --> 00:13:45,200 this function actually picks up all the files in a particular folder 155 156 00:13:45,430 --> 00:13:50,710 and this includes the system files that don't belong in the email corpus. 156 157 00:13:50,710 --> 00:13:53,790 That's how they got included in the dataframe. 157 158 00:13:53,800 --> 00:13:58,750 Now we know the row names are index names for undesirables in our dataframe. 158 159 00:13:58,750 --> 00:14:02,560 But what we don't know yet is where they are. Now, 159 160 00:14:02,590 --> 00:14:09,940 let me show you how to get the location for this ".DS_Store" entry. 160 161 00:14:09,940 --> 00:14:15,190 What we can do with our dataframe is get the index, 161 162 00:14:15,190 --> 00:14:17,140 so "index.get_ 162 163 00:14:20,120 --> 00:14:31,220 loc", get_location, loc, parentheses and then single quotes, and then ".DS_ 163 164 00:14:31,820 --> 00:14:40,970 Store" in the single quotes will give us the row number where this entry lives. So we can search the entire 164 165 00:14:40,970 --> 00:14:41,720 index, 165 166 00:14:41,840 --> 00:14:47,940 so all the names for a particular name which is our argument in this "get_location" method here. 166 167 00:14:48,800 --> 00:14:50,550 Let's take a look. 167 168 00:14:50,570 --> 00:14:58,700 So here we have it, 4609 is where this entry is in our dataframe. If I say 168 169 00:14:58,700 --> 00:15:09,630 "data[4609: 169 170 00:15:09,630 --> 00:15:17,870 4611]" then I can pull up this section of the dataframe and take a look at which rows precede 170 171 00:15:18,200 --> 00:15:27,190 and follow this ".DS_Store" entry and there we can see it just happily living between two 171 172 00:15:27,190 --> 00:15:35,410 non-spam emails. In the next lesson I'll show you how to remove these system files from our dataframe. 172 173 00:15:35,680 --> 00:15:41,200 I'll show you how to drop these entries and clean things up a little bit and much, much more. I'll see 173 174 00:15:41,200 --> 00:15:41,580 you there.