0 1 00:00:00,750 --> 00:00:04,410 This lesson is going to be a very, very quick exercise. 1 2 00:00:04,410 --> 00:00:07,290 So I'll insert a markdown cell here, 2 3 00:00:07,290 --> 00:00:15,490 so the exercise will consist of checking if a word is part of the vocabulary. 3 4 00:00:16,620 --> 00:00:24,810 So we've got 2500 words in the vocabulary and what I'd like to do is I'd like you to write the code 4 5 00:00:25,230 --> 00:00:31,200 to check if a particular word is in this vocabulary. As a challenge, 5 6 00:00:31,230 --> 00:00:39,210 can you write a line of code that checks if a particular word is part of the vocabulary? Your code should 6 7 00:00:39,210 --> 00:00:45,480 return True if the word is among the 2500 words that comprise the vocabulary and 7 8 00:00:45,480 --> 00:00:47,800 False otherwise. 8 9 00:00:47,970 --> 00:00:51,370 And I'd like you to check the following words individually: 9 10 00:00:51,420 --> 00:00:59,330 'machine', 'learning', 'fun', 'learn', 'data', 'science', 'app' and 'brewery'. 10 11 00:00:59,560 --> 00:01:06,810 I'll give you a few seconds to pause the video and give this a shot. 11 12 00:01:06,870 --> 00:01:08,690 Did you have a go? 12 13 00:01:09,220 --> 00:01:18,040 Well first, let me show you the more inefficient way. You can use our dataframe of vocabulary words and 13 14 00:01:18,190 --> 00:01:27,820 select the "VOCAB_WORD" column and then check if any of these words is equal to the word 'machine'. 14 15 00:01:28,600 --> 00:01:34,900 So if I hit Shift+Enter on this I'll get a huge, huge result 15 16 00:01:35,710 --> 00:01:41,850 and because I get a whole series of booleans I can't check these all individually. 16 17 00:01:41,860 --> 00:01:46,480 So what I would do then is wrap this in the "any" function. 17 18 00:01:47,230 --> 00:01:49,350 So in this case I can find out, 18 19 00:01:49,420 --> 00:01:58,530 yes, the word 'machine' is amongst the vocabulary words that are most frequent. Now there's actually a better 19 20 00:01:58,530 --> 00:02:03,040 way because we've already learned about sets. 20 21 00:02:03,060 --> 00:02:10,590 So I'll just add a little comment here and I'll write "inefficient". The better way of doing this, of checking 21 22 00:02:10,590 --> 00:02:21,420 membership in a collection is to use the Python sets. I can convert our vocabulary words to a set by 22 23 00:02:21,420 --> 00:02:30,840 using the "set" keyword and then feeding in our column "vocab.VOCAB_WORD". 23 24 00:02:30,870 --> 00:02:34,330 This converts our vocabulary to a set 24 25 00:02:34,770 --> 00:02:42,660 and this is very, very efficient at checking membership. And the way you would do this is using this "in" keyword. 25 26 00:02:42,690 --> 00:02:50,940 So if I have the word 'machine' in single quotes followed by the 'in' keyword and then our set here, I get 26 27 00:02:50,940 --> 00:02:54,760 a much, much better way of checking for membership. 27 28 00:02:54,930 --> 00:02:59,190 And this quite frankly is the better way. 28 29 00:02:59,190 --> 00:03:04,910 Now if you're just executing one line of code, you might be like "Well, why is this inefficient?", right. 29 30 00:03:04,920 --> 00:03:10,580 Why is using this "any" and this logical condition here inefficient? 30 31 00:03:10,580 --> 00:03:14,040 And the thing is - yes, you wouldn't actually notice in this case, right. 31 32 00:03:14,080 --> 00:03:14,610 It's 32 33 00:03:14,690 --> 00:03:17,270 a very, very quick thing to execute. 33 34 00:03:17,310 --> 00:03:22,740 But if you're running a loop, if you're running something that runs thousands and thousands of times, 34 35 00:03:23,190 --> 00:03:31,020 then you will actually see a massive difference in the computer time using sets versus another type 35 36 00:03:31,020 --> 00:03:38,670 of data structure. Going back up to our list, let's see which words were actually among the 2500. 36 37 00:03:38,810 --> 00:03:41,200 So 'machine' is True. 37 38 00:03:41,250 --> 00:03:44,290 What about 'learning'? 'learning' is False, 38 39 00:03:44,310 --> 00:03:44,550 right. 39 40 00:03:44,550 --> 00:03:51,630 So 'learning' is not amongst the words, but that could be because of the stemming, right. 40 41 00:03:51,780 --> 00:04:03,170 So if I write 'learn' instead, then it is indeed included. The word 'fun' is as well. The word 'data' is as well. 41 42 00:04:04,250 --> 00:04:13,010 The word 'science' is as well. The word 'app' also among the 2500 most common words in our data set. And the 42 43 00:04:13,010 --> 00:04:16,750 word 'brewery' is not. 43 44 00:04:16,790 --> 00:04:17,010 Yeah. 44 45 00:04:17,030 --> 00:04:18,770 So this one is not included. 45 46 00:04:18,770 --> 00:04:20,810 'brewer' also not, 46 47 00:04:21,110 --> 00:04:25,680 and 'brew' also not. Nobody uses this word. 47 48 00:04:25,790 --> 00:04:26,120 All right. 48 49 00:04:26,120 --> 00:04:33,320 So that completes the challenge. And the rationale for putting these challenges always into the lessons 49 50 00:04:33,770 --> 00:04:37,190 is because programming is really like a sport. 50 51 00:04:37,280 --> 00:04:37,630 Right. 51 52 00:04:37,640 --> 00:04:44,030 You can't really read about it to get good and you can't just copy code to get good at it. 52 53 00:04:44,060 --> 00:04:50,750 You actually have to do it and it's kind of similar how nobody really reads the book on how to surf and 53 54 00:04:50,750 --> 00:04:57,220 then jumps on a surfboard and, you know, surfs around and knows how to surf, doesn't work. 54 55 00:04:57,230 --> 00:05:02,270 Now it seems obvious with the surfing example, but with programming you really have to think about it 55 56 00:05:02,270 --> 00:05:03,530 in the same way. 56 57 00:05:03,800 --> 00:05:06,920 And that's why I've got another exercise for you. 57 58 00:05:06,920 --> 00:05:12,560 This one will fall into the realm of data exploration. I'll see in the next lesson.