0
1
00:00:00,750 --> 00:00:04,410
This lesson is going to be a very, very quick exercise.
1

2
00:00:04,410 --> 00:00:07,290
So I'll insert a markdown cell here,
2

3
00:00:07,290 --> 00:00:15,490
so the exercise will consist of checking if a word is part of the vocabulary.
3

4
00:00:16,620 --> 00:00:24,810
So we've got 2500 words in the vocabulary and what I'd like to do is I'd like you to write the code
4

5
00:00:25,230 --> 00:00:31,200
to check if a particular word is in this vocabulary. As a challenge,
5

6
00:00:31,230 --> 00:00:39,210
can you write a line of code that checks if a particular word is part of the vocabulary? Your code should
6

7
00:00:39,210 --> 00:00:45,480
return True if the word is among the 2500 words that comprise the vocabulary and
7

8
00:00:45,480 --> 00:00:47,800
False otherwise.
8

9
00:00:47,970 --> 00:00:51,370
And I'd like you to check the following words individually:
9

10
00:00:51,420 --> 00:00:59,330
'machine', 'learning', 'fun', 'learn', 'data', 'science', 'app' and 'brewery'.
10

11
00:00:59,560 --> 00:01:06,810
I'll give you a few seconds to pause the video and give this a shot.
11

12
00:01:06,870 --> 00:01:08,690
Did you have a go?
12

13
00:01:09,220 --> 00:01:18,040
Well first, let me show you the more inefficient way. You can use our dataframe of vocabulary words and
13

14
00:01:18,190 --> 00:01:27,820
select the "VOCAB_WORD" column and then check if any of these words is equal to the word 'machine'.
14

15
00:01:28,600 --> 00:01:34,900
So if I hit Shift+Enter on this I'll get a huge, huge result
15

16
00:01:35,710 --> 00:01:41,850
and because I get a whole series of booleans I can't check these all individually.
16

17
00:01:41,860 --> 00:01:46,480
So what I would do then is wrap this in the "any" function.
17

18
00:01:47,230 --> 00:01:49,350
So in this case I can find out,
18

19
00:01:49,420 --> 00:01:58,530
yes, the word 'machine' is amongst the vocabulary words that are most frequent. Now there's actually a better
19

20
00:01:58,530 --> 00:02:03,040
way because we've already learned about sets.
20

21
00:02:03,060 --> 00:02:10,590
So I'll just add a little comment here and I'll write "inefficient". The better way of doing this, of checking
21

22
00:02:10,590 --> 00:02:21,420
membership in a collection is to use the Python sets. I can convert our vocabulary words to a set by
22

23
00:02:21,420 --> 00:02:30,840
using the "set" keyword and then feeding in our column "vocab.VOCAB_WORD".
23

24
00:02:30,870 --> 00:02:34,330
This converts our vocabulary to a set
24

25
00:02:34,770 --> 00:02:42,660
and this is very, very efficient at checking membership. And the way you would do this is using this "in" keyword.
25

26
00:02:42,690 --> 00:02:50,940
So if I have the word 'machine' in single quotes followed by the 'in' keyword and then our set here, I get
26

27
00:02:50,940 --> 00:02:54,760
a much, much better way of checking for membership.
27

28
00:02:54,930 --> 00:02:59,190
And this quite frankly is the better way.
28

29
00:02:59,190 --> 00:03:04,910
Now if you're just executing one line of code, you might be like "Well, why is this inefficient?", right.
29

30
00:03:04,920 --> 00:03:10,580
Why is using this "any" and this logical condition here inefficient?
30

31
00:03:10,580 --> 00:03:14,040
And the thing is - yes, you wouldn't actually notice in this case, right.
31

32
00:03:14,080 --> 00:03:14,610
It's
32

33
00:03:14,690 --> 00:03:17,270
a very, very quick thing to execute.
33

34
00:03:17,310 --> 00:03:22,740
But if you're running a loop, if you're running something that runs thousands and thousands of times,
34

35
00:03:23,190 --> 00:03:31,020
then you will actually see a massive difference in the computer time using sets versus another type
35

36
00:03:31,020 --> 00:03:38,670
of data structure. Going back up to our list, let's see which words were actually among the 2500.
36

37
00:03:38,810 --> 00:03:41,200
So 'machine' is True.
37

38
00:03:41,250 --> 00:03:44,290
What about 'learning'? 'learning' is False,
38

39
00:03:44,310 --> 00:03:44,550
right.
39

40
00:03:44,550 --> 00:03:51,630
So 'learning' is not amongst the words, but that could be because of the stemming, right.
40

41
00:03:51,780 --> 00:04:03,170
So if I write 'learn' instead, then it is indeed included. The word 'fun' is as well. The word 'data' is as well.
41

42
00:04:04,250 --> 00:04:13,010
The word 'science' is as well. The word 'app' also among the 2500 most common words in our data set. And the
42

43
00:04:13,010 --> 00:04:16,750
word 'brewery' is not.
43

44
00:04:16,790 --> 00:04:17,010
Yeah.
44

45
00:04:17,030 --> 00:04:18,770
So this one is not included.
45

46
00:04:18,770 --> 00:04:20,810
'brewer' also not,
46

47
00:04:21,110 --> 00:04:25,680
and 'brew' also not. Nobody uses this word.
47

48
00:04:25,790 --> 00:04:26,120
All right.
48

49
00:04:26,120 --> 00:04:33,320
So that completes the challenge. And the rationale for putting these challenges always into the lessons
49

50
00:04:33,770 --> 00:04:37,190
is because programming is really like a sport.
50

51
00:04:37,280 --> 00:04:37,630
Right.
51

52
00:04:37,640 --> 00:04:44,030
You can't really read about it to get good and you can't just copy code to get good at it.
52

53
00:04:44,060 --> 00:04:50,750
You actually have to do it and it's kind of similar how nobody really reads the book on how to surf and
53

54
00:04:50,750 --> 00:04:57,220
then jumps on a surfboard and, you know, surfs around and knows how to surf, doesn't work.
54

55
00:04:57,230 --> 00:05:02,270
Now it seems obvious with the surfing example, but with programming you really have to think about it
55

56
00:05:02,270 --> 00:05:03,530
in the same way.
56

57
00:05:03,800 --> 00:05:06,920
And that's why I've got another exercise for you.
57

58
00:05:06,920 --> 00:05:12,560
This one will fall into the realm of data exploration. I'll see in the next lesson.