0 1 00:00:00,650 --> 00:00:05,960 In this lesson we're gonna be looking at variables again, but only this time these variables are going 1 2 00:00:05,960 --> 00:00:12,400 to be containing lots of data, instead of just a single number or a single piece of text. 2 3 00:00:12,440 --> 00:00:20,270 In other words, we're going to introduce you to collections. And by by collections we mean lists, arrays, 3 4 00:00:20,300 --> 00:00:22,640 data frames and series. 4 5 00:00:22,940 --> 00:00:29,270 In the last lesson we learned about how to store data inside variables and we learned about data types. 5 6 00:00:29,930 --> 00:00:31,910 In machine learning and data science, 6 7 00:00:31,940 --> 00:00:33,550 we often need to work with, 7 8 00:00:33,710 --> 00:00:35,460 well a lot of data. 8 9 00:00:35,630 --> 00:00:42,650 As such it would be silly to put each piece of data inside a separate variable. 9 10 00:00:42,800 --> 00:00:49,430 The alternative to assigning a single value to a single variable, like this, is that we have a single 10 11 00:00:49,430 --> 00:00:57,800 variable that actually contains lots of data and that variable will represent a collection. One of the 11 12 00:00:57,800 --> 00:01:02,960 most common types of collections that you'll encounter is the list. 12 13 00:01:02,960 --> 00:01:07,490 Lists are pre-built into Python and they're part of the language. 13 14 00:01:07,520 --> 00:01:09,950 Let me show you what a list looks like. 14 15 00:01:09,950 --> 00:01:19,280 I'm going to create a list called 'primeNumbers' and set it equal to a collection of 5 primes. 15 16 00:01:19,490 --> 00:01:23,670 And the way to do that is by using the square brackets syntax. 16 17 00:01:23,750 --> 00:01:32,650 I'm going to open a square bracket and then give my first value - 3, then put in a comma, give my second 17 18 00:01:32,650 --> 00:01:34,760 value - say 7, 18 19 00:01:34,990 --> 00:01:36,230 another comma, 19 20 00:01:36,460 --> 00:01:45,450 third value - 61, comma 29 and 199. And, just like that, we've created 20 21 00:01:45,540 --> 00:01:52,110 a variable called primeNumbers that holds onto a collection of five values. 21 22 00:01:52,140 --> 00:01:59,670 Now, just out of curiosity, let's take a look at what the type of the prime numbers variable is. 22 23 00:01:59,670 --> 00:02:06,540 type(primeNumbers), Shift+Enter and we see that it's a list. 23 24 00:02:06,600 --> 00:02:12,020 Now one thing that I'll mention is that a list in Python, you know, doesn't have to be numbers, right. 24 25 00:02:12,140 --> 00:02:14,580 A list can contain all sorts of data. 25 26 00:02:14,730 --> 00:02:17,990 For example, here's what a list of strings would look like. 26 27 00:02:18,330 --> 00:02:22,830 coolPeople = [] 27 28 00:02:22,830 --> 00:02:35,920 'Jay Z', 'Gandhi', 'me' 28 29 00:02:35,940 --> 00:02:36,720 There we go. 29 30 00:02:37,740 --> 00:02:43,560 The important thing to note is how we substituted Jay-Z for Kevin Spacey in this list. 30 31 00:02:43,630 --> 00:02:50,350 In any case, a list is just a variable that holds on to a bunch of values and it could even hold on to 31 32 00:02:50,350 --> 00:02:53,380 values of different types. 32 33 00:02:53,380 --> 00:03:00,470 Check this out: primesAndPeople = [] 33 34 00:03:00,910 --> 00:03:12,340 'King Arthur', 17, 11, 'Jennifer Lopez' 34 35 00:03:12,340 --> 00:03:18,190 Now even though we've got three different lists; one with whole numbers, one with text and one with a 35 36 00:03:18,190 --> 00:03:24,550 mix of different types of data, all of these three variables are actually of the same type - they're of 36 37 00:03:24,550 --> 00:03:26,410 type list. 37 38 00:03:26,650 --> 00:03:31,750 So now that we've covered the syntax for creating a list, I'm going to show you something even cooler 38 39 00:03:31,810 --> 00:03:39,650 than storing data and lists - and that's retrieving data from a list. So the question is how do you pull 39 40 00:03:39,650 --> 00:03:43,420 out a particular value from one of these lists? 40 41 00:03:43,490 --> 00:03:44,920 Here's the secret. 41 42 00:03:45,110 --> 00:03:51,200 The way we grab a particular item from the list is by that item's index. 42 43 00:03:51,230 --> 00:03:53,220 What do I mean by index? 43 44 00:03:53,270 --> 00:03:56,660 The index is just the item's position in the list. 44 45 00:03:56,690 --> 00:03:58,220 Let me show you what I mean. 45 46 00:03:58,460 --> 00:04:07,730 If I type primeNumbers, and open the square brackets; between these two square brackets I can put a number 46 47 00:04:08,810 --> 00:04:11,950 and that number is the value of the index. 47 48 00:04:11,960 --> 00:04:20,150 So, say I put in the number two and hit Shift+Ente,r I get back the value 61, but 61 is the 48 49 00:04:20,150 --> 00:04:22,470 third item in this list. 49 50 00:04:22,670 --> 00:04:23,870 Why is it the third item? 50 51 00:04:24,440 --> 00:04:32,090 Well it turns out programmers really, really, really, really like to start counting from zero. 51 52 00:04:32,180 --> 00:04:40,310 In other words that first item, number 3, is at position 0, 7 is at position 1 and 61 is at position 52 53 00:04:40,310 --> 00:04:41,630 2. 53 54 00:04:41,720 --> 00:04:48,040 Now oftentimes what we're gonna be doing is storing the value that we pull out from a list in another 54 55 00:04:48,050 --> 00:04:50,960 variable - so we can do something like this" 55 56 00:04:51,200 --> 00:04:59,110 bestPrimeEver = primeNumbers[4] 56 57 00:04:59,260 --> 00:05:07,010 In this case we're pulling out one of the numbers from the list and storing it in a variable called 57 58 00:05:07,280 --> 00:05:09,180 bestPrimeEver. 58 59 00:05:09,340 --> 00:05:14,150 Now, I know you might have a different favorite prime number too, but I'm sorry to say it's not going 59 60 00:05:14,150 --> 00:05:16,520 to be as good as 199. 60 61 00:05:16,520 --> 00:05:18,940 Let me give you another example of a list. 61 62 00:05:18,980 --> 00:05:27,890 Say we had a list of eggs or egg objects and we want to pull out the egg at index 1, which egg 62 63 00:05:27,890 --> 00:05:29,390 would we be pulling out? 63 64 00:05:29,660 --> 00:05:33,760 Which egg is my egg equal to? 64 65 00:05:33,820 --> 00:05:39,230 The answer is we would be grabbing the second egg in the list. 65 66 00:05:39,550 --> 00:05:40,760 The spotted one. 66 67 00:05:41,080 --> 00:05:46,950 Again, this is because programmers start counting from zero. 67 68 00:05:47,080 --> 00:05:53,680 Now the reason I'm belaboring this point so much is because forgetting to count from zero is super common. 68 69 00:05:53,890 --> 00:05:58,850 And these off by 1 errors will also crash your Python program. 69 70 00:05:58,920 --> 00:06:07,130 Check out what happens when we write primeNumbers[5] and hit Shift+Enter. 70 71 00:06:07,200 --> 00:06:15,430 We get an error that reads "index out of range" and that's because we're trying to select an item beyond 71 72 00:06:15,580 --> 00:06:17,230 the length of the list. 72 73 00:06:17,380 --> 00:06:28,480 Our list has five items and we're trying to select the item at position 0, 1, 2, 3, 4, 5. The item at position 73 74 00:06:28,480 --> 00:06:38,560 5 is the sixth one on the list. Since there is no sixth item, our Python program goes down in flames - "index 74 75 00:06:38,560 --> 00:06:40,000 out of range". 75 76 00:06:40,180 --> 00:06:48,140 So if you ever see these cursed words in your crash logs, now you know why. Let's change that five back 76 77 00:06:48,140 --> 00:06:55,950 to a four. Now lists are pretty common in Python and we'll be working with them a lot. 77 78 00:06:56,090 --> 00:07:02,510 However, we'll be encountering quite a few other kinds of collections too and a lot of them are very, 78 79 00:07:02,510 --> 00:07:10,760 very similar to the list but they are classified as a different data type because there are subtle differences. 79 80 00:07:11,780 --> 00:07:14,490 In scientific computing and machine learning, 80 81 00:07:14,510 --> 00:07:22,320 we're going to be working with another data structure very, very often - namely the array. An array is like 81 82 00:07:22,320 --> 00:07:23,850 the list's step brother. 82 83 00:07:23,950 --> 00:07:28,440 He's somewhat similar to its sister, but he's got a different dad. 83 84 00:07:28,440 --> 00:07:31,290 You see the Python list is like a ballerina. 84 85 00:07:31,290 --> 00:07:34,160 She's flexible, friendly, versatile. 85 86 00:07:34,350 --> 00:07:40,140 She can grow and shrink in size and she can hold onto different types of data like numbers and text 86 87 00:07:40,230 --> 00:07:45,010 in the same list. The array on the other hand is a weightlifter. 87 88 00:07:45,120 --> 00:07:50,430 He's strong and he's very good at particular tasks but he's less versatile. 88 89 00:07:50,430 --> 00:07:55,860 He's also so buff that if he tried to punch you he couldn't reach around his own pecs, but that's beside 89 90 00:07:55,860 --> 00:07:56,900 the point. 90 91 00:07:56,910 --> 00:08:03,090 All the data in an array must be of the same type, so an array must be all numbers or all strings, 91 92 00:08:03,090 --> 00:08:09,250 for example. Now, believe it or not, you've actually already encountered an array. 92 93 00:08:09,310 --> 00:08:15,730 Remember how we were looking at the regression intercept when we were estimating the movie revenue? Below 93 94 00:08:15,730 --> 00:08:16,390 the cell, 94 95 00:08:16,390 --> 00:08:24,340 we saw the output array([-7236000]). 95 96 00:08:25,140 --> 00:08:26,100 In this output, 96 97 00:08:26,110 --> 00:08:29,640 you even see the word array so we know that we're working with one. 97 98 00:08:29,890 --> 00:08:31,610 But let's make this more explicit. 98 99 00:08:31,630 --> 00:08:35,130 Let's check the type of 99 100 00:08:35,290 --> 00:08:42,110 regr.intercept_ under school and hit Shift+Enter. Now, if you see this error, 100 101 00:08:42,130 --> 00:08:49,170 just like I do, it's because I haven't actually run any of the Python code in the prior cells so I'm 101 102 00:08:49,170 --> 00:08:57,040 gonna go to Cell > Run All and my notebook will know about the regression intercept so that now it will 102 103 00:08:57,100 --> 00:09:06,880 correctly evaluate this cell and what we discover is that the full name of this data type is 103 104 00:09:07,240 --> 00:09:10,510 numpy.ndarray 104 105 00:09:10,510 --> 00:09:15,150 The cool thing is is that working with arrays is actually very similar to working with lists. 105 106 00:09:15,490 --> 00:09:18,520 In other words we can use that square brackets syntax. 106 107 00:09:18,520 --> 00:09:29,890 So if I write regr.intercept_[0] I'm pulling out the first and only 107 108 00:09:29,890 --> 00:09:33,140 value from our array. Hitting Shift+Enter, 108 109 00:09:33,190 --> 00:09:37,200 I see this value printed below the cell. Up here, 109 110 00:09:37,240 --> 00:09:42,910 in contrast, I was looking at the entire array, but in this cell I'm just looking at one of the values 110 111 00:09:43,120 --> 00:09:50,720 inside of this array and I'm accessing this value with the square brackets notation providing an index 111 112 00:09:51,020 --> 00:09:53,160 at value 0. 112 113 00:09:53,400 --> 00:10:00,980 Okay, so we learnt about lists and arrays in Python. Both of them use the square bracket notation and 113 114 00:10:00,980 --> 00:10:06,470 through both of them we can access individual items using an index. 114 115 00:10:06,680 --> 00:10:12,860 The index is just the position of the item in the list starting from zero. 115 116 00:10:12,860 --> 00:10:18,530 And this brings me to a very, very important point - namely do you know why the programmer quit his job? 116 117 00:10:20,420 --> 00:10:22,210 It's because he didn't get arrays...