1 00:00:01,120 --> 00:00:02,550 Welcome back. 2 00:00:02,550 --> 00:00:04,610 So what is data. 3 00:00:04,710 --> 00:00:10,440 I mean we've talked about it already and we should have an idea of what data is. 4 00:00:10,620 --> 00:00:14,300 And we also know why business care about data right. 5 00:00:14,370 --> 00:00:22,440 We live in a world now where data is valuable and some of the most valuable companies are valuable because 6 00:00:22,500 --> 00:00:24,400 they collect data. 7 00:00:24,420 --> 00:00:30,450 Companies like Facebook and Google sell data to make money. 8 00:00:30,450 --> 00:00:36,720 They provide products for customers by using the data to improve their products. 9 00:00:36,720 --> 00:00:45,150 Data is the new gold so businesses care because by understanding their data it could be part of the 10 00:00:45,150 --> 00:00:50,540 product such as creating a YouTube recommendation engine. 11 00:00:50,670 --> 00:01:00,090 We can ask if the company is doing OK to monitor the company's sales and sticks and you can also use 12 00:01:00,090 --> 00:01:04,120 data to improve the company and ask can we do better. 13 00:01:04,140 --> 00:01:05,610 What can we improve on. 14 00:01:05,820 --> 00:01:12,180 How can we make our apps faster so businesses care a lot about data. 15 00:01:12,240 --> 00:01:20,340 But I want to take a bit of time to really understand what data engineers do by manipulating or collecting 16 00:01:20,340 --> 00:01:29,470 this data so let's talk about the four different types of data we learned that there's things called 17 00:01:29,480 --> 00:01:31,280 structured data. 18 00:01:31,280 --> 00:01:38,510 Now this is data that is actually organized well for us to understand and we because data gets collected 19 00:01:38,510 --> 00:01:41,560 in different ways from different sources. 20 00:01:41,570 --> 00:01:42,180 Right. 21 00:01:42,230 --> 00:01:50,150 But once on a computer there's a standard way usually in a table or what we call a matrix a table that 22 00:01:50,150 --> 00:01:51,560 makes it easy to read. 23 00:01:51,890 --> 00:02:00,050 Attributes are usually columns and rows are usually instances and sometimes we have the outputs in there 24 00:02:00,110 --> 00:02:00,410 right. 25 00:02:00,410 --> 00:02:07,850 Most column like we saw with machine learning where we predict an output based on the inputs and structure 26 00:02:07,850 --> 00:02:16,790 data usually comes from things such as relational databases like my askew all post Cress and other relational 27 00:02:16,790 --> 00:02:21,460 databases that allow us to do something called S Q Well queries. 28 00:02:21,800 --> 00:02:30,800 That is to perform actions on the structure data to collect information then there's semi structured 29 00:02:30,800 --> 00:02:31,720 data. 30 00:02:31,920 --> 00:02:39,710 There's still structure data but often in something like an Excel CSP which we've seen or Jason form 31 00:02:40,960 --> 00:02:47,970 and these are just different extensions and different ways to store data in different files for example 32 00:02:48,510 --> 00:02:58,110 a really popular Web sites to grab data that's freely open online and find datasets maybe even do competitions 33 00:02:58,140 --> 00:03:02,870 or see other people's notebooks and talk about machine learning topics. 34 00:03:03,090 --> 00:03:12,090 Well Kaggle allows us to actually look at datasets that are available for us to use and play with and 35 00:03:12,180 --> 00:03:19,270 in here if I type something like Excel you'll give me all of the information or all the data. 36 00:03:19,290 --> 00:03:22,760 Let's make this bigger that are an ex email format. 37 00:03:23,490 --> 00:03:31,390 So for example if I click on blood cell images and I take a look over here. 38 00:03:31,510 --> 00:03:37,890 I see that the data sources there's some CSP files but there's also some accidental files. 39 00:03:37,990 --> 00:03:40,680 And if I scroll down there's also some images as well. 40 00:03:40,990 --> 00:03:45,440 So different formats of data that we can use. 41 00:03:45,550 --> 00:03:52,570 By the way if you click on filter here you'll actually see some of the file types available for data. 42 00:03:52,570 --> 00:03:54,360 Now we know about CSO fees. 43 00:03:54,370 --> 00:04:03,940 There's also Jason where Jason if we take a look and let's go to this beer styles one if I scroll down 44 00:04:04,000 --> 00:04:07,150 we see that it's an Jason format and Jason format. 45 00:04:07,150 --> 00:04:12,930 You can see has this tree like structure to explore data. 46 00:04:13,030 --> 00:04:15,450 So just a little bit different than CSC. 47 00:04:15,550 --> 00:04:23,650 Just another way to organize data like I've mentioned on Kaggle you'll find all these types of datasets 48 00:04:24,040 --> 00:04:25,160 using different things. 49 00:04:25,240 --> 00:04:31,770 Ask you a light for example is a type of database and using Eskew you also structured data. 50 00:04:31,780 --> 00:04:38,710 So if I do ask you a light over here and search for something using relational databases 51 00:04:43,640 --> 00:04:51,060 you'll see that this uses ask you up and it tells us what kind of comps it has. 52 00:04:51,060 --> 00:04:57,900 So lots of ways to explore Kaggle but these two types of data are usually well organized and easy for 53 00:04:57,900 --> 00:05:05,190 us to manipulate with tools such as pandas for Sears fees and you can actually do Eskew all as well 54 00:05:05,190 --> 00:05:06,390 with pandas. 55 00:05:06,390 --> 00:05:09,090 The third kind is unstructured data. 56 00:05:09,960 --> 00:05:12,220 And Daniel has already mentioned this. 57 00:05:12,270 --> 00:05:12,900 Right. 58 00:05:12,960 --> 00:05:20,160 This idea of an unstructured data are well they're not usually in simple formats that we can really 59 00:05:20,160 --> 00:05:22,570 manipulate and analyze easily. 60 00:05:22,620 --> 00:05:31,050 These are often things such as emails or PDA FS or a sort of documents where perhaps actually understanding 61 00:05:31,050 --> 00:05:33,390 them is a little bit more difficult. 62 00:05:33,600 --> 00:05:34,800 It's unstructured. 63 00:05:34,830 --> 00:05:37,860 Finally we have a binary data. 64 00:05:38,160 --> 00:05:44,330 These are things such as audio files image files video files they're in binary. 65 00:05:44,330 --> 00:05:47,220 That is ones and zeros that computers can understand. 66 00:05:47,250 --> 00:05:51,270 But for us to categorize them that's really really difficult. 67 00:05:51,270 --> 00:06:00,250 As you can see it goes from top to bottom from really organized well labeled data to something that 68 00:06:00,250 --> 00:06:05,620 is a little bit harder to organize especially for machines. 69 00:06:05,620 --> 00:06:08,260 So why do we have to talk about these four types of data. 70 00:06:09,100 --> 00:06:17,220 Well because most businesses the bigger they get aren't only going to have structure data they're going 71 00:06:17,220 --> 00:06:24,300 to have all sorts of data all sorts of inputs that needs to somehow all work together. 72 00:06:24,300 --> 00:06:32,070 So one of the jobs of a data engineer is to essentially use the fact that there's all these types of 73 00:06:32,070 --> 00:06:39,300 data and somehow combine them or organize them in a way that is useful to the business. 74 00:06:39,300 --> 00:06:41,790 Let's take a break here and explore some more in the next video.