1 00:00:00,560 --> 00:00:03,110 Now we figured out what problem we're trying to solve. 2 00:00:03,110 --> 00:00:07,730 We've matched our specific problem to a different type of machine learning problem. 3 00:00:07,730 --> 00:00:10,640 It's time to have a look at what data we have. 4 00:00:10,700 --> 00:00:16,350 As you may have guessed the question we're trying to answer here is what kind of data do we have. 5 00:00:16,370 --> 00:00:23,180 Data comes in many different shapes and sizes but the main two types are structured and unstructured 6 00:00:23,960 --> 00:00:29,030 structured data is something you'd expect to see in an excel file such as rows and columns of different 7 00:00:29,030 --> 00:00:36,390 patient medical records and whether or not they have heart disease or not or customer purchase transactions. 8 00:00:36,440 --> 00:00:42,920 It's called structured data because all of the samples the different patient records are typically in 9 00:00:42,920 --> 00:00:50,450 similar format meaning one column might contain numbers of a certain type such as average blood pressure 10 00:00:50,570 --> 00:00:58,190 or sags or weight of a patient and another column might have whether they have chest pain or not and 11 00:00:58,190 --> 00:01:00,640 what the level of intensity is. 12 00:01:00,680 --> 00:01:08,870 Unstructured data are things like images natural language text such as transcribed phone calls videos 13 00:01:09,110 --> 00:01:15,120 and audio files although we can turn these into numbers and create structure. 14 00:01:15,120 --> 00:01:17,960 They typically come in many varying formats. 15 00:01:18,000 --> 00:01:24,180 One picture of a dog may look completely different to another image of a dog and the email as you write 16 00:01:24,180 --> 00:01:29,760 back and forth with the friend may have a completely different structure to the emails you'd write to 17 00:01:29,760 --> 00:01:31,150 a co-worker. 18 00:01:31,170 --> 00:01:37,590 Now within these two data types there's static and streaming data static data is data which doesn't 19 00:01:37,590 --> 00:01:39,170 change over time. 20 00:01:39,300 --> 00:01:45,690 You may have a spreadsheet of patient records in a dot CSP format which stands for commas Separated 21 00:01:45,690 --> 00:01:52,360 Values which simply means all of the different data is in one file separated by commas. 22 00:01:52,500 --> 00:01:53,890 It looks like this. 23 00:01:53,940 --> 00:01:55,680 You check this table we got the idea. 24 00:01:55,680 --> 00:01:56,720 Com I'll wait. 25 00:01:56,760 --> 00:02:03,030 Comma sex and if we were to read that into it to a data frame using a tool like pandas we'll have a 26 00:02:03,030 --> 00:02:04,690 look at this in a future lesson. 27 00:02:04,800 --> 00:02:06,510 It would look something like this. 28 00:02:06,540 --> 00:02:11,970 So a lot of data you'll actually come across comes in a simple format like this. 29 00:02:11,970 --> 00:02:18,150 But to turn it into something that a little bit more structural you can convert it to this. 30 00:02:18,290 --> 00:02:21,720 Now CSB is one of the most common types of static data formats. 31 00:02:21,800 --> 00:02:24,920 We're going to get very used to this by the end of the course. 32 00:02:25,140 --> 00:02:31,890 And since these values won't really change over time they're called static Usually what you'll want 33 00:02:31,980 --> 00:02:35,120 is a lot of these examples in machine learning. 34 00:02:35,130 --> 00:02:38,660 There's a saying The more data the better. 35 00:02:38,850 --> 00:02:44,520 Which makes sense if you think about it the more examples you have of something such as the inputs and 36 00:02:44,610 --> 00:02:52,140 outputs of patient records where the inputs are a patient's body parameters and the outputs are whether 37 00:02:52,140 --> 00:02:54,260 they have heart disease or not. 38 00:02:54,540 --> 00:02:58,090 The more chances you'll have to find patterns between them. 39 00:02:58,110 --> 00:03:00,390 The same goes for machine learning algorithms. 40 00:03:00,390 --> 00:03:07,380 The more examples they can look at the more chance they have at finding patterns and thus using those 41 00:03:07,380 --> 00:03:10,160 patterns to predict something in the future. 42 00:03:10,260 --> 00:03:14,970 Like whether a new patient who comes along who isn't in this table whether they have heart disease or 43 00:03:14,970 --> 00:03:15,240 not. 44 00:03:17,070 --> 00:03:20,830 Streaming data is data which is constantly changed over time. 45 00:03:20,880 --> 00:03:26,430 For example say you wanted to predict how a stock price will change based on news headlines you'll be 46 00:03:26,430 --> 00:03:27,990 working with streaming data. 47 00:03:28,050 --> 00:03:34,380 Since news headlines are being updated constantly you'll want to be the first to see how they change 48 00:03:34,380 --> 00:03:42,790 stocks most of the work you will do in practice will start on static data and then if your data analysis 49 00:03:42,850 --> 00:03:48,580 and machine learning efforts prove to show some insights you'll move towards streaming data for when 50 00:03:48,580 --> 00:03:51,470 you go to deployment or in production. 51 00:03:51,910 --> 00:03:58,510 A common data science workflow begins by opening a v file in a Jupiter notebook a tool for building 52 00:03:58,510 --> 00:04:05,830 machine learning projects then exploring the data and performing data analysis using pandas a python 53 00:04:05,830 --> 00:04:12,790 library for data analysis and making visualizations such as graphs and comparing different data points 54 00:04:12,790 --> 00:04:21,280 using map plot lib then building machine learning models on the data using psychic learn such as a machine 55 00:04:21,280 --> 00:04:25,060 learning model to predict using these patterns here. 56 00:04:25,060 --> 00:04:31,790 Whether or not a patient has heart disease don't worry if you're thinking what's a Jupiter notebook 57 00:04:32,110 --> 00:04:35,730 and pandas what are we at the zoo. 58 00:04:35,840 --> 00:04:40,750 We've got dedicated sections and projects coming out for each of these tools. 59 00:04:40,810 --> 00:04:46,030 For now think about the different kinds of data you create or use every day. 60 00:04:46,030 --> 00:04:48,670 Are they structured or unstructured. 61 00:04:48,670 --> 00:04:49,710 How much data is there.