1 00:00:07,080 --> 00:00:11,250 In this section, we will cover the most important part of data analysis. 2 00:00:12,230 --> 00:00:14,270 This is called data pre processing. 3 00:00:15,760 --> 00:00:21,910 As I told you in the introduction to Machine Learning, Section four, practicing data analysts, nearly 4 00:00:21,940 --> 00:00:25,300 80 percent of their time is spent in data processing. 5 00:00:26,410 --> 00:00:31,390 And how well your model will perform eventually depends on how well you have prepared your data. 6 00:00:33,180 --> 00:00:37,110 So let's start with the first thing you should know when you are trying to solve a problem. 7 00:00:38,240 --> 00:00:41,060 You should know about the business context of that problem. 8 00:00:41,630 --> 00:00:47,810 You should know what all factors are important and have impact on the variables of interest. 9 00:00:49,480 --> 00:00:55,720 Only if we know the business, we will be able to identify relevant variables and gather required data 10 00:00:55,810 --> 00:00:56,860 for the analysis. 11 00:00:58,290 --> 00:00:59,070 Remember this. 12 00:00:59,540 --> 00:01:03,090 The quality of your input will decide quality of your output. 13 00:01:04,180 --> 00:01:07,840 Now, there are two primary ways to gather information about business. 14 00:01:08,710 --> 00:01:14,830 One is to gather it yourself by talking to relevant people or stakeholders of that business. 15 00:01:15,460 --> 00:01:22,210 People who are impacted by the problem at hand, or you can also gather information by actually doing 16 00:01:22,210 --> 00:01:23,320 the things yourself. 17 00:01:24,160 --> 00:01:28,780 For example, if you have to increase sales of a particular product. 18 00:01:29,960 --> 00:01:32,260 Tried to go and sell it yourself in the market. 19 00:01:33,490 --> 00:01:36,950 Go and try to buy it, also to understand customers perspective. 20 00:01:38,500 --> 00:01:43,390 Gathering information by yourself is called doing primary research. 21 00:01:44,760 --> 00:01:50,480 In secondary research, you read or listen to the information gathered by others. 22 00:01:51,410 --> 00:01:58,310 Maybe there is some industry related research report or some study done by some consulting firm or someone 23 00:01:58,310 --> 00:02:01,050 else has worked on the same problem before. 24 00:02:02,360 --> 00:02:06,710 We should look at methodology and findings of those earlier works. 25 00:02:06,800 --> 00:02:07,250 Also. 26 00:02:09,910 --> 00:02:10,870 That is an example. 27 00:02:12,060 --> 00:02:15,880 Suppose you're part of a company which is online. 28 00:02:16,530 --> 00:02:19,650 And the problem you are facing is off guard abandonment. 29 00:02:19,920 --> 00:02:26,130 That is a lot of your customers add product to their card, but do not actually putties it. 30 00:02:27,850 --> 00:02:30,940 This is a fairly common problem for online businesses. 31 00:02:33,410 --> 00:02:37,990 Now, if you are given the task to tackle this problem, where do you start? 32 00:02:40,070 --> 00:02:42,900 First, we should go and talk with the relevant teams involved. 33 00:02:45,050 --> 00:02:51,050 Usually there is a marketing team and a product team which cross-functional handle the online sales 34 00:02:51,050 --> 00:02:51,560 process. 35 00:02:52,400 --> 00:02:53,650 We should go and talk to them. 36 00:02:55,800 --> 00:03:00,810 Then we should try choosing from this site ourselves also to understand what is the customer journey 37 00:03:00,810 --> 00:03:01,320 online. 38 00:03:02,730 --> 00:03:04,440 All this will be primarily search. 39 00:03:05,530 --> 00:03:11,350 As part of the cavity search, we should look for previous studies on gut abandonment within and outside 40 00:03:11,350 --> 00:03:12,130 the organization. 41 00:03:14,440 --> 00:03:20,230 Once we have done all this, we would have a fairly good idea of what all variables and what type of 42 00:03:20,230 --> 00:03:22,060 model we are going to use. 43 00:03:23,670 --> 00:03:25,830 Next task is to get Dee dee dum.