1 00:00:00,330 --> 00:00:06,270 All in all, our position, we have gone through with lots of analysis from importing the data from 2 00:00:06,270 --> 00:00:12,330 database, and at times we have performed some sentiment analysis with respect to positive, with respect 3 00:00:12,330 --> 00:00:16,950 to negative lots of things, this word cloud and not something. 4 00:00:17,220 --> 00:00:23,670 And this is your what are my top 10 users to which Amazon is going to recommend product. 5 00:00:24,230 --> 00:00:32,580 So session the very first part of the statement is we have to analyze the length of the comment, whatever 6 00:00:32,580 --> 00:00:34,210 the user are going to give. 7 00:00:34,470 --> 00:00:38,420 So we have to analyze whether they are lendee over that short one. 8 00:00:38,430 --> 00:00:42,110 Or you can say you can analyze its distribution as well. 9 00:00:42,720 --> 00:00:51,990 So maybe some of you have not that much good specification in terms of the processor or hard disk. 10 00:00:52,260 --> 00:00:59,220 So you're according to your system specifications, you can consider some sample of this huge chunk 11 00:00:59,220 --> 00:01:04,640 of data because you will figure out this is very huge data so you can consider some sample of data. 12 00:01:04,650 --> 00:01:10,100 And yeah, if you don't have some issue with the specification, just go with this bulky data. 13 00:01:10,110 --> 00:01:11,340 It's all up to you literally. 14 00:01:11,340 --> 00:01:16,670 It's all about you say for your sake, I'm going to consider some sample of data. 15 00:01:16,980 --> 00:01:23,310 So here I'm going to say either you can consider some sample of data using this data sample what our 16 00:01:23,310 --> 00:01:30,510 sample size you want to assign using this, you can get some sample or you guys can consider some sample 17 00:01:30,510 --> 00:01:31,280 as legacy. 18 00:01:31,350 --> 00:01:36,170 And when you consider some top two thousand rows, it's all up to you. 19 00:01:36,990 --> 00:01:42,630 Let's say this is my final little frame, just executed now is on this final. 20 00:01:42,630 --> 00:01:43,720 I'm going to call ahead. 21 00:01:44,040 --> 00:01:47,610 So this is that data frame on which I had to perform this analysis. 22 00:01:48,370 --> 00:01:51,750 Check whether I have any missing value in this data or not for this. 23 00:01:51,750 --> 00:02:00,290 I'm just trying to say is and a dot some are you guys can also use this is not awesome. 24 00:02:00,300 --> 00:02:01,170 It's all up to you. 25 00:02:01,470 --> 00:02:04,680 And on this, I'm just going to call this Willetton. 26 00:02:04,680 --> 00:02:08,070 Maiya, I don't have any missing value in my data. 27 00:02:08,200 --> 00:02:13,530 Now, let me check whether if I have any duplicates in or not. 28 00:02:13,530 --> 00:02:18,560 So for this, I'm just going to say final dot. 29 00:02:18,570 --> 00:02:24,750 I have a function which is exactly duplicated, dark, some just executed. 30 00:02:24,780 --> 00:02:26,110 Now it will return me. 31 00:02:26,250 --> 00:02:26,700 Yeah. 32 00:02:26,700 --> 00:02:29,590 You don't have any duplicates in your data. 33 00:02:29,880 --> 00:02:35,130 Now what you have to do, you have to analyze the land of the customers comment. 34 00:02:35,370 --> 00:02:42,900 So now what I am going to do, I'm just going to say in this text column, I have to extract what exactly 35 00:02:42,900 --> 00:02:47,230 the length of each of the feedback given by my customer. 36 00:02:47,430 --> 00:02:52,070 So here I am going to say, let's say final of text. 37 00:02:52,080 --> 00:02:52,680 Very first. 38 00:02:52,680 --> 00:02:53,820 I have to access this. 39 00:02:54,120 --> 00:02:57,180 Let's say I'm just going to access my zeroth index data. 40 00:02:57,180 --> 00:02:57,960 Just paint it. 41 00:02:57,960 --> 00:03:03,290 You will see it has that much number of words in this feedback. 42 00:03:03,540 --> 00:03:05,550 Let's say I have to split it. 43 00:03:05,550 --> 00:03:10,530 Let's say I have to split it on the basis of, let's say, space separated, just executed. 44 00:03:10,530 --> 00:03:12,690 It will be done with this list. 45 00:03:12,930 --> 00:03:18,850 And if on this list I'm going to calculate the length of this entire list. 46 00:03:18,990 --> 00:03:24,510 Now, you will figure out this combined has that much land in a similar way. 47 00:03:24,510 --> 00:03:28,590 You can compute the length of each and every comment. 48 00:03:29,040 --> 00:03:35,670 So for this, you guys can define a function to do this operation in a much more user friendly way. 49 00:03:35,700 --> 00:03:42,660 So here I'm going to say calculate land and whatever text I'm going to pass to my function, it will 50 00:03:42,660 --> 00:03:44,810 exactly return me my land. 51 00:03:45,090 --> 00:03:47,880 So here I am going to say whatever text I have on this. 52 00:03:47,880 --> 00:03:52,680 Very first I have to close on the split and I have to split it on the basis of the separator. 53 00:03:53,010 --> 00:03:59,880 Once I have all this stuff that I'm going to compute land the simple that just like a piece of OK, 54 00:03:59,890 --> 00:04:03,370 now whatever land you have, you have to just return it. 55 00:04:03,390 --> 00:04:04,190 That's it. 56 00:04:04,230 --> 00:04:05,880 That's a task of the function. 57 00:04:06,180 --> 00:04:13,320 Now what you have do you have to map dysfunction or you can say you have to apply this function on your 58 00:04:13,560 --> 00:04:14,580 text column. 59 00:04:14,590 --> 00:04:23,250 So final text dot apply and here you have to say calc underscore, just press tab. 60 00:04:23,710 --> 00:04:25,850 So it is exactly calcaneus colon. 61 00:04:26,280 --> 00:04:30,270 Now whatever land it will hit on me, I'm going to store it somewhere else. 62 00:04:30,270 --> 00:04:32,160 Let's say I define a column for that. 63 00:04:32,490 --> 00:04:39,170 So I'm going to say it is nothing but text, underscore and whatever name you want to assign, just 64 00:04:39,180 --> 00:04:42,030 execute it and it will give me some warning. 65 00:04:42,030 --> 00:04:43,680 Just don't worry at all. 66 00:04:43,680 --> 00:04:46,500 It is just because of some your anaconda's choose. 67 00:04:46,800 --> 00:04:49,980 Some of you have some personal issues, library issues. 68 00:04:50,220 --> 00:04:52,440 It is just because of so many issues. 69 00:04:52,440 --> 00:04:53,400 Just ignore this. 70 00:04:53,400 --> 00:04:56,430 Wanting to know what you have to do. 71 00:04:56,430 --> 00:04:59,790 You have to basically let's say I just need. 72 00:05:00,160 --> 00:05:08,510 Distribution of this tax land so far, this you guys can use a very handy function, which is exactly 73 00:05:08,510 --> 00:05:16,220 my box plot, so very close I have to import my Lochley because I want to use that box dot, which is 74 00:05:16,220 --> 00:05:18,070 exactly in my plot module. 75 00:05:18,080 --> 00:05:27,070 You can also use box stored in your Poinar Seabourne and many other things, but I just need some user 76 00:05:27,230 --> 00:05:28,190 friendly visuals. 77 00:05:28,220 --> 00:05:30,180 That's one way to use it here. 78 00:05:30,200 --> 00:05:39,050 So I'm going to say Pratley Dot Xpress as a bit and if you haven't told us you can install using this 79 00:05:39,320 --> 00:05:47,600 PIP install, pip install what the US can install using this basic command. 80 00:05:47,960 --> 00:05:50,780 Now what I have to do, I have two very first imported. 81 00:05:51,020 --> 00:05:56,930 Now using this I have a function which is exactly box just passive gustad. 82 00:05:56,930 --> 00:05:59,850 You have all these different different parameters Saussure. 83 00:06:00,240 --> 00:06:04,250 Now the very first parameter is exactly what is the frame which is final. 84 00:06:04,760 --> 00:06:14,480 And on this Y-axis, I'm just going to say I have to pass this next underscore land just executed. 85 00:06:14,480 --> 00:06:21,120 It will take some couple of seconds and it will return as my beautiful box brought in a while. 86 00:06:21,200 --> 00:06:25,050 So this is exactly that box, part of which I'm talking about. 87 00:06:25,340 --> 00:06:30,020 So if you have to conclude from this box plot, you will fill it out. 88 00:06:30,410 --> 00:06:37,790 It seems to have almost 50 percent users are going to give their feedback almost almost. 89 00:06:37,790 --> 00:06:43,250 You will figure it out, almost 50 words where there are only few users. 90 00:06:43,430 --> 00:06:50,240 You will see there are a few users who are going to give a word Landi feedback's, because these DOTD 91 00:06:50,660 --> 00:06:53,590 are exactly my outliers over here. 92 00:06:54,140 --> 00:06:58,640 So that's a conclusion that I have derived from here, from my data. 93 00:06:59,000 --> 00:07:04,850 So let's go ahead with our next thought on a statement in which I have to analyze our score. 94 00:07:05,150 --> 00:07:09,290 So in this final data frame here, you will figure it out. 95 00:07:09,860 --> 00:07:12,320 Here you have a column name of the score. 96 00:07:12,680 --> 00:07:16,160 So you have to analyze this feature for this. 97 00:07:16,170 --> 00:07:20,480 What I'm going to do, I'm just going to call my control over here. 98 00:07:20,480 --> 00:07:28,670 So as dot com plot and here I'm just going to say final score just executed. 99 00:07:28,670 --> 00:07:30,560 It will return this beautiful plot. 100 00:07:31,010 --> 00:07:38,360 So from this, you guys can figure out most of the customers are going to give us five score on any 101 00:07:38,360 --> 00:07:39,160 of the product. 102 00:07:39,560 --> 00:07:44,570 So that's the type of conclusion you can fetch from all the visas that you have over here. 103 00:07:44,930 --> 00:07:53,390 But believe me, 70 to 80 percent of the time is still gas is stand in your data preprocessing and in 104 00:07:53,390 --> 00:07:53,840 your data. 105 00:07:54,230 --> 00:08:00,440 That's what I had learned from my own experience working in analytics domain and in data science industry. 106 00:08:01,040 --> 00:08:03,350 So I hope you will love this session very much. 107 00:08:03,680 --> 00:08:04,440 Thank you. 108 00:08:04,490 --> 00:08:05,510 Have a nice day. 109 00:08:05,540 --> 00:08:06,380 Keep learning. 110 00:08:06,380 --> 00:08:07,190 Keep growing. 111 00:08:07,610 --> 00:08:08,450 Keep practicing.