1 00:00:04,800 --> 00:00:06,480 Scruffy overview. 2 00:00:07,080 --> 00:00:07,690 Hi, everyone. 3 00:00:07,710 --> 00:00:13,050 Today, we're going to talk about a very interesting tool and this is actually the next level that we're 4 00:00:13,050 --> 00:00:18,000 going to build up after we learn the beautiful SoPE tool kit. 5 00:00:18,240 --> 00:00:24,710 So today we're going to talk about the scrapie, which obviously is a python tool used for weps cropping 6 00:00:25,050 --> 00:00:30,000 so you can find all the information about scrapping scrapie dot org. 7 00:00:30,360 --> 00:00:35,490 And this, too, essentially allows you to extract the data from the Web pages. 8 00:00:35,790 --> 00:00:40,920 And many people use it for data mining, information processing and so on. 9 00:00:41,340 --> 00:00:46,480 Just two rows actually in absolutely every platform since it is an item. 10 00:00:46,770 --> 00:00:53,530 So if you have Mac Windows on Linux operating system, you can use it to all of them without any issues. 11 00:00:53,910 --> 00:01:01,050 So even though the main purpose of scrapie is to extract data from the Web pages using Python, you 12 00:01:01,050 --> 00:01:05,440 can also do it with scrappier when using the APIs. 13 00:01:05,880 --> 00:01:08,370 So let's talk about the advantages of Krutch. 14 00:01:08,760 --> 00:01:16,290 So Skrappy's actually the bridge fruit for you to the partner automation, because here you simply write 15 00:01:16,290 --> 00:01:19,020 the rules and then scrap it does the job instead of you. 16 00:01:19,290 --> 00:01:24,240 So once you write your rules, then scrappier, goes to the Web, find the Web page and extract all 17 00:01:24,240 --> 00:01:29,170 the information that you require without the need for you to do any additional work. 18 00:01:29,490 --> 00:01:38,640 You can obviously easily modify your sole source code used in different platforms, regardless of whether 19 00:01:38,640 --> 00:01:43,860 you use Linux, Windows, MapQuest and so on. 20 00:01:44,070 --> 00:01:52,230 So this is quite a useful tool that will definitely simplify the way you're extracting information from 21 00:01:52,230 --> 00:01:57,660 websites to extract data from the e-mail file. 22 00:01:57,700 --> 00:02:05,310 Scrap usually uses the successful selection and expat expressions that we talked about in the previous 23 00:02:05,310 --> 00:02:05,880 lectures. 24 00:02:06,510 --> 00:02:13,260 You can simply write your code in Python console and then use these rules in order to extract information 25 00:02:13,260 --> 00:02:13,810 from the Web. 26 00:02:14,190 --> 00:02:21,780 It also support Muj performers like Jason Sias v. Ximo and so on. 27 00:02:21,810 --> 00:02:28,230 So regardless what you use, you can definitely assess most of the more popular file types out there. 28 00:02:28,440 --> 00:02:35,250 And the other thing actually that I would like about scrapple is that it allows you to be more flexible 29 00:02:35,250 --> 00:02:37,060 with what you write and what you get. 30 00:02:37,500 --> 00:02:43,050 So if you have errors in the code, there was a big chance of scrap rules to execute it and find the 31 00:02:43,050 --> 00:02:43,830 best solution. 32 00:02:44,080 --> 00:02:49,510 You can also use your own expressions, such as signals, extensions and pipelines. 33 00:02:49,770 --> 00:02:58,440 But let's talk now specifically about the architecture of those Croppy Scrappy allow you to simply scan 34 00:02:58,440 --> 00:03:03,300 the content of a website and extract the information that is useful for you. 35 00:03:03,930 --> 00:03:08,570 There is a very important architecture going into the background of that tool. 36 00:03:08,790 --> 00:03:12,270 So the first part of the architecture is the interpreter. 37 00:03:12,630 --> 00:03:18,730 So the interpreter simply allows you to create projects and test them quickly. 38 00:03:19,290 --> 00:03:27,300 The interpreter used some colder routines called spiders and are used for making a request to a list 39 00:03:27,300 --> 00:03:29,430 of domains that you predefine. 40 00:03:29,640 --> 00:03:38,850 So once you predefine these rules on HTP, request is made and then scrapie applies the rules that you 41 00:03:38,850 --> 00:03:46,180 predefined to this list of HTP requests and so you can execute all of them. 42 00:03:46,650 --> 00:03:54,690 So, as I said, scrappiest expat expressions and with them you can actually get pretty detailed information 43 00:03:54,690 --> 00:03:58,000 about them, about what we extract from the website. 44 00:03:58,230 --> 00:04:05,940 So, for example, if let's say you want to extract download links from a page, you can simply write 45 00:04:05,940 --> 00:04:14,460 an expert expression using scrapie and you easily be able to assess all the attributes in that website. 46 00:04:14,820 --> 00:04:22,380 And finally, there are the items which are basically containers of information that allow storing information 47 00:04:22,980 --> 00:04:24,380 that's returned from the rules. 48 00:04:24,390 --> 00:04:29,820 So, for example, if you write the rule that assists a website and you get a response from this website, 49 00:04:30,060 --> 00:04:36,150 you're basically going to store all this information into these containers or the items that the return 50 00:04:36,150 --> 00:04:36,990 from the rules. 51 00:04:37,530 --> 00:04:45,990 And on the next few guys, you can actually see the workflow of scrapie and how actually the scrap engine 52 00:04:46,200 --> 00:04:52,080 is implemented in order to cope with all of these web features. 53 00:04:52,160 --> 00:04:54,660 OK, so we can see that here. 54 00:04:54,690 --> 00:05:02,220 First we have the Internet and with the Internet, with the downloader, OK, that is connected to scrapie 55 00:05:02,550 --> 00:05:03,210 and. 56 00:05:03,320 --> 00:05:12,470 We keep doing requests and we're getting responses, so once we do a request to Internet, we're downloading 57 00:05:12,470 --> 00:05:16,400 their responses and we're using the spiders to do that. 58 00:05:16,460 --> 00:05:26,300 So as you can see, all of those four features are actually connected to the scrap agent and this information 59 00:05:26,300 --> 00:05:34,070 that goes inside their moves dynamically, which means that constantly the Ruža exchange with the website, 60 00:05:34,070 --> 00:05:39,860 so constantly you can access different locations in Internet depending on the rules that. 61 00:05:39,860 --> 00:05:40,340 That's right. 62 00:05:40,760 --> 00:05:51,290 So as you can see on the image, the spiders use the items in order to pass the data to diatom pipelines. 63 00:05:51,440 --> 00:05:54,950 And as you can see, scrapie actually has different spiders. 64 00:05:55,160 --> 00:06:01,010 Some of them pass items through the pipelines and other ones are sending request to a scheduler over 65 00:06:01,010 --> 00:06:01,380 here. 66 00:06:01,640 --> 00:06:10,160 So the requests from the spiders to the scheduler are actually the ones that are making requests to 67 00:06:10,160 --> 00:06:16,480 the actual server, because from the requests you're getting the requests sent to the server and this 68 00:06:16,490 --> 00:06:17,900 how in the information. 69 00:06:18,440 --> 00:06:25,370 So once we request information from the server, the information is in the back after that of the spiders. 70 00:06:25,700 --> 00:06:30,230 And so the spider is fed back with each request that we get from the server. 71 00:06:30,710 --> 00:06:34,270 So, guys, this was the overview of scrapie. 72 00:06:34,520 --> 00:06:41,090 I hope it was clear, because in the next video, we're actually going to start coding and we are going 73 00:06:41,090 --> 00:06:43,910 to do some scrapbooking requests to the Web. 74 00:06:44,180 --> 00:06:44,930 So that's it. 75 00:06:44,960 --> 00:06:47,090 Thanks very much for watching that video for today. 76 00:06:47,090 --> 00:06:48,590 And I will see you in the next one.