1 00:00:04,990 --> 00:00:06,940 Working with spiders. 2 00:00:07,600 --> 00:00:08,380 Hello, everyone. 3 00:00:08,410 --> 00:00:13,300 Today we're going to talk about what spiders are and how we can actually work with them. 4 00:00:13,810 --> 00:00:22,030 So spiders are classis, OK, spiders are python classes that helps us that help us to navigate to a 5 00:00:22,030 --> 00:00:23,940 specific website or domain. 6 00:00:24,730 --> 00:00:28,990 And it basically defines how to extract the information from there. 7 00:00:29,290 --> 00:00:32,980 So you can simply write a class at some rules there. 8 00:00:33,160 --> 00:00:41,260 And then once you run this file or actually describe a project, you are actually running this spider 9 00:00:41,260 --> 00:00:43,930 and extracting the information depending on the rules there. 10 00:00:44,350 --> 00:00:49,930 And you saw last time we actually create the spider, which extracted the information from two links 11 00:00:49,930 --> 00:00:54,180 and we got all the information for the links in the HTML format. 12 00:00:54,400 --> 00:00:57,130 So today we are going to extend this a little bit. 13 00:00:57,490 --> 00:01:05,080 And in general, spiders are basically generating an initial request to the website. 14 00:01:05,080 --> 00:01:09,670 And once you generate the request of the URL, it gets a response. 15 00:01:10,180 --> 00:01:15,120 And this response is usually at all the polling that you get inside your computer. 16 00:01:16,360 --> 00:01:21,380 So once you get this downloadable and basically you have all the information for the website. 17 00:01:21,850 --> 00:01:23,830 So, guys, now we're in the terminal. 18 00:01:23,830 --> 00:01:25,180 And let's do one thing. 19 00:01:26,140 --> 00:01:31,640 Let me show you how you can actually use Scribe to extract data. 20 00:01:32,050 --> 00:01:37,290 So we are going to use the show today, and I will show you a very simple way. 21 00:01:37,300 --> 00:01:39,430 So let's write scrapie. 22 00:01:40,750 --> 00:01:41,710 Case Croppy. 23 00:01:43,200 --> 00:01:50,400 Shell, OK, and now we're actually going to run this show, so we're going to run it for the following 24 00:01:50,400 --> 00:01:52,830 website, HTP then. 25 00:01:52,830 --> 00:01:53,460 That's right. 26 00:01:54,490 --> 00:01:55,290 Quotes. 27 00:01:56,670 --> 00:01:58,380 Not to. 28 00:02:00,030 --> 00:02:00,870 Scrub. 29 00:02:02,200 --> 00:02:08,580 Dot com and then as Fred slash page slash one, same as the last time. 30 00:02:08,700 --> 00:02:17,460 OK, let's go to the parenthesis here and now you can see that we're actually downloading here all the 31 00:02:17,460 --> 00:02:19,080 information for the website. 32 00:02:19,730 --> 00:02:28,080 OK, we're getting a list here over and say things we don't already to information and so on and so 33 00:02:28,080 --> 00:02:28,440 forth. 34 00:02:28,890 --> 00:02:33,060 And you can see different options here that you can use. 35 00:02:33,330 --> 00:02:40,020 So one particular of the options, one particular option that we can use is the. 36 00:02:41,370 --> 00:02:48,700 Response option, OK, response, because there is stored information for the website, so if we do 37 00:02:48,720 --> 00:02:59,250 response dot success and then they do litho, we're getting the actual name of the website. 38 00:02:59,520 --> 00:03:04,530 So you can see that we get Selecter, XPath and so on and so forth. 39 00:03:04,900 --> 00:03:09,320 And this is the title of the website calls to Scrap. 40 00:03:09,570 --> 00:03:17,680 OK, and this is directly extracted from the HTML file using the expert expressions here. 41 00:03:18,270 --> 00:03:22,590 So for example, let's try another thing if we do response. 42 00:03:23,690 --> 00:03:36,770 That success and then I thought, OK, let's close the bracket and let's write extract. 43 00:03:37,170 --> 00:03:37,760 That's it. 44 00:03:38,300 --> 00:03:42,680 So in that way we can actually extract text from the website. 45 00:03:44,090 --> 00:03:45,490 So let me do it. 46 00:03:46,070 --> 00:03:52,980 And here I specified that they just want to extract the text without the additional code. 47 00:03:53,210 --> 00:03:55,370 So two things are very important here. 48 00:03:56,120 --> 00:04:05,180 The first is that we actually added text with double column here, and the second thing is the extract. 49 00:04:05,390 --> 00:04:11,270 So the text means that we won't extract specifically only the text elements from the file. 50 00:04:11,540 --> 00:04:14,650 And the only text element here is cost of coal. 51 00:04:14,960 --> 00:04:18,930 And also we specify that we want them from the title element. 52 00:04:19,190 --> 00:04:26,870 So since there is only one text in the title element, we extracted the actual title of the website. 53 00:04:27,320 --> 00:04:35,030 And there is an important thing because like if we don't write text actually, so if we do response 54 00:04:35,540 --> 00:04:42,950 that says, OK, and if I simply right here I thought, OK. 55 00:04:44,060 --> 00:04:54,010 Not extract, what we're getting is still the title, but also we get the bounce or we get the exact 56 00:04:54,010 --> 00:05:01,930 markers of the course scope, and for that reason we use this text to get specifically only the exact 57 00:05:02,120 --> 00:05:03,640 insight without the markers. 58 00:05:04,030 --> 00:05:06,940 So you can see the difference is actually quite obvious. 59 00:05:08,740 --> 00:05:14,810 Also, another thing we can use, actually, if you'd like to remove the brackets here, you can simply 60 00:05:14,830 --> 00:05:16,240 write response. 61 00:05:17,460 --> 00:05:23,340 Not cases, and in the brackets you can write title. 62 00:05:24,920 --> 00:05:26,810 Text, OK? 63 00:05:27,500 --> 00:05:31,490 And after that, they can write to zero dot. 64 00:05:32,990 --> 00:05:34,000 Extract. 65 00:05:34,550 --> 00:05:34,940 OK. 66 00:05:37,040 --> 00:05:46,220 And you can see that we're actually remove the parentheses and there's only course a scope in parentheses 67 00:05:46,220 --> 00:05:47,670 because the brackets are removed. 68 00:05:48,020 --> 00:05:55,400 So actually this is the best way to extract the header from the from the website. 69 00:05:56,390 --> 00:06:00,830 Also, our way to do that is, for example, you can write a response. 70 00:06:02,500 --> 00:06:06,250 Cyesis and then here we can write I. 71 00:06:08,710 --> 00:06:10,310 Next, OK. 72 00:06:11,020 --> 00:06:13,360 And after DataDot, Ari. 73 00:06:14,650 --> 00:06:16,840 And we can do here are. 74 00:06:18,650 --> 00:06:21,500 And then quotes. 75 00:06:23,230 --> 00:06:24,520 Dot star. 76 00:06:25,090 --> 00:06:32,080 So I don't care what's after that and close the bracket and if I hit Enter here, you can see that we 77 00:06:32,080 --> 00:06:33,360 get basically the same thing. 78 00:06:33,700 --> 00:06:35,480 So get quotes of scope. 79 00:06:36,080 --> 00:06:41,170 So whatever metal to use, I think actually the first one is a bit better. 80 00:06:42,310 --> 00:06:49,090 You can always export the title of the website that you found with the heap. 81 00:06:50,080 --> 00:06:58,750 So that said, guys, this was the brief intro of what we are going to do in with the bull crap, too, 82 00:06:58,750 --> 00:06:59,210 actually. 83 00:06:59,470 --> 00:07:01,920 So that said, thank you very much for watching. 84 00:07:01,930 --> 00:07:08,080 In the next video, we'll continue with the exploration exploration of the scrapie tool kit. 85 00:07:08,410 --> 00:07:09,760 Thank you very much for watching.