1
00:00:04,990 --> 00:00:06,940
Working with spiders.

2
00:00:07,600 --> 00:00:08,380
Hello, everyone.

3
00:00:08,410 --> 00:00:13,300
Today we're going to talk about what spiders are and how we can actually work with them.

4
00:00:13,810 --> 00:00:22,030
So spiders are classis, OK, spiders are python classes that helps us that help us to navigate to a

5
00:00:22,030 --> 00:00:23,940
specific website or domain.

6
00:00:24,730 --> 00:00:28,990
And it basically defines how to extract the information from there.

7
00:00:29,290 --> 00:00:32,980
So you can simply write a class at some rules there.

8
00:00:33,160 --> 00:00:41,260
And then once you run this file or actually describe a project, you are actually running this spider

9
00:00:41,260 --> 00:00:43,930
and extracting the information depending on the rules there.

10
00:00:44,350 --> 00:00:49,930
And you saw last time we actually create the spider, which extracted the information from two links

11
00:00:49,930 --> 00:00:54,180
and we got all the information for the links in the HTML format.

12
00:00:54,400 --> 00:00:57,130
So today we are going to extend this a little bit.

13
00:00:57,490 --> 00:01:05,080
And in general, spiders are basically generating an initial request to the website.

14
00:01:05,080 --> 00:01:09,670
And once you generate the request of the URL, it gets a response.

15
00:01:10,180 --> 00:01:15,120
And this response is usually at all the polling that you get inside your computer.

16
00:01:16,360 --> 00:01:21,380
So once you get this downloadable and basically you have all the information for the website.

17
00:01:21,850 --> 00:01:23,830
So, guys, now we're in the terminal.

18
00:01:23,830 --> 00:01:25,180
And let's do one thing.

19
00:01:26,140 --> 00:01:31,640
Let me show you how you can actually use Scribe to extract data.

20
00:01:32,050 --> 00:01:37,290
So we are going to use the show today, and I will show you a very simple way.

21
00:01:37,300 --> 00:01:39,430
So let's write scrapie.

22
00:01:40,750 --> 00:01:41,710
Case Croppy.

23
00:01:43,200 --> 00:01:50,400
Shell, OK, and now we're actually going to run this show, so we're going to run it for the following

24
00:01:50,400 --> 00:01:52,830
website, HTP then.

25
00:01:52,830 --> 00:01:53,460
That's right.

26
00:01:54,490 --> 00:01:55,290
Quotes.

27
00:01:56,670 --> 00:01:58,380
Not to.

28
00:02:00,030 --> 00:02:00,870
Scrub.

29
00:02:02,200 --> 00:02:08,580
Dot com and then as Fred slash page slash one, same as the last time.

30
00:02:08,700 --> 00:02:17,460
OK, let's go to the parenthesis here and now you can see that we're actually downloading here all the

31
00:02:17,460 --> 00:02:19,080
information for the website.

32
00:02:19,730 --> 00:02:28,080
OK, we're getting a list here over and say things we don't already to information and so on and so

33
00:02:28,080 --> 00:02:28,440
forth.

34
00:02:28,890 --> 00:02:33,060
And you can see different options here that you can use.

35
00:02:33,330 --> 00:02:40,020
So one particular of the options, one particular option that we can use is the.

36
00:02:41,370 --> 00:02:48,700
Response option, OK, response, because there is stored information for the website, so if we do

37
00:02:48,720 --> 00:02:59,250
response dot success and then they do litho, we're getting the actual name of the website.

38
00:02:59,520 --> 00:03:04,530
So you can see that we get Selecter, XPath and so on and so forth.

39
00:03:04,900 --> 00:03:09,320
And this is the title of the website calls to Scrap.

40
00:03:09,570 --> 00:03:17,680
OK, and this is directly extracted from the HTML file using the expert expressions here.

41
00:03:18,270 --> 00:03:22,590
So for example, let's try another thing if we do response.

42
00:03:23,690 --> 00:03:36,770
That success and then I thought, OK, let's close the bracket and let's write extract.

43
00:03:37,170 --> 00:03:37,760
That's it.

44
00:03:38,300 --> 00:03:42,680
So in that way we can actually extract text from the website.

45
00:03:44,090 --> 00:03:45,490
So let me do it.

46
00:03:46,070 --> 00:03:52,980
And here I specified that they just want to extract the text without the additional code.

47
00:03:53,210 --> 00:03:55,370
So two things are very important here.

48
00:03:56,120 --> 00:04:05,180
The first is that we actually added text with double column here, and the second thing is the extract.

49
00:04:05,390 --> 00:04:11,270
So the text means that we won't extract specifically only the text elements from the file.

50
00:04:11,540 --> 00:04:14,650
And the only text element here is cost of coal.

51
00:04:14,960 --> 00:04:18,930
And also we specify that we want them from the title element.

52
00:04:19,190 --> 00:04:26,870
So since there is only one text in the title element, we extracted the actual title of the website.

53
00:04:27,320 --> 00:04:35,030
And there is an important thing because like if we don't write text actually, so if we do response

54
00:04:35,540 --> 00:04:42,950
that says, OK, and if I simply right here I thought, OK.

55
00:04:44,060 --> 00:04:54,010
Not extract, what we're getting is still the title, but also we get the bounce or we get the exact

56
00:04:54,010 --> 00:05:01,930
markers of the course scope, and for that reason we use this text to get specifically only the exact

57
00:05:02,120 --> 00:05:03,640
insight without the markers.

58
00:05:04,030 --> 00:05:06,940
So you can see the difference is actually quite obvious.

59
00:05:08,740 --> 00:05:14,810
Also, another thing we can use, actually, if you'd like to remove the brackets here, you can simply

60
00:05:14,830 --> 00:05:16,240
write response.

61
00:05:17,460 --> 00:05:23,340
Not cases, and in the brackets you can write title.

62
00:05:24,920 --> 00:05:26,810
Text, OK?

63
00:05:27,500 --> 00:05:31,490
And after that, they can write to zero dot.

64
00:05:32,990 --> 00:05:34,000
Extract.

65
00:05:34,550 --> 00:05:34,940
OK.

66
00:05:37,040 --> 00:05:46,220
And you can see that we're actually remove the parentheses and there's only course a scope in parentheses

67
00:05:46,220 --> 00:05:47,670
because the brackets are removed.

68
00:05:48,020 --> 00:05:55,400
So actually this is the best way to extract the header from the from the website.

69
00:05:56,390 --> 00:06:00,830
Also, our way to do that is, for example, you can write a response.

70
00:06:02,500 --> 00:06:06,250
Cyesis and then here we can write I.

71
00:06:08,710 --> 00:06:10,310
Next, OK.

72
00:06:11,020 --> 00:06:13,360
And after DataDot, Ari.

73
00:06:14,650 --> 00:06:16,840
And we can do here are.

74
00:06:18,650 --> 00:06:21,500
And then quotes.

75
00:06:23,230 --> 00:06:24,520
Dot star.

76
00:06:25,090 --> 00:06:32,080
So I don't care what's after that and close the bracket and if I hit Enter here, you can see that we

77
00:06:32,080 --> 00:06:33,360
get basically the same thing.

78
00:06:33,700 --> 00:06:35,480
So get quotes of scope.

79
00:06:36,080 --> 00:06:41,170
So whatever metal to use, I think actually the first one is a bit better.

80
00:06:42,310 --> 00:06:49,090
You can always export the title of the website that you found with the heap.

81
00:06:50,080 --> 00:06:58,750
So that said, guys, this was the brief intro of what we are going to do in with the bull crap, too,

82
00:06:58,750 --> 00:06:59,210
actually.

83
00:06:59,470 --> 00:07:01,920
So that said, thank you very much for watching.

84
00:07:01,930 --> 00:07:08,080
In the next video, we'll continue with the exploration exploration of the scrapie tool kit.

85
00:07:08,410 --> 00:07:09,760
Thank you very much for watching.