1
00:00:04,800 --> 00:00:06,480
Scruffy overview.

2
00:00:07,080 --> 00:00:07,690
Hi, everyone.

3
00:00:07,710 --> 00:00:13,050
Today, we're going to talk about a very interesting tool and this is actually the next level that we're

4
00:00:13,050 --> 00:00:18,000
going to build up after we learn the beautiful SoPE tool kit.

5
00:00:18,240 --> 00:00:24,710
So today we're going to talk about the scrapie, which obviously is a python tool used for weps cropping

6
00:00:25,050 --> 00:00:30,000
so you can find all the information about scrapping scrapie dot org.

7
00:00:30,360 --> 00:00:35,490
And this, too, essentially allows you to extract the data from the Web pages.

8
00:00:35,790 --> 00:00:40,920
And many people use it for data mining, information processing and so on.

9
00:00:41,340 --> 00:00:46,480
Just two rows actually in absolutely every platform since it is an item.

10
00:00:46,770 --> 00:00:53,530
So if you have Mac Windows on Linux operating system, you can use it to all of them without any issues.

11
00:00:53,910 --> 00:01:01,050
So even though the main purpose of scrapie is to extract data from the Web pages using Python, you

12
00:01:01,050 --> 00:01:05,440
can also do it with scrappier when using the APIs.

13
00:01:05,880 --> 00:01:08,370
So let's talk about the advantages of Krutch.

14
00:01:08,760 --> 00:01:16,290
So Skrappy's actually the bridge fruit for you to the partner automation, because here you simply write

15
00:01:16,290 --> 00:01:19,020
the rules and then scrap it does the job instead of you.

16
00:01:19,290 --> 00:01:24,240
So once you write your rules, then scrappier, goes to the Web, find the Web page and extract all

17
00:01:24,240 --> 00:01:29,170
the information that you require without the need for you to do any additional work.

18
00:01:29,490 --> 00:01:38,640
You can obviously easily modify your sole source code used in different platforms, regardless of whether

19
00:01:38,640 --> 00:01:43,860
you use Linux, Windows, MapQuest and so on.

20
00:01:44,070 --> 00:01:52,230
So this is quite a useful tool that will definitely simplify the way you're extracting information from

21
00:01:52,230 --> 00:01:57,660
websites to extract data from the e-mail file.

22
00:01:57,700 --> 00:02:05,310
Scrap usually uses the successful selection and expat expressions that we talked about in the previous

23
00:02:05,310 --> 00:02:05,880
lectures.

24
00:02:06,510 --> 00:02:13,260
You can simply write your code in Python console and then use these rules in order to extract information

25
00:02:13,260 --> 00:02:13,810
from the Web.

26
00:02:14,190 --> 00:02:21,780
It also support Muj performers like Jason Sias v. Ximo and so on.

27
00:02:21,810 --> 00:02:28,230
So regardless what you use, you can definitely assess most of the more popular file types out there.

28
00:02:28,440 --> 00:02:35,250
And the other thing actually that I would like about scrapple is that it allows you to be more flexible

29
00:02:35,250 --> 00:02:37,060
with what you write and what you get.

30
00:02:37,500 --> 00:02:43,050
So if you have errors in the code, there was a big chance of scrap rules to execute it and find the

31
00:02:43,050 --> 00:02:43,830
best solution.

32
00:02:44,080 --> 00:02:49,510
You can also use your own expressions, such as signals, extensions and pipelines.

33
00:02:49,770 --> 00:02:58,440
But let's talk now specifically about the architecture of those Croppy Scrappy allow you to simply scan

34
00:02:58,440 --> 00:03:03,300
the content of a website and extract the information that is useful for you.

35
00:03:03,930 --> 00:03:08,570
There is a very important architecture going into the background of that tool.

36
00:03:08,790 --> 00:03:12,270
So the first part of the architecture is the interpreter.

37
00:03:12,630 --> 00:03:18,730
So the interpreter simply allows you to create projects and test them quickly.

38
00:03:19,290 --> 00:03:27,300
The interpreter used some colder routines called spiders and are used for making a request to a list

39
00:03:27,300 --> 00:03:29,430
of domains that you predefine.

40
00:03:29,640 --> 00:03:38,850
So once you predefine these rules on HTP, request is made and then scrapie applies the rules that you

41
00:03:38,850 --> 00:03:46,180
predefined to this list of HTP requests and so you can execute all of them.

42
00:03:46,650 --> 00:03:54,690
So, as I said, scrappiest expat expressions and with them you can actually get pretty detailed information

43
00:03:54,690 --> 00:03:58,000
about them, about what we extract from the website.

44
00:03:58,230 --> 00:04:05,940
So, for example, if let's say you want to extract download links from a page, you can simply write

45
00:04:05,940 --> 00:04:14,460
an expert expression using scrapie and you easily be able to assess all the attributes in that website.

46
00:04:14,820 --> 00:04:22,380
And finally, there are the items which are basically containers of information that allow storing information

47
00:04:22,980 --> 00:04:24,380
that's returned from the rules.

48
00:04:24,390 --> 00:04:29,820
So, for example, if you write the rule that assists a website and you get a response from this website,

49
00:04:30,060 --> 00:04:36,150
you're basically going to store all this information into these containers or the items that the return

50
00:04:36,150 --> 00:04:36,990
from the rules.

51
00:04:37,530 --> 00:04:45,990
And on the next few guys, you can actually see the workflow of scrapie and how actually the scrap engine

52
00:04:46,200 --> 00:04:52,080
is implemented in order to cope with all of these web features.

53
00:04:52,160 --> 00:04:54,660
OK, so we can see that here.

54
00:04:54,690 --> 00:05:02,220
First we have the Internet and with the Internet, with the downloader, OK, that is connected to scrapie

55
00:05:02,550 --> 00:05:03,210
and.

56
00:05:03,320 --> 00:05:12,470
We keep doing requests and we're getting responses, so once we do a request to Internet, we're downloading

57
00:05:12,470 --> 00:05:16,400
their responses and we're using the spiders to do that.

58
00:05:16,460 --> 00:05:26,300
So as you can see, all of those four features are actually connected to the scrap agent and this information

59
00:05:26,300 --> 00:05:34,070
that goes inside their moves dynamically, which means that constantly the Ruža exchange with the website,

60
00:05:34,070 --> 00:05:39,860
so constantly you can access different locations in Internet depending on the rules that.

61
00:05:39,860 --> 00:05:40,340
That's right.

62
00:05:40,760 --> 00:05:51,290
So as you can see on the image, the spiders use the items in order to pass the data to diatom pipelines.

63
00:05:51,440 --> 00:05:54,950
And as you can see, scrapie actually has different spiders.

64
00:05:55,160 --> 00:06:01,010
Some of them pass items through the pipelines and other ones are sending request to a scheduler over

65
00:06:01,010 --> 00:06:01,380
here.

66
00:06:01,640 --> 00:06:10,160
So the requests from the spiders to the scheduler are actually the ones that are making requests to

67
00:06:10,160 --> 00:06:16,480
the actual server, because from the requests you're getting the requests sent to the server and this

68
00:06:16,490 --> 00:06:17,900
how in the information.

69
00:06:18,440 --> 00:06:25,370
So once we request information from the server, the information is in the back after that of the spiders.

70
00:06:25,700 --> 00:06:30,230
And so the spider is fed back with each request that we get from the server.

71
00:06:30,710 --> 00:06:34,270
So, guys, this was the overview of scrapie.

72
00:06:34,520 --> 00:06:41,090
I hope it was clear, because in the next video, we're actually going to start coding and we are going

73
00:06:41,090 --> 00:06:43,910
to do some scrapbooking requests to the Web.

74
00:06:44,180 --> 00:06:44,930
So that's it.

75
00:06:44,960 --> 00:06:47,090
Thanks very much for watching that video for today.

76
00:06:47,090 --> 00:06:48,590
And I will see you in the next one.