1 00:00:04,830 --> 00:00:07,980 Expat expressions hello, everyone. 2 00:00:08,250 --> 00:00:13,800 Today we're going to write some code, and as you can see, I'm already inside our terminal and also 3 00:00:13,800 --> 00:00:17,740 have opened the atrium ID in the background. 4 00:00:18,090 --> 00:00:24,150 So today we're going to scrap it and we're going to write some expat expressions which are the most 5 00:00:24,150 --> 00:00:25,440 commonly used by scrap. 6 00:00:25,920 --> 00:00:31,450 And so today, guys, I will show you how to install this crap inside your computer. 7 00:00:31,920 --> 00:00:40,200 So the first thing I do here with Secretary Gates is to write the forward comment, Scrappy Start Project 8 00:00:40,410 --> 00:00:41,160 Tutorial. 9 00:00:42,870 --> 00:00:52,710 And you can see that immediately after a directory is created inside our section for folder, as you 10 00:00:52,710 --> 00:00:53,390 can see here. 11 00:00:53,730 --> 00:00:57,210 And if I open it, you can see that it is not an empty folder. 12 00:00:57,420 --> 00:00:59,370 So it has already some items here. 13 00:00:59,550 --> 00:01:06,840 And you can see that it has items, pipelines, setters, middleware and so on. 14 00:01:07,080 --> 00:01:15,270 So in that way, by simply typing in the terminal scrub start project tutorial, this will simply launch 15 00:01:15,270 --> 00:01:17,700 of Turrill that you can run from your computer. 16 00:01:17,910 --> 00:01:23,640 And you can see that if you open some of the files that are already populated with the code and you 17 00:01:23,640 --> 00:01:30,150 can see that we have here at different functions, different pipelines and the setting because the settings 18 00:01:30,150 --> 00:01:30,820 as well here. 19 00:01:31,350 --> 00:01:36,200 So this is simply a simple example project of scrappier that you can see in front of you. 20 00:01:36,720 --> 00:01:40,670 So inside this folder, guys, let's actually create our first spider. 21 00:01:40,980 --> 00:01:48,330 So I will select here in spiders, as you can see, I will select new python file and I will simply 22 00:01:48,330 --> 00:01:53,650 write quotes, underscore spider, dot, pua. 23 00:01:54,060 --> 00:01:59,430 OK, and after that, let's write our first script, Spider. 24 00:01:59,610 --> 00:02:06,780 So I will do import scrapie and then let's write let's create a class action. 25 00:02:06,820 --> 00:02:16,700 So Class Kobolds Spider and then inciter right scrapie dot spider. 26 00:02:18,060 --> 00:02:24,360 Let's write a column and then name equals actually get a little bit closer. 27 00:02:24,690 --> 00:02:28,680 So name equals quotes. 28 00:02:30,600 --> 00:02:40,110 And, you know, create a function, so let's write Deve Start requests, OK, and then self. 29 00:02:40,290 --> 00:02:40,810 OK. 30 00:02:41,100 --> 00:02:42,450 And just under. 31 00:02:42,690 --> 00:02:43,070 All right. 32 00:02:43,090 --> 00:02:47,620 Euros equal and let's write some euros here. 33 00:02:47,880 --> 00:02:53,460 So I would htp and the first euro will be quotes. 34 00:02:53,850 --> 00:02:57,140 Dot, dos, scrub. 35 00:02:58,290 --> 00:03:04,670 OK, dot com slash page slash one. 36 00:03:05,700 --> 00:03:06,310 That's it. 37 00:03:07,080 --> 00:03:24,880 And after that this extraordinary URL htp double those quotes dot to scrap dot com slash page two. 38 00:03:25,290 --> 00:03:25,770 OK. 39 00:03:27,440 --> 00:03:34,890 The is here, let's write a comma and let's leave the brackets close here, and after that I will create 40 00:03:34,910 --> 00:03:35,490 a follow. 41 00:03:35,570 --> 00:03:45,340 So let's create four euro in euros, OK, Eells. 42 00:03:45,650 --> 00:03:46,260 Mm hmm. 43 00:03:47,000 --> 00:04:01,070 And then scrapie that request and then I will write euro will be equal to Yorio and then call back equals 44 00:04:01,240 --> 00:04:04,220 self not pass. 45 00:04:05,670 --> 00:04:06,160 OK. 46 00:04:07,440 --> 00:04:10,890 And after that, let's just follow up and I'll create another function. 47 00:04:11,400 --> 00:04:15,050 So let's try to D.F. pass. 48 00:04:17,240 --> 00:04:19,460 And here we have self. 49 00:04:20,790 --> 00:04:21,720 Response. 50 00:04:22,460 --> 00:04:25,380 OK, and the list, right page equals. 51 00:04:27,790 --> 00:04:30,460 Response, that URL. 52 00:04:31,750 --> 00:04:41,100 Don't split, and here, let's write a slash and then minus two. 53 00:04:41,560 --> 00:04:43,870 OK, then, that's right. 54 00:04:43,870 --> 00:04:44,770 File name. 55 00:04:46,490 --> 00:04:54,710 Equals, to, quote, slash percent is not HMO. 56 00:04:56,770 --> 00:05:00,070 That's it, and then percent page. 57 00:05:01,620 --> 00:05:05,190 And then let's write with Open. 58 00:05:07,160 --> 00:05:08,180 File name. 59 00:05:09,740 --> 00:05:10,340 Colma. 60 00:05:11,490 --> 00:05:12,530 WB. 61 00:05:15,950 --> 00:05:17,960 US of. 62 00:05:20,870 --> 00:05:24,050 OK, and then if that's right. 63 00:05:26,470 --> 00:05:30,010 A response, not a body. 64 00:05:32,190 --> 00:05:43,860 OK, and here finally, I will write self log and said, look, I'll write saved file. 65 00:05:45,800 --> 00:05:46,460 Is. 66 00:05:49,270 --> 00:05:52,460 There is one percent file name. 67 00:05:53,170 --> 00:05:58,060 OK, so let's go up, because I noticed there is a slight typo here. 68 00:05:58,330 --> 00:05:59,830 So here is Kuruppu. 69 00:06:01,640 --> 00:06:06,360 OK, that's it, and let's save this file, guys. 70 00:06:06,560 --> 00:06:09,970 So while the file was saved, now we can actually run it. 71 00:06:10,580 --> 00:06:17,960 And as you can see here, what we actually created, we saw obviously with the spiders we created the 72 00:06:17,960 --> 00:06:23,350 internal request the scrapie will make to the websites that we're going to assess. 73 00:06:23,930 --> 00:06:33,690 So we will state how to fold the link pages and how to pass the DOD page content right here. 74 00:06:34,010 --> 00:06:38,040 So this is what both of those functions are doing in our project. 75 00:06:38,270 --> 00:06:46,070 So this files specifically defines what a scrap is going to do and the rest of the file simply defined 76 00:06:46,070 --> 00:06:47,900 the scriber functionality. 77 00:06:48,200 --> 00:06:57,320 So what we would do, our usual workflow for you is to create scrappier file like this and then add 78 00:06:57,320 --> 00:07:01,960 different spiders so or different rules to assess this website. 79 00:07:02,240 --> 00:07:08,230 But enough talking, let's run the quote spider and see what's going to happen. 80 00:07:09,140 --> 00:07:14,600 So let's write ours and let's try to see the tutorial. 81 00:07:14,990 --> 00:07:20,170 OK, let's right there with OK, let's assess spiders, OK? 82 00:07:20,750 --> 00:07:25,440 Let's write their ways and you can see that our file was here, so let's try it. 83 00:07:25,440 --> 00:07:30,050 They will write Python and then quote spiders. 84 00:07:32,040 --> 00:07:33,260 OK, they'll save it. 85 00:07:33,260 --> 00:07:37,890 So I save it and let's try it once again, ok. 86 00:07:37,970 --> 00:07:39,950 I think it's another table somewhere. 87 00:07:40,310 --> 00:07:41,240 Let's search for it. 88 00:07:41,240 --> 00:07:43,160 So I will do ok. 89 00:07:43,160 --> 00:07:45,080 So I imported Krupka wrong. 90 00:07:45,080 --> 00:07:48,560 So I do scrapie. 91 00:07:48,800 --> 00:07:49,440 That's it. 92 00:07:49,500 --> 00:07:52,340 Let's save it and I will run the file again. 93 00:07:52,940 --> 00:08:03,500 OK, so after you run the file in order to extract the links, we need to write scrapie crawl and then 94 00:08:03,500 --> 00:08:04,190 quotes. 95 00:08:04,670 --> 00:08:05,570 Just quotes. 96 00:08:05,720 --> 00:08:07,580 OK, let's run this. 97 00:08:08,330 --> 00:08:15,260 And as you can see now, we actually started scrapie and we started the old interlinks and getting responses. 98 00:08:15,560 --> 00:08:19,340 So this is the output the that you can see in front of you. 99 00:08:20,120 --> 00:08:26,000 So now if should get back to the directory, you can notice that two new files has been stored in the 100 00:08:26,000 --> 00:08:26,660 directory. 101 00:08:26,840 --> 00:08:31,880 And these are quotes, one in quotes, two, which are referring to each of the websites. 102 00:08:32,300 --> 00:08:39,440 And how we know that is because of both of them are actually HTML files, which are specifically website 103 00:08:39,440 --> 00:08:39,860 files. 104 00:08:39,860 --> 00:08:42,860 And you can see them in actually two locations in front of you. 105 00:08:42,860 --> 00:08:49,010 So select them and you can see that these are actually the HTML files of the websites that were assessed. 106 00:08:49,610 --> 00:08:54,110 So this is how to actually extract information from the web with the scrapie. 107 00:08:54,290 --> 00:08:59,420 And in the next video, we're going to talk a little bit more in depth how to use the information that 108 00:08:59,420 --> 00:09:05,060 they just showed you and how to extract some additional information that you require. 109 00:09:05,460 --> 00:09:05,920 That's it. 110 00:09:05,940 --> 00:09:08,870 Thanks very much for watching and I'll see you in the next video.