1 00:00:04,960 --> 00:00:09,440 Getting pages and images with XPath, hello, everyone. 2 00:00:09,610 --> 00:00:15,490 Today, I'm going to teach you how to get images and pages using this expert. 3 00:00:15,880 --> 00:00:27,970 So to do that, you can create first a single file that will actually get you some links, images once 4 00:00:27,970 --> 00:00:28,420 you run. 5 00:00:28,750 --> 00:00:31,540 So let's create a new file here. 6 00:00:32,200 --> 00:00:34,840 And this will be called GET. 7 00:00:36,270 --> 00:00:36,990 Imagists. 8 00:00:38,450 --> 00:00:40,220 Let me just fix this big mistake. 9 00:00:41,930 --> 00:00:42,610 Images. 10 00:00:43,160 --> 00:00:43,700 That's it. 11 00:00:44,330 --> 00:00:55,460 So let's right here on the top yuzu us are been in three Python three. 12 00:00:56,010 --> 00:01:01,400 OK, and let me actually fix this warning so it doesn't bother us. 13 00:01:03,050 --> 00:01:04,880 And let's start importing the thing. 14 00:01:04,890 --> 00:01:12,860 So let's import and let's import OS and then let's import another package, the requests. 15 00:01:13,040 --> 00:01:24,630 And after that we'll simply actually I will do from so from El Ximo import HMO. 16 00:01:25,160 --> 00:01:30,850 OK, and then let's create now a class and this class will be called scrapping. 17 00:01:31,070 --> 00:01:34,950 So I will do a class scrapping. 18 00:01:36,440 --> 00:01:36,990 OK. 19 00:01:37,400 --> 00:01:37,910 And then. 20 00:01:37,910 --> 00:01:38,390 That's right. 21 00:01:38,390 --> 00:01:47,700 ADF scrubbing images and then self colma you are. 22 00:01:48,480 --> 00:01:57,170 OK and let's start building our new class so let's try to first print and the first things I would like 23 00:01:57,170 --> 00:01:59,540 to print is. 24 00:02:01,000 --> 00:02:08,950 To write to the user and getting images from Yorio and then the hero that we want to place so out of 25 00:02:08,950 --> 00:02:22,600 print, and then here I will do backslash and and then getting images from the URL, OK? 26 00:02:24,280 --> 00:02:26,390 And after that, let's try. 27 00:02:26,540 --> 00:02:29,920 Plus you rl. 28 00:02:31,480 --> 00:02:32,140 That's it. 29 00:02:32,560 --> 00:02:34,990 And then let's write three. 30 00:02:37,000 --> 00:02:47,630 And here, right, response equals requests that you get Eurorail. 31 00:02:47,950 --> 00:02:48,360 OK? 32 00:02:48,640 --> 00:02:56,230 And this Eurail here that you see it will store actually the variable of the bureau that you write to 33 00:02:56,230 --> 00:02:56,770 the system. 34 00:02:56,780 --> 00:03:02,520 So once you're on the court, the program will ask you for a euro. 35 00:03:02,800 --> 00:03:08,830 And once you select the euro, this will be replaced in the code with the euro that you see right here. 36 00:03:09,130 --> 00:03:13,310 So let me actually get a bit closer to you so you can have a better look. 37 00:03:13,630 --> 00:03:17,060 So let's right then ask the body. 38 00:03:17,120 --> 00:03:26,140 So I'll do past volume parameter and this will be equal to H.M. Dot. 39 00:03:27,650 --> 00:03:38,510 From starring in Hero, right, the wrist, bones, wrists, bones, that text. 40 00:03:39,980 --> 00:03:46,430 OK, and then I will write images equals. 41 00:03:48,010 --> 00:04:05,960 Past underscore body, not exports, and then I can do limited to the slashes I m g led to add a Sarsae. 42 00:04:06,370 --> 00:04:11,170 OK, so this is our regular expression, guys, for getting the images. 43 00:04:11,340 --> 00:04:17,710 OK, now after we get this, let's print found images. 44 00:04:19,360 --> 00:04:26,970 Percent is, and then we'll find the length of the images or how many images we have so as to learn 45 00:04:28,590 --> 00:04:32,970 OK in the brickies images, that's it. 46 00:04:34,880 --> 00:04:39,290 OK, then let's right always dot system. 47 00:04:41,220 --> 00:04:53,730 And then make their images and, you know, Mercader, it's actually creating story, and this is because 48 00:04:53,730 --> 00:05:02,160 once you get all those images from the Web address that you specify, you actually don't want to place 49 00:05:02,160 --> 00:05:05,600 them on random places or just to the folder that you're in. 50 00:05:05,940 --> 00:05:13,250 So instead of that, you can create a folder with a name and add all the images there. 51 00:05:13,260 --> 00:05:14,830 And this is what we're doing right now. 52 00:05:15,210 --> 00:05:18,390 OK, so once we were ready without that, let's create a for loop. 53 00:05:18,390 --> 00:05:23,550 So for image in images. 54 00:05:24,810 --> 00:05:26,580 Let's do if. 55 00:05:29,210 --> 00:05:41,580 Image that starts with age TTP, so if the image others start with HTP, false. 56 00:05:43,020 --> 00:05:53,570 OK, and then our right write double world equals U rl plus slash plus image. 57 00:05:53,780 --> 00:06:01,640 OK, so this how it will be displayed once we start downloading and then our right else download will 58 00:06:01,640 --> 00:06:04,280 simply be equal to image. 59 00:06:04,680 --> 00:06:05,150 OK. 60 00:06:05,510 --> 00:06:07,840 If it doesn't contain the HTP. 61 00:06:08,360 --> 00:06:11,910 So we do this in order to have all the images in the same format. 62 00:06:12,260 --> 00:06:14,420 So let's do print and then. 63 00:06:16,160 --> 00:06:17,810 Well, OK. 64 00:06:18,140 --> 00:06:22,100 And this is how we don't know the images in the Images Directorate. 65 00:06:22,700 --> 00:06:34,010 Now, after we do this in the same indentation level, let's write our equals requests that you get 66 00:06:34,310 --> 00:06:36,360 downloads, OK? 67 00:06:36,860 --> 00:06:37,580 And. 68 00:06:38,850 --> 00:06:48,580 And then that's right, if F equals to open Imagists, slash percent is. 69 00:06:50,600 --> 00:06:57,290 And then they were right here, percent don't want that split. 70 00:06:58,700 --> 00:07:04,340 And here, let's write a slash and then minus one. 71 00:07:05,950 --> 00:07:13,540 Como and WB, and let's close the record here and here, there's an error because I didn't have to put 72 00:07:13,540 --> 00:07:16,030 this bracket here, so just remove it. 73 00:07:17,200 --> 00:07:21,400 So we have a single bracket close here and one close here. 74 00:07:21,770 --> 00:07:22,740 OK, that's fine. 75 00:07:22,750 --> 00:07:26,110 And let's right now, if dot right. 76 00:07:26,590 --> 00:07:30,250 Our dot content, OK. 77 00:07:31,780 --> 00:07:38,110 So as you can see, we're getting the request and authority and our value from the foul. 78 00:07:38,110 --> 00:07:38,860 Don Walt. 79 00:07:40,170 --> 00:07:46,320 OK, which is actually the actual image or the actual images, because it's iterative. 80 00:07:46,800 --> 00:07:52,830 So for every image, we are going to do that and we're going to store specific value content. 81 00:07:53,040 --> 00:07:55,440 And then there's the evidence close. 82 00:07:55,800 --> 00:07:56,320 OK. 83 00:07:56,430 --> 00:07:57,200 And that's it. 84 00:07:58,090 --> 00:08:07,580 Um, so let's go out of this for a loop and let's go out of the Trelew here and let's try it now. 85 00:08:07,800 --> 00:08:09,030 Another exception. 86 00:08:09,030 --> 00:08:13,260 So except exception E! 87 00:08:14,750 --> 00:08:16,430 And then let's to print. 88 00:08:17,890 --> 00:08:19,930 And then Conexion. 89 00:08:21,350 --> 00:08:22,910 Error in. 90 00:08:24,470 --> 00:08:29,310 And then let's try to the euro plus euro. 91 00:08:29,660 --> 00:08:31,100 OK, that will pass. 92 00:08:31,340 --> 00:08:35,790 So this is just in case we have a connection there and then I will pass. 93 00:08:36,140 --> 00:08:38,740 So we don't get an actual systema. 94 00:08:40,010 --> 00:08:47,330 So whilst we are the guys with the scrapping images function or the way that we're actually getting 95 00:08:47,330 --> 00:08:52,990 the images, let's go out of dysfunction here and let's go to the next one. 96 00:08:54,080 --> 00:08:59,330 Which section away shorter, so the next one will be G.F. scrapping. 97 00:09:01,210 --> 00:09:01,960 Linc's. 98 00:09:03,660 --> 00:09:11,040 OK, scrap interlinks, and this will be ours cracklings function and to also take the euro that we 99 00:09:11,040 --> 00:09:11,550 owe us. 100 00:09:12,120 --> 00:09:13,890 So let's write print. 101 00:09:15,410 --> 00:09:22,610 And here I am, right, OK, because again and then getting. 102 00:09:24,120 --> 00:09:26,660 Links from your URL. 103 00:09:27,000 --> 00:09:27,580 OK. 104 00:09:28,230 --> 00:09:29,580 Plus your URL. 105 00:09:31,240 --> 00:09:43,540 OK, and then I will do try and what we're going to try here is response equals requests that we get 106 00:09:44,170 --> 00:09:47,260 and we're going to obviously get the euro here. 107 00:09:47,520 --> 00:09:49,150 We already have practice with that. 108 00:09:49,660 --> 00:09:51,130 And then I write. 109 00:09:52,060 --> 00:10:00,220 Again, past body equals H.M. Dot. 110 00:10:01,870 --> 00:10:06,370 From strain response that. 111 00:10:07,610 --> 00:10:08,940 Text, OK? 112 00:10:11,080 --> 00:10:16,380 So we actually get the website and we transfer it from e-mail to string. 113 00:10:16,810 --> 00:10:22,810 OK, and now let's write that the regular expression to get the links. 114 00:10:23,020 --> 00:10:25,780 So our simple right links. 115 00:10:26,930 --> 00:10:27,650 He, of course. 116 00:10:28,790 --> 00:10:29,440 Asked. 117 00:10:31,330 --> 00:10:42,250 Body dot expert, and here I will do a slush slush, a slush. 118 00:10:43,410 --> 00:10:44,010 At. 119 00:10:45,620 --> 00:10:48,380 H r e f. 120 00:10:49,700 --> 00:10:50,190 OK. 121 00:10:50,540 --> 00:10:57,380 And you can see that there are slight differences between how we get the links here and how we get the 122 00:10:57,380 --> 00:10:58,430 images here. 123 00:10:58,520 --> 00:11:05,080 OK, because here we get with AMG Slash at sea, our SIRC. 124 00:11:05,450 --> 00:11:11,330 And here we we're getting the links by a H are if. 125 00:11:11,540 --> 00:11:15,950 OK, so this is the regular expression for that, for getting the links then. 126 00:11:15,950 --> 00:11:16,520 That's right. 127 00:11:16,730 --> 00:11:17,360 Print. 128 00:11:18,780 --> 00:11:31,680 And outright hero found Linc's percent as and to the percent who were actually going to pass the number 129 00:11:31,680 --> 00:11:33,330 of links, which is LENTH. 130 00:11:34,280 --> 00:11:34,870 Links. 131 00:11:35,390 --> 00:11:36,000 That's it. 132 00:11:36,440 --> 00:11:47,620 And Hiroto for Loop, so for link in links perent link, actually this is quite straightforward here, 133 00:11:47,750 --> 00:11:49,890 so I'll just link. 134 00:11:50,900 --> 00:11:51,970 OK, that's it. 135 00:11:52,430 --> 00:12:00,190 And let's go away from this trial law and I will do except exception. 136 00:12:00,950 --> 00:12:01,370 Oops. 137 00:12:01,370 --> 00:12:09,590 Sorry, the wrong one exception that set us e print. 138 00:12:09,620 --> 00:12:13,820 So if the previous line of code does not work, we would like to print. 139 00:12:15,230 --> 00:12:18,680 Connection, error in. 140 00:12:20,420 --> 00:12:26,300 And then I was the link, you know, that's it, and then that's right. 141 00:12:27,200 --> 00:12:31,660 OK, so I'm just going to remove Darby here because I think this a typo. 142 00:12:32,660 --> 00:12:36,160 It shouldn't be scrapping with the double paper just with a single one. 143 00:12:36,560 --> 00:12:38,900 So that's it here, guys, with dysfunction. 144 00:12:38,900 --> 00:12:40,550 And let's finally. 145 00:12:41,450 --> 00:12:46,400 Created the main function, which will actually run the code, because, as you know, we just created 146 00:12:46,400 --> 00:12:47,270 a class here. 147 00:12:48,420 --> 00:12:53,490 Which is also it single, by the way, and let's try it if. 148 00:12:55,090 --> 00:13:02,630 Name equals main, and this is standard running function here. 149 00:13:03,310 --> 00:13:07,840 So if that's the case, let's try target equals. 150 00:13:08,080 --> 00:13:12,770 And here I will place a website with Hacker News. 151 00:13:12,790 --> 00:13:14,600 Actually, it's quite interesting. 152 00:13:15,490 --> 00:13:16,900 So it's called News Dot. 153 00:13:17,230 --> 00:13:19,270 Y Combinator dot com. 154 00:13:20,760 --> 00:13:22,020 And then let's do. 155 00:13:23,210 --> 00:13:26,790 Scrapping equals. 156 00:13:28,530 --> 00:13:31,800 Scrapping, that's it, and then. 157 00:13:33,730 --> 00:13:41,750 Scrapping the dot, scrapping images, and then I will do target. 158 00:13:41,980 --> 00:13:46,210 OK, so this will be the euro that will be using this one. 159 00:13:48,310 --> 00:13:53,710 And after that, let's do another one scrapping DOT. 160 00:13:55,420 --> 00:14:00,340 Let's try scrapping links, and our target will be the target file. 161 00:14:00,580 --> 00:14:02,160 OK, we're ready here. 162 00:14:02,710 --> 00:14:07,080 And let's go a little bit up because I found some typos in the previous function. 163 00:14:07,390 --> 00:14:10,120 So this here is not so selfish yourself. 164 00:14:10,600 --> 00:14:15,600 And also here starts with instead of start with. 165 00:14:16,150 --> 00:14:21,190 So let's save these guys and let's now go to our terminal. 166 00:14:22,350 --> 00:14:25,200 We run the files and let's try it Elway's. 167 00:14:25,630 --> 00:14:29,100 And you can see that now you can have the file, get images. 168 00:14:29,470 --> 00:14:30,500 So let's run. 169 00:14:30,700 --> 00:14:36,100 Let's try Wighton and get images of your way. 170 00:14:36,460 --> 00:14:39,800 And OK, here I can see that there was some error. 171 00:14:40,660 --> 00:14:42,130 Let's go back and fix it. 172 00:14:42,130 --> 00:14:43,180 Line fifty nine. 173 00:14:43,840 --> 00:14:48,970 So yes, here I missed the hour and here I am. 174 00:14:48,970 --> 00:14:49,720 Is that again. 175 00:14:49,960 --> 00:14:52,580 OK, let's save that and let's go back. 176 00:14:53,140 --> 00:14:56,110 So let's run it against corruption on. 177 00:14:57,310 --> 00:14:57,730 Yes. 178 00:14:57,750 --> 00:15:01,360 So here is scrapping. 179 00:15:01,600 --> 00:15:02,230 That's it. 180 00:15:02,320 --> 00:15:09,660 OK, so now all the files are the same and let's run the code once again and see if everything's OK. 181 00:15:10,360 --> 00:15:13,000 So you can see that now, guys, everything is good. 182 00:15:13,000 --> 00:15:22,510 And we've got quite a few links and you can see how many actual links and images we got from this website. 183 00:15:22,750 --> 00:15:30,580 And actually, if you check the Section four and I will just put up for you, you can see that we've 184 00:15:30,580 --> 00:15:37,060 also created the images folder and we've got some images here from that website, which are, of course, 185 00:15:37,060 --> 00:15:41,230 not that impressive, but is still something to display there. 186 00:15:41,680 --> 00:15:44,930 Let's actually change the name of the website and let's see what we get. 187 00:15:45,190 --> 00:15:52,090 So if I do hear You Tube dot com, actually just You Tube and let's get back here and less from the 188 00:15:52,090 --> 00:15:52,510 code. 189 00:15:52,990 --> 00:16:00,940 And you can see that now you started getting more links and images and you can see that you go to the 190 00:16:00,940 --> 00:16:06,060 YouTube politics developers, DOT, Google, YouTube and so on. 191 00:16:06,550 --> 00:16:08,850 And this is just because we change the website. 192 00:16:09,190 --> 00:16:18,370 So this is how you can actually automate how to get all the links and images from a website using specifically 193 00:16:19,210 --> 00:16:22,120 this tool called Expert Expressions. 194 00:16:22,340 --> 00:16:26,860 I hope you enjoyed that video and the lectures until now. 195 00:16:27,100 --> 00:16:33,010 So I will be waiting for you in the next ones where we're going to start working with a very interesting 196 00:16:33,010 --> 00:16:35,140 tool called Beautiful Setup. 197 00:16:35,530 --> 00:16:36,580 Thanks for watching.