1 00:00:04,940 --> 00:00:12,980 Passing H.M.S. Hello, everyone, today I'm going to teach you how to pass the HTML files and to do 2 00:00:12,980 --> 00:00:19,130 that work usually need to use a third party package that is called XML. 3 00:00:19,580 --> 00:00:23,750 And this package is nothing more than simply an XML parser. 4 00:00:24,350 --> 00:00:27,650 This package is a quick way to navigate between documents. 5 00:00:27,650 --> 00:00:32,000 And actually this is one of the best XML parser that you can find out there. 6 00:00:32,390 --> 00:00:39,480 So this package can be basically installed in every platform that supports Python. 7 00:00:39,680 --> 00:00:44,860 So this means that nothing can stop us to use it in our project. 8 00:00:45,200 --> 00:00:53,300 So let us install this platform and after that, they will show you how we can actually pass an HTML 9 00:00:53,300 --> 00:00:53,760 file. 10 00:00:54,080 --> 00:01:06,680 So I will go here to the Edem and actually let me first go to the pie charm and create a section for 11 00:01:07,070 --> 00:01:09,400 weight of our file. 12 00:01:09,460 --> 00:01:14,780 So when they create Section four here, we will collect all of our files for this section. 13 00:01:14,990 --> 00:01:17,180 So the section. 14 00:01:18,440 --> 00:01:25,490 For OK, and you can see it now, you have here the file for Section four and actually also create Section 15 00:01:25,490 --> 00:01:31,220 one two and we're going to add the thousands are already used in there. 16 00:01:31,880 --> 00:01:37,190 So I will create directory section two. 17 00:01:37,610 --> 00:01:44,900 OK, so I will simply pass everything we've done so far into that section. 18 00:01:45,410 --> 00:01:46,070 Here we go. 19 00:01:46,520 --> 00:01:47,250 And that's it. 20 00:01:47,360 --> 00:01:50,560 So you can see in that way it is actually way more organized. 21 00:01:50,960 --> 00:01:54,230 So let me close all of those files here. 22 00:01:55,160 --> 00:02:01,780 And after that, and we can go back to the terminal and I will show you what you need to do. 23 00:02:02,150 --> 00:02:07,640 First of all, let's go to the Section four files. 24 00:02:08,060 --> 00:02:09,430 Here we are in Section four. 25 00:02:09,830 --> 00:02:19,250 And before actually running the files, let's to by install El Ximo. 26 00:02:20,680 --> 00:02:26,320 OK, so as you can see, I have it already installed, but if you don't have it, you can install it 27 00:02:26,320 --> 00:02:34,660 with that comment and let's actually do buy in stole El Ximo. 28 00:02:35,170 --> 00:02:35,950 Yes. 29 00:02:36,670 --> 00:02:40,330 And then I will just reinstall so you can see how easy it is. 30 00:02:41,390 --> 00:02:41,990 That's it. 31 00:02:42,520 --> 00:02:47,630 This is how much time it takes to truly one second to install the XML package. 32 00:02:48,380 --> 00:02:53,500 Anyways, let's now continue with writing into the Python command line. 33 00:02:54,130 --> 00:03:05,210 So write Python and then let's import import resources that they don't mistake Greece or actually this 34 00:03:05,350 --> 00:03:06,220 the wrong package. 35 00:03:06,500 --> 00:03:10,060 We need to import three coup requests. 36 00:03:10,360 --> 00:03:11,500 OK, that's fine. 37 00:03:11,830 --> 00:03:16,960 And no, let's first download a page. 38 00:03:18,210 --> 00:03:22,100 Which is going to be obviously into the general format. 39 00:03:22,470 --> 00:03:33,750 So how do we download page Lizardo Rorris bonds will be equal to requests that we get and here I write 40 00:03:33,750 --> 00:03:41,150 the name of the page, which will be age Tepes w w w dot. 41 00:03:41,670 --> 00:03:45,000 They've become dot org re. 42 00:03:46,610 --> 00:03:52,070 Releases stable slash index. 43 00:03:53,450 --> 00:03:57,500 Not even the HMO, that's it. 44 00:03:58,910 --> 00:04:02,280 So everything is passed, we downloaded this page. 45 00:04:02,600 --> 00:04:07,070 Now let's choose the package that we already implemented. 46 00:04:07,070 --> 00:04:17,790 So let's draw from Elex, smell that, eat three import HMO package. 47 00:04:17,810 --> 00:04:20,720 OK, and then I read ruled. 48 00:04:22,100 --> 00:04:28,250 Equals HMO response, not content. 49 00:04:29,450 --> 00:04:30,030 OK. 50 00:04:30,620 --> 00:04:38,360 So by using the HMO function, we're using a shortcut that reads the HTML file that is passed to that 51 00:04:38,360 --> 00:04:38,900 function. 52 00:04:39,170 --> 00:04:45,470 And from that HMO, we're producing an XML tree. 53 00:04:45,680 --> 00:04:52,820 So I want to note here that we're not passing the text of the response, but the whole content into 54 00:04:52,820 --> 00:04:59,120 the HTML so we can have the whole content when we created the XML file. 55 00:05:00,290 --> 00:05:05,120 Now let's right here Brocket e not Thack. 56 00:05:06,770 --> 00:05:10,120 For E in ruled. 57 00:05:11,990 --> 00:05:21,740 OK, and you can see here that we have hit and bodgy of our route and now let's right route dot find 58 00:05:23,030 --> 00:05:30,620 and then head dot find in here laterite tyto. 59 00:05:32,280 --> 00:05:32,830 That's. 60 00:05:34,490 --> 00:05:44,300 Text and OK, I've got an error here, so let's see what it is so rude, don't find head. 61 00:05:44,600 --> 00:05:49,480 So there is that item at the parentheses here. 62 00:05:49,700 --> 00:05:57,560 So after title, when I put in our parentheses, you can see that we're printing the actual text content 63 00:05:57,710 --> 00:05:58,850 of our website. 64 00:05:59,150 --> 00:06:07,790 So we're basically printing the heads, which is Debian booster, and then we get the release information. 65 00:06:08,150 --> 00:06:10,070 So action against. 66 00:06:10,070 --> 00:06:16,430 Let's go now to the Web page because I want to show how the actual page looks like and from where we're 67 00:06:16,430 --> 00:06:18,490 actually getting the information. 68 00:06:18,980 --> 00:06:21,590 So this is the page that we're actually using. 69 00:06:21,590 --> 00:06:29,720 And let let me inspect the code so you can see that basically this one here is the HTML. 70 00:06:30,050 --> 00:06:34,480 OK, and here is the head and here's the body. 71 00:06:34,700 --> 00:06:44,120 OK, so if you actually go back to the terminal, you can see that up here where it is, where we run. 72 00:06:45,170 --> 00:06:51,950 What are the sections in our HTML file we got exactly here, which is this one here and then we've got 73 00:06:51,950 --> 00:06:52,520 the body. 74 00:06:52,940 --> 00:07:00,380 OK, so this is how the file is representing the website and this is how we're getting the information 75 00:07:00,380 --> 00:07:02,080 from the website into Python. 76 00:07:02,300 --> 00:07:05,210 So let's open actually the sections. 77 00:07:05,210 --> 00:07:11,390 And when they open the bottom and when they go to the content here and when they open it, you can see 78 00:07:11,390 --> 00:07:18,890 that the hero of the code of the content is the Debian boaster release information. 79 00:07:18,890 --> 00:07:22,070 And actually this is exactly what we found here. 80 00:07:22,490 --> 00:07:24,900 OK, so this is exactly what we've got. 81 00:07:25,160 --> 00:07:32,390 So the only reason why we got this information is because up here we selected that we would like to 82 00:07:32,390 --> 00:07:35,470 see the actual content. 83 00:07:35,480 --> 00:07:39,880 So response, not content of the website that we select here. 84 00:07:40,130 --> 00:07:46,010 So the website that was left to pass through the HTML to the response and then from the response we 85 00:07:46,010 --> 00:07:56,030 got to the content field and whilst we display it, we got the exact header of the content of the content 86 00:07:56,030 --> 00:07:56,540 field. 87 00:07:56,630 --> 00:08:02,030 OK, so here we did hit and this is the header one and we found the title. 88 00:08:03,020 --> 00:08:12,080 So this is how you can basically take any website and find us here and any other relevant information 89 00:08:12,380 --> 00:08:13,710 that you would like to use. 90 00:08:14,240 --> 00:08:16,220 So let's write another thing. 91 00:08:16,400 --> 00:08:21,260 Let's write the rules dot find and then. 92 00:08:21,260 --> 00:08:23,120 That's right body. 93 00:08:23,630 --> 00:08:25,970 OK, that's fine. 94 00:08:26,450 --> 00:08:30,650 All def OK. 95 00:08:30,710 --> 00:08:40,990 And then they will write one here dot find and then B dot text and that's it. 96 00:08:41,240 --> 00:08:49,100 So you can see here we found Debian then dot then version and again this is part of the body and I will 97 00:08:49,100 --> 00:08:51,880 show you exactly where this part is right here. 98 00:08:52,280 --> 00:08:58,420 So we'll go to route find OK and we'll go to the board here. 99 00:08:58,550 --> 00:09:07,220 So we're within the body then if one finds P and you can see that here we have Debian, then then OK, 100 00:09:07,580 --> 00:09:11,750 exactly the content under the pack then under the P. 101 00:09:11,750 --> 00:09:12,140 Sorry. 102 00:09:12,410 --> 00:09:13,390 Quite simple, right. 103 00:09:13,880 --> 00:09:20,750 So there is actually one quite unpleasant disadvantage here. 104 00:09:21,620 --> 00:09:30,500 And this is because when we're searching by def, OK, there is quite a high chance Advtech that has 105 00:09:30,500 --> 00:09:36,060 been inserted before the actual section that we look for. 106 00:09:36,260 --> 00:09:42,820 So if we're sorted by David Thack, we might not find always the right place. 107 00:09:43,250 --> 00:09:50,920 So in many cases it is better to search not by the VITAC but by the idea. 108 00:09:51,290 --> 00:09:54,890 And here in that case, the idea is content. 109 00:09:55,760 --> 00:10:02,390 So in most of the times you would prefer this way, because in that way you're sure that there was only 110 00:10:02,390 --> 00:10:09,800 one content place or there was only one place or there is only one title piece? 111 00:10:09,800 --> 00:10:10,130 Right. 112 00:10:10,550 --> 00:10:12,800 There are not multiple areas like this. 113 00:10:12,800 --> 00:10:16,400 But the difference, as you can see, one here are quite a few. 114 00:10:16,830 --> 00:10:25,910 So to address this issue, we're going to create a simple called a file into a Section four, and this 115 00:10:25,910 --> 00:10:27,110 will be our first file. 116 00:10:27,260 --> 00:10:32,900 So let's create the Python file and I will create it the. 117 00:10:34,310 --> 00:10:42,530 Because who are going to go to the go website, so let's write from and let's get closer a bit so from 118 00:10:43,160 --> 00:10:51,610 AWEX a.m. dot H.T. and they'll import from string. 119 00:10:52,190 --> 00:10:56,170 OK, let's fix the syntax here. 120 00:10:57,510 --> 00:11:03,270 OK, and also to string. 121 00:11:04,350 --> 00:11:04,890 OK. 122 00:11:04,980 --> 00:11:19,800 And also from El Ximo, that HMO import pass and submit underscore for. 123 00:11:21,140 --> 00:11:33,800 OK, then let's import requests, OK, and let's create in our response, equal to requests that we 124 00:11:33,800 --> 00:11:39,880 get and here all need to write, as you can see, the bureau of the website. 125 00:11:40,220 --> 00:11:46,820 So I'm going to write right here, h t t p. 126 00:11:46,820 --> 00:11:56,390 S book, The End Goal, dot com and then from page in the right forum. 127 00:11:57,530 --> 00:12:01,670 String response, dot text. 128 00:12:02,090 --> 00:12:03,920 OK, let's move it a little bit up. 129 00:12:04,410 --> 00:12:09,680 OK, so after that, that's right from or actually right form. 130 00:12:10,790 --> 00:12:13,370 Ecorse form. 131 00:12:14,900 --> 00:12:17,880 OK, form here, page. 132 00:12:18,260 --> 00:12:22,400 So the previous file dot forms. 133 00:12:24,860 --> 00:12:25,370 Zero. 134 00:12:25,890 --> 00:12:27,890 OK, and then our write print. 135 00:12:30,650 --> 00:12:32,330 To string. 136 00:12:33,790 --> 00:12:34,330 For. 137 00:12:36,000 --> 00:12:47,460 OK, and then let's right page equals Paris, and here I will write again, the euro, so the euro will 138 00:12:47,460 --> 00:12:49,080 be age. 139 00:12:50,230 --> 00:12:56,170 TTP, slash the look and go dot com. 140 00:12:58,540 --> 00:13:02,950 OK, and then let's try to not get ruled. 141 00:13:05,000 --> 00:13:06,150 OK, we're ready here. 142 00:13:06,420 --> 00:13:11,630 That's right, page, OK, page that forms. 143 00:13:13,460 --> 00:13:14,030 Zero. 144 00:13:15,650 --> 00:13:16,220 That's. 145 00:13:18,190 --> 00:13:23,740 Fields fuel equals Biton. 146 00:13:25,130 --> 00:13:29,480 And actually put this in parenthesis, OK? 147 00:13:29,630 --> 00:13:32,360 And we're almost done, actually, so let's write the result. 148 00:13:34,120 --> 00:13:45,280 OK, this will be a variable, so result here will be equal to pass submit form. 149 00:13:47,520 --> 00:13:50,400 Page dot forms. 150 00:13:52,190 --> 00:13:56,550 Zero not get ruled. 151 00:13:58,020 --> 00:14:05,020 That's it, and then let's right print to string the result, that's it. 152 00:14:05,700 --> 00:14:06,990 So everything looks good. 153 00:14:07,020 --> 00:14:13,790 Guys, let's save this file and let's now go to the terminal and let's exit the python mode. 154 00:14:13,810 --> 00:14:21,450 And as you can see, I'm in Section four so I can simply run this file by writing Python and then The 155 00:14:21,840 --> 00:14:25,230 Pure Way, and it will run for a second. 156 00:14:25,230 --> 00:14:29,730 And you can see the enormous size of the file, how big it is. 157 00:14:29,730 --> 00:14:36,090 And obviously this is because we cannot display the whole XML in the way we would like to see it. 158 00:14:36,340 --> 00:14:42,870 But if you see just the first part of the file, you can see the object and you can see the idea, which 159 00:14:42,870 --> 00:14:44,330 is the search from home page. 160 00:14:44,340 --> 00:14:51,840 You can see the class that the search and all the other elements of the header and of the form that 161 00:14:51,840 --> 00:14:52,930 we obtained online. 162 00:14:52,950 --> 00:14:55,810 So this is the way that you can actually do it. 163 00:14:56,190 --> 00:15:04,350 So this is what I wanted to share with you for today, basically how you can pass HTML file into XML 164 00:15:04,350 --> 00:15:04,720 file. 165 00:15:04,950 --> 00:15:10,350 I hope it was quite useful specifically because here we didn't use any APIs that. 166 00:15:10,540 --> 00:15:14,360 Guys, thanks very much for watching and I'll see you in the next video.