1 00:00:04,960 --> 00:00:05,720 Hi there. 2 00:00:06,460 --> 00:00:15,970 And this lecture, we will learn how to scrape outside scaping is grabbing data from a website and get 3 00:00:15,970 --> 00:00:22,390 all data from webpage so any website or web page can be scraped. 4 00:00:23,130 --> 00:00:33,540 And after we scraping data from website or web page, we can use that data, then clean it by using 5 00:00:33,540 --> 00:00:44,310 data tools like by some Banda's module or Postgres sequel Quiring Tool to clean the data out, but after 6 00:00:44,310 --> 00:00:46,870 loading and Zambezia admin food. 7 00:00:47,520 --> 00:00:57,240 So when we doing scraping any website on the Internet, we have our browser and on the other side there 8 00:00:57,240 --> 00:01:07,920 is a server which is ABC has great ability to save and load files like JavaScript files see as files 9 00:01:08,130 --> 00:01:10,230 and e-mail files. 10 00:01:10,260 --> 00:01:19,140 We will focus on Ekdahl files, what you hold all the content of the website and Chaib of text and image 11 00:01:19,140 --> 00:01:19,720 data. 12 00:01:19,740 --> 00:01:28,350 So when we search for something using the browser, which will contact the server and get all the information 13 00:01:28,590 --> 00:01:36,960 about that says that we do these informations, as we said before, saved on servers as JavaScript, 14 00:01:37,410 --> 00:01:45,840 as an e-mail file, which Orbin as a Web page in our browser, if we search for Wikipedia in the programming 15 00:01:45,840 --> 00:01:49,470 language, used a most popular website. 16 00:01:49,590 --> 00:01:55,980 Petteway shows the page of programming languages used in most popular websites. 17 00:01:55,980 --> 00:01:58,620 And we right click here and choose. 18 00:01:58,620 --> 00:02:00,730 We use a Web page source. 19 00:02:00,780 --> 00:02:06,950 As you can see, it's very long and has whole data of our Web page. 20 00:02:07,710 --> 00:02:14,700 Now we can search for any information we want using control and depth and type what we want to search 21 00:02:14,700 --> 00:02:16,900 for and that search panel. 22 00:02:17,520 --> 00:02:21,780 So we're describing is grabbing all this data was always about this. 23 00:02:21,780 --> 00:02:26,970 We do not want the question now why we use web escaping. 24 00:02:27,720 --> 00:02:36,450 And the answer to that question is as a form, we will scrape data from websites to use and the process 25 00:02:36,450 --> 00:02:45,750 of data analysis and also in market research is also used and other searches and studies due to increase 26 00:02:45,750 --> 00:02:53,730 sales of some products or increase benefits of something and use that effectively to serve humanity 27 00:02:53,730 --> 00:03:02,120 and invent new products that will solve problems and so on, but on condition that we get the mission 28 00:03:02,160 --> 00:03:04,370 to do that and maintain privacy. 29 00:03:04,620 --> 00:03:12,600 So collecting data is useful for public benefits, but on condition that we get the permission to do 30 00:03:12,600 --> 00:03:17,550 that and maintain the privacy of anyone on the Internet. 31 00:03:17,560 --> 00:03:25,590 To do that, we must check every mention of any website or Web page for squibbing before we begin this 32 00:03:25,590 --> 00:03:33,420 process of collecting data and respect privacy on the Internet as the following, you will type in the 33 00:03:33,420 --> 00:03:35,790 end of your URL of the Web page. 34 00:03:35,790 --> 00:03:42,240 We want to see if there is a dimension to those giving and what we can keep and what. 35 00:03:43,020 --> 00:03:51,840 Items prevented from screaming, so we will tied and the end of the you are all robots, the take it 36 00:03:51,840 --> 00:03:59,850 then enter, we will get all the items allowed to escape and the items that we can screen. 37 00:03:59,870 --> 00:04:03,670 So Wikipedia doesn't contain robot dot text. 38 00:04:04,780 --> 00:04:07,500 So we cannot get any information from Wikipedia. 39 00:04:09,150 --> 00:04:12,830 Again, we will do that for this allow. 40 00:04:13,620 --> 00:04:25,170 That means that you can't escape these items like slaves or slash, slash or slash users slash tying 41 00:04:25,170 --> 00:04:25,470 up. 42 00:04:26,400 --> 00:04:27,180 And so on. 43 00:04:28,420 --> 00:04:29,170 Allow. 44 00:04:30,200 --> 00:04:33,270 User dot, agent column Yahoo! 45 00:04:33,680 --> 00:04:34,460 And so on. 46 00:04:35,610 --> 00:04:44,100 So allowed means you can this aloud, you can't scream or Ascoli, you can't sleep to maintain privacy 47 00:04:44,100 --> 00:04:45,040 on the Internet. 48 00:04:45,870 --> 00:04:56,220 Now we will install the audio also by using Pepsi and CMG and Windows are terminal and make the following 49 00:04:56,340 --> 00:04:59,720 Web three inches tall, beautiful, super. 50 00:04:59,880 --> 00:05:05,130 Then click enter and it will be against the install, but also for. 51 00:05:11,280 --> 00:05:20,700 After also for has installed, we will install request this module as the following February install 52 00:05:20,850 --> 00:05:21,800 requests. 53 00:05:23,450 --> 00:05:29,640 After request this module has installed, we will jump in our lab. 54 00:05:30,540 --> 00:05:36,130 To do where the screaming from from Wikipedia Web page we have accused. 55 00:05:37,390 --> 00:05:38,010 The following. 56 00:05:41,610 --> 00:05:49,410 We will Auburn New Bison's re not back to the Web is claiming we will hand over the benefits of our 57 00:05:49,780 --> 00:05:54,180 website to see the documentation of the budget for four and a half. 58 00:05:54,240 --> 00:05:57,740 This model does work and is grieving the following. 59 00:05:58,290 --> 00:06:05,900 You should review the documentation of beautiful souffl to understand how it does work. 60 00:06:06,950 --> 00:06:15,350 We will do that also for the request, this module, you should know that these libraries or modules 61 00:06:15,530 --> 00:06:20,130 will help us to grab educational files from Wikipedia. 62 00:06:20,150 --> 00:06:28,460 So reading documentation of these two modules will help you learn how to use them effectively. 63 00:06:28,490 --> 00:06:35,330 Now, we will use our Twitter lab as the following and the first cell we will import requests and the 64 00:06:35,330 --> 00:06:41,630 first line and the following then and the second line we will type from B. 65 00:06:41,720 --> 00:06:46,640 As for import, we also then run the Fed itself. 66 00:06:46,730 --> 00:06:54,350 So we import beautiful soup, which will make us able to grab data from XML files and request. 67 00:06:54,350 --> 00:07:01,910 This module will make us able to download the HDMI travel files, which then the beautiful tool for 68 00:07:01,940 --> 00:07:06,500 will grab data from these HCM and files. 69 00:07:06,530 --> 00:07:16,130 Then in the second cell we will type response equal requests dot get and between two brackets inside 70 00:07:16,130 --> 00:07:17,570 two parentheses. 71 00:07:17,570 --> 00:07:23,330 We will add our your URL of the website that we want to. 72 00:07:24,300 --> 00:07:26,800 As great as the following. 73 00:07:27,240 --> 00:07:32,410 So let's do that so we will add the link of that Web page. 74 00:07:32,430 --> 00:07:39,120 We want to scrape data from and between the two parentheses inside the brackets. 75 00:07:39,950 --> 00:07:48,340 Of request dismissal and the third, so we will present our response to get all the responses numbers, 76 00:07:48,340 --> 00:07:57,590 so we will type in between two brackets, response to branch to the number of responses from our website, 77 00:07:57,890 --> 00:08:09,320 which has to handle it to ensure that we will refer to our website as the following and click on network 78 00:08:09,320 --> 00:08:10,520 and is and. 79 00:08:11,800 --> 00:08:22,020 Our network, then we will get that most of responses are after refreshing our 200 responses. 80 00:08:22,810 --> 00:08:23,560 As you see. 81 00:08:25,050 --> 00:08:32,780 If you check these responses after refreshing the Web page, we get that responses. 82 00:08:32,800 --> 00:08:36,670 The numbers in most cases is to hundreds. 83 00:08:36,900 --> 00:08:41,070 Then in the next cell, we will type the range between two brackets. 84 00:08:41,400 --> 00:08:42,990 Response dot text. 85 00:08:44,810 --> 00:08:54,250 Which will print all the text of the website inside that HTML file of our website, this all the text 86 00:08:54,260 --> 00:09:02,420 data inside the Hellfires, which is determined by this comment print between two brackets. 87 00:09:02,600 --> 00:09:06,320 This wants to take a slow response to the text method. 88 00:09:06,380 --> 00:09:13,240 When we use this method, it will return all the text data from HTML file of the website. 89 00:09:13,340 --> 00:09:17,390 We can assign this data to a variable as the following. 90 00:09:17,630 --> 00:09:22,070 So one beautiful soup and Comikaze between two brackets. 91 00:09:23,280 --> 00:09:29,050 Response, the thickest comma and between two parentheses etched in El Paso. 92 00:09:30,490 --> 00:09:39,110 If you read a few your review beautiful documentation, you will find that vegetable soup has to buzzer's 93 00:09:39,170 --> 00:09:51,940 HTML parser and XML Dornberger so beautiful soup can be used to actually smell files and XML files. 94 00:09:51,970 --> 00:09:59,320 Also, all this text data will be assigned to a variable called soup as the following. 95 00:09:59,440 --> 00:10:07,630 Now we can use beautiful soup methods to get any sort of text data that we script from HTML file of 96 00:10:07,630 --> 00:10:12,800 our website that we saw Brent between two brackets. 97 00:10:12,800 --> 00:10:20,560 So the body, this method where it had all the value from the text data that we screen, we can also 98 00:10:20,560 --> 00:10:32,020 brand another method called Subadult Titan to print the title of the text data from HTML file of our 99 00:10:32,020 --> 00:10:40,150 website that with the title was Programming Language Used and most popular website at Wikipedia and 100 00:10:40,150 --> 00:10:40,770 so on. 101 00:10:40,780 --> 00:10:51,280 We can also use so that find underscore all methods to find any items or all items that began with specific 102 00:10:51,280 --> 00:11:02,440 tag or a specific dev or a specific something in our text data to return it successfully. 103 00:11:02,710 --> 00:11:07,420 This method is the most popular method used by B. 104 00:11:07,420 --> 00:11:14,320 As for module, for example, we can use it to return or find all items. 105 00:11:14,320 --> 00:11:16,090 Begin with B. 106 00:11:17,070 --> 00:11:18,060 Add the following. 107 00:11:19,000 --> 00:11:26,280 We can also use the find method without all to find the first item, beginning with specific character, 108 00:11:26,290 --> 00:11:30,360 specific tag Brent between two brackets. 109 00:11:30,850 --> 00:11:37,180 So Dot find between two brackets and two parentheses B to return. 110 00:11:38,440 --> 00:11:46,300 The first item began with the tag be we can also use a specific method called select. 111 00:11:47,060 --> 00:11:55,600 To print all the items inside a specific character or specific structure like glass. 112 00:11:56,880 --> 00:12:04,730 We will find specific class and rent the items inside it or the data inside it. 113 00:12:06,210 --> 00:12:06,900 The following. 114 00:12:08,140 --> 00:12:13,840 So we will balance between two brackets, so docility and two brackets. 115 00:12:14,950 --> 00:12:22,900 And two parentheses that Wikipedia sortable, this class doesn't contain anything we will find as a 116 00:12:22,900 --> 00:12:26,140 class indicator of these two classes. 117 00:12:27,390 --> 00:12:35,010 That contain tickets of data we want to scrape or find has the following. 118 00:12:36,480 --> 00:12:44,730 We will search our text file for specific laws that contain specific information or import information 119 00:12:44,730 --> 00:12:45,680 as a following. 120 00:12:46,020 --> 00:12:56,350 So we will add this class and we succeeded to grab data from this class, which is the important class, 121 00:12:56,370 --> 00:13:01,520 the most important class, and that HTML file of our Web page. 122 00:13:01,640 --> 00:13:08,490 Now we will use this class to grab the information about the popularity column in this table and the 123 00:13:08,490 --> 00:13:09,060 following. 124 00:13:09,930 --> 00:13:13,590 So at this point, we reached the end of this lecture. 125 00:13:14,400 --> 00:13:16,730 Hope you enjoyed this lecture and get all that. 126 00:13:16,950 --> 00:13:18,370 Thank you for being here. 127 00:13:20,610 --> 00:13:21,810 Thanks for watching. 128 00:13:21,930 --> 00:13:23,760 See you next, Vedo.