1 00:00:04,680 --> 00:00:07,020 Website corruption overview. 2 00:00:07,230 --> 00:00:08,040 Hello, everyone. 3 00:00:08,370 --> 00:00:15,600 In the last section, we're talking how we're using API in order to extract data from the Web, but 4 00:00:15,600 --> 00:00:19,200 many of the websites do not actually offer API. 5 00:00:19,380 --> 00:00:20,940 So you can extract the data. 6 00:00:21,120 --> 00:00:28,140 And I'm talking about the small websites, not the big ones like Amazon or Twitter, but smaller ones 7 00:00:28,140 --> 00:00:30,520 that do not actually offer this option. 8 00:00:30,750 --> 00:00:32,570 So what do we do in that case? 9 00:00:32,790 --> 00:00:39,740 In that case, we're actually using scrapping techniques in order to recover data from the web automatically. 10 00:00:39,870 --> 00:00:47,700 And some of the most powerful tools that are known and actually are implemented into item three point 11 00:00:47,760 --> 00:00:52,410 seven are the beautiful soap and scrap it all. 12 00:00:52,620 --> 00:01:02,310 So, for example, scrapie is specifically for a framework written for Python and it is used for extraction 13 00:01:02,640 --> 00:01:04,890 from the Web in an automated way. 14 00:01:06,310 --> 00:01:15,070 And even today, and actually, especially today, many people are using scrapie for data processing 15 00:01:15,070 --> 00:01:17,080 and specifically for data mining. 16 00:01:17,500 --> 00:01:25,960 So if you would like to use Python for your Web extraction of data and for data mining, Skrappy's a 17 00:01:25,960 --> 00:01:31,660 very, very good tool that you can specifically implement for mining your data. 18 00:01:32,020 --> 00:01:39,100 So in general, in this section, I would like to cover the following topics so you can be aware of 19 00:01:39,100 --> 00:01:40,690 what we are going to learn today. 20 00:01:41,860 --> 00:01:44,640 We're going to talk about the web scrapping tools. 21 00:01:44,770 --> 00:01:50,020 As I told you, we're going to do an overview of scrapie and beautiful. 22 00:01:50,920 --> 00:01:58,930 So we're going to use them to extract information from pages and for passing the HTML files, which 23 00:01:58,930 --> 00:02:03,920 are the files that all the websites do to their data. 24 00:02:05,050 --> 00:02:08,910 We're going to obviously learn how to use a scrappy toolkit. 25 00:02:09,490 --> 00:02:16,080 We're going to learn the web crawling process and how to analyze data with scrapie. 26 00:02:16,330 --> 00:02:22,640 And we're going to do some Theros cropping on the web using cloud. 27 00:02:22,660 --> 00:02:29,600 So I will show how you can use croppy and scrap web data in a cloud. 28 00:02:29,980 --> 00:02:38,260 So when we're extract content from Internet, there are few widely used techniques that they'd like 29 00:02:38,260 --> 00:02:39,110 to discuss today. 30 00:02:39,460 --> 00:02:47,770 So the first and the more simple one is the screen scrapping when you can obtain information by moving 31 00:02:47,770 --> 00:02:51,280 around the screen and checking for registered users. 32 00:02:52,100 --> 00:02:58,210 Then you have the website cropping, which aims to obtain the data or the information from a resource 33 00:02:58,360 --> 00:03:05,650 such as an HTML resource and process information once we extract the relevant data. 34 00:03:06,760 --> 00:03:13,900 The next technique is the report mining, and this is a technique that allows trees to obtain information. 35 00:03:14,170 --> 00:03:19,210 If you're not familiar with trees, I advise to researchers from the Web because this is quite an interesting 36 00:03:19,210 --> 00:03:26,020 structure and can be used in multiple files such as HTML, RDF, C, ASV and so on. 37 00:03:26,260 --> 00:03:36,060 And in that way we can create simple and quite fast mechanism without needing to use an API. 38 00:03:36,280 --> 00:03:42,880 And sometimes we wouldn't require any connection because we can simply extract the data and analyze 39 00:03:42,880 --> 00:03:43,870 it afterwards. 40 00:03:43,900 --> 00:03:48,090 So this is quite a good way to mine the data you need. 41 00:03:48,910 --> 00:03:53,140 But let's focus now on the next one, which is called spiders. 42 00:03:53,260 --> 00:03:59,980 So spiders refer to defining specific rules for moving around a website. 43 00:04:00,280 --> 00:04:04,300 And while moving around following these rules, it's like a robot. 44 00:04:04,770 --> 00:04:12,250 So while moving around to this website, we can extract information about the interactions with the 45 00:04:12,250 --> 00:04:16,750 users, what the data has been created on the website and so on. 46 00:04:16,990 --> 00:04:26,680 So the idea here is that the software developers like you and me only need to write the rules for the 47 00:04:26,980 --> 00:04:28,300 data extraction. 48 00:04:28,300 --> 00:04:37,330 And after we write that rules, we can just leave the code to execute and obtain the relevant information. 49 00:04:37,630 --> 00:04:48,130 And finally, we have the craws that are not spiders, but there are processes in this process automatically 50 00:04:48,130 --> 00:04:51,790 pass and extract content from a website. 51 00:04:52,030 --> 00:04:58,190 And they ECDIS this content from the website and provide it to the search engine immediately. 52 00:04:59,440 --> 00:05:03,100 And this is how the page indexes are actually built. 53 00:05:03,580 --> 00:05:08,770 And before ending this video for today, let's talk about what is Web scrubbing. 54 00:05:09,070 --> 00:05:18,850 So the Web scrubbing is a technique that allows you to extract information from websites, and it basically 55 00:05:18,850 --> 00:05:26,380 helps you to transform the unstructured data from the website that is collected into the HTML file, 56 00:05:26,530 --> 00:05:33,190 into a structured data that is easy to use for programmers and from programming languages. 57 00:05:33,580 --> 00:05:36,840 So that was the intro for that section. 58 00:05:36,850 --> 00:05:39,460 Guys, I hope you really enjoyed it. 59 00:05:39,730 --> 00:05:44,920 And I think it will be quite useful for you, especially once you start writing the code, which is 60 00:05:44,920 --> 00:05:46,800 actually going to be from the next video. 61 00:05:47,110 --> 00:05:48,220 So that's it. 62 00:05:48,370 --> 00:05:52,030 Thanks very much for watching this video for today and I will see you in the next one.