1 00:00:01,590 --> 00:00:06,820 In the last two videos, we talked about transform loaders, which transform or 2 00:00:06,830 --> 00:00:11,020 load data from a specific format into the Language Chain Document Format. 3 00:00:11,570 --> 00:00:15,440 In this case, the data was loaded from local files. 4 00:00:16,670 --> 00:00:22,440 us -constitution .pdf and the great -gxb .docx are local files. 5 00:00:22,910 --> 00:00:27,260 Now let's see how to load data from online services into Language Chain. 6 00:00:27,270 --> 00:00:33,700 We don't deal with files, but with different protocols or APIs that connect 7 00:00:33,710 --> 00:00:34,920 to those services. 8 00:00:36,050 --> 00:00:39,480 Since the format and code differ for each 9 00:00:39,490 --> 00:00:45,600 service, I would create a unique function for each dataset or service loader that I 10 00:00:45,610 --> 00:00:47,300 want to support in my application. 11 00:00:48,370 --> 00:00:50,500 Let's do this for Wikipedia. 12 00:00:54,370 --> 00:00:58,420 I am installing the Wikipedia module, which is a prerequisite. 13 00:01:01,850 --> 00:01:17,300 So pip install wikipedia -q and df load from wikipedia. 14 00:01:19,010 --> 00:01:26,680 And the function will take two arguments query and lang equals by default en. 15 00:01:29,050 --> 00:01:35,860 Query is free text, which is used to find documents in Wikipedia, and lang is by 16 00:01:35,870 --> 00:01:40,980 default English, and it is used to search in a specific language part of Wikipedia. 17 00:01:42,250 --> 00:01:44,600 I am loading the Wikipedia loader. 18 00:01:44,970 --> 00:01:47,500 So from lang -chain document -loaders 19 00:01:51,060 --> 00:02:04,530 import wikipedia -loader and loader equals wikipedia -loader of query equals 20 00:02:04,540 --> 00:02:13,770 query, lang equals lang, and the loadMaxDocs let's say equals to. 21 00:02:14,880 --> 00:02:19,390 LoadMaxDocs can be used to limit the number of downloaded documents. 22 00:02:19,980 --> 00:02:24,430 You can use a hard -coded value or add a third argument to the function. 23 00:02:25,180 --> 00:02:37,210 Like this, loadMaxDocs equals to a default value and here loadMaxDocs equals loadMaxDocs. 24 00:02:37,300 --> 00:02:38,750 I am more flexible now. 25 00:02:40,020 --> 00:02:45,410 Data equals loader .load and return data. 26 00:02:45,960 --> 00:02:46,850 And that's it. 27 00:02:48,500 --> 00:02:49,650 I am running the cell. 28 00:02:51,800 --> 00:02:57,150 And I am loading something from Wikipedia in the lang -chain document format. 29 00:02:57,620 --> 00:03:01,190 Data equals load from Wikipedia. 30 00:03:06,890 --> 00:03:07,340 Ok. 31 00:03:07,710 --> 00:03:08,420 It's load. 32 00:03:08,790 --> 00:03:09,720 It was a typo. 33 00:03:12,450 --> 00:03:13,780 Let's say GPT -4. 34 00:03:16,590 --> 00:03:19,540 I remind you that the training data for 35 00:03:19,550 --> 00:03:23,880 GPT -4 was cut off in September 2021. 36 00:03:24,450 --> 00:03:28,720 GPT -4 was launched in March 2023, so it 37 00:03:28,730 --> 00:03:31,460 was not included in the GPT -4 training data. 38 00:03:31,950 --> 00:03:34,040 Without loading the data from external 39 00:03:34,050 --> 00:03:40,720 sources, LLMs like GPT -3 .5 Turbo or GPT -4 have no knowledge of it. 40 00:03:42,030 --> 00:03:46,280 And print data of 0 .pageContent. 41 00:03:47,170 --> 00:03:48,300 And I am running the cell. 42 00:03:52,180 --> 00:03:57,650 This is the data about GPT -4 downloaded from Wikipedia. 43 00:03:58,140 --> 00:04:03,550 If you want to see it in another language, just add a second argument with 44 00:04:03,560 --> 00:04:04,410 the language code. 45 00:04:04,540 --> 00:04:07,210 For example, DE from German. 46 00:04:07,540 --> 00:04:08,270 Deutsch or German. 47 00:04:09,400 --> 00:04:10,310 I am running it again. 48 00:04:14,520 --> 00:04:14,970 All right. 49 00:04:15,220 --> 00:04:17,750 In this lecture, I have shown you how to 50 00:04:17,760 --> 00:04:21,610 load data from an online service like Wikipedia. 51 00:04:22,620 --> 00:04:25,110 We will take a short break and in the 52 00:04:25,120 --> 00:04:28,330 next video we will discuss chunking documents. 53 00:04:29,160 --> 00:04:32,190 These will then be embedded into numeric 54 00:04:32,200 --> 00:04:35,090 vectors and uploaded to a vector store.