1 00:00:00,950 --> 00:00:06,460 In this video, we will discuss how to load your private documents into Lanchain. 2 00:00:06,870 --> 00:00:10,960 This is the first step in building the question -answering application. 3 00:00:11,770 --> 00:00:16,700 The purpose of this project is to ask questions about the content of your 4 00:00:16,710 --> 00:00:21,940 custom or private data, which is stored in different formats, such as Pandas 5 00:00:21,950 --> 00:00:27,680 DataFrames, PDFs, CSV or JSON files, HTML or Office documents. 6 00:00:28,730 --> 00:00:34,560 Lanchain provides with document loaders which load the data into documents. 7 00:00:35,390 --> 00:00:41,680 In the Lanchain terminology, a document is a fancy way of saying some piece of text. 8 00:00:42,470 --> 00:00:47,180 These document loaders are aimed at making this task easy. 9 00:00:47,810 --> 00:00:55,760 There are two types of document loaders, transform loaders and public and 10 00:00:55,770 --> 00:00:59,960 proprietary data set or service loaders. 11 00:01:01,750 --> 00:01:04,640 Transform loaders will transform the data 12 00:01:04,650 --> 00:01:08,460 from a specific format into the Lanchain document format. 13 00:01:09,030 --> 00:01:15,620 This is what you will use most of the time for formats such as PDFs, CSV and 14 00:01:15,630 --> 00:01:18,900 JSON files, Office documents, and many more. 15 00:01:19,810 --> 00:01:22,280 You can see the entire list in the 16 00:01:22,290 --> 00:01:23,500 Lanchain documentation. 17 00:01:26,950 --> 00:01:30,210 The public data set and service loaders 18 00:01:30,220 --> 00:01:36,250 are used to query and search for data on public websites such as Project 19 00:01:36,260 --> 00:01:41,050 Gutenberg, Hacker News, Wikipedia, and so on. 20 00:01:41,600 --> 00:01:44,610 The proprietary data set and service 21 00:01:44,620 --> 00:01:50,950 loaders are used for private or proprietary services such as Azure, AWS 22 00:01:50,960 --> 00:01:55,970 S3, Google Cloud, Discord, Twitter, and many more. 23 00:01:57,930 --> 00:02:03,680 In this video, I will show you how to load your data from PDFs, which is a very 24 00:02:03,690 --> 00:02:04,800 common data format. 25 00:02:05,330 --> 00:02:08,020 In the next video, we will generalize it 26 00:02:08,030 --> 00:02:09,820 for other formats as well. 27 00:02:11,710 --> 00:02:14,220 I am loading the environment variables 28 00:02:14,230 --> 00:02:18,580 used for authentication with OpenAI and Pinecon. 29 00:02:19,770 --> 00:02:22,720 To load PDF files, I am going to install 30 00:02:22,730 --> 00:02:26,340 another library called pypdf that is required. 31 00:02:27,630 --> 00:02:36,100 So pip install pypdf I want my code to be 32 00:02:36,110 --> 00:02:38,680 modular so I am defining a function. 33 00:02:39,630 --> 00:02:45,940 So df load document of file The function 34 00:02:45,950 --> 00:02:51,400 will take as an argument a PDF file and return its text content. 35 00:02:52,110 --> 00:02:54,680 I am importing the required libraries. 36 00:02:57,240 --> 00:03:01,090 From lanchain .documentloaders I am 37 00:03:01,100 --> 00:03:04,010 importing the pypdf loader class. 38 00:03:07,290 --> 00:03:09,750 By the way, the standard recommendation 39 00:03:09,760 --> 00:03:13,170 is to put import statements at the top of the file. 40 00:03:13,920 --> 00:03:14,110 Here. 41 00:03:14,940 --> 00:03:18,070 However, there are cases when putting 42 00:03:18,080 --> 00:03:21,750 import statements inside the function is even better. 43 00:03:22,440 --> 00:03:27,510 This will prevent circular dependencies and you will also benefit from a more 44 00:03:27,520 --> 00:03:30,130 reliable refactoring of your code. 45 00:03:31,140 --> 00:03:34,110 When you move a function from one module 46 00:03:34,120 --> 00:03:38,550 to another, you will know that the function will continue to work because it 47 00:03:38,560 --> 00:03:40,570 contains everything inside it. 48 00:03:41,840 --> 00:03:45,350 This function loads the PDFs using a 49 00:03:45,360 --> 00:03:51,030 library called pypdf into an array of documents where each document contains 50 00:03:51,040 --> 00:03:55,010 the page content and metadata with a page number. 51 00:03:55,440 --> 00:03:57,550 I am printing a message for the user. 52 00:03:58,560 --> 00:04:00,510 Loading and file. 53 00:04:02,080 --> 00:04:06,730 And the loader equals pypdf loader of file. 54 00:04:08,000 --> 00:04:12,030 Note that it is also able to load online PDFs. 55 00:04:12,500 --> 00:04:17,190 Just pass a URL to a PDF to pypdf loader. 56 00:04:18,260 --> 00:04:22,030 The next step is to call loader .load. 57 00:04:22,700 --> 00:04:25,670 This will return a list of lang chain documents. 58 00:04:26,080 --> 00:04:28,010 One document for each page. 59 00:04:28,500 --> 00:04:30,490 And I am returning the data. 60 00:04:34,030 --> 00:04:34,780 The list. 61 00:04:41,320 --> 00:04:43,930 This will be the section for the running code. 62 00:04:49,590 --> 00:04:50,700 Let's test the function. 63 00:04:51,390 --> 00:04:53,380 I am loading a local PDF file. 64 00:04:55,870 --> 00:05:00,440 In the current directory, there is a subdirectory called files that contains a 65 00:05:00,450 --> 00:05:02,940 PDF called us -constitution. 66 00:05:03,910 --> 00:05:04,400 This one. 67 00:05:06,450 --> 00:05:07,280 I am loading it. 68 00:05:07,750 --> 00:05:14,240 So data equals load document of and the 69 00:05:14,250 --> 00:05:17,280 argument is files and us -constitution. 70 00:05:19,350 --> 00:05:23,060 The data was split by pages and you can 71 00:05:23,070 --> 00:05:25,920 use indexes to display a specific page. 72 00:05:26,510 --> 00:05:29,440 For example, print data of 1. 73 00:05:31,150 --> 00:05:35,080 This is the second page because it starts from 0. 74 00:05:35,690 --> 00:05:36,380 That page content. 75 00:05:39,190 --> 00:05:42,220 I am running the cell that contains the function. 76 00:05:43,130 --> 00:05:44,240 And this one. 77 00:05:50,910 --> 00:05:52,400 And I've got the page. 78 00:05:53,410 --> 00:05:59,860 If you want to see the metadata of this document, use the metadata attribute. 79 00:06:00,210 --> 00:06:04,140 For example, data of 10 .metadata. 80 00:06:13,010 --> 00:06:13,940 This is a dictionary. 81 00:06:16,960 --> 00:06:19,270 Let's check how many pages there are. 82 00:06:21,870 --> 00:06:29,320 You have length of data, pages, in your data. 83 00:06:33,680 --> 00:06:35,130 There are 41 pages. 84 00:06:36,940 --> 00:06:38,310 And if you want to see how many 85 00:06:38,320 --> 00:06:42,870 characters are in one page, you can do print. 86 00:06:44,060 --> 00:06:48,230 There are curly braces, length of data 87 00:06:48,240 --> 00:06:51,190 of, let's say, 20. 88 00:06:51,860 --> 00:06:52,870 That page content. 89 00:06:55,680 --> 00:06:57,290 Characters in the page. 90 00:07:03,750 --> 00:07:07,160 If you want to load remote PDFs, just 91 00:07:07,170 --> 00:07:10,020 pass the URL to the function as an argument. 92 00:07:11,640 --> 00:07:12,300 We are done. 93 00:07:12,750 --> 00:07:17,660 In this lecture, I showed you how to load PDFs into Lanchain documents. 94 00:07:18,150 --> 00:07:19,180 We'll take a quick break. 95 00:07:19,650 --> 00:07:21,900 And in the next video, I'll show you how 96 00:07:21,910 --> 00:07:24,940 to load other document formats into Lanchain. 97 00:07:25,370 --> 00:07:25,760 Bye Bye.