1 00:00:00,570 --> 00:00:06,740 In the previous video, I showed you how to load PDF files into Lanchain documents. 2 00:00:07,270 --> 00:00:12,700 However, your private unstructured data isn't limited to PDF format. 3 00:00:13,530 --> 00:00:19,080 It can be found in various other formats, such as Office documents, Google Docs, 4 00:00:19,210 --> 00:00:19,500 and more. 5 00:00:20,370 --> 00:00:23,320 Next, I'll demonstrate how to load these 6 00:00:23,330 --> 00:00:25,920 different document formats into Lanchain. 7 00:00:26,510 --> 00:00:29,520 Let's change the loadDocument function so 8 00:00:29,530 --> 00:00:31,940 that it accepts other formats as well. 9 00:00:33,290 --> 00:00:35,600 The function takes the filename to load 10 00:00:35,610 --> 00:00:36,360 as an argument. 11 00:00:36,870 --> 00:00:39,040 I'm gonna check the extension of the file 12 00:00:39,050 --> 00:00:45,380 and based on its extension, I'll load it using the specific Lanchain loader. 13 00:00:46,410 --> 00:00:48,060 I'm importing OS. 14 00:00:49,290 --> 00:00:54,060 I'm splitting the filename into name and extension. 15 00:00:54,190 --> 00:01:03,700 So name, extension equals OS .path .splittext of file. 16 00:01:04,750 --> 00:01:08,700 You can print name and extension if you want to see their values. 17 00:01:10,230 --> 00:01:16,920 If the extension is .pdf, I'll use the existing code which loads PDF documents. 18 00:01:19,800 --> 00:01:31,740 So if extension equals equals .pdf, and I'm loading the data using pypdf loader. 19 00:01:32,860 --> 00:01:36,020 I'm indenting these three lines of code. 20 00:01:36,950 --> 00:01:41,620 But if the extension is .docx, I'll use 21 00:01:41,630 --> 00:01:44,460 the code that loads Office documents. 22 00:01:44,750 --> 00:01:53,300 So, if extension equals equals .docx. 23 00:01:54,310 --> 00:01:57,220 I'm importing the required library. 24 00:01:58,090 --> 00:02:04,670 From Lanchain document loaders import 25 00:02:04,680 --> 00:02:09,650 docx2txt loader. 26 00:02:10,240 --> 00:02:13,630 This covers how to load Word documents 27 00:02:13,640 --> 00:02:18,250 into a document format that we can use later in the application. 28 00:02:19,700 --> 00:02:26,230 It is also necessary to install another Python module called docx2txt. 29 00:02:27,200 --> 00:02:29,110 I'm installing the package. 30 00:02:30,160 --> 00:02:34,390 pip install docx2txt -q 31 00:02:34,580 --> 00:02:35,810 And I'm running the cell. 32 00:02:39,350 --> 00:02:40,840 Ok, it was installed. 33 00:02:41,710 --> 00:02:46,200 And I'm restarting the kernel to use the updated package. 34 00:02:51,700 --> 00:02:55,430 And I'm running the cell that loads the environment variables. 35 00:02:57,460 --> 00:02:59,470 Let's continue implementing the function. 36 00:03:01,300 --> 00:03:03,070 I'm printing a message for the user. 37 00:03:04,220 --> 00:03:08,430 It's the same message which says which file is loading. 38 00:03:09,330 --> 00:03:15,250 And the loader equals docx2txtloader of file. 39 00:03:16,460 --> 00:03:19,370 You'll add an elif branch for each 40 00:03:19,380 --> 00:03:21,210 document format you want to support. 41 00:03:22,220 --> 00:03:23,270 There are so many. 42 00:03:24,720 --> 00:03:25,510 Take a look here. 43 00:03:26,700 --> 00:03:28,570 When you are done with all the formats, 44 00:03:29,060 --> 00:03:32,210 add an else clause and print a message for the user. 45 00:03:32,460 --> 00:03:32,930 Like this. 46 00:03:32,940 --> 00:03:37,390 else print document format is not supported. 47 00:03:41,130 --> 00:03:42,420 And I'm returning none. 48 00:03:43,230 --> 00:03:45,580 Outside if elif else, I'm loading the 49 00:03:45,590 --> 00:03:47,140 data and returning it. 50 00:03:48,650 --> 00:03:49,860 Let's test the function. 51 00:03:50,170 --> 00:03:51,100 I'm running the cell. 52 00:03:52,330 --> 00:03:55,580 And here in the running code section, I'm 53 00:03:55,590 --> 00:04:03,840 going to load a docx file that contains the novel The Great Gatsby by F. Scott Fitzgerald. 54 00:04:04,270 --> 00:04:06,040 It's in the files directory. 55 00:04:06,410 --> 00:04:06,840 This file. 56 00:04:09,200 --> 00:04:15,830 Data equals load document of files and The Great Gatsby dot docx. 57 00:04:17,180 --> 00:04:23,110 This time data is a list with a single element and the content is in the page 58 00:04:23,120 --> 00:04:24,950 underlying content attribute. 59 00:04:26,420 --> 00:04:31,130 So print data of zero dot page content. 60 00:04:34,460 --> 00:04:35,490 And I'm running it. 61 00:04:42,950 --> 00:04:45,720 It has displayed the document contents. 62 00:04:48,300 --> 00:04:53,750 If you take a look at Lanchain document loaders in the documentation, you'll 63 00:04:53,760 --> 00:04:59,090 notice that in addition to transform loaders, there are also public and 64 00:04:59,100 --> 00:05:02,210 proprietary data set or service loaders. 65 00:05:02,660 --> 00:05:05,030 We'll take a break and in the next video, 66 00:05:05,400 --> 00:05:09,530 I'll show you how to load the data from public or private services. 67 00:05:10,180 --> 00:05:10,970 Bye Bye.