1
00:00:00,570 --> 00:00:06,740
In the previous video, I showed you how
to load PDF files into Lanchain documents.

2
00:00:07,270 --> 00:00:12,700
However, your private unstructured data
isn't limited to PDF format.

3
00:00:13,530 --> 00:00:19,080
It can be found in various other formats,
such as Office documents, Google Docs,

4
00:00:19,210 --> 00:00:19,500
and more.

5
00:00:20,370 --> 00:00:23,320
Next, I'll demonstrate how to load these

6
00:00:23,330 --> 00:00:25,920
different document formats into Lanchain.

7
00:00:26,510 --> 00:00:29,520
Let's change the loadDocument function so

8
00:00:29,530 --> 00:00:31,940
that it accepts other formats as well.

9
00:00:33,290 --> 00:00:35,600
The function takes the filename to load

10
00:00:35,610 --> 00:00:36,360
as an argument.

11
00:00:36,870 --> 00:00:39,040
I'm gonna check the extension of the file

12
00:00:39,050 --> 00:00:45,380
and based on its extension, I'll load it
using the specific Lanchain loader.

13
00:00:46,410 --> 00:00:48,060
I'm importing OS.

14
00:00:49,290 --> 00:00:54,060
I'm splitting the filename into name and extension.

15
00:00:54,190 --> 00:01:03,700
So name, extension equals OS .path
.splittext of file.

16
00:01:04,750 --> 00:01:08,700
You can print name and extension if you
want to see their values.

17
00:01:10,230 --> 00:01:16,920
If the extension is .pdf, I'll use the
existing code which loads PDF documents.

18
00:01:19,800 --> 00:01:31,740
So if extension equals equals .pdf, and
I'm loading the data using pypdf loader.

19
00:01:32,860 --> 00:01:36,020
I'm indenting these three lines of code.

20
00:01:36,950 --> 00:01:41,620
But if the extension is .docx, I'll use

21
00:01:41,630 --> 00:01:44,460
the code that loads Office documents.

22
00:01:44,750 --> 00:01:53,300
So, if extension equals equals .docx.

23
00:01:54,310 --> 00:01:57,220
I'm importing the required library.

24
00:01:58,090 --> 00:02:04,670
From Lanchain document loaders import

25
00:02:04,680 --> 00:02:09,650
docx2txt loader.

26
00:02:10,240 --> 00:02:13,630
This covers how to load Word documents

27
00:02:13,640 --> 00:02:18,250
into a document format that we can use
later in the application.

28
00:02:19,700 --> 00:02:26,230
It is also necessary to install another
Python module called docx2txt.

29
00:02:27,200 --> 00:02:29,110
I'm installing the package.

30
00:02:30,160 --> 00:02:34,390
pip install docx2txt -q

31
00:02:34,580 --> 00:02:35,810
And I'm running the cell.

32
00:02:39,350 --> 00:02:40,840
Ok, it was installed.

33
00:02:41,710 --> 00:02:46,200
And I'm restarting the kernel to use the
updated package.

34
00:02:51,700 --> 00:02:55,430
And I'm running the cell that loads the
environment variables.

35
00:02:57,460 --> 00:02:59,470
Let's continue implementing the function.

36
00:03:01,300 --> 00:03:03,070
I'm printing a message for the user.

37
00:03:04,220 --> 00:03:08,430
It's the same message which says which
file is loading.

38
00:03:09,330 --> 00:03:15,250
And the loader equals docx2txtloader of file.

39
00:03:16,460 --> 00:03:19,370
You'll add an elif branch for each

40
00:03:19,380 --> 00:03:21,210
document format you want to support.

41
00:03:22,220 --> 00:03:23,270
There are so many.

42
00:03:24,720 --> 00:03:25,510
Take a look here.

43
00:03:26,700 --> 00:03:28,570
When you are done with all the formats,

44
00:03:29,060 --> 00:03:32,210
add an else clause and print a message
for the user.

45
00:03:32,460 --> 00:03:32,930
Like this.

46
00:03:32,940 --> 00:03:37,390
else print document format is not supported.

47
00:03:41,130 --> 00:03:42,420
And I'm returning none.

48
00:03:43,230 --> 00:03:45,580
Outside if elif else, I'm loading the

49
00:03:45,590 --> 00:03:47,140
data and returning it.

50
00:03:48,650 --> 00:03:49,860
Let's test the function.

51
00:03:50,170 --> 00:03:51,100
I'm running the cell.

52
00:03:52,330 --> 00:03:55,580
And here in the running code section, I'm

53
00:03:55,590 --> 00:04:03,840
going to load a docx file that contains
the novel The Great Gatsby by F. Scott Fitzgerald.

54
00:04:04,270 --> 00:04:06,040
It's in the files directory.

55
00:04:06,410 --> 00:04:06,840
This file.

56
00:04:09,200 --> 00:04:15,830
Data equals load document of files and
The Great Gatsby dot docx.

57
00:04:17,180 --> 00:04:23,110
This time data is a list with a single
element and the content is in the page

58
00:04:23,120 --> 00:04:24,950
underlying content attribute.

59
00:04:26,420 --> 00:04:31,130
So print data of zero dot page content.

60
00:04:34,460 --> 00:04:35,490
And I'm running it.

61
00:04:42,950 --> 00:04:45,720
It has displayed the document contents.

62
00:04:48,300 --> 00:04:53,750
If you take a look at Lanchain document
loaders in the documentation, you'll

63
00:04:53,760 --> 00:04:59,090
notice that in addition to transform
loaders, there are also public and

64
00:04:59,100 --> 00:05:02,210
proprietary data set or service loaders.

65
00:05:02,660 --> 00:05:05,030
We'll take a break and in the next video,

66
00:05:05,400 --> 00:05:09,530
I'll show you how to load the data from
public or private services.

67
00:05:10,180 --> 00:05:10,970
Bye Bye.