1
00:00:00,950 --> 00:00:06,460
In this video, we will discuss how to
load your private documents into Lanchain.

2
00:00:06,870 --> 00:00:10,960
This is the first step in building the
question -answering application.

3
00:00:11,770 --> 00:00:16,700
The purpose of this project is to ask
questions about the content of your

4
00:00:16,710 --> 00:00:21,940
custom or private data, which is stored
in different formats, such as Pandas

5
00:00:21,950 --> 00:00:27,680
DataFrames, PDFs, CSV or JSON files, HTML
or Office documents.

6
00:00:28,730 --> 00:00:34,560
Lanchain provides with document loaders
which load the data into documents.

7
00:00:35,390 --> 00:00:41,680
In the Lanchain terminology, a document
is a fancy way of saying some piece of text.

8
00:00:42,470 --> 00:00:47,180
These document loaders are aimed at
making this task easy.

9
00:00:47,810 --> 00:00:55,760
There are two types of document loaders,
transform loaders and public and

10
00:00:55,770 --> 00:00:59,960
proprietary data set or service loaders.

11
00:01:01,750 --> 00:01:04,640
Transform loaders will transform the data

12
00:01:04,650 --> 00:01:08,460
from a specific format into the Lanchain
document format.

13
00:01:09,030 --> 00:01:15,620
This is what you will use most of the
time for formats such as PDFs, CSV and

14
00:01:15,630 --> 00:01:18,900
JSON files, Office documents, and many more.

15
00:01:19,810 --> 00:01:22,280
You can see the entire list in the

16
00:01:22,290 --> 00:01:23,500
Lanchain documentation.

17
00:01:26,950 --> 00:01:30,210
The public data set and service loaders

18
00:01:30,220 --> 00:01:36,250
are used to query and search for data on
public websites such as Project

19
00:01:36,260 --> 00:01:41,050
Gutenberg, Hacker News, Wikipedia, and so on.

20
00:01:41,600 --> 00:01:44,610
The proprietary data set and service

21
00:01:44,620 --> 00:01:50,950
loaders are used for private or
proprietary services such as Azure, AWS

22
00:01:50,960 --> 00:01:55,970
S3, Google Cloud, Discord, Twitter, and
many more.

23
00:01:57,930 --> 00:02:03,680
In this video, I will show you how to
load your data from PDFs, which is a very

24
00:02:03,690 --> 00:02:04,800
common data format.

25
00:02:05,330 --> 00:02:08,020
In the next video, we will generalize it

26
00:02:08,030 --> 00:02:09,820
for other formats as well.

27
00:02:11,710 --> 00:02:14,220
I am loading the environment variables

28
00:02:14,230 --> 00:02:18,580
used for authentication with OpenAI and Pinecon.

29
00:02:19,770 --> 00:02:22,720
To load PDF files, I am going to install

30
00:02:22,730 --> 00:02:26,340
another library called pypdf that is required.

31
00:02:27,630 --> 00:02:36,100
So pip install pypdf I want my code to be

32
00:02:36,110 --> 00:02:38,680
modular so I am defining a function.

33
00:02:39,630 --> 00:02:45,940
So df load document of file The function

34
00:02:45,950 --> 00:02:51,400
will take as an argument a PDF file and
return its text content.

35
00:02:52,110 --> 00:02:54,680
I am importing the required libraries.

36
00:02:57,240 --> 00:03:01,090
From lanchain .documentloaders I am

37
00:03:01,100 --> 00:03:04,010
importing the pypdf loader class.

38
00:03:07,290 --> 00:03:09,750
By the way, the standard recommendation

39
00:03:09,760 --> 00:03:13,170
is to put import statements at the top of
the file.

40
00:03:13,920 --> 00:03:14,110
Here.

41
00:03:14,940 --> 00:03:18,070
However, there are cases when putting

42
00:03:18,080 --> 00:03:21,750
import statements inside the function is
even better.

43
00:03:22,440 --> 00:03:27,510
This will prevent circular dependencies
and you will also benefit from a more

44
00:03:27,520 --> 00:03:30,130
reliable refactoring of your code.

45
00:03:31,140 --> 00:03:34,110
When you move a function from one module

46
00:03:34,120 --> 00:03:38,550
to another, you will know that the
function will continue to work because it

47
00:03:38,560 --> 00:03:40,570
contains everything inside it.

48
00:03:41,840 --> 00:03:45,350
This function loads the PDFs using a

49
00:03:45,360 --> 00:03:51,030
library called pypdf into an array of
documents where each document contains

50
00:03:51,040 --> 00:03:55,010
the page content and metadata with a page number.

51
00:03:55,440 --> 00:03:57,550
I am printing a message for the user.

52
00:03:58,560 --> 00:04:00,510
Loading and file.

53
00:04:02,080 --> 00:04:06,730
And the loader equals pypdf loader of file.

54
00:04:08,000 --> 00:04:12,030
Note that it is also able to load online PDFs.

55
00:04:12,500 --> 00:04:17,190
Just pass a URL to a PDF to pypdf loader.

56
00:04:18,260 --> 00:04:22,030
The next step is to call loader .load.

57
00:04:22,700 --> 00:04:25,670
This will return a list of lang chain documents.

58
00:04:26,080 --> 00:04:28,010
One document for each page.

59
00:04:28,500 --> 00:04:30,490
And I am returning the data.

60
00:04:34,030 --> 00:04:34,780
The list.

61
00:04:41,320 --> 00:04:43,930
This will be the section for the running code.

62
00:04:49,590 --> 00:04:50,700
Let's test the function.

63
00:04:51,390 --> 00:04:53,380
I am loading a local PDF file.

64
00:04:55,870 --> 00:05:00,440
In the current directory, there is a
subdirectory called files that contains a

65
00:05:00,450 --> 00:05:02,940
PDF called us -constitution.

66
00:05:03,910 --> 00:05:04,400
This one.

67
00:05:06,450 --> 00:05:07,280
I am loading it.

68
00:05:07,750 --> 00:05:14,240
So data equals load document of and the

69
00:05:14,250 --> 00:05:17,280
argument is files and us -constitution.

70
00:05:19,350 --> 00:05:23,060
The data was split by pages and you can

71
00:05:23,070 --> 00:05:25,920
use indexes to display a specific page.

72
00:05:26,510 --> 00:05:29,440
For example, print data of 1.

73
00:05:31,150 --> 00:05:35,080
This is the second page because it starts
from 0.

74
00:05:35,690 --> 00:05:36,380
That page content.

75
00:05:39,190 --> 00:05:42,220
I am running the cell that contains the function.

76
00:05:43,130 --> 00:05:44,240
And this one.

77
00:05:50,910 --> 00:05:52,400
And I've got the page.

78
00:05:53,410 --> 00:05:59,860
If you want to see the metadata of this
document, use the metadata attribute.

79
00:06:00,210 --> 00:06:04,140
For example, data of 10 .metadata.

80
00:06:13,010 --> 00:06:13,940
This is a dictionary.

81
00:06:16,960 --> 00:06:19,270
Let's check how many pages there are.

82
00:06:21,870 --> 00:06:29,320
You have length of data, pages, in your data.

83
00:06:33,680 --> 00:06:35,130
There are 41 pages.

84
00:06:36,940 --> 00:06:38,310
And if you want to see how many

85
00:06:38,320 --> 00:06:42,870
characters are in one page, you can do print.

86
00:06:44,060 --> 00:06:48,230
There are curly braces, length of data

87
00:06:48,240 --> 00:06:51,190
of, let's say, 20.

88
00:06:51,860 --> 00:06:52,870
That page content.

89
00:06:55,680 --> 00:06:57,290
Characters in the page.

90
00:07:03,750 --> 00:07:07,160
If you want to load remote PDFs, just

91
00:07:07,170 --> 00:07:10,020
pass the URL to the function as an argument.

92
00:07:11,640 --> 00:07:12,300
We are done.

93
00:07:12,750 --> 00:07:17,660
In this lecture, I showed you how to load
PDFs into Lanchain documents.

94
00:07:18,150 --> 00:07:19,180
We'll take a quick break.

95
00:07:19,650 --> 00:07:21,900
And in the next video, I'll show you how

96
00:07:21,910 --> 00:07:24,940
to load other document formats into Lanchain.

97
00:07:25,370 --> 00:07:25,760
Bye Bye.