1
00:00:00,440 --> 00:00:01,470
Welcome back.

2
00:00:01,800 --> 00:00:03,410
We are starting a series of lectures

3
00:00:03,420 --> 00:00:08,510
where we'll build on what we've learned
so far and see the foundation of an AI

4
00:00:08,520 --> 00:00:13,290
application that can answer questions
about the content of private documents.

5
00:00:14,020 --> 00:00:19,710
At the beginning of the course, we set up
the environment and the sign up for a

6
00:00:19,720 --> 00:00:20,770
Pinecone account.

7
00:00:21,160 --> 00:00:24,990
We have a Pinecone API key in a .env file

8
00:00:25,000 --> 00:00:26,510
in the current directory.

9
00:00:27,100 --> 00:00:29,130
You should keep the API key secure.

10
00:00:29,680 --> 00:00:35,250
In my case, I'll rotate or generate
another key after recording this video.

11
00:00:35,640 --> 00:00:39,230
I'll do the same for all the keys shown
in this course.

12
00:00:40,080 --> 00:00:44,930
I'm loading them as environment variables
using python .env.

13
00:00:45,200 --> 00:00:47,230
This is the code that does the job.

14
00:00:47,580 --> 00:00:49,530
When dealing with very long pieces of

15
00:00:49,540 --> 00:00:52,910
text, it's necessary to split the text
into chunks.

16
00:00:53,460 --> 00:00:57,870
You also want to keep semantically
related pieces of text together.

17
00:00:58,760 --> 00:01:00,350
Let's look at an example.

18
00:01:01,500 --> 00:01:04,070
We'll load the data to be split from a

19
00:01:04,080 --> 00:01:10,310
text file, but you can load it from many
other document types such as PDFs, CSV

20
00:01:10,320 --> 00:01:14,330
files, spreadsheets, SQL databases, and
so on.

21
00:01:15,120 --> 00:01:18,070
In this example, we'll work with a text file.

22
00:01:18,960 --> 00:01:22,170
However, there are Lanchain loaders that

23
00:01:22,180 --> 00:01:27,670
can be used to load data from almost any
type of documents, including CSV,

24
00:01:28,040 --> 00:01:35,190
Evernote, file directories, Facebook
chats, JSON files, PowerPoint, PDFs, and

25
00:01:35,200 --> 00:01:36,070
many more.

26
00:01:36,480 --> 00:01:38,870
There are also public service loaders

27
00:01:38,880 --> 00:01:47,170
from different services such as Project
Gutenberg, Hacker News, Wikipedia, and YouTube.

28
00:01:48,360 --> 00:01:53,450
Additionally, there are proprietary
service loaders for Amazon, Azure,

29
00:01:53,820 --> 00:01:55,210
Google, and others.

30
00:01:56,080 --> 00:01:59,110
In this example, I'll take the famous

31
00:01:59,120 --> 00:02:03,850
speech by Winston Churchill, We shall
fight on the beaches, which was delivered

32
00:02:03,860 --> 00:02:10,030
to the House of Commons of the Parliament
of the United Kingdom in June 1940.

33
00:02:10,820 --> 00:02:15,170
This is the text file and I'll split it
into chunks.

34
00:02:16,660 --> 00:02:23,470
The default and recommended text splitter
is called Recursive Character Text Splitter.

35
00:02:24,080 --> 00:02:25,490
I'm importing it.

36
00:02:25,940 --> 00:02:32,000
So from lanchain text splitter import

37
00:02:32,010 --> 00:02:37,920
recursive character text splitter.

38
00:02:39,170 --> 00:02:42,540
By default, the characters it tries to

39
00:02:42,550 --> 00:02:47,160
split on are double backslash n,
backslash n, and y space.

40
00:02:48,310 --> 00:02:53,300
Now, I'm creating the text splitter
object by calling the recursive character

41
00:02:53,310 --> 00:02:54,460
text splitter constructor.

42
00:02:55,030 --> 00:02:59,240
So text splitter equals recursive

43
00:02:59,250 --> 00:03:01,180
character text splitter.

44
00:03:04,950 --> 00:03:08,140
The first argument is the maximum size of

45
00:03:08,150 --> 00:03:11,280
your chunks as measured by the length function.

46
00:03:13,650 --> 00:03:15,300
We'll set a really small chunk size,

47
00:03:15,650 --> 00:03:19,700
which is around a line of text just to
show you how it works.

48
00:03:20,590 --> 00:03:26,320
However, you should experiment with
different values to see which one works best.

49
00:03:27,670 --> 00:03:30,340
You'll normally use a higher value here.

50
00:03:30,890 --> 00:03:33,140
We'll talk more about chunking strategies

51
00:03:33,150 --> 00:03:36,080
for LLM applications later in the course.

52
00:03:37,290 --> 00:03:39,740
Also note that this is the maximum size

53
00:03:39,750 --> 00:03:43,620
of the chunk, but in reality, it can have
a lower size.

54
00:03:44,190 --> 00:03:46,980
The next argument is a chunk overlap.

55
00:03:50,080 --> 00:03:53,010
This is the maximum overlap between

56
00:03:53,020 --> 00:03:57,170
chunks needed to maintain some continuity
between them.

57
00:03:58,360 --> 00:04:02,010
And finally, the last argument length function.

58
00:04:04,840 --> 00:04:09,310
It indicates how the length of chunks is calculated.

59
00:04:09,960 --> 00:04:13,430
The default is to just count the number
of characters, but

60
00:04:13,440 --> 00:04:19,670
because we'll work with LLMs, and LLMs
use tokens not characters, it's pretty

61
00:04:19,680 --> 00:04:22,330
common to pass a token counter here.

62
00:04:25,250 --> 00:04:26,760
I'm creating the chunks.

63
00:04:27,910 --> 00:04:32,560
Chunks equals text splitter dot create documents.

64
00:04:33,490 --> 00:04:37,580
And it takes as an argument a list of characters.

65
00:04:39,250 --> 00:04:40,620
Let's say Churchill speech.

66
00:04:44,870 --> 00:04:47,940
Of course, I have to open and read the

67
00:04:47,950 --> 00:04:51,200
contents of the file as a string in this variable.

68
00:04:51,770 --> 00:04:52,760
I'm doing it now.

69
00:04:53,610 --> 00:04:56,340
I'm opening the file as usual.

70
00:04:57,370 --> 00:05:01,460
So with open files Churchill speech dot

71
00:05:01,470 --> 00:05:13,610
txt as f and the Churchill speech equals
f dot read.

72
00:05:15,910 --> 00:05:19,380
This file is in a directory called files
in the current one.

73
00:05:21,500 --> 00:05:25,190
You should use a correct relative or
absolute path.

74
00:05:26,160 --> 00:05:27,310
I'm running the cell.

75
00:05:28,880 --> 00:05:30,790
Let's see the first chunks.

76
00:05:32,160 --> 00:05:34,610
Chunks of zero.

77
00:05:36,080 --> 00:05:37,330
This is the first one.

78
00:05:39,280 --> 00:05:45,410
This is the second chunk, the third
chunk, and so on.

79
00:05:45,880 --> 00:05:48,870
Each chunk is approximately a line.

80
00:05:51,050 --> 00:05:53,890
If you want to see only the text, use

81
00:05:53,900 --> 00:06:00,810
chunks and an index, let's say 10 dot
page content.

82
00:06:02,460 --> 00:06:04,610
And I've got the text only.

83
00:06:07,840 --> 00:06:09,930
Let's see how many chunks do I have.

84
00:06:10,980 --> 00:06:18,960
So now you have len of chunks and the chunks.

85
00:06:21,230 --> 00:06:21,400
Good.

86
00:06:21,890 --> 00:06:23,840
There are 300 chunks.

87
00:06:26,160 --> 00:06:29,170
We'll be using OpenAI's text embedding

88
00:06:29,180 --> 00:06:31,950
ADA002 which has a cost.

89
00:06:32,520 --> 00:06:35,090
It's a good idea to calculate the

90
00:06:35,100 --> 00:06:38,470
embedding costs in advance to avoid any surprises.

91
00:06:39,020 --> 00:06:41,810
We'll use the tick token library for this.

92
00:06:42,160 --> 00:06:46,470
I've already shown you how to in the
first part of the course, so I'll just

93
00:06:46,480 --> 00:06:48,190
copy and paste the code now.

94
00:06:55,640 --> 00:06:58,270
This is the function and I'm calling it.

95
00:06:58,780 --> 00:07:02,610
So print embedding cost of chunks.

96
00:07:08,140 --> 00:07:14,030
So there are in total 5820 tokens and the

97
00:07:14,040 --> 00:07:18,010
embeddings will cost a fraction of a cent.

98
00:07:18,800 --> 00:07:19,490
It's really cheap.

99
00:07:19,500 --> 00:07:25,810
The next step is to import and
instantiate OpenAI embeddings.

100
00:07:26,160 --> 00:07:29,030
I'm importing the required library.

101
00:07:30,560 --> 00:07:35,530
From lang chain embeddings, I'm importing

102
00:07:35,540 --> 00:07:46,300
OpenAI embeddings and the embeddings
equals and I'm instantiating the object.

103
00:07:47,450 --> 00:07:53,460
If the API key is loaded in an
environment variable, it is not necessary

104
00:07:53,470 --> 00:07:56,380
to pass it as an argument to this function.

105
00:07:57,110 --> 00:08:00,860
The OpenAI embeddings class can be used

106
00:08:00,870 --> 00:08:02,560
to embed text to vectors.

107
00:08:02,970 --> 00:08:06,080
For example, let's turn the first chunk

108
00:08:06,090 --> 00:08:09,020
into a vector of floating point numbers.

109
00:08:10,470 --> 00:08:15,160
So vector equals embedding the object

110
00:08:15,170 --> 00:08:25,490
I've just created dot embed query and the
argument will be a text.

111
00:08:25,960 --> 00:08:26,970
It can be any text.

112
00:08:34,660 --> 00:08:38,010
This is the embedding of a string abc.

113
00:08:39,000 --> 00:08:42,010
Let's see the embedding of the first chunk.

114
00:08:42,420 --> 00:08:46,790
So chunks of zero dot page content.

115
00:08:54,380 --> 00:08:57,510
This is the embedding of the first chunk.

116
00:08:58,780 --> 00:08:59,150
Great.

117
00:08:59,440 --> 00:09:05,170
In this video, I've shown you how to
split text and embed them into vectors

118
00:09:05,180 --> 00:09:07,670
using lang chain and OpenAI.

119
00:09:08,380 --> 00:09:11,070
We'll take a break and in the next video,

120
00:09:11,420 --> 00:09:16,990
we'll move to Pinecone and see how to
insert the embeddings into a Pinecone index.