1
00:00:00,380 --> 00:00:01,410
Welcome back.

2
00:00:01,900 --> 00:00:05,050
Up until now, we've loaded our custom or

3
00:00:05,060 --> 00:00:07,770
private data into Lanchain documents.

4
00:00:08,940 --> 00:00:11,630
The next step is to split or chunk the

5
00:00:11,640 --> 00:00:13,370
documents into smaller parts.

6
00:00:13,980 --> 00:00:17,030
In the context of building LLM -related

7
00:00:17,040 --> 00:00:22,610
applications, chunking is the process of
breaking down large pieces of text into

8
00:00:22,620 --> 00:00:23,530
smaller segments.

9
00:00:24,320 --> 00:00:27,090
It's an essential technique that helps

10
00:00:27,100 --> 00:00:33,670
optimize the relevance of the content we
get back from a vector database once we

11
00:00:33,680 --> 00:00:36,110
use the LLM to embed content.

12
00:00:36,460 --> 00:00:39,670
As you know, we need to first embed any

13
00:00:39,680 --> 00:00:42,070
content that we index in Pinecon.

14
00:00:42,440 --> 00:00:45,230
The main reason for chunking is to make

15
00:00:45,240 --> 00:00:50,590
sure that we are embedding a piece of
content with as little noise as possible

16
00:00:50,600 --> 00:00:53,630
while still keeping it semantically relevant.

17
00:00:54,100 --> 00:00:56,470
For example, in a semantic search, we

18
00:00:56,480 --> 00:00:58,270
index a corpus of documents.

19
00:00:58,820 --> 00:01:00,710
Each document contains valuable

20
00:01:00,720 --> 00:01:02,730
information on a specific topic.

21
00:01:03,300 --> 00:01:06,050
By applying an effective chunking

22
00:01:06,060 --> 00:01:11,430
strategy, we can make sure that our
search results accurately capture the

23
00:01:11,440 --> 00:01:13,370
essence of the user's query.

24
00:01:13,880 --> 00:01:17,050
If our chunks are too small or too large,

25
00:01:17,340 --> 00:01:22,570
it may lead to imprecise search results
or missed opportunities to surface

26
00:01:22,580 --> 00:01:24,050
relevant content.

27
00:01:24,560 --> 00:01:27,410
As a rule of thumb, if a chunk of text

28
00:01:27,420 --> 00:01:33,310
makes sense without the surrounding
context to a human, it will make sense to

29
00:01:33,320 --> 00:01:34,850
the language model as well.

30
00:01:35,180 --> 00:01:37,890
Therefore, finding the optimal chunk size

31
00:01:37,900 --> 00:01:43,630
for the documents in the corpus is
crucial to ensure that the search results

32
00:01:43,640 --> 00:01:45,890
are accurate and relevant.

33
00:01:46,360 --> 00:01:49,550
I am defining a function called chunkData

34
00:01:51,770 --> 00:01:56,180
.df chunkData of data.

35
00:01:57,230 --> 00:01:59,740
Lanchain provides many text splitters,

36
00:02:00,050 --> 00:02:04,780
but recursive character text splitter is
the recommended one for generic text.

37
00:02:05,250 --> 00:02:08,080
I am importing it from Lanchain.

38
00:02:10,690 --> 00:02:14,900
textSplitter import recursive character

39
00:02:14,910 --> 00:02:19,420
text splitter
By default, the characters it tries to

40
00:02:19,430 --> 00:02:24,220
split on are double backslash n,
backslash n, and whitespace.

41
00:02:24,470 --> 00:02:30,060
I have already explained to you in the
previous section how recursive character

42
00:02:30,070 --> 00:02:33,380
text splitter works, and now I am just
writing the code.

43
00:02:50,450 --> 00:02:56,560
When a sentence is embedded, the
resulting vector focuses on the

44
00:02:56,570 --> 00:02:58,300
sentence's specific meaning.

45
00:02:59,130 --> 00:03:02,300
The comparison would naturally be done on

46
00:03:02,310 --> 00:03:06,040
that level when compared to other
sentence embeddings.

47
00:03:07,010 --> 00:03:13,520
This also implies that embedding may miss
out a broader contextual information

48
00:03:13,530 --> 00:03:15,160
found in a paragraph.

49
00:03:15,730 --> 00:03:17,880
When a full paragraph or document is

50
00:03:17,890 --> 00:03:23,920
embedded, the embedding process considers
both the overall context and the

51
00:03:23,930 --> 00:03:28,600
relationships between the sentences and
the phrases within the text.

52
00:03:29,230 --> 00:03:34,600
This can result in a more comprehensive
vector representation that captures the

53
00:03:34,610 --> 00:03:36,180
broader meaning of the text.

54
00:03:36,850 --> 00:03:39,540
To be flexible, I'll add a second

55
00:03:39,550 --> 00:03:45,780
argument with a default value for the
chunk size, instead of this hard -coded value.

56
00:03:47,170 --> 00:03:52,600
So, chunkSize equals to 5, 6.

57
00:03:53,170 --> 00:03:57,080
And here chunkSize equals chunkSize the

58
00:03:57,090 --> 00:03:58,060
function's parameter.

59
00:03:59,010 --> 00:03:59,820
It's better now.

60
00:04:01,150 --> 00:04:06,840
And chunks equals textSplitter .splitDocumentsOfData.

61
00:04:09,280 --> 00:04:11,410
It returns a list of documents.

62
00:04:12,360 --> 00:04:17,850
Use createDocuments instead of
splitDocuments when it's not already

63
00:04:17,860 --> 00:04:19,090
splitted in pages.

64
00:04:19,680 --> 00:04:22,450
This depends on how you have loaded the data.

65
00:04:23,240 --> 00:04:25,330
And I'm returning the chunks.

66
00:04:27,370 --> 00:04:28,520
I'm running the cell.

67
00:04:31,050 --> 00:04:34,060
Let's test it in the running code part.

68
00:04:35,070 --> 00:04:35,380
Ok.

69
00:04:35,910 --> 00:04:38,280
I'm gonna load the US Constitution.

70
00:04:39,430 --> 00:04:42,500
I'm commenting out the other code.

71
00:04:44,500 --> 00:04:45,810
I'm running this cell.

72
00:04:46,880 --> 00:04:47,950
I'm loading it.

73
00:04:48,580 --> 00:04:55,330
And I'm running chunks equals chunkDataOfData.

74
00:04:57,000 --> 00:04:59,870
Data was returned by loadDocument.

75
00:05:02,200 --> 00:05:04,230
Let's see how many chunks do I have.

76
00:05:04,700 --> 00:05:05,950
printLen of chunks.

77
00:05:09,690 --> 00:05:10,860
And I'm running it.

78
00:05:11,710 --> 00:05:14,340
There are 190 chunks.

79
00:05:14,350 --> 00:05:24,820
And if you want to see one chunk of data,
you can do print chunks of 10 .pageContent.

80
00:05:26,610 --> 00:05:29,380
This is the chunk at index 10.

81
00:05:32,170 --> 00:05:32,460
Good.

82
00:05:33,550 --> 00:05:38,920
We'll be using OpenAI's Text Embedding
ADA -002, which has a cost.

83
00:05:39,390 --> 00:05:44,380
It's prudent to calculate the embedding
costs in advance to avoid any surprises.

84
00:05:45,110 --> 00:05:47,980
We'll utilize the TickToken library for this.

85
00:05:48,490 --> 00:05:50,820
As I've already shown you how to do this

86
00:05:50,830 --> 00:05:54,460
in the first part of the course, I'll be
simply copy -pasting the code.

87
00:06:02,560 --> 00:06:03,410
I'm running the cell.

88
00:06:04,680 --> 00:06:06,370
And checking the cost.

89
00:06:07,660 --> 00:06:12,190
So print embedding cost of chunks.

90
00:06:15,650 --> 00:06:16,100
Ok.

91
00:06:16,430 --> 00:06:17,240
It's really cheap.

92
00:06:20,830 --> 00:06:22,780
After chunking the document, the next

93
00:06:22,790 --> 00:06:28,840
step is to embed the chunks and upload
them to a vector database like Pinecon.

94
00:06:29,250 --> 00:06:33,720
We'll take a break, and in the next
video, we'll discuss embedding and

95
00:06:33,730 --> 00:06:35,940
uploading to a vector database.