1 00:00:00,380 --> 00:00:01,410 Welcome back. 2 00:00:01,900 --> 00:00:05,050 Up until now, we've loaded our custom or 3 00:00:05,060 --> 00:00:07,770 private data into Lanchain documents. 4 00:00:08,940 --> 00:00:11,630 The next step is to split or chunk the 5 00:00:11,640 --> 00:00:13,370 documents into smaller parts. 6 00:00:13,980 --> 00:00:17,030 In the context of building LLM -related 7 00:00:17,040 --> 00:00:22,610 applications, chunking is the process of breaking down large pieces of text into 8 00:00:22,620 --> 00:00:23,530 smaller segments. 9 00:00:24,320 --> 00:00:27,090 It's an essential technique that helps 10 00:00:27,100 --> 00:00:33,670 optimize the relevance of the content we get back from a vector database once we 11 00:00:33,680 --> 00:00:36,110 use the LLM to embed content. 12 00:00:36,460 --> 00:00:39,670 As you know, we need to first embed any 13 00:00:39,680 --> 00:00:42,070 content that we index in Pinecon. 14 00:00:42,440 --> 00:00:45,230 The main reason for chunking is to make 15 00:00:45,240 --> 00:00:50,590 sure that we are embedding a piece of content with as little noise as possible 16 00:00:50,600 --> 00:00:53,630 while still keeping it semantically relevant. 17 00:00:54,100 --> 00:00:56,470 For example, in a semantic search, we 18 00:00:56,480 --> 00:00:58,270 index a corpus of documents. 19 00:00:58,820 --> 00:01:00,710 Each document contains valuable 20 00:01:00,720 --> 00:01:02,730 information on a specific topic. 21 00:01:03,300 --> 00:01:06,050 By applying an effective chunking 22 00:01:06,060 --> 00:01:11,430 strategy, we can make sure that our search results accurately capture the 23 00:01:11,440 --> 00:01:13,370 essence of the user's query. 24 00:01:13,880 --> 00:01:17,050 If our chunks are too small or too large, 25 00:01:17,340 --> 00:01:22,570 it may lead to imprecise search results or missed opportunities to surface 26 00:01:22,580 --> 00:01:24,050 relevant content. 27 00:01:24,560 --> 00:01:27,410 As a rule of thumb, if a chunk of text 28 00:01:27,420 --> 00:01:33,310 makes sense without the surrounding context to a human, it will make sense to 29 00:01:33,320 --> 00:01:34,850 the language model as well. 30 00:01:35,180 --> 00:01:37,890 Therefore, finding the optimal chunk size 31 00:01:37,900 --> 00:01:43,630 for the documents in the corpus is crucial to ensure that the search results 32 00:01:43,640 --> 00:01:45,890 are accurate and relevant. 33 00:01:46,360 --> 00:01:49,550 I am defining a function called chunkData 34 00:01:51,770 --> 00:01:56,180 .df chunkData of data. 35 00:01:57,230 --> 00:01:59,740 Lanchain provides many text splitters, 36 00:02:00,050 --> 00:02:04,780 but recursive character text splitter is the recommended one for generic text. 37 00:02:05,250 --> 00:02:08,080 I am importing it from Lanchain. 38 00:02:10,690 --> 00:02:14,900 textSplitter import recursive character 39 00:02:14,910 --> 00:02:19,420 text splitter By default, the characters it tries to 40 00:02:19,430 --> 00:02:24,220 split on are double backslash n, backslash n, and whitespace. 41 00:02:24,470 --> 00:02:30,060 I have already explained to you in the previous section how recursive character 42 00:02:30,070 --> 00:02:33,380 text splitter works, and now I am just writing the code. 43 00:02:50,450 --> 00:02:56,560 When a sentence is embedded, the resulting vector focuses on the 44 00:02:56,570 --> 00:02:58,300 sentence's specific meaning. 45 00:02:59,130 --> 00:03:02,300 The comparison would naturally be done on 46 00:03:02,310 --> 00:03:06,040 that level when compared to other sentence embeddings. 47 00:03:07,010 --> 00:03:13,520 This also implies that embedding may miss out a broader contextual information 48 00:03:13,530 --> 00:03:15,160 found in a paragraph. 49 00:03:15,730 --> 00:03:17,880 When a full paragraph or document is 50 00:03:17,890 --> 00:03:23,920 embedded, the embedding process considers both the overall context and the 51 00:03:23,930 --> 00:03:28,600 relationships between the sentences and the phrases within the text. 52 00:03:29,230 --> 00:03:34,600 This can result in a more comprehensive vector representation that captures the 53 00:03:34,610 --> 00:03:36,180 broader meaning of the text. 54 00:03:36,850 --> 00:03:39,540 To be flexible, I'll add a second 55 00:03:39,550 --> 00:03:45,780 argument with a default value for the chunk size, instead of this hard -coded value. 56 00:03:47,170 --> 00:03:52,600 So, chunkSize equals to 5, 6. 57 00:03:53,170 --> 00:03:57,080 And here chunkSize equals chunkSize the 58 00:03:57,090 --> 00:03:58,060 function's parameter. 59 00:03:59,010 --> 00:03:59,820 It's better now. 60 00:04:01,150 --> 00:04:06,840 And chunks equals textSplitter .splitDocumentsOfData. 61 00:04:09,280 --> 00:04:11,410 It returns a list of documents. 62 00:04:12,360 --> 00:04:17,850 Use createDocuments instead of splitDocuments when it's not already 63 00:04:17,860 --> 00:04:19,090 splitted in pages. 64 00:04:19,680 --> 00:04:22,450 This depends on how you have loaded the data. 65 00:04:23,240 --> 00:04:25,330 And I'm returning the chunks. 66 00:04:27,370 --> 00:04:28,520 I'm running the cell. 67 00:04:31,050 --> 00:04:34,060 Let's test it in the running code part. 68 00:04:35,070 --> 00:04:35,380 Ok. 69 00:04:35,910 --> 00:04:38,280 I'm gonna load the US Constitution. 70 00:04:39,430 --> 00:04:42,500 I'm commenting out the other code. 71 00:04:44,500 --> 00:04:45,810 I'm running this cell. 72 00:04:46,880 --> 00:04:47,950 I'm loading it. 73 00:04:48,580 --> 00:04:55,330 And I'm running chunks equals chunkDataOfData. 74 00:04:57,000 --> 00:04:59,870 Data was returned by loadDocument. 75 00:05:02,200 --> 00:05:04,230 Let's see how many chunks do I have. 76 00:05:04,700 --> 00:05:05,950 printLen of chunks. 77 00:05:09,690 --> 00:05:10,860 And I'm running it. 78 00:05:11,710 --> 00:05:14,340 There are 190 chunks. 79 00:05:14,350 --> 00:05:24,820 And if you want to see one chunk of data, you can do print chunks of 10 .pageContent. 80 00:05:26,610 --> 00:05:29,380 This is the chunk at index 10. 81 00:05:32,170 --> 00:05:32,460 Good. 82 00:05:33,550 --> 00:05:38,920 We'll be using OpenAI's Text Embedding ADA -002, which has a cost. 83 00:05:39,390 --> 00:05:44,380 It's prudent to calculate the embedding costs in advance to avoid any surprises. 84 00:05:45,110 --> 00:05:47,980 We'll utilize the TickToken library for this. 85 00:05:48,490 --> 00:05:50,820 As I've already shown you how to do this 86 00:05:50,830 --> 00:05:54,460 in the first part of the course, I'll be simply copy -pasting the code. 87 00:06:02,560 --> 00:06:03,410 I'm running the cell. 88 00:06:04,680 --> 00:06:06,370 And checking the cost. 89 00:06:07,660 --> 00:06:12,190 So print embedding cost of chunks. 90 00:06:15,650 --> 00:06:16,100 Ok. 91 00:06:16,430 --> 00:06:17,240 It's really cheap. 92 00:06:20,830 --> 00:06:22,780 After chunking the document, the next 93 00:06:22,790 --> 00:06:28,840 step is to embed the chunks and upload them to a vector database like Pinecon. 94 00:06:29,250 --> 00:06:33,720 We'll take a break, and in the next video, we'll discuss embedding and 95 00:06:33,730 --> 00:06:35,940 uploading to a vector database.