1 00:00:00,440 --> 00:00:01,470 Welcome back. 2 00:00:01,800 --> 00:00:03,410 We are starting a series of lectures 3 00:00:03,420 --> 00:00:08,510 where we'll build on what we've learned so far and see the foundation of an AI 4 00:00:08,520 --> 00:00:13,290 application that can answer questions about the content of private documents. 5 00:00:14,020 --> 00:00:19,710 At the beginning of the course, we set up the environment and the sign up for a 6 00:00:19,720 --> 00:00:20,770 Pinecone account. 7 00:00:21,160 --> 00:00:24,990 We have a Pinecone API key in a .env file 8 00:00:25,000 --> 00:00:26,510 in the current directory. 9 00:00:27,100 --> 00:00:29,130 You should keep the API key secure. 10 00:00:29,680 --> 00:00:35,250 In my case, I'll rotate or generate another key after recording this video. 11 00:00:35,640 --> 00:00:39,230 I'll do the same for all the keys shown in this course. 12 00:00:40,080 --> 00:00:44,930 I'm loading them as environment variables using python .env. 13 00:00:45,200 --> 00:00:47,230 This is the code that does the job. 14 00:00:47,580 --> 00:00:49,530 When dealing with very long pieces of 15 00:00:49,540 --> 00:00:52,910 text, it's necessary to split the text into chunks. 16 00:00:53,460 --> 00:00:57,870 You also want to keep semantically related pieces of text together. 17 00:00:58,760 --> 00:01:00,350 Let's look at an example. 18 00:01:01,500 --> 00:01:04,070 We'll load the data to be split from a 19 00:01:04,080 --> 00:01:10,310 text file, but you can load it from many other document types such as PDFs, CSV 20 00:01:10,320 --> 00:01:14,330 files, spreadsheets, SQL databases, and so on. 21 00:01:15,120 --> 00:01:18,070 In this example, we'll work with a text file. 22 00:01:18,960 --> 00:01:22,170 However, there are Lanchain loaders that 23 00:01:22,180 --> 00:01:27,670 can be used to load data from almost any type of documents, including CSV, 24 00:01:28,040 --> 00:01:35,190 Evernote, file directories, Facebook chats, JSON files, PowerPoint, PDFs, and 25 00:01:35,200 --> 00:01:36,070 many more. 26 00:01:36,480 --> 00:01:38,870 There are also public service loaders 27 00:01:38,880 --> 00:01:47,170 from different services such as Project Gutenberg, Hacker News, Wikipedia, and YouTube. 28 00:01:48,360 --> 00:01:53,450 Additionally, there are proprietary service loaders for Amazon, Azure, 29 00:01:53,820 --> 00:01:55,210 Google, and others. 30 00:01:56,080 --> 00:01:59,110 In this example, I'll take the famous 31 00:01:59,120 --> 00:02:03,850 speech by Winston Churchill, We shall fight on the beaches, which was delivered 32 00:02:03,860 --> 00:02:10,030 to the House of Commons of the Parliament of the United Kingdom in June 1940. 33 00:02:10,820 --> 00:02:15,170 This is the text file and I'll split it into chunks. 34 00:02:16,660 --> 00:02:23,470 The default and recommended text splitter is called Recursive Character Text Splitter. 35 00:02:24,080 --> 00:02:25,490 I'm importing it. 36 00:02:25,940 --> 00:02:32,000 So from lanchain text splitter import 37 00:02:32,010 --> 00:02:37,920 recursive character text splitter. 38 00:02:39,170 --> 00:02:42,540 By default, the characters it tries to 39 00:02:42,550 --> 00:02:47,160 split on are double backslash n, backslash n, and y space. 40 00:02:48,310 --> 00:02:53,300 Now, I'm creating the text splitter object by calling the recursive character 41 00:02:53,310 --> 00:02:54,460 text splitter constructor. 42 00:02:55,030 --> 00:02:59,240 So text splitter equals recursive 43 00:02:59,250 --> 00:03:01,180 character text splitter. 44 00:03:04,950 --> 00:03:08,140 The first argument is the maximum size of 45 00:03:08,150 --> 00:03:11,280 your chunks as measured by the length function. 46 00:03:13,650 --> 00:03:15,300 We'll set a really small chunk size, 47 00:03:15,650 --> 00:03:19,700 which is around a line of text just to show you how it works. 48 00:03:20,590 --> 00:03:26,320 However, you should experiment with different values to see which one works best. 49 00:03:27,670 --> 00:03:30,340 You'll normally use a higher value here. 50 00:03:30,890 --> 00:03:33,140 We'll talk more about chunking strategies 51 00:03:33,150 --> 00:03:36,080 for LLM applications later in the course. 52 00:03:37,290 --> 00:03:39,740 Also note that this is the maximum size 53 00:03:39,750 --> 00:03:43,620 of the chunk, but in reality, it can have a lower size. 54 00:03:44,190 --> 00:03:46,980 The next argument is a chunk overlap. 55 00:03:50,080 --> 00:03:53,010 This is the maximum overlap between 56 00:03:53,020 --> 00:03:57,170 chunks needed to maintain some continuity between them. 57 00:03:58,360 --> 00:04:02,010 And finally, the last argument length function. 58 00:04:04,840 --> 00:04:09,310 It indicates how the length of chunks is calculated. 59 00:04:09,960 --> 00:04:13,430 The default is to just count the number of characters, but 60 00:04:13,440 --> 00:04:19,670 because we'll work with LLMs, and LLMs use tokens not characters, it's pretty 61 00:04:19,680 --> 00:04:22,330 common to pass a token counter here. 62 00:04:25,250 --> 00:04:26,760 I'm creating the chunks. 63 00:04:27,910 --> 00:04:32,560 Chunks equals text splitter dot create documents. 64 00:04:33,490 --> 00:04:37,580 And it takes as an argument a list of characters. 65 00:04:39,250 --> 00:04:40,620 Let's say Churchill speech. 66 00:04:44,870 --> 00:04:47,940 Of course, I have to open and read the 67 00:04:47,950 --> 00:04:51,200 contents of the file as a string in this variable. 68 00:04:51,770 --> 00:04:52,760 I'm doing it now. 69 00:04:53,610 --> 00:04:56,340 I'm opening the file as usual. 70 00:04:57,370 --> 00:05:01,460 So with open files Churchill speech dot 71 00:05:01,470 --> 00:05:13,610 txt as f and the Churchill speech equals f dot read. 72 00:05:15,910 --> 00:05:19,380 This file is in a directory called files in the current one. 73 00:05:21,500 --> 00:05:25,190 You should use a correct relative or absolute path. 74 00:05:26,160 --> 00:05:27,310 I'm running the cell. 75 00:05:28,880 --> 00:05:30,790 Let's see the first chunks. 76 00:05:32,160 --> 00:05:34,610 Chunks of zero. 77 00:05:36,080 --> 00:05:37,330 This is the first one. 78 00:05:39,280 --> 00:05:45,410 This is the second chunk, the third chunk, and so on. 79 00:05:45,880 --> 00:05:48,870 Each chunk is approximately a line. 80 00:05:51,050 --> 00:05:53,890 If you want to see only the text, use 81 00:05:53,900 --> 00:06:00,810 chunks and an index, let's say 10 dot page content. 82 00:06:02,460 --> 00:06:04,610 And I've got the text only. 83 00:06:07,840 --> 00:06:09,930 Let's see how many chunks do I have. 84 00:06:10,980 --> 00:06:18,960 So now you have len of chunks and the chunks. 85 00:06:21,230 --> 00:06:21,400 Good. 86 00:06:21,890 --> 00:06:23,840 There are 300 chunks. 87 00:06:26,160 --> 00:06:29,170 We'll be using OpenAI's text embedding 88 00:06:29,180 --> 00:06:31,950 ADA002 which has a cost. 89 00:06:32,520 --> 00:06:35,090 It's a good idea to calculate the 90 00:06:35,100 --> 00:06:38,470 embedding costs in advance to avoid any surprises. 91 00:06:39,020 --> 00:06:41,810 We'll use the tick token library for this. 92 00:06:42,160 --> 00:06:46,470 I've already shown you how to in the first part of the course, so I'll just 93 00:06:46,480 --> 00:06:48,190 copy and paste the code now. 94 00:06:55,640 --> 00:06:58,270 This is the function and I'm calling it. 95 00:06:58,780 --> 00:07:02,610 So print embedding cost of chunks. 96 00:07:08,140 --> 00:07:14,030 So there are in total 5820 tokens and the 97 00:07:14,040 --> 00:07:18,010 embeddings will cost a fraction of a cent. 98 00:07:18,800 --> 00:07:19,490 It's really cheap. 99 00:07:19,500 --> 00:07:25,810 The next step is to import and instantiate OpenAI embeddings. 100 00:07:26,160 --> 00:07:29,030 I'm importing the required library. 101 00:07:30,560 --> 00:07:35,530 From lang chain embeddings, I'm importing 102 00:07:35,540 --> 00:07:46,300 OpenAI embeddings and the embeddings equals and I'm instantiating the object. 103 00:07:47,450 --> 00:07:53,460 If the API key is loaded in an environment variable, it is not necessary 104 00:07:53,470 --> 00:07:56,380 to pass it as an argument to this function. 105 00:07:57,110 --> 00:08:00,860 The OpenAI embeddings class can be used 106 00:08:00,870 --> 00:08:02,560 to embed text to vectors. 107 00:08:02,970 --> 00:08:06,080 For example, let's turn the first chunk 108 00:08:06,090 --> 00:08:09,020 into a vector of floating point numbers. 109 00:08:10,470 --> 00:08:15,160 So vector equals embedding the object 110 00:08:15,170 --> 00:08:25,490 I've just created dot embed query and the argument will be a text. 111 00:08:25,960 --> 00:08:26,970 It can be any text. 112 00:08:34,660 --> 00:08:38,010 This is the embedding of a string abc. 113 00:08:39,000 --> 00:08:42,010 Let's see the embedding of the first chunk. 114 00:08:42,420 --> 00:08:46,790 So chunks of zero dot page content. 115 00:08:54,380 --> 00:08:57,510 This is the embedding of the first chunk. 116 00:08:58,780 --> 00:08:59,150 Great. 117 00:08:59,440 --> 00:09:05,170 In this video, I've shown you how to split text and embed them into vectors 118 00:09:05,180 --> 00:09:07,670 using lang chain and OpenAI. 119 00:09:08,380 --> 00:09:11,070 We'll take a break and in the next video, 120 00:09:11,420 --> 00:09:16,990 we'll move to Pinecone and see how to insert the embeddings into a Pinecone index.