1 00:00:00,630 --> 00:00:06,920 We have loaded the data into long chain documents and split them into smaller chunks. 2 00:00:07,690 --> 00:00:14,160 Next, we will embed the chunks and upload both the chunks and the embeddings to a 3 00:00:14,170 --> 00:00:18,040 pine cone vector store for fast retrieval and similarity search. 4 00:00:29,750 --> 00:00:33,940 I am defining a function called insertOrFetchEmbeddings. 5 00:00:40,380 --> 00:00:43,630 This function will create an index if the 6 00:00:43,640 --> 00:00:50,590 index doesn't exist, embed the chunks and add both the chunks and embeddings into 7 00:00:50,600 --> 00:00:51,490 the pine cone index. 8 00:00:52,280 --> 00:00:55,270 If the index already exists, the function 9 00:00:55,280 --> 00:00:58,170 will just load the embeddings from that index. 10 00:01:00,600 --> 00:01:02,850 This function will have two arguments, 11 00:01:03,780 --> 00:01:05,830 indexName and chunks. 12 00:01:08,050 --> 00:01:10,560 I am importing the necessary libraries. 13 00:01:11,950 --> 00:01:18,880 I am importing the pine cone library and from long chain community vector stores, 14 00:01:22,900 --> 00:01:25,070 I am importing the pine cone class. 15 00:01:30,130 --> 00:01:33,260 From long chain OpenAI, I am importing 16 00:01:33,270 --> 00:01:44,880 OpenAI embeddings and from pine cone, I am importing the pod spec class. 17 00:01:45,610 --> 00:01:49,040 This is necessary to create an index. 18 00:01:56,530 --> 00:01:59,160 Next, I will initialize the pine cone client. 19 00:02:00,050 --> 00:02:02,900 To do this, you will need the pine cone API key. 20 00:02:03,390 --> 00:02:08,300 We covered this in the last section where we discussed pine cone for the first time. 21 00:02:08,790 --> 00:02:13,480 If you skipped that video, I can tell you in just a few hours what you have to do now. 22 00:02:14,810 --> 00:02:20,200 Go to pinecone .io and create a free pine cone account. 23 00:02:21,050 --> 00:02:26,380 Log into your pine cone account, go to API keys and generate one. 24 00:02:27,530 --> 00:02:36,680 Copy the key to keyboard and paste it in the .env file where the OpenAI API key is to. 25 00:02:37,910 --> 00:02:45,340 This is my .env file that contains both the OpenAI API key and the pine cone API key. 26 00:02:46,110 --> 00:02:54,120 I am creating a pine cone object, pc equals pinecone .pinecone. 27 00:02:57,070 --> 00:03:02,860 The constructor expects an environment variable called pine cone API key. 28 00:03:03,330 --> 00:03:08,200 If you have already created and loaded such a variable into memory as we did, 29 00:03:08,590 --> 00:03:11,240 the authentication is automatically handled. 30 00:03:11,250 --> 00:03:15,280 However, if you haven't loaded the 31 00:03:15,290 --> 00:03:21,120 environment variable, you should explicitly pass an argument named API key 32 00:03:21,130 --> 00:03:25,460 with your keys value to the pine cone constructor like this. 33 00:03:26,090 --> 00:03:30,260 API underline key equals and your API key. 34 00:03:36,730 --> 00:03:39,620 Next, I will create an instance of the 35 00:03:39,630 --> 00:03:44,700 OpenAI embeddings class which will be used to embed text into vectors. 36 00:03:47,220 --> 00:03:56,490 Embeddings equals OpenAI embeddings and the arguments model equals text embedding 37 00:03:56,500 --> 00:04:05,380 3small and dimensions 1536. 38 00:04:08,320 --> 00:04:12,250 If the OpenAI API key is loaded as an 39 00:04:12,260 --> 00:04:15,970 environment variable, it is not necessary to specify it as an argument. 40 00:04:18,200 --> 00:04:24,070 If the pine cone index which is the functions argument already exists, you 41 00:04:24,080 --> 00:04:26,030 can load the embeddings like this. 42 00:04:26,460 --> 00:04:28,790 First, I am checking that the index 43 00:04:28,800 --> 00:04:29,710 already exists. 44 00:04:30,340 --> 00:04:37,190 If indexName in PC .list indexes .names, 45 00:04:38,910 --> 00:04:47,060 I am printing a message and loading the embeddings index indexName in curly 46 00:04:47,070 --> 00:05:03,860 braces already exists and loading embeddings and equals empty string and 47 00:05:03,870 --> 00:05:19,810 vectorStore equals pineCone .fromExistingIndex of indexName and 48 00:05:19,820 --> 00:05:23,310 embeddings, the OpenAI embeddings object. 49 00:05:27,000 --> 00:05:35,120 I am also printing ok, but if the index 50 00:05:35,130 --> 00:05:39,300 does not exist, create the index and insert the embeddings. 51 00:05:41,230 --> 00:05:43,440 I am printing a message for the user. 52 00:05:45,350 --> 00:05:51,100 Creating index, indexName in curly braces 53 00:05:51,110 --> 00:05:54,120 and embeddings. 54 00:06:00,330 --> 00:06:03,140 I am creating a new index, PC 55 00:06:03,150 --> 00:06:10,550 .createIndex and the arguments name equals indexName, the functions 56 00:06:10,560 --> 00:06:33,200 parameter, dimensions equals 1536, metric equals cosine and the spec equals podSpec 57 00:06:33,210 --> 00:06:44,730 of environment equals gcpStarter and if you want to create a serverless index 58 00:06:44,740 --> 00:06:48,890 instead of a pod based one, use this configuration instead. 59 00:06:50,080 --> 00:06:59,890 I am calling pineCone .fromDocuments, vectorStore equals the pineCone class 60 00:06:59,900 --> 00:07:11,710 .fromDocuments and the arguments are chunks, embeddings and the indexName 61 00:07:11,720 --> 00:07:13,270 equals indexName. 62 00:07:13,280 --> 00:07:18,070 This method is processing the input 63 00:07:18,080 --> 00:07:24,810 documents, the chunks, generating the embeddings using the provided OpenAI 64 00:07:24,820 --> 00:07:32,830 embeddings instance, inserting the embeddings into the index and returning a 65 00:07:32,840 --> 00:07:34,530 new pineCone vectorStore object. 66 00:07:35,900 --> 00:07:41,550 I am also printing ok and return vectorStore. 67 00:07:43,570 --> 00:07:46,600 That is it for insert or fetch embeddings. 68 00:07:48,090 --> 00:07:50,660 I will also create a function that 69 00:07:50,670 --> 00:07:56,460 deletes a pineCone index or all the indexes because the pineCone free tier 70 00:07:56,470 --> 00:08:02,240 supports only one index, it could be necessary to delete the existing index frequently. 71 00:08:02,890 --> 00:08:14,190 So, def deletePineConeIndex and the function will have one argument indexName 72 00:08:14,200 --> 00:08:16,790 with the default value all. 73 00:08:18,640 --> 00:08:22,170 I am importing the pineCone library and 74 00:08:22,180 --> 00:08:28,570 the pc equals pineCone .pineCone. 75 00:08:30,100 --> 00:08:33,710 By the way, pineCone written in lowercase 76 00:08:33,720 --> 00:08:41,950 is the library and the pineCone written with an uppercase p is a class contained 77 00:08:41,960 --> 00:08:51,230 in this library and if indexName equals equals all, the default value, 78 00:08:51,480 --> 00:08:53,750 I am deleting all the indexes. 79 00:08:54,340 --> 00:08:58,550 So, indexes equals pc .listIndexes 80 00:08:58,560 --> 00:09:13,530 .names, I am displaying a message for the user deleting all indexes and a for loop 81 00:09:13,540 --> 00:09:28,080 for index in indexes colon pc .deleteIndex of index and I am printing 82 00:09:28,090 --> 00:09:41,240 ok, else if the user provided an argument, a name for the indexName, I 83 00:09:41,250 --> 00:09:45,120 will delete only that index, I am displaying a message. 84 00:09:46,710 --> 00:09:59,460 Deleting index, indexName in curly braces and end equals empty string and I am 85 00:09:59,470 --> 00:10:13,520 calling deleteIndex, pc .deleteIndex of indexName, that is it. 86 00:10:13,870 --> 00:10:16,720 I am running the code in these two cells. 87 00:10:26,920 --> 00:10:28,590 Let's test the functions. 88 00:10:30,040 --> 00:10:34,390 To make sure there are no pineCone indexes, I will remove all indexes first. 89 00:10:34,760 --> 00:10:39,710 I am using the pineCone free tier and I want to avoid getting an error when I try 90 00:10:39,720 --> 00:10:40,870 to create a new index. 91 00:10:42,940 --> 00:10:44,710 So, delete pineCone index. 92 00:10:46,100 --> 00:10:51,470 Before running the function, I will check my on pineCone .io. 93 00:10:52,200 --> 00:10:55,290 There is an index called ask a document. 94 00:10:56,280 --> 00:10:56,630 Very well. 95 00:10:56,940 --> 00:10:58,790 I am calling the function. 96 00:11:01,820 --> 00:11:04,350 It is deleting all the indexes. 97 00:11:05,420 --> 00:11:08,810 It is done and I am refreshing this page. 98 00:11:12,140 --> 00:11:14,070 There is no index on pineCone. 99 00:11:17,110 --> 00:11:24,000 Next, I will create an index, embed the chunks and upload both the chunks and the 100 00:11:24,010 --> 00:11:25,300 embeddings to pineCone. 101 00:11:26,250 --> 00:11:30,580 Let's say indexName equals ask a document 102 00:11:32,670 --> 00:11:39,800 and the vector store equals insert or fetch embeddings. 103 00:11:40,690 --> 00:11:43,200 The function I have just defined. 104 00:11:43,810 --> 00:11:46,760 The arguments are indexName and chunks. 105 00:11:47,950 --> 00:11:52,340 I have already created the chunks by calling chunkData. 106 00:11:52,610 --> 00:12:00,260 I am running the code in this cell again just to be sure that the data was chunked. 107 00:12:01,640 --> 00:12:08,720 I am running the code and I have got an error. 108 00:12:09,950 --> 00:12:13,000 The correct name is dimensions not 109 00:12:13,010 --> 00:12:19,660 dimensions and I noticed another mistake. 110 00:12:20,390 --> 00:12:25,360 It is GCP, Google Cloud Platform. 111 00:12:27,980 --> 00:12:29,950 I am running the code again. 112 00:12:35,510 --> 00:12:38,220 All right, the index is being created. 113 00:12:41,580 --> 00:12:42,610 It is done. 114 00:12:43,580 --> 00:12:45,570 Let's check it on pineCone. 115 00:12:47,570 --> 00:12:55,950 I am refreshing the page and we notice a new index called ask a document. 116 00:12:57,120 --> 00:13:03,490 We can also notice that the index contains 190 vectors. 117 00:13:04,740 --> 00:13:08,370 These vectors are the chunks that were embedded. 118 00:13:11,570 --> 00:13:14,140 Great, we have loaded the data into 119 00:13:14,150 --> 00:13:20,000 LangStream documents, split the documents into chunks, embedded the chunks and 120 00:13:20,010 --> 00:13:22,420 uploaded them to the pineCone index. 121 00:13:23,790 --> 00:13:25,700 We have accomplished a lot of work. 122 00:13:26,030 --> 00:13:31,200 In the next video, we will start asking questions and getting answers based on 123 00:13:31,210 --> 00:13:32,040 similarity search.