1
00:00:01,590 --> 00:00:06,820
In the last two videos, we talked about
transform loaders, which transform or

2
00:00:06,830 --> 00:00:11,020
load data from a specific format into the
Language Chain Document Format.

3
00:00:11,570 --> 00:00:15,440
In this case, the data was loaded from
local files.

4
00:00:16,670 --> 00:00:22,440
us -constitution .pdf and the great -gxb
.docx are local files.

5
00:00:22,910 --> 00:00:27,260
Now let's see how to load data from
online services into Language Chain.

6
00:00:27,270 --> 00:00:33,700
We don't deal with files, but with
different protocols or APIs that connect

7
00:00:33,710 --> 00:00:34,920
to those services.

8
00:00:36,050 --> 00:00:39,480
Since the format and code differ for each

9
00:00:39,490 --> 00:00:45,600
service, I would create a unique function
for each dataset or service loader that I

10
00:00:45,610 --> 00:00:47,300
want to support in my application.

11
00:00:48,370 --> 00:00:50,500
Let's do this for Wikipedia.

12
00:00:54,370 --> 00:00:58,420
I am installing the Wikipedia module,
which is a prerequisite.

13
00:01:01,850 --> 00:01:17,300
So pip install wikipedia -q and df load
from wikipedia.

14
00:01:19,010 --> 00:01:26,680
And the function will take two arguments
query and lang equals by default en.

15
00:01:29,050 --> 00:01:35,860
Query is free text, which is used to find
documents in Wikipedia, and lang is by

16
00:01:35,870 --> 00:01:40,980
default English, and it is used to search
in a specific language part of Wikipedia.

17
00:01:42,250 --> 00:01:44,600
I am loading the Wikipedia loader.

18
00:01:44,970 --> 00:01:47,500
So from lang -chain document -loaders

19
00:01:51,060 --> 00:02:04,530
import wikipedia -loader and loader
equals wikipedia -loader of query equals

20
00:02:04,540 --> 00:02:13,770
query, lang equals lang, and the
loadMaxDocs let's say equals to.

21
00:02:14,880 --> 00:02:19,390
LoadMaxDocs can be used to limit the
number of downloaded documents.

22
00:02:19,980 --> 00:02:24,430
You can use a hard -coded value or add a
third argument to the function.

23
00:02:25,180 --> 00:02:37,210
Like this, loadMaxDocs equals to a
default value and here loadMaxDocs equals loadMaxDocs.

24
00:02:37,300 --> 00:02:38,750
I am more flexible now.

25
00:02:40,020 --> 00:02:45,410
Data equals loader .load and return data.

26
00:02:45,960 --> 00:02:46,850
And that's it.

27
00:02:48,500 --> 00:02:49,650
I am running the cell.

28
00:02:51,800 --> 00:02:57,150
And I am loading something from Wikipedia
in the lang -chain document format.

29
00:02:57,620 --> 00:03:01,190
Data equals load from Wikipedia.

30
00:03:06,890 --> 00:03:07,340
Ok.

31
00:03:07,710 --> 00:03:08,420
It's load.

32
00:03:08,790 --> 00:03:09,720
It was a typo.

33
00:03:12,450 --> 00:03:13,780
Let's say GPT -4.

34
00:03:16,590 --> 00:03:19,540
I remind you that the training data for

35
00:03:19,550 --> 00:03:23,880
GPT -4 was cut off in September 2021.

36
00:03:24,450 --> 00:03:28,720
GPT -4 was launched in March 2023, so it

37
00:03:28,730 --> 00:03:31,460
was not included in the GPT -4 training data.

38
00:03:31,950 --> 00:03:34,040
Without loading the data from external

39
00:03:34,050 --> 00:03:40,720
sources, LLMs like GPT -3 .5 Turbo or GPT
-4 have no knowledge of it.

40
00:03:42,030 --> 00:03:46,280
And print data of 0 .pageContent.

41
00:03:47,170 --> 00:03:48,300
And I am running the cell.

42
00:03:52,180 --> 00:03:57,650
This is the data about GPT -4 downloaded
from Wikipedia.

43
00:03:58,140 --> 00:04:03,550
If you want to see it in another
language, just add a second argument with

44
00:04:03,560 --> 00:04:04,410
the language code.

45
00:04:04,540 --> 00:04:07,210
For example, DE from German.

46
00:04:07,540 --> 00:04:08,270
Deutsch or German.

47
00:04:09,400 --> 00:04:10,310
I am running it again.

48
00:04:14,520 --> 00:04:14,970
All right.

49
00:04:15,220 --> 00:04:17,750
In this lecture, I have shown you how to

50
00:04:17,760 --> 00:04:21,610
load data from an online service like Wikipedia.

51
00:04:22,620 --> 00:04:25,110
We will take a short break and in the

52
00:04:25,120 --> 00:04:28,330
next video we will discuss chunking documents.

53
00:04:29,160 --> 00:04:32,190
These will then be embedded into numeric

54
00:04:32,200 --> 00:04:35,090
vectors and uploaded to a vector store.