0
1
00:00:11,160 --> 00:00:11,910
Hello everyone.
1

2
00:00:11,940 --> 00:00:16,230
Let us start analyzing a bunch of malicious PDF files.
2

3
00:00:16,310 --> 00:00:23,270
So again we'll be coming back to our FLARE suite of tools collection.
3

4
00:00:23,310 --> 00:00:29,280
You can see there is a folder for PDF. When you go inside you will find that there is a tool called PDF
4

5
00:00:29,280 --> 00:00:38,070
parser which is critical in parsing the complete PDF file and extracting malicious artifacts artifact.
5

6
00:00:38,140 --> 00:00:43,460
Then there is a shortcut for PDF-parser and there is a shortcut for PDfid.
6

7
00:00:43,480 --> 00:00:47,450
So these are basically the compiled executables of the actual program.
7

8
00:00:47,490 --> 00:00:51,670
If you go inside PDf-parser, you'll see that it's simply a python file.
8

9
00:00:51,700 --> 00:00:57,810
You we can run it just like we ran the other OLE file analysis tools in the previous videos.
9

10
00:00:58,240 --> 00:01:05,020
So Pdfid's original tool is not here. So we can just click on the properties of the shortcut and we
10

11
00:01:05,020 --> 00:01:08,240
can see that where exactly this shortcut is pointing to.
11

12
00:01:08,450 --> 00:01:11,880
They can see that it's in Program Files/pdfid
12

13
00:01:12,010 --> 00:01:18,550
So I don't really enjoy running the shortcuts. It's better to always run the python program directly
13

14
00:01:18,640 --> 00:01:23,110
so that in case there is any error you can look at it and try and resolve.
14

15
00:01:23,110 --> 00:01:25,480
So let's quickly go to program files.
15

16
00:01:27,000 --> 00:01:36,010
Pdfid and just copy it and move it to the FLARE folder.
16

17
00:01:36,080 --> 00:01:41,510
So you now have both pdf-parser and pdfid in the same FLARE directory.
17

18
00:01:41,510 --> 00:01:43,640
So pdfid is more of a 
18

19
00:01:45,100 --> 00:01:51,650
meta information tool which gives you a bunch of information about the PDF file.
19

20
00:01:51,770 --> 00:01:57,260
For example how many page numbers are there, are there any javascripts inside it and things like that.
20

21
00:01:57,260 --> 00:02:02,870
Whereas PDF-parser is more of a dynamic parsing of the PDF file.
21

22
00:02:02,900 --> 00:02:06,410
So let's begin with using pdfid.
22

23
00:02:06,410 --> 00:02:07,350
For the first
23

24
00:02:12,680 --> 00:02:22,600
So I will come to my pdfid directory and my files are stored in course files/PDF files/PDF examples
24

25
00:02:22,610 --> 00:02:24,350
I have three examples here.
25

26
00:02:24,350 --> 00:02:28,700
So we'll be using them one on one 
26

27
00:02:28,700 --> 00:02:37,420
We Will pass 
>python pdfid.py
followed by the location of the file.
27

28
00:02:39,500 --> 00:02:43,360
So once you press enter it will give us a bunch of information.
28

29
00:02:43,360 --> 00:02:47,660
For example this PDF file has 26 objects inside it.
29

30
00:02:47,660 --> 00:02:53,240
If you remember from our previous discussion we talked about how PDF file is basically.....the body of PDF file
30

31
00:02:53,240 --> 00:02:57,920
consists of different objects and all those objects will begin with.
31

32
00:02:58,010 --> 00:03:01,650
'obj' and end with 'endobj'.
32

33
00:03:01,880 --> 00:03:08,780
So there are 26 objects and 26 end-objects so it is ending all the objects properly
33

34
00:03:08,990 --> 00:03:15,500
There are nine streams. Again the body of for the PDF files contain streams and these teams have the
34

35
00:03:15,500 --> 00:03:17,060
data.
35

36
00:03:17,240 --> 00:03:18,990
Then there is one cross-reference.
36

37
00:03:19,010 --> 00:03:21,690
There is one trailer one start xref
37

38
00:03:21,770 --> 00:03:23,790
There are three page numbers.
38

39
00:03:23,960 --> 00:03:27,050
There is one javascript as well.
39

40
00:03:27,050 --> 00:03:33,800
/JS tag and it has been picked up by PDfid
40

41
00:03:33,800 --> 00:03:35,100
well.
41

42
00:03:35,200 --> 00:03:37,190
There is an open action as well.
42

43
00:03:37,190 --> 00:03:42,640
So what I mean by open action here is that once you launch the PDf file, whatever is
43

44
00:03:42,650 --> 00:03:46,370
marked as open action will be immediately executed.
44

45
00:03:46,670 --> 00:03:55,060
So it's very important to understand all these meta properties that we have got from PDfid
45

46
00:03:55,160 --> 00:04:00,710
We already know a bunch of them but there are some of them which are new and the important ones are things
46

47
00:04:00,710 --> 00:04:04,860
like JS, Javascript, AA, openaction
47

48
00:04:04,920 --> 00:04:05,720
XFA, URI
48

49
00:04:05,720 --> 00:04:11,930
So URI again tells us is there is any URI that is present inside the PDF. The embedded file
49

50
00:04:11,930 --> 00:04:12,470
tells us.
50

51
00:04:12,470 --> 00:04:20,540
Is there any embedded file for example an executable or a Flash file that is inside the PDF. So the interesting
51

52
00:04:20,540 --> 00:04:22,340
parts here are javascript's.
52

53
00:04:22,370 --> 00:04:28,550
We know that this file contains javascript and there is an open action that is performed as well which
53

54
00:04:28,550 --> 00:04:33,980
means that as soon as we are launching  the PDF ,the PDF is trying to do something without you know giving
54

55
00:04:33,980 --> 00:04:37,410
you any kind of permission or something.
55

56
00:04:37,410 --> 00:04:47,260
All you have to do is just from that PDF itself. let us run for our second file as well
56

57
00:04:47,260 --> 00:04:49,410
file we get something similar.
57

58
00:04:49,510 --> 00:04:54,770
There are 12 objects two streams it has two pages.
58

59
00:04:54,910 --> 00:05:03,580
And again it has javascript inside it and it performs open action as well and there is no embedded file
59

60
00:05:03,820 --> 00:05:06,950
and there is no URI inside that PDF file
60

61
00:05:08,580 --> 00:05:13,110
Let us try with our third example
61

62
00:05:13,230 --> 00:05:18,060
We have eight objects one stream one page.
62

63
00:05:18,060 --> 00:05:19,470
There is no javascript.
63

64
00:05:19,470 --> 00:05:25,900
In this case and that is one xfa, no URI. That's it.
64

65
00:05:25,920 --> 00:05:31,950
So this is how we first collect some kind of static information of the PDF file using pdfid and
65

66
00:05:31,950 --> 00:05:37,230
this can help us in making again some heuristic analysis of the PDF file by looking at the number
66

67
00:05:37,230 --> 00:05:43,020
of pages, whether it has some javascript's or not with it or its performing some open action or not and
67

68
00:05:43,020 --> 00:05:44,350
things like that.
68

69
00:05:44,370 --> 00:05:50,940
So once we have some kind of static heuristics about the PDF file, the next thing that we can do
69

70
00:05:50,940 --> 00:05:56,710
is we can start using PDF parser to actually look into these elements.