1 00:00:00,360 --> 00:00:05,150 Let us start by understanding why should one learn about PE file format? 2 00:00:06,360 --> 00:00:13,740 The primary reason is PE is one of the most common file format for executable on Windows platform. 3 00:00:14,670 --> 00:00:18,570 And because the number of people using Microsoft operating system is huge. 4 00:00:19,080 --> 00:00:23,220 Usually we see more malwares written for Windows platform. 5 00:00:24,210 --> 00:00:31,060 Here is a graph that shows almost 50 percent of the malware written are in PE file format, followed by 6 00:00:31,080 --> 00:00:34,830 document types like word, PDF, PPT, etc.. 7 00:00:36,740 --> 00:00:42,290 Considering all these factors, it is worth investing our time in learning about the PE file structure, 8 00:00:43,340 --> 00:00:46,250 PE stands for Portable Executable. 9 00:00:47,790 --> 00:00:57,000 All the dot exe and dot dll are PE files, the PE file contains the information required for the operating 10 00:00:57,000 --> 00:00:58,860 system to run the executable. 11 00:01:00,220 --> 00:01:06,580 During malware analysis, this information gives us hints about the functionality of the malware and how 12 00:01:06,580 --> 00:01:08,410 it interacts with operating system. 13 00:01:09,850 --> 00:01:15,030 If you are interested in knowing what is the most common format for the linux distribution, it is ELF 14 00:01:15,460 --> 00:01:18,280 Executable Link File. 15 00:01:20,650 --> 00:01:24,520 With this background, let's start exploring the PE file format. 16 00:01:25,610 --> 00:01:33,380 At this point, I will give you a warning that this model might sound difficult to grasp, but don't 17 00:01:33,380 --> 00:01:34,210 get discouraged. 18 00:01:34,820 --> 00:01:39,430 Just watch this model couple of times to get a fair knowledge on PE file format. 19 00:01:41,280 --> 00:01:43,950 A typical PE file has the following parts. 20 00:01:45,790 --> 00:01:50,320 However, it is not required for a malware analyst to know all the parts of PE file. 21 00:01:50,950 --> 00:01:57,010 So we will simplify the PE structure like this and understand what each part does. 22 00:01:59,360 --> 00:02:01,890 The first part of PE file is called DOS Header. 23 00:02:03,050 --> 00:02:05,720 It is also referred to as MZ header. 24 00:02:06,500 --> 00:02:09,290 It defines that the file is an executable binary. 25 00:02:10,040 --> 00:02:16,550 The DOS header holds, something called as file signature or magic number for Windows executables. 26 00:02:16,790 --> 00:02:23,420 The magic number is 5A 4D in hexadecimal and MZ in ASCII. 27 00:02:24,460 --> 00:02:31,630 The DOS header also contains a value called e Underscore lfa new, which tells the operating system where 28 00:02:31,630 --> 00:02:37,960 to find the PE header location, which as you see is present down here. 29 00:02:39,760 --> 00:02:45,490 Next, we have a DOS stub this mostly exist for backward compatibility. 30 00:02:46,300 --> 00:02:54,040 DOS stands for Disk Operating System, a predecessor of Microsoft Windows in the recent times, as 31 00:02:54,040 --> 00:03:00,610 there are no applications built for does this part of the PE file is just used to print the message the 32 00:03:00,610 --> 00:03:02,970 program cannot be run in DOS mode 33 00:03:04,840 --> 00:03:12,490 After the DOS stub, we have the PE header this is used to define the executable as a PE format. 34 00:03:14,220 --> 00:03:27,090 Signature to represent the PE file is PE00 in ASCII and fifty forty five 00 in hexadecimal 35 00:03:27,090 --> 00:03:29,990 PE header also holds the machine type. 36 00:03:30,150 --> 00:03:36,330 The application is designed to run on like 32 bit or a 64 bit Intel or AMD chipsets. 37 00:03:37,700 --> 00:03:41,330 This part contains information about the number of sections present in the file. 38 00:03:43,170 --> 00:03:46,560 What is the section we will discuss this at the end here. 39 00:03:47,580 --> 00:03:51,650 Additionally, we can get the date and time stamp of when the file was compiled. 40 00:03:52,110 --> 00:03:56,550 The PE header part also holds the size of the optional header. 41 00:03:57,900 --> 00:03:59,910 Next is optional header. 42 00:04:01,530 --> 00:04:08,880 This consists of information like size of code code here refers to the specific section called as 43 00:04:08,910 --> 00:04:09,290 dot text 44 00:04:12,530 --> 00:04:13,860 Address of an entry point. 45 00:04:14,090 --> 00:04:18,250 It is the address in the memory where the PE loader will begin executing. 46 00:04:19,280 --> 00:04:24,140 This is very important during malware analysis as it tells exactly where the code begins. 47 00:04:25,530 --> 00:04:31,320 Preferred Base Address, it is the address of the first byte of image when loaded into the memory, 48 00:04:32,250 --> 00:04:35,240 it must be a multiple of 64 k. 49 00:04:36,690 --> 00:04:41,550 The default for recent versions of Windows is zero x zero zero four. 50 00:04:42,270 --> 00:04:48,270 It also contains the size of the image, which is the size of the image, including all the headers 51 00:04:48,450 --> 00:04:50,500 as the images is loaded in memory. 52 00:04:52,120 --> 00:04:59,410 And operating system version refers to minor and major operating system versions, more about address 53 00:04:59,410 --> 00:05:05,550 and relative address and size of each part in the PE file will be discussed later. 54 00:05:07,350 --> 00:05:15,330 Moving on next, we have the Section table this holds the information like virtual size, which is the total 55 00:05:15,330 --> 00:05:19,560 size of the section when loaded into memory size of the raw data. 56 00:05:20,040 --> 00:05:26,400 This is the size of the initialised, the data on disk and characteristics, the flags that describe 57 00:05:26,400 --> 00:05:31,800 the characteristics of a section like if it is readable, writable, executable, etc.. 58 00:05:33,010 --> 00:05:36,610 Finally, we have the section part in the PE file format. 59 00:05:37,810 --> 00:05:43,030 This part contains multiple sections, depending on what the application is trying to achieve, like 60 00:05:43,390 --> 00:05:50,620 dot text section contains the executable code for application dot bss holds uninitialized data 61 00:05:50,620 --> 00:05:52,120 for the application and so on. 62 00:05:53,180 --> 00:05:58,620 All put together here is a snapshot of all the parts of PE header and their brief description. 63 00:05:59,360 --> 00:06:05,100 We have noticed that a couple of terminologies appear again and again through this section like size of 64 00:06:05,100 --> 00:06:12,590 the specific section, address, relative address, etc. What do they mean and how to understand them? 65 00:06:13,170 --> 00:06:16,780 Let's consider a real life analogy to understand these terms better. 66 00:06:17,720 --> 00:06:23,720 Considered a class of students are asked to submit an academic thesis on a topic each student should 67 00:06:23,720 --> 00:06:31,400 pick his or her topic of interest and should research and write a two hundred page thesis in order 68 00:06:31,400 --> 00:06:34,640 to help students on how their thesis has to be structured. 69 00:06:35,000 --> 00:06:42,050 The faculty comes up with a format with the sections like title of the thesis, author adviser, abstract 70 00:06:42,050 --> 00:06:43,830 of the thesis and so on. 71 00:06:44,360 --> 00:06:50,750 Obviously, every thesis will include a table of contents which will help the faculty in quickly navigating 72 00:06:50,750 --> 00:06:52,250 through a 200 page thesis. 73 00:06:53,270 --> 00:06:59,510 Here is where it gets interesting, instead of mentioning the actual page number in the TOC the student 74 00:06:59,510 --> 00:07:07,490 will mention a number of pages consumed by each section, like this title of the thesis is covered in 75 00:07:07,490 --> 00:07:10,610 a total of one page, and it is on page number one. 76 00:07:11,990 --> 00:07:15,980 Author and advisor is a total of page one page. 77 00:07:17,380 --> 00:07:25,090 And it is present on page two, abstract is three pages, and it starts from page number three, problem 78 00:07:25,090 --> 00:07:29,080 statement is two pages and it starts from page number six and so on. 79 00:07:30,960 --> 00:07:34,590 Now, are we making the table of contents simpler by doing something like this? 80 00:07:35,440 --> 00:07:42,270 No, in fact, it is more complicated now, but this example helps us to understand the size and addressing 81 00:07:42,270 --> 00:07:45,720 involved in executing a program. 82 00:07:45,720 --> 00:07:49,710 Here, the size of pages is similar to the size of a section in an executable. 83 00:07:49,710 --> 00:07:55,530 When loaded onto that memory, the number of pages that is the size of each section in the thesis 84 00:07:55,800 --> 00:08:02,780 gives us the information of where the section ends and where the next section starts, which is the 85 00:08:02,790 --> 00:08:03,870 relative virtual address. 86 00:08:04,890 --> 00:08:07,560 The page numbers are the absolute addresses. 87 00:08:09,660 --> 00:08:15,510 Let's get back to the technical learning and see how this correlates in file execution, the file when 88 00:08:15,510 --> 00:08:20,540 executed is loaded into memory that is RAM, but in some random location. 89 00:08:21,300 --> 00:08:24,660 This is how a ram with the unused space would look like. 90 00:08:25,290 --> 00:08:30,770 When executable is loaded onto the memory, it will take some random location in the ram. 91 00:08:31,800 --> 00:08:38,460 The operating system needs to understand where to find the instructions to run the size of a specific 92 00:08:38,460 --> 00:08:42,030 section, tells the computer where to find the next part of the file. 93 00:08:42,780 --> 00:08:48,240 It is worth noting that all memory locations are mentioned in hexadecimal formats like this.