Metasm classes (2)

Fri 08 May 2009 by jj

This article will explore a bit of the framework internals to show how decoding executable files and instructions is handled.

We've seen last week a high-level overview of disassembly using metasm.

This article will dive into the code to see how things are done under the hood.

Executable file decoding

All executable file formats supported by the framework are implemented in filesunder the metasm/exe_format/ directory. Every format has an associated ruby class implemented in the file holding thesame name (pe.rb for the PE file format, etc). Some of the most complex aresplit in three files, e.g. elf.rb, elf_decode.rb and elf_encode.rb.

This makes use of a ruby specific feature: in ruby, classes can be reopened andextended at will. So the main file defines the core components (what attributesthe class has, what subclasses are used, what are the constants for this format),and the other files enhance those classes with methods allowing to decode (resp.encode) a binary file for this format.

The executable format classes are all subclasses of the ExeFormat class, whichis defined in exe_format/main.rb. This class defines standard methods available for all subclasses, like decodingfrom a file (the decode_file class method). This one is a simple shortcutthat will call the load_file and the decode method. This last decodemust be implemented by all subclasses who want decode_file to work.

load_file is in charge of making the data in the file passed as argumentavailable for the ExeFormat instance. Instead of reading the whole content inmemory, which may be quite big, it uses the special VirtualFile class. This one is defined in metasm/os/main.rb. It behave as a standard rubyString, and will read on demand a small subset of the underlying file. The (virtual) file content is then passed to the load method, which willcreate a new instance of the specific ExeFormat and initialize its encodedattribute with this data.

The load method is in fact a constructor: it is a class method that willreturn an initialized instance. AutoExe (exe_format/autoexe.rb) takesadvantage of this to do its job. It behaves like a normal ExeFormat, and onlyimplements the load method. In it, it checks the first bytes of the datapassed, looking for a known format signature (e.g. 0x7f454c46 for the ELF fileformat). If it finds a match, it then forwards the load call to the specific class.

AutoExe.decode_file()
  calls AutoExe.load_file()
    calls AutoExe.load()
      returns ELF.load()
      (inherited from ExeFormat)
      calls elf = ELF.new
      (initializes elf.encoded)
  calls elf.decode
returns elf

The decode method is used to interpret all the data available from thefile, e.g. symbols, relocations, imports, etc. If you just want a certain type of information, you can use the lighterdecode_header which will just decode the main file header, you are thenfree to call only the methods you need on the file (check the decodemethod implementation to find what you need, it is often a simple list ofdecode_imports, decode_exports, ..., just pick what you need).

Most file formats (ELF and PE/COFF) use a whole bunch of subclassesto represent parts of the executable (e.g. header, imports, relocs). Those are descendants of the SerialStruct class(exe_format/serialstruct.rb), which allows a very simple definition ofthe underlying structures (words, bitfields, ...) and automatically providebinary encoding and decoding routines.

Intruction decoding

Metasm can interpret binary executable files, and it can also interpretbinary machine code.

This takes place in the file <cpu>/decode.rb, which adds the decodingroutines to the CPU class. The CPU holds a list of opcode descriptions (<cpu>/opcodes.rb):

  • Instruction name
  • Binary encoding
  • Formal parameters (register, memory indirection, immediate...)
  • Other bitfields
  • Opcode properties (diverts code flow, only available in 16bit mode, ...)

For fixed-size architectures (e.g. MIPS, where all instrs are 32bits), thebinary informations are stored as an integer; but for variable-length archs(Ia32), they are stored in an array on bytes. In this last case, the offsetsare the couple [byte index, shift inside the byte].

Before decoding the first instruction, the CPU will build a lookaside table,which holds, for each possible byte value, the list of opcodes that may havethis byte as 1st byte. This is done in order to speedup decoding, it reduces thelength of the table that must be checked at every new instruction to decode. It will also create for each opcode a binary mask using the binary value forthe opcode and the masks of all parameters/bitfields.

The instruction to decode is given to the CPU through an EncodedDataobject. This object wraps a String (or VirtualString), and adds methods and attributesto give name to certain offsets of this string (exports), and add relocationinformation (relocs, e.g. bytes 0 through 3 holds 0x12345678, bytes 4 trough7 holds the value of the symbol named 'toto' encoded as a 32bit bigendian word).It also has a pointer that points to something to decode, which is advancedonce the thing has been decoded. In this case, it points to the beginning ofthe instruction to decode. The instruction decoding method is CPU#decode_instruction (decode.rb).It takes the following steps (arch-specific, implemented in <cpu>/decode.rb):

  • Find the opcode corresponding to the encoded data. This is done by checking the lookaside list and trying the raw data against each opcode binary mask. This method returns a DecodedInstruction di, with the opcode attribute set to the matching Opcode object.
  • Decode the actual Instruction in di.instruction. This object stores the specific arguments used by the encoded instruction, e.g. a :reg argument whose bitfield value is 1 in the raw data will be converted to the Ia32::Reg object representing the ecx register.
  • Contextualize the instruction. This may be needed for e.g. jumps in the Ia32 architecture, where the convention is for the instruction to have an absolute address as argument, whereas the binary encoding holds only a delta from the current instruction pointer.

This process also sets the bin_length field of the DecodedInstruction, whichholds the size in bytes of the binary data occupied by the encoded instruction.

These two primitives (ExeFormat/Instruction decoding) are the building blocksused to create the Disassembler, which we'll see next week.