Book Image

Windows Malware Analysis Essentials

By : Victor Marak
Book Image

Windows Malware Analysis Essentials

By: Victor Marak

Overview of this book

Table of Contents (13 chapters)

Sharpening the scalpel


The regular disassembler is a static analysis software tool that performs many different processes and extracts information out of a binary executable. It parses the binary executable, takes apart the individual sections, and presents a list of annotated assembly code from the binary string of opcodes embedded inside the executable.

Additional embellishments arrange relevant data such as symbolic function and variable names (if present), stack frames and variable lists, common data structures, strings, and import and export function jump lists. There are two primary algorithms that are implemented in a disassembler:

  • Linear Sweep (Windbg, Win32Dasm, Sourcer)

  • Recursive Traversal (OllyDbg, IDA Pro)

Linear sweep processes a binary executable by navigating to the code segment and reading the binary strings as a linear sequence of bytes, which are decoded according to a table of opcodes for a specific instruction set, much like a mapping process with an incrementing position counter to keep track of the next byte sequence to decode. The primary caveat is that because linear disassembly assumes the code section as pristine without being tainted by data elements, additional code constructs such as unreachable code, code interspersed with data (switch tables and function pointer tables), opaque predicates (fixed-value false conditional expressions), and stack frame pointer resolution cannot be done with confidence as cross references such as function call statements are not maintained. Thus, complicated machine code sequences can confuse the disassembly and result in junk code. However, code coverage is a feature that can be availed of when necessary.

Recursion in computer science might remind you of mathematical induction and a growing stack-based nested function call sequence of a function calling itself. The recursive function requires a terminating condition in order to halt a repeated procedure for input values by calling itself repeatedly till the terminating condition is met. However, recursive traversal disassembly algorithm is a relatively complex undertaking that can be implemented in numerous ways. As a generic approach, all conditional (jnz/jxx) and unconditional code constructs (jmp/call) are taken into account and the control flow is traversed wherein the pseudo custom C data structure is as follows:

typedef struct _instruction_metadata {
  unsigned int *instr_offset; /* instruction offset in executable
    or eip if emulated */
  unsigned short op_length; /* processed opcode length */
  unsigned int *dest_address;
  char array [op_length]; /* opcode sequence */
  unsigned int *return_address; /* for address of next instruction after call */
  /* also current offset + opcode size */
  /* data structure representing internal parameters required by
     the disassembler */
  MetaData meta;
}INS_META;

This structure is saved in the disassembler's internal data structures as branch lists (also known as jump list – which is confirmed code instructions or return list – which is yet to be identified addresses-code/data/tables) for resolving all control flow pathways. After each linear control path analysis pass, another address is retrieved from the branch list and the evaluate list, and the disassembly process resumes till both lists are empty. This list is re-processed in analysis passes, and cross references are drawn from the prior analysis, resulting in a much more coherent disassembly listing. Quite a bit of heuristics (for instance, compiler type-based assembly code templates and EBP or ESP-based stack frames) and code instructions vs. data statistical analyses strive to make this a more reliable process. The disassembly also annotates the disassembly code accordingly with identifiers and labels, and could even provide with disassembler-generated comments for familiar code sequences, to get the final code listing. Binary code disassembly can be an intractable problem, particularly if requiring user input or external data and with no named symbols in the executable, and things such as variable and function names are lost. While the disassembly will provide a lot of insight by presenting the code in the best possible light, whatever is remaining will have to be semantically reconstructed from the disassembly manually or using advanced algorithm-based code analysis tools. We will cover the standard disassemblers in vogue, as well as code analysis tools aimed for high-quality reconstruction and analysis. Because of the complexity involved, recursive traversal is more time consuming than linear sweep but more accurate and resilient to the issues that can halt the linear sweep process, and therefore the algorithm of choice for our purposes.