Book Image

Windows Malware Analysis Essentials

By : Victor Marak
Book Image

Windows Malware Analysis Essentials

By: Victor Marak

Overview of this book

Table of Contents (13 chapters)

Entropy


The byte distribution of any binary file in your computer has certain entropy to it. Entropy can be simply defined as a measure of disorder or uncertainty in a given system.

To explain the value of this metric in more simplistic terms, since file (binary/text) structures follow a set template for the most part, the data structures associated with it develop certain expected patterns. The rules that give the file meaning to its parser or parent software expect grammar rules to be followed. However, if the data structure is random and does not follow a set sequence, the rules that expect the structure in sequence will fail to validate the input stream (series of bytes). This incoherence or discrepancy will be directly proportional to the entropy of the file or the selected regions thereof. This would mean that the file is either filled with junk and random data or a custom format, or the data is corrupted or packed, compressed or encrypted, or any combination thereof. However, as more information can be accumulated with such systems, the sample data can be used to reduce the entropy and deal with failure conditions by an analysis of the input and getting a clearer scope of the sample parameters.

A byte probability distribution is a sum of the probabilities of each byte occurring in the entire file. A byte can have values from 0 to 255 in decimals. Notated in hexadecimals, the values are from 0x00 to 0xFF. The probability of each byte occurring in the file stream is as follows:

P(b) = total count of individual byte value occurrences in the file/total number of bytes in the file

Taking the sigma (or summation) of each of these probabilities and mapping or normalizing the value to a negative logarithmic scale gives us a value from 0.0 to 8.0 when calibrated to mean the 8 bits used to encode a byte, or the number of bits required to represent a byte in the current data stream.

Entropy = -Sigma(0 to N samples){P(b) * ln (P(b))}

The values can be in fractions as well. The negative of the logarithm is taken to remove the negative sign for base 2 log values of negative powers. ln(1/8) = -3 because 1/(2^3) = 2^-3. Probabilities will normally be between 0 and 1, unless the data expected has a probability of 1, such as a data input stream where each byte occurs with equal probability. Say for a length of a byte input stream of size 256, where every byte from 0–255 occurs exactly once, you have a per byte equal probability of 1/256.

We know that Log2 (1/256) = Ln(1/256)/Ln(2) = -8

For each byte, the value of the expression {P(b)*ln(P(b))} will be -(1/256*8).

Perform a sigma operation as follows: -1 * 256 * -(1/256 * 8) = 8. Now that we know the significance of the negative sign, we can say that the entropy is 8. Information theory-wise, it would mean that the file has a lot of information. However, for our purposes, this file certainly has no defining structure, other than the fact that the distribution is anomalously uniform and contains all the information that it can have in a file, or all events have occurred that could occur within the range of possible events.

A base 2 logarithm is the number of bits (information units) that are required to represent or distinguish n number of states/symbols. It boils down to permutation and statistical metrics represented in another more compact manner.

The following is the code in C#, which is a class that gives the entropy value as a string. The class exports a static method, and hence, there is no need to make an instance in an OOP paradigm; further, it can be used in any of the .NET-supported languages.

The method can be called using the following:

string value=Entropy.GetEntropy(<byte array of the input file>);

You need to pass the byte array of the input file.

In C#, you can use the File class and the ReadAllBytes() method that returns a byte array object.

namespace ENTROPY
{
class Entropy{

public static string GetEntropy(byte[] c)
{
int[] numArray = newint[0x100];
byte[] buffer = c;
for (int i = 0; i < 0x100; i++)//initialize each element to zero
   {
      numArray[i] = 0;
   }

for (int j = 0; j < (buffer.Length - 1); j++) //histogram of each byte
   {
int index = buffer[j];
numArray[index]++;
   }
int length = buffer.Length;
float entropy = 0f;
for (int k = 0; k < 0x100; k++)
   {
  if ((numArray[k] != 0) && (k != 0))
      {
     entropy += (-float.Parse(numArray[k].ToString()) / float.Parse(length.ToString())) * float.Parse(Math.Log((double) (float.Parse(numArray[k].ToString()) / float.Parse(length.ToString())), 2.0).ToString());
     }
   }

return entropy.ToString();
   }
}
}

Analyzing a sosex_64.zip from the http://www.stevestechspot.com/downloads/sosex_64.zip file will give you a value of 7.96, which is a very high entropy value. You can read more on building a visualizer component in C# for an entropy analysis at http://resources.infosecinstitute.com/building-custom-controls-in-c-part-1/.

Some range normalizing or scaling methods compact the range of values from 0 to 1 and can be used in probability distributions. Taking a reciprocal is one of the most common and simplest methods with the other variants working on the mathematical properties of e to map to sigmoid or hyperbolic curves on a plot:

Sigmoid (X)= 1/(1+e^-X)

Hyperbolic(X) = (e^2X -1 )/(e^2X+1)

Reciprocal (X) = 1/X

Visit the following links to learn more about them:

For our purposes, the final value represents the number of bits required to get information out of the input stream. If the value is high, the byte stream is most likely encrypted or obfuscated or is simply junk corrupted data, but you still need to differentiate it by using other analyses to complement the initial red flags.

Entropy analysis is a very useful metric to detect compressed files, encrypted files, packed files, and obfuscated data, and hence, is indispensable to malware analysis and malware forensics. Compiled code rarely gives this kind of randomization as it follows strict grammar according to the source code text. Hence, when binary executables are tampered with or armored in any way, this simple metric can give away that fact. You can think of entropy as an anomaly detector for a given rule set for our purpose of malware analysis.