We will now discuss encoding ASCII data as bytes and base64 encoding these bytes. We will also cover base64 encoding for binary data and decoding to get back to the original input.
In ASCII, each character turns into one byte:
A
is65
in base10
, and in binary, it is0b01000001
. Here, you have0
in the most significant bit because there's no128
, then you have1
in the next bit for64
and1
in the end, so you have 64 + 1=65.- The next is
B
with base66
andC
with base67
. The binary forB
is0b01000010
, and forC
, it is0b01000011
.
The three-letter string ABC
can be interpreted as a 24-bit string that looks like this:
We've added these blue lines just to show where the bytes are broken out. To interpret that as base64, you need to break it into groups of 6 bits. 6 bits have a total of 64 combinations, so you need 64 characters to encode it.
The characters used are as follows:
We use the capital letters for the first 26, lowercase letters for another 26, the digits for another 10, which gets you up to 62 characters. In the most common form of base64, you use +
and /
for the last two characters:
If you have an ASCII string of three characters, it turns into 24 bits interpreted as 3 groups of 8. If you just break them up into 4 groups of 6, you have 4 numbers between 0 and 63, and in this case, they turn into Q
, U
, J
, and D
. In Python, you just have a string followed by the command:
>>> "ABC".encode("base64") 'QUJD\n'
This will do the encoding. Then add an extra carriage return at the end, which neither matters nor affects the decoding.
What if you have something other than a group of 3 bytes?
If you have four bytes for the input, then the base64 encoding ends with two equals signs, just to indicate that it had to add two characters of padding. If you have five bytes, you have one equals sign, and if you have six bytes, then there's no equals signs, indicating that the input fit neatly into base64 with no need for padding. The padding is null.
You take ABCD
and encode it and then you take ABCD
with explicit byte of zero. x00
means a single character with eight bits of zero, and you get the same result with just an extra A
and one equals, and if you fill it out all the way with two bytes of zero, you get capital A
all the way. Remember: a capital A
is the very first character in base64
. It stands for six bits of zero.
Let's take a look at base64 encoding in Python:
- We will start
python
up and make a string. If you just make a string with quotes and press Enter, it will print it in immediate mode:
>>> "ABC" 'ABC'
- Python will print the result of each calculation automatically. If we encode that with
base64
, we will get this:
>>> "ABC".encode(""base64")
'QUJD\n'
- It turns into
QUJD
with an extra courage return at the end and if we make it longer:
>>> "ABCD".encode("base64") 'QUJDRA==\n'
- This has two equals signs because we started with four bytes, and it had to add two more to make it a multiple of three:
>>> "ABCDE".encode("base64") 'QUJDREU=\n' >>> "ABCDEF".encode("base64") 'QUJDREVG\n'
- With a five-byte input, we have one equals sign; and with six bytes of input, we have no more equal signs, instead, we have a total of eight characters with
base64
. - Let's go back to
ABCD
with the two equals signs:
>>>"ABCD".encode("base64") 'QUJDRA==\n'
- You can see how the padding was done by putting it in explicitly here:
>>> "ABCD\x00\x00".encode("base64") 'QUJDRAA=\n'
There's a first byte of zero, and now we get another single equals sign.
- Let's put in a second byte of zero:
>>> "ABCD\x00\x00".encode("base64") 'QUJDRAAA\n'
We have no padding here, and we see that the last characters are all A
, indicating that there's been a filling of binary zeros.
The next issue is handling binary data. Executable files are binary and not ASCII. Also, images, movies, and many other files have binary data. ASCII data always starts with a zero as the first bit, but base64
works fine with binary data. Here is a common executable file, a forensic utility; it starts with MZê
and has unprintable ASCII characters:
As this is a hex viewer, you see the raw data in hexadecimal, and on the right, it attempts to print it as ASCII. Windows programs have this string at the start, and this program cannot be run in DOS mode, but they have a lot of unprintable characters, such as FF
and 0
, which really doesn't matter for Python at all. An easy way to encode data like that is to read it directly from the file. You can use thewith
command. It will just open a file with filename and mode read binary with the handlef
and then you can read it. Thewith
command is here just to tell Python to open the file, and that if it cannot be opened due to some error, then just to close the handle and then decode it exactly the same way. To decode data you've encoded in this fashion, you just take the output string and you put .decode
instead of .encode
.
Now let's take a look at how to handle binary data:
- We will first exit Python so that we can see the filesystem, and then we'll look for the
Ac
file using the command shown here:
>>> exit() $ ls Ac* AccessData Registry Viewer_1.8.3.exe
There's the filename. Since that's kind of a long block, we are just going to copy and paste it.
- Now we start Python and
clear
the screen using the following command:
$ clear
- We will start
python
again:
$ python
- Alright, so, now we use the following command:
>>> with open("AccessData Registry Viewer_1.8.3.exe", "rb") as f: ... data = f.read() ... print data.encode("base64")
Here we enter the filename first and then the mode, which is read binary. We will give it filename handle of f
. We will take all the data and put it in a single variable data. We could just encode the data inbase64
, and it would automatically print it. If you have an intended block in Python, you have to pressEntertwice so it knows the block is done, and thenbase64
encodes it.
>>> "ABC".encode("base64") 'QUJD\n'
- If we want to play with it, put that in a
c
variable using the following command:
>>> c = "ABC".encode("base64")
>>> print c
QUJD
- Now we can print
c
to make sure that we have got what we expected. We haveQUJD
, which is what we expected. So, now we can decode it using the following command:
>>> c.decode("base64") 'ABC'
base64
is not encrypting. It is not hiding anything, but it is just another way to represent it. In the next section, we'll cover XOR.