Hashing data is a common technique in the forensics community to "fingerprint" a file. Normally, we create a hash of an entire file; however, here, we will use hash chunks of a file to evaluate the similarity between two files. This technique is referred to as rolling hashing since the stream of data, known as the window, to hash rolls through the file. This allows us to generate hashes from a known file and compare them with unknown files. To generate this hash set for comparison, we must hash fixed chunks of a file and append them to a list. This allows us to compare chunks between files to see how many hashes are identified.
Before we explore the process of creating a rolling hash, let's begin by looking at a simpler scenario—hashing a file in Python. To start, we must decide which algorithm we would like to use in creating a hash for a file. This can be a tough question, as there are multiple factors to consider. The Message Digest Algorithm...