Book Image

Clojure Data Structures and Algorithms Cookbook

By : Rafik Naccache
Book Image

Clojure Data Structures and Algorithms Cookbook

By: Rafik Naccache

Overview of this book

Table of Contents (14 chapters)
Clojure Data Structures and Algorithms Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Efficiently compressing a byte array


Compressing a byte array is a matter of recognizing repeating patterns within a byte sequence and devising a method that can represent the same underlying information to take advantage of these discovered repetitions.

To get a rough idea of how this works, imagine having a sequence as:

["a" "a" "a" "b" "b" "b" "b" "b" "b" "b" "c" "c"]

It is intuitively more efficient to represent this as:

[3 times "a", 7 times "b", 2 times "c"]

Now, we are going to use a methodology based on a well-known algorithm, that is, the LZ77 compression method. LZ77 is, despite being quite old, the basis of most of all the well-known and currently used compression methods, especially the Deflate algorithm.

Note

Deflate is at the heart of the ZIP family of compression algorithms. It uses a slightly modified version of LZ77 plus a special encoding, that is, the Huffman encoding.

The point of LZ77 is to walk a sequence and recognize a pattern in the past elements that will occur in the upcoming elements, replacing those with a couple of values: how many elements should go backwards in order to locate the recurring pattern, which is called "distance"; and how long is the recurring pattern, which is labeled as "length".

The iteration of the LZ77 compression would look as follows:

  1. At any point of time, the algorithm is processing a particular element, which is located at the current position. Consider a window of the size n, as a set of n elements preceding the one that is occupying current position, and consider lookahead as the rest of the elements up until the input's end.

  2. Begin with the first element of the input.

  3. Move on to the next element.

  4. Find in the window (that is, past n elements), the longest pattern that can be found in lookahead.

  5. If such a sequence is found, consider distance as the location where, the matching sequence was found, expressed in regards to the current position, consider length as the length of the matching pattern, and proceed with the two following actions:

    • Replace the match in lookahead by "distance" and "length".

    • Move forward using the "length" elements and resume algorithm execution at step 4.

  6. Otherwise, resume at step 3.

The procedure to uncompress is as follows:

  1. Walk the compressed sequence.

  2. If the "distance" and "length" are found, go back to the "distance" elements and replace this couple with the "length" elements.

  3. If not, lay out the element that you've found.

Let's see this in action in Clojure!

How to do it...

  1. First of all, here is the ns declaration containing the Clojure facilities that we are going to use:

     (ns recipe1.core
      (:require [clojure.set :as cset])) 
       ;; => we'll need set operations later on.
    
  2. Let's begin by working on the uncompressing part. First of all, we need an expand function that takes the source array as a vector of the elements distance and length and generates a repetition of a sub-vector of the last distance characters from the source array until the length is reached:

    (defn expand
      [the-vector
       distance
       length]
      (let [end (count the-vector)
            start (- end
                     distance)     
    ;;=> Here we go backwards 'distance' elements.
            pattern (subvec the-vector
                            start
                            end)]      ;=> We have our pattern.
        (into [] (take length    ;=> We exactly take "length" from 
                   (cycle pattern)))))  
    ;; an infinite repetition of our pattern.
    
  3. Now, let's define un-LZ77 using expand function while walking through a sequence of bytes:

    (defn un-LZ77
      [bytes]
      (loop [result []
             remaining bytes]    
    ;;=> We recur over the contents of the array.
        (if (seq remaining)
          (let [current (first remaining)
                the-rest (rest remaining)]  
    ;=> Current element under scrutiny;
            (if-not (vector? Current)  
    ;=> If it is not a vector, add to result
              (recur (conj result    
    ;;      the very element, and move on.
                           current)
                     the-rest)
              (recur (into result (expand result  
    ;;=> This is a vector, then we'll expand here and move on  
                                          (current 0)         
                                          (current 1)))
                     the-rest)))               
          result)))                                                 
    ;;=> end of recursion, return result.
    
  4. Now let's address the topic of compressing. First of all, we need to grab all sub-vectors, as we'll have to find matches between window and lookahead and then pick the longest one among them:

    (defn all-subvecs-from-beginning       
    ;;=> this function will generate a set of all sub-vectors starting ;; from begin
      [v]            
      (set (map #(subvec v 0 %)             
    ;;=> we apply subvec from 0 to all indices from 1 up to the size ;; of the array + 1. 
                (range 1 (inc (count v))))))      
      
    (defn all-subvecs                             
    ;;=> this function will generate all 
      [v]          ;       sub-vectors, applying 
      (loop [result #{}        
    ;;       all-subvecs from beginning to
    ;;       all possible beginnings.
             remaining v]                            
        (if (seq remaining)
          (recur (into result
                       (all-subvecs-from-beginning remaining))
                 (into[]  (rest remaining)))     
    ;;=> We recur fetching all sub-vectors for next beginning.
                  
          result)))
    ;;=> end of recursion, I return result.
    
  5. Now we define a function to grab the longest match in left array with the beginning of right array:

    (defn longest-match-w-beginning
      [left-array right-array]
      (let [all-left-chunks (all-subvecs left-array) 
                                     all-right-chunks-from-beginning
    ;;=> I take all sub-vectors from left-array
            (all-subvecs-from-beginning right-array)
    ;;=> I take all sub-vectors from right-array
        all-matches (cset/intersection all-right-chunks-from-beginning
                                           all-left-chunks)]
    ;;=> I get all the matchings using intersection on sets
        (->> all-matches
             (sort-by count >)
             first)))     
    ;=> Then I return the longest match, sorting them 
    ;; by decreasing order and taking the first element.
    
  6. With the longest match function in hand, we need a function to tell us where where is this match exactly located inside the window:

    (defn pos-of-subvec
      [sv v]  
      {:pre [(<= (count sv)
                 (count v))]} 
    ;;=> I verify that sv elements are less than v's.
      (loop
          [cursor 0]                       
        (if (or (empty? v)                  
    ;;=> If on of the vectors is empty
                (empty? sv)
                (= cursor   (count v)))   
    ;; or the cursor ended-up exiting v,
          nil              ;; we return nil  
          (if (= (subvec v cursor         
    ;; => If we found that the v sub-vector 
                         (+ (count sv)         
    ;;   beginning with cursor up to sv count  
                            cursor)) sv)         
    ;; is equal to sv cursor
    ;; we return cursor, this is where the match is.  
            (recur (inc cursor))))))     
    ;=> We recur incrementing the cursor
    
  7. Armed with the toolbox we've built so far, let's devise an LZ77 step:

    (defn LZ77-STEP
      [window look-ahead]
      (let [longest (longest-match-w-beginning window
      look-ahead)]         ;;=> We find the Longest match,
        (if-let [pos-subv-w (pos-of-subvec longest window)]     
    ;;=> If there is a match  we find its position in window.
          (let [distance (-  (count window) pos-subv-w) 
    ;;=> the distance,
                pos-subv-l (pos-of-subvec longest
                                          look-ahead)
    ;;=> the position of the match in look-ahead
                the-char (first (subvec look-ahead
                                        (+ pos-subv-l
                                           (count longest))))]
    ;;=> the first element occuring after the match
            {:distance distance                 
             :length (count longest)
             :char the-char})      
    ;;=> and we return information about match
          {:distance 0      
           :length 0                                  
           :char (first look-ahead)})))      
    ;;=> We did not find a match, we emit zeros  for "distance" 
    ;; and "length", and first element of lookahead as first char 
    ;; occurring after the (non-)match.
    
  8. Finally, we will write the main LZ77 compression function as follows:

    (defn LZ77
    [bytes-array
     window-size]
    (->> (loop [result []
                cursor 0
                window []
                look-ahead bytes-array]    
    ;;=> we begin with position 0; and everything as look-ahead.
           (if (empty? look-ahead)
             result
    ;;=> end of recursion, I emit result.
             (let [this-step-output (LZ77-STEP window look-ahead) 
                   distance (:distance this-step-output)
                   length (:length this-step-output)
                   literal (:char this-step-output)
    ;;=> We grab informations about this step output
                   raw-new-cursor (+ cursor
                                     length
                                     1)
                   new-cursor (min raw-new-cursor
                                   (count bytes-array))
    ;;=> We compute the new-cursor, that is, where to go in the next ;; step
    ;;which is capped by count of bytes-array
                   new-window (subvec bytes-array
                                      (max 0 (inc (- new-cursor
                                                     window-size)))  
                                      new-cursor)          
    ;;=> new window is window-size elements back from new cursor.
                   new-look-ahead (subvec bytes-array
                                          new-cursor )]
    ;;=> new look-ahead is everything from new cursor on.
               (recur (conj result
                            [distance length]
                            literal) 
                      new-cursor 
                      new-window
                      new-look-ahead)))) 
    ;; and we recur with the new elements.
         (filter   (partial
                    not=
                    [0 0]))   
    ;;=> We eliminate the entries related to non-matches
         (filter (comp
                  not
                  nil?))   ;;=> and any nils
         (into [])))       ;;=> and make a vector out of the output.
    

That's it! Now, let's see our code in action. Input into your REPL as follows:

(LZ77 ["a" "b" "c" "f" "a" "b" "c" "d"] 5)
;; => ["a" "b" "c" "f" [4 3] "d"]
(un-LZ77 ["a" "b" "c" "f" [4 3] "d"])
;; => ["a" "b" "c" "f" "a" "b" "c" "d"]