Haskell High Performance Programming

Haskell High Performance Programming

By : Samuli Thomasson

Buy this Book

Haskell High Performance Programming

By: Samuli Thomasson

Buy this Book

Overview of this book

Haskell, with its power to optimize the code and its high performance, is a natural candidate for high performance programming. It is especially well suited to stacking abstractions high with a relatively low performance cost. This book addresses the challenges of writing efficient code with lazy evaluation and techniques often used to optimize the performance of Haskell programs. We open with an in-depth look at the evaluation of Haskell expressions and discuss optimization and benchmarking. You will learn to use parallelism and we'll explore the concept of streaming. We’ll demonstrate the benefits of running multithreaded and concurrent applications. Next we’ll guide you through various profiling tools that will help you identify performance issues in your program. We’ll end our journey by looking at GPGPU, Cloud and Functional Reactive Programming in Haskell. At the very end there is a catalogue of robust library recommendations with code samples. By the end of the book, you will be able to boost the performance of any app and prepare it to stand up to real-world punishment.

Haskell High Performance Programming

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Identifying Bottlenecks

Meeting lazy evaluation

Memoization and CAFs

Recursion and accumulators

Inspecting time and space usage

Compiler code optimizations

Summary

Choosing the Correct Data Structures

Annotating strictness and unpacking datatype fields

Handling numerical data

Handling binary and textual data

Handling sequential data

Handling tabular data

Handling sparse data

Ephemeral data structures

Working with monads and monad stacks

Summary

Profile and Benchmark to Your Heart's Content

Profiling time and allocations

Heap profiling

Benchmarking using the criterion library

Profile and monitor in real time

Summary

The Devil's in the Detail

The anatomy of a Haskell project

Erroring and handling exceptions

Writing tests for Haskell

Trivia at term-level

Trivia at type-level

Useful GHC extensions

Summary

Parallelize for Performance

Primitive parallelism and the Runtime System

The Eval monad and strategies

The Par monad and schedules

Diagnosing parallelism – ThreadScope

Data parallel programming – Repa

Summary

I/O and Streaming

Reading, writing, and handling resources

Streaming with side-effects

Logging in Haskell

Summary

Concurrency and Performance

Threads and concurrency primitives

Software Transactional Memory

Runtime System and threads

Asynchronous processing

Lifting up from I/O

Summary

Tweaking the Compiler and Runtime System (GHC)

Using GHC like a pro

Tuning GHC's Runtime System

Summary of useful GHC options

Summary of useful RTS options

Summary

GHC Internals and Code Generation

Interpreting GHC's internal representations

Primitive GHC-specific features

Datatype generic programming

Generating Haskell with Haskell

Summary

Foreign Function Interface

From Haskell to C and C to Haskell

Data marshal and stable pointers

Summary

Programming for the GPU with Accelerate

Writing Accelerate programs

Running with the CUDA backend

More Accelerate concepts

Summary

Scaling to the Cloud with Cloud Haskell

Processes and message-passing

Handling failure

Nodes and networking

Summary

Functional Reactive Programming

The tiny discrete-time Elerea

Events and signal functions with Yampa

Reactive-banana – Safe and simple semantics

Combining events and behaviors

Summary

Library Recommendations

Representing data

Functional graphs

Numeric data for special use

Encoding and serialization

Persistent storage, SQL, and NoSQL

Networking and HTTP

Cryptography

Web technologies

Parsing and pretty-printing

Pretty-printing and text formatting

Control and utility libraries

Working with monads and transformers

Handling exceptions

Random number generators

Parallel and concurrent programming

Functional Reactive Programming

Mathematics, statistics, and science

Tools for research and sketching

The HaskellR project

Creating charts and diagrams

Scripting and CLI applications

Testing and benchmarking

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Inspecting time and space usage

It is often necessary to have numbers about the time and space usage of Haskell programs, either to have an indicator of how well the program performs or to identify unnecessary allocations. The GHC Runtime System flag -s enables printing allocation and garbage-collection statistics when the program finishes.

Let's try this with an example program, which naively calculates the covariance of two lists:

-- file: time_and_space.hs
import Data.List (foldl')

sum' = foldl' (+) 0

mean :: [Double] -> Double
mean v = sum' v / fromIntegral (length v)

covariance :: [Double] -> [Double] -> Double
covariance xs ys =
    sum' (zipWith (\x y -> (x - mean xs) * (y - mean ys)) xs ys)
    / fromIntegral (length xs)

main = do
    let xs = [1, 1.1 .. 500]
        ys = [2, 2.1 .. 501]
    print $ covariance xs ys

To enable passing options for the Runtime System, we must compile with -rtsopts:

$ ghc -rtsopts time_and_space.hs

For the time being, we ignore optimizations GHC could do for us and compile the program without any:

$ ./time_and_space +RTS -s
20758.399999992813
     802,142,688 bytes allocated in the heap
       1,215,656 bytes copied during GC
         339,056 bytes maximum residency (2 sample(s))
          88,104 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1529 colls,     0 par    0.008s   0.007s     0.0000s    0.0004s
  Gen  1         2 colls,     0 par    0.001s   0.001s     0.0003s    0.0006s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    1.072s  (  1.073s elapsed)
  GC      time    0.008s  (  0.008s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    1.083s  (  1.082s elapsed)

  %GC     time       0.8%  (0.7% elapsed)

  Alloc rate    747,988,284 bytes per MUT second

  Productivity  99.2% of total user, 99.3% of total elapsed

On the first line of output from the Runtime System, we see that we allocated over 800 megabytes of memory. This is quite a lot for a program that only handles two lists of 5,000 double-precision values. There is definitely something in our code that could be made a lot more efficient. The output also contains other useful information, such as the total memory in use and, more importantly, some statistics on garbage collection. Our program spent only 0.8% of time in GC, meaning the program was doing productive things 99.2% of the time. So our performance problem lies in the calculations our program performs themselves.

If we look at the definition of covariance, we can spot the many invocations to mean in the argument lambda to zipWith: we actually calculate the means of both lists thousands of times over. So let's optimize that away:

covariance' :: [Double] -> [Double] -> Double
covariance' xs ys =
    let mean_xs = mean xs
        mean_ys = mean ys
        in
    sum' (zipWith (\x y -> (x - mean_xs) * (y - mean_ys)) xs ys)
    / fromIntegral (length xs)

With covariance' we get down to three megabytes of allocation:

       3,263,680 bytes allocated in the heap
         915,024 bytes copied during GC
         339,032 bytes maximum residency (2 sample(s))
         112,936 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         5 colls,     0 par    0.002s   0.002s     0.0003s    0.0005s
  Gen  1         2 colls,     0 par    0.001s   0.001s     0.0005s    0.0010s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.003s  (  0.003s elapsed)
  GC      time    0.003s  (  0.003s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.008s  (  0.006s elapsed)

  %GC     time      35.3%  (44.6% elapsed)

  Alloc rate    1,029,648,194 bytes per MUT second

  Productivity  63.1% of total user, 79.6% of total elapsed

That's over a 250-fold decrease in heap allocation! With the new version, we now have a considerable amount of time going to GC, about a third. This is about as good as we can get without enabling compiler optimizations; if we compile with -O, we would get to under two megabytes of heap allocation. And if you tried the original covariance performance with optimizations on, you should get exactly the same performance as with the newer hand-optimized variant. In fact, both versions compile to the same assembly code. This is a demonstration of the sophistication of GHC's optimizer, which we will take a deeper look at in a later chapter.

Tip

GHCi tip:

By setting +s in the interpreter, you can get time and space statistics of every evaluation, which can be handy for quick testing. Keep in mind though that no optimizations can be enabled for interpreted code, so compiled code can have very different performance characteristics. To test with optimizations, you should compile the module with optimizations and then import it into GHCi.

Increasing sharing and minimizing allocation

In the covariance example, we observed that we could improve code performance by explicitly sharing the result of an expensive calculation. Alternatively, enabling compiler optimizations would have had that same effect (with some extras). Most of the time, the optimizer does the right thing, but that is not always the case. Consider the following versions of rather a silly function:

-- file: time_and_space_2.hs

goGen        u = sum [1..u] + product [1..u]
goGenShared  u = let xs = [1..u] in sum xs + product xs

Try reasoning which of these functions executes faster. The first one builds two possibly very large lists and then immediately consumes them, independent of each other. The second one shares the list between sum and product.

The list-sharing function is about 25% slower than the list-rebuilding function. When we share the list, we need to keep the whole list in memory, whereas by just enumerating the elements we can discard the elements as we go. The following table confirms our reasoning. The list-sharing function has a larger maximum residency in system memory and does more GC:

U = 10000	Time	Allocated heap	Copied during GC	Maximum residency	Total memory	Time in GC
`goGen`	0.050ms	87 MB	10 MB	0.7 MB	6 MB	60%
`goGenShared`	0.070ms	88 MB	29 MB	0.9 MB	7 MB	70%

Recall that, in the covariance example, the compiler automatically shared the values of sin x and cos x for us when we enabled optimizations. But in the previous example, we didn't get implicit sharing of the lists, even though they are thunks just like the results of sin x and cos x. So what magic enabled the GHC optimizer to choose the best sharing schema in both cases? The optimizer is non-trivial, and unfortunately, in practice it's not feasible to blindly rely on the optimizer always doing "the right thing." If you need to be sure that some sharing will take place, you should test it yourself.

Let's go back to our previous example of sum and product. Surely we could do better than spending 60% of the time in GC. The obvious improvement would be to make only one pass through one list and calculate both the sum and product of the elements simultaneously. The code is then a bit more involved:

goGenOnePass u = su + pu
  where
    (su, pu) = foldl f (0,1) [1..u]
    f (s, p) i = let s' = s+i
                     p' = p*i
                     in s' `seq` p' `seq` (s', p')

Note the sequential use of seq in the definition of goGenOnePass. This version has a much better performance: only 10% in GC and about 50% faster than our first version:

U = 10000	Time	Allocated heap	Copied during GC	Maximum residency	Total memory	Time in GC
`GoGenOnePass`	0.025ms	86 MB	0.9 MB	0.05 MB	2 MB	10%

The takeaway message is that once again algorithmic complexity matters more than technical optimizations. The one-pass version executed in half the time of the original two-pass version, as would be expected.

Note

With the Bang Patterns (BangPatterns) language extension (available since GHC 6.6) the f binding could have been written more cleanly as f (!s, !p) i = (s + i, p * I) with very slightly degraded performance (0.7%). Annotating a binding with a bang means that evaluation of that binding will be bound to the evaluation of its surrounding tuple.

Haskell High Performance Programming

By : Samuli Thomasson

Haskell High Performance Programming

By: Samuli Thomasson

Overview of this book

Related Content you might be interested in

Current Title:

Haskell High Performance Programming

Inspecting time and space usage

Tip

Increasing sharing and minimizing allocation

Note