Dask is a parallel computing library that offers not only a general framework for distributing complex computations on many nodes, but also a set of convenient high-level APIs to deal with out-of-core computations on large arrays. Dask provides data structures resembling NumPy arrays (dask.array
) and Pandas DataFrames (dask.dataframe
) that efficiently scale to huge datasets. The core idea of Dask is to split a large array into smaller arrays (chunks).
In this recipe, we illustrate the basic principles of dask.array
.
Dask should already be installed in Anaconda, but you can always install it manually with conda install dask
. You also need memory_profiler
, which you can install with conda install memory_profiler
.