Book Image

In-Memory Analytics with Apache Arrow

By : Matthew Topol

Book Image

In-Memory Analytics with Apache Arrow

By: Matthew Topol

Overview of this book

Apache Arrow is designed to accelerate analytics and allow the exchange of data across big data systems easily. In-Memory Analytics with Apache Arrow begins with a quick overview of the Apache Arrow format, before moving on to helping you to understand Arrow’s versatility and benefits as you walk through a variety of real-world use cases. You'll cover key tasks such as enhancing data science workflows with Arrow, using Arrow and Apache Parquet with Apache Spark and Jupyter for better performance and hassle-free data translation, as well as working with Perspective, an open source interactive graphical and tabular analysis tool for browsers. As you advance, you'll explore the different data interchange and storage formats and become well-versed with the relationships between Arrow, Parquet, Feather, Protobuf, Flatbuffers, JSON, and CSV. In addition to understanding the basic structure of the Arrow Flight and Flight SQL protocols, you'll learn about Dremio’s usage of Apache Arrow to enhance SQL analytics and discover how Arrow can be used in web-based browser apps. Finally, you'll get to grips with the upcoming features of Arrow to help you stay ahead of the curve. By the end of this book, you will have all the building blocks to create useful, efficient, and powerful analytical services and utilities with Apache Arrow.

Preface

Who this book is for

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Share Your Thoughts

Section 1: Overview of What Arrow Is, its Capabilities, Benefits, and Goals

Section 1: Overview of What Arrow Is, its Capabilities, Benefits, and Goals

Free Chapter

Chapter 1: Getting Started with Apache Arrow

Chapter 1: Getting Started with Apache Arrow

Technical requirements

Understanding the Arrow format and specifications

Why does Arrow use a columnar in-memory format?

Learning the terminology and physical memory layout

Arrow format versioning and stability

Would you download a library? Of course!

Setting up your shooting range

Chapter 2: Working with Key Arrow Specifications

Chapter 2: Working with Key Arrow Specifications

Technical requirements

Playing with data, wherever it might be!

pandas firing Arrow

Sharing is caring… especially when it's your memory

Chapter 3: Data Science with Apache Arrow

Chapter 3: Data Science with Apache Arrow

Technical requirements

ODBC takes an Arrow to the knee

Lost in translation

SPARKing new ideas on Jupyter

Interactive charting powered by Arrow

Stretching workflows onto Elasticsearch

Section 2: Interoperability with Arrow: pandas, Parquet, Flight, and Datasets

Section 2: Interoperability with Arrow: pandas, Parquet, Flight, and Datasets

Chapter 4: Format and Memory Handling

Chapter 4: Format and Memory Handling

Technical requirements

Storage versus runtime in-memory versus message-passing formats

Passing your Arrows around

Learning about memory cartography

Chapter 5: Crossing the Language Barrier with the Arrow C Data API

Chapter 5: Crossing the Language Barrier with the Arrow C Data API

Technical requirements

Using the Arrow C data interface

Example use cases

Streaming across the C Data API

Other use cases

Chapter 6: Leveraging the Arrow Compute APIs

Chapter 6: Leveraging the Arrow Compute APIs

Technical requirements

Letting Arrow do the work for you

Executing compute functions

Picking the right tools

Chapter 7: Using the Arrow Datasets API

Chapter 7: Using the Arrow Datasets API

Technical requirements

Querying multifile datasets

Filtering data programmatically

Using the Datasets API in Python

Streaming results

Chapter 8: Exploring Apache Arrow Flight RPC

Chapter 8: Exploring Apache Arrow Flight RPC

Technical requirements

The basics and complications of gRPC

Arrow Flight's building blocks

Using Flight, choose your language!

What is Flight SQL?

Section 3: Real-World Examples, Use Cases, and Future Development

Section 3: Real-World Examples, Use Cases, and Future Development

Chapter 9: Powered by Apache Arrow

Chapter 9: Powered by Apache Arrow

Swimming in data with Dremio Sonar

Spicing up your ML workflows

Arrow in the browser using JavaScript

Chapter 10: How to Leave Your Mark on Arrow

Chapter 10: How to Leave Your Mark on Arrow

Technical requirements

Contributing to open source projects

Preparing your first pull request

Find your interest and expand on it

Getting that sweet, sweet approval

Finishing up with style!

Chapter 11: Future Development and Plans

Chapter 11: Future Development and Plans

Examining Flight SQL (redux)

Firing a Ballista using Data(Fusion)

Building a cross-language compute serialization

Other Books You May Enjoy

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Why does Arrow use a columnar in-memory format?

Most traditional data processing of tabular data will have its own custom data structures for representing and managing those datasets in memory while processing them, such as query engines and data services, for example. Of course, if there are custom data structures, this means it requires developing custom serialization protocols between file formats, network protocols, libraries, and any other interface you could think of. I can vouch from experience that the result is a huge amount of developer time and CPU cycles being wasted dealing with these various serialization schemes, rather than being able to spend it all on the analytical workloads. One goal of the Arrow project is for fewer systems to have to create their own data structures and utilize Arrow as their internal format. Doing so would allow those components to expose Arrow directly as a wire format and benefit from not having to pay a serialization or deserialization cost to pass the data around.

There is often a lot of debate surrounding whether a database should be row-oriented or column-oriented, but this primarily refers to the on-disk format of the underlying storage files. Arrow's data format is different from most cases discussed so far since it uses a columnar organization of data structures in memory directly. If you're not familiar with columnar as a term, let's take a look at what exactly it means. First, imagine the following table of data:

Figure 1.3 – Sample data table

Figure 1.3 – Sample data table

Traditionally, if you were to read this table into memory, you'd likely have some structure to represent a row and then read the data in one row at a time. Maybe something like struct { string archer; string location; int year }. The result is that you have the memory grouped closely together for each row, which is great if you always want to read all the columns for every row. But, if this were a much bigger table, and you just wanted to find out the minimum and maximum years or any other column-wise analytics such as the unique locations, you would have to read the whole table into memory and then jump around in memory, skipping the fields you didn't care about so that you could read the value for each row of one column.

Most operating systems, while reading data into main memory and CPU caches, will attempt to make predictions about what memory it is going to need next. In our example table of archers, consider how many memory pages of data would have to be accessible and traversed to get a list of unique locations if the data were organized in row or column orientations:

Figure 1.4 – Row versus columnar memory buffers

Figure 1.4 – Row versus columnar memory buffers

A columnar format keeps the data organized by column instead of by row, as shown in the preceding figure. As a result, operations such as grouping, filtering, or aggregations based on column values become much more efficient to perform since the entire column is already contiguous in memory. Considering memory pages again, it's plain to see that for a large table, there would be significantly more pages that need to be traversed to get a list of unique locations from a row-oriented buffer than a columnar one. Fewer page faults and more cache hits mean increased performance and a happier CPU. Computational routines and query engines tend to operate on subsets of the columns for a dataset rather than needing every column for a given computation, making it significantly more efficient to operate on columnar data.

If you look closely at the construction of the column-oriented data buffer on the right side of Figure 1.4, you can see how it benefits the queries I mentioned earlier. If we wanted all the archers that are in Europe, we can easily scan through just the location column and discover which rows are the ones we want, and then spin through just the archer block and grab only the rows that correspond to the row indexes we found. This will come into play again when we start looking at the physical memory layout of Arrow arrays; since the data is column-oriented, it makes it easier for the CPU to predict instructions to execute and maintains this memory locality between instructions.

By keeping the column data contiguous in memory, it enables vectorization of the computations. Most modern processors have single instruction, multiple data (SIMD) instructions available that can be taken advantage of for speeding up computations and require having the data in a contiguous block of memory to operate on it. This concept can be found heavily utilized by graphics cards, and in fact, Arrow provides libraries to take advantage of Graphics Processing Units (GPUs) precisely because of this. Consider the example where you might want to multiply every element of a list by a static value, such as performing a currency conversion on a column of prices with an exchange rate:

Figure 1.5 – SIMD/vectorized versus non-vectorized

Figure 1.5 – SIMD/vectorized versus non-vectorized

From the figure, you can see the following:

The left side of the figure shows that an ordinary CPU performing the computation in a non-vectorized fashion requires loading each value into a register, multiplying it with the exchange rate, and then saving the result back into RAM.
On the right side of the figure, we see that vectorized computation, such as using SIMD, performs the same operation on multiple different inputs at the same time, enabling a single load to multiply and save to get the result for the entire group of prices. Being able to vectorize a computation has various constraints; often, one of those constraints is requiring the data being operated on to be in a contiguous chunk of memory, which is why columnar data is much easier to do this with.
SIMD versus Multithreading
If you're not familiar with SIMD, you may wonder how it differs from another parallelization technique: multithreading. Multithreading operates at a higher conceptual level than SIMD. Each thread has its own set of registers and memory space representing its execution context. These contexts could be spread across separate CPU cores or possibly interleaved by a single CPU core switching whenever it needs to wait for I/O. SIMD is a processor-level concept that refers to the specific instructions being executed. Put simply, multithreading is multitasking and SIMD is doing less work to achieve the same result.

Another benefit of utilizing column-oriented data comes into play when considering compression techniques. At some point, your data will become large enough that sending it across the network could become a bottleneck, purely due to size and bandwidth. With the data being grouped together in columns that are all the same type as contiguous memory, we end up with significantly better compression ratios than we would get with the same data in a row-oriented configuration, simply because data of the same type is easier to compress together than data of different types.