Using Spark in data science

At the beginning of the twenty-first century, the big data problem became a reality. Data stored in data centers was growing in volumes and velocity. In 2021, we refer to datasets as big data when they reach at least a couple of terabytes in size, while it is not uncommon to see even petabytes of data in large organizations. These datasets increase at a rapid rate, which can be from a couple of gigabytes per day to even per minute, for example, when you are storing user interactions with a website in an online store to perform clickstream analysis.

In 2009, a research project started at the University of California, Berkeley, trying to provide the parallel computing tools needed to handle big data. In 2014, the first version of Apache Spark was released from this research project. Members from that research team founded the Databricks company, one of the most significant contributors to the open source Apache Spark project.

Apache Spark provides an easy-to-use scalable solution that allows people to perform parallel processing on top of data in a distributed manner. The main idea behind the Spark architecture is that a driver node is responsible for executing your code. Your code is split into smaller parallel actions that can be performed against smaller portions of data. These smaller jobs are scheduled to be executed by the worker nodes, as seen in Figure 1.6:

Figure 1.6 – Parallel processing of big data in a Spark cluster

For example, suppose you wanted to calculate how many products your company sold during the last year. In that case, Spark could spin up 12 jobs that would produce the monthly aggregates, and then the results would be processed by another job that would sum up the totals for all months. If you were tempted to load the entire dataset into memory and perform those aggregates directly from there, let's examine how much memory you would need within that computer. Let's assume that the sales data for a single month is stored in a CSV file that is 1 GB. This file will require approximately 10 GB of memory to load. The compressed Parquet files will require even more memory. For example, a similar 1 GB parquet file may end up needing 40 GB of memory to load as a pandas.DataFrame object. As you can understand, loading all 12 files in memory simultaneously is an impossible task. You need to parallelize the processing, something Spark can do for you automatically.

Important note

The Parquet files are stored in a columnar format, which allows you to load partially any number of columns you need. In the 1 GB Parquet example, if you load only half the columns from the dataset, you will probably need only 20 GB of memory. This is one of the reasons why the Parquet format is widely used in analytical loads.

Spark is written in the Scala programming language. It offers APIs for Scala, Python, Java, R, and even C#. Still, the data science community is either working on Scala to achieve maximum computational performance and utilizing the Java library ecosystem or Python, which is widely adopted by the modern data science community. When you are writing Python code to utilize the Spark engine, you are using the PySpark tool to perform operations on top of Resilient Distributed Datasets (RDDs) or Spark.DataFrame objects introduced later in Spark Framework. To benefit from the distributed nature of Spark, you need to be handling big datasets. This means that Spark may be overkill if you deal with only hundreds of thousands of records or even a couple of millions of records.

Spark offers two machine learning libraries, the old MLlib and the new version called Spark ML. Spark ML uses the Spark.DataFrame structure, a distributed collection of data, and offers similar functionality to the DataFrame objects used in Python pandas or R. Moreover, the Koalas project provides an implementation that allows data scientists with existing knowledge of pandas.DataFrame manipulations to use their existing coding skills on top of Spark.

AzureML allows you to execute Spark jobs on top of PySpark, either using its native compute clusters or by attaching to Azure Databricks or Synapse Spark pools. Although you will not write any PySpark code in this book, in Chapter 12, Operationalizing Models with Code, you will learn how to achieve similar parallelization benefits without the need for Spark or a driver node.

No matter whether you are coding in regular Python, PySpark, R, or Scala, you are producing some code artifacts that are probably part of a larger system. In the next section, you will explore the DevOps mindset, which emphasizes the communication and collaboration of software engineers, data scientists, and system administrators to achieve faster release of valuable product features.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Azure Data Scientist Associate Certification Guide

By : Andreas Botsikas , Michael Hlobil

Azure Data Scientist Associate Certification Guide

By: Andreas Botsikas , Michael Hlobil

Overview of this book

Using Spark in data science

Azure Data Scientist Associate Certification Guide

By : Andreas Botsikas , Michael Hlobil

Azure Data Scientist Associate Certification Guide

By: Andreas Botsikas , Michael Hlobil

Overview of this book

Using Spark in data science

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access