Book Image

Data Processing with Optimus

By : Dr. Argenis Leon, Luis Aguirre
Book Image

Data Processing with Optimus

By: Dr. Argenis Leon, Luis Aguirre

Overview of this book

Optimus is a Python library that works as a unified API for data cleaning, processing, and merging data. It can be used for handling small and big data on your local laptop or on remote clusters using CPUs or GPUs. The book begins by covering the internals of Optimus and how it works in tandem with the existing technologies to serve your data processing needs. You'll then learn how to use Optimus for loading and saving data from text data formats such as CSV and JSON files, exploring binary files such as Excel, and for columnar data processing with Parquet, Avro, and OCR. Next, you'll get to grips with the profiler and its data types - a unique feature of Optimus Dataframe that assists with data quality. You'll see how to use the plots available in Optimus such as histogram, frequency charts, and scatter and box plots, and understand how Optimus lets you connect to libraries such as Plotly and Altair. You'll also delve into advanced applications such as feature engineering, machine learning, cross-validation, and natural language processing functions and explore the advancements in Optimus. Finally, you'll learn how to create data cleaning and transformation functions and add a hypothetical new data processing engine with Optimus. By the end of this book, you'll be able to improve your data science workflow with Optimus easily.
Table of Contents (16 chapters)
1
Section 1: Getting Started with Optimus
4
Section 2: Optimus – Transform and Rollout
10
Section 3: Advanced Features of Optimus

Bumblebee

Bumblebee is an open source, low-code web app that aims to make big data preparation easy. It builds on top of Optimus so that you have all the flexibility the library provides.

Bumblebee sends automatically generated Optimus code to a Python kernel gateway to operate our datasets and configuration settings. All of this is done over a secure connection.

For example, when we ask Bumblebee to load a file, it automatically uploads the file to a place Optimus can find it (since it may not be able to load the file from your local storage) and loads it using op.load.file.

Bumblebee has a broad range of available operations, and almost every Optimus function is mapped as a user-friendly interface.

In the web app, we can take advantage of its profiling functionality to give the user insight into the loaded data. This also includes loading the actual values of the dataset into a table in real time:

Figure 10.1 – Bumblebee default view with...