Book Image

Learn Python by Building Data Science Applications

By : Philipp Kats, David Katz
Book Image

Learn Python by Building Data Science Applications

By: Philipp Kats, David Katz

Overview of this book

Python is the most widely used programming language for building data science applications. Complete with step-by-step instructions, this book contains easy-to-follow tutorials to help you learn Python and develop real-world data science projects. The “secret sauce” of the book is its curated list of topics and solutions, put together using a range of real-world projects, covering initial data collection, data analysis, and production. This Python book starts by taking you through the basics of programming, right from variables and data types to classes and functions. You’ll learn how to write idiomatic code and test and debug it, and discover how you can create packages or use the range of built-in ones. You’ll also be introduced to the extensive ecosystem of Python data science packages, including NumPy, Pandas, scikit-learn, Altair, and Datashader. Furthermore, you’ll be able to perform data analysis, train models, and interpret and communicate the results. Finally, you’ll get to grips with structuring and scheduling scripts using Luigi and sharing your machine learning models with the world as a microservice. By the end of the book, you’ll have learned not only how to implement Python in data science projects, but also how to maintain and design them to meet high programming standards.
Table of Contents (26 chapters)
Free Chapter
1
Section 1: Getting Started with Python
11
Section 2: Hands-On with Data
17
Section 3: Moving to Production

What this book covers

This book consists of three main sections. The first one is focused on language fundamentals, the second introduces data analysis in Python, and the final section covers different ways to deliver the results of your work. The last chapter of each section is focused on non-Python tools and topics related to the section subject.

Section 1, Getting Started with Python, introduces the Python programming language and explains how to install Python and all of the packages and tools we'll be using.

Chapter 1, Preparing the Workspace, covers all the tools we'll need throughout the book—what they are, how to install them, and how to use their interfaces. This includes the installation process for Python 3.7, all of the packages we'll require throughout the book, how to install all of them at once in a separate environment, as well as two code development tools we'll use—the Jupyter Notebook and VS Code. Finally, we'll run our first script to ensure everything works fine! By the end of this chapter, you will have everything you need to execute the book's code, ready to go.

Chapter 2, First Steps in Coding – Variables and Data Types, gives an introduction to fundamental programming concepts, such as variables and data types. You'll start writing code in Jupyter, and will even solve a simple problem using the knowledge you've just acquired.

Chapter 3, Functions, introduces yet another concept fundamental to programming—functions. This chapter covers the most important built-in functions and teaches you about writing new ones. Finally, you will revisit the problem from the previous chapter, and write an alternative solution, using functions.

Chapter 4, Data Structures, covers different types of data structures in Python—lists, sets, dictionaries, and many others. You will learn about the properties of each structure, their interfaces, how to operate them, and when to use them.

Chapter 5, Loops and Other Compound Statements, illustrates different compound statements in Python—loops—if/else, try/except, one-liners, and others. These represent core logic in the code and allow non-linear code execution. At the end of this chapter, you'll be able to operate large data structures using short, expressive code.

Chapter 6, First Script – Geocoding with Web APIs, introduces the concept of APIs, working with HTTP and geocoding service APIs in particular, from Python. At the end of this chapter, you'll have fully operational code for geocoding addresses from the dataset—code that you'll be using extensively throughout the rest of the book, but that's also highly applicable to many tasks beyond it.

Chapter 7, Scraping Data from the Web with Beautiful Soup 4, illustrates a solution to a similar but more complex task of data extraction from HTML pages—scraping. Step by step, you will build a script that collects pages and extracts data on all the battles in World War II, as described in Wikipedia. At the end of this chapter, you'll know the limitations, challenges, and the main solutions of the scraping packages used for the task, and will be able to write your own scrapers.

Chapter 8, Simulation with Classes and Inheritance, introduces one more critical concept for programming in Python—classes. Using classes, we will build a simple simulation model of an ecological system. We'll compute, collect, and visualize metrics, and use them to analyze the system's behavior.

Chapter 9, Shell, Git, Conda, and More – at Your Command, covers the basic tools essential for the development process—from Shell and Git, to Conda packaging and virtual environments, to the use of makefiles and the Cookiecutter tool. The information we share in this chapter is essential for code development in general, and Python development in particular, and will allow you to collaborate and talk the same language with other developers.

Section 2, Hands-On with Data, focuses on using Python for data processing analysis, including cleaning, visualization, and training machine learning models.

Chapter 10, Python for Data Applications, works as an introduction to the Python data analysis ecosystem—a distinct group of packages that allow simple work with data, its processing, and analysis. As a result, you will get familiar with the main packages and their purpose, their special syntaxes, and will understand what makes them work substantially faster than normal Python for numeric calculations.

Chapter 11, Data Cleaning and Manipulation, shows how to use the pandas package to process and clean our data, and make it ready for analysis. As an example, we'll clean and prepare the dataset we obtained from Wikipedia in Chapter 7, Scraping Data from the Web with Beautiful Soup 4. Through the process, we'll learn how to use regular expressions, use the geocoding code we wrote in Chapter 6, First Script – Geocoding with Web APIs, and an array of other techniques to clean the data.

Chapter 12, Data Exploration and Visualization, explains how to explore an arbitrary dataset and ask and answer questions about it, using queries, statistics, and visualizations. You'll learn how to use two visualization libraries, Matplotlib and Altair. Both make static charts quickly or more advanced, interactive ones. As our case example, we'll use the dataset we cleaned in the previous chapter.

Chapter 13, Training a Machine Learning Model, presents the core idea of machine learning and shows how to apply unsupervised learning with the k-means clustering algorithm, and supervised learning with KNN, linear regression, and decision trees, to a given dataset.

Chapter 14, Improving Your Model – Pipelines and Experiments, highlights ways to improve your model, using feature engineering, cross-validation, and by applying a more sophisticated algorithm. In addition, you will learn how to track your experiments and keep both code and data under version control, using data version control with dvc.

Section 3, Moving to Production, is focused on delivering the results of your work with Python, in different formats.

Chapter 15, Packaging and Testing with Poetry and PyTest, explains the process of packaging. Using our Wikipedia scraper as an example, we'll create a package using the poetry library, set dependencies and a development environment, and make the package accessible for installation using pip from GitHub. To ensure the package's functionality, we will add a few unit tests using the pytest testing library.

Chapter 16, Data Pipelines with Luigi, introduces ETL pipelines and explains how to build and schedule one using the luigi framework. We will build a set of interdependent tasks for data collection and processing and set them to work on a scheduled basis, writing data to local files, S3 buckets, or a database.

Chapter 17, Let's Build a Dashboard, covers a few ways to build and share a dashboard online. We'll start by writing a static dashboard based on the charts we made with the Altair library in Chapter 12, Data Exploration and Visualization. As an alternative, we will also deploy a dynamic dashboard that pulls data from a database upon request, using the panel library.

Chapter 18, Serving Models with a RESTful API, brings us back to the API theme—but this time, we'll build an API on our own, using the fastAPI framework and the pydantic package for validation. Using a machine learning model, we'll build a fully operational API server, with the OpenAPI documentation and strict request validation. As FastAPI supports asynchronous execution, we'll also discuss what that means and when to use it.

Chapter 19, Serverless API Using Chalice, goes beyond serving an API with a personal server and shows how to achieve similar results with a serverless application, using AWS Lambda and the chalice package. This includes building an API endpoint, a scheduled pipeline, and serving a machine learning model. Along the way, we discuss the pros and cons of running serverless, its limitations, and ways to mitigate them.

Chapter 20, Best Practices and Python Performance, is comprises of three distinct parts. The first part showcases different ways to make your code faster, by using NumPy's vectorized computations or a specific data structure (in our case, a k-d tree), extending computations to multiple cores or even machines with Dask, or by leveraging performance (and, potentially, GIL-release) of just-in-time compilation with Numba. We also discuss different ways to achieve concurrency in Python—using threads, asynchronous tasks, or multiple processes.

The second part of the chapter focuses on improving the speed and quality of development. In particular, we'll cover the use of linters and formatters—the black package in particular; code maintainability measurements with wily; and advanced, data-driven code testing with the hypothesis package.

Finally, the third part of this chapter goes over a few technologies beyond Python, but that are still potentially useful to you. This list includes different Python interpreters, such as Jython, Brython, and Iodide; Docker technology; and Kubernetes.