Book Image

Learn Python by Building Data Science Applications

By : Philipp Kats, David Katz
Book Image

Learn Python by Building Data Science Applications

By: Philipp Kats, David Katz

Overview of this book

Python is the most widely used programming language for building data science applications. Complete with step-by-step instructions, this book contains easy-to-follow tutorials to help you learn Python and develop real-world data science projects. The “secret sauce” of the book is its curated list of topics and solutions, put together using a range of real-world projects, covering initial data collection, data analysis, and production. This Python book starts by taking you through the basics of programming, right from variables and data types to classes and functions. You’ll learn how to write idiomatic code and test and debug it, and discover how you can create packages or use the range of built-in ones. You’ll also be introduced to the extensive ecosystem of Python data science packages, including NumPy, Pandas, scikit-learn, Altair, and Datashader. Furthermore, you’ll be able to perform data analysis, train models, and interpret and communicate the results. Finally, you’ll get to grips with structuring and scheduling scripts using Luigi and sharing your machine learning models with the world as a microservice. By the end of the book, you’ll have learned not only how to implement Python in data science projects, but also how to maintain and design them to meet high programming standards.
Table of Contents (26 chapters)
Free Chapter
1
Section 1: Getting Started with Python
11
Section 2: Hands-On with Data
17
Section 3: Moving to Production

Data Pipelines with Luigi

Until now, we have been writing code as separate notebooks and scripts. In the previous chapter, we learned how to group those scripts into a package so that it can be distributed and tested properly. In many cases, however, we need to execute certain tasks on a strict schedule. Often, it is needed to process certain data—pull off analytics, collect information from external sources, or re-train an ML model. All of this is prone to errors: tasks may depend on other tasks, and some tasks shouldn't run before others. It is important that tasks should be easy to orchestrate, monitor, and re-run for ease of use.

In this chapter, we will learn to build and orchestrate our own data pipelines. Building good pipelines is an important skill that can save tons of time and stress for anyone who masters it.

In particular, we will cover the following topics...