Book Image

Learn Python by Building Data Science Applications

By : Philipp Kats, David Katz

Book Image

Learn Python by Building Data Science Applications

By: Philipp Kats, David Katz

Overview of this book

Python is the most widely used programming language for building data science applications. Complete with step-by-step instructions, this book contains easy-to-follow tutorials to help you learn Python and develop real-world data science projects. The “secret sauce” of the book is its curated list of topics and solutions, put together using a range of real-world projects, covering initial data collection, data analysis, and production. This Python book starts by taking you through the basics of programming, right from variables and data types to classes and functions. You’ll learn how to write idiomatic code and test and debug it, and discover how you can create packages or use the range of built-in ones. You’ll also be introduced to the extensive ecosystem of Python data science packages, including NumPy, Pandas, scikit-learn, Altair, and Datashader. Furthermore, you’ll be able to perform data analysis, train models, and interpret and communicate the results. Finally, you’ll get to grips with structuring and scheduling scripts using Luigi and sharing your machine learning models with the world as a microservice. By the end of the book, you’ll have learned not only how to implement Python in data science projects, but also how to maintain and design them to meet high programming standards.

Preface

Who this book is for

What this book covers

To get the most out of this book

Free Chapter

Section 1: Getting Started with Python

Section 1: Getting Started with Python

Preparing the Workspace

Preparing the Workspace

Technical requirements

Installing Python

Downloading materials for running the code

Working with VS Code

Beginning with Jupyter

Pre-flight check

Further reading

First Steps in Coding - Variables and Data Types

First Steps in Coding - Variables and Data Types

Technical requirements

Assigning variables

Naming the variable

Understanding data types

Converting the data types

Further reading

Functions

Technical requirements

Understanding a function

Defining the function

Refactoring the temperature conversion

Understanding anonymous (lambda) functions

Understanding recursion

Further reading

Data Structures

Data Structures

Technical requirements

What are data structures?

More data structures

Using generators

Useful functions to use with data structures

Further reading

Loops and Other Compound Statements

Loops and Other Compound Statements

Technical requirements

Understanding if, else, and elif statements

Running code many times with loops

Handling exceptions with try/except and try/finally

Understanding the with statements

Further reading

First Script – Geocoding with Web APIs

First Script – Geocoding with Web APIs

Technical requirements

Geocoding as a service

Learning about web APIs

Working with the Nominatim API

Caching with decorators

Reading and writing data

Moving code to a separate module

Collecting NYC Open Data from the Socrata service

Further reading

Scraping Data from the Web with Beautiful Soup 4

Scraping Data from the Web with Beautiful Soup 4

Technical requirements

When there is no API

Scraping WWII battles

Beyond Beautiful Soup

Further reading

Simulation with Classes and Inheritance

Simulation with Classes and Inheritance

Technical requirements

Understanding classes

Using classes in simulation

Further reading

Shell, Git, Conda, and More – at Your Command

Shell, Git, Conda, and More – at Your Command

Technical requirements

Section 2: Hands-On with Data

Section 2: Hands-On with Data

Python for Data Applications

Python for Data Applications

Technical requirements

Introducing Python for data science

Exploring NumPy

Beginning with pandas

Trying SciPy and scikit-learn

Understanding Jupyter

Data Cleaning and Manipulation

Data Cleaning and Manipulation

Technical requirements

Getting started with pandas

Working with real data

Getting to know regular expressions

Parsing locations

Understanding casualties

Quality assurance

Writing the file

Further reading

Data Exploration and Visualization

Data Exploration and Visualization

Technical requirements

Exploring the dataset

Declarative visualization with vega and altair

Big data visualization with datashader

Further reading

Training a Machine Learning Model

Training a Machine Learning Model

Technical requirements

Understanding the basics of ML

Further reading

Improving Your Model – Pipelines and Experiments

Improving Your Model – Pipelines and Experiments

Technical requirements

Understanding cross-validation

Exploring feature engineering

Optimizing the hyperparameters

Tracking your data and metrics with version control

Further reading

Section 3: Moving to Production

Section 3: Moving to Production

Packaging and Testing with Poetry and PyTest

Packaging and Testing with Poetry and PyTest

Technical requirements

Building a package

A few ways to build your package

Testing the code so far

Automating the process with CI services

Generating documentation generation with sphinx

Installing a package in editable mode

Further reading

Data Pipelines with Luigi

Data Pipelines with Luigi

Technical requirements

Introducing the ETL pipeline

Building our first task in Luigi

Understanding time-based tasks

Exploring the different output formats

Expanding Luigi with custom template classes

Further reading

Let's Build a Dashboard

Let's Build a Dashboard

Technical requirements

Building a dashboard – three types of dashboard

Understanding dynamic dashboards

Further reading

Serving Models with a RESTful API

Serving Models with a RESTful API

Technical requirements

What is a RESTful API?

Building a basic API service

Building a web page

Speeding up with asynchronous calls

Deploying and testing your API loads with Locust

Further reading

Serverless API Using Chalice

Serverless API Using Chalice

Technical requirements

Understanding serverless

Getting started with Chalice

Setting up a simple model

Building a serverless API for an ML model

Building a serverless function as a data pipeline

Further reading

Best Practices and Python Performance

Best Practices and Python Performance

Technical requirements

Speeding up your Python code

Using best practices for coding in your project

Beyond this book – packages and technologies to look out for

Further reading

Assessments

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Quality assurance

I know we have spent a lot of time cleaning the data, but there is still one last task we need to perform – quality assurance. Proper quality assurance is a very important practice. In a nutshell, you need to define certain assumptions about the dataset (for example, minimum and maximum values, the acceptable number of missing values, standard deviation, medians, the number of unique values, and many more). The key is to start with something that is somewhat reasonable, and then run tests to check whether the data fits your assumptions. If not, investigate specific data points to check whether your assumptions were incorrect (and update them), or whether there are still some issues with the data. It just gets a little more tricky for the multilevel columns. Consider the following code:

assumptions = {
 'killed': [0, 1_500_000],
 'wounded&apos...