Python Web Scraping Cookbook

By : Michael Heydt

Python Web Scraping Cookbook

By: Michael Heydt

Overview of this book

Python Web Scraping Cookbook is a solution-focused book that will teach you techniques to develop high-performance scrapers and deal with crawlers, sitemaps, forms automation, Ajax-based sites, caches, and more. You'll explore a number of real-world scenarios where every part of the development/product life cycle will be fully covered. You will not only develop the skills needed to design and develop reliable performance data flows, but also deploy your codebase to AWS. If you are involved in software engineering, product development, or data mining (or are interested in building data-driven products), you will find this book useful as each recipe has a clear purpose and objective. Right from extracting data from the websites to writing a sophisticated web crawler, the book's independent recipes will be a godsend. This book covers Python libraries, requests, and BeautifulSoup. You will learn about crawling, web spidering, working with Ajax websites, paginated items, and more. You will also learn to tackle problems such as 403 errors, working with proxy, scraping images, and LXML. By the end of this book, you will be able to scrape websites more efficiently and able to deploy and operate your scraper in the cloud.

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

Getting Started with Scraping

Introduction

Setting up a Python development environment

Scraping Python.org with Requests and Beautiful Soup

Scraping Python.org in urllib3 and Beautiful Soup

Scraping Python.org with Scrapy

Scraping Python.org with Selenium and PhantomJS

Data Acquisition and Extraction

Introduction

How to parse websites and navigate the DOM using BeautifulSoup

Searching the DOM with Beautiful Soup's find methods

Querying the DOM with XPath and lxml

Querying data with XPath and CSS selectors

Using Scrapy selectors

Loading data in unicode / UTF-8

Processing Data

Introduction

Working with CSV and JSON data

Storing data using AWS S3

Storing data using MySQL

Storing data using PostgreSQL

Storing data in Elasticsearch

How to build robust ETL pipelines with AWS SQS

Working with Images, Audio, and other Assets

Introduction

Downloading media content from the web

Parsing a URL with urllib to get the filename

Determining the type of content for a URL

Determining the file extension from a content type

Downloading and saving images to the local file system

Downloading and saving images to S3

Generating thumbnails for images

Taking a screenshot of a website

Taking a screenshot of a website with an external service

Performing OCR on an image with pytesseract

Creating a Video Thumbnail

Ripping an MP4 video to an MP3

Scraping - Code of Conduct

Introduction

Scraping legality and scraping politely

Respecting robots.txt

Crawling using the sitemap

Crawling with delays

Using identifiable user agents

Setting the number of concurrent requests per domain

Using auto throttling

Using an HTTP cache for development

Scraping Challenges and Solutions

Introduction

Retrying failed page downloads

Supporting page redirects

Waiting for content to be available in Selenium

Limiting crawling to a single domain

Processing infinitely scrolling pages

Controlling the depth of a crawl

Controlling the length of a crawl

Handling paginated websites

Handling forms and forms-based authorization

Handling basic authorization

Preventing bans by scraping via proxies

Randomizing user agents

Caching responses

Text Wrangling and Analysis

Introduction

Installing NLTK

Performing sentence splitting

Performing tokenization

Performing stemming

Performing lemmatization

Determining and removing stop words

Calculating the frequency distributions of words

Identifying and removing rare words

Removing punctuation marks

Piecing together n-grams

Scraping a job listing from StackOverflow

Reading and cleaning the description in the job listing

Searching, Mining and Visualizing Data

Introduction

Geocoding an IP address

How to collect IP addresses of Wikipedia edits

Visualizing contributor location frequency on Wikipedia

Creating a word cloud from a StackOverflow job listing

Crawling links on Wikipedia

Visualizing page relationships on Wikipedia

Calculating degrees of separation

Creating a Simple Data API

Introduction

Creating a REST API with Flask-RESTful

Integrating the REST API with scraping code

Adding an API to find the skills for a job listing

Storing data in Elasticsearch as the result of a scraping request

Checking Elasticsearch for a listing before scraping

Creating Scraper Microservices with Docker

Introduction

Installing Docker

Installing a RabbitMQ container from Docker Hub

Running a Docker container (RabbitMQ)

Creating and running an Elasticsearch container

Stopping/restarting a container and removing the image

Creating a generic microservice with Nameko

Creating a scraping microservice

Creating a scraper container

Creating an API container

Composing and running the scraper locally with docker-compose

Making the Scraper as a Service Real

Introduction

Creating and configuring an Elastic Cloud trial account

Accessing the Elastic Cloud cluster with curl

Connecting to the Elastic Cloud cluster with Python

Performing an Elasticsearch query with the Python API

Using Elasticsearch to query for jobs with specific skills

Modifying the API to search for jobs by skill

Storing configuration in the environment

Creating an AWS IAM user and a key pair for ECS

Configuring Docker to authenticate with ECR

Pushing containers into ECR

Creating an ECS cluster

Creating a task to run our containers

Starting and accessing the containers in AWS

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Setting up a Python development environment

If you have not used Python before, it is important to have a working development environment. The recipes in this book will be all in Python and be a mix of interactive examples, but primarily implemented as scripts to be interpreted by the Python interpreter. This recipe will show you how to set up an isolated development environment with virtualenv and manage project dependencies with pip . We also get the code for the book and install it into the Python virtual environment.

Getting ready

We will exclusively be using Python 3.x, and specifically in my case 3.6.1. While Mac and Linux normally have Python version 2 installed, and Windows systems do not. So it is likely that in any case that Python 3 will need to be installed. You can find references for Python installers at www.python.org.

You can check Python's version with python --version

pip comes installed with Python 3.x, so we will omit instructions on its installation. Additionally, all command line examples in this book are run on a Mac. For Linux users the commands should be identical. On Windows, there are alternate commands (like dir instead of ls), but these alternatives will not be covered.

How to do it...

We will be installing a number of packages with pip. These packages are installed into a Python environment. There often can be version conflicts with other packages, so a good practice for following along with the recipes in the book will be to create a new virtual Python environment where the packages we will use will be ensured to work properly.

Virtual Python environments are managed with the virtualenv tool. This can be installed with the following command:

~ $ pip install virtualenv
Collecting virtualenv
 Using cached virtualenv-15.1.0-py2.py3-none-any.whl
Installing collected packages: virtualenv
Successfully installed virtualenv-15.1.0

Now we can use virtualenv. But before that let's briefly look at pip. This command installs Python packages from PyPI, a package repository with literally 10's of thousands of packages. We just saw using the install subcommand to pip, which ensures a package is installed. We can also see all currently installed packages with pip list:

~ $ pip list
alabaster (0.7.9)
amqp (1.4.9)
anaconda-client (1.6.0)
anaconda-navigator (1.5.3)
anaconda-project (0.4.1)
aniso8601 (1.3.0)

I've truncated to the first few lines as there are quite a few. For me there are 222 packages installed.

Packages can also be uninstalled using pip uninstall followed by the package name. I'll leave it to you to give it a try.

Now back to virtualenv. Using virtualenv is very simple. Let's use it to create an environment and install the code from github. Let's walk through the steps:

Create a directory to represent the project and enter the directory.

~ $ mkdir pywscb
~ $ cd pywscb

Initialize a virtual environment folder named env:

pywscb $ virtualenv env
Using base prefix '/Users/michaelheydt/anaconda'
New python executable in /Users/michaelheydt/pywscb/env/bin/python
copying /Users/michaelheydt/anaconda/bin/python => /Users/michaelheydt/pywscb/env/bin/python
copying /Users/michaelheydt/anaconda/bin/../lib/libpython3.6m.dylib => /Users/michaelheydt/pywscb/env/lib/libpython3.6m.dylib
Installing setuptools, pip, wheel...done.

This creates an env folder. Let's take a look at what was installed.

pywscb $ ls -la env
total 8
drwxr-xr-x 6  michaelheydt staff 204 Jan 18 15:38 .
drwxr-xr-x 3  michaelheydt staff 102 Jan 18 15:35 ..
drwxr-xr-x 16 michaelheydt staff 544 Jan 18 15:38 bin
drwxr-xr-x 3  michaelheydt staff 102 Jan 18 15:35 include
drwxr-xr-x 4  michaelheydt staff 136 Jan 18 15:38 lib
-rw-r--r-- 1  michaelheydt staff 60 Jan 18 15:38  pip-selfcheck.json

New we activate the virtual environment. This command uses the content in the env folder to configure Python. After this all python activities are relative to this virtual environment.

pywscb $ source env/bin/activate
(env) pywscb $

We can check that python is indeed using this virtual environment with the following command:

(env) pywscb $ which python
/Users/michaelheydt/pywscb/env/bin/python

With our virtual environment created, let's clone the books sample code and take a look at its structure.

(env) pywscb $ git clone https://github.com/PacktBooks/PythonWebScrapingCookbook.git
 Cloning into 'PythonWebScrapingCookbook'...
 remote: Counting objects: 420, done.
 remote: Compressing objects: 100% (316/316), done.
 remote: Total 420 (delta 164), reused 344 (delta 88), pack-reused 0
 Receiving objects: 100% (420/420), 1.15 MiB | 250.00 KiB/s, done.
 Resolving deltas: 100% (164/164), done.
 Checking connectivity... done.

This created a PythonWebScrapingCookbook directory.

(env) pywscb $ ls -l
 total 0
 drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 PythonWebScrapingCookbook
 drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 env

Let's change into it and examine the content.

(env) PythonWebScrapingCookbook $ ls -l
 total 0
 drwxr-xr-x 15 michaelheydt staff 510 Jan 18 16:21 py
 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 www

There are two directories. Most the the Python code is is the py directory. www contains some web content that we will use from time-to-time using a local web server. Let's look at the contents of the py directory:

(env) py $ ls -l
 total 0
 drwxr-xr-x 9  michaelheydt staff 306 Jan 18 16:21 01
 drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 03
 drwxr-xr-x 21 michaelheydt staff 714 Jan 18 16:21 04
 drwxr-xr-x 10 michaelheydt staff 340 Jan 18 16:21 05
 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 06
 drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 07
 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 08
 drwxr-xr-x 7  michaelheydt staff 238 Jan 18 16:21 09
 drwxr-xr-x 7  michaelheydt staff 238 Jan 18 16:21 10
 drwxr-xr-x 9  michaelheydt staff 306 Jan 18 16:21 11
 drwxr-xr-x 8  michaelheydt staff 272 Jan 18 16:21 modules

Code for each chapter is in the numbered folder matching the chapter (there is no code for chapter 2 as it is all interactive Python).

Note that there is a modules folder. Some of the recipes throughout the book use code in those modules. Make sure that your Python path points to this folder. On Mac and Linux you can sets this in your .bash_profile file (and environments variables dialog on Windows):

export PYTHONPATH="/users/michaelheydt/dropbox/packt/books/pywebscrcookbook/code/py/modules"
export PYTHONPATH

The contents in each folder generally follows a numbering scheme matching the sequence of the recipe in the chapter. The following is the contents of the chapter 6 folder:

(env) py $ ls -la 06
 total 96
 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 .
 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:26 ..
 -rw-r--r-- 1  michaelheydt staff 902 Jan 18 16:21  01_scrapy_retry.py
 -rw-r--r-- 1  michaelheydt staff 656 Jan 18 16:21  02_scrapy_redirects.py
 -rw-r--r-- 1  michaelheydt staff 1129 Jan 18 16:21 03_scrapy_pagination.py
 -rw-r--r-- 1  michaelheydt staff 488 Jan 18 16:21  04_press_and_wait.py
 -rw-r--r-- 1  michaelheydt staff 580 Jan 18 16:21  05_allowed_domains.py
 -rw-r--r-- 1  michaelheydt staff 826 Jan 18 16:21  06_scrapy_continuous.py
 -rw-r--r-- 1  michaelheydt staff 704 Jan 18 16:21  07_scrape_continuous_twitter.py
 -rw-r--r-- 1  michaelheydt staff 1409 Jan 18 16:21 08_limit_depth.py
 -rw-r--r-- 1  michaelheydt staff 526 Jan 18 16:21  09_limit_length.py
 -rw-r--r-- 1  michaelheydt staff 1537 Jan 18 16:21 10_forms_auth.py
 -rw-r--r-- 1  michaelheydt staff 597 Jan 18 16:21  11_file_cache.py
 -rw-r--r-- 1  michaelheydt staff 1279 Jan 18 16:21 12_parse_differently_based_on_rules.py

In the recipes I'll state that we'll be using the script in <chapter directory>/<recipe filename>.

Congratulations, you've now got a Python environment configured with the books code!

Now just the be complete, if you want to get out of the Python virtual environment, you can exit using the following command:

(env) py $ deactivate
 py $

And checking which python we can see it has switched back:

py $ which python
 /Users/michaelheydt/anaconda/bin/python

I won't be using the virtual environment for the rest of the book. When you see command prompts they will be either of the form "<directory> $" or simply "$".

Now let's move onto doing some scraping.

Python Web Scraping Cookbook

By : Michael Heydt

Python Web Scraping Cookbook

By: Michael Heydt

Overview of this book

Related Content you might be interested in

Current Title:

Python Web Scraping Cookbook

Python Web Scraping

Hands-On Web Scraping with Python