Python Web Scraping Cookbook

By : Michael Heydt

Python Web Scraping Cookbook

By: Michael Heydt

Overview of this book

Python Web Scraping Cookbook is a solution-focused book that will teach you techniques to develop high-performance scrapers and deal with crawlers, sitemaps, forms automation, Ajax-based sites, caches, and more. You'll explore a number of real-world scenarios where every part of the development/product life cycle will be fully covered. You will not only develop the skills needed to design and develop reliable performance data flows, but also deploy your codebase to AWS. If you are involved in software engineering, product development, or data mining (or are interested in building data-driven products), you will find this book useful as each recipe has a clear purpose and objective. Right from extracting data from the websites to writing a sophisticated web crawler, the book's independent recipes will be a godsend. This book covers Python libraries, requests, and BeautifulSoup. You will learn about crawling, web spidering, working with Ajax websites, paginated items, and more. You will also learn to tackle problems such as 403 errors, working with proxy, scraping images, and LXML. By the end of this book, you will be able to scrape websites more efficiently and able to deploy and operate your scraper in the cloud.

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

Getting Started with Scraping

Introduction

Setting up a Python development environment

Scraping Python.org with Requests and Beautiful Soup

Scraping Python.org in urllib3 and Beautiful Soup

Scraping Python.org with Scrapy

Scraping Python.org with Selenium and PhantomJS

Data Acquisition and Extraction

Introduction

How to parse websites and navigate the DOM using BeautifulSoup

Searching the DOM with Beautiful Soup's find methods

Querying the DOM with XPath and lxml

Querying data with XPath and CSS selectors

Using Scrapy selectors

Loading data in unicode / UTF-8

Processing Data

Introduction

Working with CSV and JSON data

Storing data using AWS S3

Storing data using MySQL

Storing data using PostgreSQL

Storing data in Elasticsearch

How to build robust ETL pipelines with AWS SQS

Working with Images, Audio, and other Assets

Introduction

Downloading media content from the web

Parsing a URL with urllib to get the filename

Determining the type of content for a URL

Determining the file extension from a content type

Downloading and saving images to the local file system

Downloading and saving images to S3

Generating thumbnails for images

Taking a screenshot of a website

Taking a screenshot of a website with an external service

Performing OCR on an image with pytesseract

Creating a Video Thumbnail

Ripping an MP4 video to an MP3

Scraping - Code of Conduct

Introduction

Scraping legality and scraping politely

Respecting robots.txt

Crawling using the sitemap

Crawling with delays

Using identifiable user agents

Setting the number of concurrent requests per domain

Using auto throttling

Using an HTTP cache for development

Scraping Challenges and Solutions

Introduction

Retrying failed page downloads

Supporting page redirects

Waiting for content to be available in Selenium

Limiting crawling to a single domain

Processing infinitely scrolling pages

Controlling the depth of a crawl

Controlling the length of a crawl

Handling paginated websites

Handling forms and forms-based authorization

Handling basic authorization

Preventing bans by scraping via proxies

Randomizing user agents

Caching responses

Text Wrangling and Analysis

Introduction

Installing NLTK

Performing sentence splitting

Performing tokenization

Performing stemming

Performing lemmatization

Determining and removing stop words

Calculating the frequency distributions of words

Identifying and removing rare words

Removing punctuation marks

Piecing together n-grams

Scraping a job listing from StackOverflow

Reading and cleaning the description in the job listing

Searching, Mining and Visualizing Data

Introduction

Geocoding an IP address

How to collect IP addresses of Wikipedia edits

Visualizing contributor location frequency on Wikipedia

Creating a word cloud from a StackOverflow job listing

Crawling links on Wikipedia

Visualizing page relationships on Wikipedia

Calculating degrees of separation

Creating a Simple Data API

Introduction

Creating a REST API with Flask-RESTful

Integrating the REST API with scraping code

Adding an API to find the skills for a job listing

Storing data in Elasticsearch as the result of a scraping request

Checking Elasticsearch for a listing before scraping

Creating Scraper Microservices with Docker

Introduction

Installing Docker

Installing a RabbitMQ container from Docker Hub

Running a Docker container (RabbitMQ)

Creating and running an Elasticsearch container

Stopping/restarting a container and removing the image

Creating a generic microservice with Nameko

Creating a scraping microservice

Creating a scraper container

Creating an API container

Composing and running the scraper locally with docker-compose

Making the Scraper as a Service Real

Introduction

Creating and configuring an Elastic Cloud trial account

Accessing the Elastic Cloud cluster with curl

Connecting to the Elastic Cloud cluster with Python

Performing an Elasticsearch query with the Python API

Using Elasticsearch to query for jobs with specific skills

Modifying the API to search for jobs by skill

Storing configuration in the environment

Creating an AWS IAM user and a key pair for ECS

Configuring Docker to authenticate with ECR

Pushing containers into ECR

Creating an ECS cluster

Creating a task to run our containers

Starting and accessing the containers in AWS

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Scraping Python.org with Selenium and PhantomJS

This recipe will introduce Selenium and PhantomJS, two frameworks that are very different from the frameworks in the previous recipes. In fact, Selenium and PhantomJS are often used in functional/acceptance testing. We want to demonstrate these tools as they offer unique benefits from the scraping perspective. Several that we will look at later in the book are the ability to fill out forms, press buttons, and wait for dynamic JavaScript to be downloaded and executed.

Selenium itself is a programming language neutral framework. It offers a number of programming language bindings, such as Python, Java, C#, and PHP (amongst others). The framework also provides many components that focus on testing. Three commonly used components are:

IDE for recording and replaying tests
Webdriver, which actually launches a web browser (such as Firefox, Chrome, or Internet Explorer) by sending commands and sending the results to the selected browser
A grid server executes tests with a web browser on a remote server. It can run multiple test cases in parallel.

Getting ready

First we need to install Selenium. We do this with our trusty pip:

~ $ pip install selenium
Collecting selenium
 Downloading selenium-3.8.1-py2.py3-none-any.whl (942kB)
 100% |████████████████████████████████| 952kB 236kB/s
Installing collected packages: selenium
Successfully installed selenium-3.8.1

This installs the Selenium Client Driver for Python (the language bindings). You can find more information on it at https://github.com/SeleniumHQ/selenium/blob/master/py/docs/source/index.rst if you want to in the future.

For this recipe we also need to have the driver for Firefox in the directory (it's named geckodriver). This file is operating system specific. I've included the file for Mac in the folder. To get other versions, visit https://github.com/mozilla/geckodriver/releases.

Still, when running this sample you may get the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver'

If you do, put the geckodriver file somewhere on your systems PATH, or add the 01 folder to your path. Oh, and you will need to have Firefox installed.

Finally, it is required to have PhantomJS installed. You can download and find installation instructions at: http://phantomjs.org/

How to do it...

The script for this recipe is 01/04_events_with_selenium.py.

The following is the code:

from selenium import webdriver

def get_upcoming_events(url):
    driver = webdriver.Firefox()
    driver.get(url)

    events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')

    for event in events:
        event_details = dict()
        event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
        event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
        event_details['time'] = event.find_element_by_xpath('p/time').text
        print(event_details)

    driver.close()

get_upcoming_events('https://www.python.org/events/python-events/')

And run the script with Python. You will see familiar output:

~ $ python 04_events_with_selenium.py
{'name': 'PyCascades 2018', 'location': 'Granville Island Stage, 1585 Johnston St, Vancouver, BC V6H 3R9, Canada', 'time': '22 Jan. – 24 Jan.'}
{'name': 'PyCon Cameroon 2018', 'location': 'Limbe, Cameroon', 'time': '24 Jan. – 29 Jan.'}
{'name': 'FOSDEM 2018', 'location': 'ULB Campus du Solbosch, Av. F. D. Roosevelt 50, 1050 Bruxelles, Belgium', 'time': '03 Feb. – 05 Feb.'}
{'name': 'PyCon Pune 2018', 'location': 'Pune, India', 'time': '08 Feb. – 12 Feb.'}
{'name': 'PyCon Colombia 2018', 'location': 'Medellin, Colombia', 'time': '09 Feb. – 12 Feb.'}
{'name': 'PyTennessee 2018', 'location': 'Nashville, TN, USA', 'time': '10 Feb. – 12 Feb.'}

During this process, Firefox will pop up and open the page. We have reused the previous recipe and adopted Selenium.

The Window Popped up by Firefox

How it works

The primary difference in this recipe is the following code:

driver = webdriver.Firefox()
driver.get(url)

This gets the Firefox driver and uses it to get the content of the specified URL. This works by starting Firefox and automating it to go the the page, and then Firefox returns the page content to our app. This is why Firefox popped up. The other difference is that to find things we need to call find_element_by_xpath to search the resulting HTML.

There's more...

PhantomJS, in many ways, is very similar to Selenium. It has fast and native support for various web standards, with features such as DOM handling, CSS selector, JSON, Canvas, and SVG. It is often used in web testing, page automation, screen capturing, and network monitoring.

There is one key difference between Selenium and PhantomJS: PhantomJS is headless and uses WebKit. As we saw, Selenium opens and automates a browser. This is not very good if we are in a continuous integration or testing environment where the browser is not installed, and where we also don't want thousands of browser windows or tabs being opened. Being headless, makes this faster and more efficient.

The example for PhantomJS is in the 01/05_events_with_phantomjs.py file. There is a single one line change:

driver = webdriver.PhantomJS('phantomjs')

And running the script results in similar output to the Selenium / Firefox example, but without a browser popping up and also it takes less time to complete.

Python Web Scraping Cookbook

By : Michael Heydt

Python Web Scraping Cookbook

By: Michael Heydt

Overview of this book

Related Content you might be interested in

Current Title:

Python Web Scraping Cookbook

Python Web Scraping

Hands-On Web Scraping with Python