Web Scraping with Python

Book Image

Web Scraping with Python

By : Richard Penman

Book Image

Web Scraping with Python

By: Richard Penman

Overview of this book

Web Scraping with Python

Web Scraping with Python

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Introduction to Web Scraping

Introduction to Web Scraping

When is web scraping useful?

Is web scraping legal?

Background research

Crawling your first website

Scraping the Data

Scraping the Data

Analyzing a web page

Three approaches to scrape a web page

Caching Downloads

Caching Downloads

Adding cache support to the link crawler

Concurrent Downloading

Concurrent Downloading

One million web pages

Sequential crawler

Threaded crawler

Dynamic Content

Dynamic Content

An example dynamic web page

Reverse engineering a dynamic web page

Rendering a dynamic web page

Interacting with Forms

Interacting with Forms

Extending the login script to update content

Automating forms with the Mechanize module

Solving CAPTCHA

Solving CAPTCHA

Registering an account

Optical Character Recognition

Solving complex CAPTCHAs

Scrapy

Starting a project

Visual scraping with Portia

Automated scraping with Scrapely

Overview

Google search engine

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Summary

In this chapter, we walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, so we will use it in future examples.

In the next chapter we will introduce caching, which allows us to save web pages so that they only need be downloaded the first time a crawler is run.