Web Scraping with Python

Book Image

Web Scraping with Python

By : Richard Penman

Book Image

Web Scraping with Python

By: Richard Penman

Overview of this book

Web Scraping with Python

Web Scraping with Python

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Introduction to Web Scraping

Introduction to Web Scraping

When is web scraping useful?

Is web scraping legal?

Background research

Crawling your first website

Scraping the Data

Scraping the Data

Analyzing a web page

Three approaches to scrape a web page

Caching Downloads

Caching Downloads

Adding cache support to the link crawler

Concurrent Downloading

Concurrent Downloading

One million web pages

Sequential crawler

Threaded crawler

Dynamic Content

Dynamic Content

An example dynamic web page

Reverse engineering a dynamic web page

Rendering a dynamic web page

Interacting with Forms

Interacting with Forms

Extending the login script to update content

Automating forms with the Mechanize module

Solving CAPTCHA

Solving CAPTCHA

Registering an account

Optical Character Recognition

Solving complex CAPTCHAs

Scrapy

Starting a project

Visual scraping with Portia

Automated scraping with Scrapely

Overview

Google search engine

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Summary

In this chapter, we learned that caching downloaded web pages will save time and minimize bandwidth when recrawling a website. The main drawback of this is that the cache takes up disk space, which can be minimized through compression. Additionally, building on top of an existing database system, such as MongoDB, can be used to avoid any filesystem limitations.

In the next chapter, we will add further functionalities to our crawler so that we can download multiple web pages concurrently and crawl faster.