3. Caching Downloads | Python Web Scraping

Book Overview & Buying
Table Of Contents

Python Web Scraping

By : Richard Penman

3.8 (10)

Buy this Book

Python Web Scraping

3.8 (10)

By: Richard Penman

Buy this Book

Overview of this book

The Internet contains the most useful set of data ever assembled, largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be carefully extracted to be useful. Web scraping is becoming increasingly useful as a means to easily gather and make sense of the plethora of information available online. Using a simple language like Python, you can crawl the information out of complex websites using simple programming. This book is the ultimate guide to using Python to scrape data from websites. In the early chapters it covers how to extract data from static web pages and how to use caching to manage the load on servers. After the basics we'll get our hands dirty with building a more sophisticated crawler with threads and more advanced topics. Learn step-by-step how to use Ajax URLs, employ the Firebug extension for monitoring, and indirectly scrape data. Discover more scraping nitty-gritties such as using the browser renderer, managing cookies, how to submit forms to extract data from complex websites protected by CAPTCHA, and so on. The book wraps up with how to create high-level scrapers with Scrapy libraries and implement what has been learned to real websites.

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Free Chapter

1. Introduction to Web Scraping

When is web scraping useful?

Is web scraping legal?

Background research

Crawling your first website

Summary

2. Scraping the Data

Analyzing a web page

Three approaches to scrape a web page

Summary

3. Caching Downloads

Adding cache support to the link crawler

Disk cache

Database cache

Summary

4. Concurrent Downloading

One million web pages

Sequential crawler

Threaded crawler

Performance

Summary

5. Dynamic Content

An example dynamic web page

Reverse engineering a dynamic web page

Rendering a dynamic web page

Summary

6. Interacting with Forms

The Login form

Extending the login script to update content

Automating forms with the Mechanize module

Summary

7. Solving CAPTCHA

Registering an account

Optical Character Recognition

Solving complex CAPTCHAs

Summary

8. Scrapy

Installation

Starting a project

Visual scraping with Portia

Automated scraping with Scrapely

Summary

9. Overview

Google search engine

Facebook

Gap

BMW

Summary

Index

Python Web Scraping

By : Richard Penman

Python Web Scraping

By: Richard Penman

Overview of this book

Database cache

What is NoSQL?

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access