Book Image

Web Scraping with Python

By : Richard Penman

Book Image

Web Scraping with Python

By: Richard Penman

Overview of this book

Web Scraping with Python

Web Scraping with Python

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Introduction to Web Scraping

Introduction to Web Scraping

When is web scraping useful?

Is web scraping legal?

Background research

Crawling your first website

Scraping the Data

Scraping the Data

Analyzing a web page

Three approaches to scrape a web page

Caching Downloads

Caching Downloads

Adding cache support to the link crawler

Concurrent Downloading

Concurrent Downloading

One million web pages

Sequential crawler

Threaded crawler

Dynamic Content

Dynamic Content

An example dynamic web page

Reverse engineering a dynamic web page

Rendering a dynamic web page

Interacting with Forms

Interacting with Forms

Extending the login script to update content

Automating forms with the Mechanize module

Solving CAPTCHA

Solving CAPTCHA

Registering an account

Optical Character Recognition

Solving complex CAPTCHAs

Scrapy

Starting a project

Visual scraping with Portia

Automated scraping with Scrapely

Overview

Google search engine

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

absolute link
- about / Link crawler
account
- registering / Registering an account
- URL, for registration / Registering an account
- CAPTCHA image, loading / Loading the CAPTCHA image
advanced features, link crawler
- robots.txt file, parsing / Parsing robots.txt
- proxies, supporting / Supporting proxies
- downloads, throttling / Throttling downloads
- spider traps, avoiding / Avoiding spider traps
- maximum depth, setting / Final version
advanced search
- URL / Estimating the size of a website
Alexa
- URL / One million web pages
Alexa list
- URL / One million web pages
- parsing / Parsing the Alexa list
annotation, Portia
- about / Annotation
Asynchronous JavaScript and XML (AJAX)
- about / An example dynamic web page
automated scraping
- with Scrapely / Automated scraping with Scrapely

B

Beautiful Soup
- about / Beautiful Soup
- overview / Beautiful Soup
- common methods / Beautiful Soup
- URL / Beautiful Soup
Blink
- about / Rendering a dynamic web page
BMW
- about / BMW
- URL / BMW
- using / BMW
- reference link / BMW
builtwith
- URL / Estimating the size of a website

C

2Captcha
- URL / Using a CAPTCHA solving service
cache
- implementing, in MongoDB / MongoDB cache implementation
- compression, adding / Compression
- testing, in MongoDB / Testing the cache
- URL, for testing / Testing the cache
cache support
- adding, to link crawler / Adding cache support to the link crawler
CAPTCHA API
- about / 9kw CAPTCHA API
- implementation / 9kw CAPTCHA API
- example / 9kw CAPTCHA API
Captcha API
- integrating, with registration form / Integrating with registration
CaptchaAPI class
- reference link / 9kw CAPTCHA API
CAPTCHA image
- loading / Loading the CAPTCHA image
CAPTCHA solving service
- using / Using a CAPTCHA solving service
complex CAPTCHA
- solving / Solving complex CAPTCHAs
cookies
- about / The Login form
- loading, from browser / Loading cookies from the web browser
crawl
- interrupting / Interrupting and resuming a crawl
- resuming / Interrupting and resuming a crawl
crawl command
- about / Installation
crawling
- about / Crawling your first website
- web page, downloading / Downloading a web page
- sitemap crawler / Sitemap crawler
- ID iteration crawler / ID iteration crawler
- link crawler / Link crawler
cross-process crawler
- about / Cross-process crawler
CSS selectors
- about / Sitemap crawler, CSS selectors
- references / CSS selectors

D

Death by Captcha
- URL / Using a CAPTCHA solving service
disk cache
- about / Disk cache
- implementation / Implementation
- testing / Testing the cache
- URL, for source code / Testing the cache
- drawbacks / Drawbacks
- MongoDB / Database cache
- NoSQL / What is NoSQL?
disk space
- saving / Saving disk space
- stale data, expiring / Expiring stale data
Downloader class
- URL, for source code / Adding cache support to the link crawler
dynamic web page
- example / An example dynamic web page
- reference link, for example / An example dynamic web page
- reverse engineering / Reverse engineering a dynamic web page
- rendering / Rendering a dynamic web page
- rendering, with PyQt / PyQt or PySide
- rendering, with PySide / PyQt or PySide
- JavaScript, executing / Executing JavaScript
- website interaction, with WebKit / Website interaction with WebKit
- rendering, with Selenium / Selenium

E

edge cases
- about / Edge cases
example web scraping website
- URL / Executing JavaScript

F

Facebook
- about / Facebook
- website / The website
- API / The API
Firebug Lite
- URL / Analyzing a web page
form encodings
- about / The Login form
- reference link / The Login form

G

Gap
- using / Gap
- URL / Gap
Gecko
- about / Rendering a dynamic web page
genspider command
- about / Installation
Google
- URL / Google search engine
Google search engine
- about / Google search engine
- homepage / Google search engine
- test search, performing / Google search engine
Google Translate
- about / BMW
- URL / BMW
Google Web Toolkit (GWT)
- about / Rendering a dynamic web page
Graph API
- about / The API
- example / The API
- URL / The API

H

HTTP requests
- reference link / Supporting proxies

I

ID iteration crawler
- about / ID iteration crawler
Internet Engineering Task Force
- URL / Retrying downloads
items.py file
- about / Starting a project

J

JavaScript
- executing / Executing JavaScript
JSONP format
- about / BMW

K

9kw
- using / Getting started with 9kw
- URL / Getting started with 9kw
9kw API
- URL / 9kw CAPTCHA API

L

link crawler
- about / Link crawler
- advanced features, adding / Advanced features
- scrape callback, adding / Adding a scrape callback to the link crawler
- URL / Adding a scrape callback to the link crawler
- cache support, adding / Adding cache support to the link crawler
Link Extractors
- reference link / Testing the spider
Login form
- about / The Login form
- URL / The Login form
- automating / The Login form
- examples, reference link / The Login form
- cookies, loading from browser / Loading cookies from the web browser
- content, updating / Extending the login script to update content
- automating, with Mechanize module / Automating forms with the Mechanize module
Lxml
- about / Lxml
- URL / Lxml
- CSS selectors / CSS selectors

M

Mechanize module
- Login form, automating / Automating forms with the Mechanize module
- URL / Automating forms with the Mechanize module
model, Scrapy
- defining / Defining a model
- URL / Defining a model
MongoDB
- about / Database cache
- installing / Installing MongoDB
- URL / Installing MongoDB
- overview / Overview of MongoDB
- URL, for documentation / Overview of MongoDB
- cache, implementing / MongoDB cache implementation
- compression, adding to cache / Compression
- cache, testing / Testing the cache

N

no country redirect (ncr)
- URL / Google search engine
- about / Google search engine
NoSQL
- about / What is NoSQL?

O

OCR
- about / Optical Character Recognition
- example / Optical Character Recognition
- performance, improving / Further improvements
- complex CAPTCHA, solving / Solving complex CAPTCHAs
- CAPTCHA solving service, using / Using a CAPTCHA solving service
- 9kw, using / Getting started with 9kw
- CAPTCHA API / 9kw CAPTCHA API
one million web pages
- downloading / One million web pages
- Alexa list, parsing / Parsing the Alexa list
owner, website
- searching / Crawling your first website

P

padding
- about / BMW
Pillow library
- using / Loading the CAPTCHA image
- URL / Loading the CAPTCHA image
- versus Python Image Library (PIL) / Loading the CAPTCHA image
pip command
- about / Installation
Portia
- used, for visual scraping / Visual scraping with Portia
- about / Installation
- URL / Installation
- installing / Installation
- URL, for downloading / Installation
- annotation / Annotation
- spider, tuning / Tuning a spider
- results, checking / Checking results
- automated scraping, with Scrapely / Automated scraping with Scrapely
Presto
- about / Rendering a dynamic web page
process_link_crawler
- URL / Cross-process crawler
PyQt
- about / PyQt or PySide
- URL / PyQt or PySide
PySide
- about / PyQt or PySide
- URL / PyQt or PySide
Python Image Library (PIL)
- versus Pillow library / Loading the CAPTCHA image
- about / Loading the CAPTCHA image

Q

Qt 4.8
- URL / Executing JavaScript

R

regular expressions
- about / Regular expressions
- URL / Regular expressions
relative link
- about / Link crawler
Render class
- reference link / The Render class
reverse engineering
- dynamic web page / Reverse engineering a dynamic web page
- about / Reverse engineering a dynamic web page
- edge cases / Edge cases
robots.txt file
- checking / Checking robots.txt
- URL / Checking robots.txt

S

scrape callback
- adding, to link crawler / Adding a scrape callback to the link crawler
Scrapely
- URL / Automated scraping with Scrapely
- used, for automated scraping / Automated scraping with Scrapely
scraping approaches
- regular expressions / Regular expressions
- Beautiful Soup / Beautiful Soup
- Lxml / Lxml
- comparing / Comparing performance
- results, testing / Scraping results
- advantages / Overview
- disadvantages / Overview
Scrapy
- installing / Installation
- URL, for installation / Installation
- URL, for commands / Installation
- URL / Interrupting and resuming a crawl
scrapy command
- about / Installation
Scrapy project
- starting / Starting a project
- model, defining / Defining a model
- spider, creating / Creating a spider
Selenium
- about / Selenium
- URL / Selenium
sequential crawler
- about / Sequential crawler
- URL / Sequential crawler
settings.py file
- about / Starting a project
shell command
- about / Installation
- using / Scraping with the shell command
sitemap crawler
- about / Sitemap crawler
Sitemap file
- examining / Examining the Sitemap
- reference link / Examining the Sitemap
special class methods, Python
- URL / Adding cache support to the link crawler
spider
- creating / Creating a spider
- about / Creating a spider
- reference link / Creating a spider
- settings, tuning / Tuning settings
- URL, for settings / Tuning settings
- testing / Testing the spider
- scraping, with shell command / Scraping with the shell command
- results, checking / Checking results
- tuning / Tuning a spider
spider trap
- about / Avoiding spider traps
- avoiding / Avoiding spider traps
startproject command
- about / Installation

T

technology
- identifying / Estimating the size of a website
Tesseract OCR engine
- about / Optical Character Recognition
- URL / Optical Character Recognition
threaded crawler
- about / Threaded crawler
- process / Threaded crawler, How threads and processes work
- implementation / Implementation
- URL / Implementation
- cross-process crawler / Cross-process crawler
- performance / Performance
thresholding
- about / Optical Character Recognition
Trident
- about / Rendering a dynamic web page

V

virtualenv
- about / Installation
- URL / Installation
visual scraping
- with Portia / Visual scraping with Portia

W

WebKit
- about / Rendering a dynamic web page
- website interaction / Website interaction with WebKit
- search results, scraping / Waiting for results
- Render class, using / The Render class
web page
- downloading, for crawling / Downloading a web page
- downloads, retrying / Retrying downloads
- user agent, setting / Setting a user agent
- analyzing / Analyzing a web page
web scraping
- usage / When is web scraping useful?
- legality / Is web scraping legal?
- referenced, for legal cases / Is web scraping legal?
website
- background research / Background research
- robots.txt file, checking / Checking robots.txt
- Sitemap file, examining / Examining the Sitemap
- size, estimating / Estimating the size of a website
- technology, identifying / Estimating the size of a website
- owner, searching / Crawling your first website
Whois
- URL / Crawling your first website