Book Image

Python Web Scraping Cookbook

By : Michael Heydt
Book Image

Python Web Scraping Cookbook

By: Michael Heydt

Overview of this book

Python Web Scraping Cookbook is a solution-focused book that will teach you techniques to develop high-performance scrapers and deal with crawlers, sitemaps, forms automation, Ajax-based sites, caches, and more. You'll explore a number of real-world scenarios where every part of the development/product life cycle will be fully covered. You will not only develop the skills needed to design and develop reliable performance data flows, but also deploy your codebase to AWS. If you are involved in software engineering, product development, or data mining (or are interested in building data-driven products), you will find this book useful as each recipe has a clear purpose and objective. Right from extracting data from the websites to writing a sophisticated web crawler, the book's independent recipes will be a godsend. This book covers Python libraries, requests, and BeautifulSoup. You will learn about crawling, web spidering, working with Ajax websites, paginated items, and more. You will also learn to tackle problems such as 403 errors, working with proxy, scraping images, and LXML. By the end of this book, you will be able to scrape websites more efficiently and able to deploy and operate your scraper in the cloud.
Table of Contents (13 chapters)

Scraping Python.org with Selenium and PhantomJS

This recipe will introduce Selenium and PhantomJS, two frameworks that are very different from the frameworks in the previous recipes. In fact, Selenium and PhantomJS are often used in functional/acceptance testing. We want to demonstrate these tools as they offer unique benefits from the scraping perspective. Several that we will look at later in the book are the ability to fill out forms, press buttons, and wait for dynamic JavaScript to be downloaded and executed.

Selenium itself is a programming language neutral framework. It offers a number of programming language bindings, such as Python, Java, C#, and PHP (amongst others). The framework also provides many components that focus on testing. Three commonly used components are:

  • IDE for recording and replaying tests
  • Webdriver, which actually launches a web browser (such as Firefox, Chrome, or Internet Explorer) by sending commands and sending the results to the selected browser
  • A grid server executes tests with a web browser on a remote server. It can run multiple test cases in parallel.

Getting ready

First we need to install Selenium. We do this with our trusty pip:

~ $ pip install selenium
Collecting selenium
Downloading selenium-3.8.1-py2.py3-none-any.whl (942kB)
100% |████████████████████████████████| 952kB 236kB/s
Installing collected packages: selenium
Successfully installed selenium-3.8.1

This installs the Selenium Client Driver for Python (the language bindings). You can find more information on it at https://github.com/SeleniumHQ/selenium/blob/master/py/docs/source/index.rst if you want to in the future.

For this recipe we also need to have the driver for Firefox in the directory (it's named geckodriver). This file is operating system specific. I've included the file for Mac in the folder. To get other versions, visit https://github.com/mozilla/geckodriver/releases.

Still, when running this sample you may get the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver'

If you do, put the geckodriver file somewhere on your systems PATH, or add the 01 folder to your path. Oh, and you will need to have Firefox installed.

Finally, it is required to have PhantomJS installed. You can download and find installation instructions at: http://phantomjs.org/

How to do it...

The script for this recipe is 01/04_events_with_selenium.py.

  1. The following is the code:
from selenium import webdriver

def get_upcoming_events(url):
driver = webdriver.Firefox()
driver.get(url)

events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')

for event in events:
event_details = dict()
event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
event_details['time'] = event.find_element_by_xpath('p/time').text
print(event_details)

driver.close()

get_upcoming_events('https://www.python.org/events/python-events/')
  1. And run the script with Python. You will see familiar output:
~ $ python 04_events_with_selenium.py
{'name': 'PyCascades 2018', 'location': 'Granville Island Stage, 1585 Johnston St, Vancouver, BC V6H 3R9, Canada', 'time': '22 Jan. – 24 Jan.'}
{'name': 'PyCon Cameroon 2018', 'location': 'Limbe, Cameroon', 'time': '24 Jan. – 29 Jan.'}
{'name': 'FOSDEM 2018', 'location': 'ULB Campus du Solbosch, Av. F. D. Roosevelt 50, 1050 Bruxelles, Belgium', 'time': '03 Feb. – 05 Feb.'}
{'name': 'PyCon Pune 2018', 'location': 'Pune, India', 'time': '08 Feb. – 12 Feb.'}
{'name': 'PyCon Colombia 2018', 'location': 'Medellin, Colombia', 'time': '09 Feb. – 12 Feb.'}
{'name': 'PyTennessee 2018', 'location': 'Nashville, TN, USA', 'time': '10 Feb. – 12 Feb.'}

During this process, Firefox will pop up and open the page. We have reused the previous recipe and adopted Selenium.

The Window Popped up by Firefox

How it works

The primary difference in this recipe is the following code:

driver = webdriver.Firefox()
driver.get(url)

This gets the Firefox driver and uses it to get the content of the specified URL. This works by starting Firefox and automating it to go the the page, and then Firefox returns the page content to our app. This is why Firefox popped up. The other difference is that to find things we need to call find_element_by_xpath to search the resulting HTML.

There's more...

PhantomJS, in many ways, is very similar to Selenium. It has fast and native support for various web standards, with features such as DOM handling, CSS selector, JSON, Canvas, and SVG. It is often used in web testing, page automation, screen capturing, and network monitoring.

There is one key difference between Selenium and PhantomJS: PhantomJS is headless and uses WebKit. As we saw, Selenium opens and automates a browser. This is not very good if we are in a continuous integration or testing environment where the browser is not installed, and where we also don't want thousands of browser windows or tabs being opened. Being headless, makes this faster and more efficient.

The example for PhantomJS is in the 01/05_events_with_phantomjs.py file. There is a single one line change:

driver = webdriver.PhantomJS('phantomjs')

And running the script results in similar output to the Selenium / Firefox example, but without a browser popping up and also it takes less time to complete.