-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Python Web Scraping
By :
To test the performance of concurrent downloading, it would be preferable to have a larger target website. For this reason, we will use the Alexa list in this chapter, which tracks the top 1 million most popular websites according to users who have installed the Alexa Toolbar. Only a small percentage of people use this browser plugin, so the data is not authoritative, but is fine for our purposes.
These top 1 million web pages can be browsed on the Alexa website at http://www.alexa.com/topsites. Additionally, a compressed spreadsheet of this list is available at http://s3.amazonaws.com/alexa-static/top-1m.csv.zip, so scraping Alexa is not necessary.
The Alexa list is provided in a spreadsheet with columns for the rank and domain:

Extracting this data requires a number of steps, as follows:
Download the .zip file.
Extract the CSV file from this .zip file.
Parse the CSV file.
Iterate each row of the CSV file to extract the domain.
Here is an implementation...
Change the font size
Change margin width
Change background colour