Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Python Web Scraping
  • Table Of Contents Toc
Python Web Scraping

Python Web Scraping

By : Richard Penman
3 (2)
close
close
Python Web Scraping

Python Web Scraping

3 (2)
By: Richard Penman

Overview of this book

The Internet contains the most useful set of data ever assembled, largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be carefully extracted to be useful. Web scraping is becoming increasingly useful as a means to easily gather and make sense of the plethora of information available online. Using a simple language like Python, you can crawl the information out of complex websites using simple programming. This book is the ultimate guide to using Python to scrape data from websites. In the early chapters it covers how to extract data from static web pages and how to use caching to manage the load on servers. After the basics we'll get our hands dirty with building a more sophisticated crawler with threads and more advanced topics. Learn step-by-step how to use Ajax URLs, employ the Firebug extension for monitoring, and indirectly scrape data. Discover more scraping nitty-gritties such as using the browser renderer, managing cookies, how to submit forms to extract data from complex websites protected by CAPTCHA, and so on. The book wraps up with how to create high-level scrapers with Scrapy libraries and implement what has been learned to real websites.
Table of Contents (10 chapters)
close
close

Background research

Before diving into crawling a website, we should develop an understanding about the scale and structure of our target website. The website itself can help us through their robots.txt and Sitemap files, and there are also external tools available to provide further details such as Google Search and WHOIS.

Checking robots.txt

Most websites define a robots.txt file to let crawlers know of any restrictions about crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt file is a valuable resource to check before crawling to minimize the chance of being blocked, and also to discover hints about a website's structure. More information about the robots.txt protocol is available at http://www.robotstxt.org. The following code is the content of our example robots.txt, which is available at http://example.webscraping.com/robots.txt:

# section 1
User-agent: BadCrawler
Disallow: /

# section 2
User-agent: *
Crawl-delay: 5
Disallow: /trap

# section 3
Sitemap: http://example.webscraping.com/sitemap.xml

In section 1, the robots.txt file asks a crawler with user agent BadCrawler not to crawl their website, but this is unlikely to help because a malicious crawler would not respect robots.txt anyway. A later example in this chapter will show you how to make your crawler follow robots.txt automatically.

Section 2 specifies a crawl delay of 5 seconds between download requests for all User-Agents, which should be respected to avoid overloading their server. There is also a /trap link to try to block malicious crawlers who follow disallowed links. If you visit this link, the server will block your IP for one minute! A real website would block your IP for much longer, perhaps permanently, but then we could not continue with this example.

Section 3 defines a Sitemap file, which will be examined in the next section.

Examining the Sitemap

Sitemap files are provided by websites to help crawlers locate their updated content without needing to crawl every web page. For further details, the sitemap standard is defined at http://www.sitemaps.org/protocol.html. Here is the content of the Sitemap file discovered in the robots.txt file:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url><loc>http://example.webscraping.com/view/Afghanistan-1</loc></url>
  <url><loc>http://example.webscraping.com/view/Aland-Islands-2</loc></url>
  <url><loc>http://example.webscraping.com/view/Albania-3</loc></url>
  ...
</urlset>

This sitemap provides links to all the web pages, which will be used in the next section to build our first crawler. Sitemap files provide an efficient way to crawl a website, but need to be treated carefully because they are often missing, out of date, or incomplete.

Estimating the size of a website

The size of the target website will affect how we crawl it. If the website is just a few hundred URLs, such as our example website, efficiency is not important. However, if the website has over a million web pages, downloading each sequentially would take months. This problem is addressed later in Chapter 4, Concurrent Downloading, on distributed downloading.

A quick way to estimate the size of a website is to check the results of Google's crawler, which has quite likely already crawled the website we are interested in. We can access this information through a Google search with the site keyword to filter the results to our domain. An interface to this and other advanced search parameters are available at http://www.google.com/advanced_search.

Here are the site search results for our example website when searching Google for site:example.webscraping.com:

Estimating the size of a website

As we can see, Google currently estimates 202 web pages, which is about as expected. For larger websites, I have found Google's estimates to be less accurate.

We can filter these results to certain parts of the website by adding a URL path to the domain. Here are the results for site:example.webscraping.com/view, which restricts the site search to the country web pages:

Estimating the size of a website

This additional filter is useful because ideally you will only want to crawl the part of a website containing useful data rather than every page of it.

Identifying the technology used by a website

The type of technology used to build a website will effect how we crawl it. A useful tool to check the kind of technologies a website is built with is the builtwith module, which can be installed with:

    pip install builtwith

This module will take a URL, download and analyze it, and then return the technologies used by the website. Here is an example:

   >>> import builtwith
   >>> builtwith.parse('http://example.webscraping.com')
   {u'javascript-frameworks': [u'jQuery', u'Modernizr', u'jQuery UI'],
    u'programming-languages': [u'Python'],
    u'web-frameworks': [u'Web2py', u'Twitter Bootstrap'],
    
u'web-servers': [u'Nginx']}

We can see here that the example website uses the Web2py Python web framework alongside with some common JavaScript libraries, so its content is likely embedded in the HTML and be relatively straightforward to scrape. If the website was instead built with AngularJS, then its content would likely be loaded dynamically. Or, if the website used ASP.NET, then it would be necessary to use sessions and form submissions to crawl web pages. Working with these more difficult cases will be covered later in Chapter 5, Dynamic Content and Chapter 6, Interacting with Forms.

Finding the owner of a website

For some websites it may matter to us who is the owner. For example, if the owner is known to block web crawlers then it would be wise to be more conservative in our download rate. To find who owns a website we can use the WHOIS protocol to see who is the registered owner of the domain name. There is a Python wrapper to this protocol, documented at https://pypi.python.org/pypi/python-whois, which can be installed via pip:

    pip install python-whois

Here is the key part of the WHOIS response when querying the appspot.com domain with this module:

    >>> import whois
    >>> print whois.whois('appspot.com')
    {
      ...
      "name_servers": [
        "NS1.GOOGLE.COM", 
        "NS2.GOOGLE.COM", 
        "NS3.GOOGLE.COM", 
        "NS4.GOOGLE.COM", 
        "ns4.google.com", 
        "ns2.google.com", 
        "ns1.google.com", 
        "ns3.google.com"
      ], 
      "org": "Google Inc.", 
      "emails": [
        "[email protected]", 
        "[email protected]"
      ]
    }

We can see here that this domain is owned by Google, which is correct—this domain is for the Google App Engine service. Google often blocks web crawlers despite being fundamentally a web crawling business themselves. We would need to be careful when crawling this domain because Google often blocks web crawlers, despite being fundamentally a web crawling business themselves.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Python Web Scraping
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon