Background research
Before diving into crawling a website, we should develop an understanding about the scale and structure of our target website. The website itself can help us through their robots.txt
and Sitemap
files, and there are also external tools available to provide further details such as Google Search and WHOIS
.
Checking robots.txt
Most websites define a robots.txt
file to let crawlers know of any restrictions about crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt
file is a valuable resource to check before crawling to minimize the chance of being blocked, and also to discover hints about a website's structure. More information about the robots.txt
protocol is available at http://www.robotstxt.org. The following code is the content of our example robots.txt
, which is available at http://example.webscraping.com/robots.txt:
# section 1 User-agent: BadCrawler Disallow: / # section 2 User-agent: * Crawl-delay: 5 Disallow: /trap # section 3 Sitemap: http://example.webscraping.com/sitemap.xml
In section 1, the robots.txt
file asks a crawler with user agent BadCrawler
not to crawl their website, but this is unlikely to help because a malicious crawler would not respect robots.txt
anyway. A later example in this chapter will show you how to make your crawler follow robots.txt
automatically.
Section 2 specifies a crawl delay of 5 seconds between download requests for all User-Agents, which should be respected to avoid overloading their server. There is also a /trap
link to try to block malicious crawlers who follow disallowed links. If you visit this link, the server will block your IP for one minute! A real website would block your IP for much longer, perhaps permanently, but then we could not continue with this example.
Section 3 defines a Sitemap
file, which will be examined in the next section.
Examining the Sitemap
Sitemap
files are provided by websites to help crawlers locate their updated content without needing to crawl every web page. For further details, the sitemap standard is defined at http://www.sitemaps.org/protocol.html. Here is the content of the Sitemap
file discovered in the robots.txt
file:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>http://example.webscraping.com/view/Afghanistan-1</loc></url> <url><loc>http://example.webscraping.com/view/Aland-Islands-2</loc></url> <url><loc>http://example.webscraping.com/view/Albania-3</loc></url> ... </urlset>
This sitemap provides links to all the web pages, which will be used in the next section to build our first crawler. Sitemap
files provide an efficient way to crawl a website, but need to be treated carefully because they are often missing, out of date, or incomplete.
Estimating the size of a website
The size of the target website will affect how we crawl it. If the website is just a few hundred URLs, such as our example website, efficiency is not important. However, if the website has over a million web pages, downloading each sequentially would take months. This problem is addressed later in Chapter 4, Concurrent Downloading, on distributed downloading.
A quick way to estimate the size of a website is to check the results of Google's crawler, which has quite likely already crawled the website we are interested in. We can access this information through a Google search with the site
keyword to filter the results to our domain. An interface to this and other advanced search parameters are available at http://www.google.com/advanced_search.
Here are the site search results for our example website when searching Google for site:example.webscraping.com
:
As we can see, Google currently estimates 202 web pages, which is about as expected. For larger websites, I have found Google's estimates to be less accurate.
We can filter these results to certain parts of the website by adding a URL path to the domain. Here are the results for site:example.webscraping.com/view
, which restricts the site search to the country web pages:
This additional filter is useful because ideally you will only want to crawl the part of a website containing useful data rather than every page of it.
Identifying the technology used by a website
The type of technology used to build a website will effect how we crawl it. A useful tool to check the kind of technologies a website is built with is the builtwith
module, which can be installed with:
pip install builtwith
This module will take a URL, download and analyze it, and then return the technologies used by the website. Here is an example:
>>> import builtwith >>> builtwith.parse('http://example.webscraping.com') {u'javascript-frameworks': [u'jQuery', u'Modernizr', u'jQuery UI'], u'programming-languages': [u'Python'], u'web-frameworks': [u'Web2py', u'Twitter Bootstrap'], u'web-servers': [u'Nginx']}
We can see here that the example website uses the Web2py Python web framework alongside with some common JavaScript libraries, so its content is likely embedded in the HTML and be relatively straightforward to scrape. If the website was instead built with AngularJS, then its content would likely be loaded dynamically. Or, if the website used ASP.NET, then it would be necessary to use sessions and form submissions to crawl web pages. Working with these more difficult cases will be covered later in Chapter 5, Dynamic Content and Chapter 6, Interacting with Forms.
Finding the owner of a website
For some websites it may matter to us who is the owner. For example, if the owner is known to block web crawlers then it would be wise to be more conservative in our download rate. To find who owns a website we can use the WHOIS
protocol to see who is the registered owner of the domain name. There is a Python wrapper to this protocol, documented at https://pypi.python.org/pypi/python-whois, which can be installed via pip
:
pip install python-whois
Here is the key part of the WHOIS
response when querying the appspot.com domain with this module:
>>> import whois >>> print whois.whois('appspot.com') { ... "name_servers": [ "NS1.GOOGLE.COM", "NS2.GOOGLE.COM", "NS3.GOOGLE.COM", "NS4.GOOGLE.COM", "ns4.google.com", "ns2.google.com", "ns1.google.com", "ns3.google.com" ], "org": "Google Inc.", "emails": [ "[email protected]", "[email protected]" ] }
We can see here that this domain is owned by Google, which is correct—this domain is for the Google App Engine service. Google often blocks web crawlers despite being fundamentally a web crawling business themselves. We would need to be careful when crawling this domain because Google often blocks web crawlers, despite being fundamentally a web crawling business themselves.