Book Image

Web Scraping with Python

By : Richard Penman
Book Image

Web Scraping with Python

By: Richard Penman

Overview of this book

Table of Contents (16 chapters)

Gap


Gap has a well structured website with a Sitemap to help web crawlers locate their updated content. If we use the techniques from Chapter 1, Introduction to Web Scraping, to investigate a website, we would find their robots.txt file at http://www.gap.com/robots.txt, which contains a link to this Sitemap:

Sitemap: http://www.gap.com/products/sitemap_index.xml

Here are the contents of the linked Sitemap file:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>http://www.gap.com/products/sitemap_1.xml</loc>
        <lastmod>2015-03-03</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.gap.com/products/sitemap_2.xml</loc>
        <lastmod>2015-03-03</lastmod>
    </sitemap>
</sitemapindex>

As shown here, this Sitemap link is just an index and contains links to other Sitemap files. These other Sitemap files then...