As mentioned in previous chapters, the robots.txt
file should only be used to let search engines know which pages/paths on the website we wish or do not wish to be crawled. Ideally, we would only want our main pages to be crawled and cached by search engines (products, categories, and CMS pages).
The robots.txt
file should be updated whenever a page is created that we do not wish to be crawled; however, the following list is a good place to start and will help to reduce the number of unnecessary pages cached by search engines.
Inside the robots.txt
file, we would add the following options (one per line, under User-Agent: *
):
Disallow: /checkout/ # To stop our checkout pages being crawled Disallow: /review/ # To disallow our product review pages (especially if we are also showing reviews directly on our product pages) Disallow: /catalogsearch/ # To disallow our search-results pages from being indexed by search engines Disallow: /catalog/product/view/ # A further...