Book Image

Python Web Scraping - Second Edition

By : Katharine Jarmul
Book Image

Python Web Scraping - Second Edition

By: Katharine Jarmul

Overview of this book

The Internet contains the most useful set of data ever assembled, most of which is publicly accessible for free. However, this data is not easily usable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online. This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you'll get hands-on practice building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers. You'll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You'll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You'll find out how to automate these actions with Python packages such as mechanize. You'll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites. By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics.
Table of Contents (10 chapters)

Is web scraping legal?

Web scraping, and what is legally permissible when web scraping, are still being established despite numerous rulings over the past two decades. If the scraped data is being used for personal and private use, and within fair use of copyright laws, there is usually no problem. However, if the data is going to be republished, if the scraping is aggressive enough to take down the site, or if the content is copyrighted and the scraper violates the terms of service, then there are several legal precedents to note.

In Feist Publications, Inc. v. Rural Telephone Service Co., the United States Supreme Court decided scraping and republishing facts, such as telephone listings, are allowed. A similar case in Australia, Telstra Corporation Limited v. Phone Directories Company Pty Ltd, demonstrated that only data with an identifiable author can be copyrighted. Another scraped content case in the United States, evaluating the reuse of Associated Press stories for an aggregated news product, was ruled a violation of copyright in Associated Press v. Meltwater. A European Union case in Denmark, ofir.dk vs home.dk, concluded that regular crawling and deep linking is permissible.

There have also been several cases in which companies have charged the plaintiff with aggressive scraping and attempted to stop the scraping via a legal order. The most recent case, QVC v. Resultly, ruled that, unless the scraping resulted in private property damage, it could not be considered intentional harm, despite the crawler activity leading to some site stability issues.

These cases suggest that, when the scraped data constitutes public facts (such as business locations and telephone listings), it can be republished following fair use rules. However, if the data is original (such as opinions and reviews or private user data), it most likely cannot be republished for copyright reasons. In any case, when you are scraping data from a website, remember you are their guest and need to behave politely; otherwise, they may ban your IP address or proceed with legal action. This means you should make download requests at a reasonable rate and define a user agent to identify your crawler. You should also take measures to review the Terms of Service of the site and ensure the data you are taking is not considered private or copyrighted.

If you have doubts or questions, it may be worthwhile to consult a media lawyer regarding the precedents in your area of residence.

You can read more about these legal cases at the following sites: