Web scraping is now widely used to get data from websites. Whether it be e-mails, contact information, or selling prices of items, we rely on web scraping techniques as they allow us to collect large data with minimal effort, and also, we don't require database or other backend access to get this data as they are represented as web pages.
Beautiful Soup allows us to get data from HTML and XML pages. This book helps us by explaining the installation and creation of a sample website scraper using Beautiful Soup. Searching and navigation methods are explained with the help of simple examples, screenshots, and code samples in this book. The different parser support offered by Beautiful Soup, supports for scraping pages with encodings, formatting the output, and other tasks related to scraping a page are all explained in detail. Apart from these, practical approaches to understanding patterns on a page, using the developer tools in browsers will enable you to write similar scrapers for any other website.
Also, the practical approach followed in this book will help you to design a simple web scraper to scrape and compare the selling prices of various books from three websites, namely, Amazon, Barnes and Noble, and PacktPub.
Chapter 1, Installing Beautiful Soup, covers installing Beautiful Soup 4 on Windows, Linux, and Mac OS, and verifying the installation.
Chapter 2, Creating a BeautifulSoup Object, describes creating a BeautifulSoup
object from a string, file, and web page; discusses different objects such as Tag
, NavigableString
, and parser support; and specifies parsers that scrape XML too.
Chapter 3, Search Using Beautiful Soup, discusses in detail the different search methods in Beautiful Soup, namely, find()
, find_all()
, find_next()
, and find_parents()
; code examples for a scraper using search methods to get information from a website; and understanding the application of search methods in combination.
Chapter 4, Navigation Using Beautiful Soup, discusses in detail the different navigation methods provided by Beautiful Soup, methods specific to navigating downwards and upwards, and sideways, to the previous and next elements of the HTML tree.
Chapter 5, Modifying Content Using Beautiful Soup, discusses modifying the HTML tree using Beautiful Soup, and the creation and deletion of HTML tags. Altering the HTML tag attributes is also covered with the help of simple examples.
Chapter 6, Encoding Support in Beautiful Soup, discusses the encoding support in Beautiful Soup, creating a BeautifulSoup
object for a page with specific encoding, and the encoding supports for output.
Chapter 7, Output in Beautiful Soup, discusses formatted and unformatted printing support in Beautiful Soup, specifications of different formatters to format the output, and getting just text from an HTML page.
Chapter 8, Creating a Web Scraper, discusses creating a web scraper for three websites, namely, Amazon, Barnes and Noble, and PacktPub, to get the book selling price based on ISBN. Searching and navigation methods used to create the parser, use of developer tools so as to identify the patterns required to create the parser, and the full code sample for scraping the mentioned websites are also explained in this chapter.
You will need Python Version 2.7.5 or higher and Beautiful Soup Version 4 for this book.
For Chapter 3, Search Using Beautiful Soup and Chapter 8, Creating a Web Scraper, you must have an Internet connection to scrape different websites using the code examples provided.
This book is for beginners in web scraping using Beautiful Soup. Knowing the basics of Python programming (such as functions, variables, and values), and the basics of HTML, and CSS, is important to follow all of the steps in this book. Even though it is not mandatory, knowledge of using developer tools in browsers such as Google Chrome and Firefox will be an advantage when learning the scraper examples in chapters 3 and 8.
In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The prettify()
method can be called either on a Beautiful Soup object or any of the Tag
objects."
A block of code is set as follows:
html_markup = """<html> <body>& & ampersand ¢ ¢ cent © © copyright ÷ ÷ divide > > greater than </body> </html> """ soup = BeautifulSoup(html_markup,"lxml") print(soup.prettify())
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
UserWarning: "http://www.packtpub.com/books" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup
Any command-line input or output is written as follows:
sudo easy_install beautifulsoup4
New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "The output methods in Beautiful Soup escape only the HTML entities of >,<, and & as >, <, and &."
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to <[email protected]>
, and mention the book title via the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]>
with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
You can contact us at <[email protected]>
if you are having a problem with any aspect of the book, and we will do our best to address it.