Getting Started with Beautiful Soup

Getting Started with Beautiful Soup

By : Vineeth G Nair

Buy this Book

Getting Started with Beautiful Soup

By: Vineeth G Nair

Buy this Book

Overview of this book

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need without writing excess code for an application. It doesn't take much code to write an application using Beautiful Soup. Getting Started with Beautiful Soup is a practical guide to Beautiful Soup using Python. The book starts by walking you through the installation of each and every feature of Beautiful Soup using simple examples which include sample Python codes as well as diagrams and screenshots wherever required for better understanding. The book discusses the problems of how exactly you can get data out of a website and provides an easy solution with the help of a real website and sample code. Getting Started with Beautiful Soup goes over the different methods to install Beautiful Soup in both Linux and Windows systems. You will then learn about searching, navigating, content modification, encoding support, and output formatting with the help of examples and sample Python codes for each example so that you can try them out to get a better understanding. This book is a practical guide for scraping information from any website. If you want to learn how to efficiently scrape pages from websites, then this book is for you.

Getting Started with Beautiful Soup

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Installing Beautiful Soup

Using Beautiful Soup without installation

Verifying the installation

Quick reference

Summary

Creating a BeautifulSoup Object

Creating a BeautifulSoup object

Tag

The NavigableString object

Quick reference

Summary

Search Using Beautiful Soup

Searching in Beautiful Soup

Using search methods to scrape information from a web page

Quick reference

Summary

Navigation Using Beautiful Soup

Navigation using Beautiful Soup

Quick reference

Summary

Modifying Content Using Beautiful Soup

Modifying Tag using Beautiful Soup

Modifying string contents

Deleting tags from the HTML document

Special functions to modify content

Quick reference

Summary

Encoding Support in Beautiful Soup

Encoding in Beautiful Soup

Output encoding

Quick reference

Summary

Output in Beautiful Soup

Formatted printing

Unformatted printing

Output formatters in Beautiful Soup

Using get_text()

Quick reference

Summary

Creating a Web Scraper

Getting book details from PacktPub.com

Getting selling prices from Amazon

Getting the selling price from Barnes and Noble

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Using get_text()

Getting just text from websites is a common task. Beautiful Soup provides the method get_text() for this purpose.

If we want to get only the text of a Beautiful Soup or a Tag object, we can use the get_text() method. For example:

html_markup = """<p class="ecopyramid">
<ul id="producers">
  <li class="producerlist">
    <div class="name">plants</div>
    <div class="number">100000</div>
  </li>
  <li class="producerlist">
    <div class="name">algae</div>
    <div class="number">100000</div>
  </li>
</ul>"""
soup = BeautifulSoup(html_markup,"lxml")
print(soup.get_text())

#output
plants
100000

algae
100000

The get_text() method returns the text inside the Beautiful Soup or Tag object as a single Unicode string. But get_text() has issues when dealing with web pages. Web pages often have JavaScript code, and the get_text() method returns the JavaScript code as well. For example, in Chapter...

Getting Started with Beautiful Soup

By : Vineeth G Nair

Getting Started with Beautiful Soup

By: Vineeth G Nair

Overview of this book

Related Content you might be interested in

Current Title:

Getting Started with Beautiful Soup

Using get_text()