Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Modern Python Cookbook
  • Table Of Contents Toc
  • Feedback & Rating feedback
Modern Python Cookbook

Modern Python Cookbook - Third Edition

By : Steven F. Lott
4.9 (17)
close
close
Modern Python Cookbook

Modern Python Cookbook

4.9 (17)
By: Steven F. Lott

Overview of this book

Python is the go-to language for developers, engineers, data scientists, and hobbyists worldwide. Known for its versatility, Python can efficiently power applications, offering remarkable speed, safety, and scalability. This book distills Python into a collection of straightforward recipes, providing insights into specific language features within various contexts, making it an indispensable resource for mastering Python and using it to handle real-world use cases. The third edition of Modern Python Cookbook provides an in-depth look into Python 3.12, offering more than 140 new and updated recipes that cater to both beginners and experienced developers. This edition introduces new chapters on documentation and style, data visualization with Matplotlib and Pyplot, and advanced dependency management techniques using tools like Poetry and Anaconda. With practical examples and detailed explanations, this cookbook helps developers solve real-world problems, optimize their code, and get up to date with the latest Python features.
Table of Contents (20 chapters)
close
close
18
Other Books You May Enjoy
19
Index

1.8 Decoding bytes – how to get proper characters from some bytes

How can we work with files that aren’t properly encoded? What do we do with files written in ASCII encoding?

A download from the internet is almost always in bytes—not characters. How do we decode the characters from that stream of bytes?

Also, when we use the subprocess module, the results of an OS command are in bytes. How can we recover proper characters?

Much of this is also relevant to the material in Chapter 11. We’ve included this recipe here because it’s the inverse of the previous recipe, Encoding strings – creating ASCII and UTF-8 bytes.

1.8.1 Getting ready

Let’s say we’re interested in offshore marine weather forecasts. Perhaps this is because we are departing the Chesapeake Bay for the Caribbean.

Are there any special warnings coming from the National Weather Services office in Wakefield, Virginia?

Here’s the link: https://forecast.weather.gov/product.php?site=AKQ&product=SMW&issuedby=AKQ.

We can download this with Python’s urllib module:

>>> import urllib.request 
 
>>> warnings_uri = ( 
 
...     ’https://forecast.weather.gov/’ 
 
...     ’product.php?site=AKQ&product=SMW&issuedby=AKQ’ 
 
... ) 
 >>> with urllib.request.urlopen(warnings_uri) as source: 
 
...     forecast_text = source.read()

Note that we’ve enclosed the URI string in () and broken it into two separate string literals. Python will concatenate these two adjacent literals into a single string. We’ll look at this in some depth in Chapter 2.

As an alterative, we can use programs like curl or wget to get this. At the OS Terminal prompt, we might run the following (long) command:

(cookbook3) % curl ’https://forecast.weather.gov/product.php?site=AKQ&product=SMW&issuedby=AKQ’ -o AKQ.html

Typesetting this book tends to break the command onto many lines. It’s really one very long line.

The code repository includes a sample file, ch01/Text Products for SMW Issued by AKQ.html.

The forecast_text value is a stream of bytes. It’s not a proper string. We can tell because it starts like this:

>>> forecast_text[:80] 
 
b’<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/x’

The data goes on for a while, providing details from the web page. Because the displayed value starts with b’, it’s bytes, not proper Unicode characters. It was probably encoded with UTF-8, which means some characters could have weird-looking \xnn escape sequences instead of proper characters. We want to have the proper characters.

While this data has many easy-to-read characters, the b’ prefix shows that it’s a collection of byte values, not proper text. Generally, a bytes object behaves somewhat like a string object. Sometimes, we can work with bytes directly. Most of the time, we’ll want to decode the bytes and create proper Unicode characters from them.

1.8.2 How to do it...

  1. Determine the coding scheme if possible. In order to decode bytes to create proper Unicode characters, we need to know what encoding scheme was used. When we read XML documents, there’s a big hint provided within the document:

    <?xml version="1.0" encoding="UTF-8"?>

    When browsing web pages, there’s often a header containing this information:

    Content-Type: text/html; charset=ISO-8859-4

    Sometimes, an HTML page may include this as part of the header:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    In other cases, we’re left to guess. In the case of US weather data, a good first guess is UTF-8. Another good guess is ISO-8859-1. In some cases, the guess will depend on the language.

  2. The codecs — Codec registry and base classes section of the Python Standard Library lists the standard encodings available. Decode the data:

    >>> document = forecast_text.decode("UTF-8") 
     
    >>> document[:80] 
     
    ’<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/x’

    The b’ prefix is no longer used to show that these are bytes. We’ve created a proper string of Unicode characters from the stream of bytes.

  3. If this step fails with an exception, we guessed wrong about the encoding. We need to try another encoding in order to parse the resulting document.

Since this is an HTML document, we should use Beautiful Soup to extract the data. See http://www.crummy.com/software/BeautifulSoup/.

We can, however, extract one nugget of information from this document without completely parsing the HTML:

>>> import re 
 
>>> content_pattern = re.compile(r"// CONTENT STARTS(.*?)// CONTENT ENDS", re.MULTILINE | re.DOTALL) 
 
>>> content_pattern.search(document) 
 
<re.Match object; span=(8530, 9113), match=’// CONTENT STARTS HERE -->\n\n<span style="font-s>

This tells us what we need to know: there are no warnings at this time. This doesn’t mean smooth sailing, but it does mean that there aren’t any major weather systems that could cause catastrophes.

1.8.3 How it works...

See the Encoding strings – creating ASCII and UTF-8 bytes recipe for more information on Unicode and the different ways that Unicode characters can be encoded into streams of bytes.

At the foundation of the OS, files and network connections are built up from bytes. It’s our software that decodes the bytes to discover the content. It might be characters, images, or sounds. In some cases, the default assumptions are wrong and we need to do our own decoding.

1.8.4 See also

Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Modern Python Cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon