Book Image

Modern Python Cookbook

Book Image

Modern Python Cookbook

Overview of this book

Python is the preferred choice of developers, engineers, data scientists, and hobbyists everywhere. It is a great scripting language that can power your applications and provide great speed, safety, and scalability. By exposing Python as a series of simple recipes, you can gain insight into specific language features in a particular context. Having a tangible context helps make the language or standard library feature easier to understand. This book comes with over 100 recipes on the latest version of Python. The recipes will benefit everyone ranging from beginner to an expert. The book is broken down into 13 chapters that build from simple language concepts to more complex applications of the language. The recipes will touch upon all the necessary Python concepts related to data structures, OOP, functional programming, as well as statistical programming. You will get acquainted with the nuances of Python syntax and how to effectively use the advantages that it offers. You will end the book equipped with the knowledge of testing, web services, and configuration and application integration tips and tricks. The recipes take a problem-solution approach to resolve issues commonly faced by Python programmers across the globe. You will be armed with the knowledge of creating applications with flexible logging, powerful configuration, and command-line options, automated unit tests, and good documentation.
Table of Contents (18 chapters)
Title Page
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Decoding bytes – how to get proper characters from some bytes


How can we work with files that aren't properly encoded? What do we do with files written in the ASCII encoding?

A download from the Internet is almost always in bytes—not characters. How do we decode the characters from that stream of bytes?

Also, when we use the subprocess module, the results of an OS command are in bytes. How can we recover proper characters?

Much of this is also relevant to the material in Chapter 8, Input/Output, Physical Format, Logical Layout. We've included the recipe here because it's the inverse of the previous recipe, Encoding strings – creating ASCII and UTF-8 bytes.

Getting ready

Let's say we're interested in offshore marine weather forecasts. Perhaps because we own a large sailboat. Or perhaps because good friends of ours have a large sailboat and are departing the Chesapeake Bay for the Caribbean.

Are there any special warnings coming from the National Weather Services office in Wakefield, Virginia?

Here's where we can get the warnings: http://www.nws.noaa.gov/view/national.php?prod=SMW&sid=AKQ.

We can download this with Python's urllib module:

>>> import urllib.request>>> warnings_uri= 'http://www.nws.noaa.gov/view/national.php?prod=SMW&sid=AKQ'>>> with urllib.request.urlopen(warnings_uri) as source:...     warnings_text= source.read()

Or, we can use programs like curl or wget to get this. We might do:

curl -O http://www.nws.noaa.gov/view/national.php?prod=SMW&sid=AKQmv national.php\?prod\=SMW AKQ.html

Since curl left us with an awkward file name, we needed to rename the file.

The forecast_text value is a stream of bytes. It's not a proper string. We can tell because it starts like this:

>>> warnings_text[:80]b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.or'

And goes on for a while providing details. Because it starts with b', it's bytes, not proper Unicode characters. It was probably encoded with UTF-8, which means some characters could have weird-looking \xnn escape sequences instead of proper characters. We want to have the proper characters.

Note

Bytes vs Strings Bytes are often displayed using printable characters. We'll see b'hello' as a short-hand for a five-byte value. The letters are chosen using the old ASCII encoding scheme. Many byte values from about 0x20 to 0xFE will be shown as characters. This can be confusing. The prefix of b' is our hint that we're looking at bytes, not proper Unicode characters.

Generally, bytes behave somewhat like strings. Sometimes we can work with bytes directly. Most of the time, we'll want to decode the bytes and create proper Unicode characters.

How to do it..

  1. .Determine the coding scheme if possible. In order to decode bytes to create proper Unicode characters, we need to know what encoding scheme was used. When we read XML documents, there's a big hint provided within the document:
<?xml version="1.0" encoding="UTF-8"?>

When browsing web pages, there's often a header with this information:

Content-Type: text/html; charset=ISO-8859-4

Sometimes an HTML page may include this as part of the header:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

In other cases, we're left to guess. In the case of US Weather data, a good first guess is UTF-8. Other good guesses include ISO-8859-1. In some cases, the guess will depend on the language.

  1. Section 7.2.3, Python Standard Library lists the standard encodings available. Decode the data:
>>> document = forecast_text.decode("UTF-8")>>> document[:80]'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.or'

The b' prefix is gone. We've created a proper string of Unicode characters from the stream of bytes.

  1. If this step fails with an exception, we guessed wrong about the encoding. We need to try another encoding. Parse the resulting document.

Since this is an HTML document, we should use Beautiful Soup. See http://www.crummy.com/software/BeautifulSoup/.

We can, however, extract one nugget of information from this document without completely parsing the HTML:

>>> import re>>> title_pattern = re.compile(r"\<h3\>(.*?)\</h3\>")>>> title_pattern.search( document )<_sre.SRE_Match object; span=(3438, 3489), match='<h3>There are no products active at this time.</h>

This tells us what we need to know: there are no warnings at this time. That doesn't mean smooth sailing, but it does mean that there aren't any major weather systems that can cause catastrophes.

How it works...

See the Encoding strings – creating ASCII and UTF-8 bytes recipe for more information on Unicode and the different ways that Unicode characters can be encoded into streams of bytes.

At the foundation of the operating system, files and network connections are built up from bytes. It's our software that decodes the bytes to discover the content. It might be characters, or images, or sounds. In some cases, the default assumptions are wrong and we need to do our own decoding.

See also