Book Image

Learning Python Network Programming

By : Dr. M. O. Faruque Sarker, Samuel B Washington, Sam Washington
Book Image

Learning Python Network Programming

By: Dr. M. O. Faruque Sarker, Samuel B Washington, Sam Washington

Overview of this book

Table of Contents (17 chapters)
Learning Python Network Programming
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Breaking a few eggs


The power of the layer model of network protocols is that a higher layer can easily build on the services provided by the lower layers and this enables them to add new services to the network. Python provides modules for interfacing with protocols at different levels in the network stack, and modules that support higher-layer protocols follow the aforementioned principle by using the interfaces supplied by the lower level protocols. How can we visualize this?

Well, sometimes a good way to see inside something like this is by breaking it. So, let's break Python's network stack. Or, more specifically, let's generate a traceback.

Yes, this means that the first piece of Python that we're going to write is going to generate an exception. But, it will be a good exception. We'll learn from it. So, fire up your Python shell and run the following command:

>>> import smtplib
>>> smtplib.SMTP('127.0.0.1', port=66000)

What are we doing here? We are importing smtplib, which is Python's standard library for working with the SMTP protocol. SMTP is an application layer protocol, which is used for sending e-mails. We will then try to open an SMTP connection by instantiating an SMTP object. We want the connection to fail and that is why we've specified the port number 66000, which is an invalid port. We will specify the local host for the connection, as this will cause it to fail quickly, rather than make it wait for a network timeout.

On running the preceding command, you should get the following traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/smtplib.py", line 242, in __init__
    (code, msg) = self.connect(host, port)
  File "/usr/lib/python3.4/smtplib.py", line 321, in connect
    self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/lib/python3.4/smtplib.py", line 292, in _get_socket
    self.source_address)
  File "/usr/lib/python3.4/socket.py", line 509, in create_connection
    raise err
  File "/usr/lib/python3.4/socket.py", line 500, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

This was generated by using Python 3.4.1 on a Debian 7 machine. The final error message will be slightly different from this if you run this on Windows, but the stack trace will remain the same.

Inspecting it will reveal how the Python network modules act as a stack. We can see that the call stack starts in smtplib.py, and then as we go down, it moves into socket.py. The socket module is Python's standard interface for the transport layer, and it provides the functions for interacting with TCP and UDP as well as for looking up hostnames through DNS. We'll learn much more about this in Chapter 7, Programming with Sockets, and Chapter 8, Client and Server Applications.

From the preceding program, it's clear that the smtplib module calls into the socket module. The application layer protocol has employed a transport layer protocol (which in this case is TCP).

Right at the bottom of the traceback, we can see the exception itself and the Errno 111. This is an error message from the operating system. You can verify this by going through /usr/include/asm-generic/errno.h (asm/errno.h on some systems) for the error message number 111 (on Windows the error will be a WinError, so you can see that it has clearly been generated by the OS). From this error message we can see that the socket module is calling down yet again and asking the operating system to manage the TCP connection for it.

Python's network modules are working as the protocol stack designers intended them to. They call on the lower levels in the stack to employ their services to perform the network tasks. We can work by using simple calls made to the application layer protocol, which in this case is SMTP, without having to worry about the underlying network layers. This is network encapsulation in action, and we want to make as much use of this as we can in our applications.

Taking it from the top

Before we start writing code for a new network application, we want to make sure that we're taking as much advantage of the existing stack as possible. This means finding a module that provides an interface to the services that we want to use, and that is as high up the stack as we can find. If we're lucky, someone has already written a module that provides an interface that provides the exact service we need.

Let's use an example to illustrate this process. Let's write a tool for downloading Request for Comments (RFC) documents from IETF, and then display them on screen.

Let's keep the RFC downloader simple. We'll make it a command-line program that just accepts an RFC number, downloads the RFC in text format, and then prints it to stdout.

Now, it's possible that somebody has already written a module for doing this, so let's see if we can find anything.

The first place we look should always be the Python standard library. The modules in the library are well maintained, and well documented. When we use a standard library module, the users of your application won't need to install any additional dependencies for running it.

A look through the Library Reference at https://docs.python.org doesn't seem to show anything directly relevant to our requirement. This is not entirely surprising!

So, next we will turn to third-party modules. The Python package index, which can be found at https://pypi.python.org, is the place where we should look for these. Here as well, running a few searches around the theme of RFC client and RFC download doesn't seem to reveal anything useful. The next place to look will be Google, though again, the searches don't reveal anything promising. This is slightly disappointing, but this is why we're learning network programming, to fill these gaps!

There are other ways in which we may be able to find out about useful third-party modules, including mailing lists, Python user groups, the programming Q&A site http://stackoverflow.com, and programming textbooks.

For now, let's assume that we really can't find a module for downloading RFCs. What next? Well, we need to think lower in the network stack. This means that we need to identify the network protocol that we'll need to use for getting hold of the RFCs in text format by ourselves.

The IETF landing page for RFCs is http://www.ietf.org/rfc.html, and reading through it tell us exactly what we want to know. We can access a text version of an RFC using a URL of the form http://www.ietf.org/rfc/rfc741.txt. The RFC number in this case is 741. So, we can get text format of RFCs using HTTP.

Now, we need a module that can speak HTTP for us. We should look at the standard library again. You will notice that there is, in fact, a module called http. Sounds promising, though looking at its documentation will tell us that it's a low level library and that something called urllib will prove to be more useful.

Now, looking at the urllib documentation, we find that it does indeed do what we need. It downloads the target of a URL through a straightforward API. We've found our protocol module.

Downloading an RFC

Now we can write our program. For this, create a text file called RFC_downloader.py and save the following code to it:

import sys, urllib.request

try:
    rfc_number = int(sys.argv[1])
except (IndexError, ValueError):
    print('Must supply an RFC number as first argument')
    sys.exit(2)

template = 'http://www.ietf.org/rfc/rfc{}.txt'
url = template.format(rfc_number)
rfc_raw = urllib.request.urlopen(url).read()
rfc = rfc_raw.decode()
print(rfc)

We can run the preceding code by using the following command:

$ python RFC_downloader.py 2324 | less

On Windows, you'll need to use more instead of less. RFCs can run to many pages, hence we use a pager here. If you try this, then you should see some useful information on the remote control of coffee pots.

Let's go through our code and look at what we've done so far.

First, we import our modules and check whether an RFC number has been supplied on the command line. Then, we construct our URL by substituting the supplied RFC number. Next, the main activity, the urlopen() call will construct an HTTP request for our URL, and then it will contact the IETF web server over the Internet and download the RFC text. Next, we decode the text to Unicode, and finally we print it out to screen.

So, we can easily view any RFC that we like from the command line. In retrospect, it's not entirely surprising that there isn't a module for this, because we can use urllib to do most of the hard work!

Looking deeper

But, what if HTTP was brand new and there were no modules, such as urllib, which we could use to speak HTTP for us? Well, then we would have to step down the stack again and use TCP for our purposes. Let's modify our program according to this scenario, as follows:

import sys, socket

try:
    rfc_number = int(sys.argv[1])
except (IndexError, ValueError):
    print('Must supply an RFC number as first argument')
    sys.exit(2)

host = 'www.ietf.org'
port = 80
sock = socket.create_connection((host, port))

req = (
    'GET /rfc/rfc{rfcnum}.txt HTTP/1.1\r\n'
    'Host: {host}:{port}\r\n'
    'User-Agent: Python {version}\r\n'
    'Connection: close\r\n'
    '\r\n'
)
req = req.format(
    rfcnum=rfc_number,
    host=host,
    port=port,
    version=sys.version_info[0]
)
sock.sendall(req.encode('ascii'))
rfc_raw = bytearray()
while True:
    buf = sock.recv(4096)
    if not len(buf):
        break
    rfc_raw += buf
rfc = rfc_raw.decode('utf-8')
print(rfc)

The first noticeable change is that we have used socket instead of urllib. Socket is Python's interface for the operating system's TCP and UDP implementation. The command-line check remains the same, but then we will see that we now need to handle some of the things that urllib was doing for us before.

We have to tell socket which transport layer protocol that we want to use. We do this by using the socket.create_connection() convenience function. This function will always create a TCP connection. You'll notice that we have to explicitly supply the TCP port number that socket should use to establish the connection as well. Why 80? 80 is the standard port number for web services over HTTP. We've also had to separate the host from the URL, since socket has no understanding of URLs.

The request string that we create to send to the server is also much more complicated than the URL that we used before: it's a full HTTP request. In the next chapter, we'll be looking at these in detail.

Next, we deal with the network communication over the TCP connection. We send the entire request string to the server using the sendall() call. The data sent through TCP must be in raw bytes, so we have to encode the request text as ASCII before sending it.

Then, we piece together the server's response as it arrives in the while loop. Bytes that are sent to us through a TCP socket are presented to our application in a continuous stream. So, like any stream of unknown length, we have to read it iteratively. The recv() call will return the empty string after the server sends all its data and closes the connection. Hence, we can use this as a condition for breaking out and printing the response.

Our program is clearly more complicated. Compared to our previous one, this is not good in terms of maintenance. Also, if you run the program and look at the start of the output RFC text, then you'll notice that there are some extra lines at the beginning, and these are as follows:

HTTP/1.1 200 OK
Date: Thu, 07 Aug 2014 15:47:13 GMT
Content-Type: text/plain
Transfer-Encoding: chunked
Connection: close
Set-Cookie: __cfduid=d1983ad4f7…
Last-Modified: Fri, 27 Mar 1998 22:45:31 GMT
ETag: W/"8982977-4c9a-32a651f0ad8c0"

Because we're now dealing with a raw HTTP protocol exchange, we're seeing the extra header data that HTTP includes in a response. This has a similar purpose to the lower-level packet headers. The HTTP header contains HTTP-specific metadata about the response that tells the client how to interpret it. Before, urllib parsed this for us, added the data as attributes to the response object, and removed the header data from the output data. We would need to add code to do this as well to make this program as capable as our first one.

What can't immediately be seen from the code is that we're also missing out on the urllib module's error checking and handling. Although low-level network errors will still generate exceptions, we will no longer catch any problems in the HTTP layer, which urllib would have done.

The 200 value in the first line of the aforementioned headers is an HTTP status code, which tells us whether there were any problems with the HTTP request or response. 200 means that everything went well, but other codes, such as the infamous 404 'not found' can mean something went wrong. The urllib module would check these for us and raise an exception. But here, we need to handle these ourselves.

So, there are clear benefits of using modules as far up the stack as possible. Our resulting programs will be less complicated, which will make them quicker to write, and easier to maintain. It also means that their error handling will be more robust, and we will benefit from the expertise of the modules' developers. Also, we benefit from the testing that the module would have undergone for catching unexpected and tricky edge-case problems. Over the next few chapters, we'll be discussing more modules and protocols that live at the top of the stack.

Programming for TCP/IP networks

To round up, we're going to look at a few frequently encountered aspects of TCP/IP networks that can cause a lot of head-scratching for application developers who haven't encountered them before. These are: firewalls, Network Address Translation, and some of the differences between IPv4 and IPv6.

Firewalls

A firewall is a piece of hardware or software that inspects the network packets that flow through it and, based on the packet's properties, it filters what it lets through. It is a security mechanism for preventing unwanted traffic from moving from one part of a network to another. Firewalls can sit at network boundaries or can be run as applications on network clients and servers. For example, iptables is the de facto firewall software for Linux. You'll often find a firewall built into desktop anti-virus programs.

The filtering rules can be based on any property of the network traffic. The commonly used properties are: the transport layer protocol (that is, whether traffic uses TCP or UDP), the source and destination IP addresses, and the source and destination port numbers.

A common filtering strategy is to deny all inbound traffic and only allow traffic that matches very specific parameters. For example, a company might have a web server it wants to allow access to from the Internet, but it wants to block all traffic from the Internet that is directed towards any of the other devices on its network. To do so, it would put a firewall directly in front of or behind its gateway, and then configure it to block all incoming traffic, except TCP traffic with the destination IP address of the web server, and the destination port number 80 (since port 80 is the standard port number for the HTTP service).

Firewalls can also block outbound traffic. This may be done to stop malicious software that finds its way onto internal network devices from calling home or sending spam e-mail.

Because firewalls block network traffic, they can cause obvious problems for network applications. When testing our applications over a network, we need to be sure that the firewalls that exist between our devices are configured such that they let our application's traffic through. Usually, this means that we need to make sure that the ports which we need are open on the firewall for the traffic between the source and the destination IP addresses to flow freely. This may take some negotiating with an IT support team or two, and maybe looking at our operating system's and local network router's documentation. Also, we need to make sure that our application users are aware of any firewall configuration that they need to perform in their own environments in order to make use of our program.

Network Address Translation

Earlier, we discussed private IP address ranges. While they are potentially very useful, they come with a small catch. Packets with source or destination addresses in the private ranges are forbidden from being routed over the public Internet! So, without some help, devices using private range addresses can't talk to devices using addresses on the public Internet. However, with Network Address Translation (NAT), we can solve this. Since most home networks use private range addresses, NAT is likely to be something that you'll encounter.

Although NAT can be used in other circumstances, it is most commonly performed by a gateway at the boundary of the public Internet and a network that is using private range IP addresses. To enable the packets from the gateway's network to be routed on the public Internet as the gateway receives packets from the network that are destined for the Internet, it rewrites the packets' headers and replaces the private range source IP addresses with its own public range IP address. If the packets contain TCP or UDP packets, and these contain a source port, then it may also open up a new source port for listening on its external interface and rewrite the source port number in the packets to match this new number.

As it does these rewrites, it records the mapping between the newly opened source port and the source device on the internal network. If it receives a reply to the new source port, then it reverses the translation process and sends the received packets to the original device on the internal network. The originating network device shouldn't be made aware of the fact that its traffic is undergoing NAT.

There are several benefits of using NAT. The internal network devices are shielded from malicious traffic directed toward the network from the Internet, devices which use NAT devices are provided with a layer of privacy since their private addresses are hidden, and the number of network devices that need to be assigned precious public IP addresses is reduced. It's actually the heavy use of NAT that allows the Internet to continue functioning despite having run out of IPv4 addresses.

NAT can cause some problems for network's applications, if it is not taken into consideration at design time.

If the transmitted application data includes information about a device's network configuration and that device is behind a NAT router, then problems can occur if the receiving device acts on the assumption that the application data matches the IP and the TCP/UDP header data. NAT routers will rewrite the IP and TCP/UDP header data, but not the application data. This is a well known problem in the FTP protocol.

Another problem that FTP has with NAT is that in FTP active mode, a part of the protocol operation involves the client opening a port for listening on, and the server creating a new TCP connection to that port (as opposed to just a regular reply). This fails when the client is behind a NAT router because the router doesn't know what to do with the server's connection attempt. So, be careful about assuming that servers can create new connections to clients, since they may be blocked by a NAT router, or firewall. In general, it's best to program under the assumption that it's not possible for a server to establish a new connection to a client.

IPv6

We mentioned that the earlier discussion is based on IPv4, but that there is a new version called IPv6. IPv6 is ultimately designed to replace IPv4, but this process is unlikely to be completed for a while yet.

Since most Python standard library modules have now been updated to support IPv6 and to accept IPv6 addresses, moving to IPv6 in Python shouldn't have much impact on our applications. However, there are a few small glitches to watch out for.

The main difference that you'll notice in IPv6 is that the address format has been changed. One of the main design goals of the new protocol was to alleviate the global shortage of IPv4 addresses and to prevent it from happening again the IETF quadrupled the length of an address, to 128 bits, creating a large enough address space to give each human on the planet a billion times as many addresses as there are in the entire IPv4 address space.

The new format IP addresses are written differently, they look like this:

2001:0db8:85a3:0000:0000:b81a:63d6:135b

Note the use of colons and hexadecimal format.

There are rules for writing IPv6 addresses in more compact forms as well. This is principally done by omitting runs of consecutive zeros. For example, the address in the preceding example could be shortened to:

2001:db8:85a3::b81a:63d6:135b

If a program needs to compare or parse text-formatted IPv6 addresses, then it will need to be made aware of these compacting rules, as a single IPv6 address can be represented in more than one way. Details of these rules can be found in RFC 4291, which is available at http://www.ietf.org/rfc/rfc4291.txt.

Since colons may cause conflicts when used in URIs, IPv6 addresses need to be enclosed in square brackets when they are used in this manner, for example:

http://[2001:db8:85a3::b81a:63d6:135b]/index.html

Also, in IPv6, it is now standard practice for network interfaces to have multiple IP addresses assigned to them. IPv6 addresses are classified by what scope they are valid in. The scopes include the global scope (that is, the public Internet) and the link-local scope, which is only valid for the local subnet. An IP address's scope can be determined by inspecting its high-order bits. If we enumerate the IP addresses of local interfaces to use for a certain purpose, then we need to check if we have used the correct address for the scope that we intend to work with. There are more details in RFC 4291.

Finally, with the mind-boggling cornucopia of addresses that are available in IPv6, the idea is that every device (and component, and bacterium) can be given a globally unique public IP address, and NAT will become a thing of the past. Though it sounds great in theory, some concerns have been raised about the implications that this has for issues like user privacy. As such, additions designed for alleviating these concerns have been made to the protocol (http://www.ietf.org/rfc/rfc3041.txt). This is a welcome progression; however, it can cause problems for some applications. So reading through the RFC is worth your while, if you're planning for your program to employ IPv6.