Book Image

Python Data Structures and Algorithms

By : Benjamin Baka
Book Image

Python Data Structures and Algorithms

By: Benjamin Baka

Overview of this book

Data structures allow you to organize data in a particular way efficiently. They are critical to any problem, provide a complete solution, and act like reusable code. In this book, you will learn the essential Python data structures and the most common algorithms. With this easy-to-read book, you will be able to understand the power of linked lists, double linked lists, and circular linked lists. You will be able to create complex data structures such as graphs, stacks and queues. We will explore the application of binary searches and binary search trees. You will learn the common techniques and structures used in tasks such as preprocessing, modeling, and transforming data. We will also discuss how to organize your code in a manageable, consistent, and extendable way. The book will explore in detail sorting algorithms such as bubble sort, selection sort, insertion sort, and merge sort. By the end of the book, you will learn how to build components that are easy to understand, debug, and use in different applications.
Table of Contents (20 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
5
Stacks and Queues
7
Hashing and Symbol Tables

Python for data


Python has several built-in data structures, including lists, dictionaries, and sets, that we use to build customized objects. In addition, there are a number of internal libraries, such as collections and the math object, which allow us to create more advanced structures as well as perform calculations on those structures. Finally, there are the external libraries such as those found in the SciPy packages. These allow us to perform a range of advanced data tasks such as logistic and linear regression, visualization, and mathematical calculations such as operations on matrixes and vectors. External libraries can be very useful for an out-of-the-box solution. However, we must also be aware that there is often a performance penalty compared to building customized objects from the ground up. By learning how to code these objects ourselves, we can target them to specific tasks, making them more efficient. This is not to exclude the role of external libraries and we will look at this in Chapter 12, Design Techniques and Strategies.

To begin, we will take an overview of some of the key language features that make Python such a great choice for data programming.

The Python environment

A feature of the Python environment is its interactive console allowing you to both use Python as a desktop programmable calculator and also as an environment to write and test snippets of code. The read-evaluate-print loop of the console is a very convenient way to interact with a larger code base, such as to run functions and methods or to create instances of classes. This is one of the major advantages of Python over compiled languages such as C/C++ or Java, where the write-compile-test-recompile cycle can increase development time considerably compared to Python's read - evaluate - print loop. Being able to type in expressions and get an immediate response can greatly speed up data science tasks.

There are some excellent distributions of Python apart from the official CPython version. Two of the most popular are Anaconda (https://www.continuum.io/downloads) and Canopy (https://www.enthought.com/products/canopy/). Most distributions come with their own developer environments. Both Canopy and Anaconda include libraries for scientific, machine learning, and other data applications. Most distributions come with an editor.

There are also a number of implementations of the Python console, apart from the CPython version. Most notable amongst these is the Ipython/Jupyter platform that includes a web-based computational environment.

Variables and expressions

To translate a real-world problem into one that can be solved by an algorithm, there are two interrelated tasks. Firstly, select the variables, and secondly, find the expressions that relate to these variables. Variables are labels attached to objects; they are not the object itself. They are not containers for objects either. A variable does not contain the object, rather it acts as a pointer or reference to an object. For example, consider the following code:

Here we have created a variable, a, which points to a list object. We create another variable, b, which points to this same list object. When we append an element to this list object, this change is reflected in both a and b.

Python is a dynamically typed language. Variable names can be bound to different values and types during program execution. Each value is of a type, a string, or integer for example; however, the name that points to this value does not have a specific type. This is different from many languages such as C and Java where a name represents a fixed size, type, and location in memory. This means when we initialize variables in Python, we do not need to declare a type. Also, variables, or more specifically the objects they point to, can change type depending on the values assigned to them, for example:

Variable scope

It is important to understand the scoping rules of variables inside functions. Each time a function executes, a new local namespace is created. This represents a local environment that contains the names of the parameters and variables that are assigned by the function. To resolve a namespace when a function is called, the Python interpreter first searches the local namespace (that is, the function itself) and if no match is found, it searches the global namespace. This global namespace is the module in which the function was defined. If the name is still not found, it searches the built-in namespace. Finally, if this fails then the interpreter raises a NameError exception. Consider the following code:

    a=10; b=20 
    def my_function(): 
        global a 
        a=11; b=21 
    my_function() 
    print(a) #prints 11 
    print(b) #prints 20

Here is the output of the preceding code:

In the preceding code, we define two global variables. We need to tell the interpreter, using the keyword global, that inside the function, we are referring to a global variable. When we change this variable to 11, these changes are reflected in the global scope. However, the variable b we set to 21 is local to the function, and any changes made to it inside the function are not reflected in the global scope. When we run the function and print b, we see that it retains its global value.

Flow control and iteration

Python programs consist of a sequence of statements. The interpreter executes each statement in order until there are no more statements. This is true if both files run as the main program as well as files that are loaded via import. All statements, including variable assignment, function definitions, class definitions, and module imports, have equal status. There are no special statements that have higher priority than any other and every statement can be placed anywhere in a program. There are two main ways of controlling the flow of program execution, conditional statements and loops.

The ifelse, and elif statements control the conditional execution of statements. The general format is a series of if and elif statements followed by a final else statement:

    x='one' 
    if x==0:  
        print('False') 
    elif x==1: 
        print('True') 
    else: print('Something else') 
    #prints 'Something else' 

Note the use of the == operator to test for the same values. This returns true if the values are equal; it returns false otherwise. Note also that setting x to a string will return something else rather than generate a type error as may happen in languages that are not dynamically typed. Dynamically typed languages such as Python allow flexible assignment of objects with different types.

The other way of controlling program flow is with loops. They are created using the while or for statements, for example:

Overview of data types and objects

Python contains 12 built-in data types. These include four numeric types (int, float, complex, bool), four sequence types (str, list, tuple, range), one mapping type (dict), and two set types. It is also possible to create user-defined objects such as functions or classes. We will look at the string and the list data types in this chapter and the remaining built-in types in the next chapter.

All data types in Python are objects. In fact, pretty much everything is an object in Python, including modules, classes, and functions, as well as literals such as strings and integers. Each object in Python has a type, a value, and an identity. When we write greet = "hello world" we are creating an instance of a string object with the value "hello world" and the identity of greet. The identity of an object acts as a pointer to the object's location in memory. The type of an object, also known as the object's class, describes the object's internal representation as well as the methods and operations it supports. Once an instance of an object is created, its identity and type cannot be changed.

We can get the identity of an object by using the built-in function id(). This returns an identifying integer and on most systems this refers to its memory location, although you should not rely on this in any of your code.

Also, there are a number of ways to compare objects, for example:

    if a== b: #a and b have the same value 

    if a is b: # if a and b are the same object 
    if type(a) is type(b): # a and b are the same type 

An important distinction needs to be made between mutable and immutable objects. Mutable object's such as lists can have their values changed. They have methods, such as insert() or append(), that change an objects value. Immutable objects, such as strings, cannot have their values changed, so when we run their methods, they simply return a value rather than change the value of an underlying object. We can, of course, use this value by assigning it to a variable or using it as an argument in a function.

Strings

Strings are immutable sequence objects, with each character representing an element in the sequence. As with all objects, we use methods to perform operations. Strings, being immutable, do not change the instance; each method simply returns a value. This value can be stored as another variable or given as an argument to a function or method.

The following table is a list of some of the most commonly used string methods and their descriptions:

Methods

Descriptions

s.count(substring, [start,end])

Counts the occurrences of a substring with optional start and end parameters.

s.expandtabs([tabsize])

Replaces tabs with spaces.

s.find(substring, [start, end])

Returns the index of the first occurrence of a substring or returns -1 if the substring is not found.

s.isalnum()

Returns True if all characters are alphanumeric, returns False otherwise.

s.isalpha()

Returns True if all characters are alphabetic, returns False otherwise.

s.isdigit()

Returns True if all characters are digits, returns False otherwise.

s.join(t)

Joins the strings in sequence t.

s.lower()

Converts the string to all lowercase.

s.replace(old, new [maxreplace])

Replaces old substring with new substring.

s.strip([characters])

Removes whitespace or optional characters.

s.split([separator], [maxsplit])

Splits a string separated by whitespace or an optional separator. Returns a list.

Strings, like all sequence types, support indexing and slicing. We can retrieve any character from a string by using its index s[i]. We can retrieve a slice of a string by using s[i:j], where i and j are the start and end points of the slice. We can return an extended slice by using a stride, as in the following: s[i:j:stride]. The following code should make this clear:

The first two examples are pretty straightforward, returning the character located at index 1 and the first seven characters of the string, respectively. Notice that indexing begins at 0. In the third example, we are using a stride of 2. This results in every second character being returned. In the final example, we omit the end index and the slice returns every second character in the entire string.

You can use any expression, variable, or operator as an index as long as the value is an integer, for example:

Another common operation is traversing through a string with a loop, for example:

Given that strings are immutable, a common question that arises is how we perform operations such inserting values. Rather than changing a string, we need to think of ways to build new string objects for the results we need. For example, if we wanted to insert a word into our greeting, we could assign a variable to the following:

As this code shows, we use the slice operator to split the string at index position 5 and use + to concatenate. Python never interprets the contents of a string as a number. If we need to perform mathematical operations on a string, we need to first convert them to a numeric type, for example:

Lists

Lists are probably the most used built-in data structures in Python because they can be composed of any number of other data types. They are a simple representation of arbitrary objects. Like strings, they are indexed by integers starting with zero. The following table contains the most commonly used list methods and their descriptions:

Method

Description

list(s)

Returns a list of the sequence s.

s.append(x)

Appends element x to the end of s.

s.extend(x)

Appends the list x to s.

s.count(x)

Counts the occurrence of x in s.

s.index(x, [start], [stop])

Returns the smallest index, i, where s[i] ==x. Can include optional start and stop index for the search.

s.insert(i,e)

Inserts x at index i.

s.pop(i)

Returns the element i and removes it from the list.

s.remove(x)

Removes x from s.

s.reverse()

Reverses the order of s.

s.sort(key ,[reverse])

Sorts s with optional key and reverse.

 

When we are working with lists, and other container objects, it is important to understand the internal mechanism that Python uses to copy them. Python creates real copies only if it has to. When we assign the value of a variable to another variable, both of these variables point to the same memory location. A new slot in memory will only be allocated if one of the variables changes. This has important consequences for mutable compound objects such as lists. Consider the following code:

Here, both the list1 and list2 variables point to the same slot in memory. When we change the y variable to 4, we are changing the same y variable that list1 is pointing to.

An important feature of list's is that they can contain nested structures, that is, other lists, for example:

We can access the lists values using the bracket operators and since lists are mutable, they are copied in place. The following example demonstrates how we can use this to update elements; for example, here we are raising the price of flour by 20 percent:

A common and very intuitive way to create lists from expressions is using list comprehensions. This allows us to create a list by writing an expression directly into the list, for example:

List comprehensions can be quite flexible; for example, consider the following code. It essentially shows two different ways to performs a function composition, where we apply one function (x * 4) to another (x * 2). The following code prints out two lists representing the function composition of f1 and f2 calculated first using a for loop and then using a list comprehension:

    def f1(x): return x*2 
    def f2(x): return x*4 

    lst = [] 
    for i in range(16): 
        lst.append(f1(f2(i))) 

    print(lst) 

    print([f1(x)  for x in range(64) if x in [f2(j) for j in range(16)]]) 

The first line of output is from the for loop construct. The second is from the list comprehension expression:

List comprehensions can also be used to replicate the action of nested loops in a more compact form. For example, we multiply each of the elements contained within list1 with each other:

We can also use list comprehensions with other objects such as strings, to build more complex structures. For example, the following code creates a list of words and their letter count:

As we will see, lists form the foundation of many of the data structures we will look at. Their versatility, ease of creation, and use enables them to build more specialized and complex data structures.

Functions as first class objects

In Python, it is not only data types that are treated as objects. Both functions and classes are what are known as first class objects, allowing them to be manipulated in the same ways as built-in data types. By definition, first class objects are:

  • Created at runtime
  • Assigned as a variable or in a data structure
  • Passed as an argument to a function
  • Returned as the result of a function

In Python, the term first class object is a bit of a misnomer since it implies some sort of hierarchy, whereas all Python objects are essentially first class.

To have a look at how this works, let's define a simple function:

    def greeting(language): 
    if language== 'eng': 
             return 'hello world' 
       if language  == 'fr' 
             return 'Bonjour le monde' 
       else: return 'language not supported' 

Since user-defined functions are objects, we can do things such as include them in other objects, such as lists:

Functions can also be used as arguments for other functions. For example, we can define the following function:

Here, callf() takes a function as an argument, sets a language variable to 'eng', and then calls the function with the language variable as its argument. We could see how this would be useful if, for example, we wanted to produce a program that returns specific sentences in a variety of languages, perhaps for some sort of natural language application. Here we have a central place to set the language. As well as our greeting function, we could create similar functions that return different sentences. By having one point where we set the language, the rest of the program logic does not have to worry about this. If we want to change the language, we simply change the language variable and we can keep everything else the same.

Higher order functions

Functions that take other functions as arguments, or that return functions, are called higher order functions. Python 3 contains two built-in higher order functions, filter() and map(). Note that in earlier versions of Python, these functions returned lists; in Python 3, they return an iterator, making them much more efficient. The map() function provides an easy way to transform each item into an iterable object. For example, here is an efficient, compact way to perform an operation on a sequence. Note the use of the lambda anonymous function:

Similarly, we can use the filter built-in function to filter items in a list:

Note that both map and filter perform the identical function as to what can be achieved by list comprehensions. There does not seem to be a great deal of difference in the performance characteristics apart from a slight performance advantage when using the in built functions map and filter without the lambda operator, compared to list comprehensions. Despite this, most style guides recommend the use of list comprehensions over built-in functions, possibly because they tend to be easier to read.

Creating our own higher order functions is one of the hallmarks of functional programming style. A practical example of how higher order functions can be useful is demonstrated by the following. Here we are passing the len function as the key to the sort function. This way, we can sort a list of words by length:

Here is another example for case-insensitive sorting:

Note the difference between the list.sort() method and the sorted built-in function. list.sort(), a method of the list object, sorts the existing instance of a list without copying it. This method changes the target object and returns None. It is an important convention in Python that functions or methods that change the object return None to make it clear that no new object was created and that the object itself was changed.

On the other hand, the sorted built-in function returns a new list. It actually accepts any iterable object as an argument, but it will always return a list. Both list sort and sorted take two optional keyword arguments as key.

A simple way to sort more complex structures is to use the index of the element to sort using the lambda operator, for example:

Here we have sorted the items by price.

Recursive functions

Recursion is one of the most fundamental concepts of computer science. In Python, we can implement a recursive function simply by calling it within its own function body. To stop a recursive function turning into an infinite loop, we need at least one argument that tests for a terminating case to end the recursion. This is sometimes called the base case. It should be pointed out that recursion is different from iteration. Although both involve repetition, iteration loops through a sequence of operations, whereas recursion repeatedly calls a function. Both need a selection statement to end. Technically, recursion is a special case of iteration known as tail iteration, and it is usually always possible to convert an iterative function to a recursive function and vice versa. The interesting thing about recursive functions is that they are able to describe an infinite object within a finite statement.

The following code should demonstrate the difference between recursion and iteration. Both these functions simply print out numbers between low and high, the first one using iteration and the second using recursion:

Notice, iterTest, the iteration example, we use a while statement to test for the condition, then call the print method, and finally increment the low value. The recursive example tests for the condition, prints, then calls itself, incrementing the low variable in its argument. In general, iteration is more efficient; however, recursive functions are often easier to understand and write. Recursive functions are also useful for manipulating recursive data structures such as linked lists and trees, as we will see.

Generators and co-routines

We can create functions that do not just return one result, but rather an entire sequence of results, by using the yield statement. These functions are called generators. Python contains generator functions, which are an easy way to create iterators and they are especially useful as a replacement for unworkably long lists. A generator yields items rather than build lists. For example, the following code shows why we might choose to use a generator as opposed to creating a list:

    # compares the running time of a list compared to a generator 
    import time 
    #generator function creates an iterator of odd numbers between n and m 
    def oddGen(n, m):         
        while n < m: 
            yield n 
            n += 2 
    #builds a list of odd numbers between n and m 
    def oddLst(n,m): 
        lst=[] 
        while n<m: 
            lst.append(n) 
            n +=2 
        return lst 
    #the time it takes to perform sum on an iterator    
    t1=time.time() 
    sum(oddGen(1,1000000)) 
    print("Time to sum an iterator: %f" % (time.time() - t1)) 

    #the time it takes to build and sum a list 
    t1=time.time() 
    sum(oddLst(1,1000000)) 
    print("Time to build and sum a list: %f" % (time.time() - t1))      

This prints out the following:

As we can see, building a list to do this calculation takes significantly longer. The performance improvement as a result of using generators is because the values are generated on demand, rather than saved as a list in memory. A calculation can begin before all the elements have been generated and elements are generated only when they are needed.

In the preceding example, the sum method loads each number into memory when it is needed for the calculation. This is achieved by the generator object repeatedly calling the __next__() special method. Generators never return a value other than None.

Typically, generator objects are used in for loops. For example, we can make use of the oddcount generator function created in the preceding code to print out odd integers between 1 and 10:

    for i in oddcount(1,10):print(i) 

We can also create a generator expression, which, apart from replacing square brackets with parentheses, uses the same syntax and carries out the same operation as list comprehensions. Generator expressions, however, do not create a list, they create a generator object. This object does not create the data, but rather creates that data on demand. This means that generator objects do not support sequence methods such as append() and insert(). You can, however, change a generator into a list using the list() function:

Classes and object programming

Classes are a way to create new kinds of objects and they are central to object-oriented programming. A class defines a set of attributes that are shared across instances of that class. Typically, classes are sets of functions, variables, and properties.

The object-oriented paradigm is compelling because it gives us a concrete way to think about and represent the core functionality of our programs. By organizing our programs around objects and data rather than actions and logic, we have a robust and flexible way to build complex applications. The actions and logic are still present of course, but by embodying them in objects, we have a way to encapsulate functionality, allowing objects to change in very specific ways. This makes our code less error-prone, easier to extend and maintain, and able to model real-world objects.

Classes are created in Python using the class statement. This defines a set of shared attributes associated with a collection of class instances. A class usually consists of a number of methods, class variables, and computed properties. It is important to understand that defining a class does not, by itself, create any instances of that class. To create an instance, a variable must be assigned to a class. The class body consists of a series of statements that execute during the class definition. The functions defined inside a class are called instance methods. They apply some operations to the class instance by passing an instance of that class as the first argument. This argument is called self by convention, but it can be any legal identifier. Here is a simple example:

    class Employee(object): 
        numEmployee = 0 
        def __init__(self, name, rate): 
            self.owed = 0         
            self.name = name 
            self.rate=rate 
            Employee.numEmployee += 1 

        def __del__(self): 
            Employee.numEmployee -= 1 

        def hours(self, numHours): 
            self.owed += numHours * self.rate 
            return("%.2f hours worked" % numHours) 

        def pay(self):                 
            self.owed = 0 
            return("payed %s " % self.name) 

Class variables, such as numEmployee, share values among all the instances of the class. In this example, numEmployee is used to count the number of employee instances. Note that the Employee class implements the __init__ and __del__ special methods, which we will discuss in the next section.

We can create instances of the Employee objects, run methods, and return class and instance variables by doing the following:

Special methods

We can use the dir(object) function to get a list of attributes of a particular object. The methods that begin and end with two underscores are called special methods. Apart from the following exception, special method, are generally called by the Python interpreter rather than the programmer; for example, when we use the + operator, we are actually invoking a call to __add__(). For example, rather than using my_object.__len__() we can use len(my_object) using len() on a string object is actually much faster because it returns the value representing the object's size in memory, rather than making a call to the object's __len__ method. The only special method we actually call in our programs, as common practice, is the __init__ method, to invoke the initializer of the superclass in our own class definitions. It is strongly advised not to use the double underscore syntax for your own objects because of potential current or future conflicts with Python's own special methods.

We may, however, want to implement special methods in custom objects, to give them some of the behavior of built-in types. In the following code, we create a class that implements the __repr__ method. This method creates a string representation of our object that is useful for inspection purposes:

    class my_class(): 
        def __init__(self, greet): 
            self.greet = greet 
        def __repr__(self): 
            return 'a custom object (%r)' % (self.greet) 

When we create an instance of this object and inspect it, we can see we get our customized string representation. Notice the use of the %r format placeholder to return the standard representation of the object. This is useful and best practice, because, in this case, it shows us that the greet object is a string indicated by the quotation marks:

Inheritance

It is possible to create a new class that modifies the behavior of an existing class through inheritance. This is done by passing the inherited class as an argument in the class definition. It is often used to modify the behavior of existing methods, for example:

    class specialEmployee(Employee): 
        def hours(self, numHours): 
            self.owed += numHours * self.rate * 2 
            return("%.2f hours worked" % numHours)    

An instance of the specialEmployee class is identical to an Employee instance except for the changed hours() method.

For a subclass to define new class variables, it needs to define an __init__() method, as follows:

    class specialEmployee(Employee): 
        def __init__(self,name,rate, bonus): 
            Employee.__init__(self, name, rate) #calls the base classes 
            self.bonus = bonus 

        def hours(self, numHours): 
            self.owed += numHours * self.rate + self.bonus  
            return("%.2f hours worked" % numHours)      

Notice that the methods of the base class are not automatically invoked and it is necessary for the derived class to call them. We can test for class membership using the built-in function isintance(obj1, obj2). This returns true if obj1 belongs to the class of obj2 or any class derived from obj2.

Within a class definition, it is assumed that all methods operate on the instance, but this is not a requirement. There are, however, other types of methods: static methods and class methods. A static method is just an ordinary function that just happens to be defined in a class. It does not perform any operations on the instance and it is defined using the @staticmethod class decorator. Static methods cannot access the attributes of an instance, so their most common usage is as a convenience to group utility functions together.

Class methods operate on the class itself, not the instance, in the same way that class variables are associated with the classes rather than instances of that class. They are defined using the @classmethod decorator, and are distinguished from instance methods in that the class is passed as the first argument. This is named cls by convention.

    class Aexp(object): 
        base=2 
        @classmethod 
        def exp(cls,x): 
            return(cls.base**x) 

    class Bexp(Aexp): 
            base=3 

The class Bexp inherits from the Aexp class and changes the base class variable to 3. We can run the parent class's exp() method as follows:

Although this example is a little contrived, there are several reasons why class methods may be useful. For example, because a subclass inherits all the same features of its parent, there is the potential for it to break inherited methods. Using class methods is a way to define exactly what methods are run.

Data encapsulation and properties

Unless otherwise specified, all attributes and methods are accessible without restriction. This also means that everything defined in a base class is accessible from a derived class. This may cause problems when we are building object-oriented applications where we may want to hide the internal implementation of an object. This can lead to namespace conflicts between objects defined in derived classes with the base class. To prevent this, the methods we define private attributes with have a double underscore, such as __privateMethod(). These method names are automatically changed to _Classname__privateMethod() to prevent name conflicts with methods defined in base classes. Be aware that this does not strictly hide private attributes, rather it just provides a mechanism for preventing name conflicts.

It is recommended to use private attributes when using a class property to define mutable attributes. A property is a kind of attribute that rather than returning a stored value, computes its value when called. For example, we could redefine the exp() property with the following:

    class Bexp(Aexp): 
        __base=3 
        def __exp(self): 
            return(x**cls.base)     

In this chapter, we have looked at some of the fundamentals of the Python programming language, from basic operations to functions, classes, and objects in Python. In the next chapter, we will examine, in detail, the built-in data structures of Python.