Book Image

Python Object-Oriented Programming - Fourth Edition

By : Steven F. Lott, Dusty Phillips
2 (2)
Book Image

Python Object-Oriented Programming - Fourth Edition

2 (2)
By: Steven F. Lott, Dusty Phillips

Overview of this book

Object-oriented programming (OOP) is a popular design paradigm in which data and behaviors are encapsulated in such a way that they can be manipulated together. Python Object-Oriented Programming, Fourth Edition dives deep into the various aspects of OOP, Python as an OOP language, common and advanced design patterns, and hands-on data manipulation and testing of more complex OOP systems. These concepts are consolidated by open-ended exercises, as well as a real-world case study at the end of every chapter, newly written for this edition. All example code is now compatible with Python 3.9+ syntax and has been updated with type hints for ease of learning. Steven and Dusty provide a comprehensive, illustrative tour of important OOP concepts, such as inheritance, composition, and polymorphism, and explain how they work together with Python’s classes and data structures to facilitate good design. In addition, the book also features an in-depth look at Python’s exception handling and how functional programming intersects with OOP. Two very powerful automated testing systems, unittest and pytest, are introduced. The final chapter provides a detailed discussion of Python's concurrent programming ecosystem. By the end of the book, you will have a thorough understanding of how to think about and apply object-oriented principles using Python syntax and be able to confidently create robust and reliable programs.
Table of Contents (17 chapters)
15
Other Books You May Enjoy
16
Index

Case study

This section expands on the object-oriented design of our realistic example. We'll start with the diagrams created using the Unified Modeling Language (UML) to help depict and summarize the software we're going to build.

We'll describe the various considerations that are part of the Python implementation of the class definitions. We'll start with a review of the diagrams that describe the classes to be defined.

Logical view

Here's the overview of the classes we need to build. This is (except for one new method) the previous chapter's model:

Diagram

Description automatically generated

Figure 2.2: Logical view diagram

There are three classes that define our core data model, plus some uses of the generic list class. We've shown it using the type hint of List. Here are the four central classes:

  • The TrainingData class is a container with two lists of data samples, a list used for training our model and a list used for testing our model. Both lists are composed of KnownSample instances. Additionally, we'll also have a list of alternative Hyperparameter values. In general, these are tuning values that change the behavior of the model. The idea is to test with different hyperparameters to locate the highest-quality model.

    We've also allocated a little bit of metadata to this class: the name of the data we're working with, the datetime of when we uploaded the data the first time, and the datetime of when we ran a test against the model.

  • Each instance of the Sample class is the core piece of working data. In our example, these are measurements of sepal lengths and widths and petal lengths and widths. Steady-handed botany graduate students carefully measured lots and lots of flowers to gather this data. We hope that they had time to stop and smell the roses while they were working.
  • A KnownSample object is an extended Sample. This part of the design foreshadows the focus of Chapter 3, When Objects Are Alike. A KnownSample is a Sample with one extra attribute, the assigned species. This information comes from skilled botanists who have classified some data we can use for training and testing.
  • The Hyperparameter class has the k used to define how many of the nearest neighbors to consider. It also has a summary of testing with this value of k. The quality tells us how many of the test samples were correctly classified. We expect to see that small values of k (like 1 or 3) don't classify well. We expect middle values of k to do better, and very large values of k to not do as well.

The KnownSample class on the diagram may not need to be a separate class definition. As we work through the details, we'll look at some alternative designs for each of these classes.

We'll start with the Sample (and KnownSample) classes. Python offers three essential paths for defining a new class:

  • class definition; we'll focus on this to start.
  • @dataclass definition. This provides a number of built-in features. While it's handy, it's not ideal for programmers who are new to Python, because it can obscure some implementation details. We'll set this aside for Chapter 7, Python Data Structures.
  • An extension to the typing.NamedTuple class. The most notable feature of this definition will be that the state of the object is immutable; the attribute values cannot be changed. Unchanging attributes can be a useful feature for making sure a bug in the application doesn't mess with the training data. We'll set this aside for Chapter 7, also.

Our first design decision is to use Python's class statement to write a class definition for Sample and its subclass KnownSample. This may be replaced in the future (i.e., Chapter 7) with alternatives that use data classes as well as NamedTuple.

Samples and their states

The diagram in Figure 2.2 shows the Sample class and an extension, the KnownSample class. This doesn't seem to be a complete decomposition of the various kinds of samples. When we review the user stories and the process views, there seems to be a gap: specifically, the "make classification request" by a User requires an unknown sample. This has the same flower measurements attributes as a Sample, but doesn't have the assigned species attribute of a KnownSample. Further, there's no state change that adds an attribute value. The unknown sample will never be formally classified by a Botanist; it will be classified by our algorithm, but it's only an AI, not a Botanist.

We can make a case for two distinct subclasses of Sample:

  • UnknownSample: This class contains the initial four Sample attributes. A User provides these objects to get them classified. 
  • KnownSample: This class has the Sample attributes plus the classification result, a species name. We use these for training and testing the model.

Generally, we consider class definitions as a way to encapsulate state and behavior. An UnknownSample instance provided by a user starts out with no species. Then, after the classifier algorithm computes a species, the Sample changes state to have a species assigned by the algorithm.

A question we must always ask about class definitions is this:

Is there any change in behavior that goes with the change in state?

In this case, it doesn't seem like there's anything new or different that can happen. Perhaps this can be implemented as a single class with some optional attributes.

We have another possible state change concern. Currently, there's no class that owns the responsibility of partitioning Sample objects into the training or testing subsets. This, too, is a kind of state change.

This leads to a second important question: 

What class has responsibility for making this state change?

In this case, it seems like the TrainingData class should own the discrimination between testing and training data.

One way to help look closely at our class design is to enumerate all of the various states of individual samples. This technique helps uncover a need for attributes in the classes. It also helps to identify the methods to make state changes to objects of a class.

Sample state transitions

Let's look at the life cycles of Sample objects. An object's life cycle starts with object creation, then state changes, and (in some cases) the end of its processing life when there are no more references to it. We have three scenarios:

  1. Initial load: We'll need a load() method to populate a TrainingData object from some source of raw data. We'll preview some of the material in Chapter 9, Strings, Serialization, and File Paths, by saying that reading a CSV file often produces a sequence of dictionaries. We can imagine a load() method using a CSV reader to create Sample objects with a species value, making them KnownSample objects. The load() method splits the KnownSample objects into the training and testing lists, which is an important state change for a TrainingData object.
  2. Hyperparameter testing: We'll need a test() method in the Hyperparameter class. The body of the test() method works with the test samples in the associated TrainingData object. For each sample, it applies the classifier and counts the matches between Botanist-assigned species and the best guess of our AI algorithm. This points out the need for a classify() method for a single sample that's used by the test() method for a batch of samples. The test() method will update the state of the Hyperparameter object by setting the quality score.
  3. User-initiated classification: A RESTful web application is often decomposed into separate view functions to handle requests. When handling a request to classify an unknown sample, the view function will have a Hyperparameter object used for classification; this will be chosen by the Botanist to produce the best results. The user input will be an UnknownSample instance. The view function applies the Hyperparameter.classify() method to create a response to the user with the species the iris has been classed as. Does the state change that happens when the AI classifies an UnknownSample really matter? Here are two views:
    • Each UnknownSample can have a classified attribute. Setting this is a change in the state of the Sample. It's not clear that there's any behavior change associated with this state change. 
    • The classification result is not part of the Sample at all. It's a local variable in the view function. This state change in the function is used to respond to the user, but has no life within the Sample object.

There's a key concept underlying this detailed decomposition of these alternatives:

There's no "right" answer.

Some design decisions are based on non-functional and non-technical considerations. These might include the longevity of the application, future use cases, additional users who might be enticed, current schedules and budgets, pedagogical value, technical risk, the creation of intellectual property, and how cool the demo will look in a conference call.

In Chapter 1, Object-Oriented Design, we dropped a hint that this application is the precursor to a consumer product recommender. We noted: "The users eventually want to tackle complex consumer products, but recognize that solving a difficult problem is not a good way to learn how to build this kind of application. It's better to start with something of a manageable level of complexity and then refine and expand it until it does everything they need."

Because of that, we'll consider a change in state from UnknownSample to ClassifiedSample to be very important. The Sample objects will live in a database for additional marketing campaigns or possibly reclassification when new products are available and the training data changes.

We'll decide to keep the classification and the species data in the UnknownSample class.

This analysis suggests we can coalesce all the various Sample details into the following design:

Diagram

Description automatically generated

Figure 2.3: The updated UML diagram

This view uses the open arrowhead to show a number of subclasses of Sample. We won't directly implement these as subclasses. We've included the arrows to show that we have some distinct use cases for these objects. Specifically, the box for KnownSample has a condition species is not None to summarize what's unique about these Sample objects. Similarly, the UnknownSample has a condition, species is None, to clarify our intent around Sample objects with the species attribute value of None.

In these UML diagrams, we have generally avoided showing Python's "special" methods. This helps to minimize visual clutter. In some cases, a special method may be absolutely essential, and worthy of showing in a diagram. An implementation almost always needs to have an __init__() method.

There's another special method that can really help: the  __repr__() method is used to create a representation of the object. This representation is a string that generally has the syntax of a Python expression to rebuild the object. For simple numbers, it's the number. For a simple string, it will include the quotes. For more complex objects, it will have all the necessary Python punctuation, including all the details of the class and state of the object. We'll often use an f-string with the class name and the attribute values.

Here's the start of a class, Sample, which seems to capture all the features of a single sample: 

class Sample:
    def __init__(
        self,
        sepal_length: float,
        sepal_width: float,
        petal_length: float,
        petal_width: float,
        species: Optional[str] = None,
    ) -> None:
        self.sepal_length = sepal_length
        self.sepal_width = sepal_width
        self.petal_length = petal_length
        self.petal_width = petal_width
        self.species = species
        self.classification: Optional[str] = None
    def __repr__(self) -> str:
        if self.species is None:
            known_unknown = "UnknownSample"
        else:
            known_unknown = "KnownSample"
        if self.classification is None:
            classification = ""
        else:
            classification = f", {self.classification}"
        return (
            f"{known_unknown}("
            f"sepal_length={self.sepal_length}, "
            f"sepal_width={self.sepal_width}, "
            f"petal_length={self.petal_length}, "
            f"petal_width={self.petal_width}, "
            f"species={self.species!r}"
            f"{classification}"
            f")"
        )

The __repr__() method reflects the fairly complex internal state of this Sample object. The states implied by the presence (or absence) of a species and the presence (or absence) of a classification lead to small behavior changes. So far, any changes in object behavior are limited to the __repr__() method used to display the current state of the object. 

What's important is that the state changes do lead to a (tiny) behavioral change.

We have two application-specific methods for the Sample class. These are shown in the next code snippet:

    def classify(self, classification: str) -> None:
        self.classification = classification
    def matches(self) -> bool:
        return self.species == self.classification

The classify() method defines the state change from unclassified to classified. The matches() method compares the results of classification with a Botanist-assigned species. This is used for testing.

Here's an example of how these state changes can look:

>>> from model import Sample
>>> s2 = Sample(
...     sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa")
>>> s2
KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa')
>>> s2.classification = "wrong"
>>> s2
KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa', classification='wrong')

We have a workable definition of the Sample class. The __repr__() method is quite complex, suggesting there may be some improvements possible.

It can help to define responsibilities for each class. This can be a focused summary of the attributes and methods with a little bit of additional rationale to tie them together.

Class responsibilities

Which class is responsible for actually performing a test? Does the Training class invoke the classifier on each KnownSample in a testing set? Or, perhaps, does it provide the testing set to the Hyperparameter class, delegating the testing to the Hyperparameter class? Since the Hyperparameter class has responsibility for the k value, and the algorithm for locating the k-nearest neighbors, it seems sensible for the Hyperparameter class to run the test using its own k value and a list of KnownSample instances provided to it.

It also seems clear the TrainingData class is an acceptable place to record the various Hyperparameter trials. This means the TrainingData class can identify which of the Hyperparameter instances has a value of k that classifies irises with the highest accuracy.

There are multiple, related state changes here. In this case, both the Hyperparameter and TrainingData classes will do part of the work. The system – as a whole – will change state as individual elements change state. This is sometimes described as emergent behavior. Rather than writing a monster class that does many things, we've written smaller classes that collaborate to achieve the expected goals.

This test() method of TrainingData is something that we didn't show in the UML image. We included test() in the Hyperparameter class, but, at the time, it didn't seem necessary to add it to TrainingData.

Here's the start of the class definition:

class Hyperparameter:
    """A hyperparameter value and the overall quality of the classification."""
    def __init__(self, k: int, training: "TrainingData") -> None:
        self.k = k
        self.data: weakref.ReferenceType["TrainingData"] = weakref.ref(training)
        self.quality: float

Note how we write type hints for classes not yet defined. When a class is defined later in the file, any reference to the yet-to-be-defined class is a forward reference. The forward references to the not-yet-defined TrainingData class are provided as strings, not the simple class name. When mypy is analyzing the code, it resolves the strings into proper class names.

The testing is defined by the following method:

    def test(self) -> None:
        """Run the entire test suite."""
        training_data: Optional["TrainingData"] = self.data()
        if not training_data:
            raise RuntimeError("Broken Weak Reference")
        pass_count, fail_count = 0, 0
        for sample in training_data.testing:
            sample.classification = self.classify(sample)
            if sample.matches():
                pass_count += 1
            else:
                fail_count += 1
        self.quality = pass_count / (pass_count + fail_count)

We start by resolving the weak reference to the training data. This will raise an exception if there's a problem. For each testing sample, we classify the sample, setting the sample's classification attribute. The matches method tells us if the model's classification matches the known species. Finally, the overall quality is measured by the fraction of tests that passed. We can use the integer count, or a floating-point ratio of tests passed out of the total number of tests.

We won't look at the classification method in this chapter; we'll save that for Chapter 10, The Iterator Pattern. Instead, we'll finish this model by looking at the TrainingData class, which combines the elements seen so far.

The TrainingData class

The TrainingData class has lists with two subclasses of Sample objects. The KnownSample and UnknownSample can be implemented as extensions to a common parent class, Sample.

We'll look at this from a number of perspectives in Chapter 7. The TrainingData class also has a list with Hyperparameter instances. This class can have simple, direct references to previously defined classes.

This class has the two methods that initiate the processing:

  • The load() method reads raw data and partitions it into training data and test data. Both of these are essentially KnownSample instances with different purposes. The training subset is for evaluating the k-NN algorithm; the testing subset is for determining how well the k hyperparameter is working.
  • The test() method uses a Hyperparameter object, performs the test, and saves the result.

Looking back at Chapter 1's context diagram, we see three stories: Provide Training Data, Set Parameters and Test Classifier, and Make Classification Request. It seems helpful to add a method to perform a classification using a given Hyperparameter instance. This would add a classify() method to the TrainingData class. Again, this was not clearly required at the beginning of our design work, but seems like a good idea now.

Here's the start of the class definition:

class TrainingData:
    """A set of training data and testing data with methods to load and test the samples."""
    def __init__(self, name: str) -> None:
        self.name = name
        self.uploaded: datetime.datetime
        self.tested: datetime.datetime
        self.training: List[Sample] = []
        self.testing: List[Sample] = []
        self.tuning: List[Hyperparameter] = []

We've defined a number of attributes to track the history of the changes to this class. The uploaded time and the tested time, for example, provide some history. The training, testing, and tuning attributes have Sample objects and Hyperparameter objects.

We won't write methods to set all of these. This is Python and direct access to attributes is a huge simplification to complex applications. The responsibilities are encapsulated in this class, but we don't generally write a lot of getter/setter methods.

In Chapter 5, When to Use Object-Oriented Programming, we'll look at some clever techniques, like Python's property definitions, additional ways to handle these attributes.

The load() method is designed to process data given by another object. We could have designed the load() method to open and read a file, but then we'd bind the TrainingData to a specific file format and logical layout. It seems better to isolate the details of the file format from the details of managing training data. In Chapter 5, we'll look closely at reading and validating input. In Chapter 9, Strings, Serialization, and File Paths, we'll revisit the file format considerations.

For now, we'll use the following outline for acquiring the training data:

    def load(
            self, 
            raw_data_source: Iterable[dict[str, str]]
    ) -> None:
        """Load and partition the raw data"""
        for n, row in enumerate(raw_data_source):
            ... filter and extract subsets (See Chapter 6)
            ... Create self.training and self.testing subsets 
        self.uploaded = datetime.datetime.now(tz=datetime.timezone.utc)

We'll depend on a source of data. We've described the properties of this source with a type hint, Iterable[dict[str, str]]. The Iterable states that the method's results can be used by a for statement or the list function. This is true of collections like lists and files. It's also true of generator functions, the subject of Chapter 10, The Iterator Pattern.

The results of this iterator need to be dictionaries that map strings to strings. This is a very general structure, and it allows us to require a dictionary that looks like this:

{
    "sepal_length": 5.1, 
    "sepal_width": 3.5, 
    "petal_length": 1.4, 
    "petal_width": 0.2, 
    "species": "Iris-setosa"
}

This required structure seems flexible enough that we can build some object that will produce it. We'll look at the details in Chapter 9.

The remaining methods delegate most of their work to the Hyperparameter class. Rather than do the work of classification, this class relies on another class to do the work:

def test(
        self, 
        parameter: Hyperparameter) -> None:
    """Test this Hyperparameter value."""
    parameter.test()
    self.tuning.append(parameter)
    self.tested = datetime.datetime.now(tz=datetime.timezone.utc)
def classify(
        self, 
        parameter: Hyperparameter, 
        sample: Sample) -> Sample:
    """Classify this Sample."""
    classification = parameter.classify(sample)
    sample.classify(classification)
    return sample

In both cases, a specific Hyperparameter object is provided as a parameter. For testing, this makes sense because each test should have a distinct value. For classification, however, the "best" Hyperparameter object should be used for classification.

This part of the case study has built class definitions for Sample, KnownSample, TrainingData, and Hyperparameter. These classes capture parts of the overall application. This isn't complete, of course; we've omitted some important algorithms. It's good to start with things that are clear, identify behavior and state change, and define the responsibilities. The next pass of design can then fill in details around this existing framework.