Book Image

Clojure for Data Science

By : Garner
Book Image

Clojure for Data Science

By: Garner

Overview of this book

The term “data science” has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist’s diverse needs. Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you’ll see how to make use of Clojure’s Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don’t yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language’s flexibility! You’ll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark’s MapReduce and GraphX’s BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models. Above all, by following the explanations in this book, you’ll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.
Table of Contents (12 chapters)
11
Index

Comparative visualizations

Let's suppose we'd like to compare the distributions of electorate data between the UK and Russia. We've already seen in this chapter how to make use of CDFs and box plots, so let's investigate an alternative that's similar to a histogram.

We could try and plot both datasets on a histogram but this would be a bad idea. We wouldn't be able to interpret the results for two reasons:

  • The sizes of the voting districts, and therefore the means of the distributions, are very different
  • The number of voting districts overall is so different, so the histograms bars will have different heights

An alternative to the histogram that addresses both of these issues is the probability mass function (PMF).

Probability mass functions

The probability mass function, or PMF, has a lot in common with a histogram. Instead of plotting the counts of values falling into bins, though, it instead plots the probability that a number drawn from a distribution will be exactly equal to a given value. As the function assigns a probability to every value that can possibly be returned by the distribution, and because probabilities are measured on a scale from zero to one, (with one corresponding to certainty), the area under the probability mass function is equal to one.

Thus, the PMF ensures that the area under our plots will be comparable between datasets. However, we still have the issue that the sizes of the voting districts—and therefore the means of the distributions—can't be compared. This can be addressed by a separate technique—normalization.

Note

Normalizing the data isn't related to the normal distribution. It's the name given to the general task of bringing one or more sequences of values into alignment. Depending on the context, it could mean simply adjusting the values so they fall within the same range, or more sophisticated procedures to ensure that the distributions of data are the same. In general, the goal of normalization is to facilitate the comparison of two or more series of data.

There are innumerable ways to normalize data, but one of the most basic is to ensure that each series is in the range zero to one. None of our values decrease below zero, so we can accomplish this normalization by simply dividing by the largest value:

(defn as-pmf [bins]
  (let [histogram (frequencies bins)
        total     (reduce + (vals histogram))]
    (->> histogram
         (map (fn [[k v]]
                [k (/ v total)]))
         (into {}))))

With the preceding function in place, we can normalize both the UK and Russia data and plot it side by side on the same axes:

(defn ex-1-32 []
  (let [n-bins 40
        uk (->> (load-data :uk-victors)
                (i/$ :turnout)
                (bin n-bins)
                (as-pmf))
        ru (->> (load-data :ru-victors)
                (i/$ :turnout)
                (bin n-bins)
                (as-pmf))]
    (-> (c/xy-plot (keys uk) (vals uk)
                   :series-label "UK"
                   :legend true
                   :x-label "Turnout Bins"
                   :y-label "Probability")
        (c/add-lines (keys ru) (vals ru)
                     :series-label "Russia")
        (i/view))))

The preceding example generates the following chart:

Probability mass functions

After normalization, the two distributions can be compared more readily. It's clearly apparent how—in spite of having a lower mean turnout than the UK—the Russia election had a massive uplift towards 100-percent turnout. Insofar as it represents the combined effect of many independent choices, we would expect election results to conform to the central limit theorem and be approximately normally distributed. In fact, election results from around the world generally conform to this expectation.

Although not quite as high as the modal peak in the center of the distribution—corresponding to approximately 50 percent turnout—the Russian election data presents a very anomalous result. Researcher Peter Klimek and his colleagues at the Medical University of Vienna have gone as far as to suggest that this is a clear signature of ballot-rigging.

Scatter plots

We've observed the curious results for the turnout at the Russian election and identified that it has a different signature from the UK election. Next, let's see how the proportion of votes for the winning candidate is related to the turnout. After all, if the unexpectedly high turnout really is a sign of foul play by the incumbent government, we'd anticipate that they'll be voting for themselves rather than anyone else. Thus we'd expect most, if not all, of these additional votes to be for the ultimate election winners.

Chapter 3, Correlation, will cover the statistics behind correlating two variables in much more detail, but for now it would be interesting simply to visualize the relationship between turnout and the proportion of votes for the winning party.

The final visualization we'll introduce this chapter is the scatter plot. Scatter plots are very useful for visualizing correlations between two variables: where a linear correlation exists, it will be evident as a diagonal tendency in the scatter plot. Incanter contains the c/scatter-plot function for this kind of chart with arguments the same as for the c/xy-plot function.

(defn ex-1-33 []
  (let [data (load-data :uk-victors)]
    (-> (c/scatter-plot (i/$ :turnout data)
                        (i/$ :victors-share data)
                        :x-label "Turnout"
                        :y-label "Victor's Share")
        (i/view))))

The preceding code generates the following chart:

Scatter plots

Although the points are arranged broadly in a fuzzy ellipse, a diagonal tendency towards the top right of the scatter plot is clearly apparent. This indicates an interesting result—turnout is correlated with the proportion of votes for the ultimate election winners. We might have expected the reverse: voter complacency leading to a lower turnout where there was a clear victor in the running.

Note

As mentioned earlier, the UK election of 2010 was far from ordinary, resulting in a hung parliament and a coalition government. In fact, the "winners" in this case represent two parties who had, up until election day, been opponents. A vote for either counts as a vote for the winners.

Next, we'll create the same scatter plot for the Russia election:

(defn ex-1-34 []
  (let [data (load-data :ru-victors)]
    (-> (c/scatter-plot (i/$ :turnout data)
                        (i/$ :victors-share data)
                        :x-label "Turnout"
                        :y-label "Victor's Share")
        (i/view))))

This generates the following plot:

Scatter plots

Although a diagonal tendency in the Russia data is clearly evident from the outline of the points, the sheer volume of data obscures the internal structure. In the last section of this chapter, we'll show a simple technique for extracting structure from a chart such as the earlier one using opacity.

Scatter transparency

In situations such as the preceding one where a scatter plot is overwhelmed by the volume of points, transparency can help to visualize the structure of the data. Since translucent points that overlap will be more opaque, and areas with fewer points will be more transparent, a scatter plot with semi-transparent points can show the density of the data much better than solid points can.

We can set the alpha transparency of points plotted on an Incanter chart with the c/set-alpha function. It accepts two arguments: the chart and a number between zero and one. One signifies fully opaque and zero fully transparent.

(defn ex-1-35 []
  (let [data (-> (load-data :ru-victors)
                 (s/sample :size 10000))]
    (-> (c/scatter-plot (i/$ :turnout data)
                        (i/$ :victors-share data)
                        :x-label "Turnout"
                        :y-label "Victor Share")
        (c/set-alpha 0.05)
        (i/view))))

The preceding example generates the following chart:

Scatter transparency

The preceding scatter plot shows the general tendency of the victor's share and the turnout to vary together. We can see a correlation between the two values, and a "hot spot" in the top right corner of the chart corresponding to close to 100-percent turnout and 100-percent votes for the winning party. This in particular is the sign that the researchers at the Medial University of Vienna have highlighted as being the signature of electoral fraud. It's evident in the results of other disputed elections around the world, such as those of the 2011 Ugandan presidential election, too.

Tip

The district-level results for many other elections around the world are available at http://www.complex-systems.meduniwien.ac.at/elections/election.html. Visit the site for links to the research paper and to download other datasets on which to practice what you've learned in this chapter about scrubbing and transforming real data.

We'll cover correlation in more detail in Chapter 3, Correlation, when we'll learn how to quantify the strength of the relationship between two values and build a predictive model based on it. We'll also revisit this data in Chapter 10, Visualization when we implement a custom two-dimensional histogram to visualize the relationship between turnout and the winner's proportion of the vote even more clearly.