Book Image

Machine Learning: End-to-End guide for Java developers

By : Boštjan Kaluža, Jennifer L. Reese, Krishna Choppella, Richard M. Reese, Uday Kamath
Book Image

Machine Learning: End-to-End guide for Java developers

By: Boštjan Kaluža, Jennifer L. Reese, Krishna Choppella, Richard M. Reese, Uday Kamath

Overview of this book

Machine Learning is one of the core area of Artificial Intelligence where computers are trained to self-learn, grow, change, and develop on their own without being explicitly programmed. In this course, we cover how Java is employed to build powerful machine learning models to address the problems being faced in the world of Data Science. The course demonstrates complex data extraction and statistical analysis techniques supported by Java, applying various machine learning methods, exploring machine learning sub-domains, and exploring real-world use cases such as recommendation systems, fraud detection, natural language processing, and more, using Java programming. The course begins with an introduction to data science and basic data science tasks such as data collection, data cleaning, data analysis, and data visualization. The next section has a detailed overview of statistical techniques, covering machine learning, neural networks, and deep learning. The next couple of sections cover applying machine learning methods using Java to a variety of chores including classifying, predicting, forecasting, market basket analysis, clustering stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, and deep learning. The last section highlights real-world test cases such as performing activity recognition, developing image recognition, text classification, and anomaly detection. The course includes premium content from three of our most popular books: [*]Java for Data Science [*]Machine Learning in Java [*]Mastering Java Machine Learning On completion of this course, you will understand various machine learning techniques, different machine learning java algorithms you can use to gain data insights, building data models to analyze larger complex data sets, and incubating applications using Java and machine learning algorithms in the field of artificial intelligence.
Table of Contents (5 chapters)

Chapter 4. Data Visualization

The human mind is often good at seeing patterns, trends, and outliers in visual representations. The large amount of data present in many data science problems can be analyzed using visualization techniques. Visualization is appropriate for a wide range of audiences, ranging from analysts, to upper-level management, to clientele. In this chapter, we present various visualization techniques and demonstrate how they are supported in Java.

In this chapter, we will illustrate how to create different types of graph, plot, and chart. The majority of the examples use JavaFX, with a few using a free library called GRAphing Library (GRAL). There are several open source Java plotting libraries available. A brief comparison of several of these libraries can be found at https://github.com/eseifert/gral/wiki/comparison. We chose JavaFX because it is packaged as part of Java SE.

GRAL is used to illustrate plots that are not as easily created using JavaFX. GRAL is a free Java library useful for creating a variety of charts and graphs. This graphing library provides flexibility in types of plots, axis formatting, and export options. GRAL resources (http://trac.erichseifert.de/gral/) include example code and helpful how to sections.

Visualization is an important step in data analysis because it allows us to conceive of large datasets in practical and meaningful ways. We can look at small datasets of values and perhaps draw conclusions from the patterns we see, but this is an overwhelming and unreliable process. Using visualization tools helps us identify potential problems or unexpected data results, as well as construct meaningful interpretations of good data.

One example of the usefulness of data visualization comes with the presence of outliers. Visualizing data allows us to quickly see data results significantly outside of our expectations, and we can choose how to modify the data to build a clean and usable dataset. This process allows us to see errors quickly and deal with them before they become a problem later on. Additionally, visualization allows us to easily classify information and help analysts organize their inquiries in a manner best suited to their particular dataset.

Understanding plots and graphs

There are many types of visual expression available to aid in visualization. We are going to briefly discuss the most common and useful ones, and then demonstrate several Java techniques for achieving these types of expression. The choice of graph, or other visualization tool will depend upon the dataset and application needs and constraints.

A bar chart is a very common technique for displaying relationships in data. In this type of graph, data is represented in either vertical or horizontal bars placed along an X and Y axis. The data is scaled so the values represented by each bar can be compared to one another. The following is a simple example of a bar chart we will create in the Using country as the category section:

Understanding plots and graphs

A pie chart is most useful when you want to demonstrate a value in relation to a larger set. Think of this as a way to visualize how large the piece of pie is in relation to the entire pie. The following is a simple example of a pie chart showing the distribution of population for selected European countries:

Understanding plots and graphs

Time series graphs are a special type of graph used for displaying time-related values. These are most appropriate when the data analysis requires an understanding of how data changes over a period of time. In these graphs, the vertical axis corresponds to the values and the horizontal axis corresponds to particular points in time. In particular, this type of graph can be useful for identifying trends across time,or suggesting correlations between data values and particular events within a given time period.

For example, stock prices and home prices will change, but their rate of change varies. Pollution levels and crime rates also change over time. There are several techniques that visualize this type of data. Often, specific values are not as important as their trend over time.

An index chart is also called a line chart. Line charts use the X and Y axis to plot points on a grid. They can be used to represent time series data. These points are connected by lines, and these lines are used to compare values of multiple data at one time. This comparison is usually achieved by plotting independent variables, such as time, along the X axis, and independent variables, such as frequency or percentages, along the Y axis.

The following is a simple example of an index chart showing the distribution of population for selected European countries:

Understanding plots and graphs

When we wish to arrange larger amounts of data in a compact and useful manner, we may opt for a stem and leaf plot. This type of visual expression allows you to demonstrate the correlation of one value to many values in a readable manner. The stem refers to a data value, and the leaves are the corresponding data points. One common example of this is a train timetable. In the following table, the departure times for a train are listed:

06:15

06 :20

06:25

06:30

06:40

06:45

06:55

07:15

07:20

07:25

07:30

07:40

07:45

07:55

08:00

08:12

08:24

08:36

08:48

09:00

09:12

09:24

09:36

09:48

10:00

10:12

10:24

10:36

10:48

However, this table can be hard to read. Instead, in the following partial stem and leaf plot, the stem represents the hours at which a train may depart, while the leaves represent the minutes within each hour:

Hour

Minute

06

:15 :20 :25 :30 :40 :45 :55

07

:15 :20 :25 :30 :40 :45 :55

08

:00 :12 :24 :36 :48

09

:00 :12 :24 :36 :48

10

:00 :12 :24 :36 :48

This is much easier to read and process.

A very popular form of visualization in statistical analysis is the histogram. Histograms allow you to display frequencies within data using bars, similar to a bar chart. The main difference is that histograms are used to identify frequencies and trends within a dataset while bar charts are used to compare specific data values within a dataset. The following is an example of a histogram we will create in the Creating histograms section:

Understanding plots and graphs

A scatter plot is simply collections of points, and analysis techniques, such as correlation or regression, can be used to identify trends within these types of graph. In the following scatter chart, as developed in Creating scatter charts, the population along the X axis is plotted against the decade along the Y axis:

Understanding plots and graphs

Visual analysis goals

Each type of visual expression lends itself to different types of data and data analysis purposes. One common purpose of data analysis is data classification. This involves determining which subset within a dataset a particular data value belongs to. This process may occur early in the data analysis process because breaking data apart into manageable and related pieces simplifies the analysis process. Often, classification is not the end goal but rather an important intermediary step before further analysis can be undertaken.

Regression analysis is a complex and important form of data analysis. It involves studying relationships between independent and dependent variables, as well as multiple independent variables. This type of statistical analysis allows the analyst to identify ranges of acceptable or expected values and determine how individual values may fit into a larger dataset. Regression analysis is a significant part of machine learning, and we will discuss it in more detail in Chapter 5, Statisitcal Data Analysis Techniques.

Clustering allows us to identify groups of data points within a particular set or class. While classification sorts data into similar types of datasets, clustering is concerned with the data within the set. For example, we may have a large dataset containing all feline species in the world, in the family Felidae. We could then classify these cats into two groups, Pantherinae (containing most larger cats) and Felinae (all other cats). Clustering would involve grouping subsets of similar cats within one of these classifications. For example, all tigers could be a cluster within the Pantherinae group.

Sometimes, our data analysis requires that we extract specific types of information from our dataset. The process of selecting the data to extract is known as attribute selection or feature selection. This process helps analysts simplify the data models and allows us to overcome issues with redundant or irrelevant information within our dataset.

With this introduction to basic plot and chart types, we will discuss Java support for creating these plots and charts.

Creating index charts

An index chart is a line chart that shows the percentage change of something over time. Frequently, such a chart is based on a single data attribute. In the following example, we will be using the Belgian population for six decades. The data is a subset of population data found at https://ourworldindata.org/grapher/population-by-country?tab=data:

Decade

Population

1950

8639369

1960

9118700

1970

9637800

1980

9846800

1990

9969310

2000

10263618

We start by creating the MainApp class, which extends Application. We create a series of instance variables. The XYChart.Series class represents a series of data points for some plot. In our case, this will be for the decades and population, which we will initialize shortly. The next declaration is for the CategoryAxis and NumberAxis instances. These represent the X and Y axes respectively. The declaration for the Y axis includes range and increment values for the population. This makes the chart a bit more readable. The last declaration is a string variable for the country:

public class MainApp extends Application { 
    final XYChart.Series<String, Number> series =  
        new XYChart.Series<>(); 
    final CategoryAxis xAxis = new CategoryAxis(); 
    final NumberAxis yAxis =  
        new NumberAxis(8000000, 11000000, 1000000); 
    final static String belgium = "Belgium"; 
    ... 
} 

In JavaFX, the main method usually launches the application using the base class launch method. Eventually, the start method is called, which we override. In this example, we call the simpleLineChart method where the user interface is created:

public static void main(String[] args) { 
    launch(args); 
} 
 
public void start(Stage stage) { 
    simpleIndexChart (stage); 
} 

The simpleLineChart follows and is passed an instance of the Stage class. This represents the client area of the application's window. We start by setting a title for the application and the line chart proper. The label of the Y axis is set. An instance of the LineChart class is initialized using the X and Y axis instances. This class represents the line chart:

public void simpleIndexChart (Stage stage) { 
    stage.setTitle("Index Chart"); 
    lineChart.setTitle("Belgium Population"); 
    yAxis.setLabel("Population"); 
    final LineChart<String, Number> lineChart 
            = new LineChart<>(xAxis, yAxis); 
 
    ... 
} 

The series is given a name, and then the population for each decade is added to the series using the addDataItem helper method:

series.setName("Population"); 
addDataItem(series, "1950", 8639369); 
addDataItem(series, "1960", 9118700); 
addDataItem(series, "1970", 9637800); 
addDataItem(series, "1980", 9846800); 
addDataItem(series, "1990", 9969310); 
addDataItem(series, "2000", 10263618); 

The addDataItem method follows, which creates an XYChart.Data class instance using the String and Number values passed to it. It then adds the instance to the series:

public void addDataItem(XYChart.Series<String, Number> series, 
        String x, Number y) { 
    series.getData().add(new XYChart.Data<>(x, y)); 
} 

The last part of the simpleLineChart method creates a Scene class instance that represents the content of the stage. JavaFX uses the concept of a stage and scene to deal with the internals of the application's GUI.

The scene is created using a line chart, and the application's size is set to 800 by 600 pixels. The series is then added to the line chart and scene is added to stage. The show method displays the application:

Scene scene = new Scene(lineChart, 800, 600); 
lineChart.getData().add(series); 
stage.setScene(scene); 
stage.show(); 

When the application executes the following window will be displayed:

Creating index charts

Creating bar charts

A bar chart uses two axes with rectangular bars that can be either positioned either vertically or horizontally. The length of a bar is proportional to the value it represents. A bar chart can be used to show time series data.

In the following series of examples, we will be using a set of European country populations for three decades, as listed in the following table. The data is a subset of population data found at https://ourworldindata.org/grapher/population-by-country?tab=data:

Country

1950

1960

1970

Belgium

8,639,369

9,118,700

9,637,800

France

42,518,000

46,584,000

51,918,000

Germany

68,374,572

72,480,869

77,783,164

Netherlands

10,113,527

11,486,000

13,032,335

Sweden

7,014,005

7,480,395

8,042,803

United Kingdom

50,127,000

52,372,000

55,632,000

The first of three bar charts will be constructed using JavaFX. We start with a series of declarations for the countries as part of a class that extends the Application class:

public class MainApp extends Application { 
    final static String belgium = "Belgium"; 
    final static String france = "France"; 
    final static String germany = "Germany"; 
    final static String netherlands = "Netherlands"; 
    final static String sweden = "Sweden"; 
    final static String unitedKingdom = "United Kingdom"; 
 
... 
} 

Next, we declared a series of instance variables that represent the parts of a graph. The first are CategoryAxis and NumberAxis instances:

final CategoryAxis xAxis = new CategoryAxis(); 
final NumberAxis yAxis = new NumberAxis(); 

The population and country data is stored in a series of XYChart.Series instances. Here, we have declared six different series, which use a string and number pair. The first example does not use all six series, but later examples will. We will initially assign a country string and its corresponding population to three series. These series will represent the populations for the decades 1950, 1960, and 1970:

final XYChart.Series<String, Number> series1 =  
    new XYChart.Series<>(); 
final XYChart.Series<String, Number> series2  
    new XYChart.Series<>(); 
final XYChart.Series<String, Number> series3 =  
    new XYChart.Series<>(); 
final XYChart.Series<String, Number> series4 =  
    new XYChart.Series<>(); 
final XYChart.Series<String, Number> series5 =  
    new XYChart.Series<>(); 
final XYChart.Series<String, Number> series6 =  
    new XYChart.Series<>(); 

We will start with two simple bar charts. The first one will show the countries as categories where the year changes occur within the category on the X axis and the population along the Y axis. The second shows the decades as categories containing the counties. The last example is a stacked bar chart.

Using country as the category

The elements of the bar chart are set up in the simpleBarChartByCountry method. The title of the chart is set and a BarChart class instance is created using the two axes. The chart, its X axis, and its Y axis also have labels that are initialized here:

public void simpleBarChartByCountry(Stage stage) { 
    stage.setTitle("Bar Chart"); 
    final BarChart<String, Number> barChart 
                = new BarChart<>(xAxis, yAxis); 
    barChart.setTitle("Country Summary"); 
    xAxis.setLabel("Country"); 
    yAxis.setLabel("Population"); 
    ... 
} 

Next, the first three series are initialized with a name, and then the country and population data for that series. A helper method, addDataItem, as introduced in the previous section, is used to add the data to each series:

series1.setName("1950"); 
addDataItem(series1,belgium, 8639369); 
addDataItem(series1,france, 42518000); 
addDataItem(series1,germany, 68374572); 
addDataItem(series1,netherlands, 10113527); 
addDataItem(series1,sweden, 7014005); 
addDataItem(series1,unitedKingdom, 50127000); 
 
series2.setName("1960"); 
addDataItem(series2,belgium, 9118700); 
addDataItem(series2,france, 46584000); 
addDataItem(series2,germany, 72480869); 
addDataItem(series2,netherlands, 11486000); 
addDataItem(series2,sweden, 7480395); 
addDataItem(series2,unitedKingdom, 52372000); 
 
series3.setName("1970"); 
addDataItem(series3,belgium, 9637800); 
addDataItem(series3,france, 51918000); 
addDataItem(series3,germany, 77783164); 
addDataItem(series3,netherlands, 13032335); 
addDataItem(series3,sweden, 8042803); 
addDataItem(series3,unitedKingdom, 55632000); 

The last part of the method creates a scene instance. The three series are added to the scene and the scene is attached to the stage using the setScene method. A stage is a class that essentially represents the client area of a window:

Scene scene = new Scene(barChart, 800, 600); 
barChart.getData().addAll(series1, series2, series3); 
stage.setScene(scene); 
stage.show(); 

The last of the two methods is the start method, which is called automatically when the window is displayed. It is passed the Stage instance. Here, we call the simpleBarChartByCountry method:

public void start(Stage stage) { 
    simpleBarChartByCountry(stage); 
} 

The main method consists of a call to the Application class's launch method:

public static void main(String[] args) { 
    launch(args); 
} 

When the application is executed, the following graph is displayed:

Using country as the category

Using decade as the category

In the following example, we will demonstrate how to display the same information, but we will organize the X axis categories by year. We will use the simpleBarChartByYear method, as shown next. The axis and titles are set up in the same way as before, but with different values for the title and labels:

public void simpleBarChartByYear(Stage stage) { 
    stage.setTitle("Bar Chart"); 
    final BarChart<String, Number> barChart 
            = new BarChart<>(xAxis, yAxis); 
    barChart.setTitle("Year Summary"); 
    xAxis.setLabel("Year"); 
    yAxis.setLabel("Population"); 
    ... 
} 

The following string variables are declared for the three decades:

String year1950 = "1950"; 
String year1960 = "1960"; 
String year1970 = "1970"; 

The data series are created in the same way as before, except the country name is used for the series name and the year is used for the category. In addition, six series are used, one for each country:

series1.setName(belgium); 
addDataItem(series1, year1950, 8639369); 
addDataItem(series1, year1960, 9118700); 
addDataItem(series1, year1970, 9637800); 
 
series2.setName(france); 
addDataItem(series2, year1950, 42518000); 
addDataItem(series2, year1960, 46584000); 
addDataItem(series2, year1970, 51918000); 
 
series3.setName(germany); 
addDataItem(series3, year1950, 68374572); 
addDataItem(series3, year1960, 72480869); 
addDataItem(series3, year1970, 77783164); 
 
series4.setName(netherlands); 
addDataItem(series4, year1950, 10113527); 
addDataItem(series4, year1960, 11486000); 
addDataItem(series4, year1970, 13032335); 
 
series5.setName(sweden); 
addDataItem(series5, year1950, 7014005); 
addDataItem(series5, year1960, 7480395); 
addDataItem(series5, year1970, 8042803); 
 
series6.setName(unitedKingdom); 
addDataItem(series6, year1950, 50127000); 
addDataItem(series6, year1960, 52372000); 
addDataItem(series6, year1970, 55632000); 

The scene is created and attached to the stage:

Scene scene = new Scene(barChart, 800, 600); 
barChart.getData().addAll(series1, series2,  
    series3, series4, series5, series6); 
stage.setScene(scene); 
stage.show(); 

The main method is unchanged, but the start method calls the simpleBarChartByYear method instead:

public void start(Stage stage) { 
    simpleBarChartByYear(stage); 
} 

When the application is executed, the following graph is displayed:

Using decade as the category

Using country as the category

The elements of the bar chart are set up in the simpleBarChartByCountry method. The title of the chart is set and a BarChart class instance is created using the two axes. The chart, its X axis, and its Y axis also have labels that are initialized here:

public void simpleBarChartByCountry(Stage stage) { 
    stage.setTitle("Bar Chart"); 
    final BarChart<String, Number> barChart 
                = new BarChart<>(xAxis, yAxis); 
    barChart.setTitle("Country Summary"); 
    xAxis.setLabel("Country"); 
    yAxis.setLabel("Population"); 
    ... 
} 

Next, the first three series are initialized with a name, and then the country and population data for that series. A helper method, addDataItem, as introduced in the previous section, is used to add the data to each series:

series1.setName("1950"); 
addDataItem(series1,belgium, 8639369); 
addDataItem(series1,france, 42518000); 
addDataItem(series1,germany, 68374572); 
addDataItem(series1,netherlands, 10113527); 
addDataItem(series1,sweden, 7014005); 
addDataItem(series1,unitedKingdom, 50127000); 
 
series2.setName("1960"); 
addDataItem(series2,belgium, 9118700); 
addDataItem(series2,france, 46584000); 
addDataItem(series2,germany, 72480869); 
addDataItem(series2,netherlands, 11486000); 
addDataItem(series2,sweden, 7480395); 
addDataItem(series2,unitedKingdom, 52372000); 
 
series3.setName("1970"); 
addDataItem(series3,belgium, 9637800); 
addDataItem(series3,france, 51918000); 
addDataItem(series3,germany, 77783164); 
addDataItem(series3,netherlands, 13032335); 
addDataItem(series3,sweden, 8042803); 
addDataItem(series3,unitedKingdom, 55632000); 

The last part of the method creates a scene instance. The three series are added to the scene and the scene is attached to the stage using the setScene method. A stage is a class that essentially represents the client area of a window:

Scene scene = new Scene(barChart, 800, 600); 
barChart.getData().addAll(series1, series2, series3); 
stage.setScene(scene); 
stage.show(); 

The last of the two methods is the start method, which is called automatically when the window is displayed. It is passed the Stage instance. Here, we call the simpleBarChartByCountry method:

public void start(Stage stage) { 
    simpleBarChartByCountry(stage); 
} 

The main method consists of a call to the Application class's launch method:

public static void main(String[] args) { 
    launch(args); 
} 

When the application is executed, the following graph is displayed:

Using country as the category

Using decade as the category

In the following example, we will demonstrate how to display the same information, but we will organize the X axis categories by year. We will use the simpleBarChartByYear method, as shown next. The axis and titles are set up in the same way as before, but with different values for the title and labels:

public void simpleBarChartByYear(Stage stage) { 
    stage.setTitle("Bar Chart"); 
    final BarChart<String, Number> barChart 
            = new BarChart<>(xAxis, yAxis); 
    barChart.setTitle("Year Summary"); 
    xAxis.setLabel("Year"); 
    yAxis.setLabel("Population"); 
    ... 
} 

The following string variables are declared for the three decades:

String year1950 = "1950"; 
String year1960 = "1960"; 
String year1970 = "1970"; 

The data series are created in the same way as before, except the country name is used for the series name and the year is used for the category. In addition, six series are used, one for each country:

series1.setName(belgium); 
addDataItem(series1, year1950, 8639369); 
addDataItem(series1, year1960, 9118700); 
addDataItem(series1, year1970, 9637800); 
 
series2.setName(france); 
addDataItem(series2, year1950, 42518000); 
addDataItem(series2, year1960, 46584000); 
addDataItem(series2, year1970, 51918000); 
 
series3.setName(germany); 
addDataItem(series3, year1950, 68374572); 
addDataItem(series3, year1960, 72480869); 
addDataItem(series3, year1970, 77783164); 
 
series4.setName(netherlands); 
addDataItem(series4, year1950, 10113527); 
addDataItem(series4, year1960, 11486000); 
addDataItem(series4, year1970, 13032335); 
 
series5.setName(sweden); 
addDataItem(series5, year1950, 7014005); 
addDataItem(series5, year1960, 7480395); 
addDataItem(series5, year1970, 8042803); 
 
series6.setName(unitedKingdom); 
addDataItem(series6, year1950, 50127000); 
addDataItem(series6, year1960, 52372000); 
addDataItem(series6, year1970, 55632000); 

The scene is created and attached to the stage:

Scene scene = new Scene(barChart, 800, 600); 
barChart.getData().addAll(series1, series2,  
    series3, series4, series5, series6); 
stage.setScene(scene); 
stage.show(); 

The main method is unchanged, but the start method calls the simpleBarChartByYear method instead:

public void start(Stage stage) { 
    simpleBarChartByYear(stage); 
} 

When the application is executed, the following graph is displayed:

Using decade as the category

Using decade as the category

In the following example, we will demonstrate how to display the same information, but we will organize the X axis categories by year. We will use the simpleBarChartByYear method, as shown next. The axis and titles are set up in the same way as before, but with different values for the title and labels:

public void simpleBarChartByYear(Stage stage) { 
    stage.setTitle("Bar Chart"); 
    final BarChart<String, Number> barChart 
            = new BarChart<>(xAxis, yAxis); 
    barChart.setTitle("Year Summary"); 
    xAxis.setLabel("Year"); 
    yAxis.setLabel("Population"); 
    ... 
} 

The following string variables are declared for the three decades:

String year1950 = "1950"; 
String year1960 = "1960"; 
String year1970 = "1970"; 

The data series are created in the same way as before, except the country name is used for the series name and the year is used for the category. In addition, six series are used, one for each country:

series1.setName(belgium); 
addDataItem(series1, year1950, 8639369); 
addDataItem(series1, year1960, 9118700); 
addDataItem(series1, year1970, 9637800); 
 
series2.setName(france); 
addDataItem(series2, year1950, 42518000); 
addDataItem(series2, year1960, 46584000); 
addDataItem(series2, year1970, 51918000); 
 
series3.setName(germany); 
addDataItem(series3, year1950, 68374572); 
addDataItem(series3, year1960, 72480869); 
addDataItem(series3, year1970, 77783164); 
 
series4.setName(netherlands); 
addDataItem(series4, year1950, 10113527); 
addDataItem(series4, year1960, 11486000); 
addDataItem(series4, year1970, 13032335); 
 
series5.setName(sweden); 
addDataItem(series5, year1950, 7014005); 
addDataItem(series5, year1960, 7480395); 
addDataItem(series5, year1970, 8042803); 
 
series6.setName(unitedKingdom); 
addDataItem(series6, year1950, 50127000); 
addDataItem(series6, year1960, 52372000); 
addDataItem(series6, year1970, 55632000); 

The scene is created and attached to the stage:

Scene scene = new Scene(barChart, 800, 600); 
barChart.getData().addAll(series1, series2,  
    series3, series4, series5, series6); 
stage.setScene(scene); 
stage.show(); 

The main method is unchanged, but the start method calls the simpleBarChartByYear method instead:

public void start(Stage stage) { 
    simpleBarChartByYear(stage); 
} 

When the application is executed, the following graph is displayed:

Using decade as the category

Creating stacked graphs

An area chart depicts information by allocating more space for larger values. By stacking area charts on top of each other we create a stacked graph, sometimes called a stream graph. However, stacked graphs do not work well with negative values and cannot be used for data where summation does not make sense, such as with temperatures. If too many graphs are stacked, then it can become difficult to interpret.

Next, we will show how to create a stacked bar chart. The stackedGraphExample method contains the code to create the bar chart. We start with familiar code to set the title and labels. However, for the X axis, the setCategories method FXCollections.<String>observableArrayList instance is used to set the categories. The argument of this constructor is an array of strings created by the Arrays class's asList method and the names of the countries:

public void stackedGraphExample(Stage stage) { 
    stage.setTitle("Stacked Bar Chart"); 
    final StackedBarChart<String, Number> stackedBarChart 
            = new StackedBarChart<>(xAxis, yAxis); 
    stackedBarChart.setTitle("Country Population"); 
    xAxis.setLabel("Country"); 
    xAxis.setCategories( 
        FXCollections.<String>observableArrayList( 
                Arrays.asList(belgium, germany, france, 
                    netherlands, sweden, unitedKingdom))); 
    yAxis.setLabel("Population"); 
    ... 
} 

The series are initialized with the year being used for the series name and the country, and their population being added using the helper method addDataItem. The scene is then created:

series1.setName("1950"); 
addDataItem(series1, belgium, 8639369); 
addDataItem(series1, france, 42518000); 
addDataItem(series1, germany, 68374572); 
addDataItem(series1, netherlands, 10113527); 
addDataItem(series1, sweden, 7014005); 
addDataItem(series1, unitedKingdom, 50127000); 
 
series2.setName("1960"); 
addDataItem(series2, belgium, 9118700); 
addDataItem(series2, france, 46584000); 
addDataItem(series2, germany, 72480869); 
addDataItem(series2, netherlands, 11486000); 
addDataItem(series2, sweden, 7480395); 
addDataItem(series2, unitedKingdom, 52372000); 
 
series3.setName("1970"); 
addDataItem(series3, belgium, 9637800); 
addDataItem(series3, france, 51918000); 
addDataItem(series3, germany, 77783164); 
addDataItem(series3, netherlands, 13032335); 
addDataItem(series3, sweden, 8042803); 
addDataItem(series3, unitedKingdom, 55632000); 
 
Scene scene = new Scene(stackedBarChart, 800, 600); 
stackedBarChart.getData().addAll(series1, series2, series3); 
stage.setScene(scene); 
stage.show(); 

The main method is unchanged, but the start method calls the stackedGraphExample method instead:

public void start(Stage stage) { 
    stackedGraphExample(stage); 
} 

When the application is executed, the following graph is displayed:

Creating stacked graphs

Creating pie charts

The following pie chart example is based on the 2000 population of selected European countries as summarized here:

Country

Population

Percentage

Belgium

10,263,618

3

France

61,137,000

26

Germany

82,187,909

35

Netherlands

15,907,853

7

Sweden

8,872,000

4

United Kingdom

59,522,468

25

The JavaFX implementation uses the same Application base class and main method as used in the previous examples. We will not use a separate method for creating the GUI, but instead place this code in the start method, as shown here:

public class PieChartSample extends Application { 
 
    public void start(Stage stage) { 
        Scene scene = new Scene(new Group()); 
        stage.setTitle("Europian Country Population"); 
        stage.setWidth(500); 
        stage.setHeight(500); 
        ... 
    } 
 
    public static void main(String[] args) { 
        launch(args); 
    } 
     
} 

A pie chart is represented by the PieChart class. We can create and initialize the pie chart in the constructor by using an ObservableList of pie chart data. This data consists of a series of PieChart.Data instances, each containing a text label and a percentage value.

The next sequence creates an ObservableList instance based on the European population data presented earlier. The FXCollections class's observableArrayList method returns an ObservableList instance with a list of pie chart data:

ObservableList<PieChart.Data> pieChartData = 
        FXCollections.observableArrayList( 
        new PieChart.Data("Belgium", 3), 
        new PieChart.Data("France", 26), 
        new PieChart.Data("Germany", 35), 
        new PieChart.Data("Netherlands", 7), 
        new PieChart.Data("Sweden", 4), 
        new PieChart.Data("United Kingdom", 25)); 

We then create the pie chart and set its title. The pie chart is then added to the scene, the scene is associated with the stage, and then the window is displayed:

final PieChart pieChart = new PieChart(pieChartData); 
pieChart.setTitle("Country Population"); 
((Group) scene.getRoot()).getChildren().add(pieChart); 
stage.setScene(scene); 
stage.show(); 

When the application is executed, the following graph is displayed:

Creating pie charts

Creating scatter charts

Scatter charts also use the XYChart.Series class in JavaFX. For this example, we will use a set of European data that includes the previous Europeans countries and their population data for the decades 1500 through 2000. This information is stored in a file called EuropeanScatterData.csv. The first part of this file is shown here:

1500 1400000
1600 1600000
1650 1500000
1700 2000000
1750 2250000
1800 3250000
1820 3434000
1830 3750000
1840 4080000
...

We start with the declaration of the JavaFX MainApp class, as shown next. The main method launches the application and the start method creates the user interface:

public class MainApp extends Application { 
    @Override 
    public void start(Stage stage) throws Exception { 
        ... 
    } 
 
    public static void main(String[] args) { 
        launch(args); 
    } 
} 

Within the start method we set the title, create the axes, and create an instance of the ScatterChart that represents the scatter plot. The NumberAxis class's constructors used values that better match the data range than the default values used by its default constructor:

stage.setTitle("Scatter Chart Sample"); 
final NumberAxis yAxis = new NumberAxis(1400, 2100, 100); 
final NumberAxis xAxis = new NumberAxis(500000, 90000000, 
    1000000); 
final ScatterChart<Number, Number> scatterChart = new  
    ScatterChart<>(xAxis, yAxis); 

Next, the axes' labels are set along with the scatter chart's title:

xAxis.setLabel("Population"); 
yAxis.setLabel("Decade"); 
scatterChart.setTitle("Population Scatter Graph"); 

An instance of the XYChart.Series class is created and named:

XYChart.Series series = new XYChart.Series(); 

The series is populated using a CSVReader class instance and the file EuropeanScatterData.csv. This process was discussed in Chapter 3, Data Cleaning:

try (CSVReader dataReader = new CSVReader(new FileReader("EuropeanScatterData.csv"), ',')) { 
    String[] nextLine; 
    while ((nextLine = dataReader.readNext()) != null) { 
            int decade = Integer.parseInt(nextLine[0]); 
            int population = Integer.parseInt(nextLine[1]); 
            series.getData().add(new XYChart.Data( 
                population, decade)); 
            out.println("Decade: " + decade +  
                "  Population: " + population); 
    } 
} 
scatterChart.getData().addAll(series); 

The JavaFX scene and stage are created, and then the plot is displayed:

Scene scene = new Scene(scatterChart, 500, 400); 
stage.setScene(scene); 
stage.show(); 

When the application is executed, the following graph is displayed:

Creating scatter charts

Creating histograms

Histograms, though similar in appearance to bar charts, are used to display the frequency of data items in relation to other items within the dataset. Each of the following examples using GRAL will use the DataTable class to initially hold the data to be displayed. In this example, we will read data from a sample file called AgeofMarriage.csv. This comma-separated file holds a list of ages at which people were first married.

We will create a new class, called HistogramExample, which extends the JFrame class and contains the following code within its constructor. We first create a DataReader object to specify that the data is in CSV format. We then use a try-catch block to handle IO exceptions and call the DataReader class's read method to place the data directly into a DataTable object. The first parameter of the read method is a FileInputStream object, and the second specifies the type of data expected from within the file:

DataReader readType=
  DataReaderFactory.getInstance().get("text/csv"); 
String fileName = "C://AgeofMarriage.csv"; 
try { 
    DataTable histData = (DataTable) readType.read( 
        New FileInputStream(fileName), Integer.class); 
    ... 
} 
 

Next, we create a Number array to specify the ages for which we expect to have data. In this case, we expect the ages of marriage will range from 19 to 30. We use this array to create our Histogram object. We include our DataTable from earlier and specify the orientation as well. Then we create our DataSource, specify our starting age, and specify the spacing along our X axis:

 
Number ageRange[] = {19,20,21,22,23,24,25,26,27,28,29,30}; 
Histogram sampleHisto = new Histogram1D( 
    histData, Orientation.VERTICAL, ageRange); 
DataSource sampleHistData = new EnumeratedData(sampleHisto, 19,
  1.0); 
 

We use the BarPlot class to create our histogram from the data we read in earlier:

 
BarPlot testPlot = new BarPlot(sampleHistData); 
 

The next few steps serve to format various aspects of our histogram. We use the setInsets method to specify how much space to place around each side of the graph within the window. We can provide a title for our graph and specify the bar width:

testPlot.setInsets(new Insets2D.Double(20.0, 50.0, 50.0, 20.0)); 
testPlot.getTitle().setText("Average Age of Marriage"); 
testPlot.setBarWidth(0.7);

We also need to format our X and Y axes. We have chosen to set our range for the X axis to closely match our expected age range but to provide some space on the side of the graph. Because we know the amount of sample data, we set our Y axis to range from 0 to 10. In a business application, these ranges would be calculated by examining the actual dataset. We can also specify whether we want tick marks to show and where we would like the axes to intersect:

testPlot.getAxis(BarPlot.AXIS_X).setRange(18, 30.0); 
testPlot.getAxisRenderer(BarPlot.AXIS_X).setTickAlignment(0.0); 
testPlot.getAxisRenderer(BarPlot.AXIS_X).setTickSpacing(1); 
testPlot.getAxisRenderer(BarPlot.AXIS_X).setMinorTicksVisible(false  ); 
                
testPlot.getAxis(BarPlot.AXIS_Y).setRange(0.0, 10.0); 
testPlot.getAxisRenderer(BarPlot.AXIS_Y).setTickAlignment(0.0); 
testPlot.getAxisRenderer(BarPlot.AXIS_Y).setMinorTicksVisible(false  ); 
testPlot.getAxisRenderer(BarPlot.AXIS_Y).setIntersection(0); 
 

We also have a lot of flexibility with the color and values displayed on the graph. In this example, we have chosen to display the frequency value for each age and set our graph color to black:

 
PointRenderer renderHist = 
    testPlot.getPointRenderers(sampleHistData).get(0); 
renderHist.setColor(GraphicsUtils.deriveWithAlpha(Color.black,
  128)); 
renderHist.setValueVisible(true); 
 

Finally, we set several properties for how we want our window to display:

InteractivePanel pan = new InteractivePanel(testPlot); 
pan.setPannable(false); 
pan.setZoomable(false); 
add(pan); 
setSize(1500, 700); 
this.setVisible(true); 
 

When the application is executed, the following graph is displayed:

Creating histograms

Creating donut charts

Donut charts are similar to pie charts, but they are missing the middle section (hence the name donut). Some analysts prefer donut charts to pie charts because they do not emphasize the size of each piece within the chart and are easier to compare to other donut charts. They also provide the added advantage of taking up less space, allowing for more formatting options in the display.

In this example, we will assume our data is already populated in a two-dimensional array called ageCount. The first row of the array contains the possible age values, ranging again from 19 to 30 (inclusive). The second row contains the number of data values equal to each age. For example, in our dataset, there are six data values equal to 19, so ageCount[0][1] contains the number six.

We create a DataTable and use the add method to add our values from the array. Notice we are testing to see if the value of a particular age is zero. In our test case, there will be zero data values equal to 23. We are opting to add a blank space in our donut chart if there are no data values for that point. This is accomplished by using a negative number as the first parameter in the add method. This will set an empty space of size 3:

DataTable donutData = new DataTable(Integer.class, Integer.class); 
for(int Y = 0; Y < ageCount[0].length; y++){ 
    if(ageCount[1][y] == 0){ 
        donutData.add(-3, ageCount[0][y]); 
    }else{ 
        donutData.add(ageCount[1][y], ageCount[0][y]); 
    } 
} 
 

Next, we create our donut plot using the PiePlot class. We set basic properties of the plot, including specifying the values for the legend. In this case, we want our legend to reflect our age possibilities, so we use the setLabelColumn method to change the default labels. We also set our insets as we did in the previous example:

PiePlot testPlot = new PiePlot(donutData); 
((ValueLegend) testPlot.getLegend()).setLabelColumn(1); 
testPlot.getTitle().setText("Donut Plot Example"); 
testPlot.setRadius(0.9); 
testPlot.setLegendVisible(true); 
testPlot.setInsets(new Insets2D.Double(20.0, 20.0, 20.0, 20.0)); 

Next, we create a PieSliceRenderer object to set more advanced properties. Because a donut plot is basically a pie plot in essence, we will render a donut plot by calling the setInnerRadius method. We also specify the gap between the pie slices, the colors used, and the style of the labels:

 
PieSliceRenderer renderPie = (PieSliceRenderer) 
testPlot.getPointRenderer(donutData); 
renderPie.setInnerRadius(0.4); 
renderPie.setGap(0.2); 
LinearGradient colors = new LinearGradient( 
    Color.blue, Color.green); 
renderPie.setColor(colors); 
renderPie.setValueVisible(true); 
renderPie.setValueColor(Color.WHITE); 
renderPie.setValueFont(Font.decode(null).deriveFont(Font.BOLD)); 
 

Finally, we create our panel and set its size:

add(new InteractivePanel(testPlot), BorderLayout.CENTER); 
setSize(1500, 700); 
setVisible(true); 
 

When the application is executed, the following graph is displayed:

Creating donut charts

Creating bubble charts

Bubble charts are similar to scatter plots except they represent data with three dimensions. The first two dimensions are expressed on the X and Y axes and the third is represented by the size of the point plotted. This can be helpful in determining relationships between data values.

We will again use the DataTable class to initially hold the data to be displayed. In this example, we will read data from a sample file called MarriageByYears.csv. This is also a CSV file, and contains one column representing the year a marriage occurred, a second column holding the age at which a person was married, and a third column holding integers representing marital satisfaction on a scale from 1 (least satisfied) to 10 (most satisfied). We create a DataSeries to represent our type of desired data plot and then create a XYPlot object:

DataReader readType =  
    DataReaderFactory.getInstance().get("text/csv"); 
String fileName = "C://MarriageByYears.csv"; 
try { 
    DataTable bubbleData = (DataTable) readType.read( 
        new FileInputStream(fileName), Integer.class,  
            Integer.class, Integer.class); 
DataSeries bubbleSeries = new DataSeries("Bubble", bubbleData); 
XYPlot testPlot = new XYPlot(bubbleSeries); 
 

Next, we set basic property information about our chart. We will set the color and turn off the vertical and horizontal grids in this example. We will also make our X and Y axes invisible in this example. Notice that we still set a range for the axes, even though they are not displayed:

 
testPlot.setInsets(new Insets2D.Double(30.0));  testPlot.setBackground(new Color(0.75f, 0.75f, 0.75f)); 
XYPlotArea2D areaProp = (XYPlotArea2D) testPlot.getPlotArea(); 
areaProp.setBorderColor(null);    
areaProp.setMajorGridX(false);    
areaProp.setMajorGridY(false);    
areaProp.setClippingArea(null);   
 
testPlot.getAxisRenderer(XYPlot.AXIS_X).setShapeVisible(false);   
testPlot.getAxisRenderer(XYPlot.AXIS_X).setTicksVisible(false);   
testPlot.getAxisRenderer(XYPlot.AXIS_Y).setShapeVisible(false);   
testPlot.getAxisRenderer(XYPlot.AXIS_Y).setTicksVisible(false);   
testPlot.getAxis(XYPlot.AXIS_X).setRange(1940, 2020);   
testPlot.getAxis(XYPlot.AXIS_Y).setRange(17, 30);   
 

We can also set properties related to the bubbles drawn on the chart. Here, we set the color and shape, and specify which column of the data will be used to scale the shapes. In this case, the third column, with the marital satisfaction rating, will be used. We set it using the setColumn method:

Color color = GraphicsUtils.deriveWithAlpha(Color.black, 96); 
SizeablePointRenderer renderBubble = new SizeablePointRenderer(); 
renderBubble.setShape(new Ellipse2D.Double(-3.5, -3.5, 4.0, 4.0));   
renderBubble.setColor(color); 
renderBubble.setColumn(2);   
testPlot.setPointRenderers(bubbleSeries, renderBubble);   
 

Finally, we create our panel and set its size:

 
add(new InteractivePanel(testPlot), BorderLayout.CENTER);  
setSize(new Dimension(1500, 700)); 
setVisible(true); 
 

When the application is executed, the following graph is displayed. Notice both the size and color of the points changes depending upon the frequency of that particular data point:

Creating bubble charts

Summary

In this chapter, we introduce basic graphs, plots, and charts used to visualize data. The process of visualization enables an analyst to graphically examine the data under review. This is more intuitive, and often facilitates the rapid identification of anomalies in the data that can be hard to extract from the raw data.

Several visual representations were examined, including line charts, a variety of bar charts, pie charts, scatterplots, histograms, donut charts, and bubble charts. Each of these graphical depictions of data provides a different perspective of the data being analyzed. The most appropriate technique depends on the nature of the data being used. While we have not covered all of the possible graphical techniques, this sample provides a good overview of what is available.

We were also concerned with how Java is used to draw these graphics. Many of the examples used JavaFX. This is a readily available tool that is bundled with Java SE. However, there are several other libraries available. We used GRAL to illustrate how to generate some graphs.

With the overview of visualization techniques, we are ready to move on to other topics, where visualization will be used to better convey the essence of data science techniques. In the next chapter, we will introduce basic statistical processes, including linear regression, and we will use the techniques introduced in this chapter.